2011 08H046
1 1 2 2 3 3 3.1 RSS Dripper [1]............................................ 3 3.2 Whazzup [2].............................................. 3 3.3 Summify [3].............................................. 3 3.4 Paper.li [4]............................................... 3 3.5.............................................. 3 4 5 4.1................................................ 5 4.2............................................ 5 4.3.......................................... 7 4.4............................................... 8 4.5.............................................. 13 4.6 Python................................. 14 4.7................................................ 14 4.8.............................................. 14 4.9.......................................... 15 4.10...................................... 16 4.11.......................................... 19 4.12............................................ 19 5 20 6 24 6.1............................................... 24 A 26 A.1 main.py................................................ 26
1 *1 RSS *2 Atom *3 [1, 2] PC *4 *5 *1 *2 RDF site summary : XML *3 RSS2.0 *4 Gnu/Linux, BSD, OS X, MS Windows OS *5 Android, ios OS 1
2 Google Reader RSS RSS Google Reader Google Reader 2
3 3.1 RSS Dripper [1] Web 3.2 Whazzup [2] Python/web.py 3.3 Summify [3] Summify Twitter,Facebook,Google Reader Web ios RSS 1 1 3.4 Paper.li [4] Paper.li Twitter Facebook Web 3.5 Google Reader 1, 2 3
1 Google Reader 2 Google Reader 4
4 4.1 (MacBook Air Late 2010) :1.6 GHz Intel Core 2 Duo :4 GB 1067 Mhz DDR3 OS:Mac OS X Lion 10.7.2 (11C74) :Python 2.7.1 4.2 Google Reader 3 5
3 6
4.3 2 Google Reader 4.3.1 Rss Dripper Whazzup 4.3.2 A B X B Y A Y Summify Paper.li 7
4.4 4.4.1 MeCab( ) [5] ChaSen Juman KAKASI ChaSen 3 4 OS X Spotlight,iPhone OS 2.1 ChaSen( ) [6] Juman JUMAN [7] ChaSen KAKASI [8] kanji kana simple inverter MeCab 8
4.4.2 UniDic [9] Chasen MeCab mecab-ipadic [10] IPA IPA CRF MeCab mecab-jumandic [10] Juman CRF 30000 mecab-naist-jdic [10, 11] IPA / IPADIC (ICOT ) 4 4 6 9
4 1. MeCab 4 Xbox360 UniDic Xbox naist-jdic 360 naist-jdic ipadic, jumandic 10
5 2. MeCab 4 jumandic UniDic ipadic, naist-jdic 11
6 3. MeCab 4 ipadic, naist-jdic 2 jmandic 3 ipadic,naist-jdic naist-jdic UniDic jumandic ipadic ipadic MeCab ipadic 12
4.5 [12] P(B) = B, prior probability P(B A) = A B, posterior probability conditional probability P(A) > 0 P (A B) = P (A)P (B A) P (B) (1) A B 13
4.6 Python Python gdata-python-client (2.0.15) [13] Google Google Data API Python Google Google Reader API MeCab (0.98) [5] Reverend (0.3) [14] 4.7 Google Reader 4.8 Google API *6 [15, 16] Google Reader SID *7 Google *6 Application Program Interface *7 Session ID 14
4.9 sqlite3 feeds results 4.9.1 feeds crawltime : Google Reader feedurl : URL itemurl : URL itemid : ID title : body : status : 4.9.2 results feedurl : URL itemid : ID title : result sub : 15
4.10 API Google Reader Google Reader API 1000 XML *8 *9 4.10.1 XML XML XML parser * 10 API DOM * 11 4.10.2 7 8 URL ExtractContent * 12 ExtractContent XML *8 Extensible Markup Language: XML *9 feeds *10 XML *11 Document Object Model:XML *12 Web 16
7 Web Google Reader 8 Web Google Reader 17
4.10.3 XML html 9 10 9 html 10 html 4.10.4 status status : star : read : unread 18
4.11 feeds 4.11.1 4.11.2 results feeds XML result sub = (2) 4.12 results result sub Google Reader 19
5 Google Reader 11 20 12 13 1 Web Google Reader 12,14 PC Google Reader 15,16 11 Google Reader 20
12 Google Reader 13 21
14 Web Google Reader iphone4 Safari 15 ios Sylfeed Version 2.1.1 22
16 ios Reeder Versioin 2.5.4 23
6 Google Reader PC Google Reader 6.1 6.1.1 Web Python Web 6.1.2 OAuth SID OAuth Google Reader API 6.1.3 24
[1] Rss dripper. http://ns.oblique-project.com/rssdripper/. [2] Whazzup. http://code.google.com/p/whazzup/. [3] Summify. http://summify.com/. [4] Paper.li. http://paper.li/. [5] Mecab: Yet another part-of-speech and morphological analyzer. http://mecab.sourceforge.net/. [6] Chasen. http://chasen-legacy.sourceforge.jp/. [7] Juman - kurohashi-kawahara lab. http://nlp.ist.i.kyoto-u.ac.jp/index.php?juman. [8] Kakasi - ( ). http://kakasi.namazu.org. [9] unidic. http://www.tokuteicorpus.jp/dist/. [10] mecab - downloads. http://code.google.com/p/mecab/downloads/list. [11] Naist-jdic wiki. http://sourceforge.jp/projects/naist-jdic/wiki/frontpage. [12]. http://ja.wikipedia.org/wiki/%e3%83%99%e3%82%a4%e3%82%ba%e3%81%ae%e5%ae% 9A%E7%90%86. [13] gdata-python-client. http://code.google.com/p/gdata-python-client/. [14] Reverend. https://github.com/arnaudsj/reverend. [15] Koji Yamashita. google reader api api. http://colo-ri.jp/ develop/2009/12/google-reader-apiapi.html, 2009. [16] MOIMOI. Google python google reader api. http://moimoitei. blogspot.com/2011/03/google-python-google-reader-api.html, 2011. 25
A A.1 main.py #! / usr / bin /env python # coding : utf 8 Listing 1 main.py import gdata. s e r v i c e import s q l i t e 3 import os import MeCab import u r l l i b import re from xml. dom import minidom from reverend. thomas import Bayes USER NAME = @gmail. com USER PASSWD = EXTRACT FEED NUM = 20 LABEL NAME = NiceFeed GET FEED NUM = 1000 c l a s s Reader ( ) : def i n i t ( s e l f ) : s e l f. auth ( ) s e l f. l o a d database ( feeddata. db ) def auth ( s e l f ) : s e l f. s e r v i c e = gdata. s e r v i c e. GDataService ( account type = GOOGLE, s e r v i c e = reader, s e r v e s e l f. s e r v i c e. ClientLogin (USER NAME,USER PASSWD) s e l f. token = s e l f. s e r v i c e. Get ( / reader / api /0/ token, c o n v e r t e r=lambda x : x ) def l o a d d a t a b a s e ( s e l f, f i l e n a m e ) : i f os. path. i s f i l e ( f i l e n a m e ) : s e l f. database = s q l i t e 3. connect ( filename, i s o l a t i o n l e v e l=none ) e l s e : s e l f. database = s q l i t e 3. connect ( filename, i s o l a t i o n l e v e l=none ) s e l f. database. execute ( c r e a t e t a b l e f e e d s ( crawltime, status, f e e d u r l, itemurl, item s e l f. database. execute ( c r e a t e t a b l e r e s u l t s ( f e e d u r l, itemid, t i t l e, r e s u l t s u b ) ) 26
t a b l e = s e l f. database. execute ( s e l e c t from s q l i t e m a s t e r where type = table and name i f t a b l e. f e t c h o n e ( )!= None : s e l f. a d d l a b e l ( ) s e l f. database. execute ( d e l e t e from r e s u l t s ) def q u e r y s e l e c t e r ( s e l f, s t a t u s ) : try : crawltime = i n t ( s e l f. database. execute ( s e l e c t max( crawltime ) from f e e d s where s t a t u s crawltime = s t r ( crawltime + 1) except : crawltime = i f s t a t u s == s t a r : return gdata. s e r v i c e. Query ( f e e d = / reader /atom/ user/ / s t a t e /com. g o o g l e / starred, param e l i f s t a t u s == read : return gdata. s e r v i c e. Query ( f e e d = / reader /atom/ user/ / s t a t e /com. g o o g l e / read, params={ e l i f s t a t u s == unread : return gdata. s e r v i c e. Query ( f e e d = / reader /atom/ user/ / s t a t e /com. g o o g l e / reading l i s t, def add entry data ( s e l f, s t a t u s ) : query = s e l f. q u e r y s e l e c t e r ( s t a t u s ) feedxml = s e l f. s e r v i c e. Get ( query. ToUri ( ), c o n v e r t e r=lambda x : x ) e n t r i e s = minidom. p a r s e S t r i n g ( feedxml ). getelementsbytagname ( entry ) f o r entry in e n t r i e s : crawltime = entry. a t t r i b u t e s [ gr : crawl timestamp msec ]. value f e e d u r l = entry. getelementsbytagname ( source ) [ 0 ]. a t t r i b u t e s [ gr : stream id ]. value i t e m u r l = entry. getelementsbytagname ( l i n k ) [ 0 ]. a t t r i b u t e s [ h r e f ]. value itemid = entry. getelementsbytagname ( id ) [ 0 ]. childnodes [ 0 ]. data t i t l e = entry. getelementsbytagname ( t i t l e ) [ 0 ]. childnodes [ 0 ]. data body = t i t l e + s e l f. get subbody ( entry, content ) + s e l f. get subbody ( entry, summary ) v a l u e s = [ crawltime, status, f e e d u r l, itemurl, itemid, t i t l e, body ] s e l f. database. execute ( i n s e r t i n t o f e e d s v a l u e s (?,?,?,?,?,?,? ), v a l u e s ) def get subbody ( s e l f, entry, tag ) : try : data = entry. getelementsbytagname ( tag ) [ 0 ]. childnodes [ 0 ]. data 27
return re. sub ( <.? >,, data ) except : return def t r a i n ( s e l f, s t a t u s ) : f o r body in s e l f. database. execute ( s e l e c t body from f e e d s where s t a t u s =?,[ s t a t u s ] ) : wakati body = MeCab. Tagger( Owakati ). parse ( body [ 0 ]. encode ( utf 8 )) s e l f. g u e s s e r. t r a i n ( status, wakati body ) def e x t r a c t f e e d ( s e l f ) : s e l f. g u e s s e r = Bayes ( ) s e l f. t r a i n ( star ) s e l f. t r a i n ( read ) f o r body, f e e d u r l, itemid, t i t l e in s e l f. database. execute ( s e l e c t body, f e e d u r l, itemid, t i t wakati body = MeCab. Tagger( Owakati ). parse ( t i t l e. encode ( utf 8 )) r e s u l t s = s e l f. g u e s s e r. guess ( wakati body ) i f l e n ( r e s u l t s ) > 0 and r e s u l t s [ 0 ] [ 0 ] == s t a r and not t i t l e. s t a r t s w i t h ( ( PR:, AD: i f l e n ( r e s u l t s ) == 2 : r e s u l t s u b = r e s u l t s [ 0 ] [ 1 ] r e s u l t s [ 1 ] [ 1 ] e l s e : r e s u l t s u b = r e s u l t s [ 0 ] [ 1 ] v a l u e s = [ f e e d u r l, itemid, t i t l e, r e s u l t s u b ] s e l f. database. execute ( i n s e r t i n t o r e s u l t s v a l u e s (?,?,?,? ), v a l u e s ) def a d d l a b e l ( s e l f ) : f o r f e e d u r l, itemid, t i t l e, r e s u l t s u b in s e l f. database. execute ( s e l e c t f e e d u r l, itemid, t i params = u r l l i b. urlencode ( { s : f e e d u r l, i : itemid, a : user/ / l a b e l / + LABEL NAME, s e l f. s e r v i c e. Post ( params, / reader / api /0/ edit tag, c o n v e r t e r=lambda x : x, e x t r a h e a d e r s s e l f. database. execute ( d e l e t e from f e e d s where itemid =?,[ itemid ] ) p r i n t r e s u l t s u b, t i t l e s e l f. database. execute ( d e l e t e from r e s u l t s ) def main ( s e l f ) : s e l f. add entry data ( star ) 28
s e l f. add entry data ( read ) s e l f. add entry data ( unread ) s e l f. e x t r a c t f e e d ( ) s e l f. a d d l a b e l ( ) i f name == main : reader = Reader ( ) reader. main ( ) 29