Vol. 48 No. SIG 14(TOD 35) Sep. 2007 BLOGRANGER Web Web Web 2191 Web 2 BLOGRANGER: Implementation of Goal-oriented Blog Search Engine Hiroyuki Toda, Ko Fujimura, Takafumi Inoue, Nobuaki Hiroshima, Masayuki Sugizaki, Ryoji Kataoka and Masahiro Oku Topics mentioned in blogspace are biased towards interesting/funny or entertainmentrelated topics compared to the generic web space and many articles contain personal opinions on goods or services. Making good use of these characteristics, we introduce a new blog search engine that provides multiple interfaces, each targeted at a different goal, e.g., topic search, blogger search, and reputation search. To evaluate the effectiveness of the system, we conducted a user survey and collected 2191 answers. For the specific search conducted, twice as many people answered that BLOGRANGER is superior to general web search. 1. World Wide Web Web ping push RSS pull 2006 3 868 19) NTT NTT Cyber Solutions Laboratories, NTT Corporation NTT NTT Resonant Inc. 20),21),23),24),27),30) Web Web Web Web 20),24),27),30) 132
Vol. 48 No. SIG 14(TOD 35) BLOGRANGER 133 BLOGRANGER BLO- GRANGER BLO- GRANGER 2 BLOGRANER 3 2 BLOGRANGER 4 5 6 2. Web Web 1 2 3 BLOGRANGER 4
134 Sep. 2007 1 BLOGRANGER Fig. 1 GUI of BLOGRANGER. BLOGRANGER 3. BLOGRANGER 3.1 2 Scatter/Gather 3),8) Scatter/Gather BLOGRANGER 15),17) Scatter/Gather Gather 15),17) Web BLOGRANGER 1 BLOGRANGER 3 4
Vol. 48 No. SIG 14(TOD 35) BLOGRANGER 135 1 3.3 3.2 BLOGRANGER 2 2 2 3.2.1 17) 2 17) 7), 17)
136 Sep. 2007 3 Web 2 Web CD DVD Web Web goo 7 40 Web Web Web 11 http://movie.goo.ne.jp/schedule/upcoming.html 2 Fig. 2 Outline of dictionary construction process. Web Web Web Web Web 18) Shinzato 13) HTML (1) (a) 5 10 (b) Web Web (c) Web (d) HTML XML TR LI (e)
Vol. 48 No. SIG 14(TOD 35) BLOGRANGER 137 3 Fig. 3 Constructing process of Topic Filter. (2) (3) Web (4) Fujii 4) (1) Web (2) Web (3) (4) (5) (6) 2 BLO- GRANGER DVD CD 4 Fig. 4 Example of Topic Filter. 17) 3 4
138 Sep. 2007 5 Fig. 5 Constructing process of Refer Filter. 6 3.2.2 BLOGRANGER 2),10) Web URL URL URL HTML TITLE URL URL 5 3.2.3 7000 BLOGRANGER
Vol. 48 No. SIG 14(TOD 35) BLOGRANGER 139 6 Fig. 6 Example of Sentiment Filter. 6 7 8 3.2.4 EigenRumor 6) EigenRumor authority hub reputation 3 reputation authority hub authority hub reputation reputation reputation authority
140 Sep. 2007 Fig. 7 7 Example of Sentiment Filter (A case which a sentiment word is selected). Fig. 8 8 Constructing process of Sentiment Filter. URL EigenRumor 1 9 3.3 BLOGRANGER 4 BLOGRANGER EigenRumor BLOGRANGER BLOGRANGER 1 RSS Web
Vol. 48 No. SIG 14(TOD 35) BLOGRANGER 141 Fig. 9 9 Constructing process of Blogger Filter. 1),9) RSS RSS Web 1 2004 10 16 2005 2 3 10 305,000 9,280,000 6) 9,280,000 1 1,520,000 16.3% 116,000 1.25% 107,000 1.15% PageRank 98.85% 0 EigenRumor 6) EigenRumor hub authority authority 36,200 11.9% 28,300 9.28% 28,300 862,000 9.3% EigenRumor 9.3% 9.3% 9.3% EigneRumor HITS HITS Web EigenRumor
142 Sep. 2007 EigenRumor BBS identity ID EigenRumor m n i j i j P =[p i,j ] i =1 m, j =1 n i j p i,j =1 p i,j =0 i j e i,j i j E =[e i,j ] i =1 m, j =1 n e i,j i j [0,1] 1 EigenRumor P E identity URL URL identity blog identity blog URL URL URL 2 i j i j e i,j =1 e i,j =0 URL URL 2 EigenRumor P E 2 authority a hub h reputation r 3 authority a i i i =1 m authority a =[a 1 a m ] T authority hub h i i i =1 m hub h =[h 1 h m ] T hub reputation r j j j =1 n reputation r =[r 1 r n ] T reputation authority reputation hub reputation reputation authority
Vol. 48 No. SIG 14(TOD 35) BLOGRANGER 143 reputation hub 4 r = P T a (1) r = E T h (2) a = P r (3) h = E r (4) (1) (2) (1) (2) r = αp T a +(1 α)e T h (5) α [0,1] α 1 0 reputation (5) (3) (4) r = αp T P r +(1 α)e T E r =(αp T P +(1 α)e T E) r (6) = S r S =(αp T P +(1 α)e T E) reputation (6) r S λ r = S r (7) r λ S r S S HITS (6) r r S principal eigenvector r (3) (4) a h EigenRumor EigenRumor BLOGRANGER EigenRumor HITS PageRank HITS HITS 1) HITS HITS BLOGRANGER E 2 EigenRumor E P authority hub PageRank E P 1 Web 1 7 10 1 E P 1 BLOGRANGER E P BLOGRANGER 1 E P 0.98 3.4 BLOGRANGER 10 BLOGRANGER
144 Sep. 2007 10 Fig. 10 System overview. 4 URL URL BLOGRANGER 1 4. 2 BLOGRANGER BLOGRANGER Intelliseek 28) 4.1 22 Web 2006 2 10 2 12 6,700 2,191 32.7% Web BLOGRANGER 1 40 1 Web 24) 25) BLO- GRANGER 22) 1 1 BLOGRANGER 5 100 1,000 Web Google Web 2500 BLOGRANGER Web 2 BLOGRANGER 1 Web Web 2006 1 goo Web 20 26) 2006 2 9 10 30) 2006 1 9 2 8 BLOGRANGER 10
Vol. 48 No. SIG 14(TOD 35) BLOGRANGER 145 1 Table 1 Search goals for the selected keywords. 38.02% 36.92% 48.52% 21.50% 7.99% 1.32% 2 Web BLOGRANGER Table 2 Comparison of the usefullness between Web search and BLOGRANGER. Web BLOGRANGER 2191 907 698 586 40 HIS DELL JTB 2 4.2 1 5 Web 2 2 4.3 Web 2 Web BLOGRANGER 8 Web BLO- GRANGER Web 3 5 8 Web BLOGRANGER 2 BLOGRANGER BLOGRANGER Web 5% 4 Web BLOGRANGER 8 9 2 3 Web BLOGRANGER 2 Web BLOGRANGER 2 Web 4.4 BLOGRANGER Web 6BLOGRANGER Web 8 BLOGRANGER 11 6 2 2
146 Sep. 2007 Table 3 3 Web BLOGRANGER Comparison of the usefullness between Web search and BLOGRANGER. Web BLOGRANGER 833 411 201 221 809 345 257 207 1063 362 430 271 471 167 177 127 175 51 89 35 4 Web BLOGRANGER Table 4 Comparison of the usefullness between Web search and BLOGRANGER. Web BLOGRANGER 754 530 62 162 732 346 228 158 968 225 487 256 302 75 155 72 101 7 71 23 30% 40% Web 11 BLOGRANGER Fig. 11 Comparison of usefullness between traditional bog search and BLOGRANGER. 8 BLOGRANGER Web 19% BLOGRANGER 34% BLOGRANGER 5% BLOGRANGER Web BLOGRANGER BLOGRANGER 3 50% 65% BLOGRANGER BLOGRANGER 4
Vol. 48 No. SIG 14(TOD 35) BLOGRANGER 147 BLOGRANGER BLO- GRANGER 3 BLOGRANGER BLOGRANGER 4.5 7 BLOGRANGER 4 12 12 Fig. 12 Comparison of usefullness among the proposed filters. 5 Table 5 Questionnaire result for the usability. 58.97% 10.59% 30.31% 56.37% 9.27% 34.14% 4.6 13 BLOGRANGER 14 5 10% BLOGRANGER 5. BlogPulse 21) Conversation Tracker BlogPulse Profiles
148 Sep. 2007 BlogPulse Nakajima 12) Agitator Summarizer EigenRumor Nakajima blogwatcher 16),23) 5) blogwatcher 3.5 kizasi.jp 29) kizasi.jp 1 14) Mishne 11) Web 6. 2 3 BLOGRANGER 4 2 BLOGRANGER Web 5
Vol. 48 No. SIG 14(TOD 35) BLOGRANGER 149 1) Brin, S. and Page, L.: The anatomy of a largescale hypertextual Web Search Engine, Proc. 7th international conference on World Wide Web 7, Brisbane, Australia, pp.107 117 (Apr. 1998). 2) Chang, C.H., Lui, S.C. and Pu. C.: IEPAD: Information Extraction Based on Pattern Discovery, Proc. 12th International Conference of World Wide Web, Hong Kong, China, pp.4 15 (May 2001). 3) Cutting, D., Karger, D., Pedersen, J. and Tukey, J.: Scatter/Gather: A cluster-based approach to browsing large document collections, Proc. 15th annual international ACM SIGIR conference on Research and development in information retrieval, Copenhagen, Denmark, pp.318 329 (June 1992). 4) Fujii, A., Itoh, K., Akiba, T. and Ishikawa, T.: Exploiting Anchor Text for the Navigational Web Retrieval at NTCIR-5, Proc. NTCIR-5 Workshop Meeting, Tokyo, Japan (Dec. 2005). 5) Fujiki, T., Nanno, T., Suzuki, Y. and Okumura, M.: Identification of Bursts in a Document Stream, Proc. 1st International Workshop on Knowledge Discovery in Data Streams, Pisa, Italy, pp.55 64 (Sep. 2004). 6) Fujimura, K., Inoue, T. and Sugizaki, M.: The EigenRumor Algorithm for Ranking Blogs, Proc.WWW 2nd Annual Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, Chiba, Japan (May 2005). 7) Grishman, R. and Sundheim, B.: Message Understanding Conference 6: A Brief History, Proc. 16th International Conference on Computational Linguistics, Copenhagen, Denmark, pp.466 471 (Aug. 1996). 8) Hearst, M. and Pederson, J.: Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results, Proc. 19th annual international ACM SIGIR conference on Research and development in information retrieval, Zurich, Switzerland, pp.318 329 (Aug. 1996). 9) Kleinburg, J.: Authoritative sources in hyperlinked environment, J. ACM, Vol.46, No.5. pp.604 632 (1999). 10) Kushmerick, N.: Wrapper Induction: Efficiency and Expressiveness, Artifical Intelligence, Vol.118, pp.15 68 (2000). 11) Mishne, G. and Rijke, M.: A Study of Blog Search, Proc. 28th European Conference on Information Retrieval, London, UK, pp.289 301 (Apr. 2006). 12) Nakajima, S., Tatemura, J., Hino, Y., Hara, Y. and Tanaka, K.: Discovering Important Bloggers based on Analyzing Blog Threads, Proc. WWW 2005 2nd Annual Workshop on the Weblogging Ecosystem, Chiba, Japan (May 2005). 13) Shinzato, K. and Torisawa, K.: A Simple WWW-based Method for Sementic Word Class Acquisition, Proc. International Conference on Recant Advances in Natural Language Processing 2005, pp.493 500 (Sep. 2005). 14) Suhara, Y., Toda, H. and Sakurai, A.: Event mining from the Blogosphere using topic words, Proc. 1st International Conference on Weblogs and Social Media (ICWSM 2007 ), Boulder, Colorado, U.S.A. (Mar. 2007). 15) Zeng, H., He, Q., Zheng, C., Ma, W. and Ma, J.: Learning to cluster web search results, Proc. 27th annual international ACM SIGIR conference on Research and development in information retrieval, Sheffield, United Kingdom, pp.210 217 (Aug. 2004). 16) blog SIG-SW & ONT- A401-01, pp.01-01 01-08 (2004). 17) Vol.46, No.SIG13 (TOD27), pp.40 52 (2005). 18) SoftPath Vol.2002, No.105, pp.15 20 (2002). 19) SNS (2006). 20) ask.jphttp://ask.jp/bloghome.asp 21) BlogPulse. http://www.blogpulse.com/ 22) BLOGRANGER. http://ranger.labs.goo.ne. jp/ 23) blogwatcher. http://blogwatcher.pi.titech.ac. jp/ 24) goo http://www.goo.ne.jp/ 25) goo Search. http://blog.goo.ne.jp/ 26) goo http://ranking.goo.ne.jp/ 27) Google Blog Search. http://blogsearch.google. com/ 28) Intelliseek, WWE-2006 Weblog Data Challenge. http://www.blogpulse.com/ www2006-workshop/datashare-instructions.txt 29) kizasi.jp. http://kizasi.jp/ 30) Technorati JAPAN. http://technorati.jp/
150 Sep. 2007 A.1 4 1 40 2 5 4 4 Web goo goo 6 Web Web 4 BLOGRANGER 7 4 BLO- GRANGER 4 8 4 Web BLOGRANGER Web 9 4 BLO- GRANGER 13 BLOGRANGER 14 BLOGRANGER A.2 189 8.63% 187 8.53% 171 7.80% DS 155 7.07% 125 5.71% 111 5.07% 98 4.47% 85 3.88% 79 3.61% 72 3.29% 68 3.10% 68 3.10% 64 2.92% 61 2.78% 55 2.51% JTB 48 2.19% 45 2.05% HIS 42 1.92% 42 1.92% 41 1.87% 40 1.83% W-ZERO3 38 1.73% 38 1.73% KAT-TUN 35 1.60% 30 1.37% Dell 27 1.23% 27 1.23% 21 0.96% 19 0.87% 18 0.82% Web2.0 15 0.68% 15 0.68% 14 0.64% 13 0.59% Opera 12 0.55% ENDLICHERI 8 0.37% 4gamer 5 0.23% Feedpath 4 0.18% foobar2000 4 0.18% 2 0.09%
Vol. 48 No. SIG 14(TOD 35) BLOGRANGER 151 ( 19 3 20 ) ( 19 7 4 ) 1997 1999 2007 1999 Web NTT ACM SIGIR 1998 2000 NTT 1993 1995 NTT 1984 1989 NTT 1990 1992 NTT 1985 1987 NTT 1982 1984 NTT ALT-J/E REVISE-T NTT goo http://labs.goo.ne.jp/