24 1 Vo. 24 No. 1 2001 1 CHINESE J. COMPUTERS Jan. 2001!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 100080. SVM SVM. SVM. SVM.. TP391 A Chinese Web Page Cassifier Based on Support Vector Machine and Unsupervised Custering LI Xiao-Li LIU Ji-Min SHI Zhong-Zhi Institute of Computing Technoogy Chinese Academy of Sciences Beijing 100080 Abstract This paper presents a new agorithm that combines Support Vector Machine SVM and unsupervised custering. After anayzing the characteristics of web pages it proposes a new vector representation of web pages and appies it to web page cassification. Given a training set the agorithm custers positive and negative exampes respectivey by the unsupervised custering agorithm UC which wi produce a number of positive and negative centers. Then it seects ony some of the exampes to input to SVM according to ISUC agorithm. At the end it constructs a cassifier through SVM earning. Any text can be cassified by comparing the distance of custering centers or by SVM. If the text nears one custer center of a category and far away from a the custer centers of other categories UC can cassify it righty with high possibiity otherwise SVM is empoyed to decide the category it beongs. The agorithm utiizes the virtues of SVM and unsupervised custering. The experiment shows that it not ony improves training efficiency but aso has good precision. Keywords support vector machine custering text cassification 1 Internet.... Internet 1999-11-17. 69803010 863-511-946-010. 1969. 1967. 1941.
63.... Apte Yang 2 Lewis 3 Cohen 4.. UC SVM. 2 3 SVM UC 4 5.! Lin Shian-Hua.. 2. n n. n!! 2! n.. Saton 0 K 3 tf i > og N / n i w i =. "!. 2 tf j > og N / n j j. SVM. SVM. SVM 8 9. SVM.. UC.... SVM UC tf i N n i. IG! 2 -test CHI MITS..... HTML. HTML. P BRDOCTYPE. TITLEH H2 H6 B UIA HREF =.. HTML Meta name = description content =. UC SVM. SVM Meta name = keywords content = Meta name = cassification content =. TITLE
64 200. H H6 H H6. B U I. URL. Meta.. S = TITLE H H2 H3 H4 H5 H6 B U I URL Meta. W = W A IA S E W A X tf A i X Iog N / I i A S w i = W A X tf A 2 X Iog N / I ~ E E A W A A W TITLE > W H > W H2 > W H3 > > W Meta. tf A i i A. 3 SVM UC WEB. I L = O O2 OI. Oi N i I Oi. Oi I N i E N. = # i. I N i. O L E = x y x 2 y 2 x y! i " I y i + -. y i = + x i O y i = - x i$o. x x O x$ O. x I +. 2. SVM.. SVM. E = z i y i I 2 z i R N y i - + w 6 R w 6 = 2 f w 6 z - y Id P x y. P x y! f w z 6 = sgn w z + 6. w 6 f w z 6 max W O =E Oi - 2 EOiO y i y z i z 0SOi SY 2 E Oi y i = 0. O w = E y ioiz i w. z i I f w z 6 i I = 6. x O z = G x f z = sgn[ E y ioi z z i + 6 ]. f z = x O x. z =G x exp - H 2 x - x ( ih c ). SVM. 3.1 UC UC. r Z = x x 2 x m! i " I. UC Z x i Step. C x 0 x m. Step2. Z = Step3. x i 0 stop. x IumCuster Z x 2 x 3 Z x i 0 arg IumCuster min I = < r Step4. d x i 0 d x i 0 I. x i C C C x i
65 n X 0 + x i c 0 一 n n + 一 n + Go to Step6. Step5.. numcuster 一 numcuster + c numcuster 一 x i 0 numcuster 一 x i. Step6. Z 一 Z - z i go to Step2. UC numcuster m c c 2 c numcuster 0 c n c. Step2 Step6. x i i = 2 m.. numcuster X m numcuster.. UC! " + " - UC. " + = x i I x i y i G E y i " - = x i I x i y i G E y i = = -. " + 0 + 02 + 0 u + " - 0-0 2-0 1 -. x. d + x = min u d x d x 0 + i d - x = min 1 d x 0. ī yx y. d + x < d - x xg! x 奏!. r. UC. UC SVM.!! 0 + 0 - # #> 0! \ d x + - d x - \ <# UC! UC SVM.! \ d x + - d x - \ >#. d x + < d x -! 0 +!G!! 奏!.! UC. 3.2 SVM UC ISUC SVM UC. ISUC. r UC. SVM.. 2. R R > r SVM " + U x I x G " - k 八 x GU B 0 + R i. B 0 + R i 0 i + R. 2. \R. r.. R SVM.
66 200. ISUC Step. i 一 S T 一 1. Step2. i > ugo to Step6 u. Step3. i + d x j i + < R j =!! 2! G! -. Step4. S T 一 S TU!! 2!. Step5. i 一 i + go to Step2. Step6. S T 一 S TU! + S T SVM SVM.! " d x + d x - I d x + - d x - I >".! UC!.! d x + < min U d!!g# ī!! 奏 #. UC!.!. UC. SVM. " ISUC. TISUC. Step. T = 1!G T. Step2. d x + 一 min u Step3. I d x + d! + i d x - 一 min U - d - x I >" go to Step7. d!. ī Step4. SVM f! = sgn [ } i#i! x i + b] Step5. f! =!G#! 奏 #. Step6. go to Step8. Step7. d + x < d - x!g#! 奏 #. Step8. T 一 T -! go to Step. 4 3548 3.. 283.. 3265. 9000 4265... UC SVM..... SVM UC.. SVM UC. SVM. 1 SVM UC % % % SVM 98. 60 90. 97 90. 9 UC r = 0.3 90.97 327 3239 89.32 56 3009 89.48 495 307 UC r = 0.7 96.23 268 2696 89.53 49 253 89.53 424 2596 UC r =.0 96.83 94 87 89.48 34 760 88.68 306 766 UC r =.2 96.5 05 706 87.67 79 483 83.95 88 495 UC r =.4 92.82 74.23 67.59 " # I " - # I 2 = I " I 2 + I # I 2-2 I " I I # I = 2-2 I " I I # I 三 2 $ 三 I " - # I 三 2.
1 67 UC!! = 1" 4 UC "! "! = 1" 2" 105 706 " 2 ISUC "! # "!= 0"3 UC SVM " 2 ISUC! # # NexampIe cut # CaII SVM # SV Precision SVM % ISUCprecision % 0"3 0"5 $ 7820 1553 $ 112 10"03 74"90 0" 3 1" 0 6008 1553 678 83" 25 95" 12 0" 3 1" 3 2351 1553 937 97" 1 98" 81 1" 2 1" 25 4450 2561 543 96" 72 99" 19 1" 2 1" 3 2132 2561 651 97" 54 98" 57 1" 2 1" 4 1023 2561 709 97" 66 98" 62 0"3 0"5 $ 8251 2195 $ 120 21"10 70"09 0" 3 1" 0 5422 2195 835 68" 22 88" 20 0" 3 1" 3 2198 2195 1323 83" 97 91" 86 1" 2 1" 25 5321 3094 1425 83" 29 91" 25 1" 2 1" 3 3800 3094 1672 87" 10 91" 42 1" 2 1" 4 1923 3094 1714 87" 56 91" 42 0"3 0"5 $ 7693 2093 $ 342 26"11 70"51 0" 3 1" 0 5017 2093 625 59" 68 80" 55 0" 3 1" 3 1296 2093 1389 82" 78 91" 80 1" 2 1" 25 5982 3290 1322 87" 78 91" 27 1" 2 1" 3 4336 3290 1481 87" 97 91" 34 1" 2 1" 4 1295 3290 1481 87" 97 91" 34 2! SVM # CaII SVM "! "! # NexampIe cut SVM Precision SVM ISUC ISUCprecision. SVM # SV " "! = 1" 2 #! 1" 25 1" 4 # # SV. ISUCprecision SVM. SVM "! = 1" 2 # = 1" 25 ISUC 1 / 3. UC SVM " " " 5 SVM UC. SVM SVM. SVM. UC SVM... ISUC! #!. 1 Apte C Damerau F Weiss S. Automated Iearning of decision
68 2001 ruies for text categorization. ACM Transactions on Information System 1994 12 3 233-251 2 Yang Y. Expert network Effective and efficient iearning from human decisions in text categorization and retrievai. In Proc Seventeenth Internationai ACM SIGIR Conference on Research and Deveiopment in Information Retrievai Dubiin 1994. 13-22 3 Lewis D D Schapore R E Caiian J P Papka R. Training aigorithms for iinear text ciassifiers. In Proc Nineteenth Internationai ACM SIGIR Conference on Research and Deveiopment in Information Retrievai Zurich 1996. 298-306 4 Cohen W W Singer Y. Context-sensitive iearning methods for text categorization. In Proc Nineteenth Internationai ACM SIGIR Conference on Research and Deveiopment in Information Retrievai Zurich 1996. 307-315 5 Lin Shian-hua. Extracting ciassification knowiedge of internet documents with mining term associations A sementic approach. In Proc Internationai ACM SIGIR Conference on Research and Deveiopment in Information Retrievai Meibourne 1998. 241-249 6 Vapnik V. The Nature of Statisticai Learning Theory. New York Springer-Veriag 1995 7 Vapnik V. Estimation of Dependences Based on Empiricai Data. New York Springer-Veriag 1982 8 Bernhard Schoikopf Sung Kah-Kay et a. Comparing support vector machines with gaussian kerneis to radicai basis function ciassifiers. IEEE Transactions on Signai Processing 1997 45 11 2758-2765 9 Edgar Osuna Robert Freund Federico Girosi. Training support vector machines An appiication to face detection. In Proc IEEE Conference on Computer Vision and Pattern Recognition Puerto 1997. 130-136 10 Saiton. Introduction to Modern Information Retrievai. New York Mc- Graw-hiii Book Company 1983 11 Yang Yi-Ming Jan O Pederson. A comparative study on feature seiection in text categorization. In Proc 14th Internationai Conference on Machine Learning Nashviiie 1997. 412-420 12 Li Xiao-Li Shi Zhong-Zhi. A data mining method appiying to acguire part of speech ruies in Chinese text. Computer Research and Deveiopment Accepted in Chinese..