ISCO SSM JGSS Improvement of Classification Accuracy in an ISCO Automatic Coding System: Results of Experiments Using both the SSM Dataset and the JGSS Dataset Kazuko TAKAHASHI Faculty of International Studies Keiai University In social surveys, we need to conduct the occupation coding when occupation data is obtained by open-ended questionnaire. Conducting the occupation coding manually is a time-consuming and complicated task and sometimes leads to inconsistent coding results when coders are not experts. For this reason, the automatic coding system, which is a combination of a rule-based method and Support Vector Machines (SVMs), has been developed and used for SSM occupation codes, which are usually used in Japanese social surveys. Recently, coders are often requested to conduct both the SSM occupation coding and the ISCO (International Standard Classification of Occupation) coding. Therefore an automatic coding system for ISCO codes should be also developed. The purpose of this paper is to report results of experiments designed for improvement of classification accuracy in the ISCO automatic coding system by using real datasets from the 2005SSM surveys and the JGSS. Key Words: JGSS, ISCO automatic coding system, Support Vector Machines SSM ISCO 2 ISCO SSM ISCO SSM SSM JGSS JGSS ISCO 193
1. ISCO International Standard Classification of Occupation SSM JGSS 2 1 1984 (1) 2004, 2006 (2) 1988 ILO ISCO -88 ISCO Bureau of Statistics, International Labour Office 2001 ISCO-68 JSCO ; Japanese Standard Classification of Occupation JSCO JSCO SSM Social Stratification and Social Mobility 95 SSM 1995 SSM 1995 2000, 2004 SSM ISCO 2 ISCO 4 10 28 116 390 task duty job SSM 3 196 ISCO ISCO-68 skill level skill specialization 2008 SSM ISCO (3) (4) Kunz 2003, Creecy et al. 1992, Riviere 1994, 2004 2-gram (5) 3-gram 2 SSM 2001; 2002; 2003; 2004; 2005, Takahashi et al. 2005 ISCO SSM ISCO SSM SSM ISCO ISCO 194
2 SSM ISCO ISCO (6) ISCO ISCO 400 2007; 2008 ISCO 2003 SSM 767 2.2 2010 ISCO (7) ISCO 2005 SSM 2005SSM JGSS ISCO 2008 SSM 2005SSM JGSS JGSS (8) JGSS SSM 2005 SSM ISCO ISCO ISCO 2 2005SSM JGSS ISCO 3 4 5 195
2. SSM ISCO 2.1 SSM SSM 3 1 (9) 2000 2 (10) 2004 3 4 (11) 2005, Takahashi et al. 2005 add-code add-code JGSS-2003 80.7 (12) 2004 ISCO 1 2.2 ISCO 2.2.1 add-code add-code ISCO 2006 SSM ISCO add-code 2 SSM add-code SSM SSM SSM 2 2008 2003 SSM SSM SSM 3.3 9.6 SSM SSM SSM SSM SSM 2.2.2 1 ISCO SSM 2008 2008 1 196
1 2008 4 4 2 3 1 7 1.3 5 2008 ISCO 3. 3.1 ISCO 2 1 2 SSM JGSS 1 SSM 2 SSM 4 2005 SSM baseline (13) (13) 10 13 SSM 200 SSM 3.2 3.2.1 SVM (14) SVM 2 one-versus-rest Kressel 1999 SVM SSM 3.2.2 2005SSM JGSS JGSS-2006 JGSS-2008 JGSS-2010 2 2005SSM 4,133 5,542 2,915 3,499 16,089 JGSS-2006 JGSS-2008 JGSS-2010 ISCO SSM 2,224 1,375 2,570 197
2 2 5 4/5 1/5 5 5 2 3-1 3-2 2 3.1 4 1 2 1-1 2005SSM JGSS-2006 JGSS-2008 JGSS-2010 5 1 2005SSM & JGSS-2006 2005SSM & JGSS-2006 & JGSS-2008 1-2 2005SSM & JGSS-2006 & JGSS-2008 & JGSS-2010 5 2005SSM 2005SSM JGSS-2006 2 JGSS-2008 2 JGSS-2010 3-1 JGSS-2006 JGSS-2006 & JGSS-2008 JGSS-2010 3-2 2005SSM & JGSS-2006 2005SSM & JGSS-2006 & JGSS-2008 3.2.3 classification accuracy recall 4. 2 3.1 4.1 1-1 1-2 1-1 SSM 3 2.2.2 2005SSM 198
3 4 2005SSM JGSS-2006 JGSS-2008 JGSS-2010 15,271 1,779 1,086 2,056 baseline 0.6834 0.5323 0.5356 0.6051 0.6832 0.5323 0.4945 0.6051 0.7448 0.5863 0.6571 0.6882 0.7425 0.5863 0.6531 0.6882 1-2 2005SSM JGSS 4 SSM 2005SSM 6.1 JGSS 17.4 1-1 13 2.2.2 1-1 2.2.2 7 5 1 4 4 4 2005SSM & JGSS-2006 17,050 2005SSM & JGSS-2006 & JGSS-2008 18,136 2005SSM & JGSS-2006 & JGSS-2008 & JGSS-2010 20,192 baseline 0.6780 0.6308 0.5513 0.6785 0.6342 0.5536 0.7368 0.7156 0.7252 0.6833 0.6849 0.6851 4.2 2 3-1 3-2 JGSS 2005SSM 2 2005SSM 2005SSM JGSS 5 2005SSM 5 2005SSM 2005SSM 6.6 199
10.3 JGSS 3 JGSS-2008 2005SSM JGSS-2006 JGSS-2010 4 ISCO JGSS-2010 3 ISCO 3418 5249 5164 2008 (15) JGSS-2010 3410 3412 0110 JGSS-2010 JGSS 2005SSM JGSS-2010 2005SSM JGSS JGSS-2010 SSM SSM 5 2005SSM JGSS 2005SSM 16,089 2005SSM JGSS-2006 JGSS-2008 JGSS-2010 0.6834 0.5899 0.5770 0.5697 0.6832 0.5863 0.5925 0.5805 0.7448 0.6093 0.6699 0.6997 0.7425 0.6057 0.7015 0.6482 0.7010 0.5978 0.6352 0.6245 3-1 2005SSM JGSS 6 JGSS-2006 JGSS-2008 JGSS-2010 6 JGSS JGSS-2006 JGSS-2008 SSM 61.5 JGSS 5 8.5 2005SSM 200
6 JGSS JGSS-2010 JGSS-2006 JGSS-2006 & JGSS-2008 2,224 3,581 0.5148 0.5529 0.5148 0.5502 0.5852 0.6148 0.5852 0.6109 3-2 2005SSM JGSS 7 JGSS-2010 2005SSM JGSS-2006 JGSS-2008 SSM 5 7 2005SSM 4.6 SSM 5 7 2005SSM JGSS 7 2005SSM JGSS JGSS-2010 2005SSM & JGSS-2006 2005SSM & JGSS-2006 & JGSS-2008 18,313 19,670 0.5724 0.6016 0.5856 0.5969 0.6412 0.6533 0.6191 0.6521 ISCO 20,000 1 SSM 2 SSM 2005SSM SSM 2005SSM JGSS JGSS 1 201
1 3 X Y JGSS-2010 JGSS 2005SSM 2005SSM JGSS 1 5 (16) Takahashi et al. 2008 1 JGSS-2006 & JGSS-2008 2 2005SSM 3 2005SSM & JGSS-2006 & JGSS-2008 5. ISCO SSM JGSS SSM SSM ISCO SSM ISCO SSM ISCO 1 2 ISCO ISCO ISCO Web SSM ISCO 202
Acknowledgement General Social Surveys JGSS JGSS 2005 SSM 2005 SSM 1 SOC 2000 ASCO2 SOC NOC-S2001 353 340 449 520 ASCO2 SOC 986 821 2 1970 1980 Rubin 2004 3 SSM95 ISCO-88 2005 SSM 2004 2003 1 1 30 40 2005 SSM 2005 SSM ISCO SSM ISCO SSM ISCO 2008 4 Precision Data AIOCS PACE ACTR SICORE Keogh 1998 5 work at the office 2-gram work at at the the office 2-gram wo or ic ce 6 ISCO 7 2006 8 JGSS 34,521 JGSS-2003 15,000 8.3 9 1995 SSM 1996, 2001, 2006 ROCCO Rule-based OCcupation COding ROCCO JGSS 10 SVM Vapnik 1998, Sebastiani 2002 SVM 11 12 13,300 20,000 13 JUMAN 1998 203
14 http://chasen.org/~taku/software/tinysvm/ 3.3.1 15 3 1 16 SVM 1995 SSM, 1995, SSM 95 1995 SSM. 1995 SSM, 1996, 1995 SSM 1995 SSM. 2005 SSM, 2004, SSM95 ISCO-88. Bureau of Statistics, International Labour Office, 2001, Coding Occupation and Industry, Bureau of Statistics; International Labour Office. Creecy, R. H., Mas, B. M., Smith, S. J., and Waltz, D. L., 1992, Trading Mips and Memory for Knowledge Engineering, Communication of the ACM 35(8): 48-63. Dumais, S., Platt, J., Hecherman, D., and Sahami, M., 1998, Inductive Learning Algorithms and Representations for Text Categorization, Proceedings of the ACM-CIKM98: 145-155. Gillman, D. W., and Appel, M. V., 1999, Developing an Automated Industry and Occupation Coding System for CENSUS 2000, 2000 Proceeding of the American Statistical Association Annual Meeting, Government Statistics Section., 1984,., 2006, 2005 SSM SSM95. Joachims, T., 1998, Text Categorization with Support Vector Machines: Learning with Many Relevant Features, Proceedings of the European Conference on Machine Learning: 137-142., 2008, SSM EGP SIOPS ISEI 2005 SSM 12 16 19 2005 SSM : 69-94. Keogh, G., 1998, Automatically Coding Occupation Description from the 1996 Census of Population of Ireland, Technical report in Central Statistic Office (CSO). Kunz, C., 2003, CENSUS: OCCUPATION (Census Paper No.03/06), Australian Bureau of Statistics., 1998, JUMAN version 3,61. Kressel, U., 1999, Pairwise classification and Support Vector Machines, Scholkopf, B., Burgesa, C. J. C., and Smola, A. J. [eds.], Advances in Kernel Methods Support Vector Learning, The MIT Press, 255-268., 2006,., 2001, SSJ Data Archive Research Paper Series JGSS-2000., 2006, No.57., 2004, 2 : 46-77. Riviere, P., 1997, SICORE - general automatic coding system, Statistical Data Editing Vol.2 Methods and Techniques, United Nations Statistical Commission and Economic Commission for Europe, 222-231. Rubin, D. B., 2004, Multiple Imputation for Nonresponse in Surveys, John Wiley & Sons, Hoboken New Jersey. 204
Sebastiani, F., 2002, Machine Learning Automated Text Categorization, ACM Computing Surveys 34(1): 1-47., 2004,., 2000, SSM 15(1): 149-164., 2001, A 2 8(1): 31-52., 2002, JGSS-2000 General Social Surveys 1: 171-183., 2003, JGSS-2001 General Social Surveys 2: 179-191., 2004, ROCCO SVM General Social Surveys 3: 163-174., 2004, 15(1): 177-196., 2005, 12(2): 3-24. Takahashi, K., Takamura, H., and Okumura, M., 2005, Automatic Occupation Coding with Combination of Machine Learning and Hand-Crafted Rules, Bao, H. T., David, C., and Huan, L. [eds.], Advances in Knowledge Discovery and Data Mining Proceedings Series: Lecture Notes in Computer Science Subseries: Lecture Notes in Artificial Intelligence 3518: 269-279, Springer-Verlag Berlin Heidelberg., 2008, ISCO 2005 SSM 12 16 19 2005 SSM : 47-68., 2008, 15(2): 3-38. Takahashi, K., Takamura, H., and Okumura, M., 2008, Direct estimation of class membership probabilities for multiclass classification using multiple scores, Knowledge and Information Systems (KAIS), 19(2): 185-210, Springer-Verlag, London., 2010, 24 https://kaigi.org/jsai/webprogram/2010/pdf/260.pdf, 2008, SSM ISCO-88 2005 SSM - 16 19 2005 SSM, 31-47., 2008,. Vapnik, V., 1998, Statistical Learning Theory, John Wiley, New York. 205