wnbahasa_kuri0_mac

Similar documents

L1 What Can You Blood Type Tell Us? Part 1 Can you guess/ my blood type? Well,/ you re very serious person/ so/ I think/ your blood type is A. Wow!/ G

<95DB8C9288E397C389C88A E696E6462>

elemmay09.pub

浜松医科大学紀要

2



™…

1 1 tf-idf tf-idf i


総研大文化科学研究第 11 号 (2015)


untitled

Page 1 of 6 B (The World of Mathematics) November 20, 2006 Final Exam 2006 Division: ID#: Name: 1. p, q, r (Let p, q, r are propositions. ) (10pts) (a

昭和恐慌期における長野県下農業・農村と産業組合の展開過程


[2] , [3] 2. 2 [4] 2. 3 BABOK BABOK(Business Analysis Body of Knowledge) BABOK IIBA(International Institute of Business Analysis) BABOK 7

tikeya[at]shoin.ac.jp The Function of Quotation Form -tte as Sentence-final Particle Tomoko IKEYA Kobe Shoin Women s University Institute of Linguisti

open / window / I / shall / the? something / want / drink / I / to the way / you / tell / the library / would / to / me


126 学習院大学人文科学論集 ⅩⅩⅡ(2013) 1 2


Kyushu Communication Studies 第2号

On the Wireless Beam of Short Electric Waves. (VII) (A New Electric Wave Projector.) By S. UDA, Member (Tohoku Imperial University.) Abstract. A new e



鹿大広報149号


きずなプロジェクト-表紙.indd

NINJAL Project Review Vol.3 No.3

大学における原価計算教育の現状と課題

C. S2 X D. E.. (1) X S1 10 S2 X+S1 3 X+S S1S2 X+S1+S2 X S1 X+S S X+S2 X A. S1 2 a. b. c. d. e. 2

fx-9860G Manager PLUS_J

000-Tanikawa_Watanabe

【生】④木原資裕先生【本文】/【生】④木原資裕先生【本文】


A Nutritional Study of Anemia in Pregnancy Hematologic Characteristics in Pregnancy (Part 1) Keizo Shiraki, Fumiko Hisaoka Department of Nutrition, Sc

Vol. 42 No MUC-6 6) 90% 2) MUC-6 MET-1 7),8) 7 90% 1 MUC IREX-NE 9) 10),11) 1) MUCMET 12) IREX-NE 13) ARPA 1987 MUC 1992 TREC IREX-N

Bull. of Nippon Sport Sci. Univ. 47 (1) Devising musical expression in teaching methods for elementary music An attempt at shared teaching

Modal Phrase MP because but 2 IP Inflection Phrase IP as long as if IP 3 VP Verb Phrase VP while before [ MP MP [ IP IP [ VP VP ]]] [ MP [ IP [ VP ]]]

\615L\625\761\621\745\615\750\617\743\623\6075\614\616\615\606.PS

在日外国人高齢者福祉給付金制度の創設とその課題


A5 PDF.pwd


untitled

01ⅢⅣⅤⅥⅦⅧⅨⅩ一二三四五六七八九零壱弐02ⅢⅣⅤⅥⅦⅧⅨⅩ一二三四五六七八九零壱弐03ⅢⅣⅤⅥⅦⅧⅨⅩ一二三四五六七八九零壱弐04ⅢⅣⅤⅥⅦⅧⅨⅩ一二三四五六七八九零壱弐05ⅢⅣⅤⅥⅦⅧⅨⅩ一二三四五六七八九零壱弐06ⅢⅣⅤⅥⅦⅧⅨⅩ一二三四五六

JOURNAL OF THE JAPANESE ASSOCIATION FOR PETROLEUM TECHNOLOGY VOL. 66, NO. 6 (Nov., 2001) (Received August 10, 2001; accepted November 9, 2001) Alterna


Visual Evaluation of Polka-dot Patterns Yoojin LEE and Nobuko NARUSE * Granduate School of Bunka Women's University, and * Faculty of Fashion Science,

07_太田美帆.indd

2 except for a female subordinate in work. Using personal name with SAN/KUN will make the distance with speech partner closer than using titles. Last

_念3)医療2009_夏.indd

九州大学学術情報リポジトリ Kyushu University Institutional Repository 看護師の勤務体制による睡眠実態についての調査 岩下, 智香九州大学医学部保健学科看護学専攻 出版情報 : 九州大学医学部保健学

平成29年度英語力調査結果(中学3年生)の概要


untitled

卒業論文はMS-Word により作成して下さい

CA HP,,,,,,.,,,,,,.,,,,,,.,,,,,,.,,,,,,.,,,,,,.,,,,,,.,,,,,.,,,,,.,,,,,.,,,,,.,,,,,.,,,,,.,,,,,.,,,,,.,,,,,,.,,,,,.,,,,,,.,,,,,.,,,,,.,,,,,,.,,,,,,.,,

日本語教育紀要 7/pdf用 表紙

4.1 % 7.5 %

A5 PDF.pwd

先端社会研究 ★5★号/4.山崎

吉田 今めかし 小考 30

52-2.indb

2 ( ) i

Motivation and Purpose There is no definition about whether seatbelt anchorage should be fixed or not. We tested the same test conditions except for t


_Y05…X…`…‘…“†[…h…•

Corrections of the Results of Airborne Monitoring Surveys by MEXT and Ibaraki Prefecture



Microsoft Word - PCM TL-Ed.4.4(特定電気用品適合性検査申込のご案内)

...

1 ( 8:12) Eccles. 1:8 2 2

大学論集第42号本文.indb

THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS TECHNICAL REPORT OF IEICE.

Web Web Web Web 1 1,,,,,, Web, Web - i -

When creating an interactive case scenario of a problem that may occur in the educational field, it becomes especially difficult to assume a clear obj

L3 Japanese (90570) 2008

(II) tikeya[at]shoin.ac.jp ayakutsuki[at]shoin.ac.jp Study of katakana for English Speakers Learning Japanese (II) IKEYA Tomoko KUTSUKI Aya Faculty of

pp DC 2,

Introduction Purpose This training course describes the configuration and session features of the High-performance Embedded Workshop (HEW), a key tool

Vol.2.indb

P

soturon.dvi

alternating current component and two transient components. Both transient components are direct currents at starting of the motor and are sinusoidal

9(2007).ren



Hospitality-mae.indd

はじめに

16_.....E...._.I.v2006

1 Fig. 1 Extraction of motion,.,,, 4,,, 3., 1, 2. 2.,. CHLAC,. 2.1,. (256 ).,., CHLAC. CHLAC, HLAC. 2.3 (HLAC ) r,.,. HLAC. N. 2 HLAC Fig. 2

_09名嶋.indd

第16回ニュージェネレーション_cs4.indd

国際恋愛で避けるべき7つの失敗と解決策

Microsoft Word - j201drills27.doc


Transcription:

Orthographic variation problems and the Japanese Wordnet Takayuki KURIBAYASHI Division of Linguistics and Multilingual Studies Nanyang Technological University

At the setout What is "orthographic variation"? Words can be written in more than one form Orthographic variants have the same meaning and reading in common Not so many patterns In English e.g. center / centre color / colour 2

Source of orthographic variation problems in Japanese 3

The 3 scripts in Japanese Kanji (" ", Chinese character) Ideogram Sometimes has different shapes and combined in a string Kana e.g. for gakkou (school) { Phonogram 2 types New letter shape " " Old letter shape " " Katakana " " Hiragana " " 4

Choice of the script(s) In modern Japanese, a word string usually consists of a single script or kanji + hiragana A choice of scripts depends on the writer and type of document [,, ] for dog In informal documents such as novels and blogs, it more depends on the writer 5

Kanji + hiragana string Kanjis often need okurigana (, accompany letters) In the first place, Japanese readings can not fit the kanji's original readings Most kanjis have more than one meaning Okurigana is needed to reduce the ambiguity 6

Examples of okurigana " " oroginal readings: juu, chou " " e, juu (numeral classifier) " " omo-i heavy " " omo-sa weight " " kasa-neru pile " " kasa-naru overlap " " kasa-nete again 7

Okurigana rules The Japanese government has issued a guideline for okurigana But only reveals in newspapers, official documents, legal sentences, and so on No strict rule for usage in other kinds of writings Conjugation part can not be omitted " ",, " Not recommended to omit if the disambiguation is obstructed Which does?" " means? 8

Sources of orthgraphic variation (review) Freely decided which script to use Scripts : kanji, katakana and hiragana Kanjis often need okurigana How many okuriganas to use is relatively free, too Choices are depend on the type of the document and/or the writer s liking 9

Other examples of variation (osoroshii), terrible,,, (hifu), skin,,,, (mazeawaseru), mix consists of & = 32 variants (mazeru), mix,,,,,,, (awaseru), combine,,, 10

Actual problems 11

Actual problems 1. Japanese Wordnet (JWN) 1.1 does not cover all the variants Affect the coverages when annotating corpora 2. A variant sometimes appears in a synset, but misses in other synsets e.g. 3. Are the numbers of synonyms and senses (synonymsynset pair) reasonable? we counted " ", " " separately 12

1.Strings not covered when annotating In a newspaper corpus (Kyoto University Text Corpus) In a novel " (boukuu-gou), bombproof", we have " " in 02868638-n " (ayaui), dangerous", we have " " in 02058794-a In a old Japanese novel Some Meiji era novelists preferred than? 13

Actual problems 1. Japanese Wordnet (JWN) 1.1 does not cover all the variants Affect the coverages when annotating corpora 2. A variant sometimes appears in a synset, but misses in other synsets e.g. appears in 6 synsets appears in 5 synsets 3. Are the numbers of synonyms and senses (synonym-synset pair) reasonable? we counted " ", " " separately 14

Actual problems 1. Japanese Wordnet (JWN) 1.1 does not cover all the variants Affect the coverages when annotating corpora 2. A variant sometimes appears in a synset, but misses in other synsets e.g. 3. Are the numbers of synonyms and senses (synonymsynset pair) reasonable? e.g. we counted " ", " " separately 15

To solve the problem 16

Our method 1. Create variant sets with help from openlicenced dictionaries 2. Apply the variant sets to JWN 1.1 synonyms 3. Hand check adding & grouping variants 17

Dictionaries 3 dictionaries JUMANdic by Kyoto University For their morphological analysis system JUMAN Entries can be grouped by canonical form & reading JMdict managed by EDRGD Entries can be grouped by meaning & reading IPAdic by NAIST We hired merely to give reading the synonyms not in JUMANdic nor JMdict 1. create variant sets 18

Merging 2 dictionaries Merge the JUMANdic entries and JMdict entries that can be identified as the same word or its variants e.g. (arasu), desolate { JUMANdic: [, ] JMdict : [,, ] merged : [,, ] 1. create variant sets 19

Giving reading Give the each merged set a katakana string as reading By converting the hiragana string in JMdict e.g. 1. create variant sets 20

Why do we need kana strings? Kana is made available as phonogram in Japanese, therefore adding reading information is equal to adding kana strings On top of that, the difference of reading can contribute Word Sense Disambiguation (WSD) in some cases e.g. can be read as: a) (tsura), (omote), (men) b) (men) 1. create variant sets 21

Giving reading (cont d) If a synonym is not in JUMANdic nor JMdict, do morphological analysis and give them the readings with IPAdic e.g. (jouhoukikan), intelligent agent IPAdic: [, ] + [, ] [,, ] 1. create variant sets 22

In case readings are not found Give the synonyms a tag that means its reading is unknown e.g. (suidan), play in 01725051-v) [, YOMI, YOMI] 1. create variant sets 23

Deciding display form Decide a display form for each variant set We do not say "standard form" since no one can decide undisputed ones Merely in order to create a key for each set Show only one form when searching JWN Use for sentence generating 1. create variant sets 24

1. Has the highest frequency --- N/A as of now 2. Agrees with JUMANdic's canonical form 3. Consists of more chinese characters 4. Consists of more new letter shape ones 5. Is longer if 1 ~ 4 can not settle 1. create variant sets 25

Create the key To make a variant set's ID, give each display form one digit This is to deal with variant sets which have the same display form like " e.g. [ 0,,, ] key reading ==> Hand check all variant sets (done) 1. create variant sets 26

Apply variant sets Apply the hand-checked variant sets to JWN 1.1 synonyms when a synonym is in the variant sets, we apply the sets e.g. appears in 6 variant sets and each JWN synset which has are applied 6 sets Hand check again to remove variant sets which are applied incorrectly e.g. in 03724870-n ( mask ) {,,,,, read as men read as tsura 2. apply variant sets to JWN 27

Status of the JWN (as of Jan 2016) 91,961 unique words 83,174 variant sets 213,986 unique strings 158,074 senses (synset-synonym pairs) 148,005 synset-variant set pairs 449,240 synset-string pairs (the numbers include error correction) results of applying 28

Examples [ 0 (,,, )] 02765464-v ( absorb, take in ) JWN 1.1 :,,,,,,,,,,,,,,,, results of applying 29

Coverage (as of 2012) Total words Content words Covered content words Coverage Dancing Men Speckled Band Cathedral & Bazaar Kyoto Corpus (articles) Kyoto Corpus (editorial) 13,483 4,752 13,896 4,848 18,067 7,509 24,615 11,939 27,906 13,300 3,874 81.5% 4,332 91.2% 4,097 84.5% 4,501 92.8% 5,858 78.0% 6,618 88.1% 9,385 78.6% 9,766 81.8% 10,958 82.4% 11,542 86.8% results of applying 30

Problems and future work 31

Increased ambiguity 1. The hand checking takes time The data before checking contained many errors which come from ambiguity since we considered improving the coverage first Especially kana strings increase ambiguity e.g. Each (tai) in JWN 1.1 is applied 10 variant sets before checking 32

Rare forms 2. A variant set contains rare forms in some cases and increase ambiguity Rare ones should be removed or suppressed to appear by using frequency data in the future e.g. in the variant set (tsura) 33

Need to further merge 3. Not all the variants are merged into each variant set Target : strings which are not in JUMANdic nor JMdic If the variant sets which appear in the same synset and have the same reading in common should be merged (such as in 02765464-v, pp29) Reading (kana strings) information is important also in this respect 34

Relationship with OMW 4. This attempt has proceeded independently of our Open Multilingual Wordnet Error correction in both side independently How to merge the data? 35

Conclusion We need to handle orthographic variants Without them, our coverage is poor We need to group variants We do this by Find dictionar(ies) in which orthographic variants are grouped Connect the dictionar(ies) to your Wordnet by reading information Checking them 36