untitled

Similar documents

11_寄稿論文_李_再校.mcd

<95DB8C9288E397C389C88A E696E6462>

_Y05…X…`…‘…“†[…h…•

null element [...] An element which, in some particular description, is posited as existing at a certain point in a structure even though there is no

< F909D96EC2091E633358D862E696E6462>

環境影響評価制度をめぐる法的諸問題（4） : 米国の環境影響評価制度について

16_.....E...._.I.v2006

,,,,., C Java,,.,,.,., ,,.,, i

tikeya[at]shoin.ac.jp The Function of Quotation Form -tte as Sentence-final Particle Tomoko IKEYA Kobe Shoin Women s University Institute of Linguisti

浜松医科大学紀要

41 1. 初めに ) The National Theatre of the Deaf 1980

Kyushu Communication Studies 第２号

駒田朋子.indd

22SPC4報告書

( ) ( ) (action chain) (Langacker 1991) ( 1993: 46) (x y ) x y LCS (2) [x ACT-ON y] CAUSE [BECOME [y BE BROKEN]] (1999: 215) (1) (1) (3) a. * b. * (4)

CPP46 UFO Image Analysis File on yucatan091206a By Tree man (on) BLACK MOON (Kinohito KULOTSUKI) CPP46 UFO 画像解析ファイル yucatan091206a / 黒月樹人 Fig.02 Targe

36 Theoretical and Applied Linguistics at Kobe Shoin No. 20, 2017 : Key Words: syntactic compound verbs, lexical compound verbs, aspectual compound ve

WASEDA RILAS JOURNAL 1Q84 book1 book One Piece

Microsoft Word - ??? ????????? ????? 2013.docx

\615L\625\761\621\745\615\750\617\743\623\6075\614\616\615\606.PS

計量国語学アーカイブ ID KK 種別特集招待論文 A タイトル Webコーパスの概念と種類, 利用価値語史研究の情報源としてのWebコーパス Title The Concept, Types and Utility of Web Corpora: Web Corpora as

IPSJ SIG Technical Report Vol.2010-NL-199 No /11/ treebank ( ) KWIC /MeCab / Morphological and Dependency Structure Annotated Corp

Webサービス本格活用のための設計ポイント

WASEDA RILAS JOURNAL

L1 What Can You Blood Type Tell Us? Part 1 Can you guess/ my blood type? Well,/ you re very serious person/ so/ I think/ your blood type is A. Wow!/ G

Vol. 42 No MUC-6 6) 90% 2) MUC-6 MET-1 7),8) 7 90% 1 MUC IREX-NE 9) 10),11) 1) MUCMET 12) IREX-NE 13) ARPA 1987 MUC 1992 TREC IREX-N

On the Wireless Beam of Short Electric Waves. (VII) (A New Electric Wave Projector.) By S. UDA, Member (Tohoku Imperial University.) Abstract. A new e

29 jjencode JavaScript

総研大文化科学研究第 11 号 (2015)

Core Ethics Vol. : - : : : -

:. * ** *** **** Little Lord Fauntleroy Little Lord Fauntleroy Frances Eliza Hodgson Burnett, - The Differences between the Initial Edition and First

【生】④木原資裕先生【本文】／【生】④木原資裕先生【本文】

07_太田美帆.indd

NINJAL Research Papers No.3

08-特集04.indd

生研ニュースNo.132

<303288C991BD946797C797592E696E6464>

2013 Future University Hakodate 2013 System Information Science Practice Group Report biblive : Project Name biblive : Recording and sharing experienc

大学における原価計算教育の現状と課題

評論・社会科学　８４号（よこ）（Ｐ）／３．金子

_念3）医療2009_夏.indd

126 学習院大学人文科学論集 ⅩⅩⅡ(2013) 1 2

日本版 General Social Surveys 研究論文集[2]

23 The Study of support narrowing down goods on electronic commerce sites

〈論文〉興行データベースから「古典芸能」の定義を考える

先端社会研究　★５★号／４．山崎

学位研究17号

田中ゆかり・早川洋平・冨田悠・林直樹

NIES ASEAN4.. NIES.....EU.. ASEAN4 NIES

Web Stamps 96 KJ Stamps Web Vol 8, No 1, 2004

<8ED089EF8B D312D30914F95742E696E6464>

of one s information (hearsay, personal experience, traditional lore), or epistemological stance may be expected of all speakers. This is especially t

エレクトーンのお客様向けiPhone/iPad接続マニュアル

11モーゲージカンパニー研究論文.PDF

SpecimenOTKozGo indd

Core Ethics Vol.

Kansai University of Welfare Sciences Practical research on the effectiveness of the validation for the elderly with dementia Naoko Tsumura, Tomoko Mi

(1) i NGO ii (2) 112

〈論文〉組織改革の成果に関する予備的調査--社内カンパニー制導入が財務的業績に与える影響

000-Tanikawa_Watanabe

インターネット接続ガイド v110

Transcription:

Deignan, Alice. 2006. Metaphor and Corpus Linguistics. Amsterdam: John Benjamins. Friedl, Jeffrey E. 2002. Mastering Regular Expressions, 2nd ed. Sebastopol, CA: O Reilly and Associates. Gries, Stefan Th. and Anatol Stefanowitsch. 2006. Corpora in Cognitive Linguistics: Corpus-Based Approaches to Syntax and Lexis. Berlin: Mouton de Gruyter. Stubbs, Michael. 2002. Words and Phrases: Corpus Studies of Lexical Semantics. Oxford: Blackwell. Thomas, Dave, Chad Fowler, and Andy Hunt. 2005. Programming Ruby: The Pragmatic Programmer s Guide, 2nd ed. Raleigh, NC: Pragmatic Bookshelf.

Method for Using Wikipedia as Japanese Corpus Yoichiro HASEBE corpus linguistics, Japanese, Wikipedia Linguistic research and its methods using large-scale corpora have been attracting more and more attention in recent years. Major projects of constructing large-scale corpora are now being carried out not only for English, of which there are several major corpora such as the British National Corpus, but also for many other languages. At present, however, there are few corpora of written Japanese widely available for researchers. One of the reasons why a large corpus is difficult to come by is that numerous procedures must be completed before the copyright issues are cleared. It is not a matter of just collecting a large amount of text and sharing it among researchers. There is, however, one source where a great deal of Japanese text is continually submitted and accumulated in a form that is completely open to the public. That source is Wikipedia. Although some restrictions do apply, as is the case with any other medium, Wikipedia offers quite a large set of linguistic data that reflects the present state of both the grammar and the vocabulary of the Japanese language. This is favorable for many linguistic approaches in a synchronic perspective. Moreover, since the compressed package of all the articles is published regularly for archiving purposes, it is also hoped that researchers will use these data to investigate the semidiachronic phenomenon as well. With the above facts as background, this paper suggests a method to utilize Wikipedia in linguistic researches based on corpora of written

Japanese. A computational toolkit to effectively access and analyze the text data in the archived file is presented. This toolkit comprises two programs written in the programming language Ruby. One, called wp2txt.rb, extracts article texts from the original XML data, and converts the bare text with Wiki and HTML tags sprinkled everywhere into plain text suitable for analysis. The other program, called mconc.rb, matches up a morphological collocation pattern (described using Regular Expressions in a configuration file) in the Wikipedia article texts. It also outputs results in CSV format so that it is able to process the data on ordinary spreadsheet application software. By using the archived data of Wikipedia and the toolkit introduced in this paper, researchers can easily retrieve examples and statistical data of a particular linguistic phenomenon. Moreover, since the Wikipedia data and the toolkit are available as open-source, the procedure of particular research and its resulting data can be tested, refined, or expanded, by other researchers, making it possible for communities to build an effective interresearch network.