JPH01205377A

JPH01205377A - Japanese language document analyzing device

Info

Publication number: JPH01205377A
Application number: JP8830188A
Authority: JP
Inventors: Shinichiro Takagi; 伸一郎高木; Tsuneo Yasuda; 安田　恒雄; Katsumi Shimazaki; 島崎　勝美; Satoru Ikehara; 池原　悟
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1988-02-12
Filing date: 1988-02-12
Publication date: 1989-08-17
Anticipated expiration: 2010-02-22
Also published as: JPH0715690B2

Abstract

PURPOSE:To improve word recognizing precision and analyzing speed by adding a dictionary retrieval information extracting part to extract a dictionary retrieval information flag accompanied with a word extracted by a KANJI (Chinese character) vocabulary word extracting part, and a word train qualifying part to qualify a word train generated by a word train generating part according to a retrieval information flag to a document analyzing device. CONSTITUTION:A clause punctuating part 3 extracts a clause from a data base 2, and refers to a dictionary 21 for every clause, and a word candidate extracting part 22 extracts the word train comprehensively. The extracting part 22 is provided with the KANJI vocabulary word extracting part 23 to extract the word of two and more characters, a one-KANJI word extracting part 24 and the dictionary retrieval information flag extracting part 25 to extract the dictionary retrieval information flag accompanied with the extracted word, and at the time of the extraction 8 of the word candidate, it controls the right or wroung of the extraction of the one-KANJI word contained in the word according to the dictionary retrieval information flag 25 of the extracted word and extracts efficiently the candidate, and further, generates a word train candidate according to a position connection condition, and qualifies 26 the train by deleting and so on according to the flag 25. Accordingly, the word recognizing precision is improved, and the analyzing speed is speeded up.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は、日本語文書データベースを作成するため、入
力装置から読み込まれた漢字かな混じりの日本文文書文
字列を日本文単語に解析する日本文文書解析装置に関す
る。[Detailed Description of the Invention] [Industrial Application Field] The present invention relates to a Japanese text database that analyzes Japanese text strings containing kanji and kana read from an input device into Japanese words in order to create a Japanese text database. It relates to a document analysis device.

[Conventional technology]

新聞記事、出版用原稿、科学技術論文等の多量の日本文
文書を電子ファイル化して、日本文文書データベースを
作成し、これを利用して混入する誤字などの誤りを検出
したり、日本文の他の言語に翻訳したり、あるいは漢字
かな変換した後日本語の合成音声として出力する言語処
理システムなどを構築する場合、全ての自然言語処理の
基礎となる形態素解析が不可欠である。A large amount of Japanese documents such as newspaper articles, publication manuscripts, scientific and technical papers, etc. are converted into electronic files to create a Japanese document database, and this can be used to detect errors such as typographical errors and to analyze Japanese documents. Morphological analysis, which is the basis of all natural language processing, is essential when building a language processing system that translates into other languages or converts Kanji to Kana and then outputs synthesized Japanese speech.

また、日本文の単語解析においては、その解析速度を向
上させるために、一般に漢字を含む２文字以上の単語を
優先して抽出し、多くの漢字１文字単語の検索は漢字を
含む２文字以上の漢字用語単語がない場合にのみ行なう
処理を採用している。In addition, in word analysis of Japanese sentences, in order to improve the analysis speed, priority is generally given to extracting words with two or more characters that include kanji, and many single-character kanji words are searched for with two or more characters that include kanji. The process is performed only when there are no kanji term words.

第４図は、従来の単語解析システムの構成図である。FIG. 4 is a block diagram of a conventional word analysis system.

図において、全体符号１は単語解析処理装置を示し、２
は漢字読取機、ベンタッチ、キーボード等の日本文人力
装置によって読み込まれた日本文を磁気装置に文字コー
ドの形式で記録する入力日本文データベース、３はデー
タベース２に記録された読取り結果の日本文文字列に対
し字種の変化点における自立語あるいは付属語からなる
文節を抽出する文節切り部、４は各単語ごとに見出し、
読み、文法情報、単語の属性情報を格納した日本文単語
辞書、５は各文節ごとに日本文単語辞書４を牽引して単
語列を網羅的に抽出する単語候補抽出部、６は単語候補
抽出部５内にあって、漢字を含む２文字以上の単語を抽
出する漢字用語単語抽出部、７は単語候補抽出部５内に
あって、漢字１文字の単語を抽出する漢字１文字単語抽
出部、８は漢字以外の字種の候補を含めた単語候補抽出
を制御する単語候補抽出制御部、９は抽出された単語列
について位置的に接続可能な単語列連鎖の候補を作成す
る接続単語列生成部、１０は接続単語列生成部９の単語
列について単語の見出しあるいは品詞の間の接続条件を
用いてこれに違反する単語列を文法的に削除する接続検
定部、１１は認定された単語認定精度果を記録する記録
装置である。In the figure, the overall reference numeral 1 indicates a word analysis processing device, and 2
3 is an input Japanese text database in which the Japanese text read by a Japanese human resource device such as a kanji reader, Bentouch, keyboard, etc. is recorded in the form of a character code on a magnetic device, and 3 is the Japanese text read result recorded in database 2. A clause cutter extracts clauses consisting of independent words or adjunct words at the point of change in character types for a string; 4 is a heading for each word;
5 is a word candidate extraction unit that comprehensively extracts word strings by driving the Japanese word dictionary 4 for each clause; 6 is word candidate extraction; Part 5 includes a kanji term word extraction unit that extracts words with two or more characters including kanji, and 7 includes a kanji one-character word extraction unit that extracts words with one kanji character, which is included in the word candidate extraction unit 5. , 8 is a word candidate extraction control unit that controls the extraction of word candidates including candidates for character types other than Kanji, and 9 is a connected word string that creates positionally connectable word string chain candidates for the extracted word strings. 10 is a connection testing unit that grammatically deletes word strings that violate the connection conditions between word headings or parts of speech for the word string of the connected word string generation unit 9; 11 is a certified word; This is a recording device that records certified accuracy results.

第５図は、上述のように構成された単語解析処理システ
ムによる単語解析処理例を示すもので、第５図に示す原
文文字列１２は、文節切り部３によって文節に抽出され
た後、日本文単語辞書４を検索して単語候補抽出部５お
よび単語候補抽出制御部８により、第５図の符号１３で
示すような単語候補群を抽出する。この結果、漢字を含
む２文字以上を優先して抽・出した単語候補１４となる
。FIG. 5 shows an example of word analysis processing by the word analysis processing system configured as described above. The original character string 12 shown in FIG. The sentence word dictionary 4 is searched and a word candidate extraction section 5 and a word candidate extraction control section 8 extract a group of word candidates as shown by reference numeral 13 in FIG. As a result, word candidates 14 are extracted with priority given to two or more characters including kanji.

そして抽出された単語候補群１３は接続単語列生成部９
と接続検定部１０によって、符号１５で示す接続単語列
に生成する。また、第５図において、１６は単語認定が
不良となった単語、１７は単語間の接続が不良となった
位置である。The extracted word candidate group 13 is then generated by the connected word string generation unit 9
The connection test unit 10 generates a connected word string indicated by reference numeral 15. Further, in FIG. 5, 16 is a word whose word recognition has become defective, and 17 is a position where a connection between words has become defective.

[Problem to be solved by the invention]

上述した従来の単語解析処理方式では、漢字を含む２文
字以上の漢字用語を優先して検索するため、長単位の単
語列「間引き」に包含される短単位の漢字１文字単語「
間」が検索されず、その「間引き」なる単語列は単語認
定が不良となり、また、「間引き」と「取る」とが文法
的に接続しない等、単語列の認定不良が発生し、単語認
定精度が低下してしまう。In the conventional word analysis processing method described above, in order to search for kanji terms containing two or more characters containing kanji with priority, the short unit kanji one-character word included in the long unit word string ``thinning'' is searched.
``ma'' is not searched, and the word string ``decimation'' is not recognized as a good word.In addition, ``decimation'' and ``take'' are not grammatically connected, and the word string recognition is incorrect. Accuracy will decrease.

一方、これらの単語検索不足を解決するため、漢字を含
む２文字以上の単語を優先せずに漢字１文字単語を検索
すると、漢字１文字の単語は非常に多いため不要な単語
列の数が増大して解析速度を低下させると共に、単語列
数が増加するためにかえって単語認定精度を低下するこ
ととなる問題があった。On the other hand, in order to solve these word search shortages, if we search for single-character words in kanji without prioritizing words with two or more characters that include kanji, the number of unnecessary word strings increases because there are so many single-character words in kanji. There is a problem in that this increases the number of word strings, lowering the analysis speed, and also lowers the accuracy of word recognition due to the increased number of word strings.

本発明は上述のような問題を解決したもので、単語認定
精度および解析速度を向上できる日本文文書解析装置を
提供することを目的とする。The present invention solves the above-mentioned problems, and aims to provide a Japanese document analysis device that can improve word recognition accuracy and analysis speed.

[Means to solve the problem]

本発明の日本文文書解析装置は、文節切り部、日本文単
語辞書、漢字用語単語抽出部および漢字１文字単語抽出
部を有する単語候補抽出部、単語候補抽出制御部、接続
単語列生成部および接続検定部を備えた解析システムに
おいて、漢字用語単語抽出部で抽出された漢字用語単語
に付随する辞書検索情報フラグを抽出する辞書検索情報
抽出部と、単語列生成部で生成された単語列のうち辞書
検索情報フラグに応じて単語列の絞込みを行なう単語列
絞込み部を付加したものである。The Japanese document analysis device of the present invention includes a word candidate extraction section having a bunsetsu cutting section, a Japanese word dictionary, a Kanji term word extraction section, and a Kanji one-character word extraction section, a word candidate extraction control section, a connected word string generation section, and In an analysis system equipped with a connection test section, a dictionary search information extraction section extracts a dictionary search information flag attached to a kanji term word extracted by a kanji term word extraction section, and a dictionary search information extraction section extracts a dictionary search information flag attached to a kanji term word extracted by a kanji term word extraction section; Among them, a word string narrowing section is added that narrows down word strings according to the dictionary search information flag.

[For production]

本発明においては、単語候補のうち漢字を含む２文字以
上の漢字用語の単語候補抽出処理で、予め漢字を含む２
文字以上の漢字用語に付随して日本文単語辞書から抽出
された辞書検索情報フラグである単語検索フラグに応じ
て、漢字用語に含まれる必要な漢字１文字単語の抽出を
制御し、さらに文字の位置的な接続関係を用いて生成さ
れた接続単語列の絞込みでは、この漢字用語に包含され
る漢字１文字単語を含む見出しごとの位置的な接続を許
可して、これらの接続単語列を削除しないようになる。In the present invention, in the word candidate extraction process for kanji terms with two or more characters including kanji from word candidates, two or more kanji terms including kanji
In accordance with the word search flag, which is a dictionary search information flag extracted from the Japanese word dictionary accompanying a kanji term with more than one character, the extraction of necessary kanji one-letter words included in the kanji term is controlled, and When narrowing down connected word strings generated using positional connection relationships, positional connections are allowed for each heading that includes one-letter Kanji words included in this Kanji term, and these connected word strings are deleted. I learn not to do it.

従って、本発明にあっては、単語認定精度及び解析速度
を向上させることができる。Therefore, according to the present invention, word recognition accuracy and analysis speed can be improved.

〔Example〕

以下、本発明の実施例を図面について詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

第１図は本発明の解析方式による実施例の全体構成図で
、第４図と同一の部分には同一符号を付して説明する。FIG. 1 is an overall configuration diagram of an embodiment according to the analysis method of the present invention, and the same parts as in FIG. 4 are given the same reference numerals and will be explained.

図において、全体符号２０で示す単語解析処理システム
は、データベース２に記録された読取り結果の日本文文
字列に対し字種の変化点における自立語あるいは付属語
からなる文節を抽出する文節切り部３と、各単語ごとに
見出し、読み、文法情報、単語の属性情報および漢字単
語の抽出と接続に関する辞書情報検索フラグを格納した
日本文単語辞書２１と、各文節ごとに日本文単語辞書２
１を牽引して単語列を網羅的に抽出する単語候補抽出部
２２と、この単語候補抽出部２２にあって、漢字を含む
２文字以上の単語を抽出する漢字用語単語抽出部２３と
、単語候補抽出部２２にあって、漢字１文字の単語を抽
出する漢字１文字単語抽出部２４と、漢字用語単語抽出
部２３で抽出された漢字用語単語に付随して予め日本文
単語辞書２１に格納されている辞書検索情報フラグを抽
出する辞書検索情報抽出部２５と、漢字以外の字種の候
補を含めた単語候補抽出を制御する単語候補抽出制御部
８と、抽出された単語列について位置的に接続可能な単
語列連鎖の候補を作成する接続単語列生成部９と、生成
された接続単語列について単語に付随する辞書検索情報
フラグに応じて漢字用語に包含される漢字１文字単語を
含む見出しごとの位置的接続を許可し、これらの接続単
語列を絞り込む単語列絞込み部２６と、接続単語列の各
単語間の文法的接続関係を検定し単語認定列を生成する
接続検定部１０とから構成されている。また、１１は認
定された単語認定精度果を記録する記録装置である。In the figure, the word analysis processing system indicated by the general reference numeral 20 includes a phrase cutting unit 3 that extracts phrases consisting of independent words or attached words at the point of change in character type from the Japanese character string of the reading result recorded in the database 2. , a Japanese word dictionary 21 that stores headings, pronunciations, grammatical information, word attribute information, and dictionary information search flags related to extraction and connection of kanji words for each word, and a Japanese word dictionary 2 for each phrase.
1, a word candidate extraction unit 22 that comprehensively extracts word strings, a kanji term word extraction unit 23 that extracts words of two or more characters including kanji, and a word In the candidate extraction unit 22, there is a kanji one-character word extraction unit 24 that extracts one-letter kanji words, and a kanji term word extraction unit 23 stores kanji term words in advance in the Japanese word dictionary 21 along with the kanji term words extracted. A dictionary search information extraction unit 25 extracts dictionary search information flags that have been searched, a word candidate extraction control unit 8 controls word candidate extraction including character type candidates other than Kanji, and a word candidate extraction control unit 8 controls word candidate extraction including character type candidates other than Kanji. A connected word string generation unit 9 generates candidates for a word string chain that can be connected to a word string, and the generated connected word string includes one-character Kanji words included in Kanji terms according to dictionary search information flags attached to the words. a word string narrowing section 26 that allows positional connections for each heading and narrows down these connected word strings; and a connection testing section 10 that tests the grammatical connection relationship between each word in the connected word string and generates a word recognition string. It consists of Further, 11 is a recording device for recording the certified word recognition accuracy results.

上述した実施例の方式においては、単語候補抽出部２２
での単語候補抽出時に、抽出された漢字用語単語の辞書
検索情報フラグに応じ、該漢字用語単語に包含される漢
字１文字単語の抽出の是非を制御して効率の良い単語候
補抽出を行ない、さらに位置的接続条件による単語列候
補生成を行なった後、これらの単語列候補で漢字用語に
包含される単語列について抽出された漢字用語単語の辞
書検索情報フラグに応じて、該接続単語列を削除する等
の絞込みを行なうから、単語認定精度の向上および解析
速度の高速化が得られることになる。In the method of the embodiment described above, the word candidate extraction unit 22
At the time of word candidate extraction, efficient word candidate extraction is performed by controlling whether or not to extract a Kanji one-character word included in the Kanji term word according to the dictionary search information flag of the extracted Kanji term word. Furthermore, after generating word string candidates based on the positional connection condition, the connected word string is generated according to the dictionary search information flag of the extracted Kanji term word for the word string included in the Kanji term with these word string candidates. By narrowing down the list by deletion, etc., it is possible to improve the accuracy of word recognition and speed up the analysis.

第２図は、上記構成の解析処理システムによる単語解析
の具体的処理例（単語連鎖フラグオフの場合）を示す。FIG. 2 shows a specific processing example of word analysis (in the case where the word chain flag is off) by the analysis processing system configured as described above.

第２図に示す原文文字列１２は、文節切り部３によって
文節に切られた後、日本文単語辞書２１を検索して単語
候補抽出部２２および単語候補抽出制御部８により、第
２図に示すような単語候補群１３を抽出する。この第２
図において、１４は漢字を含む２文字以上の単語候補、
１５は生成した接続単語列、１６は単語認定が不良とな
った単語、１７は単語間の接続が不良となった位置であ
り、また、３０は辞書検索情報フラグを持つ漢字用語［
間引きＪ、３１．３２は漢字用語「間引き」３０に付随
した単語検索フラグおよび単語連鎖フラグ、３３は単語
検索フラグ３１がオンのために従来に追加して検索され
る単語群、３４は抽出された漢字１文字単語、３５は漢
字１文字単語３４の抽出によって新たに検索される漢字
用語単語、３６は「間引きＪに包含される単語列で単語
連鎖フラグ３２によって連鎖禁止となる位置、３７は連
鎖禁止位置３６によって絞り込まれた接続単語列、３８
は認定不良単語１６、接続不良位置１７によって絞り込
まれた接続単語列、３９は接続検定部１０で最終的に認
定された接続単語列である。The original text character string 12 shown in FIG. A word candidate group 13 as shown is extracted. This second
In the figure, 14 is a word candidate with two or more characters including kanji,
15 is a generated connected word string, 16 is a word whose word recognition is defective, 17 is a position where a connection between words is defective, and 30 is a kanji term with a dictionary search information flag [
Thinning J, 31.32 is the word search flag and word chain flag attached to the kanji term "thinning" 30, 33 is a word group that is conventionally additionally searched because the word search flag 31 is on, and 34 is a word group that has been extracted. 35 is a kanji term word newly searched by extracting the 1-character kanji word 34, 36 is a position where chaining is prohibited by the word chain flag 32 in the word string included in thinning J, 37 is a kanji term word newly searched by extracting the 1-character kanji word Connected word string narrowed down by chain prohibition position 36, 38
39 is a connected word string narrowed down by the certified defective word 16 and connection defective position 17, and 39 is a connected word string finally certified by the connection verification section 10.

上述のような処理例では、抽出された長単位の漢字用語
「間引き」に辞書検索情報フラグを予め付加しておき、
単語検索フラグ３１がオンのため通常は検索されない包
含１文字単語「間」について抽出を行ない、さらに「間
」の後方から通常の検索を行なって「引き」、「引き取
」を抽出する。In the processing example described above, a dictionary search information flag is added in advance to the extracted long unit kanji term "decimation",
Since the word search flag 31 is on, the inclusive one-character word ``ma'' which is not normally searched is extracted, and furthermore, a normal search is performed from after ``ma'' to extract ``hiki'' and ``take''.

この結果、位置的に接続可能な接続単語列１５が作成さ
れるが、「間引き」と「取」は品詞の接続条件から接続
不可で、また「間引き」に包含される単語列「間」と「
引き」は「間引き」の単語連鎖フラグがオフであるので
、接続禁止となり、いずれも接続単語列から削除され、
かつ絞り込まれ、最終的に正規の単語認定結果である接
続単語列３９が選択される。As a result, a connected word string 15 that can be connected positionally is created, but "Kaku" and "Tori" cannot be connected due to the connection conditions of the parts of speech, and the word string "Ma" included in "Kaku" is "
Since the word chain flag for "thinning" is off, connection is prohibited, and both are deleted from the connected word string.
Finally, a connected word string 39 that is a regular word recognition result is selected.

第３図は、本発明による単語解析処理の他の例（単語連
鎖フラグオンの場合）を示すもので、４０は辞書検索情
報フラグを持つ漢字用語「後生」、４１．４２は漢字用
語「後生」４０に付随した単語検索フラグおよび単語連
鎖フラグ、４３は単語検索フラグ４１がオンのために従
来に追加して検索される単語群、４４は抽出された漢字
１文字単語、４５は漢字１文字単語４４の抽出によって
新たに検索される漢字用語単語、４６は「後生」に包含
される単語列で単語連鎖フラグ４２によって連鎖許可と
なる位置、４７は認定不良単語１６、接続不良位置１７
によって絞り込まれた接続単語列、４日は接続検定部で
最終的に認定された接続単語列である。FIG. 3 shows another example of word analysis processing according to the present invention (in the case of word chain flag on), where 40 is a kanji term "gosei" with a dictionary search information flag, and 41.42 is a kanji term "gosei". 40 is a word search flag and a word chain flag, 43 is a word group that is conventionally additionally searched because the word search flag 41 is on, 44 is an extracted one-letter Kanji word, and 45 is a one-letter Kanji word. 44 is a newly searched kanji term word, 46 is a word string included in "Kosei" and the position where chaining is permitted by the word chain flag 42, 47 is a recognized defective word 16, connection defective position 17
The connected word string narrowed down by the 4th day is the connected word string finally certified by the connection testing department.

この処理例では、抽出された長単位の漢字用語「後生」
に辞書検索情報フラグを予め付加しておき、単語検索フ
ラグ４１がオンのため通常は検索されない包含漢字１文
字単語「後」について抽出を行ない、さらに「後」の後
方から通常の検索を行なって「生」を抽出する。この結
果、位置的に接続可能な接続単語列１５が作成され、こ
の中で、「後生」と「ま」は品詞の接続条件から接続不
可で削除されるが、［後生Ｊに包含される単語列「後」
と「生」は「後生」の単語連鎖フラグがオンであるので
、接続許可となり接続単語列を生成する。このようにし
て、最終的に正規の単語認定結果である接続単語列４８
が選択される。In this processing example, the extracted long unit kanji term "gosei"
A dictionary search information flag is added in advance to , the word search flag 41 is on, so the inclusive kanji one-character word ``ato'', which is not normally searched, is extracted, and then a normal search is performed starting from the end of ``ato''. Extract "raw". As a result, a connected word string 15 that can be connected positionally is created, and in this, "gosei" and "ma" are deleted because they cannot be connected due to the connection condition of parts of speech. Column “after”
Since the word chain flag of "post-birth" is on for "raw", the connection is permitted and a connected word string is generated. In this way, the connected word string 48 which is the final formal word recognition result
is selected.

従って、これらの結果から明らかなように、従来の技術
に比べ、単語の認定精度が向上し、さらに不要な漢字１
文字単語検索の排除や検定対象の接続単語列の絞込みを
行なうので、総合的な解析速度を向上させ得る。Therefore, as is clear from these results, the accuracy of word recognition is improved compared to the conventional technology, and one unnecessary kanji character is eliminated.
Since character word searches are eliminated and connected word strings to be tested are narrowed down, the overall analysis speed can be improved.

〔Effect of the invention〕

以上のように、本発明によれば、単語抽出時に、単に漢
字を含む２文字以上の漢字用語を優先し検索して、漢字
１文字単語の検索は漢字を含む２文字以上の漢字用語が
無い場合に行なうだけでなく、予め漢字を含む２文字以
上の漢字用語に付随して日本語辞書に格納されている辞
書検索情報フラグである単語検索フラグに応じて、この
フラグがオンならば、漢字用語に包含される見出しの漢
字１文字単語を抽出し、さらに文字の位置的な接続関係
を用いて生成された接続単語列の絞込みにおいて漢字用
語の辞書検索情報フラグである単語連鎖フラグに応じて
このフラグがオンならばこの漢字用語に包含される漢字
１文字単語を含む見出しごとの位置的な接続を許可し、
これらの接続単語列を削除しないようにした手段を備え
るものであるから、単語認定精度の向上および総合的な
解析速度の向上を図り、有効な単語解析処理を実現でき
る。As described above, according to the present invention, when extracting words, priority is simply given to search for kanji terms with two or more letters that include kanji, and when searching for one-letter kanji words, there are no kanji terms that include kanji and have two or more letters. In addition to searching for kanji when this flag is on, the word search flag is a dictionary search information flag that is stored in the Japanese dictionary along with kanji terms containing two or more characters. Extracts the Kanji one-letter words in the headings included in the term, and further narrows down the connected word strings generated using the positional connection relationships of the characters. If this flag is on, positional connections are allowed for each heading containing one-letter Kanji words included in this Kanji term,
Since it is provided with a means for not deleting these connected word strings, it is possible to improve word recognition accuracy and overall analysis speed, and realize effective word analysis processing.

[Brief explanation of the drawing]

第１図は本発明の基本構成例を示す全体構成図、第２図
は本発明の単語解析の基本構成例による単語解析処理例
を示す説明図、第３図は本発明の基本構成例による単語
解析処理の他の例を示す説明図、第４図は従来の単語解
析の構成図、第５図は従来の単語解析処理例を示す説明
図である。〔主要な部分の符号の説明〕２・・・入力日本文データベース３・・・文節切り部８・・・単語候補抽出制御部９・・・接続単語列生成部１０・・・接続検定部１１・・・単語認定列結果の記録装置２０・・・単語解析処理システム２１・・・日本文単語辞書２２・・・単語候補抽出部２３・・・漢字用語単語抽出部２４・・・漢字１文字単語抽出部２５・・・辞書検索情報抽出部２６・・・単語列絞込み部不発日月の基本オー１広図第１図、本、ｉ［］月Ｌｎｆｉ［イ列第２図木禰笹日目σ）イを二σ）ｆｆｉ王里、イ１１１殖来の
オＡｈへＪレコFIG. 1 is an overall configuration diagram showing an example of the basic configuration of the present invention, FIG. 2 is an explanatory diagram showing an example of word analysis processing according to an example of the basic configuration of word analysis of the present invention, and FIG. 3 is an example of the basic configuration of the present invention. FIG. 4 is an explanatory diagram showing another example of word analysis processing, FIG. 4 is a block diagram of conventional word analysis, and FIG. 5 is an explanatory diagram showing an example of conventional word analysis processing. [Explanation of symbols of main parts] 2... Input Japanese sentence database 3... Bunsetsu cutting section 8... Word candidate extraction control section 9... Connected word string generation section 10... Connection verification section 11 ...Word recognition string result recording device 20...Word analysis processing system 21...Japanese word dictionary 22...Word candidate extraction section 23...Kanji term word extraction section 24...One kanji character Word extractor 25... Dictionary search information extractor 26... Word string narrower 2 σ) ffi Ori, I 111 Shokurai no Oh to J record

Claims

[Claims]

(1) A phrase cutter that extracts phrases consisting of independent words or adjunct words at points of change in character types from input Japanese character strings, and extracts and connects each word's heading, pronunciation, grammatical information, and kanji words. a Japanese sentence word dictionary that stores dictionary information search flags for each phrase; a word candidate extraction unit that exhaustively extracts possible words and associated information for each clause using the Japanese word dictionary; and the word candidate extraction section. The part includes a kanji term word extraction unit that extracts kanji term candidates that are words with two or more letters including kanji from among the word candidates using a Japanese word dictionary; A dictionary search information extraction unit that controls whether or not to search for a single-character kanji word according to a dictionary search information flag that is stored in advance in the Japanese word dictionary along with the word; A single-character Kanji word extraction unit that extracts candidates using a Japanese word dictionary, a word candidate extraction control unit that controls the extraction of word candidates including candidates for character types other than Kanji, and a character extraction unit for each extracted word group. a connected word string generation unit that creates candidates for word strings that can form clauses using the positional connection relationships of A word string narrowing section that allows positional connections for each heading containing one-character Kanji words included in the Kanji term according to the dictionary search information flag, and narrows down these connected word strings, and each word in the connected word strings. A Japanese document analysis device comprising a connection testing section that tests the grammatical connections between words and generates a word recognition string.