JPH01205377A - Japanese language document analyzing device - Google Patents

Japanese language document analyzing device

Info

Publication number
JPH01205377A
JPH01205377A JP8830188A JP3018888A JPH01205377A JP H01205377 A JPH01205377 A JP H01205377A JP 8830188 A JP8830188 A JP 8830188A JP 3018888 A JP3018888 A JP 3018888A JP H01205377 A JPH01205377 A JP H01205377A
Authority
JP
Japan
Prior art keywords
word
kanji
dictionary
words
japanese
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
JP8830188A
Other languages
Japanese (ja)
Other versions
JPH0715690B2 (en
Inventor
Shinichiro Takagi
伸一郎 高木
Tsuneo Yasuda
安田 恒雄
Katsumi Shimazaki
島崎 勝美
Satoru Ikehara
池原 悟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nippon Telegraph and Telephone Corp
Original Assignee
Nippon Telegraph and Telephone Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nippon Telegraph and Telephone Corp filed Critical Nippon Telegraph and Telephone Corp
Priority to JP63030188A priority Critical patent/JPH0715690B2/en
Publication of JPH01205377A publication Critical patent/JPH01205377A/en
Publication of JPH0715690B2 publication Critical patent/JPH0715690B2/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

PURPOSE:To improve word recognizing precision and analyzing speed by adding a dictionary retrieval information extracting part to extract a dictionary retrieval information flag accompanied with a word extracted by a KANJI (Chinese character) vocabulary word extracting part, and a word train qualifying part to qualify a word train generated by a word train generating part according to a retrieval information flag to a document analyzing device. CONSTITUTION:A clause punctuating part 3 extracts a clause from a data base 2, and refers to a dictionary 21 for every clause, and a word candidate extracting part 22 extracts the word train comprehensively. The extracting part 22 is provided with the KANJI vocabulary word extracting part 23 to extract the word of two and more characters, a one-KANJI word extracting part 24 and the dictionary retrieval information flag extracting part 25 to extract the dictionary retrieval information flag accompanied with the extracted word, and at the time of the extraction 8 of the word candidate, it controls the right or wroung of the extraction of the one-KANJI word contained in the word according to the dictionary retrieval information flag 25 of the extracted word and extracts efficiently the candidate, and further, generates a word train candidate according to a position connection condition, and qualifies 26 the train by deleting and so on according to the flag 25. Accordingly, the word recognizing precision is improved, and the analyzing speed is speeded up.

Description

【発明の詳細な説明】 〔産業上の利用分野〕 本発明は、日本語文書データベースを作成するため、入
力装置から読み込まれた漢字かな混じりの日本文文書文
字列を日本文単語に解析する日本文文書解析装置に関す
る。
[Detailed Description of the Invention] [Industrial Application Field] The present invention relates to a Japanese text database that analyzes Japanese text strings containing kanji and kana read from an input device into Japanese words in order to create a Japanese text database. It relates to a document analysis device.

〔従来の技術〕[Conventional technology]

新聞記事、出版用原稿、科学技術論文等の多量の日本文
文書を電子ファイル化して、日本文文書データベースを
作成し、これを利用して混入する誤字などの誤りを検出
したり、日本文の他の言語に翻訳したり、あるいは漢字
かな変換した後日本語の合成音声として出力する言語処
理システムなどを構築する場合、全ての自然言語処理の
基礎となる形態素解析が不可欠である。
A large amount of Japanese documents such as newspaper articles, publication manuscripts, scientific and technical papers, etc. are converted into electronic files to create a Japanese document database, and this can be used to detect errors such as typographical errors and to analyze Japanese documents. Morphological analysis, which is the basis of all natural language processing, is essential when building a language processing system that translates into other languages or converts Kanji to Kana and then outputs synthesized Japanese speech.

また、日本文の単語解析においては、その解析速度を向
上させるために、一般に漢字を含む2文字以上の単語を
優先して抽出し、多くの漢字1文字単語の検索は漢字を
含む2文字以上の漢字用語単語がない場合にのみ行なう
処理を採用している。
In addition, in word analysis of Japanese sentences, in order to improve the analysis speed, priority is generally given to extracting words with two or more characters that include kanji, and many single-character kanji words are searched for with two or more characters that include kanji. The process is performed only when there are no kanji term words.

第4図は、従来の単語解析システムの構成図である。FIG. 4 is a block diagram of a conventional word analysis system.

図において、全体符号1は単語解析処理装置を示し、2
は漢字読取機、ベンタッチ、キーボード等の日本文人力
装置によって読み込まれた日本文を磁気装置に文字コー
ドの形式で記録する入力日本文データベース、3はデー
タベース2に記録された読取り結果の日本文文字列に対
し字種の変化点における自立語あるいは付属語からなる
文節を抽出する文節切り部、4は各単語ごとに見出し、
読み、文法情報、単語の属性情報を格納した日本文単語
辞書、5は各文節ごとに日本文単語辞書4を牽引して単
語列を網羅的に抽出する単語候補抽出部、6は単語候補
抽出部5内にあって、漢字を含む2文字以上の単語を抽
出する漢字用語単語抽出部、7は単語候補抽出部5内に
あって、漢字1文字の単語を抽出する漢字1文字単語抽
出部、8は漢字以外の字種の候補を含めた単語候補抽出
を制御する単語候補抽出制御部、9は抽出された単語列
について位置的に接続可能な単語列連鎖の候補を作成す
る接続単語列生成部、10は接続単語列生成部9の単語
列について単語の見出しあるいは品詞の間の接続条件を
用いてこれに違反する単語列を文法的に削除する接続検
定部、11は認定された単語認定精度果を記録する記録
装置である。
In the figure, the overall reference numeral 1 indicates a word analysis processing device, and 2
3 is an input Japanese text database in which the Japanese text read by a Japanese human resource device such as a kanji reader, Bentouch, keyboard, etc. is recorded in the form of a character code on a magnetic device, and 3 is the Japanese text read result recorded in database 2. A clause cutter extracts clauses consisting of independent words or adjunct words at the point of change in character types for a string; 4 is a heading for each word;
5 is a word candidate extraction unit that comprehensively extracts word strings by driving the Japanese word dictionary 4 for each clause; 6 is word candidate extraction; Part 5 includes a kanji term word extraction unit that extracts words with two or more characters including kanji, and 7 includes a kanji one-character word extraction unit that extracts words with one kanji character, which is included in the word candidate extraction unit 5. , 8 is a word candidate extraction control unit that controls the extraction of word candidates including candidates for character types other than Kanji, and 9 is a connected word string that creates positionally connectable word string chain candidates for the extracted word strings. 10 is a connection testing unit that grammatically deletes word strings that violate the connection conditions between word headings or parts of speech for the word string of the connected word string generation unit 9; 11 is a certified word; This is a recording device that records certified accuracy results.

第5図は、上述のように構成された単語解析処理システ
ムによる単語解析処理例を示すもので、第5図に示す原
文文字列12は、文節切り部3によって文節に抽出され
た後、日本文単語辞書4を検索して単語候補抽出部5お
よび単語候補抽出制御部8により、第5図の符号13で
示すような単語候補群を抽出する。この結果、漢字を含
む2文字以上を優先して抽・出した単語候補14となる
FIG. 5 shows an example of word analysis processing by the word analysis processing system configured as described above. The original character string 12 shown in FIG. The sentence word dictionary 4 is searched and a word candidate extraction section 5 and a word candidate extraction control section 8 extract a group of word candidates as shown by reference numeral 13 in FIG. As a result, word candidates 14 are extracted with priority given to two or more characters including kanji.

そして抽出された単語候補群13は接続単語列生成部9
と接続検定部10によって、符号15で示す接続単語列
に生成する。また、第5図において、16は単語認定が
不良となった単語、17は単語間の接続が不良となった
位置である。
The extracted word candidate group 13 is then generated by the connected word string generation unit 9
The connection test unit 10 generates a connected word string indicated by reference numeral 15. Further, in FIG. 5, 16 is a word whose word recognition has become defective, and 17 is a position where a connection between words has become defective.

〔発明が解決しようとする課題〕[Problem to be solved by the invention]

上述した従来の単語解析処理方式では、漢字を含む2文
字以上の漢字用語を優先して検索するため、長単位の単
語列「間引き」に包含される短単位の漢字1文字単語「
間」が検索されず、その「間引き」なる単語列は単語認
定が不良となり、また、「間引き」と「取る」とが文法
的に接続しない等、単語列の認定不良が発生し、単語認
定精度が低下してしまう。
In the conventional word analysis processing method described above, in order to search for kanji terms containing two or more characters containing kanji with priority, the short unit kanji one-character word included in the long unit word string ``thinning'' is searched.
``ma'' is not searched, and the word string ``decimation'' is not recognized as a good word.In addition, ``decimation'' and ``take'' are not grammatically connected, and the word string recognition is incorrect. Accuracy will decrease.

一方、これらの単語検索不足を解決するため、漢字を含
む2文字以上の単語を優先せずに漢字1文字単語を検索
すると、漢字1文字の単語は非常に多いため不要な単語
列の数が増大して解析速度を低下させると共に、単語列
数が増加するためにかえって単語認定精度を低下するこ
ととなる問題があった。
On the other hand, in order to solve these word search shortages, if we search for single-character words in kanji without prioritizing words with two or more characters that include kanji, the number of unnecessary word strings increases because there are so many single-character words in kanji. There is a problem in that this increases the number of word strings, lowering the analysis speed, and also lowers the accuracy of word recognition due to the increased number of word strings.

本発明は上述のような問題を解決したもので、単語認定
精度および解析速度を向上できる日本文文書解析装置を
提供することを目的とする。
The present invention solves the above-mentioned problems, and aims to provide a Japanese document analysis device that can improve word recognition accuracy and analysis speed.

〔課題を解決するための手段〕[Means to solve the problem]

本発明の日本文文書解析装置は、文節切り部、日本文単
語辞書、漢字用語単語抽出部および漢字1文字単語抽出
部を有する単語候補抽出部、単語候補抽出制御部、接続
単語列生成部および接続検定部を備えた解析システムに
おいて、漢字用語単語抽出部で抽出された漢字用語単語
に付随する辞書検索情報フラグを抽出する辞書検索情報
抽出部と、単語列生成部で生成された単語列のうち辞書
検索情報フラグに応じて単語列の絞込みを行なう単語列
絞込み部を付加したものである。
The Japanese document analysis device of the present invention includes a word candidate extraction section having a bunsetsu cutting section, a Japanese word dictionary, a Kanji term word extraction section, and a Kanji one-character word extraction section, a word candidate extraction control section, a connected word string generation section, and In an analysis system equipped with a connection test section, a dictionary search information extraction section extracts a dictionary search information flag attached to a kanji term word extracted by a kanji term word extraction section, and a dictionary search information extraction section extracts a dictionary search information flag attached to a kanji term word extracted by a kanji term word extraction section; Among them, a word string narrowing section is added that narrows down word strings according to the dictionary search information flag.

〔作 用〕[For production]

本発明においては、単語候補のうち漢字を含む2文字以
上の漢字用語の単語候補抽出処理で、予め漢字を含む2
文字以上の漢字用語に付随して日本文単語辞書から抽出
された辞書検索情報フラグである単語検索フラグに応じ
て、漢字用語に含まれる必要な漢字1文字単語の抽出を
制御し、さらに文字の位置的な接続関係を用いて生成さ
れた接続単語列の絞込みでは、この漢字用語に包含され
る漢字1文字単語を含む見出しごとの位置的な接続を許
可して、これらの接続単語列を削除しないようになる。
In the present invention, in the word candidate extraction process for kanji terms with two or more characters including kanji from word candidates, two or more kanji terms including kanji
In accordance with the word search flag, which is a dictionary search information flag extracted from the Japanese word dictionary accompanying a kanji term with more than one character, the extraction of necessary kanji one-letter words included in the kanji term is controlled, and When narrowing down connected word strings generated using positional connection relationships, positional connections are allowed for each heading that includes one-letter Kanji words included in this Kanji term, and these connected word strings are deleted. I learn not to do it.

従って、本発明にあっては、単語認定精度及び解析速度
を向上させることができる。
Therefore, according to the present invention, word recognition accuracy and analysis speed can be improved.

〔実施例〕〔Example〕

以下、本発明の実施例を図面について詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

第1図は本発明の解析方式による実施例の全体構成図で
、第4図と同一の部分には同一符号を付して説明する。
FIG. 1 is an overall configuration diagram of an embodiment according to the analysis method of the present invention, and the same parts as in FIG. 4 are given the same reference numerals and will be explained.

図において、全体符号20で示す単語解析処理システム
は、データベース2に記録された読取り結果の日本文文
字列に対し字種の変化点における自立語あるいは付属語
からなる文節を抽出する文節切り部3と、各単語ごとに
見出し、読み、文法情報、単語の属性情報および漢字単
語の抽出と接続に関する辞書情報検索フラグを格納した
日本文単語辞書21と、各文節ごとに日本文単語辞書2
1を牽引して単語列を網羅的に抽出する単語候補抽出部
22と、この単語候補抽出部22にあって、漢字を含む
2文字以上の単語を抽出する漢字用語単語抽出部23と
、単語候補抽出部22にあって、漢字1文字の単語を抽
出する漢字1文字単語抽出部24と、漢字用語単語抽出
部23で抽出された漢字用語単語に付随して予め日本文
単語辞書21に格納されている辞書検索情報フラグを抽
出する辞書検索情報抽出部25と、漢字以外の字種の候
補を含めた単語候補抽出を制御する単語候補抽出制御部
8と、抽出された単語列について位置的に接続可能な単
語列連鎖の候補を作成する接続単語列生成部9と、生成
された接続単語列について単語に付随する辞書検索情報
フラグに応じて漢字用語に包含される漢字1文字単語を
含む見出しごとの位置的接続を許可し、これらの接続単
語列を絞り込む単語列絞込み部26と、接続単語列の各
単語間の文法的接続関係を検定し単語認定列を生成する
接続検定部10とから構成されている。また、11は認
定された単語認定精度果を記録する記録装置である。
In the figure, the word analysis processing system indicated by the general reference numeral 20 includes a phrase cutting unit 3 that extracts phrases consisting of independent words or attached words at the point of change in character type from the Japanese character string of the reading result recorded in the database 2. , a Japanese word dictionary 21 that stores headings, pronunciations, grammatical information, word attribute information, and dictionary information search flags related to extraction and connection of kanji words for each word, and a Japanese word dictionary 2 for each phrase.
1, a word candidate extraction unit 22 that comprehensively extracts word strings, a kanji term word extraction unit 23 that extracts words of two or more characters including kanji, and a word In the candidate extraction unit 22, there is a kanji one-character word extraction unit 24 that extracts one-letter kanji words, and a kanji term word extraction unit 23 stores kanji term words in advance in the Japanese word dictionary 21 along with the kanji term words extracted. A dictionary search information extraction unit 25 extracts dictionary search information flags that have been searched, a word candidate extraction control unit 8 controls word candidate extraction including character type candidates other than Kanji, and a word candidate extraction control unit 8 controls word candidate extraction including character type candidates other than Kanji. A connected word string generation unit 9 generates candidates for a word string chain that can be connected to a word string, and the generated connected word string includes one-character Kanji words included in Kanji terms according to dictionary search information flags attached to the words. a word string narrowing section 26 that allows positional connections for each heading and narrows down these connected word strings; and a connection testing section 10 that tests the grammatical connection relationship between each word in the connected word string and generates a word recognition string. It consists of Further, 11 is a recording device for recording the certified word recognition accuracy results.

上述した実施例の方式においては、単語候補抽出部22
での単語候補抽出時に、抽出された漢字用語単語の辞書
検索情報フラグに応じ、該漢字用語単語に包含される漢
字1文字単語の抽出の是非を制御して効率の良い単語候
補抽出を行ない、さらに位置的接続条件による単語列候
補生成を行なった後、これらの単語列候補で漢字用語に
包含される単語列について抽出された漢字用語単語の辞
書検索情報フラグに応じて、該接続単語列を削除する等
の絞込みを行なうから、単語認定精度の向上および解析
速度の高速化が得られることになる。
In the method of the embodiment described above, the word candidate extraction unit 22
At the time of word candidate extraction, efficient word candidate extraction is performed by controlling whether or not to extract a Kanji one-character word included in the Kanji term word according to the dictionary search information flag of the extracted Kanji term word. Furthermore, after generating word string candidates based on the positional connection condition, the connected word string is generated according to the dictionary search information flag of the extracted Kanji term word for the word string included in the Kanji term with these word string candidates. By narrowing down the list by deletion, etc., it is possible to improve the accuracy of word recognition and speed up the analysis.

第2図は、上記構成の解析処理システムによる単語解析
の具体的処理例(単語連鎖フラグオフの場合)を示す。
FIG. 2 shows a specific processing example of word analysis (in the case where the word chain flag is off) by the analysis processing system configured as described above.

第2図に示す原文文字列12は、文節切り部3によって
文節に切られた後、日本文単語辞書21を検索して単語
候補抽出部22および単語候補抽出制御部8により、第
2図に示すような単語候補群13を抽出する。この第2
図において、14は漢字を含む2文字以上の単語候補、
15は生成した接続単語列、16は単語認定が不良とな
った単語、17は単語間の接続が不良となった位置であ
り、また、30は辞書検索情報フラグを持つ漢字用語[
間引きJ、31.32は漢字用語「間引き」30に付随
した単語検索フラグおよび単語連鎖フラグ、33は単語
検索フラグ31がオンのために従来に追加して検索され
る単語群、34は抽出された漢字1文字単語、35は漢
字1文字単語34の抽出によって新たに検索される漢字
用語単語、36は「間引きJに包含される単語列で単語
連鎖フラグ32によって連鎖禁止となる位置、37は連
鎖禁止位置36によって絞り込まれた接続単語列、38
は認定不良単語16、接続不良位置17によって絞り込
まれた接続単語列、39は接続検定部10で最終的に認
定された接続単語列である。
The original text character string 12 shown in FIG. A word candidate group 13 as shown is extracted. This second
In the figure, 14 is a word candidate with two or more characters including kanji,
15 is a generated connected word string, 16 is a word whose word recognition is defective, 17 is a position where a connection between words is defective, and 30 is a kanji term with a dictionary search information flag [
Thinning J, 31.32 is the word search flag and word chain flag attached to the kanji term "thinning" 30, 33 is a word group that is conventionally additionally searched because the word search flag 31 is on, and 34 is a word group that has been extracted. 35 is a kanji term word newly searched by extracting the 1-character kanji word 34, 36 is a position where chaining is prohibited by the word chain flag 32 in the word string included in thinning J, 37 is a kanji term word newly searched by extracting the 1-character kanji word Connected word string narrowed down by chain prohibition position 36, 38
39 is a connected word string narrowed down by the certified defective word 16 and connection defective position 17, and 39 is a connected word string finally certified by the connection verification section 10.

上述のような処理例では、抽出された長単位の漢字用語
「間引き」に辞書検索情報フラグを予め付加しておき、
単語検索フラグ31がオンのため通常は検索されない包
含1文字単語「間」について抽出を行ない、さらに「間
」の後方から通常の検索を行なって「引き」、「引き取
」を抽出する。
In the processing example described above, a dictionary search information flag is added in advance to the extracted long unit kanji term "decimation",
Since the word search flag 31 is on, the inclusive one-character word ``ma'' which is not normally searched is extracted, and furthermore, a normal search is performed from after ``ma'' to extract ``hiki'' and ``take''.

この結果、位置的に接続可能な接続単語列15が作成さ
れるが、「間引き」と「取」は品詞の接続条件から接続
不可で、また「間引き」に包含される単語列「間」と「
引き」は「間引き」の単語連鎖フラグがオフであるので
、接続禁止となり、いずれも接続単語列から削除され、
かつ絞り込まれ、最終的に正規の単語認定結果である接
続単語列39が選択される。
As a result, a connected word string 15 that can be connected positionally is created, but "Kaku" and "Tori" cannot be connected due to the connection conditions of the parts of speech, and the word string "Ma" included in "Kaku" is "
Since the word chain flag for "thinning" is off, connection is prohibited, and both are deleted from the connected word string.
Finally, a connected word string 39 that is a regular word recognition result is selected.

第3図は、本発明による単語解析処理の他の例(単語連
鎖フラグオンの場合)を示すもので、40は辞書検索情
報フラグを持つ漢字用語「後生」、41.42は漢字用
語「後生」40に付随した単語検索フラグおよび単語連
鎖フラグ、43は単語検索フラグ41がオンのために従
来に追加して検索される単語群、44は抽出された漢字
1文字単語、45は漢字1文字単語44の抽出によって
新たに検索される漢字用語単語、46は「後生」に包含
される単語列で単語連鎖フラグ42によって連鎖許可と
なる位置、47は認定不良単語16、接続不良位置17
によって絞り込まれた接続単語列、4日は接続検定部で
最終的に認定された接続単語列である。
FIG. 3 shows another example of word analysis processing according to the present invention (in the case of word chain flag on), where 40 is a kanji term "gosei" with a dictionary search information flag, and 41.42 is a kanji term "gosei". 40 is a word search flag and a word chain flag, 43 is a word group that is conventionally additionally searched because the word search flag 41 is on, 44 is an extracted one-letter Kanji word, and 45 is a one-letter Kanji word. 44 is a newly searched kanji term word, 46 is a word string included in "Kosei" and the position where chaining is permitted by the word chain flag 42, 47 is a recognized defective word 16, connection defective position 17
The connected word string narrowed down by the 4th day is the connected word string finally certified by the connection testing department.

この処理例では、抽出された長単位の漢字用語「後生」
に辞書検索情報フラグを予め付加しておき、単語検索フ
ラグ41がオンのため通常は検索されない包含漢字1文
字単語「後」について抽出を行ない、さらに「後」の後
方から通常の検索を行なって「生」を抽出する。この結
果、位置的に接続可能な接続単語列15が作成され、こ
の中で、「後生」と「ま」は品詞の接続条件から接続不
可で削除されるが、[後生Jに包含される単語列「後」
と「生」は「後生」の単語連鎖フラグがオンであるので
、接続許可となり接続単語列を生成する。このようにし
て、最終的に正規の単語認定結果である接続単語列48
が選択される。
In this processing example, the extracted long unit kanji term "gosei"
A dictionary search information flag is added in advance to , the word search flag 41 is on, so the inclusive kanji one-character word ``ato'', which is not normally searched, is extracted, and then a normal search is performed starting from the end of ``ato''. Extract "raw". As a result, a connected word string 15 that can be connected positionally is created, and in this, "gosei" and "ma" are deleted because they cannot be connected due to the connection condition of parts of speech. Column “after”
Since the word chain flag of "post-birth" is on for "raw", the connection is permitted and a connected word string is generated. In this way, the connected word string 48 which is the final formal word recognition result
is selected.

従って、これらの結果から明らかなように、従来の技術
に比べ、単語の認定精度が向上し、さらに不要な漢字1
文字単語検索の排除や検定対象の接続単語列の絞込みを
行なうので、総合的な解析速度を向上させ得る。
Therefore, as is clear from these results, the accuracy of word recognition is improved compared to the conventional technology, and one unnecessary kanji character is eliminated.
Since character word searches are eliminated and connected word strings to be tested are narrowed down, the overall analysis speed can be improved.

〔発明の効果〕〔Effect of the invention〕

以上のように、本発明によれば、単語抽出時に、単に漢
字を含む2文字以上の漢字用語を優先し検索して、漢字
1文字単語の検索は漢字を含む2文字以上の漢字用語が
無い場合に行なうだけでなく、予め漢字を含む2文字以
上の漢字用語に付随して日本語辞書に格納されている辞
書検索情報フラグである単語検索フラグに応じて、この
フラグがオンならば、漢字用語に包含される見出しの漢
字1文字単語を抽出し、さらに文字の位置的な接続関係
を用いて生成された接続単語列の絞込みにおいて漢字用
語の辞書検索情報フラグである単語連鎖フラグに応じて
このフラグがオンならばこの漢字用語に包含される漢字
1文字単語を含む見出しごとの位置的な接続を許可し、
これらの接続単語列を削除しないようにした手段を備え
るものであるから、単語認定精度の向上および総合的な
解析速度の向上を図り、有効な単語解析処理を実現でき
る。
As described above, according to the present invention, when extracting words, priority is simply given to search for kanji terms with two or more letters that include kanji, and when searching for one-letter kanji words, there are no kanji terms that include kanji and have two or more letters. In addition to searching for kanji when this flag is on, the word search flag is a dictionary search information flag that is stored in the Japanese dictionary along with kanji terms containing two or more characters. Extracts the Kanji one-letter words in the headings included in the term, and further narrows down the connected word strings generated using the positional connection relationships of the characters. If this flag is on, positional connections are allowed for each heading containing one-letter Kanji words included in this Kanji term,
Since it is provided with a means for not deleting these connected word strings, it is possible to improve word recognition accuracy and overall analysis speed, and realize effective word analysis processing.

【図面の簡単な説明】[Brief explanation of the drawing]

第1図は本発明の基本構成例を示す全体構成図、第2図
は本発明の単語解析の基本構成例による単語解析処理例
を示す説明図、第3図は本発明の基本構成例による単語
解析処理の他の例を示す説明図、第4図は従来の単語解
析の構成図、第5図は従来の単語解析処理例を示す説明
図である。 〔主要な部分の符号の説明〕 2・・・入力日本文データベース 3・・・文節切り部 8・・・単語候補抽出制御部 9・・・接続単語列生成部 10・・・接続検定部 11・・・単語認定列結果の記録装置 20・・・単語解析処理システム 21・・・日本文単語辞書 22・・・単語候補抽出部 23・・・漢字用語単語抽出部 24・・・漢字1文字単語抽出部 25・・・辞書検索情報抽出部 26・・・単語列絞込み部 不発日月の基本オー1広図 第1図 、本、i[]月Lnfi[イ列 第2図 木禰笹日目σ)イを二σ)ffi王里、イ111殖来の
オAhへJレコ
FIG. 1 is an overall configuration diagram showing an example of the basic configuration of the present invention, FIG. 2 is an explanatory diagram showing an example of word analysis processing according to an example of the basic configuration of word analysis of the present invention, and FIG. 3 is an example of the basic configuration of the present invention. FIG. 4 is an explanatory diagram showing another example of word analysis processing, FIG. 4 is a block diagram of conventional word analysis, and FIG. 5 is an explanatory diagram showing an example of conventional word analysis processing. [Explanation of symbols of main parts] 2... Input Japanese sentence database 3... Bunsetsu cutting section 8... Word candidate extraction control section 9... Connected word string generation section 10... Connection verification section 11 ...Word recognition string result recording device 20...Word analysis processing system 21...Japanese word dictionary 22...Word candidate extraction section 23...Kanji term word extraction section 24...One kanji character Word extractor 25... Dictionary search information extractor 26... Word string narrower 2 σ) ffi Ori, I 111 Shokurai no Oh to J record

Claims (1)

【特許請求の範囲】[Claims] (1)入力された日本文文字列に対し字種の変化点にお
ける自立語あるいは付属語からなる文節を抽出する文節
切り部と、 各単語の見出し、読み、文法情報と漢字単語の抽出およ
び接続に関する辞書情報検索フラグを格納する日本文単
語辞書と、 各文節ごとに日本文単語辞書を用いて取り得る単語とこ
れに付随する情報を網羅的に抽出する単語候補抽出部と
、 前記単語候補抽出部にあって、単語候補のうち漢字を含
む2文字以上の単語である漢字用語の候補を日本文単語
辞書を用いて抽出する漢字用語単語抽出部と、 前記漢字用語単語抽出部で抽出された単語に付随して予
め日本語単語辞書に格納されている辞書検索情報フラグ
に応じて漢字1文字単語を検索するか否かを制御する辞
書検索情報抽出部と、漢字用語以外の漢字1文字の候補
を日本文単語辞書を用いて抽出する漢字1文字単語抽出
部と、 漢字以外の字種の候補を含めた単語候補抽出を制御する
単語候補抽出制御部と、 抽出された各単語群について文字の位置的な接続関係を
用いて文節を形成し得る単語列の候補を作成する接続単
語列生成部と、 生成された接続単語列につき単語に付随して日本語単語
辞書に予め格納されている辞書検索情報フラグに応じて
該漢字用語に包含される漢字1文字単語を含む見出しご
との位置的な接続を許可し、これらの接続単語列を絞り
込む単語列絞込み部と、接続単語列の各単語間の文法的
な接続関係を検定し単語認定列を生成する接続検定部と
から成る日本文文書解析装置。
(1) A phrase cutter that extracts phrases consisting of independent words or adjunct words at points of change in character types from input Japanese character strings, and extracts and connects each word's heading, pronunciation, grammatical information, and kanji words. a Japanese sentence word dictionary that stores dictionary information search flags for each phrase; a word candidate extraction unit that exhaustively extracts possible words and associated information for each clause using the Japanese word dictionary; and the word candidate extraction section. The part includes a kanji term word extraction unit that extracts kanji term candidates that are words with two or more letters including kanji from among the word candidates using a Japanese word dictionary; A dictionary search information extraction unit that controls whether or not to search for a single-character kanji word according to a dictionary search information flag that is stored in advance in the Japanese word dictionary along with the word; A single-character Kanji word extraction unit that extracts candidates using a Japanese word dictionary, a word candidate extraction control unit that controls the extraction of word candidates including candidates for character types other than Kanji, and a character extraction unit for each extracted word group. a connected word string generation unit that creates candidates for word strings that can form clauses using the positional connection relationships of A word string narrowing section that allows positional connections for each heading containing one-character Kanji words included in the Kanji term according to the dictionary search information flag, and narrows down these connected word strings, and each word in the connected word strings. A Japanese document analysis device comprising a connection testing section that tests the grammatical connections between words and generates a word recognition string.
JP63030188A 1988-02-12 1988-02-12 Japanese document analysis device Expired - Lifetime JPH0715690B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP63030188A JPH0715690B2 (en) 1988-02-12 1988-02-12 Japanese document analysis device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP63030188A JPH0715690B2 (en) 1988-02-12 1988-02-12 Japanese document analysis device

Publications (2)

Publication Number Publication Date
JPH01205377A true JPH01205377A (en) 1989-08-17
JPH0715690B2 JPH0715690B2 (en) 1995-02-22

Family

ID=12296779

Family Applications (1)

Application Number Title Priority Date Filing Date
JP63030188A Expired - Lifetime JPH0715690B2 (en) 1988-02-12 1988-02-12 Japanese document analysis device

Country Status (1)

Country Link
JP (1) JPH0715690B2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011065384A (en) * 2009-09-16 2011-03-31 Nippon Telegr & Teleph Corp <Ntt> Text analysis device, method, and program coping with wrong letter and omitted letter

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011065384A (en) * 2009-09-16 2011-03-31 Nippon Telegr & Teleph Corp <Ntt> Text analysis device, method, and program coping with wrong letter and omitted letter

Also Published As

Publication number Publication date
JPH0715690B2 (en) 1995-02-22

Similar Documents

Publication Publication Date Title
US5276616A (en) Apparatus for automatically generating index
JP5130892B2 (en) Character encoding processing method and system
JPH11110416A (en) Method and device for retrieving document from data base
JP5231698B2 (en) How to predict how to read Japanese ideograms
KR101072100B1 (en) Document processing apparatus and method for extraction of expression and description
CN109344389B (en) Method and system for constructing Chinese blind comparison bilingual corpus
Hollingsworth et al. Retrieving hierarchical text structure from typeset scientific articles–a prerequisite for e-science text mining
JPH01205377A (en) Japanese language document analyzing device
JP3952964B2 (en) Reading information determination method, apparatus and program
Souter et al. Using Parsed Corpora: A review of current practice
JP2560656B2 (en) Document filing system
JP3767180B2 (en) Document structure analysis method and apparatus, and storage medium storing document structure analysis program
Zaghal et al. Arabic morphological analyzer with text to voice
JPS62267872A (en) Language analyzing device
Yusof et al. Identifying Relation Between Miriek and Kenyah Badeng Language by Using Morphological Analyzer
Diaconescu et al. A rule-based approach to generating large phonetic databases for Romanian results of the AFLR project
JPH0262668A (en) Sentence information retrieving system using sentence information analyzing technique
JPS63163956A (en) Document preparation and correction supporting device
JPS58127230A (en) Kanji (chinese character)-kana (japanese syllabary) converter
Holstege et al. Visual parsing: an aid to text understanding
Breuel Applying the OCRopus OCR System to Scholarly Sanskrit Literature
JPH01258069A (en) Morpheme analyzing system for japanese character string
JPH0630100B2 (en) Kana-Kanji conversion method
JPS6389975A (en) Language analyzer
JPH04253262A (en) Reading kana addition system

Legal Events

Date Code Title Description
EXPY Cancellation because of completion of term