JPS60124769A

JPS60124769A - Word extracting system

Info

Publication number: JPS60124769A
Application number: JP58232576A
Authority: JP
Inventors: Yasuyuki Numata; 泰之沼田
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1983-12-09
Filing date: 1983-12-09
Publication date: 1985-07-03

Abstract

PURPOSE:To attain high speed retrieval of a word including a Kanji (Chinese character) by registering Kanji sounds having two or more character readings, using them, sectioning an input Kana (Japanses syllabary) character and forming a character string to be retrieved based on the classification as the forming of the result. CONSTITUTION:The Kana character string inputted from a keyboard input section 1 is stored in an input character string temporary storage section 2 and fed to a Kanji sound processing section 3. The Kanji sound processing section 3 has a classification module and a register, classifies the result of processing according to the condition and transmits the result to a retrieved character string forming section 4. The retrieved character string forming section 4 sets six characters started from a set analysis start position to the buffer. This content and that of a Kanji sound table 5 in which Kanji sounds having two or more character readings are registered are subject to matching processing, the string is sectioned at an element level of the Kanji sound and the result is set to other buffer. Fromg the result of the comparison between the said buffer and the classification register, only the charfacter number's corresponding to the input character string is cut-out to form a retrieved character string.

Description

【発明の詳細な説明】技術分野本発明は日本語ワードプロセッサ等における単語抽出方
式に関し、特に用語抽出プロセスに漢字音の概念を導入
し辞書検索時における被検索文字列を必要最低限の設定
にすることにより、不必要な候補を抽出しないようにし
て、誤解析の減少および辞書検索の速度向上を実現可能
とした甲、詔抽出方式に関するものである。[Detailed Description of the Invention] Technical Field The present invention relates to a word extraction method in a Japanese word processor, etc., and in particular to introducing the concept of kanji sounds into the term extraction process and setting the search string to the minimum necessary when searching in a dictionary. The present invention relates to an edict extraction method that can reduce erroneous analysis and improve dictionary search speed by not extracting unnecessary candidates.

従来技術従来のカナ漢字変換処理装置においては、入力されたカ
ナ文字列から単語を抽出するアルゴリズムは、一般に、
次の如きものであった。Prior Art In conventional kana-kanji conversion processing devices, the algorithm for extracting words from input kana character strings is generally
It was something like this:

（１）文字列に対する解析スタート位置の設定特殊な場
合を除いて、一般には文字列の先頭文字（第１番目の文
字）を解析のスタート位置としてまず設定し、その位置
を先頭文字とする単語の切出しに成功したならば、次に
、単語切出し後の文字列の先頭文字を新たな解析のスタ
ート位置として設定する方式である。(1) Setting the analysis start position for a character string Except for special cases, generally the first character (first character) of a character string is set as the start position for analysis, and words whose first character is at that position If the word segmentation is successful, then the first character of the character string after word segmentation is set as the starting position for a new analysis.

（例）入力文字列恋いせきによりだんごの〜 ↑：最初の解析スター１〜位置ここで、「かいせき（解析）」の９ノ出しに成功した場
合には次の如くなる。(Example) Dango's ~ ↑: First analysis star 1 ~ position Here, if you succeed in getting 9 of "Kaiseki (analysis)", the result will be as follows.

かいせ１片よりだんごの〜 ↑：次の解析スタート位置（２）辞７ト検索のための被検索文字列の作成ｌＩ＋？
書中の読みの長さが最長６文字であるとすれば、」−記
例文の場合、次のような被検索文字列が設定される。A dumpling from a piece of rice cake ~ ↑: Next analysis start position (2) Creating a search string for the word 7 search lI+?
Assuming that the length of the reading in the calligraphy is a maximum of 6 characters, the following character string to be searched is set in the case of the example sentence "-".

ａ）最初の単語のりｊ出しく、いかいせきによ（？）かいせきに小）かいせき（４）かいせ（６１かい（６）かｂ）「解析」切出し成功後の単語の切出しく」うにより
たんご（２）によりたん（３＋に上りだ（１１）により（５）によ（ψに（３）設定した被検索文字列と辞書中の見出し文字列と
のマツチング判定による候補の抽出上記例の場合は次の
ようになる。。a) Extract the first word (?) Extract the word after successful extraction (4) Extract the word after successful extraction. Extraction of candidates by matching the searched string set in (3) with the header string in the dictionary In the above example, it would be as follows.

（］＋ｒかいせきによ」により候補抽出てきない ■「かいせきに」により候補抽出てきない ■「かいせき」により「会席」、「解析」、「懐石」を抽出 ■「かいせ」により候補抽出できない −の「かい」により「会」、「回」、「快」、「戒」等を抽出（ΦＦか」に
より「可」、「香」、「蚊」、「課」等を抽出（４）（３）
で抽出された候補群に刺して種々の評価。Cannot extract candidates with (] + r kaiseki yo) ■ Candidates cannot be extracted with ``kaiseki'' ■ Extract ``Kaiseki'', ``Analysis'', and ``Kaiseki'' with ``kaiseki'' ■ Extract candidates with ``kaiseki'' Extracts “kai”, “kai”, “kai”, “kai”, etc. by “kai” of “cannot do” (extracts “possible”, “ka”, “mosquito”, “section”, etc. by ΦFka) (4 )(3)
Various evaluations were performed on the candidate group extracted in .

を行い、最も適切と思われる候補を決定する。and determine the most suitable candidate.

しかしながら、上述の如き単語抽出方式は人力文字列に
よっては、候補群が極めて多数抽出される場合があり、
誤解析および辞書検索速度低Ｆの原因となるという問題
があった。However, the word extraction method described above may extract an extremely large number of candidate groups depending on the human character string.
There is a problem that it causes erroneous analysis and low dictionary search speed.

１」的本発明は」二記事情に鑑みてなされたもので、その目的
とするところは、従来の単語抽出方式における上述の如
き問題を解消し、誤解析の減少および辞書検索の速度向
上を可能とする単語抽出方式を提供することにある。The present invention was made in view of the above two circumstances, and its purpose is to solve the above-mentioned problems in conventional word extraction methods, reduce erroneous analysis, and improve the speed of dictionary searches. The purpose of this invention is to provide a word extraction method that makes it possible.

構　成以下、実施例に基づいて、本発明の構成を詳細に説明す
る。Configuration Hereinafter, the configuration of the present invention will be explained in detail based on examples.

第１図は本発明の〜実施例であるカナ漢字変換処理装置
の概要を示すブロック図、第２図はその要部である漢字
筒（おん）テーブルの内容の一部を示すものである。第
１図において、１はキーボー１へ人力部、２は人力文字
列一時記憶部、３は漢字音処理部、４は被検索文字列作
成部、５は漢字筒テーブル記憶部、６は辞書検索部、７
は単語辞書を示している。なお、第２図はあくまでも、
漢字筒テーブルの一例を示すものであり、本発明はこれ
に限定されるべきものではない。FIG. 1 is a block diagram showing an overview of a kana-kanji conversion processing apparatus according to an embodiment of the present invention, and FIG. 2 shows a part of the contents of a kanji cylinder table, which is the main part thereof. In Fig. 1, 1 is a human power unit for keyboard 1, 2 is a human power character string temporary storage unit, 3 is a kanji sound processing unit, 4 is a searched character string creation unit, 5 is a kanji cylinder table storage unit, and 6 is a dictionary search unit. Part, 7
indicates a word dictionary. Please note that Figure 2 is for illustration only.
This is an example of a kanji cylinder table, and the present invention should not be limited to this.

漢字音処理部３は後述する条件に従って処理結果を分類
するための分類モジュールおよび分類結果を保持するレ
ジスタを有するものである。以下。The kanji sound processing unit 3 has a classification module for classifying processing results according to conditions described later and a register for holding the classification results. below.

このレジスタをｒＴＹＰＥ」と呼ぶ、。This register is called "rTYPE".

以下、本実施例の動作を説明するか、説１り口；あたっ
ては、先に従来技術の項に示したと同じ例文かいせきに
よりだんごの〜を用いる。Hereinafter, the operation of this embodiment will be explained, or the first explanation will be given using the same example sentence as previously shown in the section of the prior art.

被検索文字列作成部４では、従来と同様に設定した解析
スター１−位置から始まる６文字を、予め用Ｍニジたバ
ッファにセノ１−する（第：３図参照）。このバッファ
は文字が一次元的に６文字セノ１−できるものであれば
良く、以下、このノλツファを［ｖＪＩＮＤＯＷＪと呼
ぶ。The searched character string creation unit 4 stores six characters starting from the analysis star 1 position set in the same way as in the conventional case into a buffer previously set for M (see FIG. 3). This buffer may be one that can one-dimensionally store six characters, and hereinafter this buffer will be referred to as [vJINDOWJ].

次に、上記Ｗ　Ｔ、　Ｎ　Ｄ　ＯＷ中の文字列と、第２
図に示した漢字筒テーブルの各要素とのマ・ノチング゛
処理を行い、ＷＩＮＤＯＷ中の文字列に苅してｆＪ′！
字音の要素レベルでの区切りを施し、その結果を具体的
に表現し得る方法で、予め用意した）＜ノファ等にセッ
トする。ここでは、ＷＩＮＤＯＷ２という、−次元的に
大きさ６の配列という表現を有するバッファを用意して
いる。Next, the character strings in the above W T, N D OW and the second
Perform ma-notching processing with each element of the kanji cylinder table shown in the figure, and add fJ' to the character string in WINDOW!
Separate the letters and sounds at the element level, and set the results to <nofa, etc., prepared in advance in a way that can be expressed concretely. Here, a buffer called WINDOW2 is prepared, which is expressed as an array with size 6 in the - dimension.

第４図は」二記Ｗ　Ｉ　Ｎ　＋）ＯＷ中の文字列に施し
た区切りと、ＷＴＮＩ）ＯＷ２の内容の一例を示すもの
である。ＷＩＮＬ）ＯＷにイ」された矢印は上記漢字音
しベルでの区切りを示し、ＷＩＮＤＯＷ２の内容である
数字はその文字数に対応する漢字音が前記漢字音テーブ
ル中に存在していることを示すものである。FIG. 4 shows an example of the delimiters applied to the character strings in "W I N +) OW" and the contents of WTNI) OW2. The arrow marked ``WINL)OW'' indicates the separation between the above kanji sounds and the bell, and the number that is the content of WINDOW2 indicates that the kanji sound corresponding to the number of characters exists in the kanji sound table. It is.

ここでは。here.

ＷＩＮＬ）Ｏ’ＴＶ２（１）＝２（ｒかい」に対応する
）ＷＩＮＤＯＷ２（２）＝２（ｒせき」に対応する）Ｗ
　Ｉ　ＮＤＯＷ２（３）＝　１−（’に」に対応する）
ＷＩＮＤＯＷ２（４）＝１（ｒよ」に対応する）ＷＩ　
ＮＤＯＷ２（５）＝ＯＷＩＮＤＯＷ２（６）＝０である。WINL) O'TV2 (1) = 2 (corresponds to "rkai") WINDOW2 (2) = 2 (corresponds to "r cough") W
I NDOW2 (3) = 1 - (corresponds to 'ni')
WINDOW2 (4) = 1 (corresponds to “ryo”) WI
NDOW2(5)=O WINDOW2(6)=0.

上記処理の結果を次の条件に従って分類する。The results of the above processing are classified according to the following conditions.

（］）ＷＩＮＤＯＷ２（１）≧２、かつＷＩＮＬ）ＯＷ
２（２）≧２の場合ＴＹＰＥに「１」をセントする。。(]) WINDOW2(1)≧2, and WINL)OW
If 2(2)≧2, set “1” to TYPE. .

（２）ＷＩ　ＮＤＯＷ２（１）≧２、かつＷＩＮＤＯＷ
２（２）＝　］の場合ＴＹＰＥに「２Ｊをセソ１−する。(2) WINDOW2(1)≧2 and WINDOW
2 (2) = ] If TYPE is set to “Seso 1- 2J.

（３）ＷＩＮＤＯＷ２（１）＝］、かつＷＩＮＤＯＷ２
（２）≧２の場合ＴＹＰＥに「３」をセソ１へする。(3) WINDOW2(1)=], and WINDOW2
(2) If ≧2, set "3" to TYPE to seso1.

（４）上記分類（１）〜（３）以外の場合ＴＹＰＥに「
４」をセソ１−する。(4) In cases other than the above categories (1) to (3), enter “TYPE”.
4" to 1-.

以後、上記分類結果に従って、被検索文字列を以下の方
法で作成する。Thereafter, a searched character string is created in the following manner according to the above classification results.

ＴＹＰＥに「１」〜Ｔ３Ｊのいがれかがセノ１−されて
いる場合には、入力文字列からＷＩＮＤＯＷ２（１）＋ＷＩＮ　丁つ０Ｗ（２）およびＷＩＮＤＯＷ２（１）にそれぞれ対応する文字数分だけを切出して、２通りの
被検索文字列を作成する。If any of the characters from "1" to T3J is set in TYPE, the number of characters corresponding to WINDOW2 (1) + WIN, WINDOW2 (2) and WINDOW2 (1) are extracted from the input character string. , and create two types of searched character strings.

上記例文の場合には、コ゛ＹＰＥは「１」となるので、
被検索文字列としては、（′Ｌ）かいせき（２）かいの２つか設定さＡしることになる。この被検索文字列を
用いて従来と同様に辞書検索を行う。In the case of the example sentence above, the code YPE is "1", so
As the character string to be searched, the following two characters are set: ('L), (2), and (2). Using this searched character string, a dictionary search is performed in the same manner as before.

なお、Ｔ’　Ｙ　Ｉ）　Ｅが「４」の場合は、従来と全
く同様の方法で被検索文字列を作成する。Note that when T'YI)E is "4", the searched character string is created in exactly the same manner as the conventional method.

第５図（Δ）〜（Ｄ）に各分類の具体例を挙げて説明の
補足とする。（Ａ）は分類１（１’ＹＩ）ＩＥ＝　］）
。Specific examples of each classification are given in FIGS. 5(Δ) to (D) to supplement the explanation. (A) is classification 1 (1'YI)IE= ])
.

（＋３）は分類■（′１”Ｙ　ｌ）　ＩΣ＝２）、・・
・・にそれぞれ対応しているものである。各場合の入力
文および被検索文字列は、（Ａ’）の場合：だいがくでは（大学では）（１−）だ
いかく（かだい（Ｂ）の場合：りゃくぎにて（略儀にて）中りやくざ（巧りやく（Ｃ）の場合：かのうなばあい（可能な場合）（−！）
かのうｔ匈か（Ｄ）の場合：このようにして（同左）０）このように
し。(+3) is classified ■('1"Y l) IΣ=2),...
...corresponds to each. The input sentence and searched character string in each case are: (A'): Daigaku de (University) (1-) Daikaku (Kadai (B): Ryakugi de (abbreviation)) In the case of Nakariyakuza (C): Kanonabaai (if possible) (-!)
In the case of kanou t 匈ka (D): Do it like this (same as left) 0) Do it like this.

ｃつこのように ■このよう（小このよ（加この燻）ことなる。like this ■Like this (Kokonoyo (Kako smoke) becomes.

上記各実施例においては、Ｗ　Ｉ　Ｎ　Ｄ　ＯＶＴ’　
、ｔ；よびＷＩＮＤＯＷ２をいずれも６文字分の大きさ
をイＡするバッファとしたが、これは必ずしも６文字に
限られるものではない。また、」−言己ＷＴさ１１−）
　ＯＷの如きバッファの代りに、入力文字列をセソ１−
するバッファとそのバッファ中の位置を示す複数のポイ
ンタおよびそのポインタの値を七ソ１−シ、／ｌｊ、る
レジスタ等を用意しても良い。In each of the above embodiments, W I N D OVT'
, t; and WINDOW2 are all buffers with a size of 6 characters, but this is not necessarily limited to 6 characters. Also,”-Konmi WT Sa 11-)
Instead of a buffer like OW, input string can be seso1-
It is also possible to provide a buffer to be stored, a plurality of pointers indicating positions in the buffer, and registers that store the values of the pointers.

効　果以上述べた如く、本発明によ］しば、ｒｌｌ、出方出を
行う際に、漢字音を用いて候補単語の切出しを行うよう
にしたので、漢字を含むｎｉ　語の抽出を高速化するこ
とができるという顕著な効果を奏するものである。Effects As described above, according to the present invention, candidate words are extracted using kanji sounds when performing [shiba, rll, and appearance], so that words containing kanji can be extracted at high speed. This has the remarkable effect of being able to transform

[Brief explanation of the drawing]

第１図は本発明の一実施例を示すブロック図、第２図は
漢字音テーブルの内容の一部を示す図、第３図は入力文
字列バッファの内容の一例を示す図、第４図は入力文字
列と漢字音テーブルの内容とのマツチングを行った状況
を示す図、第５図は具体的処理例を示す図である。１：キーポート入力部、−２＝大入力字列一時記憶部、
３：漢字音処理部、４：被検索文字列作成部、５：漢字
音テーブル記憶部、６＝辞書検索部、７：単語辞書。特許出願人　株式会社リ　コ　−− 第１図第　３　図８１８４図第５図FIG. 1 is a block diagram showing an embodiment of the present invention, FIG. 2 is a diagram showing part of the contents of a kanji sound table, FIG. 3 is a diagram showing an example of the contents of an input character string buffer, and FIG. 5 is a diagram showing a situation in which an input character string is matched with the contents of a kanji sound table, and FIG. 5 is a diagram showing a specific processing example. 1: Keyport input section, -2=Large input character string temporary storage section,
3: Kanji sound processing unit, 4: Searched character string creation unit, 5: Kanji sound table storage unit, 6 = Dictionary search unit, 7: Word dictionary. Patent applicant Ricoh Co., Ltd. -- Figure 1 Figure 3 Figure 8184 Figure 5

Claims

[Claims]

(1) A word dictionary storage means for storing a plurality of words in correspondence with character strings representing their pronunciations, a means for temporarily storing input kana character strings, and a word dictionary storage means for storing a plurality of words in correspondence with character strings representing their pronunciations; and means for temporarily storing input kana character strings; A kana-kanji conversion processing device comprising a means for searching, a table storage means in which kanji sounds having readings of two or more characters are registered, a means for dividing the input kana character string using the kanji sounds, and processing by the dividing means. 11. A 113-word extraction method characterized in that a means for morphologically classifying results is provided, and a character string to be searched is created based on the classification result by the classification means.