JPH05342194A

JPH05342194A - Continued kana/kanji clause conversion system

Info

Publication number: JPH05342194A
Application number: JP4177511A
Authority: JP
Inventors: Masaru Yokomori; 優横森
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1992-06-11
Filing date: 1992-06-11
Publication date: 1993-12-24

Abstract

PURPOSE:To smoothly input Japanese by improving accuracy for the converted result of Japanese syllabary (KANA)/Chinese character (KANJI) conversion. CONSTITUTION:Word information corresponding to KANA character strings inputted from a data input part 1 through an index input part 2 is read from a dictionary file 8 and outputted to a clause extraction part 4 by a dictionary word read part 3. Based on the word information from the dictionary word read part 3, the clause extraction part 4 prepares 'clause' candidates and outputs the 'clause' candidates to a priority evaluation part 5. The priority evaluation part 5 evaluates the priority of the 'clause' candidates from the clause extraction part 4 by using the use frequency information of 'independent words' among those 'clause candidates' and the 'clause' candidate having the highest point is decided and outputted to a converted result output part 6. The converted result output part 6 displays the 'clause' candidate from the priority evaluation part 5 through a data output part 7 onto a screen.

Description

Detailed Description of the Invention

【０００１】[0001]

【技術分野】本発明は連文節かな漢字変換システムに関
し、特に連文節かな漢字変換システムにおける文節の区
切り処理に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a continuous sentence kana-kanji conversion system, and more particularly, to segment break processing in a continuous sentence kana-kanji conversion system.

【０００２】[0002]

【従来技術】従来、連文節かな漢字変換システムにおい
ては、検索した「自立語」とこの自立語に付属する「付
属語」とから構成される『文節』の長さをプライオリテ
ィ評価の対象としている。2. Description of the Related Art Conventionally, in a bunsetsu kana-kanji conversion system, the length of a "bunsetsu" composed of a retrieved "independent word" and an "adjunct word" attached to this independent word is targeted for priority evaluation.

【０００３】日本語入力システムで提供される辞書は一
般向けに構成されているため、広範囲な種類の単語が格
納されている。したがって、辞書には個々のユーザにと
って必要のない単語や使用頻度の低い単語も数多く格納
されている。その結果、検索された「自立語」が的はず
れなものであることも多い。Since the dictionary provided by the Japanese input system is constructed for the general public, a wide variety of words are stored. Therefore, the dictionary stores a large number of words that are not needed by individual users or that are rarely used. As a result, the retrieved "independent word" is often irrelevant.

【０００４】しかしながら、プライオリティ評価を「自
立語」と「付属語」とから構成される『文節』の長さだ
けで行うと、検索された「自立語」の妥当性を直接評価
することができないため、誤った『文節』が作成されて
誤変換の原因となる。However, if priority evaluation is performed only by the length of the "bunsetsu" composed of "independent word" and "adjunct word", the validity of the retrieved "independent word" cannot be directly evaluated. Therefore, an incorrect “bunsetsu” is created, which causes erroneous conversion.

【０００５】また、『文節』を長く設定すればするほ
ど、プライオリティ評価において優位となるので、「付
属語」をむやみやたらに接続させる結果となり、この場
合も誤変換の原因となっている。Further, the longer the "bunsetsu" is set, the more advantageous it is in priority evaluation. Therefore, the "adjunct word" is unnecessarily connected, which is a cause of erroneous conversion.

【０００６】[0006]

【発明の目的】本発明は上記のような従来のものの問題
点を除去すべくなされたもので、かな漢字変換の変換結
果の精度を向上させることができ、日本語入力を円滑に
することができる連文節かな漢字変換システムの提供を
目的とする。SUMMARY OF THE INVENTION The present invention has been made to eliminate the above-mentioned problems of the conventional ones, and can improve the accuracy of the conversion result of the kana-kanji conversion and facilitate the Japanese input. The purpose is to provide a kanji-kana-kanji conversion system.

【０００７】[0007]

【発明の構成】本発明による連文節かな漢字変換システ
ムは、連文節で入力されたひらがな文字列を漢字かな交
じり文字列に変換する連文節かな漢字変換システムであ
って、単語各々の使用回数を示す使用頻度情報を前記単
語各々に対応して格納する格納手段と、前記格納手段に
格納された前記使用頻度情報に基づいて前記ひらがな文
字列に対する文節の区切りを決定するための評価情報を
作成する作成手段と、前記作成手段で作成された評価情
報を基に前記ひらがな文字列に対する前記文節の区切り
を行う手段とを設けたことを特徴とする。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The continuous sentence-kana-kanji conversion system according to the present invention is a continuous sentence-kana-kanji conversion system for converting a hiragana character string input in a continuous sentence clause into a kanji-kana kanji mixed character string. Storage means for storing each of the words, and creation means for creating evaluation information for determining a segment break for the hiragana character string based on the usage frequency information stored in the storage means; Means for separating the bunsetsu from the hiragana character string based on the evaluation information created by the creating means.

【０００８】[0008]

【実施例】次に、本発明の一実施例について図面を参照
して説明する。An embodiment of the present invention will be described with reference to the drawings.

【０００９】図１は本発明の一実施例の構成を示すブロ
ック図である。図において、データ入力部１からひらが
な文字列が入力されると、該ひらがな文字列は見出し入
力部２を介して辞書単語読出し部３に出力される。ここ
で、データ入力部１からひらがな文字列がローマ字入力
あるいはカナ入力で入力されると、見出し入力部２はロ
ーマ字入力あるいはカナ入力で入力された文字列をひら
がな表記に変換する。FIG. 1 is a block diagram showing the configuration of an embodiment of the present invention. In the figure, when a Hiragana character string is input from the data input unit 1, the Hiragana character string is output to the dictionary word reading unit 3 via the index input unit 2. Here, when the hiragana character string is input from the data input unit 1 by roman character input or kana input, the headline input part 2 converts the character string input by roman character input or kana input into hiragana notation.

【００１０】辞書単語読出し部３は見出し入力部２から
受取ったひらがな表記の文字列（以下見出しとする）に
対応する単語情報を辞書ファイル８から読出して文節抽
出部４に出力する。ここで、辞書ファイル８には「自立
語」に関する単語情報が格納されているので、辞書単語
読出し部３は見出し入力部２から受取った見出しによっ
て辞書ファイル８から読出すことができるすべての「自
立語」を検索する。The dictionary word reading unit 3 reads word information corresponding to the character string in hiragana notation (hereinafter referred to as a headline) received from the headline input unit 2 from the dictionary file 8 and outputs it to the phrase extracting unit 4. Here, since the dictionary file 8 stores word information regarding “independent words”, the dictionary word reading unit 3 can read all “independent words” that can be read from the dictionary file 8 by the headline received from the headline input unit 2. Search for "word".

【００１１】文節抽出部４は辞書単語読出し部３から受
取った単語情報を基に『文節』候補を作成する。すなわ
ち、文節抽出部４は辞書単語読出し部３で検索された
「自立語」に対して直後の文字が接続される得るか否
か、つまり直後の文字が「付属語」になり得るか否かを
チェックし、そのチェック結果を基に作成した『文節』
候補をプライオリティ評価部５に出力する。The phrase extracting unit 4 creates "bunsetsu" candidates based on the word information received from the dictionary word reading unit 3. That is, the phrase extracting unit 4 determines whether or not the next character can be connected to the “independent word” searched by the dictionary word reading unit 3, that is, whether or not the next character can be an “adjunct”. Is checked and "bunsetsu" created based on the check result
The candidates are output to the priority evaluation unit 5.

【００１２】プライオリティ評価部５は文節抽出部４か
ら受取った『文節』候補のプライオリティ評価を、その
『文節』候補内の「自立語」の使用頻度情報を用いて行
う。プライオリティ評価部５は個々の単語の使用頻度情
報から得られるポイントが最も高くなるような組合せを
選択することで採用する『文節』候補を決定し、その
『文節』候補を変換結果出力部６に出力する。変換結果
出力部６はプライオリティ評価部５から受取った『文
節』候補をデータ出力部７を介して図示せぬ画面上に表
示する。The priority evaluation unit 5 evaluates the priority of the "bunsetsu" candidate received from the phrase extraction unit 4 by using the usage frequency information of the "independent word" in the "bunsetsu" candidate. The priority evaluation unit 5 determines a “bunsetsu” candidate to be adopted by selecting a combination that gives the highest points obtained from the usage frequency information of each word, and outputs the “bunsetsu” candidate to the conversion result output unit 6. Output. The conversion result output unit 6 displays the “bunsetsu” candidates received from the priority evaluation unit 5 on a screen (not shown) via the data output unit 7.

【００１３】図２は図１のデータ入力部１からのローマ
字入力例を示す図であり、図３は図１のデータ入力部１
からのカナ入力例を示す図である。これらの図において
は、「今日宮崎まで行く」という文章を、ローマ字入力
すると「ＫＹＯＵＭＩＹＡＺＡＫＩＭＡＤＥＩＫＵ」と
なり、カナ字入力すると「キョウミヤサ゛キマテ゛イ
ク」となることを示している。FIG. 2 is a diagram showing an example of Roman character input from the data input unit 1 of FIG. 1, and FIG. 3 is a data input unit 1 of FIG.
It is a figure which shows the example of kana input from. In these figures, the text "Go to Miyazaki today" is shown as "KYOUMIYAZAKI MADEIKU" in romaji, and "Kyoumiyazaki" as kana.

【００１４】図４は図１の見出し入力部２によるひらが
な表記例を示す図である。図において、見出し入力部２
はデータ入力部１からの「ＫＹＯＵＭＩＹＡＺＡＫＩＭ
ＡＤＥＩＫＵ」や「キョウミヤサ゛キマテ゛イク」を、
「きょうみやざきまでいく」とひらがな表記に変換す
る。FIG. 4 is a diagram showing an example of Hiragana notation by the index input unit 2 of FIG. In the figure, the headline input unit 2
From the data input section 1 "KYOUMIYAZAKI M
"ADEIKU" and "Kyomiyazaki Made"
Convert to Hiragana notation such as "Go to Kyomi and Zaki".

【００１５】図５は図１の辞書単語読出し部３および文
節抽出部４による処理例を示す図である。図５（ａ）は
図４のひらがな表記を基に辞書単語読出し部３が検索し
て得た単語情報例を示している。図５（ｂ）は図５
（ａ）の単語情報例を基に文節抽出部４が作成した『文
節』候補例を示している。FIG. 5 is a diagram showing an example of processing by the dictionary word reading unit 3 and the phrase extracting unit 4 of FIG. FIG. 5A shows an example of word information obtained by the dictionary word reading unit 3 based on the Hiragana notation of FIG. 5 (b) is shown in FIG.
An example of "bunsetsu" candidates created by the phrase extraction unit 4 based on the word information example of (a) is shown.

【００１６】図５（ａ）においては、図４のひらがな表
記「きょうみやざきまでいく」を、「き」、「きょ」、
「きょう」、「きょうみ」、「う」、「うみ」、
「み」、「みや」、「みやざき」、「や」、「やざ
き」、「ざ」、「ま」、「まで」、「で」、「い」、
「く」という見出し情報に区切って検索した場合を示し
ている。この検索にしたがって、上記の見出し情報各々
に対応する単語表記情報と使用頻度情報とが辞書単語読
出し部３によって辞書ファイル８から読出され、文節抽
出部４に出力される。In FIG. 5 (a), the hiragana notation "Kyomi and Zakigo Iku" in FIG. 4 is replaced with "ki", "kyo",
"Today", "Today", "U", "Umi",
"Mi", "Miya", "Miyazaki", "Ya", "Yazaki", "Za", "Ma", "To", "De", "I",
It shows a case where the search is performed by dividing the heading information into "ku". According to this search, the word notation information and the usage frequency information corresponding to each of the above-mentioned headline information are read from the dictionary file 8 by the dictionary word reading unit 3 and output to the phrase extracting unit 4.

【００１７】図５（ｂ）においては、図５（ａ）の単語
表記情報で示される「自立語」のみで、あるいは「自立
語」に直後の文字を接続して作成した『文節』候補を示
している。これら『文節』候補は夫々対応する「自立
語」および使用頻度情報とともにプライオリティ評価部
５に出力される。In FIG. 5 (b), "bunsetsu" candidates created by only the "independent word" indicated by the word notation information in FIG. 5 (a) or by connecting the immediately following character to the "independent word" are selected. Shows. These "bunsetsu" candidates are output to the priority evaluation unit 5 together with the corresponding "independent words" and usage frequency information.

【００１８】図６〜図１１は図１のプライオリティ評価
部５によるプライオリティ評価の結果の例を示してい
る。これらの図においては、図４のひらがな表記「きょ
うみやざきまでいく」をどのように区切っているかを示
す見出し情報と、それに対応する『文節』候補および使
用頻度情報と、その使用頻度情報を累算して得た合計ポ
イントと、ひらがな表記をいくつに区切ったかを示す文
節数と、「合計ポイント÷文節数」で得た使用頻度の平
均値からなるプライオリティとが示されている。FIGS. 6 to 11 show examples of the results of priority evaluation by the priority evaluation unit 5 of FIG. In these figures, the heading information indicating how the hiragana notation “Kyomi and Zakigo Iku” in FIG. 4 is divided, corresponding “bunsetsu” candidates and usage frequency information, and the usage frequency information are accumulated. The total points thus obtained, the number of clauses indicating how many hiragana notations are divided, and the priority consisting of the average value of the usage frequencies obtained by “total points / number of clauses” are shown.

【００１９】図６においては、図４のひらがな表記「き
ょうみやざきまでいく」を「き」、「ょ」、「う」、
「み」、「や」、「ざ」、「き」、「ま」、「で」、
「い」、「く」と細かく区切った場合を示している。こ
の場合、『文節』候補各々に対応する使用頻度情報を累
算して得た合計ポイントが「４７」、文節数が「１１」
となるので、プライオリティは４７÷１１＝４．３とな
る。In FIG. 6, the hiragana notation “go to Kyomi and Zaki” in FIG. 4 is replaced with “ki”, “yo”, “u”,
"Mi", "ya", "za", "ki", "ma", "de",
It shows the case where it is divided into "i" and "ku". In this case, the total points obtained by accumulating the usage frequency information corresponding to each “bunsetsu” candidate is “47”, and the number of phrases is “11”.
Therefore, the priority is 47/11 = 4.3.

【００２０】図７においては、図４のひらがな表記「き
ょうみやざきまでいく」を「きょう」、「みやざき」、
「まで」、「いく」と区切った場合を示している。この
場合、合計ポイントが「３３」、文節数が「４」となる
ので、プライオリティは３３÷４＝８．３となる。In FIG. 7, the hiragana notation “go to Kyomi and Zaki” in FIG. 4 is replaced with “Kyo”, “Miyazaki”,
It shows the case where it is divided into “up” and “go”. In this case, since the total points are “33” and the number of phrases is “4”, the priority is 33/4 = 8.3.

【００２１】図８においては、図４のひらがな表記「き
ょうみやざきまでいく」を「きょう」、「みやざきま
で」、「いく」と区切った場合を示している。この場
合、合計ポイントが「２８」、文節数が「３」となるの
で、プライオリティは２８÷３＝９．３となる。FIG. 8 shows a case where the hiragana notation “Kyomi and Zaki to go” in FIG. 4 is divided into “Kyo”, “to Miyazaki” and “Iku”. In this case, since the total points are “28” and the number of phrases is “3”, the priority is 28/3 = 9.3.

【００２２】図９においては、図４のひらがな表記「き
ょうみやざきまでいく」を「きょう」、「みやざきま
で」、「いく」と区切った場合を示している。この場
合、「きょう」という見出し情報に対して図８で選択さ
れた「今日」よりも使用頻度の低い「京」が選択されて
いるので、合計ポイントが「２１」、文節数が「３」と
なり、プライオリティは２１÷３＝７となる。FIG. 9 shows a case where the hiragana notation “Kyomi and Zaki to go” in FIG. 4 is divided into “Kyo”, “to Miyazaki” and “Iku”. In this case, “Kyo” which is used less frequently than “Today” selected in FIG. 8 is selected for the heading information “Kyo”, so the total points are “21” and the number of phrases is “3”. And the priority is 21/3 = 7.

【００２３】図１０においては、図４のひらがな表記
「きょうみやざきまでいく」を「きょうみ」、「やざき
まで」、「いく」と区切った場合を示している。この場
合、「興味」の使用頻度は高いが、「矢崎」の使用頻度
が低いため、合計ポイントが「２５」、文節数が「３」
となり、プライオリティは２５÷３＝８．３となる。
尚、この図１０に示すパターンが従来の連文節かな漢字
変換で多く見られる出力パターンである。FIG. 10 shows a case where the hiragana notation “Kyomi and Zaki to Iku” in FIG. 4 is divided into “Kyomi”, “Yazaki to”, and “Iku”. In this case, the frequency of use of "interest" is high, but the frequency of use of "Yazaki" is low, so the total points are "25" and the number of phrases is "3".
And the priority is 25/3 = 8.3.
The pattern shown in FIG. 10 is an output pattern often seen in the conventional kanji-kana-kanji conversion.

【００２４】図１１においては、図４のひらがな表記
「きょうみやざきまでいく」を「きょうみや」、
「ざ」、「き」、「まで」、「いく」と区切って場合を
示している。この場合、第一文節をできるだけ長くとっ
ているが、「ざき」が辞書ファイル８に存在しないため
に第二文節以降をうまく区切ることができず、使用頻度
の低い単語を選択せざるを得なくなっている。したがっ
て、合計ポイントが「３９」、文節数が「５」となるの
で、プライオリティは３９÷５＝７．８となる。In FIG. 11, the hiragana notation “Kyomi and Zaki go to” in FIG. 4 is replaced with “Kyomiya”,
The case is divided into "za", "ki", "up", and "iku". In this case, the first phrase is taken as long as possible, but since "Zaki" does not exist in the dictionary file 8, the second and subsequent phrases cannot be well separated, and a word with a low frequency of use must be selected. ing. Therefore, since the total points are "39" and the number of phrases is "5", the priority is 39/5 = 7.8.

【００２５】上記の図６〜図１１に示す処理結果におい
ては、図８に示す区切り方のプライオリティが最も高く
なるので、図８に示す『文節』候補が変換結果出力部６
からデータ出力部７を介して画面上に表示される。In the processing results shown in FIGS. 6 to 11, since the priority of the delimiter shown in FIG. 8 is the highest, the “bunsetsu” candidate shown in FIG.
Is displayed on the screen via the data output unit 7.

【００２６】このように、単語情報各々に対応して辞書
ファイル８に格納された使用頻度情報を用いてプライオ
リティ評価部５で文節の区切りを決定することによっ
て、使用頻度の高い単語を優先的に使用することが可能
となる。よって、かな漢字変換の変換結果の精度を向上
させることができ、日本語入力を円滑にすることができ
る。As described above, the priority evaluation unit 5 determines the segment breaks by using the usage frequency information stored in the dictionary file 8 corresponding to each word information, thereby giving priority to words with high usage frequency. Can be used. Therefore, the accuracy of the conversion result of the Kana-Kanji conversion can be improved, and Japanese input can be smoothly performed.

【００２７】[0027]

【発明の効果】以上説明したように本発明によれば、単
語各々の使用回数を示す使用頻度情報を格納しておき、
この格納された使用頻度情報に基づいてひらがな文字列
に対する文節の区切りを決定するための評価情報を作成
し、その作成された評価情報を基にひらがな文字列に対
する文節の区切りを行うようにすることによって、かな
漢字変換の変換結果の精度を向上させることができ、日
本語入力を円滑にすることができるという効果がある。As described above, according to the present invention, use frequency information indicating the number of times each word is used is stored,
Based on this stored usage frequency information, create evaluation information for determining the punctuation of the hiragana character string, and perform the punctuation of the hiragana character string based on the created evaluation information. With this, it is possible to improve the accuracy of the conversion result of the kana-kanji conversion, and it is possible to smoothly input Japanese.

[Brief description of drawings]

【図１】本発明の一実施例の構成を示すブロック図であ
る。FIG. 1 is a block diagram showing a configuration of an exemplary embodiment of the present invention.

【図２】図１のデータ入力部からのローマ字入力例を示
す図である。FIG. 2 is a diagram showing an example of Roman character input from a data input unit in FIG.

【図３】図１のデータ入力部からのカナ入力例を示す図
である。3 is a diagram showing an example of kana input from the data input unit of FIG.

【図４】図１の見出し入力部によるひらがな表記例を示
す図である。FIG. 4 is a diagram showing an example of Hiragana notation written by the index input unit of FIG. 1.

【図５】図１の辞書単語読出し部および文節抽出部によ
る処理例を示す図である。5 is a diagram showing a processing example by a dictionary word reading unit and a phrase extracting unit of FIG. 1;

【図６】図１のプライオリティ評価部によるプライオリ
ティ評価の結果の例を示している。6 shows an example of a result of priority evaluation by a priority evaluation unit of FIG.

【図７】図１のプライオリティ評価部によるプライオリ
ティ評価の結果の例を示している。FIG. 7 shows an example of the result of priority evaluation by the priority evaluation unit of FIG.

【図８】図１のプライオリティ評価部によるプライオリ
ティ評価の結果の例を示している。8 shows an example of a result of priority evaluation by the priority evaluation section of FIG.

【図９】図１のプライオリティ評価部によるプライオリ
ティ評価の結果の例を示している。9 shows an example of a result of priority evaluation by a priority evaluation unit of FIG.

【図１０】図１のプライオリティ評価部によるプライオ
リティ評価の結果の例を示している。10 shows an example of a result of priority evaluation by the priority evaluation unit of FIG.

【図１１】図１のプライオリティ評価部によるプライオ
リティ評価の結果の例を示している。11 shows an example of a result of priority evaluation by the priority evaluation section of FIG.

[Explanation of symbols]

２見出し入力部３辞書単語読出し部４文節抽出部５プライオリティ評価部８辞書ファイル 2 heading input part 3 dictionary word reading part 4 phrase extraction part 5 priority evaluation part 8 dictionary file

Claims

[Claims]

1. A continuous sentence kana-kanji conversion system for converting a hiragana character string input in a continuous sentence clause into a kanji kana kanji mixture character string, wherein usage frequency information indicating the number of times each word is used is stored in correspondence with each word. Storing means, creating means for creating evaluation information for determining a punctuation of the hiragana character string based on the use frequency information stored in the storing means; and evaluation information created by the creating means. And a means for separating the bunsetsu from the hiragana character string on the basis thereof.

2. The extracting means for extracting, in the creating means, a usage frequency for each break of the phrase based on the usage frequency information of the storage means, and a use for each break of the phrase extracted by the extracting means. An accumulating means for accumulating frequencies and a calculating means for calculating an average value for each segment of the phrase in the hiragana character string from the accumulation result of the accumulating means are provided, and the average calculated by the calculating means The continuous sentence kana-kanji conversion system according to claim 1, wherein a value is used as the evaluation information.