JP5630138B2

JP5630138B2 - Sentence creation program and sentence creation apparatus

Info

Publication number: JP5630138B2
Application number: JP2010180772A
Authority: JP
Inventors: 基行鷹合; 洋平山根; 圭悟服部; 増市　博; 博増市
Original assignee: Fuji Xerox Co Ltd; Fujifilm Business Innovation Corp
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2010-08-12
Filing date: 2010-08-12
Publication date: 2014-11-26
Anticipated expiration: 2030-08-12
Also published as: JP2012042991A

Description

本発明は、文作成プログラム及び文作成装置に関する。 The present invention relates to a sentence creation program and a sentence creation apparatus.

単純な文型の文と語句とが入力されると、その単純な文型の文よりも複雑な構文の文を作成する技術が提案されている。 There has been proposed a technique for creating a sentence having a more complex syntax than a simple sentence type sentence when a simple sentence type sentence and a phrase are input.

これに関連する技術として、特許文献１には、文章の本文が入力されているときに、本文と切り離して入力される修飾文節を受け付け、これらの本文と修飾文節とを構文解析して、修飾文節が修飾可能な候補語句を本文から複数抽出し、修飾文節と最も良く適合する候補語句を被修飾語句として選択し、選択された被修飾語句を修飾するために修飾文節の最適の挿入位置と活用形等を決定して本文中に挿入する文作成装置が開示されている。 As a technology related to this, Patent Document 1 accepts modified clauses that are input separately from the body of the text when the body of the text is input, and parses the body and the modified clause, and modifies them. Multiple candidate phrases that can be modified by the phrase are extracted from the text, the candidate phrase that best matches the modified phrase is selected as the modified phrase, and the optimal insertion position of the modified phrase to qualify the selected modified phrase and A sentence creation device that determines a utilization form and inserts it in the text is disclosed.

特開平１１−２５９４８０号公報JP-A-11-259480

本発明の目的は、入力される文字列が文を構成しない場合において、文字列から文を作成する文作成プログラム及び文作成装置を提供することにある。 An object of the present invention is to provide a sentence creation program and a sentence creation apparatus for creating a sentence from a character string when an input character string does not constitute a sentence.

［１］コンピュータを、
文字列を受け付ける受付手段と、
前記受付手段が受け付けた文字列を単語に分割する分割手段と、
前記分割手段が分割した単語を予め定めた方法で拡張して拡張文字列を生成する拡張手段と、
前記拡張手段が用いた前記予め定められた方法に予め対応付けられた意味保存率の値を用いて、前記拡張文字列毎に第１の値を計算し、当該第１の値によって前記拡張手段が拡張した前記拡張文字列の意味と前記受け付けた文字列の意味とが一致する度合いを推定する第１の推定手段と、
前記第１の値に基づいて前記拡張文字列を生成した前記拡張手段が用いた方法の妥当性を評価する評価手段と、
前記評価手段の評価結果に基づいて前記拡張文字列を前記受け付けた文字列から作成された文の候補として出力する出力手段として機能させるための文作成プログラム。 [1]
A receiving means for receiving a character string;
Dividing means for dividing the character string received by the receiving means into words;
Expansion means for expanding the word divided by the dividing means by a predetermined method to generate an extended character string;
A first value is calculated for each extended character string using a value of a semantic preservation rate that is associated in advance with the predetermined method used by the extension means, and the extension means is calculated based on the first value. First estimating means for estimating a degree of coincidence between the meaning of the extended character string expanded by and the meaning of the accepted character string;
An evaluation means for evaluating the validity of the method used by the extension means that generated the extension character string based on the first value;
A sentence creation program for functioning as output means for outputting the extended character string as a sentence candidate created from the accepted character string based on the evaluation result of the evaluation means.

［２］前記拡張手段は、前記予め定められた方法として、前記分割手段が分割した単語の間に対する単語の挿入、前記分割した単語の活用形の変更又は前記分割した単語の同義語への入れ替えを用いて若しくはこれらの予め定められた方法の組み合わせに含まれる複数の予め定められた方法を順番に用いて前記拡張文字列を生成する前記［１］に記載の文作成プログラム。 [2] As the predetermined method, the extension means inserts a word between the words divided by the dividing means, changes the utilization form of the divided words, or replaces the divided words with synonyms. The sentence creation program according to [1], in which the extended character string is generated by using a plurality of predetermined methods included in a combination of these predetermined methods in order.

［３］前記拡張手段は、前記予め定められた方法の組み合わせに用いられる前記予め定められた方法の数又は前記組み合わせの数に上限を定める前記［２］に記載の文作成プログラム。 [3] The sentence creation program according to [2], wherein the expansion unit sets an upper limit on the number of the predetermined methods used for the combination of the predetermined methods or the number of the combinations.

［４］前記第１の推定手段は、前記拡張手段が組み合わせた前記組み合わせに含まれる前記予め定められた方法に予め対応付けられた意味保存率の値をそれぞれ用いて前記第１の値を推定する前記［２］又は［３］に記載の文作成プログラム。 [4] The first estimation means estimates the first value by using each value of the semantic preservation rate previously associated with the predetermined method included in the combination combined by the extension means. The sentence creation program according to [2] or [3].

［５］前記拡張文字列を構文解析して第２の値を計算し、当該第２の値によって前記拡張文字列の構文としての尤もらしさを推定する第２の推定手段としてさらに前記コンピュータを機能させるものであって、
前記評価手段は、前記第１の値及び前記第２の値に基づいて前記拡張文字列を生成した前記拡張手段が用いた方法の妥当性を評価する前記［１］から［４］のいずれかに記載の文作成プログラム。 [ 5 ] The extended character string is parsed to calculate a second value, and the computer is further functioned as second estimation means for estimating the likelihood of the extended character string as the syntax based on the second value. Which
The evaluation means evaluates the validity of the method used by the extension means that generates the extended character string based on the first value and the second value, and any one of [1] to [ 4 ] The sentence creation program described in.

［６］前記拡張文字列と前記文字列との文字数の差又は前記拡張手段が用いた前記拡張する方法の数に基づいて第３の値を計算し、当該第３の値によって前記拡張文字列の編集に要する処理コストを推定する第３の推定手段としてさらに前記コンピュータを機能させるものであって、
前記評価手段は、前記第１の値、前記第２の値及び前記第３の値に基づいて前記拡張文字列を生成した前記拡張手段が用いた方法の妥当性を評価する前記［５］に記載の文作成プログラム。 [ 6 ] A third value is calculated based on a difference in the number of characters between the extended character string and the character string or the number of the extension methods used by the extension means, and the extended character string is calculated based on the third value. And further causing the computer to function as a third estimating means for estimating the processing cost required for the editing,
Said evaluating means, said first value, to [5] to evaluate the validity of the second value and wherein said expansion means is used that generated the extension string based on said third value The sentence creation program described.

［７］前記分割手段は、前記受付手段が受け付けた文字列を単語に分割し、当該分割した単語が予め用意された単語辞書に含まれる単語に一部一致するとき、当該一致した単語で前記分割した単語を置き換える前記［１］から［６］のいずれかに記載の文作成プログラム。 [ 7 ] The dividing unit divides the character string received by the receiving unit into words, and when the divided word partially matches a word included in a word dictionary prepared in advance, The sentence creation program according to any one of [1] to [ 6 ], wherein the divided words are replaced.

［８］文字列を受け付ける受付手段と、
前記受付手段が受け付けた文字列を単語に分割する分割手段と、
前記分割手段が分割した単語を予め定めた方法で拡張して拡張文字列を生成する拡張手段と、
前記拡張手段が用いた前記予め定められた方法に予め対応付けられた意味保存率の値を用いて、前記拡張文字列毎に第１の値を計算し、当該第１の値によって前記拡張手段が拡張した前記拡張文字列の意味と前記受け付けた文字列の意味とが一致する度合いを推定する第１の推定手段と、
前記第１の値に基づいて前記拡張文字列を生成した前記拡張手段が用いた方法の妥当性を評価する評価手段と、
前記評価手段の評価結果に基づいて前記拡張文字列を前記受け付けた文字列から作成された文の候補として出力する出力手段とを有する文作成装置。 [ 8 ] Accepting means for receiving a character string;
Dividing means for dividing the character string received by the receiving means into words;
Expansion means for expanding the word divided by the dividing means by a predetermined method to generate an extended character string;
A first value is calculated for each extended character string using a value of a semantic preservation rate that is associated in advance with the predetermined method used by the extension means, and the extension means is calculated based on the first value. First estimating means for estimating a degree of coincidence between the meaning of the extended character string expanded by and the meaning of the accepted character string;
An evaluation means for evaluating the validity of the method used by the extension means that generated the extension character string based on the first value;
A sentence creation device comprising: output means for outputting the extended character string as a sentence candidate created from the accepted character string based on the evaluation result of the evaluation means.

請求項１又は８に係る発明によれば、入力される文字列が文を構成しない場合において、文字列から文を作成することができる。 According to the invention which concerns on Claim 1 or 8, when the character string input does not comprise a sentence, a sentence can be created from a character string.

請求項２に係る発明によれば、予め定められた方法として、分割された単語の間に対する単語の挿入、前記分割された単語の活用形の変更、前記分割された単語の同義語への入れ替え又は及びこれらの組み合わせを用いることができる。 According to the invention according to claim 2, as a predetermined method, insertion of a word between divided words, change of a utilization form of the divided word, replacement of the divided word with a synonym Or, and combinations thereof can be used.

請求項３に係る発明によれば、拡張文字列を生成する数に上限を設けることができる。 According to the invention of claim 3, an upper limit can be set for the number of generated extended character strings.

請求項４に係る発明によれば、予め定められた方法に基づいて求まる値を用いて第１の値を計算することができる。 According to the fourth aspect of the invention, the first value can be calculated using a value obtained based on a predetermined method.

請求項５に係る発明によれば、構文としての尤もらしさを考慮して前記拡張文字列を生成した前記拡張手段が用いた方法の妥当性を評価することができる。 According to the invention of claim 5, it is possible to evaluate the validity of the method said expansion means is used that generated the extended string in consideration of the likelihood of a syntax.

請求項６に係る発明によれば、編集に要する処理コストを考慮して前記拡張文字列を生成した前記拡張手段が用いた方法の妥当性を評価することができる。 According to the sixth aspect of the invention, it is possible to evaluate the validity of the method used by the extension means that generates the extension character string in consideration of the processing cost required for editing.

請求項７に係る発明によれば、不完全な単語を受け付けたときにも文字列から文を作成することができる。
According to the seventh aspect of the present invention, a sentence can be created from a character string even when an incomplete word is received.

図１は、文作成装置の構成例を示すブロック図である。FIG. 1 is a block diagram illustrating a configuration example of a sentence creation device. 図２は、意味保存率情報の一例を示す概略図である。FIG. 2 is a schematic diagram illustrating an example of semantic preservation rate information. 図３は、文作成装置に入力された入力文字列を示す概略図である。FIG. 3 is a schematic diagram showing an input character string input to the sentence creation device. 図４（ａ）〜（ｇ）は、文字列分割手段によって分割された入力文字列及び文字列拡張手段によって拡張された拡張文字列を示す概略図である。4A to 4G are schematic diagrams showing an input character string divided by the character string dividing unit and an extended character string expanded by the character string expanding unit. 図５（ａ）〜（ｇ）は、文字列分割手段によって分割された入力文字列及び文字列拡張手段によって拡張された拡張文字列の尤もらしさを示す概略図である。FIGS. 5A to 5G are schematic diagrams showing the likelihood of the input character string divided by the character string dividing means and the extended character string extended by the character string extending means. 図６は、（ａ）〜（ｆ）は、拡張文字列評価手段によって計算された拡張文字列の評価値を示す概略図である。6A to 6F are schematic views showing the evaluation values of the extended character string calculated by the extended character string evaluation means. 図７は、文作成装置の動作例を示すフローチャートである。FIG. 7 is a flowchart illustrating an operation example of the sentence creation device.

（文作成装置の構成）
図１は、文作成装置の構成例を示すブロック図である。 (Configuration of sentence creation device)
FIG. 1 is a block diagram illustrating a configuration example of a sentence creation device.

文作成装置１は、ＣＰＵ等から構成され各部を制御するとともに各種のプログラムを実行する制御部１０と、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）やフラッシュメモリ等の記憶媒体から構成され情報を記憶する記憶部１１と、キーボードやマウス等の操作部１２と、液晶ディスプレイ等の表示部１３とを備える。また、文作成装置１は、例えば、パーソナルコンピュータやＰＤＡ、携帯電話等の電子機器であり、受け付けた文字列から文を作成するものである。なお、文作成装置１は、操作部１２や表示部１３を備えないサーバ装置のようなものでもよく、その場合はネットワーク等により接続された端末装置の操作部や表示部がそれらの機能を代替する。 The sentence creation device 1 includes a CPU and the like, and controls each unit and executes various programs. A storage unit 11 includes a storage medium such as an HDD (Hard Disk Drive) or a flash memory and stores information. And an operation unit 12 such as a keyboard and a mouse, and a display unit 13 such as a liquid crystal display. The sentence creation device 1 is, for example, an electronic device such as a personal computer, a PDA, or a mobile phone, and creates a sentence from an accepted character string. The sentence creation device 1 may be a server device that does not include the operation unit 12 or the display unit 13, in which case the operation unit or display unit of the terminal device connected by a network or the like substitutes for these functions. To do.

制御部１０は、後述する文作成プログラム１１０を実行することで、文字列受付手段１００、文字列分割手段１０１、文字列拡張手段１０２、意味保存率推定手段１０３、尤もらしさ推定手段１０４、拡張文字列評価手段１０５及び文候補出力手段１０６等として機能する。 The control unit 10 executes a sentence creation program 110, which will be described later, so that the character string receiving unit 100, the character string dividing unit 101, the character string expanding unit 102, the meaning preservation rate estimating unit 103, the likelihood estimating unit 104, the extended character It functions as column evaluation means 105, sentence candidate output means 106, and the like.

文字列受付手段１００は、操作部１２の操作に応じて入力される文字列をテキスト情報等で受け付けてもよく、予め用意された文字列を取得してもよい。なお、図示しない通信部を介して外部から受け付けるものであってもよい。 The character string accepting unit 100 may accept a character string input according to the operation of the operation unit 12 as text information or the like, or may obtain a character string prepared in advance. In addition, you may receive from the outside via the communication part which is not shown in figure.

文字列分割手段１０１は、文字列受付手段１００が受け付けた文字列を単語等に分割する。文字列分割手段１０１は、入力された文字列を分割するための具体的な手段として、入力された文字列に含まれるスペース等の特定の記号を認識して分割してもよいし、文字列に形態素解析等を行って分割してもよい。 The character string dividing unit 101 divides the character string received by the character string receiving unit 100 into words or the like. The character string dividing unit 101 may recognize and divide a specific symbol such as a space included in the input character string as a specific means for dividing the input character string. It may be divided by performing morphological analysis or the like.

文字列拡張手段１０２は、文字列分割手段１０１によって分割された文字列の各単語の間に対する単語や助詞の挿入、各単語の交換、各単語の活用形の変更、各単語の削除、各単語の順序の変更等の拡張の方法のいずれかを１回又は任意の拡張の方法を組み合わせてそれらの方法を順番に実行して文字列を拡張する。以降、拡張された文字列を「拡張文字列」という。なお、文字列拡張手段１０２は、拡張の方法を変えて複数の拡張文字列を生成する。また、文字列拡張手段１０２は、実行する拡張の方法の組み合わせに用いられる方法の数に予め定めた上限値を定めても良いし、拡張の方法の組み合わせの数に上限値を定めても良い。また、後述する拡張文字列全体の意味保存率が予め定めた下限値を下回るまで、一回または任意の拡張の方法を組み合わせてもよい。 The character string expansion means 102 inserts words and particles between words of the character string divided by the character string dividing means 101, exchanges each word, changes the utilization form of each word, deletes each word, deletes each word The character string is expanded by executing any one of the expansion methods such as changing the order of the above once or combining any expansion method and sequentially executing the methods. Hereinafter, the extended character string is referred to as an “extended character string”. The character string expansion unit 102 generates a plurality of extended character strings by changing the expansion method. Further, the character string expansion unit 102 may determine a predetermined upper limit value for the number of methods used for the combination of expansion methods to be executed, or may determine an upper limit value for the number of combinations of expansion methods. . Moreover, you may combine the method of expansion once or arbitrarily until the meaning preservation | save rate of the whole extended character string mentioned later falls below the predetermined lower limit.

意味保存率推定手段１０３は、文字列拡張手段１０２が文字列を拡張した結果生成された拡張文字列が入力された文字列に対してどの程度意味を保存しているか、つまり、意味の同一性の程度を示す「意味保存率」を推定する。拡張文字列全体の意味保存率は、後述する意味保存率情報１１１に基づいて文字列拡張手段１０２が実行する拡張の方法のそれぞれに予め定量的に定められた意味保存率から計算され、例えば、拡張方法のそれぞれに定められた意味保存率の積から求められる。 The meaning storage rate estimation unit 103 stores the meaning of the extended character string generated as a result of the character string expansion unit 102 expanding the character string with respect to the input character string. Estimate the “meaning preservation rate” that indicates the degree of. The meaning preservation ratio of the entire extended character string is calculated from the meaning preservation ratio quantitatively determined in advance for each of the expansion methods executed by the character string expansion means 102 based on the meaning preservation ratio information 111 described later. It is obtained from the product of the semantic preservation rates defined for each of the expansion methods.

尤もらしさ推定手段１０４は、拡張文字列がその言語の文として構文がどの程度尤もらしいか示す「尤もらしさ」を推定する。尤もらしさは、確率言語モデルに基づいて計算してもよいし、構文解析器による構文解析に基づいて計算してもよい。本実施の形態においては、ｂｉ−ｇｒａｍによる確率言語モデルを用いて計算する。 The likelihood estimation unit 104 estimates “likelihood” indicating how likely the syntax is as an extension character string as a sentence in the language. Likelihood may be calculated based on a probabilistic language model or may be calculated based on parsing by a parser. In the present embodiment, calculation is performed using a probabilistic language model based on bi-gram.

拡張文字列評価手段１０５は、意味保存率推定手段１０３が推定した意味保存率及び尤もらしさ推定手段１０４が推定した尤もらしさに基づいて拡張文字列の評価値を計算する。 The extended character string evaluation means 105 calculates an evaluation value of the extended character string based on the meaning preservation ratio estimated by the meaning preservation ratio estimation means 103 and the likelihood estimated by the likelihood estimation means 104.

文候補出力手段１０６は、拡張文字列評価手段１０５によって計算された評価値に基づいて複数の拡張文字列から文候補を出力する。 The sentence candidate output unit 106 outputs sentence candidates from a plurality of extended character strings based on the evaluation value calculated by the extended character string evaluation unit 105.

記憶部１１は、制御部１０を上述した各手段１００〜１０６として動作させる文作成プログラム１１０と、文字列拡張手段１０２が文字列を拡張する各方法に予め定められた意味保存率を定義する意味保存率情報１１１と、文候補出力手段１０６が出力した文候補情報１１２とを記憶する。 The storage unit 11 is a sentence creation program 110 that causes the control unit 10 to operate as each of the above-described units 100 to 106, and a meaning that defines a predetermined semantic storage rate for each method in which the character string expansion unit 102 expands a character string. The storage rate information 111 and the sentence candidate information 112 output by the sentence candidate output unit 106 are stored.

図２は、意味保存率情報１１１の一例を示す概略図である。 FIG. 2 is a schematic diagram illustrating an example of the semantic preservation rate information 111.

意味保存率情報１１１は、各拡張方法を識別するための拡張ＩＤ欄１１１ａと、各拡張方法の具体的内容を示す拡張方法欄１１１ｂと、予め定められた各拡張方法の意味保存率を示す意味保存率欄１１１ｃとを有する。 The meaning storage rate information 111 includes an extension ID column 111a for identifying each extension method, an extension method column 111b indicating the specific contents of each extension method, and a meaning indicating the meaning storage rate of each extension method determined in advance. And a storage rate column 111c.

（文作成装置の動作）
以下に、文作成装置１の動作例を図１〜図７を参照しつつ、（１）文字列拡張動作、（２）拡張文字列評価動作に分けて説明する。 (Operation of sentence creation device)
Hereinafter, an operation example of the sentence creation device 1 will be described by dividing into (1) a character string expansion operation and (2) an expansion character string evaluation operation with reference to FIGS.

（１）文字列拡張動作
まず、利用者は、文作成装置１の操作部１２を操作して、所望の文字列を入力する。 (1) Character String Expansion Operation First, the user operates the operation unit 12 of the sentence creation device 1 to input a desired character string.

図７は、文作成装置１の動作例を示すフローチャートである。 FIG. 7 is a flowchart illustrating an operation example of the sentence creation device 1.

文字列受付手段１００は、操作部１２の操作に応じて入力される文字列をテキスト情報等で受け付ける（Ｓ１）。 The character string accepting unit 100 accepts a character string input according to the operation of the operation unit 12 as text information or the like (S1).

図３は、文作成装置１に入力された入力文字列を示す概略図である。 FIG. 3 is a schematic diagram showing an input character string input to the sentence creation device 1.

入力文字列１００Ａは、例えば、「子供書く本」という内容であり、「子供」と「書く」と「本」との間には、それぞれスペースが挿入されている。 The input character string 100 </ b> A has, for example, the content “kid writing book”, and spaces are inserted between “kid”, “writing”, and “book”, respectively.

次に、文字列分割手段１０１は、文字列受付手段１００が受け付けた入力文字列１００Ａを、入力された文字列に含まれるスペースを認識して単語「子供」、「書く」、「本」に分割する（Ｓ２）。 Next, the character string dividing unit 101 recognizes the space included in the input character string from the input character string 100A received by the character string receiving unit 100 and converts it into the words “child”, “write”, and “book”. Divide (S2).

図４（ａ）〜（ｇ）は、文字列分割手段１０１によって分割された入力文字列１００Ａ及び文字列拡張手段１０２によって拡張された拡張文字列を例示する概略図である。 4A to 4G are schematic diagrams illustrating an input character string 100A divided by the character string dividing unit 101 and an extended character string expanded by the character string expanding unit 102. FIG.

図４（ａ）に示すように、文字列分割手段１０１によって、入力文字列１００Ａは、単語１０１ａ〜１０１ｃに分割される。 As shown in FIG. 4A, the character string dividing unit 101 divides the input character string 100A into words 101a to 101c.

次に、文字列拡張手段１０２は、文字列分割手段１０１によって分割された入力文字列１００Ａを拡張する（Ｓ３）。拡張する方法は複数存在し、拡張により、例えば、以下に示すような拡張文字列が得られる。 Next, the character string extending unit 102 extends the input character string 100A divided by the character string dividing unit 101 (S3). There are a plurality of methods for extending, and for example, an extended character string as shown below is obtained by the extension.

図４（ｂ）に示すように、文字列拡張手段１０２は、入力文字列１００Ａの単語１０１ａと１０１ｂとの間に「が」である助詞１０２ａを挿入し、拡張文字列１００Ｂを生成する。 As shown in FIG. 4B, the character string expansion unit 102 inserts a particle 102a having “GA” between the words 101a and 101b of the input character string 100A, thereby generating an extended character string 100B.

図４（ｃ）に示すように、文字列拡張手段１０２は、入力文字列１００Ａの単語１０１ａと１０１ｂとの間に「が」である助詞１０２ｂを挿入するとともに、「書く」である単語１０１ｂを活用した「書いた」である単語１０２ｃに入れ替えて拡張文字列１００Ｃを生成する。 As shown in FIG. 4 (c), the character string expansion means 102 inserts the particle 102b of “GA” between the words 101a and 101b of the input character string 100A, and the word 101b of “WRITE”. The expanded character string 100C is generated by replacing with the word 102c that is “written”.

図４（ｄ）に示すように、文字列拡張手段１０２は、入力文字列１００Ａの単語１０１ａと１０１ｂとの間に「に」である助詞１０２ｄを挿入するとともに、「書く」である単語１０１ｂを活用した「書いた」である単語１０２ｅに入れ替えて拡張文字列１００Ｄを生成する。 As shown in FIG. 4D, the character string expansion unit 102 inserts the particle 102d that is “ni” between the words 101a and 101b of the input character string 100A, and the word 101b that is “write”. The extended character string 100D is generated by replacing the word 102e that is “written” with the use.

図４（ｅ）に示すように、文字列拡張手段１０２は、入力文字列１００Ａの単語１０１ａと１０１ｂとの間に「が」である助詞１０２ｆを挿入し、「書く」である単語１０１ｂを活用した「書いた」である単語１０２ｇに入れ替えるとともに、「本」である単語１０１ｃを同義語「書籍」である単語１０２ｈに入れ替えて拡張文字列１００Ｅを生成する。 As shown in FIG. 4 (e), the character string expansion unit 102 inserts a particle 102f of “GA” between the words 101a and 101b of the input character string 100A, and uses the word 101b of “write”. The word 102g that is “written” is replaced with the word 101c that is “book” and the word 102h that is the synonym “book” is generated to generate the extended character string 100E.

図４（ｆ）に示すように、文字列拡張手段１０２は、入力文字列１００Ａの単語１０１ａと１０１ｂとの間に「が」である助詞１０２ｊを挿入し、「書く」である単語１０１ｂを活用した「書いた」である単語１０２ｋに入れ替えるとともに、「子供」である単語１０１ａを同義語以外の語「大人」である単語１０２ｉに入れ替えて拡張文字列１００Ｆを生成する。 As shown in FIG. 4 (f), the character string expansion unit 102 inserts the particle 102j with “ga” between the words 101a and 101b of the input character string 100A, and uses the word 101b with “write”. The word 102k that is “written” is replaced with the word 101a that is “child”, and the word 102i that is the word “adult” other than the synonym is replaced to generate the extended character string 100F.

図４（ｇ）に示すように、文字列拡張手段１０２は、入力文字列１００Ａの単語１０１ａと１０１ｂとの間に「が」である助詞１０２ｌ及び１０２ｍを挿入して拡張文字列１００Ｇを生成する。 As shown in FIG. 4G, the character string expansion unit 102 generates the extended character string 100G by inserting particles 102l and 102m having “ga” between the words 101a and 101b of the input character string 100A. .

次に、意味保存率推定手段１０３は、文字列拡張手段１０２が文字列を拡張した結果生成された拡張文字列１００Ｂ〜１００Ｇの入力文字列１００Ａに対する意味保存率を意味保存率情報１１１に基づいて推定する（Ｓ４）。 Next, the semantic storage rate estimation unit 103 calculates the semantic storage rate for the input character string 100A of the extended character strings 100B to 100G generated as a result of the character string expansion unit 102 extending the character string, based on the semantic storage rate information 111. Estimate (S4).

図４（ｂ）に示すように、拡張文字列１００Ｂは、「が」を挿入する方法により入力文字列１００Ａが拡張されているため、意味保存率情報１１１の拡張方法欄１１１ｂの「自立語以外の語の挿入」に該当し、意味保存率欄１１１ｃからα＝０．９５となる。また、他の方法は用いられていないため、拡張文字列１００Ｂ全体の意味保存率はＡ＝０．９５となる。 As shown in FIG. 4B, since the input character string 100A is expanded by the method of inserting “ga” in the extended character string 100B, “other than independent words” is displayed in the expansion method column 111b of the semantic preservation rate information 111. From the meaning storage rate column 111c, α = 0.95. Since no other method is used, the semantic preservation rate of the entire extended character string 100B is A = 0.95.

また、図４（ｃ）に示すように、拡張文字列１００Ｃは、「が」を挿入する方法及び別の活用形へ入れ替える方法により入力文字列１００Ａが拡張されているため、意味保存率情報１１１の拡張方法欄１１１ｂの「自立語以外の語の挿入」及び「活用後の別の活用形への入れ替え」に該当し、意味保存率欄１１１ｃからα＝０．９５及びα＝０．９５となる。これらの意味保存率の積から、拡張文字列１００Ｃ全体の意味保存率はＡ＝０．９５×０．９５＝０．９０２５となる。 Further, as shown in FIG. 4C, the expanded character string 100C is expanded in the input character string 100A by the method of inserting “ga” and the method of switching to another utilization form. Corresponds to “insertion of words other than self-supporting words” and “replacement to another utilization form after utilization” in the expansion method field 111b of the meaning storage ratio field 111c with α = 0.95 and α = 0.95. Become. From the product of these semantic preservation rates, the semantic preservation rate of the entire extended character string 100C is A = 0.95 × 0.95 = 0.9025.

以上と同様の計算方法により、拡張文字列１００Ｄ〜１００Ｇの意味保存率Ａが図４（ｄ）〜（ｇ）に示す値に求まる。 By the same calculation method as described above, the semantic preservation rate A of the extended character strings 100D to 100G is obtained to the values shown in FIGS.

次に、尤もらしさ推定手段１０４は、拡張文字列１００Ｂ〜１００Ｇが文としてどの程度尤もらしいか示す「尤もらしさ」を推定する（Ｓ５）。 Next, the likelihood estimation means 104 estimates “likelihood” indicating how likely the extended character strings 100B to 100G are as sentences (S5).

図５（ａ）〜（ｇ）は、文字列分割手段１０１によって分割された入力文字列１００Ａ及び文字列拡張手段１０２によって拡張された拡張文字列の尤もらしさを例示する概略図である。 5A to 5G are schematic diagrams illustrating the likelihood of the input character string 100A divided by the character string dividing unit 101 and the extended character string expanded by the character string expanding unit 102. FIG.

尤もらしさβは、ｂｉ−ｇｒａｍによる確率言語モデルを用いて計算され、例えば、図５（ｇ）に示すように、拡張文字列１００Ｇの尤もらしさは、「が」が連続して続く不自然な文であるためβ＝０．００００００００１となり、図５（ｂ）〜（ｆ）に示す拡張文字列１００Ｂ〜１００Ｆの尤もらしさβに比べて小さい値となる。 The likelihood β is calculated using a bilingual probabilistic language model. For example, as shown in FIG. 5G, the likelihood of the extended character string 100G is an unnatural continuation of “ga”. Since it is a sentence, β = 0.0000000001, which is a smaller value than the likelihood β of the extended character strings 100B to 100F shown in FIGS.

次に、拡張文字列評価手段１０５は、意味保存率推定手段１０３が推定した意味保存率Ａ及び尤もらしさ推定手段１０４が推定した尤もらしさβの積から拡張文字列の評価値Ｘを計算する（Ｓ６）。 Next, the extended character string evaluation means 105 calculates the evaluation value X of the extended character string from the product of the semantic preservation ratio A estimated by the semantic preservation ratio estimation means 103 and the likelihood β estimated by the likelihood estimation means 104 ( S6).

図６は、（ａ）〜（ｆ）は、拡張文字列評価手段１０５によって計算された拡張文字列の評価値を例示する概略図である。 FIGS. 6A to 6F are schematic views illustrating the evaluation value of the extended character string calculated by the extended character string evaluation unit 105. FIG.

図６（ａ）に示すように、拡張文字列１００Ｂの評価値は、意味保存率Ａ及び尤もらしさβの積から、Ｘ＝Ａ×β＝０．９５×０．００６＝０．００５７と計算される。以上と同様の計算方法により、拡張文字列１００Ｃ〜１００Ｇの評価値が図６（ｂ）〜（ｆ）に示す値に求まる。 As shown in FIG. 6A, the evaluation value of the extended character string 100B is calculated as X = A × β = 0.95 × 0.006 = 0.0005 from the product of the semantic preservation rate A and the likelihood β. Is done. By the same calculation method as described above, the evaluation values of the extended character strings 100C to 100G are obtained as the values shown in FIGS.

文候補出力手段１０６は、拡張文字列評価手段１０５によって計算された評価値に基づいて複数の拡張文字列から文候補を出力する（Ｓ７）。評価値の大きいものが意味を保存し、文としてより尤もらしいため、文候補出力手段１０６は、拡張文字列１００Ｄ、１００Ｃ、１００Ｂ、１００Ｅ、１００Ｆ、１００Ｇの順で文候補情報１１２として順位付け等して記憶部１１に格納する。 The sentence candidate output means 106 outputs sentence candidates from a plurality of extended character strings based on the evaluation value calculated by the extended character string evaluation means 105 (S7). Since a sentence with a large evaluation value stores a meaning and is more likely as a sentence, the sentence candidate output means 106 ranks the sentence candidate information 112 in the order of the extended character strings 100D, 100C, 100B, 100E, 100F, and 100G. And stored in the storage unit 11.

また、文候補出力手段１０６は、文候補情報１１２を評価値の高いものを優先して表示部１３に表示してもよい。 Further, the sentence candidate output unit 106 may display the sentence candidate information 112 on the display unit 13 with priority given to a sentence having a high evaluation value.

［他の実施の形態］
なお、本発明は、上記実施の形態に限定されず、本発明の要旨を逸脱しない範囲で種々な変形が可能である。例えば、意味保存率推定手段１０３による意味保存率の推定は、必ずしも予め与えられた意味保存率情報１１１によらず、動的に計算してもよい。例えば、文字列拡張手段１０２の単語の交換に対して、交換前の単語と交換後の単語の類義度を意味保存率の推定値として用いてもよい。ここで、単語間の類義度は、単語に対してその類義語を類義度と共に保管したデータベースを参照したり、単語間の関係を記述したネットワーク構造を持つシソーラスを用いて動的に計算するなどの手段が考えられる。 [Other embodiments]
The present invention is not limited to the above embodiment, and various modifications can be made without departing from the gist of the present invention. For example, the estimation of the semantic preservation ratio by the semantic preservation ratio estimation means 103 may not be necessarily based on the meaning preservation ratio information 111 given in advance, but may be calculated dynamically. For example, for the exchange of words by the character string expansion unit 102, the similarity between the word before the exchange and the word after the exchange may be used as the estimated value of the semantic preservation rate. Here, the synonym between words is dynamically calculated by referring to a database storing synonyms together with synonyms for words or using a thesaurus having a network structure describing relationships between words. Such means can be considered.

また、文作成装置１と、入力された文（単語ではない）によって自然文検索を行う自然文検索プログラムとを組み合わせてもよい。利用者は、自然文検索において、キーワードを入力することで、文作成装置１が作成した文候補から利用者が意図する蓋然性の高い文の候補を選択し、自然文を入力することなく自然文検索を実行することができる。 The sentence creation device 1 may be combined with a natural sentence search program that performs a natural sentence search using an input sentence (not a word). A user selects a candidate of a sentence having a high probability that the user intends from sentence candidates created by the sentence creation device 1 by inputting a keyword in the natural sentence search, and the natural sentence without inputting the natural sentence. A search can be performed.

また、拡張文字列評価手段１０５は、意味保存率推定手段１０３が推定した意味保存率、尤もらしさ推定手段１０４が推定した尤もらしさに加えて、編集コストに基づいて拡張文字列の評価値を計算してもよい。ここで、「編集コスト」とは、拡張文字列の文字数から拡張前の文字列の文字数を引いた値（ただし、値が負の場合は０とする。）や、実行する拡張の方法の数に基づく値等から求められる。また、先述した文字列拡張手段１０２が実行する拡張の方法の組み合わせに用いられる方法の数又は拡張の方法の組み合わせの数に定められる上限値を編集コストに基づいて決定してもよい。 The extended character string evaluation unit 105 calculates the evaluation value of the extended character string based on the editing cost in addition to the semantic preservation rate estimated by the semantic preservation rate estimation unit 103 and the likelihood estimated by the likelihood estimation unit 104. May be. Here, the “editing cost” is a value obtained by subtracting the number of characters of the character string before expansion from the number of characters of the expansion character string (however, it is 0 when the value is negative), or the number of expansion methods to be executed. It is obtained from the value based on Further, the upper limit value determined for the number of methods used for the combination of expansion methods executed by the character string expansion unit 102 or the number of combinations of expansion methods may be determined based on the editing cost.

また、文字列受付手段１００が受け付けた文字列が、例えば、「店コンピュ」のように不完全な単語「コンピュ」を含むものである場合、文字列分割手段１０１は、「コンピュ」を予め用意した単語辞書に対して前方一致等の検索を行い、検索の結果で一致する単語、例えば、「コンピュータ」等を文字列に含まれる単語として扱い、文字列拡張手段１０２に出力してもよい。 In addition, when the character string received by the character string receiving unit 100 includes an incomplete word “Compu” such as “Store Compu”, the character string dividing unit 101 prepares “Compu” in advance. A search such as a forward match may be performed on the dictionary, and a word that matches the search result, for example, “computer” may be treated as a word included in the character string and output to the character string expansion unit 102.

また、上記文作成プログラム１１０をＣＤ−ＲＯＭ等の記憶媒体に格納して提供することも可能であり、インターネット等のネットワークに接続されているサーバ装置等から装置内の記憶部にダウンロードしてもよい。また、文字列受付手段１００、文字列分割手段１０１、文字列拡張手段１０２、意味保存率推定手段１０３、尤もらしさ推定手段１０４、拡張文字列評価手段１０５及び文候補出力手段１０６の一部又は全部をＡＳＩＣ等のハードウェアによって実現してもよい。なお、上記実施の形態の動作説明で示した各ステップは、順序の変更、ステップの省略、追加が可能である。 The sentence creation program 110 can be provided by being stored in a storage medium such as a CD-ROM, and can be downloaded from a server device connected to a network such as the Internet to a storage unit in the device. Good. In addition, part or all of the character string receiving unit 100, the character string dividing unit 101, the character string expanding unit 102, the semantic preservation rate estimating unit 103, the likelihood estimating unit 104, the extended character string evaluating unit 105, and the sentence candidate output unit 106. May be realized by hardware such as ASIC. Note that each step shown in the operation description of the above embodiment can be changed in order, omitted or added.

１…文作成装置、１０…制御部、１１…記憶部、１２…操作部、１３…表示部、１００…文字列受付手段、１００Ａ…入力文字列、１００Ｂ-１００Ｇ…拡張文字列、１０１…文字列分割手段、１０１ａ-１０１ｃ…単語、１０２…文字列振分手段、１０２ａ…助詞、１０２ｂ…助詞、１０２ｃ…単語、１０２ｄ…助詞、１０２ｅ…単語、１０２ｆ…助詞、１０２ｇ…単語、１０２ｈ…単語、１０２ｉ…単語、１０２ｊ…助詞、１０２ｋ…単語、１０２ｌ…助詞、１０３…意味保存率推定手段、１０４…尤もらしさ推定手段、１０５…拡張文字列評価手段、１０６…文候補出力手段、１１０…文作成プログラム、１１１…意味保存率情報、１１１ａ…拡張ＩＤ欄、１１１ｂ…拡張方法欄、１１１ｃ…意味保存率欄、１１２…文候補情報 DESCRIPTION OF SYMBOLS 1 ... Sentence preparation apparatus, 10 ... Control part, 11 ... Memory | storage part, 12 ... Operation part, 13 ... Display part, 100 ... Character string reception means, 100A ... Input character string, 100B-100G ... Extended character string, 101 ... Character Column dividing means, 101a-101c ... word, 102 ... character string sorting means, 102a ... particle, 102b ... particle, 102c ... word, 102d ... particle, 102e ... word, 102f ... particle, 102g ... word, 102h ... word, 102i ... Word, 102j ... Particle, 102k ... Word, 102l ... Particle, 103 ... Meaning preservation rate estimation means, 104 ... Likelihood estimation means, 105 ... Extended character string evaluation means, 106 ... Sentence candidate output means, 110 ... Sentence creation Program, 111... Semantic preservation rate information, 111a. Extended ID column, 111b. Extension method column, 111c. Semantic preservation rate column, 112. Sentence candidate information

Claims

Computer
A receiving means for receiving a character string;
Dividing means for dividing the character string received by the receiving means into words;
Expansion means for expanding the word divided by the dividing means by a predetermined method to generate an extended character string;
A first value is calculated for each extended character string using a value of a semantic preservation rate that is associated in advance with the predetermined method used by the extension means, and the extension means is calculated based on the first value. First estimating means for estimating a degree of coincidence between the meaning of the extended character string expanded by and the meaning of the accepted character string;
An evaluation means for evaluating the validity of the method used by the extension means that generated the extension character string based on the first value;
A sentence creation program for functioning as output means for outputting the extended character string as a sentence candidate created from the accepted character string based on the evaluation result of the evaluation means.

The extension means uses, as the predetermined method, insertion of a word between words divided by the dividing means, change of a utilization form of the divided word, or replacement of the divided word with a synonym. Alternatively, the sentence creation program according to claim 1, wherein the extended character string is generated by sequentially using a plurality of predetermined methods included in a combination of these predetermined methods.

The sentence creating program according to claim 2, wherein the expansion unit sets an upper limit on the number of the predetermined methods or the number of the combinations used for the combination of the predetermined methods.

The first estimation unit estimates the first value by using a value of a semantic preservation rate associated with the predetermined method included in the combination combined by the extension unit. The sentence creation program according to 2 or 3.

Parsing the extended character string to calculate a second value, and further causing the computer to function as second estimating means for estimating the likelihood of the extended character string as a syntax based on the second value. There,
Said evaluating means, as claimed in any one of claims 1 to 4 for evaluating the validity of the first value and wherein said second value said expansion means generates the extension string based on is used Sentence creation program.

A third value is calculated based on the difference in the number of characters between the extended character string and the character string or the number of the expansion methods used by the expansion means, and the extended character string is edited based on the third value. Further causing the computer to function as a third estimating means for estimating the processing cost required,
It said evaluation means, according to claim 5 for evaluating the validity of the method of the first value, the second value and said extension means generates the extension string based on said third value is used Sentence creation program.

The dividing unit divides the character string received by the receiving unit into words, and when the divided word partially matches a word included in a word dictionary prepared in advance, the divided word with the matched word The sentence creation program according to any one of claims 1 to 6 , wherein

A receiving means for receiving a character string;
Dividing means for dividing the character string received by the receiving means into words;
Expansion means for expanding the word divided by the dividing means by a predetermined method to generate an extended character string;
A first value is calculated for each extended character string using a value of a semantic preservation rate that is associated in advance with the predetermined method used by the extension means, and the extension means is calculated based on the first value. First estimating means for estimating a degree of coincidence between the meaning of the extended character string expanded by and the meaning of the accepted character string;
An evaluation means for evaluating the validity of the method used by the extension means that generated the extension character string based on the first value;
A sentence creation device comprising: output means for outputting the extended character string as a sentence candidate created from the accepted character string based on the evaluation result of the evaluation means.