JP2015095182A - Character string processing device, method, and program - Google Patents

Character string processing device, method, and program Download PDF

Info

Publication number
JP2015095182A
JP2015095182A JP2013235384A JP2013235384A JP2015095182A JP 2015095182 A JP2015095182 A JP 2015095182A JP 2013235384 A JP2013235384 A JP 2013235384A JP 2013235384 A JP2013235384 A JP 2013235384A JP 2015095182 A JP2015095182 A JP 2015095182A
Authority
JP
Japan
Prior art keywords
character string
character
partial
replacement
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
JP2013235384A
Other languages
Japanese (ja)
Other versions
JP5833087B2 (en
Inventor
小野 朗
Akira Ono
朗 小野
昌英 水島
Masahide Mizushima
昌英 水島
内野 一
Hajime Uchino
一 内野
仙 吉田
Sen Yoshida
吉田  仙
孝文 引地
Takafumi Hikichi
孝文 引地
太一 片山
Taichi Katayama
太一 片山
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nippon Telegraph and Telephone Corp
Original Assignee
Nippon Telegraph and Telephone Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nippon Telegraph and Telephone Corp filed Critical Nippon Telegraph and Telephone Corp
Priority to JP2013235384A priority Critical patent/JP5833087B2/en
Publication of JP2015095182A publication Critical patent/JP2015095182A/en
Application granted granted Critical
Publication of JP5833087B2 publication Critical patent/JP5833087B2/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

PROBLEM TO BE SOLVED: To acquire a character position of an input character string before replacement even when the character position is changed by performing replacement processing for the input character string.SOLUTION: A character processing device 10 includes a replacement part 14 that: replaces a first partial character string included in an input character string by a second partial character string on the basis of a collection of predetermined replacement rules for replacing the first partial character string by the second partial character string; and generates and provides character position information and length information of the second partial character string in the input character string after the replacement, on the basis of character position information and length information of the first character string in the input character string before the replacement.

Description

本発明は、文字列処理装置、方法、及びプログラムに係り、特に、機械翻訳における翻訳前の文字列を処理する文字列処理装置、方法、及びプログラムである。   The present invention relates to a character string processing device, method, and program, and more particularly, to a character string processing device, method, and program for processing a character string before translation in machine translation.

従来から、或る自然言語の文章を別の自然言語の文章へコンピュータプログラムを用いて機械的に変換する機械翻訳の技術が知られている。また、近年では、統計モデルを用いて自動的に機械翻訳を実現する統計的機械翻訳が知られている(例えば非特許文献1参照)。   2. Description of the Related Art Conventionally, a machine translation technique for mechanically converting a sentence in one natural language into a sentence in another natural language using a computer program is known. In recent years, statistical machine translation that automatically realizes machine translation using a statistical model is known (see, for example, Non-Patent Document 1).

統計的機械翻訳においては、例えば英語の文章から日本語の文章への翻訳を行った場合に、翻訳結果である日本語の文章の各文字の文字位置が、翻訳前の英語の文章のどの文字位置に対応しているのかを知りたいという要求は多々ある。   In statistical machine translation, for example, when an English sentence is translated into a Japanese sentence, the character position of each character in the Japanese sentence, which is the translation result, is the character in the English sentence before translation. There are many requests to know if it corresponds to a position.

また、統計的機械翻訳では、数式や化学式、長い固有名詞等を、翻訳前処理において仮想単語に変換したり、冠詞を削除したりする置換処理を行うことが一般的に行われている。   In statistical machine translation, it is common to perform a substitution process that converts mathematical formulas, chemical formulas, long proper nouns, and the like into virtual words or deletes articles in pre-translation processing.

P.Koehn et al. 2007. "Moses: Open Source Toolkit for Statistical Machine Translation"P.Koehn et al. 2007. "Moses: Open Source Toolkit for Statistical Machine Translation"

しかしながら、一般的な統計的機械翻訳では、翻訳前後のフレーズ対応をするだけであるため、翻訳対象である入力文に対して上記の翻訳前処理を実行することにより文字位置が変化してしまうと、翻訳後の翻訳文に対して入力文の文字位置を対応付けることが困難となる。   However, in general statistical machine translation, only the phrase correspondence before and after translation is supported, so if the character position is changed by executing the above pre-translation processing on the input sentence to be translated Thus, it is difficult to associate the character position of the input sentence with the translated sentence after translation.

本発明は、上記の事情に鑑みてなされたもので、入力文字列に対して置換処理を実行することにより文字位置が変化した場合でも、置換前の入力文字列の文字位置を得ることができる文字列処理装置、方法、及びプログラムを提供することを目的とする。   The present invention has been made in view of the above circumstances, and even when the character position is changed by executing a replacement process on the input character string, the character position of the input character string before replacement can be obtained. An object of the present invention is to provide a character string processing apparatus, method, and program.

上記目的を達成するために、本発明の文字列処理装置は、第1の部分文字列を第2の部分文字列に置換する予め定められた置換ルールの集合に基づいて、前記置換ルールに従って入力文字列に含まれる前記第1の部分文字列を前記第2の部分文字列に置換すると共に、置換前の前記入力文字列における前記第1の部分文字列の文字位置情報及び長さ情報に基づいて、置換後の前記入力文字列における前記第2の部分文字列の文字位置情報及び長さ情報を生成して付与する置換部を含む。   In order to achieve the above object, the character string processing device according to the present invention inputs according to the replacement rule based on a set of predetermined replacement rules for replacing the first partial character string with the second partial character string. The first partial character string included in the character string is replaced with the second partial character string, and based on character position information and length information of the first partial character string in the input character string before replacement A replacement unit that generates and assigns character position information and length information of the second partial character string in the input character string after replacement.

また、前記置換部は、前記置換ルールにおける前記第1の部分文字列は、空文字を含み、前記入力文字列を文字単位で分割した各文字に対して、前記入力文字列の先頭の文字位置を0として前記文字の文字位置を示す前記文字位置情報と、前記入力文字列の各文字の長さを1とする前記長さ情報と、を生成する情報生成部と、前記置換ルールに従って置換される前記第1の部分文字列が空文字の場合、置換された前記第2の部分文字列の各文字の前記文字位置情報を0として生成し、且つ、前記第2の部分文字列の各文字の前記長さ情報を0として生成し、前記置換ルールに従って置換される前記第1の部分文字列が空文字ではない場合、置換された前記第2の部分文字列の各文字の前記文字位置情報を前記第1の部分文字列の各文字の前記文字位置情報の最小値として生成し、且つ、前記第2の部分文字列の各文字の前記長さ情報を前記第1の部分文字列の長さとして生成する文字列処理部と、を含む構成としてもよい。   In the replacement unit, the first partial character string in the replacement rule includes an empty character, and for each character obtained by dividing the input character string in character units, the first character position of the input character string is determined. The character position information indicating the character position of the character as 0 and the length information in which the length of each character of the input character string is set to 1 are replaced according to the replacement rule. When the first partial character string is an empty character, the character position information of each character of the replaced second partial character string is generated as 0, and the character of each character of the second partial character string is When the length information is generated as 0 and the first partial character string to be replaced according to the replacement rule is not a null character, the character position information of each character of the replaced second partial character string is the first character string. 1 for each character of the substring A character string processing unit that generates a character position information as a minimum value and generates the length information of each character of the second partial character string as the length of the first partial character string. It is good.

また、本発明の文字列処理方法は、置換部を含む文字列処理装置において実行される文字列処理方法であって、前記置換部が、第1の部分文字列を第2の部分文字列に置換する予め定められた置換ルールの集合に基づいて、前記置換ルールに従って入力文字列に含まれる前記第1の部分文字列を前記第2の部分文字列に置換すると共に、置換前の前記入力文字列における前記第1の部分文字列の文字位置情報及び長さ情報に基づいて、置換後の前記入力文字列における前記第2の部分文字列の文字位置情報及び長さ情報を生成して付与するステップを含む。   The character string processing method according to the present invention is a character string processing method executed in a character string processing device including a replacement unit, wherein the replacement unit converts the first partial character string into a second partial character string. Based on a set of predetermined replacement rules to be replaced, the first partial character string included in the input character string is replaced with the second partial character string according to the replacement rule, and the input character before replacement Based on the character position information and length information of the first partial character string in the string, the character position information and length information of the second partial character string in the input character string after replacement are generated and attached. Includes steps.

また、本発明の文字列処理プログラムは、コンピュータを、請求項1又は請求項2記載の文字列処理装置の各部として機能させるためのプログラムである。   The character string processing program of the present invention is a program for causing a computer to function as each part of the character string processing device according to claim 1 or claim 2.

以上説明したように、本発明の文字列処理装置、方法、及びプログラムによれば、入力文字列に対して置換処理を実行することにより文字位置が変化した場合でも、置換前の入力文字列の文字位置を得ることができる、という効果が得られる。   As described above, according to the character string processing device, method, and program of the present invention, even when the character position is changed by executing the replacement process on the input character string, the input character string before replacement is changed. The effect that the character position can be obtained is obtained.

本実施の形態に係る文字列処理装置の機能的な構成例を示すブロック図である。It is a block diagram which shows the functional structural example of the character string processing apparatus which concerns on this Embodiment. 置換前文字列、文字位置情報、及び長さ情報の一例を示す図である。It is a figure which shows an example of the character string before substitution, character position information, and length information. 置換後文字列、文字位置情報、及び長さ情報の一例を示す図である。It is a figure which shows an example of the character string after substitution, character position information, and length information. 置換前文字列、文字位置情報、及び長さ情報の一例を示す図である。It is a figure which shows an example of the character string before substitution, character position information, and length information. 置換後文字列、文字位置情報、及び長さ情報の一例を示す図である。It is a figure which shows an example of the character string after substitution, character position information, and length information. 本実施の形態における文字列処理ルーチンを示すフローチャートである。It is a flowchart which shows the character string processing routine in this Embodiment.

以下、図面を参照して、本発明の実施の形態を詳細に説明する。
<発明の概要>
まず、本発明の実施の形態の概要について説明する。
本実施の形態では、翻訳前処理として、入力文としての置換前文字列に含まれる第1の部分文字列を第2の部分文字列に置換する。この際、第1の部分文字列の文字位置情報及び長さ情報に基づいて、置換後の第2の部分文字列の文字位置情報及び長さ情報を生成して付与する。これにより、翻訳対象である入力文に対して翻訳前処理を実行することにより文字位置が変化した場合でも、翻訳後の翻訳文に対して入力文の文字位置を適切に対応付けることを可能にする。
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
<Summary of invention>
First, an outline of an embodiment of the present invention will be described.
In the present embodiment, as the pre-translation processing, the first partial character string included in the pre-substitution character string as the input sentence is replaced with the second partial character string. At this time, based on the character position information and length information of the first partial character string, the character position information and length information of the second partial character string after replacement are generated and attached. This makes it possible to appropriately associate the character position of the input sentence with the translated sentence after translation even when the character position changes by executing pre-translation processing on the input sentence to be translated. .

<システム構成>
本実施の形態に係る文字列処理装置10は、入力部12、置換部14、及び出力部16を備えている。
<System configuration>
The character string processing apparatus 10 according to the present embodiment includes an input unit 12, a replacement unit 14, and an output unit 16.

入力部12は、翻訳対象の入力文の入力を受け付ける。入力部12は、例えばキーボード、マウス、記憶装置等の入力機器により実現される。   The input unit 12 receives input of an input sentence to be translated. The input unit 12 is realized by an input device such as a keyboard, a mouse, and a storage device.

置換部14は、機能的には、置換ルール記憶部18、情報生成部20、及び文字列処理部22を含む。置換部14は、例えばCPU(Central Processing Unit)と、RAM(Random Access Memory)と、後述する文字列処理ルーチンを実行するための文字列処理プログラム及び置換ルールを記憶したROM(Read Only Memory)と、を備えたコンピュータにより実現される。なお、ROMに代えて不揮発性メモリを用いてもよい。   The replacement unit 14 functionally includes a replacement rule storage unit 18, an information generation unit 20, and a character string processing unit 22. The replacement unit 14 includes, for example, a CPU (Central Processing Unit), a RAM (Random Access Memory), a ROM (Read Only Memory) storing a character string processing program and a replacement rule for executing a character string processing routine described later, and Are realized by a computer equipped with. Note that a nonvolatile memory may be used instead of the ROM.

置換ルール記憶部18は、統計的機械翻訳の前処理として、翻訳対象の入力文に含まれる第1の部分文字列を第2の部分文字列に置換する際のルールが記述された置換ルールの集合を記憶する。なお、本実施の形態における「文字」には、ローマ字、漢字等の各国の言語を表す文字だけでなく、数字及び記号も含む。また、「部分文字列」は、1文字の場合も含む。また、「部分文字列」は、空文字も含む。なお、以下では、入力部12により入力された入力文を翻訳前文字列といい、翻訳前文字列が置換ルールに従って置換された文字列を置換後文字列という場合がある。   The replacement rule storage unit 18 includes, as pre-processing for statistical machine translation, a replacement rule that describes a rule for replacing the first partial character string included in the input sentence to be translated with the second partial character string. Remember the set. Note that the “characters” in the present embodiment include not only characters representing languages of each country such as Roman characters and Chinese characters, but also numbers and symbols. The “partial character string” includes a case of one character. The “partial character string” includes a null character. Hereinafter, an input sentence input by the input unit 12 may be referred to as a pre-translation character string, and a character string in which the pre-translation character string is replaced according to a replacement rule may be referred to as a post-substitution character string.

本実施の形態では、2つの置換ルールについて説明する。例えば第1の置換ルールは、「入力文の文末に特定文字が存在しない場合は、文末に特定文字を挿入する。挿入した特定文字の文字位置情報は“0”、挿入した特定文字の長さ情報は“0”とする。」というものである。   In this embodiment, two replacement rules will be described. For example, the first replacement rule is “if the specific character does not exist at the end of the input sentence, the specific character is inserted at the end of the sentence. The character position information of the inserted specific character is“ 0 ”, and the length of the inserted specific character. The information is “0”.

特定文字は、例えば「.」、「。」のように、通常は文末に記述される終止符(ピリオド)を表す文字である。なお、第1の置換ルールは、換言すれば、置換前文字列の最後の空文字(第1の部分文字列)を特定文字(第2の部分文字列)に置換するルールともいえる。   The specific character is a character that represents a period that is usually written at the end of a sentence, such as “.” And “.”. In other words, the first replacement rule can be said to be a rule for replacing the last empty character (first partial character string) of the character string before replacement with a specific character (second partial character string).

また、例えば第2の置換ルールは、「記号及び数字の少なくとも一方が連続する特定文字列は、仮想単語に置換する。挿入した仮想単語の各文字の文字位置情報は全て、特定文字列の最小の文字位置情報と同一とする。挿入した仮想単語の各文字の長さ情報は全て、(特定文字列の最後の文字の文字位置+当該最後の文字の長さ−特定文字列の各文字の文字位置のうち最小の文字位置)とする。」というものである。   Further, for example, the second replacement rule is “a specific character string in which at least one of a symbol and a number continues is replaced with a virtual word. All character position information of each character of the inserted virtual word is the minimum of the specific character string. The length information of each character of the inserted virtual word is all (the character position of the last character of the specific character string + the length of the last character−the length of each character of the specific character string). The smallest character position among the character positions).

仮想単語は、例えば特定文字列が数字及び算術記号を含む数式の場合には、「スウシキ」等の単語である。なお、(特定文字列の最後の文字の文字位置+当該最後の文字の長さ−特定文字列の各文字の文字位置のうち最小の文字位置)は、換言すれば、特定文字列の長さである。   The virtual word is a word such as “Suisuki” in the case where the specific character string is a mathematical expression including a number and an arithmetic symbol, for example. In addition, (the character position of the last character of the specific character string + the length of the last character−the minimum character position among the character positions of each character of the specific character string) is, in other words, the length of the specific character string. It is.

情報生成部20は、入力部12により入力された入力文、すなわち置換前文字列を文字単位に分割し、分割した各文字に対して文字位置情報及び長さ情報を生成して付与する。例えば図2に示すように、置換前文字列が「This is a pencil」の場合、空文字(スペース)を含めて16文字で構成されるため、16個の文字に分割する。そして、置換前文字列の先頭の文字(左端の文字)に対しては、文字位置を“0”とした文字位置情報を生成し、最後の文字(右端の文字)に向かって1文字進む毎に、文字位置を1つずつ加算した文字位置情報を各文字に対して生成する。これにより、図2に示すように、最初の文字である「T」には文字位置情報として“0”が生成され、最後の文字である「l」には、文字位置情報として“15”が生成される。また、長さ情報は、図2に示すように、全ての文字に対して“1”が生成される。   The information generation unit 20 divides the input sentence input by the input unit 12, that is, the character string before substitution into character units, and generates and assigns character position information and length information to each divided character. For example, as shown in FIG. 2, when the pre-replacement character string is “This is a pencil”, it is composed of 16 characters including an empty character (space), so it is divided into 16 characters. Then, for the first character (leftmost character) of the pre-replacement character string, character position information with the character position set to “0” is generated, and one character is advanced toward the last character (rightmost character). In addition, character position information obtained by adding character positions one by one is generated for each character. As a result, as shown in FIG. 2, “0” is generated as the character position information for the first character “T”, and “15” is set as the character position information for the last character “l”. Generated. As the length information, “1” is generated for all characters as shown in FIG.

文字列処理部22は、置換ルール記憶部18に記憶された置換ルールに基づいて、入力部12により入力された置換前文字列を置換する。例えば、図2に示す「This is a pencil」のように、翻訳前文字列の文末に特定文字としてのピリオドが存在しない文章は、第1の置換ルールに従い、図3に示す「This is a pencil.」のように、最後にピリオドを挿入する。すなわち、置換前文字列の最後の空文字(第1の部分文字列)を特定文字(第2の部分文字列)に置換する。   The character string processing unit 22 replaces the pre-replacement character string input by the input unit 12 based on the replacement rule stored in the replacement rule storage unit 18. For example, a sentence that does not have a period as a specific character at the end of the pre-translation character string, such as “This is a pencil” shown in FIG. 2, follows the first substitution rule, “This is a pencil” shown in FIG. Insert a period at the end. That is, the last empty character (first partial character string) in the character string before replacement is replaced with a specific character (second partial character string).

また、文字列処理部22は、図3に示すように、挿入された特定文字としてのピリオドに対して、文字位置情報として“0”、長さ情報として“0”を生成して付与する。   In addition, as shown in FIG. 3, the character string processing unit 22 generates and assigns “0” as character position information and “0” as length information to a period as an inserted specific character.

また、例えば、図4に示すように、「2+3=5」等の数式を表す特定文字列(第1の部分文字列)を含む文章は、第2の置換ルールに従い、図5に示すように特定文字列を「スウシキ」のような仮想単語(第2の部分文字列)に置換する。   Also, for example, as shown in FIG. 4, a sentence including a specific character string (first partial character string) representing a mathematical expression such as “2 + 3 = 5” follows the second replacement rule as shown in FIG. The specific character string is replaced with a virtual word (second partial character string) such as “SUSUKIKI”.

また、特定文字列の最後の文字である「5」の文字位置は“4”であり、特定文字列の最後の文字である「5」の長さは“1”であり、特定文字列の各文字の文字位置のうち最小の文字位置、すなわち特定文字列の最初の文字である「2」の文字位置は“0”である。このため、文字列処理部22は、仮想単語の各文字に対して、文字位置情報として“0”、長さ情報として“5”(=4+1−0)を生成する。すなわち、仮想単語の各文字の長さ情報を特定文字列の長さと同一とする。   Further, the character position of “5” which is the last character of the specific character string is “4”, the length of “5” which is the last character of the specific character string is “1”, The minimum character position among the character positions of each character, that is, the character position of “2” that is the first character of the specific character string is “0”. For this reason, the character string processing unit 22 generates “0” as character position information and “5” (= 4 + 1−0) as length information for each character of the virtual word. That is, the length information of each character of the virtual word is the same as the length of the specific character string.

このように、文字列処理部22は、置換される第1の部分文字列が空文字の場合、置換された第2の部分文字列の各文字の文字位置情報を“0”、各文字の長さ情報を“0”として生成し付与する。一方、置換される第1の部分文字列が空文字ではない場合、置換された第2の部分文字列の各文字の文字位置情報を第1の部分文字列の文字位置情報の最小値とし、各文字の長さ情報を第1の部分文字列の長さとして生成し付与する。   Thus, when the first partial character string to be replaced is an empty character, the character string processing unit 22 sets the character position information of each character of the replaced second partial character string to “0” and the length of each character. Information is generated and assigned as “0”. On the other hand, if the first partial character string to be replaced is not a null character, the character position information of each character of the replaced second partial character string is set as the minimum value of the character position information of the first partial character string, Character length information is generated and assigned as the length of the first partial character string.

<文字列処理装置の作用>
次に、図6を参照して、本実施の形態に係る文字列処理装置10において実行される文字列処理ルーチンについて説明する。
<Operation of character string processing device>
Next, a character string processing routine executed in the character string processing apparatus 10 according to the present embodiment will be described with reference to FIG.

ステップS100では、翻訳対象の文字列の入力を受け付ける。本実施の形態では、一例として入力された文字列が英語の文字列であり、この英語の文字列を第1の置換ルール及び第2の置換ルールにより置換する場合について説明する。   In step S100, an input of a character string to be translated is accepted. In the present embodiment, an example will be described in which an input character string is an English character string, and the English character string is replaced by the first replacement rule and the second replacement rule.

ステップS102では、ステップS100で入力された入力文を文字単位に分割し、分割した各文字の文字位置を表す文字位置情報及び文字の長さを表す長さ情報を生成して付与する。例えば図2に示すように、置換前文字列としての入力文が「This is a pencil」の場合、16個の文字に分割され、各文字に対して図2に示すような文字位置情報及び長さ情報が生成される。   In step S102, the input sentence input in step S100 is divided into character units, and character position information indicating the character position of each divided character and length information indicating the character length are generated and attached. For example, as shown in FIG. 2, when the input sentence as the pre-replacement character string is “This is a pencil”, it is divided into 16 characters, and character position information and length as shown in FIG. Information is generated.

ステップS104では、置換ルール記憶部18に記憶された置換ルールの集合に基づいて、置換前文字列を置換する。また、置換される第1の部分文字列が空文字の場合、置換された第2の部分文字列の各文字の文字位置情報を“0”、各文字の長さ情報を“0”とする。一方、置換される第1の部分文字列が空文字ではない場合、置換された第2の部分文字列の各文字の文字位置情報を第1の部分文字列の文字位置情報の最小値とし、各文字の長さ情報を第1の部分文字列の長さとする文字列処理を行う。   In step S104, the pre-replacement character string is replaced based on the set of replacement rules stored in the replacement rule storage unit 18. If the first partial character string to be replaced is an empty character, the character position information of each character of the replaced second partial character string is set to “0”, and the length information of each character is set to “0”. On the other hand, if the first partial character string to be replaced is not a null character, the character position information of each character of the replaced second partial character string is set as the minimum value of the character position information of the first partial character string, Character string processing is performed using the character length information as the length of the first partial character string.

例えば置換前文字列が図2に示すような文字列の場合、文末にピリオドが存在しない。このため、第1の置換ルールに従い、図3に示すように、置換前文字列の最後にピリオドを追加する。すなわち、置換前文字列の最後の空文字(第1の部分文字列)を特定文字(第2の部分文字列)に置換する。   For example, when the pre-replacement character string is a character string as shown in FIG. 2, there is no period at the end of the sentence. Therefore, according to the first replacement rule, a period is added to the end of the character string before replacement as shown in FIG. That is, the last empty character (first partial character string) in the character string before replacement is replaced with a specific character (second partial character string).

そして、置換される第1の部分文字列は空文字であるため、置換された第2の部分文字列としてのピリオドに対して、文字位置情報として“0”、長さ情報として“0”を生成する。   Since the first partial character string to be replaced is an empty character, “0” is generated as character position information and “0” is generated as length information for the period as the replaced second partial character string. To do.

なお、図2に示す置換前文字列は、記号及び数字の少なくとも一方が連続する文字列を含まないため、第2の置換ルールによる置換は行われない。   Note that the pre-replacement character string shown in FIG. 2 does not include a character string in which at least one of a symbol and a number continues, so that the replacement according to the second replacement rule is not performed.

また、例えば置換前文字列が図4に示すような文字列の場合、文末にピリオドが存在するため、第1の置換ルールに基づく置換は行われない。一方、図4に示す文字列は、記号及び数字の少なくとも一方が連続する特定文字列(第1の部分文字列)としての数式「2+3=5」を含むため、第2の置換ルールに従い、上記数式は仮想単語(第2の部分文字列)としての「スウシキ」に置換される。   For example, when the pre-replacement character string is a character string as shown in FIG. 4, a period is present at the end of the sentence, so that the replacement based on the first replacement rule is not performed. On the other hand, since the character string shown in FIG. 4 includes the mathematical expression “2 + 3 = 5” as a specific character string (first partial character string) in which at least one of a symbol and a number continues, according to the second replacement rule, The mathematical expression is replaced with “Suisuki” as a virtual word (second partial character string).

また、置換される第1の部分文字列としての数式は空文字ではないため、置換された第2の部分文字列としての「スウシキ」の各文字に対して、文字位置情報として“0”、長さ情報として“5”を生成する。   Further, since the mathematical expression as the first partial character string to be replaced is not a null character, “0” as the character position information, long for each character of “Suisuki” as the replaced second partial character string “5” is generated as the information.

ステップS106では、ステップS104により置換された置換後文字列を後段の翻訳装置へ出力する。後段の翻訳装置は、例えば置換後文字列に対して他の言語への変換、すなわち翻訳を行う装置である。   In step S106, the replaced character string replaced in step S104 is output to the subsequent translation apparatus. The latter-stage translation device is a device that converts, for example, a translated character string into another language, that is, a translation.

このように、本実施の形態によれば、翻訳前処理として、入力文としての置換前文字列に含まれる第1の部分文字列を第2の部分文字列に置換した場合であっても、第1の部分文字列の文字位置情報及び長さ情報に基づいて、置換後の第2の部分文字列の文字位置情報及び長さ情報が生成される。このため、翻訳対象である入力文に対して翻訳前処理を実行することにより文字位置が変化した場合でも、翻訳後の翻訳文に対して入力文の文字位置を適切に対応付けることができる。   Thus, according to the present embodiment, as a pre-translation process, even when the first partial character string included in the pre-substitution character string as the input sentence is replaced with the second partial character string, Based on the character position information and length information of the first partial character string, the character position information and length information of the second partial character string after replacement are generated. For this reason, even when the character position is changed by executing pre-translation processing on the input sentence to be translated, the character position of the input sentence can be appropriately associated with the translated sentence after translation.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。   Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

例えば、第2の置換ルールにおける第1の部分文字列は数式に限らず、化学式や予め定めた長さ以上の固有名詞でもよい。また、置換ルールは上述した第1の置換ルール及び第2の置換ルールに限られるものではなく、他の置換ルールを記憶しておいてもよい。   For example, the first partial character string in the second replacement rule is not limited to a mathematical formula, but may be a chemical formula or a proper noun longer than a predetermined length. Further, the replacement rule is not limited to the first replacement rule and the second replacement rule described above, and other replacement rules may be stored.

また、上述の文字列処理装置10は、内部にコンピュータシステムを有しているが、「コンピュータシステム」は、WWWシステムを利用している場合であれば、ホームページ提供環境(あるいは表示環境)も含むものとする。   Further, the above-described character string processing apparatus 10 has a computer system therein, but the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used. Shall be.

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体、例えばCD−ROMやメモリーカード等に格納して提供することも可能である。   Further, in the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium such as a CD-ROM or a memory card. It is.

10 文字列処理装置
12 入力部
14 置換部
16 出力部
18 置換ルール記憶部
20 情報生成部
22 文字列処理部
DESCRIPTION OF SYMBOLS 10 Character string processing apparatus 12 Input part 14 Replacement part 16 Output part 18 Replacement rule memory | storage part 20 Information generation part 22 Character string processing part

Claims (4)

第1の部分文字列を第2の部分文字列に置換する予め定められた置換ルールの集合に基づいて、前記置換ルールに従って入力文字列に含まれる前記第1の部分文字列を前記第2の部分文字列に置換すると共に、置換前の前記入力文字列における前記第1の部分文字列の文字位置情報及び長さ情報に基づいて、置換後の前記入力文字列における前記第2の部分文字列の文字位置情報及び長さ情報を生成して付与する置換部
を含む文字列処理装置。
Based on a set of predetermined replacement rules for replacing the first partial character string with the second partial character string, the first partial character string included in the input character string according to the replacement rule is changed to the second partial character string. The second partial character string in the input character string after replacement is replaced with a partial character string and based on the character position information and length information of the first partial character string in the input character string before replacement. A character string processing apparatus including a replacement unit that generates and assigns character position information and length information.
前記置換ルールにおける前記第1の部分文字列は、空文字を含み、
前記置換部は、
前記入力文字列を文字単位で分割した各文字に対して、前記入力文字列の先頭の文字位置を0として前記文字の文字位置を示す前記文字位置情報と、前記入力文字列の各文字の長さを1とする前記長さ情報と、を生成する情報生成部と、
前記置換ルールに従って置換される前記第1の部分文字列が空文字の場合、置換された前記第2の部分文字列の各文字の前記文字位置情報を0として生成し、且つ、前記第2の部分文字列の各文字の前記長さ情報を0として生成し、前記置換ルールに従って置換される前記第1の部分文字列が空文字ではない場合、置換された前記第2の部分文字列の各文字の前記文字位置情報を前記第1の部分文字列の各文字の前記文字位置情報の最小値として生成し、且つ、前記第2の部分文字列の各文字の前記長さ情報を前記第1の部分文字列の長さとして生成する文字列処理部と、
を含む請求項1記載の文字列処理装置。
The first partial character string in the replacement rule includes a null character,
The replacement part is:
For each character obtained by dividing the input character string in character units, the character position information indicating the character position of the character with the leading character position of the input character string being 0, and the length of each character of the input character string An information generation unit that generates the length information with a length of 1;
When the first partial character string to be replaced according to the replacement rule is an empty character, the character position information of each character of the replaced second partial character string is generated as 0, and the second part When the length information of each character of the character string is generated as 0 and the first partial character string to be replaced according to the replacement rule is not a null character, each character of the replaced second partial character string The character position information is generated as a minimum value of the character position information of each character of the first partial character string, and the length information of each character of the second partial character string is generated as the first part. A character string processing unit for generating the length of the character string;
The character string processing device according to claim 1.
置換部を含む文字列処理装置において実行される文字列処理方法であって、
前記置換部が、第1の部分文字列を第2の部分文字列に置換する予め定められた置換ルールの集合に基づいて、前記置換ルールに従って入力文字列に含まれる前記第1の部分文字列を前記第2の部分文字列に置換すると共に、置換前の前記入力文字列における前記第1の部分文字列の文字位置情報及び長さ情報に基づいて、置換後の前記入力文字列における前記第2の部分文字列の文字位置情報及び長さ情報を生成して付与するステップ
を含む文字列処理方法。
A character string processing method executed in a character string processing device including a replacement unit,
The first partial character string included in the input character string according to the replacement rule based on a set of predetermined replacement rules in which the replacement unit replaces the first partial character string with the second partial character string. Is replaced with the second partial character string, and based on the character position information and length information of the first partial character string in the input character string before replacement, the second partial character string is replaced with the second partial character string. A character string processing method comprising: generating and assigning character position information and length information of the partial character string of 2.
コンピュータを、請求項1又は請求項2記載の文字列処理装置の各部として機能させるための文字列処理プログラム。   A character string processing program for causing a computer to function as each part of the character string processing device according to claim 1.
JP2013235384A 2013-11-13 2013-11-13 Character string processing apparatus, method, and program Active JP5833087B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2013235384A JP5833087B2 (en) 2013-11-13 2013-11-13 Character string processing apparatus, method, and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2013235384A JP5833087B2 (en) 2013-11-13 2013-11-13 Character string processing apparatus, method, and program

Publications (2)

Publication Number Publication Date
JP2015095182A true JP2015095182A (en) 2015-05-18
JP5833087B2 JP5833087B2 (en) 2015-12-16

Family

ID=53197521

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2013235384A Active JP5833087B2 (en) 2013-11-13 2013-11-13 Character string processing apparatus, method, and program

Country Status (1)

Country Link
JP (1) JP5833087B2 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018190287A (en) * 2017-05-10 2018-11-29 日本特許翻訳株式会社 Machine translation device and program
JP2020077134A (en) * 2018-11-06 2020-05-21 株式会社椿知財サービス Translation apparatus, control program of translation apparatus, and translation method using translation apparatus
JP2020077356A (en) * 2018-11-06 2020-05-21 株式会社椿知財サービス Translation apparatus, control program of translation apparatus, and translation method using translation apparatus

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH1097539A (en) * 1996-05-29 1998-04-14 Matsushita Electric Ind Co Ltd Document conversion device

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH1097539A (en) * 1996-05-29 1998-04-14 Matsushita Electric Ind Co Ltd Document conversion device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018190287A (en) * 2017-05-10 2018-11-29 日本特許翻訳株式会社 Machine translation device and program
JP2020077134A (en) * 2018-11-06 2020-05-21 株式会社椿知財サービス Translation apparatus, control program of translation apparatus, and translation method using translation apparatus
JP2020077356A (en) * 2018-11-06 2020-05-21 株式会社椿知財サービス Translation apparatus, control program of translation apparatus, and translation method using translation apparatus
JP7333933B2 (en) 2018-11-06 2023-08-28 株式会社椿知財サービス TRANSLATION DEVICE, CONTROL PROGRAM FOR TRANSLATION DEVICE, AND TRANSLATION METHOD USING TRANSLATION DEVICE

Also Published As

Publication number Publication date
JP5833087B2 (en) 2015-12-16

Similar Documents

Publication Publication Date Title
Salloum et al. Elissa: A dialectal to standard Arabic machine translation system
JP5239307B2 (en) Translation apparatus and translation program
JP2015038731A (en) Method for disambiguating multiple readings in language conversion
Erdmann et al. Addressing noise in multidialectal word embeddings
Chinnakotla et al. Transliteration for resource-scarce languages
JP5973986B2 (en) Translation system, method, and program
JP5833087B2 (en) Character string processing apparatus, method, and program
Ganfure et al. Design and implementation of morphology based spell checker
Lehal et al. Sangam: A Perso-Arabic to Indic script machine transliteration model
JP2017151553A (en) Machine translation device, machine translation method, and program
JP2013186189A (en) Sign language translation apparatus and sign language translation program
WO2018179729A1 (en) Index generating program, data search program, index generating device, data search device, index generating method, and data search method
JP2016189154A (en) Translation method, device, and program
Knauth et al. A dictionary data processing environment and its application in algorithmic processing of Pali dictionary data for future NLP tasks
JP6076285B2 (en) Translation apparatus, translation method, and translation program
Singvongsa et al. Lao-Thai machine translation using statistical model
JP6373198B2 (en) Text conversion apparatus, method, and program
JP4708682B2 (en) Bilingual word pair learning method, apparatus, and recording medium on which parallel word pair learning program is recorded
WO2009144890A1 (en) Pre-translation rephrasing rule generating system
JPWO2014030258A1 (en) Morphological analyzer, text analysis method, and program thereof
Kaur et al. A Review on Hindi to English Transliteration System for Proper Nouns Using Hybrid Approach
JP6472466B2 (en) Stylistic conversion device, method, and program
JP2006252290A (en) Machine translation device and computer program
Sudarma et al. Transliteration Balinese Latin Text Becomes Aksara Bali Using Rule Base And Levenshtein Distance Approach
JP6116014B2 (en) Stylistic conversion device, method, and program

Legal Events

Date Code Title Description
A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20150310

A521 Written amendment

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20150423

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20150929

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20151028

R150 Certificate of patent or registration of utility model

Ref document number: 5833087

Country of ref document: JP

Free format text: JAPANESE INTERMEDIATE CODE: R150