JP2018077677A

JP2018077677A - Character string converting device, model learning device, method, and program

Info

Publication number: JP2018077677A
Application number: JP2016218997A
Authority: JP
Inventors: いつみ斉藤; Itsumi Saito; 鈴木　潤; Jun Suzuki; 潤鈴木; 久子浅野; Hisako Asano; 齋藤　邦子; Kuniko Saito; 邦子齋藤; 松尾　義博; Yoshihiro Matsuo; 義博松尾
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2016-11-09
Filing date: 2016-11-09
Publication date: 2018-05-17
Anticipated expiration: 2036-11-09
Also published as: JP6684693B2

Abstract

PROBLEM TO BE SOLVED: To enable a precise conversion of a character string involving a corrupt word.SOLUTION: A partial character string specifying unit 230 determines, for each character, whether or not an input character string involves a character that is a corrupt word on the basis of a determination model to determine whether or not a character is a corrupt word spelled in several different ways with respect to a normal word that is a normalized expression, and specifies a partial character string constituted by the word determined as the corrupt word. A character string converting unit 232 converts the specified partial character string on the basis of a conversion model to convert the corrupt word into the normal word, thereby creating a character string having the corrupt word involved in the character string converted into the normal word.SELECTED DRAWING: Figure 8

Description

本発明は、文字列変換装置、モデル学習装置、方法、及びプログラムに係り、特に、口語調などの正書法では現れない表記である崩れた表記を頑健に解析するための文字列変換装置、モデル学習装置、方法、及びプログラムに関する。 The present invention relates to a character string conversion device, a model learning device, a method, and a program, and in particular, a character string conversion device and model learning for robustly analyzing a broken notation which is a notation that does not appear in a normal writing method such as colloquial tone. The present invention relates to an apparatus, a method, and a program.

従来より、崩れた表記を正規化するための技術として、文字列正規化パタンに基づく正規化形態素解析を行う技術が知られている。例えば図１４に示すように、事前に正規化文字列を設定し、その文字列を用いて辞書や入力文を拡張しながら解析するものである（非特許文献１、非特許文献２参照）。崩れ文字列から正規文字列への変換例としては、例えば「ー」→「null」、「ー」→「う」、「っ」→「null」、「しー」→「しい」といったものである。 2. Description of the Related Art Conventionally, a technique for performing normalized morphological analysis based on a character string normalization pattern is known as a technique for normalizing a broken notation. For example, as shown in FIG. 14, a normalized character string is set in advance, and analysis is performed while expanding the dictionary and input sentence using the character string (see Non-Patent Document 1 and Non-Patent Document 2). Examples of conversion from a broken character string to a regular character string are, for example, "-" → "null", "-" → "U", "tsu" → "null", "Shi" → "Shi" is there.

また、Encoder-decoder型ニューラルネットワークにより、入力系列ｘをベクトル空間に射影し、そのベクトル空間を参照しながら出力系列ｙを予測する技術が知られている（非特許文献３）。同技術では、図１５に示すようなEncoder-decoder型ニューラルネットワークモデルから、以下の式に従って位置ｔごとに出力される文字の各々の確率を計算し、最も確率の高い文字の系列を出力する。 Also, a technique is known in which an input sequence x is projected onto a vector space by an encoder-decoder type neural network, and an output sequence y is predicted while referring to the vector space (Non-patent Document 3). In this technique, the probability of each character output for each position t is calculated from an encoder-decoder type neural network model as shown in FIG. 15 according to the following expression, and the character sequence having the highest probability is output.

ここで、ｃ_ｔはａ_ｔを重みとしたｈ_ｔの加重平均であり、入力ソース側（ｘ）の位置ごとに次のように計算する。 Here, c _t is a weighted average of h _t with a _t as a weight, and is calculated as follows for each position on the input source side (x).

勝木健太、笹野遼平、河原大輔、黒橋禎夫，「web上の多彩な言語バリエーションに対応した頑健な形態素解析」，（2011），言語処理学会，第17回年次大会発表論文集Kenta Katsuki, Shinpei Kanno, Daisuke Kawahara, Ikuo Kurohashi, “Robust Morphological Analysis for Various Language Variations on the Web”, (2011), Proc. Of the 17th Annual Conference 斉藤，貞光，浅野，松尾，「正規-崩れ文字列アライメントと文字種変換を用いた崩れ表記正規化に基づく日本語形態素解析」，第20回言語処理学会全国大会，2014/3/10Saito, Sadamitsu, Asano, Matsuo, “Japanese Morphological Analysis Based on Normalized-Disrupted String Alignment and Character Type Conversion”, 20th Annual Conference of the Association for Natural Language Processing, 2014/3/10 Minh-Thang Luong, Hieu Pham, Christopher D. Manning,"Effective Approaches to Attention-based Neural Machine Translation" Computer Science Department, Stanford University, Stanford, CA 94305, 2015Minh-Thang Luong, Hieu Pham, Christopher D. Manning, "Effective Approaches to Attention-based Neural Machine Translation" Computer Science Department, Stanford University, Stanford, CA 94305, 2015

従来の非特許文献１、及び非特許文献２の技術では、文字列レベルのパタンを設定して正規化を行う場合、事前に設定した文字列しか正規化することができないという課題があった。また、設定した文字列の限られた範囲の文脈情報しか考慮することができないという課題があった。 In the conventional techniques of Non-Patent Document 1 and Non-Patent Document 2, when normalization is performed by setting a character string level pattern, there is a problem that only a character string set in advance can be normalized. In addition, there is a problem that only context information within a limited range of the set character string can be considered.

非特許文献３の技術では、文字列全体を考慮して変換を行うため、学習時間が長いという課題があった。また、特に学習データが少ない場合や入力文字列が長い場合、変換しなくてもよい文字も変換してしまうなどのデグレードが大きくなるという課題があった。 The technique of Non-Patent Document 3 has a problem that the learning time is long because the conversion is performed in consideration of the entire character string. In addition, particularly when learning data is small or when an input character string is long, there is a problem that degradation such as conversion of characters that do not need to be converted increases.

本発明は、上記問題点を解決するために成されたものであり、精度よく、崩れ語を含む文字列を変換することができる文字列変換装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made to solve the above-described problems, and an object of the present invention is to provide a character string conversion device, method, and program capable of accurately converting a character string including a broken word. To do.

また、本発明は、精度よく崩れ語を含む文字列を変換できるモデルを学習することができるモデル学習装置、方法、及びプログラムを提供することを目的とする。 It is another object of the present invention to provide a model learning apparatus, method, and program capable of learning a model that can convert a character string including a broken word with high accuracy.

上記目的を達成するために、第１の発明に係る文字列変換装置は、入力された文字列に対して、文字毎に、正規化された表現である正規語に対して揺らいだ表記である崩れ語の文字であるか否かを判定するための判定モデルに基づいて、崩れ語の文字であるか否かを判定し、崩れ語の文字であると判定された文字からなる部分文字列を特定する部分文字列特定部と、前記部分文字列特定部によって特定された部分文字列に対して、前記崩れ語を前記正規語に変換するための変換モデルに基づいて、前記部分文字列を変換することにより、前記文字列に含まれる前記崩れ語を前記正規語に変換した文字列を生成する文字列変換部と、を含んで構成されている。 In order to achieve the above object, the character string conversion device according to the first aspect of the present invention is a notation that fluctuates with respect to a regular word that is a normalized expression for each character with respect to the input character string. Based on the determination model for determining whether or not it is a broken word character, it is determined whether or not it is a broken word character, and a partial character string consisting of characters determined to be a broken word character The partial character string is converted based on a conversion model for converting the broken word into the regular word for the partial character string specifying unit to be specified and the partial character string specified by the partial character string specifying unit. Thus, a character string conversion unit that generates a character string obtained by converting the corrupted word included in the character string into the regular word is configured.

また、第１の発明に係る文字列変換装置において、前記変換モデルは、前記崩れ語より前の文脈及び前記崩れ語より後の文脈を考慮して、前記崩れ語を前記正規語に変換するための変換モデルであり、前記文字列変換部は、前記部分文字列特定部によって特定された部分文字列に対して、前記変換モデルと、前記文字列における前記部分文字列より前の文脈及び前記部分文字列より後の文脈とに基づいて、前記部分文字列を変換するようにしてもよい。 In the character string conversion device according to the first invention, the conversion model converts the corrupted word into the regular word in consideration of a context before the corrupted word and a context after the corrupted word. The conversion model, the context before the partial character string in the character string, and the part of the partial character string specified by the partial character string specifying unit. The partial character string may be converted based on the context after the character string.

また、第１の発明に係る文字列変換装置において、前記判定モデルは、判定対象の文字、前記判定対象の文字より前の文字列、及び前記判定対象の文字より後の文字列を入力とし、前記判定対象の文字が崩れ語の文字である確率を出力するニューラルネットワークであり、前記変換モデルは、変換対象の部分文字列、前記変換対象の部分文字列より前の文字列、及び前記変換対象の部分文字列より後の文字列を入力とし、変換対象の部分文字列を変換した変換後の部分文字列の文字の各々について、各文字である確率を出力するニューラルネットワークであるようにしてもよい。 Further, in the character string conversion device according to the first invention, the determination model receives as input a character to be determined, a character string before the character to be determined, and a character string after the character to be determined, The neural network that outputs a probability that the character to be determined is a broken word character, and the conversion model includes a partial character string to be converted, a character string before the partial character string to be converted, and the conversion target. It is also possible to use a neural network that inputs a character string after the partial character string and outputs the probability that each character of the converted partial character string is a character for each character of the converted partial character string. Good.

また、第２の発明に係るモデル学習装置において、入力された、正規化された表現である正規語からなる正規化文と、前記正規語に対して揺らいだ表記である崩れ語を含む崩れ文との複数のペアに基づいて、前記複数のペアの各々について、前記正規化文に含まれる各文字と、前記崩れ文に含まれる各文字との対応関係を求める文字列アライメント部と、前記文字列アライメント部により求められた前記複数のペアの各々についての前記対応関係に基づいて、前記崩れ語の文字であるか否かを判定するための判定モデルを学習する判定モデル学習部と、前記文字列アライメント部により求められた前記複数のペアの各々についての前記対応関係に基づいて、前記崩れ語を前記正規語に変換するための変換モデルを学習する変換モデル学習部と、を含んで構成されている。 Further, in the model learning device according to the second aspect of the present invention, a collapsed sentence including a normalized sentence composed of a normal word that is an input normalized expression and a broken word that is a distorted expression with respect to the regular word. Based on a plurality of pairs, for each of the plurality of pairs, a character string alignment unit for obtaining a correspondence relationship between each character included in the normalized sentence and each character included in the collapsed sentence, and the character A determination model learning unit that learns a determination model for determining whether or not the character is a broken word based on the correspondence relationship for each of the plurality of pairs obtained by a column alignment unit; and the character A conversion model learning unit that learns a conversion model for converting the broken word into the regular word based on the correspondence relationship for each of the plurality of pairs obtained by the column alignment unit; It is configured to include a.

第３の発明に係る文字列変換方法は、部分文字列特定部が、入力された文字列に対して、文字毎に、正規化された表現である正規語に対して揺らいだ表記である崩れ語の文字であるか否かを判定するための判定モデルに基づいて、崩れ語の文字であるか否かを判定し、崩れ語の文字であると判定された文字からなる部分文字列を特定するステップと、文字列変換部が、前記部分文字列特定部によって特定された部分文字列に対して、前記崩れ語を前記正規語に変換するための変換モデルに基づいて、前記部分文字列を変換することにより、前記文字列に含まれる前記崩れ語を前記正規語に変換した文字列を生成するステップと、を含んで実行することを特徴とする。 In the character string conversion method according to the third invention, the partial character string specifying unit is a notation that is swaying with respect to a regular word that is a normalized expression for each character with respect to the input character string. Based on the judgment model for judging whether or not it is a word character, it is judged whether or not it is a broken word character, and a partial character string made up of characters that are judged to be a broken word character is specified And a step of converting the partial character string based on a conversion model for converting the broken word into the regular word with respect to the partial character string specified by the partial character string specifying unit. And generating a character string obtained by converting the corrupted word contained in the character string into the regular word by performing the conversion.

第４の発明に係る文字列変換方法は、文字列アライメント部が、入力された、正規化された表現である正規語からなる正規化文と、前記正規語に対して揺らいだ表記である崩れ語を含む崩れ文との複数のペアに基づいて、前記複数のペアの各々について、前記正規化文に含まれる各文字と、前記崩れ文に含まれる各文字との対応関係を求めるステップと、判定モデル学習部が、前記文字列アライメント部により求められた前記複数のペアの各々についての前記対応関係に基づいて、前記崩れ語の文字であるか否かを判定するための判定モデルを学習するステップと、変換モデル学習部が、前記文字列アライメント部により求められた前記複数のペアの各々についての前記対応関係に基づいて、前記崩れ語を前記正規語に変換するための変換モデルを学習するステップと、を含んで実行することを特徴とする。 According to a fourth aspect of the present invention, there is provided the character string conversion method, wherein the character string alignment unit is a normalized sentence composed of a normal word that is a normalized expression that is input, and a distorted notation with respect to the normal word. Obtaining a correspondence relationship between each character included in the normalized sentence and each character included in the corrupted sentence for each of the plurality of pairs based on a plurality of pairs with a corrupted sentence including a word; A determination model learning unit learns a determination model for determining whether or not the character is a broken word based on the correspondence relationship for each of the plurality of pairs obtained by the character string alignment unit. And a conversion model for converting the corrupted word into the regular word based on the correspondence relationship for each of the plurality of pairs obtained by the character string alignment unit. And executes comprising the steps of learning the Le, the.

第５の発明に係るプログラムは、コンピュータを、請求項１〜請求項３のいずれか１項に記載の文字列変換装置の各部として機能させるためのプログラムである。 A program according to a fifth aspect is a program for causing a computer to function as each part of the character string conversion device according to any one of claims 1 to 3.

第６の発明に係るプログラムは、コンピュータを、第２の発明に係るモデル学習装置の各部として機能させるためのプログラムである。 A program according to a sixth invention is a program for causing a computer to function as each part of the model learning device according to the second invention.

本発明の文字列変換装置、方法、及びプログラムによれば、入力された文字列に対して、文字毎に、正規化された表現である正規語に対して揺らいだ表記である崩れ語の文字であるか否かを判定するための判定モデルに基づいて、崩れ語の文字であるか否かを判定し、崩れ語の文字であると判定された文字からなる部分文字列を特定し、特定された部分文字列に対して、崩れ語を正規語に変換するための変換モデルに基づいて、部分文字列を変換することにより、文字列に含まれる崩れ語を正規語に変換した文字列を生成することにより、精度よく、崩れ語を含む文字列を変換することができる、という効果が得られる。 According to the character string conversion apparatus, method, and program of the present invention, characters of a corrupted word that is a notation that fluctuates with respect to a regular word that is a normalized expression for each character of the input character string. Based on the determination model for determining whether or not it is, it is determined whether or not it is a character of a broken word, and a partial character string consisting of characters determined to be a broken word character is specified and specified By converting a partial character string based on a conversion model for converting a broken word into a regular word, a character string obtained by converting a broken word contained in a character string into a regular word By generating, it is possible to obtain an effect that a character string including a broken word can be converted with high accuracy.

また、本発明のモデル学習装置、方法、及びプログラムによれば、入力された、正規化された表現である正規語からなる正規化文と、正規語に対して揺らいだ表記である崩れ語を含む崩れ文との複数のペアに基づいて、複数のペアの各々について、正規化文に含まれる各文字と、崩れ文に含まれる各文字との対応関係を求め、文字列アライメント部により求められた前記複数のペアの各々についての前記対応関係に基づいて、前記崩れ語の文字であるか否かを判定するための判定モデルを学習し、複数のペアの各々についての対応関係に基づいて、崩れ語を正規語に変換するための変換モデルを学習することにより、精度よく崩れ語を含む文字列を変換できるモデルを学習することができる、という効果が得られる。 Further, according to the model learning device, method, and program of the present invention, an input normalized sentence composed of a normal word that is a normalized expression and a corrupted word that is a notation that fluctuates with respect to the normal word. Based on the multiple pairs with the collapsed sentence included, for each of the multiple pairs, the correspondence between each character included in the normalized sentence and each character included in the collapsed sentence is obtained and obtained by the character string alignment unit. Further, based on the correspondence relationship for each of the plurality of pairs, a determination model for determining whether the character is a collapsible word, based on the correspondence relationship for each of the plurality of pairs, By learning a conversion model for converting a broken word into a regular word, an effect is obtained that a model that can accurately convert a character string including a broken word can be obtained.

本発明の実施の形態に係るモデル学習装置の構成を示すブロック図である。It is a block diagram which shows the structure of the model learning apparatus which concerns on embodiment of this invention. 崩れ文の文字の各々の変化ラベル（判定モデル学習用）と、部分変化文字列集合（変換モデル学習用）の対応関係の一例を示す図である。It is a figure which shows an example of the correspondence of each change label (for determination model learning) of the character of a collapsed sentence, and a partial change character string set (for conversion model learning). 正規化文と崩れ文とのペアから作成される正解データの一例を示す図である。It is a figure which shows an example of the correct data produced | generated from the pair of a normalized sentence and a collapse sentence. 対応関係の変化ラベルによる判定モデルの学習の例を示す図である。It is a figure which shows the example of learning of the determination model by the change label of a correspondence. 判定モデルにおいて用いるＬＳＴＭ型のニューラルネットワークの一例を示す図である。It is a figure which shows an example of the LSTM type neural network used in a determination model. 対応関係の部分変化文字列集合に基づく変換モデルの学習の例を示す図である。It is a figure which shows the example of learning of the conversion model based on the partial change character string set of a correspondence. 変換モデルにおいて用いるencoder-decoder型のニューラルネットワークの一例を示す図である。It is a figure which shows an example of the encoder-decoder type neural network used in a conversion model. 本発明の実施の形態に係る文字列変換装置の構成を示すブロック図である。It is a block diagram which shows the structure of the character string converter which concerns on embodiment of this invention. 判定モデルによる崩れ語の文字であるか否かの判定の一例を示す図である。It is a figure which shows an example of determination of whether it is a character of a broken word by a determination model. 変換モデルによる崩れ語の部分文字列の各々の正規語への変換の一例を示す図である。It is a figure which shows an example of the conversion to each regular word of the partial character string of the broken word by a conversion model. 文字列変換装置２００の全体の処理の流れを表した概略図である。FIG. 5 is a schematic diagram showing the overall processing flow of the character string converter 200. 本発明の実施の形態に係るモデル学習装置におけるモデル学習処理ルーチンを示すフローチャートである。It is a flowchart which shows the model learning process routine in the model learning apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る文字列変換装置における文字列変換処理ルーチンを示すフローチャートである。It is a flowchart which shows the character string conversion process routine in the character string conversion apparatus which concerns on embodiment of this invention. 文字列正規化パタンに基づく正規化形態素解析の一例を示す図である。It is a figure which shows an example of the normalization morpheme analysis based on a character string normalization pattern. Encoder-decoder型ニューラルネットワークモデルの一例を示す図である。It is a figure which shows an example of an Encoder-decoder type neural network model.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜本発明の実施の形態に係る概要＞ <Outline according to Embodiment of the Present Invention>

まず、本発明の実施の形態における概要を説明する。 First, an outline of the embodiment of the present invention will be described.

非特許文献１及び非特許文献２における課題に対しては、ニューラルネットワークをベースとした判定モデルを用いることで、文全体の情報を考慮しながら文字の変換候補を推定することで対応する。これにより、事前に考慮する文字列の長さを固定する必要がなくなり、より広い文脈を考慮することが可能となる。また、文字列そのものではなく抽象化された文脈を考慮するため、「表層が違うが意味的に類似している」などの類似性を考慮することが可能になる。 The problems in Non-Patent Document 1 and Non-Patent Document 2 are dealt with by estimating a character conversion candidate while considering information of the entire sentence by using a determination model based on a neural network. Thereby, it is not necessary to fix the length of the character string to be considered in advance, and a wider context can be considered. In addition, since an abstract context is considered rather than a character string itself, it is possible to consider similarities such as “the surface is different but semantically similar”.

非特許文献３における課題に対しては、変換すべき部分文字列を特定し、特定された部分文字列のみを変換する仕組みを導入することで、変換箇所のみを効率よく、かつ精度よく正規化する。変換すべき部分文字列の特性と部分文字列の変換モデルを分けて学習することができ、モデル学習を効率的に行うことが可能となる。変換すべきと識別された文字のみを変換対象とするため、デグレードを抑えることができるとともに、入力文の文字長の影響をうけにくくなる。 For the problem in Non-Patent Document 3, by specifying a partial character string to be converted and introducing a mechanism for converting only the specified partial character string, only the converted part is normalized efficiently and accurately. To do. The characteristics of the partial character string to be converted and the conversion model of the partial character string can be learned separately, and the model learning can be performed efficiently. Since only the characters identified to be converted are converted, it is possible to suppress degradation, and it is difficult to be affected by the character length of the input sentence.

＜本発明の実施の形態に係るモデル学習装置の構成＞ <Configuration of Model Learning Device According to Embodiment of the Present Invention>

次に、本発明の実施の形態に係るモデル学習装置の構成について説明する。図１に示すように、本発明の実施の形態に係るモデル学習装置１００は、ＣＰＵと、ＲＡＭと、後述するモデル学習処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。このモデル学習装置１００は、機能的には図１に示すように入力部１０と、演算部２０とを備えている。 Next, the configuration of the model learning device according to the embodiment of the present invention will be described. As shown in FIG. 1, a model learning device 100 according to an embodiment of the present invention includes a CPU, a RAM, and a ROM that stores a program for executing a model learning processing routine described later and various data. Can be configured with a computer. Functionally, the model learning apparatus 100 includes an input unit 10 and a calculation unit 20 as shown in FIG.

入力部１０は、正解データとして、正規化された表現である正規語からなる正規化文と、正規化された表現である正規語に対して揺らいだ表記である崩れ語を含む崩れ文との複数のペアを受け付ける。 The input unit 10 includes, as correct answer data, a normalized sentence including a normal word that is a normalized expression, and a broken sentence that includes a broken word that is a fluctuation expression with respect to the regular word that is a normalized expression. Accept multiple pairs.

演算部２０は、文字列アライメント部３０と、判定モデル学習部３２と、変換モデル学習部３４と、判定モデル４０と、変換モデル４２とを含んで構成されている。 The calculation unit 20 includes a character string alignment unit 30, a determination model learning unit 32, a conversion model learning unit 34, a determination model 40, and a conversion model 42.

文字列アライメント部３０は、入力部１０で受け付けた、正規語からなる正規化文と、崩れ語を含む崩れ文との複数のペアに基づいて、複数のペアの各々について、正規化文に含まれる各文字と、崩れ文に含まれる各文字との対応関係を求める。対応関係として、図２に示すような、崩れ文の文字の各々の変化ラベル（判定モデル学習用）と、部分変化文字列集合（変換モデル学習用）を求める。 The character string alignment unit 30 is included in the normalized sentence for each of a plurality of pairs based on a plurality of pairs of a normalized sentence composed of regular words and a corrupted sentence including a corrupted word received by the input unit 10. The correspondence relationship between each character to be read and each character included in the collapsed sentence is obtained. As the correspondence relationship, as shown in FIG. 2, the change labels (for determination model learning) and the partial change character string set (for conversion model learning) of each character of the collapsed sentence are obtained.

具体的には、正規化文と崩れ文とのペアについて、図３に示すように、点線で囲った部分を、変換すべき崩れ文字からなる崩れ文字列、それ以外を変換なしの文字列として正解データを作成する。判定モデル学習部３２のモデル学習時には、崩れ文に含まれる文字が崩れ語か否か（変換すべき文字か否か）を判別するため、正解データの文字レベルの対応づけが必要になる。そこで、変化した文字に「１」の変化ラベルを付与し、変化していない文字については、「０」の変化ラベルを付与する。 Specifically, as shown in FIG. 3, with respect to a pair of a normalized sentence and a broken sentence, a part surrounded by a dotted line is a broken character string composed of broken characters to be converted, and the rest is a character string without conversion. Create correct data. At the time of model learning by the determination model learning unit 32, it is necessary to associate character levels of correct data in order to determine whether or not a character included in a broken sentence is a broken word (whether or not to be converted). Therefore, a change label of “1” is assigned to the changed character, and a change label of “0” is assigned to the character that has not changed.

正規化文に含まれる各文字と、崩れ文に含まれる各文字との対応関係を求める際には、崩れ文の崩れ文字列と、対応する正規化文の正規化文字列で文字列レベルのＤＰアライメントを行う。また、アライメントの結果から、正規化文において最小変化の単位で区切られた部分文字列を、変換すべき部分変化文字列の正解データとして作成する。 When determining the correspondence between each character included in the normalized sentence and each character included in the corrupted sentence, the character string level of the corrupted string of the corrupted sentence and the normalized character string of the corresponding normalized sentence are used. Perform DP alignment. Also, from the alignment result, a partial character string delimited by the unit of minimum change in the normalized sentence is created as correct answer data of the partial change character string to be converted.

判定モデル学習部３２は、文字列アライメント部３０により求められた複数のペアの各々についての対応関係に基づいて、崩れ語の文字であるか否かを判定するための判定モデルを学習し、判定モデル４０として保存する。 The determination model learning unit 32 learns a determination model for determining whether or not the character is a collapsing word based on the correspondence relationship for each of the plurality of pairs obtained by the character string alignment unit 30. Save as model 40.

具体的には、図４に示すように、対応関係の変化ラベルに基づいて、崩れ文の文字毎の変化ラベルを推定する判定モデル４０の重みパラメタを学習する。図５に本実施の形態の判定モデル４０において用いるＬＳＴＭ型のニューラルネットワークを示す。このニューラルネットワークは、以下（１）式における、各文字位置ｊのラベル出力確率ｐ(ｊ)を出力するものである。ここで、変化ラベルはラベル出力確率ｐ(ｊ)に基づき１又は０の２値で出力される。 Specifically, as shown in FIG. 4, the weight parameter of the determination model 40 that estimates the change label for each character of the broken sentence is learned based on the change label of the correspondence relationship. FIG. 5 shows an LSTM type neural network used in the determination model 40 of the present embodiment. This neural network outputs the label output probability p (j) of each character position j in the following equation (1). Here, the change label is output as a binary value of 1 or 0 based on the label output probability p (j).

・・・（１）
... (1)

ここで、順方向ＬＳＴＭの文字位置ｊでの隠れ層をｈ_ｆｊ、逆方向ＬＳＴＭの文字位置ｊでの隠れ層をｈ_ｂｊ、文字位置ｊの文字のembeddingをｅ_ｊとする。Ｗ_ｆｊ、Ｗ_ｂｊ、Ｗ_ｊはそれぞれの重みパラメタである。 Here, the hidden layer _{h fj} at character position j of the forward LSTM, at character position j of the reverse LSTM hidden layer _{h bj,} characters embedding character position j and _{e j.} W _fj , W _bj , and W _j are respective weight parameters.

変換モデル学習部３４は、文字列アライメント部３０により求められた複数のペアの各々についての対応関係に基づいて、崩れ語を正規語に変換するための変換モデルを学習し、変換モデル４２として保存する。 The conversion model learning unit 34 learns a conversion model for converting a broken word into a regular word based on the correspondence relationship for each of the plurality of pairs obtained by the character string alignment unit 30, and stores it as a conversion model 42. To do.

具体的には、図６に示すように、対応関係の部分変化文字列集合に基づいて、崩れ文の崩れ語に対応する正規語を推定する変換モデル４２の重みパラメタを学習する。図７に本実施の形態の変換モデル４２において用いるencoder-decoder型のニューラルネットワークを示す。変換対象の崩れ語である部分変化文字列に関しては、変換後の文字毎の出力を、下記のように定式化する。 Specifically, as shown in FIG. 6, the weight parameter of the conversion model 42 that estimates the regular word corresponding to the broken word of the broken sentence is learned based on the partial change character string set of the correspondence relationship. FIG. 7 shows an encoder-decoder type neural network used in the conversion model 42 of the present embodiment. For the partial change character string that is a broken word to be converted, the output for each character after conversion is formulated as follows.

・・・（２）
... (2)

既存のattention based encoder-decoderモデルをベースとするが、入力された部分変化文字列の単位で変換を行うこと、左文脈（前の文脈）、及び右文脈（後の文脈）を考慮している点が異なっている。ここで、右文脈ＬＳＴＭの隠れ層をｈ_ｃｒ、左文脈ＬＳＴＭの隠れ層をｈ_ｃｌとする。ｈ_ｃは両方向ＬＳＴＭの隠れ層を表しており、上記非特許文献３について説明したEncoder-decoder型ニューラルネットワークモデルにおけるｃ_ｔに相当する。ｈ_ｔは現在のdecoder位置ｔにおける隠れ層を表す。Ｗ_ｈｔ、Ｗ_ｃ、Ｗ_ｃｌ、Ｗ_ｃｒはそれぞれの重みパラメタである。 It is based on the existing attention based encoder-decoder model, but considers the conversion in units of the input partial change string, the left context (previous context), and the right context (later context). The point is different. Here, the hidden layer of the right context LSTM is h _cr , and the hidden layer of the left context LSTM is h _cl . h _c represents the hidden layer in both directions LSTM, corresponding to c _t in Encoder-decoder neural network model described Non-Patent Document 3. h _t represents the hidden layer at the current decoder position t. W _ht , W _c , W _cl , and W _cr are respective weight parameters.

最終的に、以下（３）式でｔ番目の文字の確率ベクトルｐ（ｔ）を推定する。この確率ベクトルｐ（ｔ）により、変換後のｔ番目の文字が決定される。 Finally, the probability vector p (t) of the t-th character is estimated by the following equation (3). The t-th character after conversion is determined by this probability vector p (t).

・・・（３）
... (3)

このように、変換モデル４２は、変換モデルは、崩れ語より前の文脈及び崩れ語より後の文脈を考慮して、崩れ語を正規語に変換するための変換モデルである。 As described above, the conversion model 42 is a conversion model for converting a collapsed word into a regular word in consideration of a context before the collapsed word and a context after the collapsed word.

＜本発明の実施の形態に係る文字列変換装置の構成＞ <Configuration of Character String Conversion Device According to Embodiment of the Present Invention>

次に、本発明の実施の形態に係る文字列変換装置の構成について説明する。図８に示すように、本発明の実施の形態に係る文字列変換装置２００は、ＣＰＵと、ＲＡＭと、後述する文字列変換処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。この文字列変換装置２００は、機能的には図８に示すように入力部２１０と、演算部２２０と、出力部２５０とを備えている。 Next, the configuration of the character string conversion device according to the embodiment of the present invention will be described. As shown in FIG. 8, a character string conversion device 200 according to an embodiment of the present invention includes a CPU, a RAM, a ROM for storing a program and various data for executing a character string conversion processing routine described later, Can be configured with a computer including Functionally, the character string conversion device 200 includes an input unit 210, a calculation unit 220, and an output unit 250 as shown in FIG.

入力部２１０は、変換対象となる崩れ語を含む文字列を受け付ける。 The input unit 210 receives a character string including a broken word to be converted.

演算部２２０は、部分文字列特定部２３０と、文字列変換部２３２と、判定モデル２４０と、変換モデル２４２とを含んで構成されている。 The calculation unit 220 includes a partial character string specifying unit 230, a character string conversion unit 232, a determination model 240, and a conversion model 242.

判定モデル２４０は、モデル学習装置１００で学習された、崩れ語の文字であるか否かを判定するための判定モデル４０と同じモデルである。また、判定モデル２４０は、判定対象の文字、判定対象の文字より前の文字列、及び判定対象の文字より後の文字列を入力とし、判定対象の文字が崩れ語の文字である確率を出力するニューラルネットワークである。 The determination model 240 is the same model as the determination model 40 that is learned by the model learning device 100 and determines whether or not the character is a broken word. Further, the determination model 240 receives the determination target character, the character string before the determination target character, and the character string after the determination target character, and outputs the probability that the determination target character is a broken word character. It is a neural network.

変換モデル２４２、モデル学習装置１００で学習された、崩れ語を正規語に変換するための変換モデル４２と同じモデルであり、変換モデル２４２は、変換対象の部分文字列、変換対象の部分文字列より前の文字列、及び変換対象の部分文字列より後の文字列を入力とし、変換対象の部分文字列を変換した変換後の部分文字列の文字の各々について、各文字である確率を出力するニューラルネットワークである。 The conversion model 242 is the same model as the conversion model 42 that is learned by the model learning device 100 and converts a corrupted word into a regular word. The conversion model 242 is a partial character string to be converted and a partial character string to be converted. Input the previous character string and the character string after the conversion target partial character string, and output the probability of each character for each character of the converted partial character string converted from the conversion target partial character string It is a neural network.

部分文字列特定部２３０は、入力部２１０で受け付けた文字列に対して、文字毎に、判定モデル２４０に基づいて、崩れ語の文字であるか否かを判定し、崩れ語の文字であると判定された文字からなる部分文字列を特定する。 The partial character string specifying unit 230 determines whether the character string received by the input unit 210 is a broken word character or not based on the determination model 240 for each character. The partial character string which consists of the character determined to be is specified.

例えば、図９に示すように、文字列の文字の各々について、判定モデル２４０を用いて崩れ語の文字であるか否かの判定を行う。文字を判定モデル２４０のＬＳＴＭ型のニューラルネットワークに入力し、上記（１）式に従って、文字位置ｊのラベル出力確率ｐ(ｊ)の出力を得て崩れ語の文字か否かを判定する。崩れ語の文字と判定された文字には変化ラベルの「１」を出力し、崩れ語の文字でないと判定された文字には変化ラベルの「０」を出力することで崩れ語の部分文字列を特定する。 For example, as shown in FIG. 9, it is determined whether each character of the character string is a broken word character using the determination model 240. A character is input to the LSTM type neural network of the determination model 240, and the output of the label output probability p (j) at the character position j is obtained according to the above equation (1) to determine whether the character is a broken word. A change label “1” is output for characters determined to be a broken word character, and a change label “0” is output to a character determined not to be a broken word character. Is identified.

文字列変換部２３２は、部分文字列特定部２３０によって特定された部分文字列の各々に対して、変換モデル２４２と、文字列における当該部分文字列より前の文脈及び当該部分文字列より後の文脈とに基づいて、当該部分文字列を変換することにより、文字列に含まれる崩れ語を正規語に変換した文字列を生成する。 The character string conversion unit 232, for each partial character string specified by the partial character string specifying unit 230, the conversion model 242, the context before the partial character string in the character string, and the partial character string after the partial character string By converting the partial character string based on the context, a character string in which a broken word included in the character string is converted into a regular word is generated.

例えば、図１０に示すように、崩れ語の部分文字列の各々について、変換モデル２４２を用いて正規語に変換を行う。部分文字列の各文字と、当該部分文字列より前の文脈の各文字と、当該部分文字列より後の文脈の各文字とをencoder-decoder型のニューラルネットワークに入力し、上記（２）式に従って、当該部分文字列に対する前の文脈及び後の文脈を考慮して、部分文字列のｔ番目の文字に対して、~ｈ_ｔを計算してｏ_ｔを計算する。そして上記（３）式に従って、文字ｔの確率ベクトルｐ（ｔ）を推定し、最も確率の高い文字に変換する。部分文字列「きょー」であれば「今日」、文字「ー」であれば「null」に変換する。そして、元の文字列に変換された部分文字列を統合し、正規化された文字列として出力部２５０に出力する。図１１に文字列変換装置２００の全体の処理の流れを表した概略図を示す。 For example, as shown in FIG. 10, each broken character partial character string is converted into a regular word using a conversion model 242. Each character of the partial character string, each character in the context before the partial character string, and each character in the context after the partial character string are input to an encoder-decoder type neural network, and the above equation (2) according, taking into account the context of the previous context and after with respect to the partial string, for t th character substring, computes the o _t by calculating the ~ h _t. Then, according to the above equation (3), the probability vector p (t) of the character t is estimated and converted to the character with the highest probability. If it is a substring “Kyo”, it is converted to “today”, and if it is a character “—”, it is converted to “null”. Then, the partial character strings converted into the original character strings are integrated and output to the output unit 250 as normalized character strings. FIG. 11 is a schematic diagram showing the overall processing flow of the character string conversion apparatus 200.

＜本発明の実施の形態に係るモデル学習装置の作用＞ <Operation of Model Learning Device According to Embodiment of Present Invention>

次に、本発明の実施の形態に係るモデル学習装置１００の作用について説明する。入力部１０において正規語からなる正規文と、崩れ語を含む崩れ文との複数のペアを受け付けると、モデル学習装置１００は、図１２に示すモデル学習処理ルーチンを実行する。 Next, the operation of the model learning device 100 according to the embodiment of the present invention will be described. When the input unit 10 receives a plurality of pairs of a regular sentence composed of regular words and a broken sentence including a broken word, the model learning device 100 executes a model learning processing routine shown in FIG.

まず、ステップＳ１００では、入力部１０で受け付けた、正規語からなる正規化文と、崩れ語を含む崩れ文との複数のペアに基づいて、複数のペアの各々について、正規化文に含まれる各文字と、崩れ文に含まれる各文字との対応関係を求める。 First, in step S100, each of a plurality of pairs is included in a normalized sentence based on a plurality of pairs of a normalized sentence composed of regular words and a corrupted sentence including a corrupted word received by the input unit 10. The correspondence between each character and each character included in the collapsed sentence is obtained.

次に、ステップＳ１０２では、ステップＳ１００で求められた複数のペアの各々についての対応関係に基づいて、ＬＳＴＭ型のニューラルネットワークを用いて、崩れ語の文字であるか否かを判定するための判定モデルを学習し、判定モデル４０として保存する。 Next, in step S102, based on the correspondence relationship for each of the plurality of pairs obtained in step S100, a determination for determining whether the character is a broken word using an LSTM type neural network. The model is learned and stored as the determination model 40.

ステップＳ１０４では、文字列アライメント部３０により求められた複数のペアの各々についての対応関係に基づいて、encoder-decoder型のニューラルネットワークを用いて、崩れ語を正規語に変換するための変換モデルを学習し、変換モデル４２として保存する。 In step S104, a conversion model for converting a broken word into a regular word using an encoder-decoder type neural network based on the correspondence relationship for each of a plurality of pairs obtained by the character string alignment unit 30. Learn and save as conversion model 42.

以上説明したように、本実施の形態に係るモデル学習装置によれば、入力された、正規化された表現である正規語からなる正規化文と、正規語に対して揺らいだ表記である崩れ語を含む崩れ文との複数のペアに基づいて、複数のペアの各々について、正規化文に含まれる各文字と、崩れ文に含まれる各文字との対応関係を求め、文字列アライメント部により求められた前記複数のペアの各々についての前記対応関係に基づいて、前記崩れ語の文字であるか否かを判定するための判定モデルを学習し、複数のペアの各々についての対応関係に基づいて、崩れ語を正規語に変換するための変換モデルを学習することにより、精度よく崩れ語を含む文字列を変換できるモデルを学習することができる。 As described above, according to the model learning device according to the present embodiment, a normalized sentence composed of a normal word that is a normalized expression that has been input, and a distorted expression that is distorted with respect to the normal word. Based on a plurality of pairs with a collapsed sentence including words, for each of the plurality of pairs, a correspondence relationship between each character included in the normalized sentence and each character included in the corrupted sentence is obtained, and the character string alignment unit Based on the correspondence relationship for each of the plurality of obtained pairs, a determination model for determining whether or not the character is a broken word is learned, and based on the correspondence relationship for each of the plurality of pairs Thus, by learning a conversion model for converting a broken word into a regular word, it is possible to learn a model that can accurately convert a character string including the broken word.

＜本発明の実施の形態に係る文字列変換装置の作用＞ <Operation of Character String Conversion Device According to Embodiment of the Present Invention>

次に、本発明の実施の形態に係る文字列変換装置２００の作用について説明する。入力部２１０において変換対象となる崩れ語を含む文字列を受け付けると、文字列変換装置２００は、図１３に示す文字列変換処理ルーチンを実行する。 Next, the operation of the character string conversion device 200 according to the embodiment of the present invention will be described. When the input unit 210 receives a character string including a broken word to be converted, the character string conversion device 200 executes a character string conversion processing routine shown in FIG.

まず、ステップＳ２００では、入力部２１０で受け付けた文字列に対して、文字毎に、判定モデル２４０に基づいて、崩れ語の文字であるか否かを判定し、崩れ語の文字であると判定された文字からなる部分文字列を特定する。 First, in step S200, it is determined for each character whether the character string received by the input unit 210 is a broken word character based on the determination model 240, and is determined to be a broken word character. A substring consisting of the specified characters is specified.

次に、ステップＳ２０２では、部分文字列特定部２３０によって特定された部分文字列の各々に対して、変換モデル２４２と、文字列における当該部分文字列より前の文脈及び当該部分文字列より後の文脈とに基づいて、当該部分文字列を変換することにより、文字列に含まれる崩れ語を正規語に変換した文字列を生成して、出力部２５０に出力する。 Next, in step S202, for each partial character string specified by the partial character string specifying unit 230, the conversion model 242, the context before the partial character string in the character string, and the partial character string after the partial character string are determined. By converting the partial character string based on the context, a character string obtained by converting the broken word included in the character string into a regular word is generated and output to the output unit 250.

以上説明したように、本発明の実施の形態に係る文字列変換装置によれば、入力された文字列に対して、文字毎に、正規化された表現である正規語に対して揺らいだ表記である崩れ語の文字であるか否かを判定するための判定モデルに基づいて、崩れ語の文字であるか否かを判定し、崩れ語の文字であると判定された文字からなる部分文字列を特定し、特定された部分文字列に対して、崩れ語を正規語に変換するための変換モデルに基づいて、部分文字列を変換することにより、文字列に含まれる崩れ語を正規語に変換した文字列を生成することにより、精度よく、崩れ語を含む文字列を変換することができる。 As described above, according to the character string conversion device according to the embodiment of the present invention, for the input character string, for each character, a notation that fluctuates with respect to a regular word that is a normalized expression Based on the determination model for determining whether or not it is a broken word character, it is determined whether or not it is a broken word character, and a partial character consisting of characters determined to be a broken word character By identifying the column and converting the partial character string based on the conversion model for converting the broken word into a regular word for the identified partial character string, the broken word contained in the character string is converted into the regular word. By generating a character string converted into, a character string including a broken word can be converted with high accuracy.

なお、本発明は、上述した実施の形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made without departing from the gist of the present invention.

例えば、上述した実施の形態では、判定モデル４０（２４０）、及び変換モデル４２（２４２）にニューラルネットワークを用いる場合を例に説明したが、これに限定されるものではなく、入力系列に対する確率を出力できる手法であれば他の手法を適用した判定モデル及び変換モデルを用いるようにしてもよい。 For example, in the above-described embodiment, the case where a neural network is used for the determination model 40 (240) and the conversion model 42 (242) has been described as an example. A determination model and a conversion model to which other methods are applied may be used as long as they can be output.

１０入力部
２０演算部
３０文字列アライメント部
３２判定モデル学習部
３４変換モデル学習部
４０，２４０判定モデル
４２，２４２変換モデル
１００モデル学習装置
２００文字列変換装置
２１０入力部
２２０演算部
２３０部分文字列特定部
２３２文字列変換部
２５０出力部 DESCRIPTION OF SYMBOLS 10 Input part 20 Operation part 30 Character string alignment part 32 Determination model learning part 34 Conversion model learning part 40,240 Determination model 42,242 Conversion model 100 Model learning apparatus 200 Character string conversion apparatus 210 Input part 220 Operation part 230 Partial character string Specific part 232 Character string conversion part 250 Output part

Claims

Based on a determination model for determining whether or not each character is a character of a broken word that is a swaying expression with respect to a regular word that is a normalized expression for each character, A partial character string identifying unit that determines whether or not the character is a broken word, and identifies a partial character string composed of characters determined to be a broken word character;
Included in the character string by converting the partial character string based on a conversion model for converting the broken word into the regular word for the partial character string specified by the partial character string specifying unit A character string conversion unit that generates a character string obtained by converting the corrupted word into the regular word;
String converter including

The conversion model is a conversion model for converting the broken word into the regular word in consideration of the context before the broken word and the context after the broken word,
The character string conversion unit, for the partial character string specified by the partial character string specifying unit, the conversion model, the context before the partial character string and the context after the partial character string in the character string The character string conversion device according to claim 1, wherein the partial character string is converted based on:

The determination model receives a determination target character, a character string before the determination target character, and a character string after the determination target character, and determines the probability that the determination target character is a broken word character. An output neural network,
The conversion model takes as input a partial character string to be converted, a character string before the partial character string to be converted, and a character string after the partial character string to be converted, and converts the partial character string to be converted. 3. The character string conversion device according to claim 1, wherein the character string conversion device is a neural network that outputs a probability of being a character for each character of the converted partial character string.

The plurality of pairs based on a plurality of pairs of a normalized sentence composed of regular words that are input normalized expressions and a corrupted sentence that includes a corrupted word that is a distorted expression with respect to the regular word For each of the above, a character string alignment unit for obtaining a correspondence relationship between each character included in the normalized sentence and each character included in the collapsed sentence,
A determination model learning unit that learns a determination model for determining whether or not the character is a broken word, based on the correspondence relationship for each of the plurality of pairs obtained by the character string alignment unit;
A conversion model learning unit that learns a conversion model for converting the broken word into the regular word based on the correspondence relationship for each of the plurality of pairs obtained by the character string alignment unit;
Model learning device including

To determine whether the partial character string specifying unit is a broken word character that is a swaying expression with respect to a regular word that is a normalized expression for each character with respect to the input character string Determining whether the character is a broken word based on the determination model, and specifying a partial character string consisting of characters determined to be a broken word character;
By converting the partial character string based on a conversion model for converting the broken word into the regular word for the partial character string specified by the partial character string specifying unit. Generating a character string obtained by converting the collapsible word included in the character string into the regular word;
String conversion method including

The character string alignment unit is based on a plurality of pairs of a normalization sentence including a normal word that is a normalized expression that has been input and a collapse sentence that includes a collapse word that is a distorted expression with respect to the regular word. Obtaining a correspondence relationship between each character included in the normalized sentence and each character included in the collapsed sentence for each of the plurality of pairs;
A determination model learning unit learns a determination model for determining whether or not the character is a broken word based on the correspondence relationship for each of the plurality of pairs obtained by the character string alignment unit. Steps,
A conversion model learning unit learning a conversion model for converting the corrupted word into the regular word based on the correspondence relationship for each of the plurality of pairs obtained by the character string alignment unit;
Model learning method including

The program for functioning a computer as each part of the character string converter of any one of Claims 1-3.

The program for functioning a computer as each part of the model learning apparatus of Claim 4.