JP7017533B2

JP7017533B2 - Classification device, learning device, classification method and program

Info

Publication number: JP7017533B2
Application number: JP2019030780A
Authority: JP
Inventors: ソンホアンコックグエン; フンタオトラン; 清良披田野; 晋作清本
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2019-02-22
Filing date: 2019-02-22
Publication date: 2022-02-08
Anticipated expiration: 2039-02-22
Also published as: JP2020135644A

Description

本発明は、機械翻訳と人間翻訳とを分類するための装置に関する。 The present invention relates to a device for classifying machine translation and human translation.

従来、スパムメール等の悪意のある文書を攻撃者が作成するために、機械翻訳が利用されている。このため、文書が機械翻訳されたものであるか、人間により翻訳されたものであるかを判別することにより、悪意のある文書を高い確率で検出し、セキュリティ上のリスクを低減することができる。
例えば、非特許文献１～８において、機械翻訳と人間翻訳とを分類する手法が提案されている。 Traditionally, machine translation has been used by attackers to create malicious documents such as spam emails. Therefore, by determining whether the document is machine-translated or human-translated, it is possible to detect a malicious document with a high probability and reduce the security risk. ..
For example, Non-Patent Documents 1 to 8 propose a method for classifying machine translation and human translation.

Ｃｈａｅ，Ｊ．，Ｎｅｎｋｏｖａ，Ａ．：Ｐｒｅｄｉｃｔｉｎｇｔｈｅｆｌｕｅｎｃｙｏｆｔｅｘｔｗｉｔｈｓｈａｌｌｏｗｓｔｒｕｃｔｕｒａｌｆｅａｔｕｒｅｓ：ｃａｓｅｓｔｕｄｉｅｓｏｆｍａｃｈｉｎｅｔｒａｎｓｌａｔｉｏｎａｎｄｈｕｍａｎ－ｗｒｉｔｔｅｎｔｅｘｔ．Ｉｎ：ＥＡＣＬ，ｐｐ．１３９－１４７（２００９）．Chae, J.M. , Nenkova, A. : Predicting the fluency of text with fluency structural features: case studies of machine translation and human-writing text. In: EACL, pp. 139-147 (2009). Ｌｉ，Ｙ．，Ｗａｎｇ，Ｒ．，Ｚｈａｏ，Ｈ．：Ａｍａｃｈｉｎｅｌｅａｒｎｉｎｇｍｅｔｈｏｄｔｏｄｉｓｔｉｎｇｕｉｓｈｍａｃｈｉｎｅｔｒａｎｓｌａｔｉｏｎｆｒｏｍｈｕｍａｎｔｒａｎｓｌａｔｉｏｎ．Ｉｎ：ＰＡＣＬＩＣ，ｐｐ．３５４－３６０（２０１５）．Li, Y. , Wang, R. , Zhao, H. : A machine learning method to distinguish machine translation from human translation. In: PACLIC, pp. 354-360 (2015). Ａｒａｓｅ，Ｙ．，Ｚｈｏｕ，Ｍ．：Ｍａｃｈｉｎｅｔｒａｎｓｌａｔｉｏｎｄｅｔｅｃｔｉｏｎｆｒｏｍｍｏｎｏｌｉｎｇｕａｌｗｅｂ－ｔｅｘｔ．Ｉｎ：ＡＣＬ（１）．ｐｐ．１５９７－１６０７（２０１３）．Arase, Y. , Zhou, M.D. : Machine translation detection from monolingual web-ext. In: ACL (1). pp. 1597-1607 (2013). Ａｈａｒｏｎｉ，Ｒ．，Ｋｏｐｐｅｌ，Ｍ．，Ｇｏｌｄｂｅｒｇ，Ｙ．：Ａｕｔｏｍａｔｉｃｄｅｔｅｃｔｉｏｎｏｆｍａｃｈｉｎｅｔｒａｎｓｌａｔｅｄｔｅｘｔａｎｄｔｒａｎｓｌａｔｉｏｎｑｕａｌｉｔｙｅｓｔｉｍａｔｉｏｎ．Ｉｎ：ＡＣＬ（２０１４）．Aharoni, R.M. , Koppel, M.D. , Goldberg, Y. : Automatic detection of machine translateddext and translation quality estimation. In: ACL (2014). Ｎｇｕｙｅｎ－Ｓｏｎ，Ｈ．Ｑ．，Ｅｃｈｉｚｅｎ，Ｉ．：Ｄｅｔｅｃｔｉｎｇｃｏｍｐｕｔｅｒ－ｇｅｎｅｒａｔｅｄｔｅｘｔｕｓｉｎｇｆｌｕｅｎｃｙａｎｄｎｏｉｓｅｆｅａｔｕｒｅｓ．Ｉｎ：ＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｆｔｈｅＰａｃｉｆｉｃＡｓｓｏｃｉａｔｉｏｎｆｏｒＣｏｍ－ｐｕｔａｔｉｏｎａｌＬｉｎｇｕｉｓｔｉｃｓ．ｐｐ．２８８－３００．Ｓｐｒｉｎｇｅｒ（２０１７）．Nguyen-Son, H.M. Q. , Echizen, I. : Transmitting computer-generated textures fluenchyand noise features. In: International Conference of the Pacific Association for Com-Putational Linguistics. pp. 288-300. Springer (2017). Ｎｇｕｙｅｎ－Ｓｏｎ，Ｈ．Ｑ．，Ｔｉｅｕ，Ｎ．Ｄ．Ｔ．，Ｎｇｕｙｅｎ，Ｈ．Ｈ．，Ｙａｍａｇｉｓｈｉ，Ｊ．，Ｚｅｎ，Ｉ．Ｅ．：Ｉｄｅｎ－ｔｉｆｙｉｎｇｃｏｍｐｕｔｅｒ－ｇｅｎｅｒａｔｅｄｔｅｘｔｕｓｉｎｇｓｔａｔｉｓｔｉｃａｌａｎａｌｙｓｉｓ．Ｉｎ：Ａｓｉａ－ＰａｃｉｆｉｃＳｉｇｎａｌａｎｄＩｎｆｏｒｍａｔｉｏｎＰｒｏｃｅｓｓｉｎｇＡｓｓｏｃｉａｔｉｏｎＡｎｎｕａｌＳｕｍｍｉｔａｎｄＣｏｎｆｅｒｅｎｃｅ（ＡＰＳＩＰＡＡＳＣ），２０１７．ｐｐ．１５０４－１５１１．ＩＥＥＥ（２０１７）．Nguyen-Son, H.M. Q. , Thieu, N.M. D. T. , Nguyen, H. et al. H. , Yamagishi, J.M. , Zen, I. E. : Iden-tifying computer-generated textured statistical analysis. In: Asia-Pacific Signaland Information Processing Information Processing Annual Summit and Conference (APSIPAASC), 2017. pp. 1504-1511. IEEE (2017). Ｌａｂｂｅ，Ｃ．，Ｌａｂｂｅ，Ｄ．：Ｄｕｐｌｉｃａｔｅａｎｄｆａｋｅｐｕｂｌｉｃａｔｉｏｎｓｉｎｔｈｅｓｃｉｅｎｔｉｆｉｃｌｉｔｅｒａｔｕｒｅ：ｈｏｗｍａｎｙＳＣＩｇｅｎｐａｐｅｒｓｉｎｃｏｍｐｕｔｅｒｓｃｉｅｎｃｅ？Ｉｎ：Ｓｃｉｅｎｔｏｍｅｔｒｉｃｓ９４（１），３７９－３９６（２０１３）．Labbe, C.I. , Labbe, D. : Duplicate and publications in the scientific literature: how many SCIgen papers in computer science? In: Scientometrics 94 (1), 379-396 (2013). Ｎｇｕｙｅｎ－Ｓｏｎ，Ｈ．Ｑ．，Ｔｉｅｕ，Ｎ．Ｄ．Ｔ．，Ｎｇｕｙｅｎ，Ｈ．Ｈ．，Ｙａｍａｇｉｓｈｉ，Ｊ．，Ｅｃｈｉｚｅｎ，Ｉ．：Ｉｄｅｎｔｉｆｙｉｎｇｃｏｍｐｕｔｅｒ－ｇｅｎｅｒａｔｅｄｐａｒａｇｒａｐｈｓｕｓｉｎｇｃｏｈｅｒｅｎｃｅａｎｄｆｌｕｅｎｃｙｆｅａｔｕｒｅｓ．Ｉｎ：ＰｒｏｃｅｅｄｉｎｇｓＰＡＣＬＩＣ（２０１８）．Nguyen-Son, H.M. Q. , Thieu, N.M. D. T. , Nguyen, H. et al. H. , Yamagishi, J.M. , Echizen, I. : Generated computer-generated paragraphs using coherence and fluency facials. In: Proceedings PACLIC (2018).

しかしながら、従来の分類手法は、機械翻訳の品質に大きく依存しており、機械翻訳の品質が人間翻訳に比べて大きく劣っている場合には両者を分類できるものの、近年の機械翻訳の性能向上により、分類精度が低下していた。 However, the conventional classification method largely depends on the quality of machine translation, and although it is possible to classify both when the quality of machine translation is significantly inferior to that of human translation, due to the recent improvement in machine translation performance. , The classification accuracy was low.

本発明は、精度良く機械翻訳と人間翻訳とを分類できる分類装置、学習装置、分類方法及びプログラムを提供することを目的とする。 An object of the present invention is to provide a classification device, a learning device, a classification method and a program capable of accurately classifying machine translation and human translation.

本発明に係る分類装置は、文書データのそれぞれを単語に分割し、当該単語のそれぞれに品詞タグを付与するタグ付け部と、前記文書データ毎に、前記単語の組み合わせについて、当該単語のそれぞれに定義された単語ベクトル間の距離を算出する距離算出部と、前記文書データ、及び前記品詞タグの組み合わせ毎に、同一の単語に関する前記距離のグループの中で最小値をそれぞれ抽出する抽出部と、前記文書データ、及び前記品詞タグの組み合わせ毎に、前記最小値の前記グループの中での統計量を特徴量として算出する特徴量算出部と、前記文書データ毎に、前記品詞タグの組み合わせに対する前記特徴量を入力とし、予め機械翻訳又は人間翻訳の区分がラベル付けされた文書データにより学習されたモデルにより、分類結果を出力する分類部と、を備える。 The classification device according to the present invention has a tagging unit that divides each of the document data into words and assigns a part of speech tag to each of the words, and for each of the document data, for each of the words, for each of the words. A distance calculation unit that calculates the distance between the defined word vectors, and an extraction unit that extracts the minimum value in the distance group for the same word for each combination of the document data and the part of speech tag. For each combination of the document data and the part of speech tag, the feature amount calculation unit that calculates the minimum value of the statistics in the group as the feature amount, and for each of the document data, the said for the combination of the part of speech tag. It is provided with a classification unit that inputs a feature amount and outputs a classification result by a model trained by document data in which a classification of machine translation or human translation is labeled in advance.

前記統計量は、平均又は分散の少なくともいずれかを含んでもよい。 The statistic may include at least one of mean or variance.

本発明に係る学習装置は、文書データのそれぞれを単語に分割し、当該単語のそれぞれに品詞タグを付与するタグ付け部と、前記文書データ毎に、前記単語の組み合わせについて、当該単語のそれぞれに定義された単語ベクトル間の距離を算出する距離算出部と、前記文書データ、及び前記品詞タグの組み合わせ毎に、同一の単語に関する前記距離のグループの中で最小値をそれぞれ抽出する抽出部と、前記文書データ、及び前記品詞タグの組み合わせ毎に、前記最小値の前記グループの中での統計量を特徴量として算出する特徴量算出部と、前記文書データ毎に、前記品詞タグの組み合わせに対する前記特徴量を入力とし、予めラベル付けされている機械翻訳又は人間翻訳の区分を学習したモデルを生成する学習部と、を備える。 The learning device according to the present invention has a tagging unit that divides each of the document data into words and assigns a part of speech tag to each of the words, and for each of the document data, for each of the words, for each of the words. A distance calculation unit that calculates the distance between the defined word vectors, and an extraction unit that extracts the minimum value in the distance group for the same word for each combination of the document data and the part of speech tag. For each combination of the document data and the part of speech tag, the feature amount calculation unit that calculates the minimum value of the statistics in the group as the feature amount, and for each of the document data, the said for the combination of the part of speech tag. It is provided with a learning unit that takes a feature amount as an input and generates a model that has learned a pre-labeled division of machine translation or human translation.

前記学習部は、複数の学習アルゴリズムにより複数の前記モデルを生成し、前記区分の出力精度が最も高いモデルを選別してもよい。 The learning unit may generate a plurality of the models by a plurality of learning algorithms and select the model having the highest output accuracy of the division.

本発明に係る分類方法は、文書データのそれぞれを単語に分割し、当該単語のそれぞれに品詞タグを付与するタグ付けステップと、前記文書データ毎に、前記単語の組み合わせについて、当該単語のそれぞれに定義された単語ベクトル間の距離を算出する距離算出ステップと、前記文書データ、及び前記品詞タグの組み合わせ毎に、同一の単語に関する前記距離のグループの中で最小値をそれぞれ抽出する抽出ステップと、前記文書データ、及び前記品詞タグの組み合わせ毎に、前記最小値の前記グループの中での統計量を特徴量として算出する特徴量算出ステップと、前記文書データ毎に、前記品詞タグの組み合わせに対する前記特徴量を入力とし、予め機械翻訳又は人間翻訳の区分がラベル付けされた文書データにより学習されたモデルにより、分類結果を出力する分類ステップと、をコンピュータが実行する。 In the classification method according to the present invention, each of the document data is divided into words, and a tagging step of assigning a part of speech tag to each of the words, and for each of the document data, for each of the words, for each of the words. A distance calculation step for calculating the distance between the defined word vectors, an extraction step for extracting the minimum value in the distance group for the same word for each combination of the document data and the part of speech tag, and an extraction step. For each combination of the document data and the part of speech tag, a feature amount calculation step for calculating the minimum value of the statistics in the group as a feature amount, and for each of the document data, the said for the combination of the part of speech tags. The computer executes a classification step of outputting the classification result by a model trained by the document data in which the feature amount is input and the classification of machine translation or human translation is labeled in advance.

本発明に係る分類プログラムは、前記分類装置としてコンピュータを機能させるためのものである。 The classification program according to the present invention is for operating a computer as the classification device.

本発明に係る学習プログラムは、前記学習装置としてコンピュータを機能させるためのものである。 The learning program according to the present invention is for making a computer function as the learning device.

本発明によれば、精度よく機械翻訳と人間翻訳とを分類できる。 According to the present invention, machine translation and human translation can be accurately classified.

実施形態に係る分類装置の機能構成を示す図である。It is a figure which shows the functional structure of the classification apparatus which concerns on embodiment. 実施形態に係る品詞タグの種類を例示する図である。It is a figure which illustrates the kind of the part of speech tag which concerns on embodiment. 実施形態に係る文書データを構成する単語に品詞タグが付与される手順を例示する図である。It is a figure which illustrates the procedure which the part-of-speech tag is attached to the word which constitutes the document data which concerns on embodiment. 実施形態に係る距離の算出単位の具体例を示す図である。It is a figure which shows the specific example of the calculation unit of the distance which concerns on embodiment. 実施形態に係る単語間の距離の最小値を抽出する手順を示す図である。It is a figure which shows the procedure which extracts the minimum value of the distance between words which concerns on embodiment. 実施形態に係る文書データ及び品詞ペア毎の単語間の距離データを例示する図である。It is a figure which illustrates the document data which concerns on embodiment, and the distance data between words for each part of speech pair. 実施形態に係る文書データ毎の特徴量を例示する図である。It is a figure which illustrates the feature amount for each document data which concerns on embodiment.

以下、本発明の実施形態の一例について説明する。
図１は、本実施形態に係る分類装置１の機能構成を示す図である。
分類装置及び学習装置としての分類装置１は、サーバ装置又はパーソナルコンピュータ等の情報処理装置（コンピュータ）であり、制御部１０及び記憶部２０の他、各種データの入出力デバイス及び通信デバイス等を備える。 Hereinafter, an example of the embodiment of the present invention will be described.
FIG. 1 is a diagram showing a functional configuration of the classification device 1 according to the present embodiment.
The classification device 1 as a classification device and a learning device is an information processing device (computer) such as a server device or a personal computer, and includes a control unit 10 and a storage unit 20, as well as various data input / output devices and communication devices. ..

制御部１０は、分類装置１の全体を制御する部分であり、記憶部２０に記憶された各種プログラムを適宜読み出して実行することにより、本実施形態における各機能を実現する。制御部１０は、ＣＰＵであってよい。 The control unit 10 is a part that controls the entire classification device 1, and realizes each function in the present embodiment by appropriately reading and executing various programs stored in the storage unit 20. The control unit 10 may be a CPU.

記憶部２０は、ハードウェア群を分類装置１として機能させるための各種プログラム、及び各種データ等の記憶領域であり、ＲＯＭ、ＲＡＭ、フラッシュメモリ又はハードディスク（ＨＤＤ）等であってよい。具体的には、記憶部２０は、本実施形態の各機能を制御部１０に実行させるためのプログラムの他、辞書データ２１、コーパス２２、学習モデル２３等を記憶する。 The storage unit 20 is a storage area for various programs and various data for making the hardware group function as the classification device 1, and may be a ROM, RAM, flash memory, hard disk (HDD), or the like. Specifically, the storage unit 20 stores the dictionary data 21, the corpus 22, the learning model 23, and the like, in addition to the program for causing the control unit 10 to execute each function of the present embodiment.

制御部１０は、入力部１１と、タグ付け部１２と、距離算出部１３と、抽出部１４と、特徴量算出部１５と、学習部１６と、分類部１７とを備える。制御部１０は、これらの機能部により、機械翻訳と人間翻訳とを分類する学習モデル２３を生成し、新たな文書データを、機械翻訳されたものであるか、人間により翻訳されたものであるかに分類して結果を出力する。 The control unit 10 includes an input unit 11, a tagging unit 12, a distance calculation unit 13, an extraction unit 14, a feature amount calculation unit 15, a learning unit 16, and a classification unit 17. The control unit 10 generates a learning model 23 that classifies machine translation and human translation by these functional units, and the new document data is either machine-translated or human-translated. The result is output by classifying into.

入力部１１は、学習モデル２３の訓練データ、又は学習モデル２３による分類対象として、文書データ（テキスト）の入力を受け付ける。 The input unit 11 accepts the input of the training data of the learning model 23 or the document data (text) as the classification target by the learning model 23.

タグ付け部１２は、入力された文書データのそれぞれを単語に分割し、これらの単語のそれぞれに品詞タグを付与する。
品詞タグを付与するためには、既存の形態素解析の手法が利用可能である。このとき、日本語又は英語等の言語に応じた品詞が定義された辞書データ２１が参照される。
なお、辞書データ２１は、分類装置１とは別の装置に記憶されていてもよいし、また、例えば、インターネット等を経由してアクセス可能な公開データベースに記憶されていてもよい。 The tagging unit 12 divides each of the input document data into words, and assigns a part-of-speech tag to each of these words.
Existing morphological analysis methods can be used to add part-speech tags. At this time, the dictionary data 21 in which the part of speech corresponding to the language such as Japanese or English is defined is referred to.
The dictionary data 21 may be stored in a device other than the classification device 1, or may be stored in, for example, a public database accessible via the Internet or the like.

図２は、本実施形態に係る品詞タグの種類を例示する図である。
ここでは、英語の文書データを処理する場合を例に、分解された単語に付与する４５種類の品詞（ＰＯＳｔａｇ）と、その意味とを示している。 FIG. 2 is a diagram illustrating the types of part of speech tags according to the present embodiment.
Here, 45 kinds of part-of-speech (POS tag) given to the decomposed words and their meanings are shown by taking the case of processing English document data as an example.

図３は、本実施形態に係る文書データを構成する単語に品詞タグが付与される手順を例示する図である。
例えば、文書１に含まれる名詞「ｓｃｈｏｏｌ」、「ｍｏｒｎｉｎｇ」、「ｂａｇ」等には、品詞タグ「ＮＮ」が付与され、動詞「ｇｏ」、「ｗａｌｋ」等には、品詞タグ「ＶＢ」が付与されている。
このように、タグ付け部１２は、入力された複数の文書データのそれぞれに対して、文書データを構成する全ての単語について、前述の４５種類の品詞タグのいずれかを付与する。 FIG. 3 is a diagram illustrating a procedure in which a part-of-speech tag is added to a word constituting the document data according to the present embodiment.
For example, the nouns "school", "morning", "bag" and the like included in the document 1 are given the part of speech tag "NN", and the verbs "go", "walk" and the like are given the part of speech tag "VB". It has been granted.
As described above, the tagging unit 12 assigns one of the above-mentioned 45 types of part-speech tags to each of the plurality of input document data for all the words constituting the document data.

距離算出部１３は、文書データ毎に、単語の組み合わせについて、単語のそれぞれに定義された単語ベクトル間の距離を算出する。
各単語に固有の多次元（例えば３００次元）の単語ベクトルは、大量のデータセットに基づいて学習され、単語間の相対的な距離の近さによって意味の類似性又は関連性が示されるデータである。単語ベクトルは、予めコーパス２２に格納されている。
なお、コーパス２２は、分類装置１とは別の装置に記憶されていてもよいし、また、例えば、インターネット等を経由してアクセス可能な公開データベースに記憶されていてもよい。 The distance calculation unit 13 calculates the distance between the word vectors defined for each word for each word combination for each document data.
Multidimensional (eg, 300-dimensional) word vectors that are unique to each word are data that are trained on the basis of large datasets and whose relative closeness between words indicates semantic similarity or relevance. be. The word vector is stored in the corpus 22 in advance.
The corpus 22 may be stored in a device other than the classification device 1, or may be stored in, for example, a public database accessible via the Internet or the like.

距離算出部１３は、例えば、次の計算式によりユークリッド距離ｄを算出する。ここで、ｐ及びｑは、２つの単語を、ｐ_ｉ及びｑ_ｉは、ｎ次元の単語ベクトルのｉ（１≦ｉ≦ｎ）番目の要素を示す。

The distance calculation unit 13 calculates the Euclidean distance d by, for example, the following formula. Here, p and q represent two words, and p _i and q _i represent the i (1 ≦ i ≦ n) th element of the n-dimensional word vector.

図４は、本実施形態に係る距離の算出単位の具体例を示す図である。
距離算出部１３は、文書１に含まれる名詞（ＮＮ）と名詞との組み合わせ（品詞ペア）として、「ｓｃｈｏｏｌ」と「ｍｏｒｎｉｎｇ」、「ｓｃｈｏｏｌ」と「ｂａｇ」、「ｍｏｒｎｉｎｇ」と「ｂａｇ」のように、２つの単語の組み合わせを順に抽出し、これらの組み合わせの距離ｄを算出する。
同様に、例えば品詞が４５種類の場合には、１０３５通りの品詞ペアそれぞれについて、単語間の距離ｄが文書毎に算出される。 FIG. 4 is a diagram showing a specific example of the distance calculation unit according to the present embodiment.
The distance calculation unit 13 includes "school" and "morning", "school" and "bag", and "morning" and "bag" as combinations (part of speech pairs) of nouns (NN) and nouns included in the document 1. As described above, the combinations of the two words are extracted in order, and the distance d of these combinations is calculated.
Similarly, for example, when there are 45 types of part of speech, the distance d between words is calculated for each document for each of the 1035 part of speech pairs.

抽出部１４は、文書データ及び品詞ペア毎に、同一の単語に関する他の単語との距離のグループの中で、最小値をそれぞれ抽出する。
例えば、文書１の品詞ペア「ＮＮ－ＮＮ」については、「ｓｃｈｏｏｌ」と他の単語（「ｍｏｒｎｉｎｇ」及び「ｂａｇ」等）との距離のグループ、「ｍｏｒｎｉｎｇ」と他の単語との距離のグループ、「ｂａｇ」と他の単語との距離のグループのように、複数のグループからそれぞれ距離の最小値を抽出する。 The extraction unit 14 extracts the minimum value in the group of distances from other words related to the same word for each document data and part of speech pair.
For example, for the part-speech pair "NN-NN" in Document 1, a group of distances between "school" and other words (such as "morning" and "bag"), and a group of distances between "morning" and other words. , The minimum distance is extracted from each of a plurality of groups, such as a group of distances between "bag" and other words.

図５は、本実施形態に係る単語間の距離の最小値を抽出する手順を示す図である。
文書１の品詞ペア１（ＮＮ－ＮＮ）では、ある単語に関する距離のグループ「６．３，４．６，２．８，０．６，９．２」からは、最小値０．６が抽出される。また、別の単語に関する距離のグループ「３．９，６．５，２．１，５．８，４．６」からは、最小値２．１が抽出される。
このように、文書データ及び品詞ペア毎に、１つ以上の最小値データが抽出され、同様の処理が全ての品詞ペア及び文書データについて行われる。 FIG. 5 is a diagram showing a procedure for extracting the minimum value of the distance between words according to the present embodiment.
In the part of speech pair 1 (NN-NN) of document 1, the minimum value of 0.6 is extracted from the distance group "6.3, 4.6, 2.8, 0.6, 9.2" for a certain word. Will be done. Further, the minimum value 2.1 is extracted from the group "3.9, 6.5, 2.1, 5.8, 4.6" of the distances related to another word.
In this way, one or more minimum value data is extracted for each document data and part of speech pair, and the same processing is performed for all part of speech pairs and document data.

図６は、本実施形態に係る文書データ及び品詞ペア毎の単語間の距離データを例示する図である。
この例では、抽出部１４により抽出された最小値データが文書データと品詞ペアとのマトリクスとして記述されている。マトリクスの各要素には、前述の通り、１つ又は複数の最小値データが配置される。
さらに、文書データが学習のための訓練データである場合、各文書データには、機械翻訳か人間翻訳かの分類ラベルが付与されている。 FIG. 6 is a diagram illustrating document data and distance data between words for each part of speech pair according to the present embodiment.
In this example, the minimum value data extracted by the extraction unit 14 is described as a matrix of document data and part of speech pairs. As described above, one or more minimum value data are arranged in each element of the matrix.
Further, when the document data is training data for learning, each document data is given a classification label of machine translation or human translation.

特徴量算出部１５は、文書データ及び品詞ペア毎に、最小値のグループの中での統計量を特徴量として算出する。
統計量は、例えば、次の計算式により算出される平均（ａｖｅｒａｇｅ）又は分散（ｖａｒｉａｎｃｅ）の少なくともいずれかを含んでよい。ここで、ａ_ｉ（１≦ｉ≦ｎ）は、マトリクスの要素（グループ）に含まれるｎ個の最小値データのｉ番目を示す。

The feature amount calculation unit 15 calculates the statistic in the minimum value group as the feature amount for each document data and part of speech pair.
The statistic may include, for example, at least one of the average or variance calculated by the following formula. Here, _ai (1 ≦ i ≦ n) indicates the i-th of the n minimum value data included in the element (group) of the matrix.

図７は、本実施形態に係る文書データ毎の特徴量を例示する図である。
この例では、文書データ毎に、１０３５通りの品詞ペアに対して平均及び分散の２つの特徴量がそれぞれ算出されている。 FIG. 7 is a diagram illustrating the feature amount for each document data according to the present embodiment.
In this example, two feature quantities, average and variance, are calculated for 1035 part-speech pairs for each document data.

学習部１６は、文書データ毎に、品詞ペアに対する特徴量を入力とし、予めラベル付けされている機械翻訳又は人間翻訳の区分を学習した学習モデル２３を生成する。
学習モデル２３を生成する手法は、ロジスティック回帰、線形分類器、確率的勾配降下法によるサポートベクタマシン、逐次最小問題最適化法によるサポートベクタマシン等、各種の学習アルゴリズムから適宜選択されてよい。
また、学習部１６は、複数の学習アルゴリズムにより複数の学習モデル２３を生成してもよく、この場合、出力精度が最も高い学習モデル２３が選別されてよい。 The learning unit 16 inputs a feature amount for a part of speech pair for each document data, and generates a learning model 23 that has learned a pre-labeled machine translation or human translation division.
The method for generating the learning model 23 may be appropriately selected from various learning algorithms such as logistic regression, a linear classifier, a support vector machine by a stochastic gradient descent method, and a support vector machine by a sequential minimum problem optimization method.
Further, the learning unit 16 may generate a plurality of learning models 23 by a plurality of learning algorithms, and in this case, the learning model 23 having the highest output accuracy may be selected.

分類部１７は、分類対象の文書データが入力された際に、前述の特徴量算出部１５により算出された品詞ペアに対する特徴量（例えば、平均及び分散）を入力とし、学習モデル２３により、分類結果を出力する。 When the document data to be classified is input, the classification unit 17 inputs the feature amount (for example, average and variance) for the part of speech pair calculated by the feature amount calculation unit 15 described above, and classifies by the learning model 23. Output the result.

本実施形態によれば、分類装置１は、文書データを構成する単語に品詞タグを付与し、品詞ペア毎に単語間の距離を算出する。そして分類装置１は、この単語間の距離を統計処理した特徴量を入力として、機械翻訳であるか人間翻訳であるかの既知の区分に基づいて学習モデル２３を生成する。
機械翻訳に比べて人間翻訳は、単一の文だけでなく文書内の複数の文の中で、類似性又は関連性の高い単語が一貫性を持って使用される傾向にある。分類装置１は、このような傾向の違いを、単語間の距離に基づく特徴量により表し、適切な学習モデル２３を生成できる。
したがって、分類装置１は、精度良く機械翻訳と人間翻訳とを分類できる。この結果、スパムメール等のユーザが望まない文書データを高精度に判別することが可能となる。 According to the present embodiment, the classification device 1 attaches a part of speech tag to a word constituting the document data, and calculates the distance between the words for each part of speech pair. Then, the classification device 1 generates a learning model 23 based on a known classification of machine translation or human translation by inputting a feature amount obtained by statistically processing the distance between words.
Compared to machine translation, human translation tends to use words with high similarity or relevance consistently not only in a single sentence but also in multiple sentences in a document. The classification device 1 can express such a difference in tendency by a feature amount based on the distance between words, and can generate an appropriate learning model 23.
Therefore, the classification device 1 can accurately classify machine translation and human translation. As a result, it becomes possible to discriminate document data that the user does not want, such as spam mail, with high accuracy.

また、分類装置１は、特徴量として、単語毎の距離の最小値に関する平均又は分散の少なくともいずれかを用いることで、容易な計算により精度良く文書データを分類できる。
さらに、分類装置１は、複数の学習アルゴリズムを用いて学習モデル２３を生成し、精度が最も高いものを選別するので、より高精度に文書データを分類できる。 Further, the classification device 1 can classify the document data accurately by simple calculation by using at least one of the average or the variance regarding the minimum value of the distance for each word as the feature amount.
Further, since the classification device 1 generates the learning model 23 using a plurality of learning algorithms and selects the one with the highest accuracy, the document data can be classified with higher accuracy.

以上、本発明の実施形態について説明したが、本発明は前述した実施形態に限るものではない。また、前述した実施形態に記載された効果は、本発明から生じる最も好適な効果を列挙したに過ぎず、本発明による効果は、実施形態に記載されたものに限定されるものではない。 Although the embodiments of the present invention have been described above, the present invention is not limited to the above-described embodiments. Moreover, the effects described in the above-described embodiments are merely a list of the most suitable effects resulting from the present invention, and the effects according to the present invention are not limited to those described in the embodiments.

分類装置１による学習方法及び分類方法は、ソフトウェアにより実現される。ソフトウェアによって実現される場合には、このソフトウェアを構成するプログラムが、情報処理装置（コンピュータ）にインストールされる。また、これらのプログラムは、ＣＤ－ＲＯＭのようなリムーバブルメディアに記録されてユーザに配布されてもよいし、ネットワークを介してユーザのコンピュータにダウンロードされることにより配布されてもよい。さらに、これらのプログラムは、ダウンロードされることなくネットワークを介したＷｅｂサービスとしてユーザのコンピュータに提供されてもよい。 The learning method and the classification method by the classification device 1 are realized by software. When realized by software, the programs that make up this software are installed in the information processing device (computer). Further, these programs may be recorded on a removable medium such as a CD-ROM and distributed to the user, or may be distributed by being downloaded to the user's computer via a network. Further, these programs may be provided to the user's computer as a Web service via a network without being downloaded.

１分類装置（学習装置）
１０制御部
１１入力部
１２タグ付け部
１３距離算出部
１４抽出部
１５特徴量算出部
１６学習部
１７分類部
２０記憶部
２３学習モデル 1 Classification device (learning device)
10 Control unit 11 Input unit 12 Tagging unit 13 Distance calculation unit 14 Extraction unit 15 Feature quantity calculation unit 16 Learning unit 17 Classification unit 20 Storage unit 23 Learning model

Claims

A tagging unit that divides each of the document data into words and assigns a part-of-speech tag to each of the words.
For each of the document data, for the combination of the words, a distance calculation unit that calculates the distance between the word vectors defined for each of the words, and a distance calculation unit.
An extraction unit that extracts the minimum value in the distance group for the same word for each combination of the document data and the part of speech tag.
A feature amount calculation unit that calculates a statistic of the minimum value in the group as a feature amount for each combination of the document data and the part of speech tag.
For each of the document data, a classification unit that outputs the classification result by a model trained by the document data in which the feature amount for the combination of the part of speech tags is input and the classification of machine translation or human translation is labeled in advance, and the classification unit. A classification device equipped with.

The classification device according to claim 1, wherein the statistic comprises at least one of mean and variance.

A tagging unit that divides each of the document data into words and assigns a part-of-speech tag to each of the words.
For each of the document data, for the combination of the words, a distance calculation unit that calculates the distance between the word vectors defined for each of the words, and a distance calculation unit.
An extraction unit that extracts the minimum value in the distance group for the same word for each combination of the document data and the part of speech tag.
A feature amount calculation unit that calculates a statistic of the minimum value in the group as a feature amount for each combination of the document data and the part of speech tag.
A learning device including a learning unit that inputs the feature amount for the combination of the part of speech tags for each document data and generates a model that learns a pre-labeled machine translation or human translation classification.

The learning device according to claim 3, wherein the learning unit generates a plurality of the models by a plurality of learning algorithms, and selects a model having the highest output accuracy in the category.

A tagging step that divides each of the document data into words and assigns a part-of-speech tag to each of the words.
For each of the document data, for the combination of the words, a distance calculation step for calculating the distance between the word vectors defined for each of the words, and a distance calculation step.
An extraction step for extracting the minimum value in the distance group for the same word for each combination of the document data and the part of speech tag.
For each combination of the document data and the part of speech tag, a feature amount calculation step of calculating the statistic of the minimum value in the group as a feature amount, and
For each of the document data, the classification step of outputting the classification result by the model trained by the document data in which the feature amount for the combination of the part of speech tags is input and the classification of machine translation or human translation is labeled in advance, and the classification step. The classification method that the computer performs.

A classification program for operating a computer as the classification device according to claim 1 or 2.

A learning program for operating a computer as the learning device according to claim 3 or 4.