CN111368563A

CN111368563A - Clustering algorithm fused dimension-Chinese machine translation system

Info

Publication number: CN111368563A
Application number: CN202010140937.4A
Authority: CN
Inventors: 艾山·吾买尔; 刘文其; 斯拉吉艾合麦提·如则麦麦提; 西热艾力·海热拉; 早克热·卡德尔; 买合木提·买买提; 汪烈军; 刘胜全
Original assignee: Xinjiang University
Current assignee: Xinjiang University
Priority date: 2020-03-03
Filing date: 2020-03-03
Publication date: 2020-07-03

Abstract

The invention discloses a Uyghur machine translation system fusing a clustering algorithm, which trains a Uyghur sentence vector model by using Doc2vec in genim; realizing text clustering of Uygur languages by using a k-means method; training a Weihan machine translation model by using a transformer structure; the method comprises the following steps of using a fine tuning method to fine tune a machine translation model for each type of clustered data to obtain k sub-translation models, fusing the above methods to realize vectorization of Uyghur, clustering and training the Uyghur translation model, and being characterized in that: because sentences with short length hardly have rich semantic information. The method further subdivides various characteristics in the corpus according to the sentence length and the k-means clustering method, so that the translation quality of the Wei-Han machine translation is improved.

Description

Clustering algorithm fused dimension-Chinese machine translation system

Technical Field

The invention belongs to the field of machine translation, and particularly relates to a Wei-Han machine translation system fusing a clustering algorithm.

Background

Machine Translation (MT) is an automated Translation that uses a computer to translate from one language to another language, greatly reducing communication barriers that people use in different languages. In recent years, neural network machine translation systems have advanced significantly, essentially replacing traditional statistical machine translation. Whether they are statistical machine translation or neural network machine translation, they rely on large-scale bilingual parallel corpora. Although the Transformer model obviously improves the translation quality on languages with rich resources, the language resource shortage is a persistent obstacle for the machine translation of the small languages, and a system with better translation quality is urgently needed for the small languages like Uygur language.

An end-to-end Neural Machine Translation (NMT) system has made some progress, and compared with the conventional statistical Machine Translation, NMT can train a Neural network from one sequence to another, so the basic structure of the Neural network Machine Translation model is based on an encoder-decoder (encoder-decoder) structure. The block diagram is shown in fig. 1. Given a source language sentence x ═ x₁,x₂,…,x_J) The target language sentence y ═ y (y)₁,y₂…y_K) The probability of a sentence is modeled directly using an end-to-end neural network model:

where θ is a parameter of the entire model, y_＜k＝(y₁,y₂,…,y_k-1) Indicating that k-1 words have been translated. This encoder-decoder-based framework uses a recurrent neural network that can compress a source language sentence into a semantic vector and use the semantic vector as an input to the decoder recurrent neural network to generate words in the sentence in the target language in turn. The machine translation model based on the recurrent neural network has a good effect when translating a relatively short sentence, but the translation quality is seriously degraded when translating a long sentence.

The introduction of the attention mechanism is primarily to dynamically select a distributed representation of the source language endwords, which allows the model to focus only on information related to the next target word. The quality of the translation of the encoder-decoder framework on long sentences is further improved, the translation quality of the neural network machine translation is comprehensively superior to that of the traditional statistical machine translation, and an encoder-decoder framework with an attention mechanism is shown in fig. 2. The process of generating the next word by RNN is represented as:

p(y_k|y_＜k,x；θ)＝softmax(g(y_k-1,t_k,c_k)) (2)

where g is a non-linear function, t_kFor the kth decoding end hidden layer state, c_kTo generate a context vector corresponding to the kth target endword. The attention model can be expressed as

e_jk＝v^Tarctan(Ut_k+Ws_j) (5)

Wherein s is_jRepresenting the distributed representation vector corresponding to each source end word at the encoding end, α_jkV, U and W are parameters for the hidden layer representation at the encoding and decoding end. Model optimization using maximum likelihood estimation for a given training set

Subsequently, Vaswani proposes a neural machine translation model (transform) based on the attention network, which in turn greatly improves the translation quality of neural network machine translation, and is one of the mainstream translation models in the world at present, and the architecture of the model is shown in fig. 3. The attention network can be viewed as mapping a data query Q onto a key (K) -value (V) pair and producing a weighted output. Calculating the formula:

wherein the content of the first and second substances,

is introduced a temperature factor, can solve the problem of unstable training process, and therefore, the length (d) of the vector is adopted_k) As a temperature factor and using the dot product as a function of the computed similarity, is called scaling the dot product attention network.

Doc2Vec can also be called Paragraph Vector, sensor embeddings, is proposed by Mikolov based on word2Vec model, is an unsupervised algorithm, and can obtain Vector expression of sensor/para/documents. The algorithm is used for predicting a vector to represent different documents, and the structure of the model potentially overcomes the defects of a bag-of-words model. The Doc2vec model is inspired by the word2vec model, when the word vector is predicted by the word2vec, the predicted word contains word senses, and the same structure is also constructed in the Doc2vec, so that the Doc2vec model overcomes the defect that no semantics exist in the bag-of-words model. Like word2vec, there are two training modes for Doc2vec, one is PV-DM (distributed MemoryModel of paramaterap vector), the structure diagram of the model is shown in FIG. 3, which is similar to the CBOW model in word2vec, and the other is PV-DBOW (distributed Bag of Words of paramaterap vector) model structure diagram is shown in FIG. 4, which is similar to the skip-gram model in word2 vec.

The clustering algorithm is a method for automatically dividing a stack of data without labels into different classes, and the clustering belongs to an unsupervised learning algorithm, and the method is required to ensure that the data in the same class have similar characteristics. The K-Means algorithm is a clustering method based on division and is one of ten classical data mining algorithms. Simply, K-Means is a method of dividing data into K parts without any supervisory signals. The value K is the number of classes in the clustering result or simply the number of classes we wish to partition the data. In conclusion, I design a Wei-Han machine translation system fusing a clustering algorithm.

Disclosure of Invention

In order to solve the existing problems, the invention provides a dimension-Chinese machine translation system fusing a clustering algorithm.

The invention is realized by the following technical scheme:

a Uyghur machine translation system fusing clustering algorithm trains a Uyghur sentence vector model by using Doc2vec in genim; realizing text clustering of Uygur languages by using a k-means method; training a Weihan machine translation model by using a transformer structure; the method is characterized in that a fine tuning method is used for finely tuning a machine translation model of each type of clustered data to obtain k sub-translation models, vectorization of Uygur language is realized by combining the methods, the Uygur language translation models are clustered and trained, and because sentences with shorter lengths hardly have rich semantic information, when sentence vectorization is carried out, short sentences can influence the quality of a sentence vector model to further cause poor clustering effect, so sentences with lengths smaller than 10 are filtered before sentence vector training, and only longer sentences are reserved.

As a further optimization scheme of the invention, in the transformer-based Wei-Hanzi machine translation model, a Wei-Hanzi end firstly performs token on a Wei-Hanzi and operates by using double Byte Encoding (BPE); the Chinese terminal firstly uses a THULAC word segmentation tool to segment the words of Chinese sentences, and uses a BPE word segmentation method.

Compared with the prior art, the invention has the beneficial effects that: the invention further subdivides various characteristics in the corpus according to the sentence length and the k-means clustering method, thereby improving the translation quality of the dimensional Chinese machine translation, and the invention is a system of the dimensional Chinese machine translation based on the sentence length and the clustering algorithm. Firstly, training a complete machine translation model by using all training corpora as a reference system, then selecting sentences with the sentence length larger than 10 to cluster, and then carrying out model fine tuning. The method utilizes Doc2vec to train a sentence vector model so as to obtain a sentence vector of each sentence, a clustering method uses K-means clustering to obtain K classes, and then each class of data is used for respectively carrying out model fine adjustment on a machine translation model so as to obtain K translation models. The translation model uses a character and BPE segmentation method, different models are selected for translation according to sentence length and class clusters during final testing, wherein short sentences are translated by directly using a reference model, and a final translation result is obtained.

Drawings

FIG. 1 is a diagram of an encoder-decoder model of the present invention;

FIG. 2 is a diagram of the encoder-decoder + Attention model of the present invention;

FIG. 3 is a diagram of a PV-DM model of the present invention;

FIG. 4 is a diagram of a PV-DBOW model of the present invention;

FIG. 5 is a structural diagram of a transformer structure training Weihan machine translation model according to the present invention;

FIG. 6 is a training flow diagram of the present invention;

fig. 7 is a side view flow diagram of the present invention.

In the figure: 1. the transformer structure trains a Weihan machine translation model.

Detailed Description

The invention is described in further detail below with reference to the following detailed description and accompanying drawings:

1-7, a Uyghur machine translation system fusing clustering algorithm, which trains Uyghur sentence vector model by using Doc2vec in genim; realizing text clustering of Uygur languages by using a k-means method; training a Weihan machine translation model (1) by using a transformer structure; the method comprises the steps that a fine tuning method is used for finely tuning a machine translation model of each type of clustered data to obtain k sub-translation models, vectorization of Uyghur is achieved by fusing the methods, the Uyghur translation models are clustered and trained, and because sentences with short lengths hardly have rich semantic information, when sentence vectorization is carried out, short sentences can affect the quality of a sentence vector model to further cause poor clustering effect, so sentences with lengths smaller than 10 are filtered before sentence vector training, and only longer sentences are reserved; in the transformer-based dimension-Chinese machine translation model, a dimension language end firstly performs token on a dimension sentence, and then uses a double Byte Encoding (BPE) operation; the Chinese terminal firstly uses a THULAC word segmentation tool to segment the words of Chinese sentences, and uses a BPE word segmentation method.

In the embodiment of the device, firstly, preprocessing all the dimensional-Chinese bilingual data, and training a transformer-based dimensional-Chinese machine translation model as a reference model; and selecting Uygur sentences with the length larger than 10 to train a sentence vector model, and then carrying out k-means clustering. And finally, fine-tuning the machine translation model by using each type of data to obtain k sub-translation models.

The data used by the method are 17 ten thousand training data translated by a ccmt2019 dimension Chinese machine and 91 ten thousand parallel sentence pairs collected by a laboratory, and the total number is about 108 ten thousand pairs. Among them, there are 62.5 thousands sentences in which the number of words in the Uygur sentence is greater than 10. Both development and test sets used 1000 development sets of ccmt2019 dimensional chinese machine translation.

The invention uses a transformer under an open source system OpenNMT to train a machine translation model, and the specific parameter settings are as follows: the multi-head attention head is 16, the number of encoder and decoder layers is 6, the word vector dimension is 768, the dropout is 0.3, and the optimization method uses adam, wherein adam _ beta1 is 0.9, and adam _ beta2 is 0.998. And training a Weihan machine translation model as a baseline model by using all data.

The invention uses a Doc2vec model in genim to train a dimensional statement subvector model, wherein the Doc2vec selects a default PV-DM structure. When training the sentence vector, in order to obtain more accurate semantic information, sentences with the length larger than 10 are selected for training.

The invention selects a KMeans clustering method in sklern to realize text clustering, and only 62.6 ten thousand Uygur sentences with the length more than 10 are vectorized and clustered during clustering. Because k-means clustering is an unsupervised learning method, the number of sentences in each class cannot be controlled, the k value must be manually selected, and if the k value is not properly set, the situation that the data in some classes is extremely few and the data in some classes is too much is likely to occur. After clustering for a plurality of times with k being 2 to 20, it is found that when k being 5, the clustering effect is good and each type of data is balanced, and the number of sentences in each type is shown in table one:

TABLE 1 clustered sentence quantity table

The method comprises the steps of finely adjusting a reference translation model by using a fine adjustment method, carrying out BPE word segmentation on 5 classes of data after 5 classes of data are obtained through k-means clustering, and finely adjusting the reference model by using the 5 classes of data respectively to finally obtain 5 sub-translation models. The iteration number of the BPE at the Uygur language end is 24k, and the iteration number of the BPE at the Chinese language end is 30 k.

When the invention uses the model to translate the sentence, firstly the length of the sentence needs to be calculated, if the number of words in the sentence is less than or equal to 10, the standard model is directly used for translation, if the number of words in the sentence is more than 10, the sentence is vectorized by using the vector model, and after the distance between the sentence and the cluster center of each class is calculated, the model which is finely adjusted by the class with the shortest distance is selected for translation, thus 1000 test sets are divided into 6 classes and are translated by 6 translation models respectively, and because the number of words in all sentences in the test set used by the invention is more than 10, the short sentence subclass of the test set is 0. The specific number of test sets per class is shown in table 2:

table 2 various test collection quantity table

The BLEU value is used for evaluating the translation quality of the model, and the BLEU value of baseline is obtained by a direct translation test set of a reference system. For the models trimmed after clustering, the BLEU values of the models were calculated, respectively, and the total BLEU value calculated by integrating all test sets is shown in table 3:

TABLE 3 translation quality Table of model

In order to more accurately see the translation quality of the model finely tuned by each class, the translation quality of the test set of each class after being translated by using a reference system and the translation quality of the model after being finely tuned are shown in table 4:

table 4 k-5 hours model translation quality table

As can be seen from table 3, the BLEU values of the wiener-chinese machine translation model after clustering are improved by 1.06 BLEU values compared with the BLEU values of the reference model trained by directly using all training data, which indicates that the training set can be divided to some extent after clustering. During testing, the translation quality can be improved to a certain extent by selecting a model with a class which is more similar to the test data.

The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A Uyghur machine translation system fusing clustering algorithm trains a Uyghur sentence vector model by using Doc2vec in genim; realizing text clustering of Uygur languages by using a k-means method; training a Weihan machine translation model by using a transformer structure; the method comprises the following steps of using a fine tuning method to fine tune a machine translation model for each type of clustered data to obtain k sub-translation models, fusing the above methods to realize vectorization of Uyghur, clustering and training the Uyghur translation model, and being characterized in that: since sentences with short lengths hardly have rich semantic information, when sentence vectorization is carried out, short sentences can influence the quality of a sentence vector model so as to cause poor clustering effect, sentences with lengths smaller than 10 are filtered out before sentence vector training, and only longer sentences are reserved.

2. The wiener machine translation system based on the fusion clustering algorithm of claim 1, wherein: in the transformer-based dimension-Chinese machine translation model, a dimension language end firstly performs token on a dimension sentence, and then uses a double Byte Encoding (BPE) operation; the Chinese terminal firstly uses a THULAC word segmentation tool to segment the words of Chinese sentences, and uses a BPE word segmentation method.