CN111368563A - Clustering algorithm fused dimension-Chinese machine translation system - Google Patents

Clustering algorithm fused dimension-Chinese machine translation system Download PDF

Info

Publication number
CN111368563A
CN111368563A CN202010140937.4A CN202010140937A CN111368563A CN 111368563 A CN111368563 A CN 111368563A CN 202010140937 A CN202010140937 A CN 202010140937A CN 111368563 A CN111368563 A CN 111368563A
Authority
CN
China
Prior art keywords
model
machine translation
clustering
uyghur
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010140937.4A
Other languages
Chinese (zh)
Inventor
艾山·吾买尔
刘文其
斯拉吉艾合麦提·如则麦麦提
西热艾力·海热拉
早克热·卡德尔
买合木提·买买提
汪烈军
刘胜全
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xinjiang University
Original Assignee
Xinjiang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xinjiang University filed Critical Xinjiang University
Priority to CN202010140937.4A priority Critical patent/CN111368563A/en
Publication of CN111368563A publication Critical patent/CN111368563A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a Uyghur machine translation system fusing a clustering algorithm, which trains a Uyghur sentence vector model by using Doc2vec in genim; realizing text clustering of Uygur languages by using a k-means method; training a Weihan machine translation model by using a transformer structure; the method comprises the following steps of using a fine tuning method to fine tune a machine translation model for each type of clustered data to obtain k sub-translation models, fusing the above methods to realize vectorization of Uyghur, clustering and training the Uyghur translation model, and being characterized in that: because sentences with short length hardly have rich semantic information. The method further subdivides various characteristics in the corpus according to the sentence length and the k-means clustering method, so that the translation quality of the Wei-Han machine translation is improved.

Description

Clustering algorithm fused dimension-Chinese machine translation system
Technical Field
The invention belongs to the field of machine translation, and particularly relates to a Wei-Han machine translation system fusing a clustering algorithm.
Background
Machine Translation (MT) is an automated Translation that uses a computer to translate from one language to another language, greatly reducing communication barriers that people use in different languages. In recent years, neural network machine translation systems have advanced significantly, essentially replacing traditional statistical machine translation. Whether they are statistical machine translation or neural network machine translation, they rely on large-scale bilingual parallel corpora. Although the Transformer model obviously improves the translation quality on languages with rich resources, the language resource shortage is a persistent obstacle for the machine translation of the small languages, and a system with better translation quality is urgently needed for the small languages like Uygur language.
An end-to-end Neural Machine Translation (NMT) system has made some progress, and compared with the conventional statistical Machine Translation, NMT can train a Neural network from one sequence to another, so the basic structure of the Neural network Machine Translation model is based on an encoder-decoder (encoder-decoder) structure. The block diagram is shown in fig. 1. Given a source language sentence x ═ x1,x2,…,xJ) The target language sentence y ═ y (y)1,y2…yK) The probability of a sentence is modeled directly using an end-to-end neural network model:
Figure BDA0002399063200000011
where θ is a parameter of the entire model, y<k=(y1,y2,…,yk-1) Indicating that k-1 words have been translated. This encoder-decoder-based framework uses a recurrent neural network that can compress a source language sentence into a semantic vector and use the semantic vector as an input to the decoder recurrent neural network to generate words in the sentence in the target language in turn. The machine translation model based on the recurrent neural network has a good effect when translating a relatively short sentence, but the translation quality is seriously degraded when translating a long sentence.
The introduction of the attention mechanism is primarily to dynamically select a distributed representation of the source language endwords, which allows the model to focus only on information related to the next target word. The quality of the translation of the encoder-decoder framework on long sentences is further improved, the translation quality of the neural network machine translation is comprehensively superior to that of the traditional statistical machine translation, and an encoder-decoder framework with an attention mechanism is shown in fig. 2. The process of generating the next word by RNN is represented as:
p(yk|y<k,x;θ)=softmax(g(yk-1,tk,ck)) (2)
where g is a non-linear function, tkFor the kth decoding end hidden layer state, ckTo generate a context vector corresponding to the kth target endword. The attention model can be expressed as
Figure BDA0002399063200000021
Figure BDA0002399063200000022
ejk=vTarctan(Utk+Wsj) (5)
Wherein s isjRepresenting the distributed representation vector corresponding to each source end word at the encoding end, αjkV, U and W are parameters for the hidden layer representation at the encoding and decoding end. Model optimization using maximum likelihood estimation for a given training set
Figure BDA0002399063200000023
Subsequently, Vaswani proposes a neural machine translation model (transform) based on the attention network, which in turn greatly improves the translation quality of neural network machine translation, and is one of the mainstream translation models in the world at present, and the architecture of the model is shown in fig. 3. The attention network can be viewed as mapping a data query Q onto a key (K) -value (V) pair and producing a weighted output. Calculating the formula:
Figure BDA0002399063200000031
wherein the content of the first and second substances,
Figure BDA0002399063200000032
is introduced a temperature factor, can solve the problem of unstable training process, and therefore, the length (d) of the vector is adoptedk) As a temperature factor and using the dot product as a function of the computed similarity, is called scaling the dot product attention network.
Doc2Vec can also be called Paragraph Vector, sensor embeddings, is proposed by Mikolov based on word2Vec model, is an unsupervised algorithm, and can obtain Vector expression of sensor/para/documents. The algorithm is used for predicting a vector to represent different documents, and the structure of the model potentially overcomes the defects of a bag-of-words model. The Doc2vec model is inspired by the word2vec model, when the word vector is predicted by the word2vec, the predicted word contains word senses, and the same structure is also constructed in the Doc2vec, so that the Doc2vec model overcomes the defect that no semantics exist in the bag-of-words model. Like word2vec, there are two training modes for Doc2vec, one is PV-DM (distributed MemoryModel of paramaterap vector), the structure diagram of the model is shown in FIG. 3, which is similar to the CBOW model in word2vec, and the other is PV-DBOW (distributed Bag of Words of paramaterap vector) model structure diagram is shown in FIG. 4, which is similar to the skip-gram model in word2 vec.
The clustering algorithm is a method for automatically dividing a stack of data without labels into different classes, and the clustering belongs to an unsupervised learning algorithm, and the method is required to ensure that the data in the same class have similar characteristics. The K-Means algorithm is a clustering method based on division and is one of ten classical data mining algorithms. Simply, K-Means is a method of dividing data into K parts without any supervisory signals. The value K is the number of classes in the clustering result or simply the number of classes we wish to partition the data. In conclusion, I design a Wei-Han machine translation system fusing a clustering algorithm.
Disclosure of Invention
In order to solve the existing problems, the invention provides a dimension-Chinese machine translation system fusing a clustering algorithm.
The invention is realized by the following technical scheme:
a Uyghur machine translation system fusing clustering algorithm trains a Uyghur sentence vector model by using Doc2vec in genim; realizing text clustering of Uygur languages by using a k-means method; training a Weihan machine translation model by using a transformer structure; the method is characterized in that a fine tuning method is used for finely tuning a machine translation model of each type of clustered data to obtain k sub-translation models, vectorization of Uygur language is realized by combining the methods, the Uygur language translation models are clustered and trained, and because sentences with shorter lengths hardly have rich semantic information, when sentence vectorization is carried out, short sentences can influence the quality of a sentence vector model to further cause poor clustering effect, so sentences with lengths smaller than 10 are filtered before sentence vector training, and only longer sentences are reserved.
As a further optimization scheme of the invention, in the transformer-based Wei-Hanzi machine translation model, a Wei-Hanzi end firstly performs token on a Wei-Hanzi and operates by using double Byte Encoding (BPE); the Chinese terminal firstly uses a THULAC word segmentation tool to segment the words of Chinese sentences, and uses a BPE word segmentation method.
Compared with the prior art, the invention has the beneficial effects that: the invention further subdivides various characteristics in the corpus according to the sentence length and the k-means clustering method, thereby improving the translation quality of the dimensional Chinese machine translation, and the invention is a system of the dimensional Chinese machine translation based on the sentence length and the clustering algorithm. Firstly, training a complete machine translation model by using all training corpora as a reference system, then selecting sentences with the sentence length larger than 10 to cluster, and then carrying out model fine tuning. The method utilizes Doc2vec to train a sentence vector model so as to obtain a sentence vector of each sentence, a clustering method uses K-means clustering to obtain K classes, and then each class of data is used for respectively carrying out model fine adjustment on a machine translation model so as to obtain K translation models. The translation model uses a character and BPE segmentation method, different models are selected for translation according to sentence length and class clusters during final testing, wherein short sentences are translated by directly using a reference model, and a final translation result is obtained.
Drawings
FIG. 1 is a diagram of an encoder-decoder model of the present invention;
FIG. 2 is a diagram of the encoder-decoder + Attention model of the present invention;
FIG. 3 is a diagram of a PV-DM model of the present invention;
FIG. 4 is a diagram of a PV-DBOW model of the present invention;
FIG. 5 is a structural diagram of a transformer structure training Weihan machine translation model according to the present invention;
FIG. 6 is a training flow diagram of the present invention;
fig. 7 is a side view flow diagram of the present invention.
In the figure: 1. the transformer structure trains a Weihan machine translation model.
Detailed Description
The invention is described in further detail below with reference to the following detailed description and accompanying drawings:
1-7, a Uyghur machine translation system fusing clustering algorithm, which trains Uyghur sentence vector model by using Doc2vec in genim; realizing text clustering of Uygur languages by using a k-means method; training a Weihan machine translation model (1) by using a transformer structure; the method comprises the steps that a fine tuning method is used for finely tuning a machine translation model of each type of clustered data to obtain k sub-translation models, vectorization of Uyghur is achieved by fusing the methods, the Uyghur translation models are clustered and trained, and because sentences with short lengths hardly have rich semantic information, when sentence vectorization is carried out, short sentences can affect the quality of a sentence vector model to further cause poor clustering effect, so sentences with lengths smaller than 10 are filtered before sentence vector training, and only longer sentences are reserved; in the transformer-based dimension-Chinese machine translation model, a dimension language end firstly performs token on a dimension sentence, and then uses a double Byte Encoding (BPE) operation; the Chinese terminal firstly uses a THULAC word segmentation tool to segment the words of Chinese sentences, and uses a BPE word segmentation method.
In the embodiment of the device, firstly, preprocessing all the dimensional-Chinese bilingual data, and training a transformer-based dimensional-Chinese machine translation model as a reference model; and selecting Uygur sentences with the length larger than 10 to train a sentence vector model, and then carrying out k-means clustering. And finally, fine-tuning the machine translation model by using each type of data to obtain k sub-translation models.
The data used by the method are 17 ten thousand training data translated by a ccmt2019 dimension Chinese machine and 91 ten thousand parallel sentence pairs collected by a laboratory, and the total number is about 108 ten thousand pairs. Among them, there are 62.5 thousands sentences in which the number of words in the Uygur sentence is greater than 10. Both development and test sets used 1000 development sets of ccmt2019 dimensional chinese machine translation.
The invention uses a transformer under an open source system OpenNMT to train a machine translation model, and the specific parameter settings are as follows: the multi-head attention head is 16, the number of encoder and decoder layers is 6, the word vector dimension is 768, the dropout is 0.3, and the optimization method uses adam, wherein adam _ beta1 is 0.9, and adam _ beta2 is 0.998. And training a Weihan machine translation model as a baseline model by using all data.
The invention uses a Doc2vec model in genim to train a dimensional statement subvector model, wherein the Doc2vec selects a default PV-DM structure. When training the sentence vector, in order to obtain more accurate semantic information, sentences with the length larger than 10 are selected for training.
The invention selects a KMeans clustering method in sklern to realize text clustering, and only 62.6 ten thousand Uygur sentences with the length more than 10 are vectorized and clustered during clustering. Because k-means clustering is an unsupervised learning method, the number of sentences in each class cannot be controlled, the k value must be manually selected, and if the k value is not properly set, the situation that the data in some classes is extremely few and the data in some classes is too much is likely to occur. After clustering for a plurality of times with k being 2 to 20, it is found that when k being 5, the clustering effect is good and each type of data is balanced, and the number of sentences in each type is shown in table one:
TABLE 1 clustered sentence quantity table
Figure BDA0002399063200000071
The method comprises the steps of finely adjusting a reference translation model by using a fine adjustment method, carrying out BPE word segmentation on 5 classes of data after 5 classes of data are obtained through k-means clustering, and finely adjusting the reference model by using the 5 classes of data respectively to finally obtain 5 sub-translation models. The iteration number of the BPE at the Uygur language end is 24k, and the iteration number of the BPE at the Chinese language end is 30 k.
When the invention uses the model to translate the sentence, firstly the length of the sentence needs to be calculated, if the number of words in the sentence is less than or equal to 10, the standard model is directly used for translation, if the number of words in the sentence is more than 10, the sentence is vectorized by using the vector model, and after the distance between the sentence and the cluster center of each class is calculated, the model which is finely adjusted by the class with the shortest distance is selected for translation, thus 1000 test sets are divided into 6 classes and are translated by 6 translation models respectively, and because the number of words in all sentences in the test set used by the invention is more than 10, the short sentence subclass of the test set is 0. The specific number of test sets per class is shown in table 2:
table 2 various test collection quantity table
Figure BDA0002399063200000081
The BLEU value is used for evaluating the translation quality of the model, and the BLEU value of baseline is obtained by a direct translation test set of a reference system. For the models trimmed after clustering, the BLEU values of the models were calculated, respectively, and the total BLEU value calculated by integrating all test sets is shown in table 3:
TABLE 3 translation quality Table of model
Figure BDA0002399063200000082
In order to more accurately see the translation quality of the model finely tuned by each class, the translation quality of the test set of each class after being translated by using a reference system and the translation quality of the model after being finely tuned are shown in table 4:
table 4 k-5 hours model translation quality table
Figure BDA0002399063200000083
As can be seen from table 3, the BLEU values of the wiener-chinese machine translation model after clustering are improved by 1.06 BLEU values compared with the BLEU values of the reference model trained by directly using all training data, which indicates that the training set can be divided to some extent after clustering. During testing, the translation quality can be improved to a certain extent by selecting a model with a class which is more similar to the test data.
The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (2)

1. A Uyghur machine translation system fusing clustering algorithm trains a Uyghur sentence vector model by using Doc2vec in genim; realizing text clustering of Uygur languages by using a k-means method; training a Weihan machine translation model by using a transformer structure; the method comprises the following steps of using a fine tuning method to fine tune a machine translation model for each type of clustered data to obtain k sub-translation models, fusing the above methods to realize vectorization of Uyghur, clustering and training the Uyghur translation model, and being characterized in that: since sentences with short lengths hardly have rich semantic information, when sentence vectorization is carried out, short sentences can influence the quality of a sentence vector model so as to cause poor clustering effect, sentences with lengths smaller than 10 are filtered out before sentence vector training, and only longer sentences are reserved.
2. The wiener machine translation system based on the fusion clustering algorithm of claim 1, wherein: in the transformer-based dimension-Chinese machine translation model, a dimension language end firstly performs token on a dimension sentence, and then uses a double Byte Encoding (BPE) operation; the Chinese terminal firstly uses a THULAC word segmentation tool to segment the words of Chinese sentences, and uses a BPE word segmentation method.
CN202010140937.4A 2020-03-03 2020-03-03 Clustering algorithm fused dimension-Chinese machine translation system Pending CN111368563A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010140937.4A CN111368563A (en) 2020-03-03 2020-03-03 Clustering algorithm fused dimension-Chinese machine translation system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010140937.4A CN111368563A (en) 2020-03-03 2020-03-03 Clustering algorithm fused dimension-Chinese machine translation system

Publications (1)

Publication Number Publication Date
CN111368563A true CN111368563A (en) 2020-07-03

Family

ID=71208388

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010140937.4A Pending CN111368563A (en) 2020-03-03 2020-03-03 Clustering algorithm fused dimension-Chinese machine translation system

Country Status (1)

Country Link
CN (1) CN111368563A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112507734A (en) * 2020-11-19 2021-03-16 南京大学 Roman Uygur language-based neural machine translation system
CN112711943A (en) * 2020-12-17 2021-04-27 厦门市美亚柏科信息股份有限公司 Uygur language identification method, device and storage medium
CN113642535A (en) * 2021-10-13 2021-11-12 聊城高新生物技术有限公司 Biological branch detection method and device and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079028A (en) * 2007-05-29 2007-11-28 中国科学院计算技术研究所 On-line translation model selection method of statistic machine translation
CN103049436A (en) * 2011-10-12 2013-04-17 北京百度网讯科技有限公司 Method and device for obtaining corpus, method and system for generating translation model and method and system for mechanical translation
CN110674646A (en) * 2019-09-06 2020-01-10 内蒙古工业大学 Mongolian Chinese machine translation system based on byte pair encoding technology
CN110705320A (en) * 2019-10-08 2020-01-17 中国船舶工业综合技术经济研究院 State-defense military-industry-field machine translation method and system for subdivision field

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079028A (en) * 2007-05-29 2007-11-28 中国科学院计算技术研究所 On-line translation model selection method of statistic machine translation
CN103049436A (en) * 2011-10-12 2013-04-17 北京百度网讯科技有限公司 Method and device for obtaining corpus, method and system for generating translation model and method and system for mechanical translation
CN110674646A (en) * 2019-09-06 2020-01-10 内蒙古工业大学 Mongolian Chinese machine translation system based on byte pair encoding technology
CN110705320A (en) * 2019-10-08 2020-01-17 中国船舶工业综合技术经济研究院 State-defense military-industry-field machine translation method and system for subdivision field

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112507734A (en) * 2020-11-19 2021-03-16 南京大学 Roman Uygur language-based neural machine translation system
CN112507734B (en) * 2020-11-19 2024-03-19 南京大学 Neural machine translation system based on romanized Uygur language
CN112711943A (en) * 2020-12-17 2021-04-27 厦门市美亚柏科信息股份有限公司 Uygur language identification method, device and storage medium
CN112711943B (en) * 2020-12-17 2023-11-24 厦门市美亚柏科信息股份有限公司 Uygur language identification method, device and storage medium
CN113642535A (en) * 2021-10-13 2021-11-12 聊城高新生物技术有限公司 Biological branch detection method and device and electronic equipment

Similar Documents

Publication Publication Date Title
CN110348016B (en) Text abstract generation method based on sentence correlation attention mechanism
CN110489555B (en) Language model pre-training method combined with similar word information
CN108984683B (en) Method, system, equipment and storage medium for extracting structured data
CN109508462B (en) Neural network Mongolian Chinese machine translation method based on encoder-decoder
CN110019839B (en) Medical knowledge graph construction method and system based on neural network and remote supervision
CN109325112B (en) A kind of across language sentiment analysis method and apparatus based on emoji
JP7253848B2 (en) Fine Grained Emotion Analysis Method for Supporting Interlanguage Transition
CN110929030A (en) Text abstract and emotion classification combined training method
CN112487143A (en) Public opinion big data analysis-based multi-label text classification method
CN111368563A (en) Clustering algorithm fused dimension-Chinese machine translation system
CN109472026A (en) Accurate emotion information extracting methods a kind of while for multiple name entities
CN110119443B (en) Emotion analysis method for recommendation service
CN114757182A (en) BERT short text sentiment analysis method for improving training mode
CN111553159B (en) Question generation method and system
CN114880461A (en) Chinese news text summarization method combining contrast learning and pre-training technology
CN116011456B (en) Chinese building specification text entity identification method and system based on prompt learning
CN115859980A (en) Semi-supervised named entity identification method, system and electronic equipment
CN113806547A (en) Deep learning multi-label text classification method based on graph model
CN116756303A (en) Automatic generation method and system for multi-topic text abstract
Cao Generating natural language descriptions from tables
CN115114940A (en) Machine translation style migration method and system based on curriculum pre-training
CN114625879A (en) Short text clustering method based on self-adaptive variational encoder
CN114328899A (en) Text summary generation method, device, equipment and storage medium
CN113407711A (en) Gibbs limited text abstract generation method by using pre-training model
CN111382583A (en) Chinese-Uygur name translation system with mixed multiple strategies

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200703

RJ01 Rejection of invention patent application after publication