CN112084794A - Tibetan-Chinese translation method and device - Google Patents

Tibetan-Chinese translation method and device Download PDF

Info

Publication number
CN112084794A
CN112084794A CN202010987775.8A CN202010987775A CN112084794A CN 112084794 A CN112084794 A CN 112084794A CN 202010987775 A CN202010987775 A CN 202010987775A CN 112084794 A CN112084794 A CN 112084794A
Authority
CN
China
Prior art keywords
vector
tibetan
target language
sequence
corpus
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010987775.8A
Other languages
Chinese (zh)
Inventor
尼玛扎西
于永斌
头旦才让
仁青东珠
王昊
邓权芯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tibet University
Original Assignee
Tibet University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tibet University filed Critical Tibet University
Priority to CN202010987775.8A priority Critical patent/CN112084794A/en
Publication of CN112084794A publication Critical patent/CN112084794A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/49Data-driven translation using very large corpora, e.g. the web
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/44Statistical methods, e.g. probability models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation

Abstract

The invention relates to a Tibetan Chinese translation method and a device, which construct a parallel original corpus of Tibetan Chinese bilingual, carry out pretreatment to obtain a trainable parallel target corpus of Tibetan Chinese, obtain a source language sequence and a target language sequence, carry out vector expansion on the source language sequence and the target language sequence to obtain a source language vector and a target language vector, input the source language vector to an encoder module for processing to obtain a processing result, input the target language vector and the processing result to a decoder module for training to obtain an output vector, map the output vector back to a target language dictionary, calculate a probability value of each word in the target language sequence, output the probability value in a vector form to obtain a training model, and deduce the training model by using a cluster search algorithm. Compared with the traditional LSTM network, the Tibetan Chinese translation method provided by the invention has better parallel performance and higher computing efficiency, and explores the application and popularization in Tibetan Chinese translation.

Description

Tibetan-Chinese translation method and device
Technical Field
The invention relates to a Tibetan-Han translation method and a Tibetan-Han translation device.
Background
The machine translation technology mainly has two implementation ways, one is a rule-based translation technology, and the other is a statistical-based translation technology. In the statistical-based translation technology, in recent years, with the development of deep neural networks, deep learning-based neural machine translation is becoming the mainstream of a translation model. In neural machine translation, the encoder-decoder model is the main translation framework, with the encoder neural network mapping source language features into vectors, which are decoded by the decoder neural network into the target language. Before 2017, the neural network mainly used in the encoder-decoder was a long-and-short neural network (LSTM) or a gated cyclic unit (GRU). The attention mechanism redistributes weights in a word vector in a weighted summation mode, so that important contents are focused, and the important contents can significantly improve the machine translation effect and are indispensable components in an encoder-decoder framework.
The LSTM can extract the characteristics of a source language and a target language, the problems of gradient explosion and gradient diffusion of the common RNN are solved, and the dependency information can be captured in a longer sequence compared with the common RNN. The GRU is an improvement based on LSTM, which is simpler to construct, saving 1/3 parameters compared to LSTM, significantly reducing training time. The attention mechanism forces the encoder-decoder to learn the corresponding relation between the source language and the target language, a series of problems that the distance between a translated source language word and a translated target language word in the encoder-decoder framework is too long can be solved, and translation quality is improved well.
The Tibetan language belongs to the Hanzang language family, is the main used language of Tibetan, and is mainly distributed in autonomous Tibet areas in China, and Tibetan autonomous areas of Qinghai province, Sichuan province, Yunnan province and Gansu province. The Tibetan is a character writing system of Tibetan and is commonly used among all dialects of Tibetan. The Tibetan language has long history source, and completely records a large number of Buddhism teaching, philosophy and literature works, and the Tibetan language literature is Haoyou Yanhai, and the quantity of the Tibetan language literature is second to that of the Chinese literature. The number of the Tibetan users is large, and the significance of researching and deploying the machine translation system between the Tibetan and the Chinese is great.
Neural machine translation is simpler in components than traditional statistical translation models, but requires a large amount of training data to train the model. At present, 70% of the world corpora are English corpora in two languages, and the corpus contents of minority languages such as Tibetan languages are rare, which brings difficulty to the training of Tibetan-Chinese translation models. Sequence-dependent properties make parallelizing computations very difficult when using LSTM or GRU neural networks. The model which cannot be parallelized in calculation is difficult to accelerate on the GPU and also difficult to utilize large-scale corpus data, which brings great problems to the large-scale application and deployment of the neural machine translation model.
Disclosure of Invention
In order to solve the technical problem, the invention provides a Tibetan language translation method and device.
In order to solve the problems, the invention adopts the following technical scheme:
a tibetan chinese translation method comprising:
constructing a Tibetan-Chinese bilingual parallel original corpus;
preprocessing the Tibetan-Chinese bilingual parallel original corpus to obtain a trainable Tibetan-Chinese bilingual parallel target corpus;
acquiring a source language sequence and a target language sequence according to the Tibetan-Chinese bilingual parallel target corpus;
performing vector expansion on the source language sequence and the target language sequence to obtain a source language vector and a target language vector;
inputting the source language vector into an encoder module for processing to obtain a processing result, and inputting the target language vector and the processing result into a decoder module for training to obtain an output vector;
mapping the output vector back to a target language dictionary;
calculating the probability value of each word in the target language sequence, and outputting the probability value in a vector form to obtain a training model;
and deducing the training model by using a cluster search algorithm to obtain a translation model.
Preferably, the preprocessing the parallel original corpus of tibetan bilinguals to obtain a trainable parallel target corpus of tibetan bilinguals, including:
segmenting the sentence-level original corpus pair in the Tibetan-Chinese parallel original corpus according to a preset segmentation algorithm to obtain word-level corpus pairs;
and processing the original corpus pairs of the word level in the Tibetan-Chinese bilingual parallel original corpus and the obtained corpus pairs of the word level according to a preset BPE byte pair coding algorithm.
Preferably, before segmenting the sentence-level original corpus in the tibetan bilingual parallel original corpus according to a preset segmentation algorithm to obtain word-level corpus pairs, the length filtering, the length truncation and the mode filtering are sequentially performed on the original corpus pairs in the tibetan bilingual parallel original corpus, wherein,
the length filtering process comprises the following steps: filtering the original corpus with the length smaller than a preset short threshold value in the Tibetan-Chinese bilingual parallel original corpus;
the length truncation processing procedure comprises the following steps: truncating the original corpus pairs with the length larger than a preset long threshold value in the Tibetan-Chinese bilingual parallel original corpus, dividing the original corpus pairs into at least two corpus pairs, and enabling the length of each divided corpus pair to be smaller than or equal to the preset long threshold value;
the mode filtering processing procedure comprises the following steps: and filtering the original forecast pairs meeting preset filtering rules in the Tibetan-Chinese bilingual parallel original corpus.
Preferably, the vector expansion of the source language sequence and the target language sequence to obtain a source language vector and a target language vector includes:
expanding the source language sequence and the target language sequence from a real number sequence into a D-dimensional real number vector;
and embedding the position information into the D-dimensional real number vector to obtain the source language vector and the target language vector.
Preferably, the inputting the source language vector into an encoder module for processing to obtain a processing result includes:
dividing the source language vector into h heads, and applying a self-attention mechanism to each head to obtain a processed source language sequence vector;
processing the source language sequence vector by using a feedforward layer to generate a new expression vector corresponding to the source language sequence vector;
the encoder module calculates the new representation vector corresponding to the source language sequence vector at each layer of the encoder module using residual concatenation and LN regularization using a residual concatenation structure;
the processing steps are repeated until the last layer of the encoder module is reached, and K, V matrix of the self-attention mechanism is obtained, wherein the K, V matrix is the processing result.
Preferably, the inputting the target language vector and the processing result into a decoder module for training to obtain an output vector includes:
processing the target language vector through an attention mechanism with a mask to obtain a Q matrix;
performing multi-head attention processing on the K, V matrix and the Q matrix to obtain a target language sequence vector;
processing the target language sequence vector by using a feedforward layer to generate a new expression vector corresponding to the target language sequence vector;
the decoder module uses a residual connection structure to calculate the new representation vector corresponding to the target language sequence vector at each layer of the decoder module by using residual connection and LN regularization;
and repeating the processing steps until the last layer of the decoder module is reached to obtain the output vector.
Preferably, the deducing the training model by using the bundle searching algorithm to obtain a translation model includes:
selecting a maximum search branch number k;
putting the target language sequence into the training model, selecting the first k values each time, and deducing;
and comparing the prediction results of the k branches, selecting the maximum value, and circulating until the whole sequence is processed to obtain the translation model.
Preferably, the calculating a probability value of each word in the target language sequence, and outputting the probability value in a vector form to obtain a training model includes:
and calculating the probability value of each word in the target language sequence by using a Softmax function and taking the cross entropy as a loss function, and outputting the probability value in a vector form to obtain the training model.
A tibetan chinese translation device comprising:
the original corpus establishing module is used for establishing a Tibetan-Chinese bilingual parallel original corpus;
the target language database acquisition module is used for preprocessing the Tibetan-Chinese bilingual parallel original corpus to obtain a trainable Tibetan-Chinese bilingual parallel target corpus;
the language sequence acquisition module is used for acquiring a source language sequence and a target language sequence according to the Tibetan-Chinese bilingual parallel target corpus;
the language vector acquisition module is used for performing vector expansion on the source language sequence and the target language sequence to obtain a source language vector and a target language vector;
the language vector processing module is used for inputting the source language vector into the encoder module for processing to obtain a processing result, and inputting the target language vector and the processing result into the decoder module for training to obtain an output vector;
a mapping module for mapping the output vector back to a target language dictionary;
the training model acquisition module is used for calculating the probability value of each word in the target language sequence and outputting the probability value in a vector form to obtain a training model;
and the inference module is used for inferring the training model by using a cluster search algorithm to obtain a translation model.
The invention has the beneficial effects that: after obtaining the parallel original corpus of the Tibetan Chinese bilingual, preprocessing the parallel original corpus of the Tibetan Chinese bilingual to obtain a trainable parallel target corpus of the Tibetan Chinese bilingual, further obtain a source language sequence and a target language sequence, and obtaining a training model according to the source language sequence and the target language sequence, wherein vector expansion is firstly carried out on the source language sequence and the target language sequence to obtain a source language vector and a target language vector, then the source language vector is processed through an encoder module to obtain a processing result, the target language vector and the processing result output by the encoder module are input into a decoder module to be trained to obtain an output vector, the output vector is mapped back to a target language dictionary, finally, the probability value of each word in the target language sequence is calculated, the probability value is output in a vector form to obtain the training model, and deducing the training model by using a cluster search algorithm to obtain a translation model. Through the training process, a reliable Tibetan Chinese translation model can be obtained, and translation between Tibetan and Chinese can be accurately and reliably carried out according to the Tibetan Chinese translation model. Therefore, compared with the traditional LSTM network, the Tibetan-Chinese translation method has better parallel performance and higher computing efficiency, and application and popularization in Tibetan-Chinese interconversion are explored.
Drawings
In order to more clearly illustrate the technical solution of the embodiment of the present invention, the drawings needed to be used in the embodiment will be briefly described as follows:
fig. 1 is a schematic overall flow chart of a tibetan chinese translation method according to an embodiment of the present application;
FIG. 2 is a pre-processing flow diagram;
FIG. 3 is a schematic diagram of a self-attention mechanism neural machine translation network;
FIG. 4 is a schematic of an attention algorithm;
FIG. 5 is a diagram of operation effect during human-computer interaction;
fig. 6 is a schematic overall structure diagram of a tibetan chinese translation apparatus according to a second embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to" determining "or" in response to detecting ". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.
Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.
The Tibetan-Chinese translation method provided by the embodiment of the application can be applied to terminal devices such as mobile phones, tablet computers, wearable devices, vehicle-mounted devices, Augmented Reality (AR)/Virtual Reality (VR) devices, notebook computers, ultra-mobile personal computers (UMPCs), netbooks, Personal Digital Assistants (PDAs), and the like, and the embodiment of the application does not limit the specific types of the terminal devices at all. That is, the carrier of the client corresponding to the tibetan chinese translation method provided in the embodiment of the present application may be any one of the above terminal devices.
In order to explain the technical means described in the present application, the following description will be given by way of specific embodiments.
Fig. 1 is a flowchart of an implementation process of a tibetan chinese translation method provided in an embodiment of the present application, and only a part related to the embodiment of the present application is shown for convenience of description.
The Tibetan-Han translation method comprises the following steps:
step S101: constructing a Tibetan-Chinese bilingual parallel original corpus:
and acquiring the original corpus pairs of the Tibetan-Chinese bilingual, and constructing a parallel original corpus of the Tibetan-Chinese bilingual according to the original corpus pairs of the Tibetan-Chinese bilingual.
Step S102: preprocessing the Tibetan-Chinese bilingual parallel original corpus to obtain a trainable Tibetan-Chinese bilingual parallel target corpus:
preprocessing is carried out on a Tibetan-Chinese bilingual parallel original corpus to obtain a trainable Tibetan-Chinese bilingual parallel target corpus, wherein the specific preprocessing process is set according to actual needs, a specific preprocessing process is given as follows, and length filtering processing, length truncation processing, mode filtering processing, word segmentation processing and BPE processing are sequentially carried out, and the five preprocessing processes are respectively as follows:
the length filtering process is to filter the too short original corpus, specifically: and filtering the original corpus pairs with the length smaller than a preset short threshold s in the Tibetan-Chinese bilingual parallel original corpus, and for a certain original corpus pair, if the length of the original corpus pair is smaller than the preset short threshold s, removing the original corpus pair from the original corpus.
The length truncation process is used for truncating the overlong original corpus pair to separate the corpus pair into a plurality of corpus pairs, specifically: for any original corpus pair in the Tibetan-Chinese bilingual parallel original corpus, if the length of the original corpus pair is greater than a preset length threshold max, the original corpus pair is cut off and divided into at least two corpus pairs, and the length of each divided corpus pair is less than or equal to the preset length threshold max.
The pattern filtering process is used to filter the original expected pairs meeting a preset filtering rule in the tibetan bilingual parallel original corpus, where the preset filtering rule is a text in a filtering set (i.e., a rule set R is defined, as shown in fig. 2), and then the text meeting the filtering set is filtered. Given Chinese mode set R1Tibetan language pattern set R2In this case R1=R2. If the mode in the original corpus pair conforms to R1Or R2Then the corpus is pairedFiltering from the original corpus. The pattern set is mainly characterized by features that do not conform to the characteristics of the processing language, such as arabic numerals in Tibetan, special characters not in the character set, and the like.
The word segmentation processing process comprises the following steps: and segmenting the sentence-level original corpus pair in the Tibetan-Chinese bilingual parallel original corpus according to a preset segmentation algorithm to obtain word-level corpus pairs. In this embodiment, the word segmentation algorithm is a word segmentation algorithm provided by Hanlp. And for the sentence-level original corpus pair in the Tibetan-Chinese bilingual parallel original corpus, performing word segmentation on the Chinese by using a word segmentation algorithm provided by Hanlp, processing the Tibetan by using a Tibetan word segmentation device, and outputting the word-level corpus pair.
The BPE treatment process comprises the following steps: and processing the original corpus pairs of the word level in the Tibetan-Chinese bilingual parallel original corpus and the corpus pairs of the word level obtained by word segmentation according to a preset BPE byte pair coding algorithm, and outputting the corpus pairs with finer granularity. Given the size Vocab, this value is used to process the corpus pairs with the BPE byte pair encoding algorithm and generate a joint dictionary.
The BPE byte pair coding algorithm receives a dictionary generated by an original corpus, splits each word into a character sequence and counts word frequency, aggregates the character sequence into new words according to the word frequency, and repeats the process until a set iteration upper limit or words which can not be combined exist. The main significance of the BPE byte pair encoding algorithm is to convert the basic unit of the corpus from word level to mixed corpus between character and word level. Because the corpus is often insufficient and the model generalization ability is limited, the word level is often positioned at the minimum learning granularity to alleviate the problem of insufficient data. The corpus processed by BPE is balanced in the word-level granularity which requires more data and relatively less data, and the character-level granularity and the precision of the model generalization capability are limited.
The effect of the pretreatment is shown in Table 1.
TABLE 1
Figure BDA0002689831040000091
Step S103: according to the Tibetan-Chinese bilingual parallel target corpus, acquiring a source language sequence and a target language sequence:
and acquiring a source language sequence and a target language sequence according to the Tibetan-Chinese bilingual parallel target corpus. In this embodiment, words in the source language sequence and the target language sequence may be replaced with values in the dictionary, and specific positions in the source language sequence and the target language sequence, such as the beginning, the end, and the blank, may be replaced with values in the dictionary.
Step S104: performing vector expansion on the source language sequence and the target language sequence to obtain a source language vector and a target language vector:
performing vector expansion on a source language sequence and a target language sequence to obtain a source language vector and a target language vector, where this embodiment provides a processing procedure:
and expanding the source language sequence and the target language sequence from a real number sequence into a D-dimensional real number vector. For two real sequences, a source language sequence and a target language sequence, let j be the position of each word in the dictionary, there is a function fwe(j) Wherein
Figure BDA0002689831040000101
This process, called the word embedding layer, maps the dictionary number to a vector in a D-dimensional space, where each real number component in the vector is a parameter to be learned.
And embedding the position information into the D-dimensional real number vector to obtain a source language vector and a target language vector. For the real number vector processed above, there is a function fpe(pos, i), where pos is the position of each word in the sequence,
Figure BDA0002689831040000102
specifically, for the above process, given the sequence position pos and the D-dimensional vector dimension i, there are:
Figure BDA0002689831040000104
Figure BDA0002689831040000103
in summary, the above process is written as f (j, pos) ═ fwe(j)+fpe(pos)。
Step S105: inputting the source language vector into an encoder module for processing to obtain a processing result, inputting the target language vector and the processing result into a decoder module for training to obtain an output vector:
as shown in fig. 3, inputting the source language vector into the encoder module for processing to obtain a processing result, specifically including the following processes:
(1) and (3) dividing the source language vector into h heads (h is more than or equal to 2), and applying a self-attention mechanism to each head to obtain the processed source language sequence vector.
For a source language vector x, a q, k and v vector can be obtained by linear mapping of three unshared parameters.
q=Linearq(x)
k=Lineark(x)
v=Linearv(x)
For a batch of data, all q, k, v vectors are concatenated to obtain the matrix Q, K, V. For the divided h heads, there are
MultiHead(Q,K,V)=Concat(head1,head2,…,headh)WO
Specifically, for each head:
Figure BDA0002689831040000111
wherein
Figure BDA0002689831040000112
For each source language vector x, through the above attention mechanism, the attention weight value can be output by the softmax function, and the attention value of each part is obtained through weighted summation. Is divided into h headsSo as to improve the generalization performance of the model.
(2) The source language sequence vector is processed using a feed forward layer to generate a new representation vector corresponding to the source language sequence vector. The formula of the feedforward layer is ffn (x) max (0, xW)1+b1)W2+b2. Wherein W1,W2,b1,b2The parameters that need to be learned for the feed forward layer.
(3) The encoder module uses a residual concatenation structure to compute the new representation vector corresponding to the source language sequence vector at each layer of the encoder module using residual concatenation and LN regularization.
Note that the processing result of the above steps (1) and (2) is sublayer (x), where f (x) ═ LN (x + sublayer (x)) exists, specifically, the LN regularization formula is LN (x) ═ g & _ n (x)) + b, where g and b are parameters requiring learning.
(4) The processing steps are repeated until the last layer of the encoder module is reached, and K, V matrix of the self-attention mechanism is obtained, wherein K, V matrix is a processing result.
And inputting the target language vector and the processing result into a decoder module for training to obtain an output vector. During training, compared with the processing process of the source language vectors, the method has a multi-head attention mechanism with masks. The significance of the mask is that the decoder module needs to deduce the next sequence information depending on the existing information, so for each sequence position pos, it needs to be ensured that the result of the attention mechanism operation for positions larger than pos is 0.
As shown in fig. 3, the target language vector and the processing result are input to the decoder module for training to obtain an output vector, and the specific process includes:
(1) and processing the target language vector through an attention mechanism with a mask to obtain a Q matrix.
(2) And taking the K, V matrix output by the last layer of the encoder module as the input of the multi-head attention module, forming the input of the multi-head attention module together with the Q matrix, and carrying out multi-head attention processing on the K, V matrix and the Q matrix to obtain a target language sequence vector.
(3) The target language sequence vector is processed using a feedforward layer to generate a new representation vector corresponding to the target language sequence vector.
(4) The decoder module uses a residual concatenation structure to compute a new representation vector corresponding to the target language sequence vector at each layer of the decoder module using residual concatenation and LN regularization.
(5) And repeating the processing steps until the last layer of the decoder module is reached to obtain an output vector.
In particular, for a masked attention mechanism, matrix M is usedmaskThe size of the matrix is the same as that of the V matrix, and the elements above the diagonal (inclusive) are 0, and the remaining elements are 1. When the method is applied to the Dot multiplication process of the QK matrix, for the position mask with 0 element, as in the Scale Dot product process of fig. 4, the position mask is the mask flow.
Step S106: mapping the output vector back to a target language dictionary:
the output vector obtained in step S105 is mapped back to the target language dictionary using a linear layer, the input size of which is the vector size of the source language vector and the target language vector, and the output size is the size of the target language dictionary.
Step S107: calculating the probability value of each word in the target language sequence, and outputting the probability value in a vector form to obtain a training model:
in this embodiment, a Softmax function is used, the cross entropy is used as a loss function, a probability value of each word in the target language sequence is calculated, and a probability value is output in a vector form to obtain a training model.
Specifically, the cross entropy loss function is
Figure BDA0002689831040000131
In the above, after the source language sequence and the target language sequence are obtained, the source language sequence and the target language sequence are encoded into the vector sequence and the position information is embedded. In the training process, K, V matrix is calculated by the encoder module from the original sequence, Q matrix is calculated by the decoder module from the target language sequence, and attention mechanism is calculated together with K, V matrix. And after the processing of the decoder module, the vectors are mapped back to the target language dictionary through the linear layer, and the cross entropy is used as a loss function to calculate the occurrence probability of each word.
Step S108: deducing the training model by using a cluster search algorithm to obtain a translation model:
and deducing the trained model to generate a translated text, namely a translation model. Deducing the training model by using a cluster search algorithm to obtain a translation model, wherein the method comprises the following steps:
(1) selecting a maximum search branch number k;
(2) putting the target language sequence into a training model, selecting the first k values each time, and deducing;
(3) comparing the prediction results of the k branches, and selecting the maximum value, namely selecting the branch with the best effect;
(4) and circulating until the full sequence processing is completed, namely until the model inference is finished, and obtaining the translation model.
The cluster searching algorithm is used for searching in the solution space, so that the situation that the local optimal solution is trapped can be effectively avoided. The k value ranges from 5 to 10 according to experience.
In this embodiment, the tibetan-chinese translation method further includes a back-end management process and a front-end display function.
The back-end management process is responsible for organizing the model files and providing translation services for the outside. Where the Base class is a common Base class that translates the plant class and the participle plant class, which specifies the behavior that the plant should have and the attributes that need to be maintained. TransFactory is the actual translation factory class, responsible for generating the objects of the translation class. All translation classes inherit the base class TransBase, which specifies how to initialize the translation classes. All translation class initialization methods are called by the factory class at creation time. Transproxy is a proxy class, responsible for the related functions of proxy translation classes, and combines two components, namely a translation class object and a participle class object, which are configured by the settings in the configuration file. And when the system is called, the related functions are completed through the proxy class, and the translation class and the word segmentation class of the specific execution function are invisible to a caller. The processing of the input data and the conversion of the results are both performed in the proxy class.
The front-end display function comprises a display module consisting of a Vue component and a Bootstrap component, provides a human-computer interaction function, and is a human-computer interaction operation effect diagram as shown in FIG. 5.
The application of the neural machine translation model based on the self-attention mechanism in Tibetan-Chinese translation comprises the steps of firstly constructing a Tibetan-Chinese bilingual corpus, then cleaning, segmenting words and applying a BPE algorithm to the Tibetan-Chinese bilingual corpus to enable the original corpus to become trainable corpus, training the corpus, deducing by using a clustering search algorithm during deduction, managing model files through a rear-end management process, and performing man-machine interaction by using a front-end display function.
Fig. 6 shows a block diagram of a Tibetan language translation apparatus provided in the second embodiment of the present application, which corresponds to the Tibetan language translation method described in the foregoing embodiment of the Tibetan language translation method.
Referring to fig. 6, the tibetan translation apparatus 200 includes:
an original corpus construction module 201, configured to construct a Tibetan-Chinese bilingual parallel original corpus;
a target language database obtaining module 202, configured to preprocess the parallel original corpus of tibetan bilinguals to obtain a trainable parallel target corpus of tibetan bilinguals;
a language sequence obtaining module 203, configured to obtain a source language sequence and a target language sequence according to the tibetan bilingual parallel target corpus;
a language vector obtaining module 204, configured to perform vector expansion on the source language sequence and the target language sequence to obtain a source language vector and a target language vector;
a language vector processing module 205, configured to input the source language vector into an encoder module for processing to obtain a processing result, and input the target language vector and the processing result into a decoder module for training to obtain an output vector;
a mapping module 206 for mapping the output vector back to a target language dictionary;
a training model obtaining module 207, configured to calculate a probability value of each word in the target language sequence, and output a probability value in a vector form to obtain a training model;
and the inference module 208 is configured to infer the training model by using a cluster search algorithm to obtain a translation model.
Further, the preprocessing the Tibetan-Chinese bilingual parallel original corpus to obtain a trainable Tibetan-Chinese bilingual parallel target corpus includes:
segmenting the sentence-level original corpus pair in the Tibetan-Chinese parallel original corpus according to a preset segmentation algorithm to obtain word-level corpus pairs;
and processing the original corpus pairs of the word level in the Tibetan-Chinese bilingual parallel original corpus and the obtained corpus pairs of the word level according to a preset BPE byte pair coding algorithm.
Further, before segmenting the sentence-level original corpus in the tibetan bilingual parallel original corpus according to a preset segmentation algorithm to obtain word-level corpus pairs, sequentially performing length filtering, length truncation and mode filtering on the original corpus pairs in the tibetan bilingual parallel original corpus, wherein,
the length filtering process comprises the following steps: filtering the original corpus with the length smaller than a preset short threshold value in the Tibetan-Chinese bilingual parallel original corpus;
the length truncation processing procedure comprises the following steps: truncating the original corpus pairs with the length larger than a preset long threshold value in the Tibetan-Chinese bilingual parallel original corpus, dividing the original corpus pairs into at least two corpus pairs, and enabling the length of each divided corpus pair to be smaller than or equal to the preset long threshold value;
the mode filtering processing procedure comprises the following steps: and filtering the original forecast pairs meeting preset filtering rules in the Tibetan-Chinese bilingual parallel original corpus.
Further, the vector expansion of the source language sequence and the target language sequence to obtain a source language vector and a target language vector includes:
expanding the source language sequence and the target language sequence from a real number sequence into a D-dimensional real number vector;
and embedding the position information into the D-dimensional real number vector to obtain the source language vector and the target language vector.
Further, the inputting the source language vector into an encoder module for processing to obtain a processing result includes:
dividing the source language vector into h heads, and applying a self-attention mechanism to each head to obtain a processed source language sequence vector;
processing the source language sequence vector by using a feedforward layer to generate a new expression vector corresponding to the source language sequence vector;
the encoder module calculates the new representation vector corresponding to the source language sequence vector at each layer of the encoder module using residual concatenation and LN regularization using a residual concatenation structure;
the processing steps are repeated until the last layer of the encoder module is reached, and K, V matrix of the self-attention mechanism is obtained, wherein the K, V matrix is the processing result.
Further, the inputting the target language vector and the processing result into a decoder module for training to obtain an output vector includes:
processing the target language vector through an attention mechanism with a mask to obtain a Q matrix;
performing multi-head attention processing on the K, V matrix and the Q matrix to obtain a target language sequence vector;
processing the target language sequence vector by using a feedforward layer to generate a new expression vector corresponding to the target language sequence vector;
the decoder module uses a residual connection structure to calculate the new representation vector corresponding to the target language sequence vector at each layer of the decoder module by using residual connection and LN regularization;
and repeating the processing steps until the last layer of the decoder module is reached to obtain the output vector.
Further, the deducing the training model by using the cluster search algorithm to obtain a translation model includes:
selecting a maximum search branch number k;
putting the target language sequence into the training model, selecting the first k values each time, and deducing;
and comparing the prediction results of the k branches, selecting the maximum value, and circulating until the whole sequence is processed to obtain the translation model.
Further, the calculating a probability value of each word in the target language sequence, and outputting the probability value in a vector form to obtain a training model includes:
and calculating the probability value of each word in the target language sequence by using a Softmax function and taking the cross entropy as a loss function, and outputting the probability value in a vector form to obtain the training model.
It should be noted that, because the contents of information interaction, execution process, and the like between the above devices/modules are based on the same concept as that of the embodiment of the tibetan chinese translation method of the present application, specific functions and technical effects thereof may be specifically referred to in the section of the embodiment of the tibetan chinese translation method, and details thereof are not described herein again.
It will be clear to those skilled in the art that, for convenience and simplicity of description, the above-mentioned division of the functional modules is merely used as an example, and in practical applications, the above-mentioned function distribution can be performed by different functional modules according to needs, that is, the internal structure of the tibetan translation apparatus 200 is divided into different functional modules to perform all or part of the above-mentioned functions. Each functional module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional modules are only used for distinguishing one functional module from another, and are not used for limiting the protection scope of the application. The specific working process of each functional module in the above description may refer to the corresponding process in the foregoing embodiment of the tibetan-chinese translation method, and is not described herein again.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims (9)

1. A tibetan chinese translation method, comprising:
constructing a Tibetan-Chinese bilingual parallel original corpus;
preprocessing the Tibetan-Chinese bilingual parallel original corpus to obtain a trainable Tibetan-Chinese bilingual parallel target corpus;
acquiring a source language sequence and a target language sequence according to the Tibetan-Chinese bilingual parallel target corpus;
performing vector expansion on the source language sequence and the target language sequence to obtain a source language vector and a target language vector;
inputting the source language vector into an encoder module for processing to obtain a processing result, and inputting the target language vector and the processing result into a decoder module for training to obtain an output vector;
mapping the output vector back to a target language dictionary;
calculating the probability value of each word in the target language sequence, and outputting the probability value in a vector form to obtain a training model;
and deducing the training model by using a cluster search algorithm to obtain a translation model.
2. The Tibetan-Chinese translation method of claim 1, wherein the preprocessing of the Tibetan-Chinese bilingual parallel original corpus to obtain a trainable Tibetan-Chinese bilingual parallel target corpus comprises:
segmenting the sentence-level original corpus pair in the Tibetan-Chinese parallel original corpus according to a preset segmentation algorithm to obtain word-level corpus pairs;
and processing the original corpus pairs of the word level in the Tibetan-Chinese bilingual parallel original corpus and the obtained corpus pairs of the word level according to a preset BPE byte pair coding algorithm.
3. The Tibetan-Chinese translation method according to claim 2, wherein before segmenting the original corpus pairs at sentence level in the Tibetan-Chinese bilingual parallel original corpus according to a preset segmentation algorithm to obtain the corpus pairs at word level, the original corpus pairs in the Tibetan-Chinese bilingual parallel original corpus are further processed by length filtering, length truncation and mode filtering in sequence, wherein,
the length filtering process comprises the following steps: filtering the original corpus with the length smaller than a preset short threshold value in the Tibetan-Chinese bilingual parallel original corpus;
the length truncation processing procedure comprises the following steps: truncating the original corpus pairs with the length larger than a preset long threshold value in the Tibetan-Chinese bilingual parallel original corpus, dividing the original corpus pairs into at least two corpus pairs, and enabling the length of each divided corpus pair to be smaller than or equal to the preset long threshold value;
the mode filtering processing procedure comprises the following steps: and filtering the original forecast pairs meeting preset filtering rules in the Tibetan-Chinese bilingual parallel original corpus.
4. The Tibetan-Chinese translation method of claim 1, wherein the vector expansion of the source language sequence and the target language sequence to obtain a source language vector and a target language vector comprises:
expanding the source language sequence and the target language sequence from a real number sequence into a D-dimensional real number vector;
and embedding the position information into the D-dimensional real number vector to obtain the source language vector and the target language vector.
5. The Tibetan language translation method of claim 1, wherein the inputting of the source language vector into an encoder module for processing results in processing results comprises:
dividing the source language vector into h heads, and applying a self-attention mechanism to each head to obtain a processed source language sequence vector;
processing the source language sequence vector by using a feedforward layer to generate a new expression vector corresponding to the source language sequence vector;
the encoder module calculates the new representation vector corresponding to the source language sequence vector at each layer of the encoder module using residual concatenation and LN regularization using a residual concatenation structure;
the processing steps are repeated until the last layer of the encoder module is reached, and K, V matrix of the self-attention mechanism is obtained, wherein the K, V matrix is the processing result.
6. The Tibetan language translation method of claim 5, wherein the inputting the target language vector and the processing result into a decoder module for training to obtain an output vector comprises:
processing the target language vector through an attention mechanism with a mask to obtain a Q matrix;
performing multi-head attention processing on the K, V matrix and the Q matrix to obtain a target language sequence vector;
processing the target language sequence vector by using a feedforward layer to generate a new expression vector corresponding to the target language sequence vector;
the decoder module uses a residual connection structure to calculate the new representation vector corresponding to the target language sequence vector at each layer of the decoder module by using residual connection and LN regularization;
and repeating the processing steps until the last layer of the decoder module is reached to obtain the output vector.
7. The Tibetan-Han translation method of claim 1, wherein the inferring the training model using a cluster search algorithm to obtain a translation model comprises:
selecting a maximum search branch number k;
putting the target language sequence into the training model, selecting the first k values each time, and deducing;
and comparing the prediction results of the k branches, selecting the maximum value, and circulating until the whole sequence is processed to obtain the translation model.
8. The Tibetan-Han translation method of claim 1, wherein the calculating a probability value of each occurrence of a word in the target language sequence and outputting the probability value in a vector form to obtain a training model comprises:
and calculating the probability value of each word in the target language sequence by using a Softmax function and taking the cross entropy as a loss function, and outputting the probability value in a vector form to obtain the training model.
9. A tibetan chinese translation device, comprising:
the original corpus establishing module is used for establishing a Tibetan-Chinese bilingual parallel original corpus;
the target language database acquisition module is used for preprocessing the Tibetan-Chinese bilingual parallel original corpus to obtain a trainable Tibetan-Chinese bilingual parallel target corpus;
the language sequence acquisition module is used for acquiring a source language sequence and a target language sequence according to the Tibetan-Chinese bilingual parallel target corpus;
the language vector acquisition module is used for performing vector expansion on the source language sequence and the target language sequence to obtain a source language vector and a target language vector;
the language vector processing module is used for inputting the source language vector into the encoder module for processing to obtain a processing result, and inputting the target language vector and the processing result into the decoder module for training to obtain an output vector;
a mapping module for mapping the output vector back to a target language dictionary;
the training model acquisition module is used for calculating the probability value of each word in the target language sequence and outputting the probability value in a vector form to obtain a training model;
and the inference module is used for inferring the training model by using a cluster search algorithm to obtain a translation model.
CN202010987775.8A 2020-09-18 2020-09-18 Tibetan-Chinese translation method and device Pending CN112084794A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010987775.8A CN112084794A (en) 2020-09-18 2020-09-18 Tibetan-Chinese translation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010987775.8A CN112084794A (en) 2020-09-18 2020-09-18 Tibetan-Chinese translation method and device

Publications (1)

Publication Number Publication Date
CN112084794A true CN112084794A (en) 2020-12-15

Family

ID=73738953

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010987775.8A Pending CN112084794A (en) 2020-09-18 2020-09-18 Tibetan-Chinese translation method and device

Country Status (1)

Country Link
CN (1) CN112084794A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112613326A (en) * 2020-12-18 2021-04-06 北京理工大学 Tibetan language neural machine translation method fusing syntactic structure
CN113157865A (en) * 2021-04-25 2021-07-23 平安科技(深圳)有限公司 Cross-language word vector generation method and device, electronic equipment and storage medium
CN113515959A (en) * 2021-06-23 2021-10-19 网易有道信息技术(北京)有限公司 Training method of machine translation model, machine translation method and related equipment
CN115329783A (en) * 2022-08-09 2022-11-11 拥措 Tibetan Chinese neural machine translation method based on cross-language pre-training model
WO2022267674A1 (en) * 2021-06-22 2022-12-29 康键信息技术(深圳)有限公司 Deep learning-based text translation method and apparatus, device and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109190131A (en) * 2018-09-18 2019-01-11 北京工业大学 A kind of English word and its capital and small letter unified prediction based on neural machine translation
CN109492232A (en) * 2018-10-22 2019-03-19 内蒙古工业大学 A kind of illiteracy Chinese machine translation method of the enhancing semantic feature information based on Transformer
CN110334361A (en) * 2019-07-12 2019-10-15 电子科技大学 A kind of neural machine translation method towards rare foreign languages language
CN110674646A (en) * 2019-09-06 2020-01-10 内蒙古工业大学 Mongolian Chinese machine translation system based on byte pair encoding technology
CN110765784A (en) * 2019-09-12 2020-02-07 内蒙古工业大学 Mongolian Chinese machine translation method based on dual learning
CN110879940A (en) * 2019-11-21 2020-03-13 哈尔滨理工大学 Machine translation method and system based on deep neural network

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109190131A (en) * 2018-09-18 2019-01-11 北京工业大学 A kind of English word and its capital and small letter unified prediction based on neural machine translation
CN109492232A (en) * 2018-10-22 2019-03-19 内蒙古工业大学 A kind of illiteracy Chinese machine translation method of the enhancing semantic feature information based on Transformer
CN110334361A (en) * 2019-07-12 2019-10-15 电子科技大学 A kind of neural machine translation method towards rare foreign languages language
CN110674646A (en) * 2019-09-06 2020-01-10 内蒙古工业大学 Mongolian Chinese machine translation system based on byte pair encoding technology
CN110765784A (en) * 2019-09-12 2020-02-07 内蒙古工业大学 Mongolian Chinese machine translation method based on dual learning
CN110879940A (en) * 2019-11-21 2020-03-13 哈尔滨理工大学 Machine translation method and system based on deep neural network

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
慈祯嘉措;桑杰端珠;孙茂松;色差甲;周毛先;: "融合单语语言模型的藏汉机器翻译方法研究", 中文信息学报, vol. 1, no. 12, pages 125 - 70 *
李京谕_等: "基于联合注意力机制的篇章级机器翻译", 中文信息学报, vol. 33, no. 12, pages 45 - 53 *
桑杰端珠: "稀疏资源条件下的藏汉机器翻译研究", 硕士电子期刊, no. 01, pages 138 - 2518 *
沙九;冯冲;张天夫;郭宇航;刘芳;: "多策略切分粒度的藏汉双向神经机器翻译研究", 厦门大学学报(自然科学版), vol. 59, no. 02, pages 71 - 77 *
赵亚平;苏依拉;牛向华;仁庆道尔吉;: "基于神经网络迁移学习的蒙汉机器翻译方法", 计算机应用与软件, no. 01, pages 185 - 191 *
赵阳_等: "民汉稀缺资源神经机器翻译技术研究", 江西师范大学学报(自然科学版), vol. 43, no. 06, pages 630 - 637 *
高芬_等: "基于Transformer的蒙汉神经机器翻译研究", 计算机应用与软件, vol. 37, no. 2, pages 141 - 146 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112613326A (en) * 2020-12-18 2021-04-06 北京理工大学 Tibetan language neural machine translation method fusing syntactic structure
CN112613326B (en) * 2020-12-18 2022-11-08 北京理工大学 Tibetan language neural machine translation method fusing syntactic structure
CN113157865A (en) * 2021-04-25 2021-07-23 平安科技(深圳)有限公司 Cross-language word vector generation method and device, electronic equipment and storage medium
CN113157865B (en) * 2021-04-25 2023-06-23 平安科技(深圳)有限公司 Cross-language word vector generation method and device, electronic equipment and storage medium
WO2022267674A1 (en) * 2021-06-22 2022-12-29 康键信息技术(深圳)有限公司 Deep learning-based text translation method and apparatus, device and storage medium
CN113515959A (en) * 2021-06-23 2021-10-19 网易有道信息技术(北京)有限公司 Training method of machine translation model, machine translation method and related equipment
CN115329783A (en) * 2022-08-09 2022-11-11 拥措 Tibetan Chinese neural machine translation method based on cross-language pre-training model

Similar Documents

Publication Publication Date Title
CN112084794A (en) Tibetan-Chinese translation method and device
CN110175221B (en) Junk short message identification method by combining word vector with machine learning
CN110442711B (en) Text intelligent cleaning method and device and computer readable storage medium
CN114970522B (en) Pre-training method, device, equipment and storage medium of language model
CN110826298B (en) Statement coding method used in intelligent auxiliary password-fixing system
CN111079447B (en) Chinese-oriented pre-training method and system
CN111460797B (en) Keyword extraction method and device, electronic equipment and readable storage medium
CN111680494A (en) Similar text generation method and device
CN107357785A (en) Theme feature word abstracting method and system, feeling polarities determination methods and system
US11615247B1 (en) Labeling method and apparatus for named entity recognition of legal instrument
CN115759119B (en) Financial text emotion analysis method, system, medium and equipment
CN113516136A (en) Handwritten image generation method, model training method, device and equipment
CN111597807B (en) Word segmentation data set generation method, device, equipment and storage medium thereof
CN112417891A (en) Text relation automatic labeling method based on open type information extraction
CN111090734B (en) Method and system for optimizing machine reading understanding capability based on hierarchical attention mechanism
CN112418320A (en) Enterprise association relation identification method and device and storage medium
Zhao et al. Recognition of the agricultural named entities with multifeature fusion based on albert
CN111858933A (en) Character-based hierarchical text emotion analysis method and system
CN111402365A (en) Method for generating picture from characters based on bidirectional architecture confrontation generation network
CN111353032B (en) Community question and answer oriented question classification method and system
CN110597982A (en) Short text topic clustering algorithm based on word co-occurrence network
CN113726730A (en) DGA domain name detection method and system based on deep learning algorithm
CN113779966A (en) Mongolian emotion analysis method of bidirectional CNN-RNN depth model based on attention
CN111581332A (en) Similar judicial case matching method and system based on triple deep hash learning
CN111090462A (en) API (application program interface) matching method and device based on API document

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination