CN112084794A

CN112084794A - Tibetan-Chinese translation method and device

Info

Publication number: CN112084794A
Application number: CN202010987775.8A
Authority: CN
Inventors: 尼玛扎西; 于永斌; 头旦才让; 仁青东珠; 王昊; 邓权芯
Original assignee: Tibet University
Current assignee: Tibet University
Priority date: 2020-09-18
Filing date: 2020-09-18
Publication date: 2020-12-15

Abstract

The invention relates to a Tibetan Chinese translation method and a device, which construct a parallel original corpus of Tibetan Chinese bilingual, carry out pretreatment to obtain a trainable parallel target corpus of Tibetan Chinese, obtain a source language sequence and a target language sequence, carry out vector expansion on the source language sequence and the target language sequence to obtain a source language vector and a target language vector, input the source language vector to an encoder module for processing to obtain a processing result, input the target language vector and the processing result to a decoder module for training to obtain an output vector, map the output vector back to a target language dictionary, calculate a probability value of each word in the target language sequence, output the probability value in a vector form to obtain a training model, and deduce the training model by using a cluster search algorithm. Compared with the traditional LSTM network, the Tibetan Chinese translation method provided by the invention has better parallel performance and higher computing efficiency, and explores the application and popularization in Tibetan Chinese translation.

Description

Tibetan-Chinese translation method and device

Technical Field

The invention relates to a Tibetan-Han translation method and a Tibetan-Han translation device.

Background

The machine translation technology mainly has two implementation ways, one is a rule-based translation technology, and the other is a statistical-based translation technology. In the statistical-based translation technology, in recent years, with the development of deep neural networks, deep learning-based neural machine translation is becoming the mainstream of a translation model. In neural machine translation, the encoder-decoder model is the main translation framework, with the encoder neural network mapping source language features into vectors, which are decoded by the decoder neural network into the target language. Before 2017, the neural network mainly used in the encoder-decoder was a long-and-short neural network (LSTM) or a gated cyclic unit (GRU). The attention mechanism redistributes weights in a word vector in a weighted summation mode, so that important contents are focused, and the important contents can significantly improve the machine translation effect and are indispensable components in an encoder-decoder framework.

The LSTM can extract the characteristics of a source language and a target language, the problems of gradient explosion and gradient diffusion of the common RNN are solved, and the dependency information can be captured in a longer sequence compared with the common RNN. The GRU is an improvement based on LSTM, which is simpler to construct, saving 1/3 parameters compared to LSTM, significantly reducing training time. The attention mechanism forces the encoder-decoder to learn the corresponding relation between the source language and the target language, a series of problems that the distance between a translated source language word and a translated target language word in the encoder-decoder framework is too long can be solved, and translation quality is improved well.

The Tibetan language belongs to the Hanzang language family, is the main used language of Tibetan, and is mainly distributed in autonomous Tibet areas in China, and Tibetan autonomous areas of Qinghai province, Sichuan province, Yunnan province and Gansu province. The Tibetan is a character writing system of Tibetan and is commonly used among all dialects of Tibetan. The Tibetan language has long history source, and completely records a large number of Buddhism teaching, philosophy and literature works, and the Tibetan language literature is Haoyou Yanhai, and the quantity of the Tibetan language literature is second to that of the Chinese literature. The number of the Tibetan users is large, and the significance of researching and deploying the machine translation system between the Tibetan and the Chinese is great.

Neural machine translation is simpler in components than traditional statistical translation models, but requires a large amount of training data to train the model. At present, 70% of the world corpora are English corpora in two languages, and the corpus contents of minority languages such as Tibetan languages are rare, which brings difficulty to the training of Tibetan-Chinese translation models. Sequence-dependent properties make parallelizing computations very difficult when using LSTM or GRU neural networks. The model which cannot be parallelized in calculation is difficult to accelerate on the GPU and also difficult to utilize large-scale corpus data, which brings great problems to the large-scale application and deployment of the neural machine translation model.

Disclosure of Invention

In order to solve the technical problem, the invention provides a Tibetan language translation method and device.

In order to solve the problems, the invention adopts the following technical scheme:

a tibetan chinese translation method comprising:

constructing a Tibetan-Chinese bilingual parallel original corpus;

preprocessing the Tibetan-Chinese bilingual parallel original corpus to obtain a trainable Tibetan-Chinese bilingual parallel target corpus;

acquiring a source language sequence and a target language sequence according to the Tibetan-Chinese bilingual parallel target corpus;

performing vector expansion on the source language sequence and the target language sequence to obtain a source language vector and a target language vector;

inputting the source language vector into an encoder module for processing to obtain a processing result, and inputting the target language vector and the processing result into a decoder module for training to obtain an output vector;

mapping the output vector back to a target language dictionary;

calculating the probability value of each word in the target language sequence, and outputting the probability value in a vector form to obtain a training model;

and deducing the training model by using a cluster search algorithm to obtain a translation model.

Preferably, the preprocessing the parallel original corpus of tibetan bilinguals to obtain a trainable parallel target corpus of tibetan bilinguals, including:

segmenting the sentence-level original corpus pair in the Tibetan-Chinese parallel original corpus according to a preset segmentation algorithm to obtain word-level corpus pairs;

and processing the original corpus pairs of the word level in the Tibetan-Chinese bilingual parallel original corpus and the obtained corpus pairs of the word level according to a preset BPE byte pair coding algorithm.

Preferably, before segmenting the sentence-level original corpus in the tibetan bilingual parallel original corpus according to a preset segmentation algorithm to obtain word-level corpus pairs, the length filtering, the length truncation and the mode filtering are sequentially performed on the original corpus pairs in the tibetan bilingual parallel original corpus, wherein,

the length filtering process comprises the following steps: filtering the original corpus with the length smaller than a preset short threshold value in the Tibetan-Chinese bilingual parallel original corpus;

the length truncation processing procedure comprises the following steps: truncating the original corpus pairs with the length larger than a preset long threshold value in the Tibetan-Chinese bilingual parallel original corpus, dividing the original corpus pairs into at least two corpus pairs, and enabling the length of each divided corpus pair to be smaller than or equal to the preset long threshold value;

the mode filtering processing procedure comprises the following steps: and filtering the original forecast pairs meeting preset filtering rules in the Tibetan-Chinese bilingual parallel original corpus.

Preferably, the vector expansion of the source language sequence and the target language sequence to obtain a source language vector and a target language vector includes:

expanding the source language sequence and the target language sequence from a real number sequence into a D-dimensional real number vector;

and embedding the position information into the D-dimensional real number vector to obtain the source language vector and the target language vector.

Preferably, the inputting the source language vector into an encoder module for processing to obtain a processing result includes:

dividing the source language vector into h heads, and applying a self-attention mechanism to each head to obtain a processed source language sequence vector;

processing the source language sequence vector by using a feedforward layer to generate a new expression vector corresponding to the source language sequence vector;

the encoder module calculates the new representation vector corresponding to the source language sequence vector at each layer of the encoder module using residual concatenation and LN regularization using a residual concatenation structure;

the processing steps are repeated until the last layer of the encoder module is reached, and K, V matrix of the self-attention mechanism is obtained, wherein the K, V matrix is the processing result.

Preferably, the inputting the target language vector and the processing result into a decoder module for training to obtain an output vector includes:

processing the target language vector through an attention mechanism with a mask to obtain a Q matrix;

performing multi-head attention processing on the K, V matrix and the Q matrix to obtain a target language sequence vector;

processing the target language sequence vector by using a feedforward layer to generate a new expression vector corresponding to the target language sequence vector;

the decoder module uses a residual connection structure to calculate the new representation vector corresponding to the target language sequence vector at each layer of the decoder module by using residual connection and LN regularization;

and repeating the processing steps until the last layer of the decoder module is reached to obtain the output vector.

Preferably, the deducing the training model by using the bundle searching algorithm to obtain a translation model includes:

selecting a maximum search branch number k;

putting the target language sequence into the training model, selecting the first k values each time, and deducing;

and comparing the prediction results of the k branches, selecting the maximum value, and circulating until the whole sequence is processed to obtain the translation model.

Preferably, the calculating a probability value of each word in the target language sequence, and outputting the probability value in a vector form to obtain a training model includes:

and calculating the probability value of each word in the target language sequence by using a Softmax function and taking the cross entropy as a loss function, and outputting the probability value in a vector form to obtain the training model.

A tibetan chinese translation device comprising:

the original corpus establishing module is used for establishing a Tibetan-Chinese bilingual parallel original corpus;

the target language database acquisition module is used for preprocessing the Tibetan-Chinese bilingual parallel original corpus to obtain a trainable Tibetan-Chinese bilingual parallel target corpus;

the language sequence acquisition module is used for acquiring a source language sequence and a target language sequence according to the Tibetan-Chinese bilingual parallel target corpus;

the language vector acquisition module is used for performing vector expansion on the source language sequence and the target language sequence to obtain a source language vector and a target language vector;

the language vector processing module is used for inputting the source language vector into the encoder module for processing to obtain a processing result, and inputting the target language vector and the processing result into the decoder module for training to obtain an output vector;

a mapping module for mapping the output vector back to a target language dictionary;

the training model acquisition module is used for calculating the probability value of each word in the target language sequence and outputting the probability value in a vector form to obtain a training model;

and the inference module is used for inferring the training model by using a cluster search algorithm to obtain a translation model.

The invention has the beneficial effects that: after obtaining the parallel original corpus of the Tibetan Chinese bilingual, preprocessing the parallel original corpus of the Tibetan Chinese bilingual to obtain a trainable parallel target corpus of the Tibetan Chinese bilingual, further obtain a source language sequence and a target language sequence, and obtaining a training model according to the source language sequence and the target language sequence, wherein vector expansion is firstly carried out on the source language sequence and the target language sequence to obtain a source language vector and a target language vector, then the source language vector is processed through an encoder module to obtain a processing result, the target language vector and the processing result output by the encoder module are input into a decoder module to be trained to obtain an output vector, the output vector is mapped back to a target language dictionary, finally, the probability value of each word in the target language sequence is calculated, the probability value is output in a vector form to obtain the training model, and deducing the training model by using a cluster search algorithm to obtain a translation model. Through the training process, a reliable Tibetan Chinese translation model can be obtained, and translation between Tibetan and Chinese can be accurately and reliably carried out according to the Tibetan Chinese translation model. Therefore, compared with the traditional LSTM network, the Tibetan-Chinese translation method has better parallel performance and higher computing efficiency, and application and popularization in Tibetan-Chinese interconversion are explored.

Drawings

In order to more clearly illustrate the technical solution of the embodiment of the present invention, the drawings needed to be used in the embodiment will be briefly described as follows:

fig. 1 is a schematic overall flow chart of a tibetan chinese translation method according to an embodiment of the present application;

FIG. 2 is a pre-processing flow diagram;

FIG. 3 is a schematic diagram of a self-attention mechanism neural machine translation network;

FIG. 4 is a schematic of an attention algorithm;

FIG. 5 is a diagram of operation effect during human-computer interaction;

fig. 6 is a schematic overall structure diagram of a tibetan chinese translation apparatus according to a second embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to" determining "or" in response to detecting ". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.

Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.

The Tibetan-Chinese translation method provided by the embodiment of the application can be applied to terminal devices such as mobile phones, tablet computers, wearable devices, vehicle-mounted devices, Augmented Reality (AR)/Virtual Reality (VR) devices, notebook computers, ultra-mobile personal computers (UMPCs), netbooks, Personal Digital Assistants (PDAs), and the like, and the embodiment of the application does not limit the specific types of the terminal devices at all. That is, the carrier of the client corresponding to the tibetan chinese translation method provided in the embodiment of the present application may be any one of the above terminal devices.

In order to explain the technical means described in the present application, the following description will be given by way of specific embodiments.

Fig. 1 is a flowchart of an implementation process of a tibetan chinese translation method provided in an embodiment of the present application, and only a part related to the embodiment of the present application is shown for convenience of description.

The Tibetan-Han translation method comprises the following steps:

step S101: constructing a Tibetan-Chinese bilingual parallel original corpus:

and acquiring the original corpus pairs of the Tibetan-Chinese bilingual, and constructing a parallel original corpus of the Tibetan-Chinese bilingual according to the original corpus pairs of the Tibetan-Chinese bilingual.

Step S102: preprocessing the Tibetan-Chinese bilingual parallel original corpus to obtain a trainable Tibetan-Chinese bilingual parallel target corpus:

preprocessing is carried out on a Tibetan-Chinese bilingual parallel original corpus to obtain a trainable Tibetan-Chinese bilingual parallel target corpus, wherein the specific preprocessing process is set according to actual needs, a specific preprocessing process is given as follows, and length filtering processing, length truncation processing, mode filtering processing, word segmentation processing and BPE processing are sequentially carried out, and the five preprocessing processes are respectively as follows:

the length filtering process is to filter the too short original corpus, specifically: and filtering the original corpus pairs with the length smaller than a preset short threshold s in the Tibetan-Chinese bilingual parallel original corpus, and for a certain original corpus pair, if the length of the original corpus pair is smaller than the preset short threshold s, removing the original corpus pair from the original corpus.

The length truncation process is used for truncating the overlong original corpus pair to separate the corpus pair into a plurality of corpus pairs, specifically: for any original corpus pair in the Tibetan-Chinese bilingual parallel original corpus, if the length of the original corpus pair is greater than a preset length threshold max, the original corpus pair is cut off and divided into at least two corpus pairs, and the length of each divided corpus pair is less than or equal to the preset length threshold max.

The pattern filtering process is used to filter the original expected pairs meeting a preset filtering rule in the tibetan bilingual parallel original corpus, where the preset filtering rule is a text in a filtering set (i.e., a rule set R is defined, as shown in fig. 2), and then the text meeting the filtering set is filtered. Given Chinese mode set R₁Tibetan language pattern set R₂In this case R₁＝R₂. If the mode in the original corpus pair conforms to R₁Or R₂Then the corpus is pairedFiltering from the original corpus. The pattern set is mainly characterized by features that do not conform to the characteristics of the processing language, such as arabic numerals in Tibetan, special characters not in the character set, and the like.

The word segmentation processing process comprises the following steps: and segmenting the sentence-level original corpus pair in the Tibetan-Chinese bilingual parallel original corpus according to a preset segmentation algorithm to obtain word-level corpus pairs. In this embodiment, the word segmentation algorithm is a word segmentation algorithm provided by Hanlp. And for the sentence-level original corpus pair in the Tibetan-Chinese bilingual parallel original corpus, performing word segmentation on the Chinese by using a word segmentation algorithm provided by Hanlp, processing the Tibetan by using a Tibetan word segmentation device, and outputting the word-level corpus pair.

The BPE treatment process comprises the following steps: and processing the original corpus pairs of the word level in the Tibetan-Chinese bilingual parallel original corpus and the corpus pairs of the word level obtained by word segmentation according to a preset BPE byte pair coding algorithm, and outputting the corpus pairs with finer granularity. Given the size Vocab, this value is used to process the corpus pairs with the BPE byte pair encoding algorithm and generate a joint dictionary.

The BPE byte pair coding algorithm receives a dictionary generated by an original corpus, splits each word into a character sequence and counts word frequency, aggregates the character sequence into new words according to the word frequency, and repeats the process until a set iteration upper limit or words which can not be combined exist. The main significance of the BPE byte pair encoding algorithm is to convert the basic unit of the corpus from word level to mixed corpus between character and word level. Because the corpus is often insufficient and the model generalization ability is limited, the word level is often positioned at the minimum learning granularity to alleviate the problem of insufficient data. The corpus processed by BPE is balanced in the word-level granularity which requires more data and relatively less data, and the character-level granularity and the precision of the model generalization capability are limited.

The effect of the pretreatment is shown in Table 1.

TABLE 1

Step S103: according to the Tibetan-Chinese bilingual parallel target corpus, acquiring a source language sequence and a target language sequence:

and acquiring a source language sequence and a target language sequence according to the Tibetan-Chinese bilingual parallel target corpus. In this embodiment, words in the source language sequence and the target language sequence may be replaced with values in the dictionary, and specific positions in the source language sequence and the target language sequence, such as the beginning, the end, and the blank, may be replaced with values in the dictionary.

Step S104: performing vector expansion on the source language sequence and the target language sequence to obtain a source language vector and a target language vector:

performing vector expansion on a source language sequence and a target language sequence to obtain a source language vector and a target language vector, where this embodiment provides a processing procedure:

and expanding the source language sequence and the target language sequence from a real number sequence into a D-dimensional real number vector. For two real sequences, a source language sequence and a target language sequence, let j be the position of each word in the dictionary, there is a function f_we(j) Wherein

This process, called the word embedding layer, maps the dictionary number to a vector in a D-dimensional space, where each real number component in the vector is a parameter to be learned.

And embedding the position information into the D-dimensional real number vector to obtain a source language vector and a target language vector. For the real number vector processed above, there is a function f_pe(pos, i), where pos is the position of each word in the sequence,

specifically, for the above process, given the sequence position pos and the D-dimensional vector dimension i, there are:

in summary, the above process is written as f (j, pos) ═ f_we(j)+f_pe(pos)。

Step S105: inputting the source language vector into an encoder module for processing to obtain a processing result, inputting the target language vector and the processing result into a decoder module for training to obtain an output vector:

as shown in fig. 3, inputting the source language vector into the encoder module for processing to obtain a processing result, specifically including the following processes:

(1) and (3) dividing the source language vector into h heads (h is more than or equal to 2), and applying a self-attention mechanism to each head to obtain the processed source language sequence vector.

For a source language vector x, a q, k and v vector can be obtained by linear mapping of three unshared parameters.

q＝Linear_q(x)

k＝Linea_rk(x)

v＝Linear_v(x)

For a batch of data, all q, k, v vectors are concatenated to obtain the matrix Q, K, V. For the divided h heads, there are

MultiHead(Q,K,V)＝Concat(head₁,head₂,…,head_h)W^O

Specifically, for each head:

wherein

For each source language vector x, through the above attention mechanism, the attention weight value can be output by the softmax function, and the attention value of each part is obtained through weighted summation. Is divided into h headsSo as to improve the generalization performance of the model.

(2) The source language sequence vector is processed using a feed forward layer to generate a new representation vector corresponding to the source language sequence vector. The formula of the feedforward layer is ffn (x) max (0, xW)₁+b₁)W₂+b₂. Wherein W₁,W₂,b₁,b₂The parameters that need to be learned for the feed forward layer.

(3) The encoder module uses a residual concatenation structure to compute the new representation vector corresponding to the source language sequence vector at each layer of the encoder module using residual concatenation and LN regularization.

Note that the processing result of the above steps (1) and (2) is sublayer (x), where f (x) ═ LN (x + sublayer (x)) exists, specifically, the LN regularization formula is LN (x) ═ g & _ n (x)) + b, where g and b are parameters requiring learning.

(4) The processing steps are repeated until the last layer of the encoder module is reached, and K, V matrix of the self-attention mechanism is obtained, wherein K, V matrix is a processing result.

And inputting the target language vector and the processing result into a decoder module for training to obtain an output vector. During training, compared with the processing process of the source language vectors, the method has a multi-head attention mechanism with masks. The significance of the mask is that the decoder module needs to deduce the next sequence information depending on the existing information, so for each sequence position pos, it needs to be ensured that the result of the attention mechanism operation for positions larger than pos is 0.

As shown in fig. 3, the target language vector and the processing result are input to the decoder module for training to obtain an output vector, and the specific process includes:

(1) and processing the target language vector through an attention mechanism with a mask to obtain a Q matrix.

(2) And taking the K, V matrix output by the last layer of the encoder module as the input of the multi-head attention module, forming the input of the multi-head attention module together with the Q matrix, and carrying out multi-head attention processing on the K, V matrix and the Q matrix to obtain a target language sequence vector.

(3) The target language sequence vector is processed using a feedforward layer to generate a new representation vector corresponding to the target language sequence vector.

(4) The decoder module uses a residual concatenation structure to compute a new representation vector corresponding to the target language sequence vector at each layer of the decoder module using residual concatenation and LN regularization.

(5) And repeating the processing steps until the last layer of the decoder module is reached to obtain an output vector.

In particular, for a masked attention mechanism, matrix M is used_maskThe size of the matrix is the same as that of the V matrix, and the elements above the diagonal (inclusive) are 0, and the remaining elements are 1. When the method is applied to the Dot multiplication process of the QK matrix, for the position mask with 0 element, as in the Scale Dot product process of fig. 4, the position mask is the mask flow.

Step S106: mapping the output vector back to a target language dictionary:

the output vector obtained in step S105 is mapped back to the target language dictionary using a linear layer, the input size of which is the vector size of the source language vector and the target language vector, and the output size is the size of the target language dictionary.

Step S107: calculating the probability value of each word in the target language sequence, and outputting the probability value in a vector form to obtain a training model:

in this embodiment, a Softmax function is used, the cross entropy is used as a loss function, a probability value of each word in the target language sequence is calculated, and a probability value is output in a vector form to obtain a training model.

Specifically, the cross entropy loss function is

In the above, after the source language sequence and the target language sequence are obtained, the source language sequence and the target language sequence are encoded into the vector sequence and the position information is embedded. In the training process, K, V matrix is calculated by the encoder module from the original sequence, Q matrix is calculated by the decoder module from the target language sequence, and attention mechanism is calculated together with K, V matrix. And after the processing of the decoder module, the vectors are mapped back to the target language dictionary through the linear layer, and the cross entropy is used as a loss function to calculate the occurrence probability of each word.

Step S108: deducing the training model by using a cluster search algorithm to obtain a translation model:

and deducing the trained model to generate a translated text, namely a translation model. Deducing the training model by using a cluster search algorithm to obtain a translation model, wherein the method comprises the following steps:

(1) selecting a maximum search branch number k;

(2) putting the target language sequence into a training model, selecting the first k values each time, and deducing;

(3) comparing the prediction results of the k branches, and selecting the maximum value, namely selecting the branch with the best effect;

(4) and circulating until the full sequence processing is completed, namely until the model inference is finished, and obtaining the translation model.

The cluster searching algorithm is used for searching in the solution space, so that the situation that the local optimal solution is trapped can be effectively avoided. The k value ranges from 5 to 10 according to experience.

In this embodiment, the tibetan-chinese translation method further includes a back-end management process and a front-end display function.

The back-end management process is responsible for organizing the model files and providing translation services for the outside. Where the Base class is a common Base class that translates the plant class and the participle plant class, which specifies the behavior that the plant should have and the attributes that need to be maintained. TransFactory is the actual translation factory class, responsible for generating the objects of the translation class. All translation classes inherit the base class TransBase, which specifies how to initialize the translation classes. All translation class initialization methods are called by the factory class at creation time. Transproxy is a proxy class, responsible for the related functions of proxy translation classes, and combines two components, namely a translation class object and a participle class object, which are configured by the settings in the configuration file. And when the system is called, the related functions are completed through the proxy class, and the translation class and the word segmentation class of the specific execution function are invisible to a caller. The processing of the input data and the conversion of the results are both performed in the proxy class.

The front-end display function comprises a display module consisting of a Vue component and a Bootstrap component, provides a human-computer interaction function, and is a human-computer interaction operation effect diagram as shown in FIG. 5.

The application of the neural machine translation model based on the self-attention mechanism in Tibetan-Chinese translation comprises the steps of firstly constructing a Tibetan-Chinese bilingual corpus, then cleaning, segmenting words and applying a BPE algorithm to the Tibetan-Chinese bilingual corpus to enable the original corpus to become trainable corpus, training the corpus, deducing by using a clustering search algorithm during deduction, managing model files through a rear-end management process, and performing man-machine interaction by using a front-end display function.

Fig. 6 shows a block diagram of a Tibetan language translation apparatus provided in the second embodiment of the present application, which corresponds to the Tibetan language translation method described in the foregoing embodiment of the Tibetan language translation method.

Referring to fig. 6, the tibetan translation apparatus 200 includes:

an original corpus construction module 201, configured to construct a Tibetan-Chinese bilingual parallel original corpus;

a target language database obtaining module 202, configured to preprocess the parallel original corpus of tibetan bilinguals to obtain a trainable parallel target corpus of tibetan bilinguals;

a language sequence obtaining module 203, configured to obtain a source language sequence and a target language sequence according to the tibetan bilingual parallel target corpus;

a language vector obtaining module 204, configured to perform vector expansion on the source language sequence and the target language sequence to obtain a source language vector and a target language vector;

a language vector processing module 205, configured to input the source language vector into an encoder module for processing to obtain a processing result, and input the target language vector and the processing result into a decoder module for training to obtain an output vector;

a mapping module 206 for mapping the output vector back to a target language dictionary;

a training model obtaining module 207, configured to calculate a probability value of each word in the target language sequence, and output a probability value in a vector form to obtain a training model;

and the inference module 208 is configured to infer the training model by using a cluster search algorithm to obtain a translation model.

Further, the preprocessing the Tibetan-Chinese bilingual parallel original corpus to obtain a trainable Tibetan-Chinese bilingual parallel target corpus includes:

Further, before segmenting the sentence-level original corpus in the tibetan bilingual parallel original corpus according to a preset segmentation algorithm to obtain word-level corpus pairs, sequentially performing length filtering, length truncation and mode filtering on the original corpus pairs in the tibetan bilingual parallel original corpus, wherein,

Further, the vector expansion of the source language sequence and the target language sequence to obtain a source language vector and a target language vector includes:

Further, the inputting the source language vector into an encoder module for processing to obtain a processing result includes:

Further, the inputting the target language vector and the processing result into a decoder module for training to obtain an output vector includes:

Further, the deducing the training model by using the cluster search algorithm to obtain a translation model includes:

selecting a maximum search branch number k;

Further, the calculating a probability value of each word in the target language sequence, and outputting the probability value in a vector form to obtain a training model includes:

It should be noted that, because the contents of information interaction, execution process, and the like between the above devices/modules are based on the same concept as that of the embodiment of the tibetan chinese translation method of the present application, specific functions and technical effects thereof may be specifically referred to in the section of the embodiment of the tibetan chinese translation method, and details thereof are not described herein again.

It will be clear to those skilled in the art that, for convenience and simplicity of description, the above-mentioned division of the functional modules is merely used as an example, and in practical applications, the above-mentioned function distribution can be performed by different functional modules according to needs, that is, the internal structure of the tibetan translation apparatus 200 is divided into different functional modules to perform all or part of the above-mentioned functions. Each functional module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional modules are only used for distinguishing one functional module from another, and are not used for limiting the protection scope of the application. The specific working process of each functional module in the above description may refer to the corresponding process in the foregoing embodiment of the tibetan-chinese translation method, and is not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A tibetan chinese translation method, comprising:

constructing a Tibetan-Chinese bilingual parallel original corpus;

mapping the output vector back to a target language dictionary;

2. The Tibetan-Chinese translation method of claim 1, wherein the preprocessing of the Tibetan-Chinese bilingual parallel original corpus to obtain a trainable Tibetan-Chinese bilingual parallel target corpus comprises:

3. The Tibetan-Chinese translation method according to claim 2, wherein before segmenting the original corpus pairs at sentence level in the Tibetan-Chinese bilingual parallel original corpus according to a preset segmentation algorithm to obtain the corpus pairs at word level, the original corpus pairs in the Tibetan-Chinese bilingual parallel original corpus are further processed by length filtering, length truncation and mode filtering in sequence, wherein,

4. The Tibetan-Chinese translation method of claim 1, wherein the vector expansion of the source language sequence and the target language sequence to obtain a source language vector and a target language vector comprises:

5. The Tibetan language translation method of claim 1, wherein the inputting of the source language vector into an encoder module for processing results in processing results comprises:

6. The Tibetan language translation method of claim 5, wherein the inputting the target language vector and the processing result into a decoder module for training to obtain an output vector comprises:

7. The Tibetan-Han translation method of claim 1, wherein the inferring the training model using a cluster search algorithm to obtain a translation model comprises:

selecting a maximum search branch number k;

8. The Tibetan-Han translation method of claim 1, wherein the calculating a probability value of each occurrence of a word in the target language sequence and outputting the probability value in a vector form to obtain a training model comprises:

9. A tibetan chinese translation device, comprising: