CN113761944B

CN113761944B - Corpus processing method, device and equipment for translation model and storage medium

Info

Publication number: CN113761944B
Application number: CN202110553522.4A
Authority: CN
Inventors: 王龙跃; 刘宏烨
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-05-20
Filing date: 2021-05-20
Publication date: 2024-03-15
Anticipated expiration: 2041-05-20
Also published as: CN113761944A

Abstract

The application relates to a corpus processing method, device and equipment for a translation model and a storage medium. The method relates to the technical field of natural language processing, and comprises the following steps: acquiring an original training corpus used for training a translation model; obtaining at least two groups of trained general language models, obtaining quality scores corresponding to parallel sentences in original training corpus through each group of general language models, filtering the original training corpus according to the quality scores to obtain high-quality training corpus, wherein the model structures of each group of general language models are different; and obtaining the domain scores corresponding to the parallel sentences in the high-quality training corpus through the trained target domain language model and the universal domain language model, and screening the high-quality training corpus of the target domain from the high-quality training corpus according to the domain scores. By adopting the method, the corpus in the target field can be screened on the basis of ensuring high quality, so that the translation performance of the translation model can be greatly improved by the obtained corpus.

Description

Corpus processing method, device and equipment for translation model and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a corpus processing method, apparatus, computer device, and storage medium for a translation model.

Background

Currently, neural networks have been widely used in the field of artificial intelligence technology, including speech recognition, computer vision, natural language processing, etc., and neural network models are excellent in various tasks of natural language processing, such as machine translation tasks. In the machine translation task, along with the continuous increase of the scale of translation corpus in recent years, the performance of the translation model is obviously improved in early stage, which shows that the large-scale corpus plays a very large role in training the translation model, but the translation model cannot be further improved by training the translation model by using the larger-scale corpus in later stage.

The inventors have studied to find two reasons for this: 1) Statement quality is uneven and noise data is more in the large-scale corpus; 2) The translation corpuses from different fields in the large-scale corpuses have distribution difference, and the fields of the large-scale corpuses are unevenly distributed.

At present, only some modes of cleaning large-scale corpus by using manual rules or filtering the large-scale corpus in a single quality aspect by using a single language model exist, and the modes are not comprehensive enough in processing the large-scale corpus, so that the problem that the translation performance of the obtained corpus still cannot be improved still exists.

Disclosure of Invention

Based on the foregoing, it is necessary to provide a method, an apparatus, a computer device and a storage medium for processing a translation model for a training corpus, so as to ensure that a high-quality training corpus is obtained and a training corpus for a translation model in a target field to perform model training is obtained, thereby improving the performance of the translation model in the target field.

A method for processing a training corpus of a translation model, the method comprising:

acquiring an original training corpus used for training a translation model;

obtaining at least two groups of trained universal language models, obtaining quality scores corresponding to parallel sentences in the original training corpus through each group of universal language models, and filtering the original training corpus according to the quality scores to obtain training corpus meeting preset quality conditions, wherein the model structures of each group of universal language models are different;

obtaining field scores corresponding to parallel sentences in the training corpus meeting the preset quality conditions through the trained target field language model and the universal field language model, and screening the training corpus meeting the preset quality conditions from the training corpus meeting the preset quality conditions according to the field scores, wherein the target field language model and the universal field language model have the same model structure;

The selected training corpus is used for obtaining the translation model in the target field after model training is carried out on the translation model.

A corpus processing apparatus of a translation model, the apparatus comprising:

the corpus acquisition module is used for acquiring an original training corpus used for training the translation model;

the quality filtering module is used for acquiring at least two groups of trained general language models, obtaining quality scores corresponding to parallel sentences in the original training corpus through each group of general language models, filtering the original training corpus according to the quality scores to obtain training corpus meeting preset quality conditions, wherein the model structures of each group of general language models are different;

the domain screening module is used for obtaining domain scores corresponding to parallel sentences in the training corpus meeting the preset quality conditions through a trained target domain language model and a universal domain language model, and screening the training corpus meeting the preset quality conditions from the training corpus meeting the preset quality conditions according to the domain scores, wherein the target domain language model and the universal domain language model have the same model structure; the selected training corpus is used for obtaining the translation model in the target field after model training is carried out on the translation model.

In one embodiment, the quality filtering module is further configured to score, through each set of universal language models, an original sentence and a translated sentence in each parallel sentence in the original training corpus, so as to obtain an original quality score and a translated quality score of each set of universal language models corresponding to the parallel sentences; and fusing the original text quality scores and the translated text quality scores of the parallel sentences corresponding to each group of general language models to obtain the quality scores corresponding to the parallel sentences.

In one embodiment, the quality filtering module comprises an original scoring unit and a translated version scoring unit;

the original text scoring unit is used for scoring the original text sentences in each parallel sentence in the original training corpus through the original text language model in each group of general language models, and respectively obtaining the original text quality scores of the parallel sentences;

the translation scoring unit is used for scoring the translation sentences in each parallel sentence in the original training corpus through the translation language models in each group of general language models, and obtaining the translation quality scores of the parallel sentences respectively.

In one embodiment, the quality filtering module is further configured to normalize the textual quality scores of parallel sentences obtained by the same set of general language models according to the highest score and the lowest score in the textual quality scores of parallel sentences in the original training corpus obtained by the same set of general language models, so as to obtain normalized textual quality scores; normalizing the translation quality scores of the parallel sentences obtained by the same group of general language models according to the highest score and the lowest score in the translation quality scores of the parallel sentences in the original training corpus obtained by the same group of general language models to obtain normalized translation quality scores; and merging the normalized original text quality scores and the normalized translated text quality scores of the parallel sentences corresponding to each group of general language models to obtain the quality scores corresponding to the parallel sentences.

In one embodiment, the quality filtering module is further configured to sum the textual quality score and the translation quality score of each set of generic language models to obtain a set-level score; obtaining a weighting coefficient corresponding to each group of general language models; and carrying out weighted summation on the group-level scores of the parallel sentences corresponding to each group of the universal language models based on the weighting coefficients corresponding to each group of the universal language models, and obtaining the quality scores corresponding to the parallel sentences.

In one embodiment, when the generic language model is a statistical language model obtained based on a high-quality corpus, the quality filtering module is further configured to sequentially obtain parallel sentences from the original training corpus; inputting an original sentence in the parallel sentence into a statistical language model of the original, and obtaining an original quality score of the parallel sentence based on a condition frequency corresponding to each word in the original sentence through the statistical language model of the original sentence; inputting translated sentence in the parallel sentence into a statistical language model of the translated sentence, and obtaining a translated quality score of the parallel sentence based on a condition frequency corresponding to each word in the translated sentence through the statistical language model of the translated sentence; and fusing the original text quality scores and the translated text quality scores of the parallel sentences to obtain the quality scores of the parallel sentences corresponding to the statistical language model.

In one embodiment, when the generic language model is an autoregressive language model, the quality filtering module is further configured to sequentially obtain parallel sentences from the original training corpus; inputting an original sentence in the parallel sentence into an autoregressive language model of the original sentence, predicting the conditional probability of each word from left to right or from right to left in the original sentence through the autoregressive language model of the original sentence, and obtaining an original quality score of the parallel sentence according to the conditional probability corresponding to each word; inputting the translated sentence in the parallel sentence into an autoregressive language model of the translated sentence, predicting the conditional probability of each word from left to right or from right to left in the translated sentence through the autoregressive language model of the translated sentence, and obtaining the translated quality score of the parallel sentence according to the conditional probability corresponding to each word; and fusing the original text quality scores and the translated text quality scores of the parallel sentences to obtain the quality scores of the autoregressive language models corresponding to the parallel sentences.

In one embodiment, when the generic language model is a self-coding language model, the quality filtering module is further configured to sequentially obtain parallel sentences from the original training corpus; sequentially taking each word in the original sentence of the parallel sentence as a masking word, inputting the masked original sentence into a self-coding language model of the original sentence, outputting a prediction probability corresponding to the masking word through the self-coding language model of the original sentence, and obtaining an original quality score of the parallel sentence according to the prediction probability corresponding to each masking word; sequentially taking each word in the translation sentence of the parallel sentence as a masking word, inputting the masked translation sentence into a self-coding language model of the translation, outputting a prediction probability corresponding to the masking word through the self-coding language model of the translation, and obtaining a translation quality score of the parallel sentence according to the prediction probability corresponding to each masking word; and fusing the original text quality score and the translated text quality score of each parallel sentence to obtain the quality score of each parallel sentence corresponding to the self-coding language model.

In one embodiment, the domain screening module is further configured to score, through a target domain language model of the original text, the original text sentence in each parallel sentence in the training corpus meeting the preset quality condition, so as to obtain a first domain score corresponding to the parallel sentence; scoring original text sentences in each parallel sentence in the training corpus meeting a preset quality condition through a general field language model of the original text to obtain a second field score corresponding to the parallel sentence; and obtaining the domain scores corresponding to the parallel sentences in the training corpus meeting the preset quality conditions according to the difference between the first domain scores corresponding to the parallel sentences and the second domain scores.

In one embodiment, the domain screening module is further configured to score, through a target domain language model of the translation, translation sentences in each parallel sentence in the training corpus that meets a preset quality condition, and obtain a third domain score corresponding to the parallel sentence; scoring translation sentences in each parallel sentence in the training corpus meeting a preset quality condition through a universal field language model of the translation to obtain a fourth field score corresponding to the parallel sentence; and obtaining the domain scores corresponding to the parallel sentences in the training corpus meeting the preset quality condition according to the difference between the third domain score corresponding to the parallel sentences and the fourth domain score.

In one embodiment, the domain screening module is further configured to score, through a target domain language model of the original text, the original text sentence in each parallel sentence in the training corpus meeting the preset quality condition, so as to obtain a first domain score corresponding to the parallel sentence; scoring original text sentences in each parallel sentence in the training corpus meeting a preset quality condition through a general field language model of the original text to obtain a second field score corresponding to the parallel sentence; scoring translation sentences in each parallel sentence in the training corpus meeting a preset quality condition through a target domain language model of the translation to obtain a third domain score corresponding to the parallel sentence; scoring translation sentences in each parallel sentence in the training corpus meeting a preset quality condition through a universal field language model of the translation to obtain a fourth field score corresponding to the parallel sentence; and fusing the difference between the first domain score and the second domain score corresponding to each parallel sentence and the difference between the third domain score and the fourth domain score to obtain the domain score corresponding to each parallel sentence in the training corpus meeting the preset quality condition.

In one embodiment, the apparatus further comprises:

the first training module is used for obtaining parallel corpus of the target field, and carrying out model training on a language model to be trained by using the parallel corpus of the target field to obtain the language model of the target field;

the second training module is used for sampling the training corpus meeting the preset quality condition to obtain a sampling corpus, and carrying out model training on the language model to be trained by using the sampling corpus to obtain the general field language model.

A computer device comprising a memory storing a computer program and a processor which when executing the computer program performs the steps of:

acquiring an original training corpus used for training a translation model;

The training corpus which is screened out and meets the preset quality condition is used for carrying out model training on the translation model to obtain the translation model of the target field.

A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

acquiring an original training corpus used for training a translation model;

A computer program comprising computer instructions stored in a computer readable storage medium, the computer instructions being read from the computer readable storage medium by a processor of a computer device, the computer instructions being executed by the processor to cause the computer device to perform the steps of the corpus processing method of a translation model as described above.

According to the corpus processing method, the device, the computer equipment and the storage medium of the translation model, at least two groups of trained general language models are used for carrying out mixed scoring on each parallel sentence in the original corpus, and compared with the quality filtering based on a single language model, the integrated multi-group language model can comprehensively score the parallel sentences from different angles due to different model structures of each group of general language models, so that the corpus meeting the preset quality condition, namely the high-quality corpus, can be filtered from the original corpus; further, on the basis, the domain scores corresponding to parallel sentences in the high-quality training corpus are obtained through the trained target domain language model and the universal domain language model, so that the training corpus which meets the target domain and meets the preset quality condition is further screened from the high-quality training corpus, the translation model in the target domain can be used for model training, and compared with filtering based on a single quality aspect, the corpus in the target domain can be screened on the basis of ensuring high quality, and the translation performance of the translation model can be greatly improved; in addition, through the scoring in the aspect of high quality, the process setting of the training corpus in the target field is screened on the basis of high quality, and compared with the process of directly carrying out quality and field mixed scoring on the original training corpus, the method can greatly ensure that the output corpus is high in quality and closest to the target field.

Drawings

FIG. 1 is an application environment diagram of a corpus processing method of a translation model in one embodiment;

FIG. 2 is a schematic diagram of the performance of a translation model as a function of the scale of a training corpus in one embodiment;

FIG. 3 is a flow chart of a method for processing a training corpus of a translation model in one embodiment;

FIG. 4 is a schematic diagram of a corpus processing flow of a translation model in one embodiment;

FIG. 5 is a flowchart of obtaining quality scores corresponding to parallel sentences in an original training corpus according to an embodiment;

FIG. 6 is a flowchart of obtaining an original text quality score and a translated text quality score of parallel sentences corresponding to each set of universal language models according to an embodiment;

FIG. 7 is a flow chart of merging the original text quality scores and the translated text quality scores of each parallel sentence corresponding to each group of general language models to obtain quality scores corresponding to each parallel sentence in one embodiment;

FIG. 8 is a flow diagram of quality filtering in one embodiment;

FIG. 9 is a flowchart of another embodiment of merging the original text quality scores and the translated text quality scores of each parallel sentence corresponding to each group of the universal language models to obtain quality scores corresponding to each parallel sentence;

FIG. 10 is a flowchart of obtaining quality scores corresponding to parallel sentences in an original training corpus through a statistical language model in one embodiment;

FIG. 11 is a flowchart of obtaining quality scores corresponding to parallel sentences in an original training corpus by an autoregressive language model in an embodiment;

FIG. 12 is a flowchart of obtaining quality scores corresponding to parallel sentences in an original training corpus by a self-coding language model in one embodiment;

FIG. 13 is a flowchart of obtaining domain scores corresponding to parallel sentences in a training corpus satisfying a preset quality condition according to an embodiment;

FIG. 14 is a schematic diagram of domain screening of training corpus meeting preset quality conditions in one embodiment;

FIG. 15 is a block diagram of a training corpus processing device of a translation model in one embodiment;

fig. 16 is an internal structural view of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

Before introducing embodiments of the present application, some terms involved in the embodiments of the present application are explained as follows:

deep Learning (DL for short): is a branch of machine learning, an algorithm that attempts to abstract data at a high level using multiple processing layers, either comprising complex structures or consisting of multiple nonlinear transformations.

Neural networks (NN for short): a deep learning model imitating the structure and function of a biological neural network is disclosed in the fields of machine learning and cognitive science.

Machine translation (Machine Translation, MT for short): one language word is automatically translated into another language word by means of an electronic computer or the like.

Statistical machine translation: (Statistical Machine Translation, SMT): machine translation techniques based on traditional statistical methods. Before the advent of neural network methods, machine translation was primarily a statistical model-based translation model.

Neural network machine translation (Neural Machine Translation, NMT for short): the latest generation of machine translation technology based on neural networks.

Cyclic neural network (Recurrent Neural Network, RNN for short): a network model that converts sequence modeling into timing modeling that loops states through its network.

Self-care neural network (Self-Attention Network, SAN for short): neural network structural model based on self-focusing mechanism.

Convolutional neural network (Convolutional Neural Network, CNN for short): consists of one or more convolutional layers and a top fully-connected layer, and also includes an associated weight and pooling layer.

Attention mechanism (Attention Mechanism): a method for modeling hidden state dependency of an encoder and a decoder in a neural network.

BLEU (Bilingual Evaluation Understudy): the higher the value of the machine translation evaluation index is, the better the translation effect is.

RNNsearch: RNN-based encoder-decoder framework.

LightConv: CNN-based encoder-decoder framework.

Transformer, an encoder-decoder framework based on SAN networks, is the most popular current model structure for sequence-to-sequence generation (sequence-to-sequence generation).

Parallel corpus (parallel corpuses): the bilingual/multilingual corpus is composed of original text and parallel corresponding translated text, the alignment degree of the bilingual/multilingual corpus can be word level, sentence level, segment level and stage level, sentence pairs composed of the original text and the translated text can be called parallel sentences, and a large number of parallel sentences form parallel corpus.

Language Model (LM): various statistical and probabilistic techniques are used to determine the probability that a given word will occur in a sentence. The language model provides the basis for their word predictions by analyzing the body of text data. Colloquially, the language model can determine syntactically whether a sentence is smooth.

Masked LM (Mask Language Model): is a language model that randomly occludes parts of the words of the input sequence and predicts those occluded words in order to avoid the words "seeing themselves" while at the same time utilizing information of the double-sided context.

N-Gram LM (N-Gram Language Model): an algorithm based on a statistical language model. The basic idea is to perform a sliding window operation of size N on the content in the text according to bytes, forming a sequence of byte fragments of length N.

Transducer LM (Transformer Language Model): the transducer is based on the encoder-decoder framework of SAN networks, which is the most popular currently used neural network model structure for sequence-to-sequence generation (sequence-to-sequence generation). The transducer LM is a language model that uses a transducer model to count and analyze sentences.

BERT (Bidirectional Encoder Representations from Transformers): a pre-trained neural network model aims at obtaining a bi-directional representation of depth by taking into account bilateral context information in all layers.

PPL (Perplexity, confusion): the method can be used for measuring indexes of the sentence quality of a language model prediction sample, and the lower the confusion degree is, the larger the sentence probability is.

The embodiment of the application provides a corpus processing method of a translation model, which relates to an artificial intelligence (Artificial Intelligence, AI) technology, wherein the artificial intelligence is a theory, a method, a technology and an application system which simulate, extend and expand human intelligence by using a digital computer or a machine controlled by the digital computer, sense environment, acquire knowledge and acquire an optimal result by using the knowledge. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The embodiment of the application provides a corpus processing method of translation models, which mainly relates to artificial intelligence natural language processing (Nature Language processing, NLP) technology, wherein natural language processing is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

The embodiment of the application provides a corpus processing method of a translation model, which mainly relates to a machine translation (Machine Translation) technology of a natural language processing technology, wherein the machine translation automatically translates one language word into another language word by using computer equipment. For example, in the embodiment of the present application, the parallel sentences are scored for quality using a trained Language Model (LM). For another example, in the embodiment of the present application, after quality filtering and domain screening are performed on an original training corpus to obtain a training corpus in a target domain and meeting a preset quality condition, a translation model may be trained based on the training corpus in the target domain and meeting the preset quality condition, so as to obtain a translation model for translating a text in the target domain.

The corpus processing method of the translation model can be applied to an application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. Each terminal 102 can crawl the training corpus from the network, and after collecting the large-scale original training corpus for training the translation model, the server 104 can obtain the original training corpus; obtaining at least two groups of trained general language models, obtaining quality scores corresponding to parallel sentences in original training corpus through each group of general language models, filtering the original training corpus according to the quality scores to obtain training corpus meeting preset quality conditions, wherein the model structures of each group of general language models are different; obtaining the domain scores corresponding to the parallel sentences in the training corpus meeting the preset quality conditions through the trained target domain language model and the universal domain language model, and screening the training corpus meeting the preset quality conditions from the training corpus meeting the preset quality conditions according to the domain scores, wherein the model structure of the target domain language model and the model structure of the universal domain language model are the same. Optionally, the server 104 may further perform model training on the translation model based on the training corpus that satisfies the preset quality condition in the selected target domain, to obtain a translation model for translating the text in the target domain.

In some embodiments, after the terminal 102 obtains the original corpus, the corpus processing method of the translation model provided in the embodiments of the present application may be directly executed, so as to obtain the corpus in the target field and meeting the preset quality condition. In some embodiments, the terminal 102 may perform model training on the translation model based on the training corpus of the screened target domain and meeting the preset quality condition to obtain a translation model for translating the text of the target domain.

The terminal 102 may be, but not limited to, various personal computers, notebook computers, smartphones, tablet computers, and portable wearable devices, and the server 104 may be implemented by a stand-alone server or a server cluster composed of a plurality of servers. In some embodiments, the server 104 may be a blockchain node in a blockchain network, and the training corpus of the selected target domain that satisfies the preset quality condition may be stored on the blockchain node.

Whether the aforementioned statistical machine translation based on statistics or the neural machine translation based on neural networks is a data-driven model, that is, the performance of the translation model is closely related to the training corpus. For example, in recent years, the scale of the training corpus used in model training of a translation model based on deep learning has also been rapidly increased, and fig. 2 is a schematic diagram showing the performance of a neural network translation model according to the scale of the training corpus in 2016-2021 years. Therefore, with the increase of the data scale, the translation model is obviously improved in early stage, which shows that the performance improvement of the neural network machine translation model based on the large-scale corpus plays a very large role, but even if the large-scale corpus is increased in later stage, the translation model cannot be improved more greatly.

The inventors have studied to find two reasons for this: 1) Statement quality is uneven and noise data is more in the large-scale corpus. Because many large-scale parallel corpora are derived from websites, when related corpora are obtained in a large scale, many noise is inevitably mixed, and the data noise can cause that the model can hardly learn correct semantic representation knowledge in the training process. 2) The translation corpuses from different fields in the large-scale corpuses have distribution difference, and the fields of the large-scale corpuses are unevenly distributed. The accuracy of machine translation depends on the distribution of fields to a certain extent, and translation corpuses from different fields have distribution differences and are mixed together, so that the models can interfere with each other when learning the characteristics of different fields.

That is, it has been difficult to guarantee a better model by increasing the scale of the training corpus, and it is necessary to guarantee the quality of data in the corpus while expanding the corpus. The inventor provides the corpus processing method of the translation model, which carries out quality filtering and field screening on the original corpus to obtain the corpus in the high-quality target field, and then carries out the training of the machine translation model, so that fewer corpora can be used to achieve better translation performance, and resource waste and training cost are saved; the translation system performance can be further improved by filtering noise in the training corpus; and a translation model with better performance can be obtained by adjusting the domain distribution of the training corpus.

In one embodiment, as shown in fig. 3, a corpus processing method of a translation model is provided, and the method is applied to the computer device (the terminal 102 or the server 104) in fig. 1 for illustration, and includes the following steps:

step 302, an original training corpus for training a translation model is obtained.

Wherein the translation model is an algorithm that translates one language word into another or another plurality of language words by a computer algorithm. The translation model to be trained can be a statistical translation model based on statistics, and can also be a neural network translation model based on a neural network. The statistical translation model may be, for example, N-Gram LM. The neural network translation model may be, for example, a transducer model based on an encoder-decoder framework, which may be embodied based on a recurrent neural network, convolutional Neural Network (CNN), or a self-focused neural network. The original training corpus used for training the translation model can be the training corpus which is not subjected to corpus filtering or screening treatment, or can be the training corpus which is simply and manually screened by utilizing some artificial rules, for example, the training corpus which is obtained by pertinently filtering through a regular expression.

In order to train the translation model, the computer device needs to acquire the original training corpus. The original training corpus may be a single language corpus of a certain language in a parallel corpus. For example, when the computer device only needs to process the original text corpus in the parallel corpus, the original text corpus in the parallel corpus can be obtained, and then quality filtering and domain screening are performed on the original text corpus to obtain the high-quality text corpus in the target domain. For another example, when the computer device only needs to process the translated corpus in the parallel corpus, the original translated corpus in the parallel corpus may be obtained, and subsequently, carrying out quality filtering and domain screening on the original translated corpus to obtain the translated corpus of the target domain with high quality. Of course, the original training corpus obtained by the computer equipment can also be the original text corpus in the parallel corpus and the original translation corpus corresponding to the parallel corpus, and the original text corpus is processed at the same time to obtain the training corpus in the high-quality target field. The parallel corpus can be bilingual corpus or multilingual corpus, and the embodiment of the application mainly uses bilingual corpus as a main description. The computer device may obtain the original training corpus from some parallel corpus database for training the translation model.

Step 304, obtaining at least two groups of trained universal language models, obtaining quality scores corresponding to parallel sentences in the original training corpus through each group of universal language models, and filtering the original training corpus according to the quality scores to obtain training corpus meeting preset quality conditions, wherein the model structures of each group of universal language models are different.

The trained universal language model is a language model obtained by model training by using universal training corpus, and compared with a language model aiming at the target field obtained by model training by using training corpus of the target field, the universal language model is more universal and has low pertinence. The general language model is adopted firstly because the step 304 is mainly used for improving the quality of the training corpus, the high-quality corpus is ensured firstly, then the subsequent step 306 is utilized for carrying out domain screening on the basis of the high quality, so that the training corpus has the domain with stronger correlation with tasks.

The at least two sets of trained universal language models for quality filtering, which are acquired by the computer equipment, can comprise statistical language models, wherein the statistical language models are models of probability distribution of different grammar units, such as descriptive words and sentences, of the whole document from the aspect of probability statistics, and can be used for measuring whether a sentence or word sequence accords with the daily text speaking mode of people in the language environment. For example, N-Gram is realized based on the assumption of Markov hypothesis, that is, the T-th word in the text is only related to N-1 limited words, and according to the difference of the value of N, the N-Gram can be divided into a unigram, a binary Gram and a ternary Gram. The statistical language model can be Ken-LM based on N-Gram, and is a language model which is built on the basis of N-Gram and is written in C++ and has higher speed and smaller memory.

The at least two sets of trained generic language models for quality filtering obtained by the computer device may include a neural network language model, which may be an autoregressive language model (Autoregressive Language Model) or an Autoencoder language model (Autoencoder LM). The autoregressive language model predicts the probability of each word occurrence in the sentence from left to right or right to left, one by one, i.e., predicts the next word from the above or below, using an autonomous learning approach (Autonomous Learner Model), such as the GPT of a unidirectional transducer, or a modified autoregressive language model that concatenates two above and below LSTMs. The self-encoding language model is able to fuse context information into the model, i.e. predict words at the current location using the context information at the current location, e.g. Bert-LM.

In some embodiments, the common language model of the at least two different sets of model structures acquired by the computer device may include at least two of a statistical language model, an autoregressive model, and an autocoding model, may be two, may be three or more. Because the model structures of each group of general language models are different, parallel sentences can be comprehensively scored from different angles by integrating a plurality of groups of language models, and the training corpus meeting the preset quality condition, namely the high-quality training corpus, can be filtered from the original training corpus. For example, a first set of trained generic language models employs statistical language models, a second set of trained generic language models employs autoregressive language models, and a third set of trained generic language models employs self-encoding language models. Because different language models have different characteristics, the scoring of sentences has different emphasis points, and if the sentences are scored by using only a single language model, the scoring of sentences is not comprehensive enough. For example, using Bert-LM alone to score the material, some of the non-registered words Bert-LM cannot be scored well for them, some of the remote words are treated as non-registered words in Bert-LM processing, and thus cannot be scored well, and some sentences that are relatively more in line with training requirements are lost; the N-Gram model can alleviate the problem, so that multiple models are integrated to perform mixed scoring filtering on the training corpus.

In some embodiments, the generic language model of at least two different sets of model structures acquired by the computer device may be the same type of language model, but obtained using different training methods. For example, the model structure of the first set of trained generic language models employs a statistical language model N-Gram, where N takes a value of 5; the model structure of the second group of trained general language models also adopts a statistical language model N-Gram, wherein N takes a value of 7; the model structure of the third group of trained general language models is realized by adopting a neural network-based language model, and can be an autoregressive language model or an autoencoding language model. Obviously, the model structures of the first group of models and the second group of models are the same, but the training modes are slightly different, and the scoring capability of the models on sentences is also different due to the slightly different training modes, so that the mixed scoring of the corpus can be realized to a certain extent by integrating the two models, and the mixed and comprehensive scoring effect can be realized.

In some embodiments, the computer device may directly obtain the trained generic language model, or may self-train with some training corpus to obtain the generic language model.

After the computer equipment acquires at least two groups of trained universal language models, each parallel sentence in the original training corpus is scored through each group of universal language models, and quality scores corresponding to each parallel sentence are obtained. Optionally, the computer device may score only the original text sentence in each parallel sentence in the original training corpus, take the score corresponding to the original text sentence as the quality score corresponding to each parallel sentence, and perform quality filtering on the original training corpus according to the quality score. Alternatively, the computer device may score only the translated sentence in each parallel sentence in the original training corpus, take the score corresponding to the original sentence as the quality score corresponding to each parallel sentence, and perform quality filtering on the original training corpus according to the quality score. Optionally, the computer device may score the original text sentence and the translated text sentence in each parallel sentence in the original training corpus, respectively, obtain a quality score corresponding to each parallel sentence, and perform quality filtering on the original training corpus according to the quality score.

It can be understood that, when the computer device scores the original text sentence and the translated text sentence in each parallel sentence in the original training corpus respectively to obtain the quality score corresponding to each parallel sentence, each group of the trained general language models obtained by the computer device needs to include the original text language model and the translated text language model with the same model structure. For example, a first set of trained generic language models includes a statistical language model of the original document and a statistical language model of the translated document, a second set of trained generic language models includes an autoregressive language model of the original document and an autoregressive language model of the translated document, and a third set of trained generic language models includes an autorecording language model of the original document and an autorecording language model of the translated document.

If the computer device only needs to score the original sentence or the translated sentence in each parallel sentence in the original training corpus to realize quality filtering, each group of the general language models only needs to comprise the original language model or the translated language model in at least two groups of trained general language models acquired by the computer device. For example, the first set of generic language models is a statistical language model of the original text, the second set of generic language models is an autoregressive language model of the original text, and the third set of generic language models is a self-encoding language model of the original text. For another example, the first set of generic language models is a statistical language model of the translation, the second set of generic language models is an autoregressive language model of the translation, and the third set of generic language models is an autorecondary language model of the translation.

Since at least two sets of generic language models obtained by the computer device are trained language models, with the ability to determine which word sequence is more likely, or with the ability to predict the next most likely word to occur given a context, a context or several words of a context, the predicted likelihood of occurrence of the next word can be indicative of the level of smoothness of a sentence, i.e., the quality, to some extent, which can be regarded as an indicator of the quality of the sentence, i.e., as a quality score. If the probability of each word in the predicted parallel sentence is higher, the sentence is more smooth, so that the training corpus meeting the preset quality condition, namely the high-quality training corpus, can be filtered from the original training corpus according to the quality score. The preset quality condition is a preset filtering condition for determining high-quality corpus from the original corpus, and can be set according to actual needs. For example, the preset quality condition may be that the ranking of the quality scores is greater than a preset threshold, the quality scores are greater than a preset threshold, the ranking percentage of the quality scores is greater than a preset threshold, and so on. Specifically, after obtaining the quality scores of the parallel sentences in the original training corpus, the computer device may filter the high-quality training corpus from the original training corpus according to the quality scores and a preset quality condition. For example, the computer device may use the sentences with the quality scores ranked M top as high-quality corpus, where M may be a percentage, or may be a ranking number, and the computer device may use the sentences with the quality scores higher than N as high-quality corpus, where N may be a preset threshold.

Step 306, obtaining the domain scores corresponding to the parallel sentences in the training corpus meeting the preset quality conditions through the trained target domain language model and the universal domain language model, and screening the training corpus meeting the preset quality conditions from the training corpus meeting the preset quality conditions according to the domain scores, wherein the model structure of the target domain language model and the model structure of the universal domain language model are the same.

At present, some corpus filtering methods are used for filtering quality based on a single model, and the inventor considers that translation needs to ensure quality and field simultaneously. In order for the computer equipment to be able to screen domain-specific data for domain-specific model training on the basis of guaranteeing high-quality data, after obtaining high-quality training corpus through the quality filtering of the first step, the computer equipment needs to obtain a target domain language model and a general domain language model, and further screens out the high-quality training corpus of the target domain through the target domain language model and the general domain language model.

The target domain language model is a language model obtained by training a training corpus of a target domain, and the target domain is, for example, a news domain, a financial domain, a medical domain, a computer technical domain and the like. The generic domain language model is a more generic domain language model that is less strongly correlated than the target domain language model. The target domain language model and the general domain language model have the same model structure, the computer equipment can respectively input the training corpus which is obtained after quality filtration and meets the preset quality condition into the target domain language model and the general domain language model, obtain domain scores according to the difference between the output of the target domain language model and the output of the general domain language model, and screen the training corpus which has stronger target domain relevance from the training corpus which meets the preset quality condition according to the domain scores, namely the training corpus which meets the preset quality condition in the target domain.

The target domain language model has the same model structure as the general domain language model. For example, the target domain language model and the general domain language model may be both statistical language models, autoregressive language models, and self-encoding language models.

Alternatively, the target domain language model may include a target domain language model of the original text and a target domain language model of the translated text, and accordingly, the general domain language model may include a general domain language model of the original text and a general domain language model of the translated text. For example, the target domain language model includes a statistical language model of an original text obtained by model training using the original text of the target domain, and also includes a statistical language model of a translated text obtained by model training using the translated text of the target domain, and the general domain language model also includes the statistical language model of the original text and the statistical language model of the translated text.

Optionally, since the high-quality training corpus is screened from the original training corpus in the previous step, the original text and the translated text of the parallel sentences in the high-quality training corpus can be considered to be the high-quality training corpus, and the current step only needs to perform secondary screening by taking the domain correlation as a target, so that the target domain language model can only include the target domain language model of the original text, and correspondingly, the general domain language model can only include the general domain language model of the original text, and the computer equipment obtains the domain scores corresponding to the original sentences in each parallel sentence in the high-quality training corpus by using the target domain language model of the original text and the general domain language model as the domain scores corresponding to each parallel sentence. For example, the target domain language model is a statistical language model of an original document, and the general domain language model is a statistical language model of an original document.

Optionally, the target domain language model may include only a target domain language model of the translated version, and correspondingly, the general domain language model may include only a general domain language model of the translated version, and the computer device obtains, by using the target domain language model of the original text and the general domain language model, a domain score corresponding to a translated version sentence in each parallel sentence in the training corpus, as the domain score corresponding to each parallel sentence. For example, the target domain language model is a statistical language model of the translation, and the generic domain language model is a statistical language model of the translation.

In some embodiments, the thinking of integrating a plurality of models to perform quality filtering is used as reference, and the computer device can integrate a plurality of models to perform domain screening so as to comprehensively obtain the domain correlation of each sentence, thereby improving the accuracy of screening the training corpus of the target domain from the high-quality corpus. For example, the trained target domain language model and the universal domain language model include two groups, the first group including the target domain statistical language model of the original text and the universal domain statistical language model of the original text, and the second group including the target domain autoregressive language model of the original text and the universal domain autoregressive language model of the original text. For another example, the first set includes a target domain statistical language model for the original document, a target domain statistical language model for the translated document, a generic domain statistical language model for the original document, and the second set includes a target domain autoregressive language model for the original document, a target domain autoregressive language model for the translated document, a generic domain statistical autoregressive language model for the original document, and a generic domain autoregressive language model for the original document.

The computer equipment integrates and mixes the parallel sentences in the high-quality corpus by utilizing two or more groups of target domain language models and general domain language models, scores of the parallel sentences corresponding to each group of models are obtained, and then scores corresponding to each group of models are fused to obtain final domain scores of the parallel sentences. And the computer equipment screens the training corpus which meets the preset quality condition and meets the target field from the training corpus which meets the preset quality condition according to the field scores after the integration scores.

Because the trained target domain language model and the universal domain language model obtained by the computer equipment are the trained language models, and the model has the same structure, the model has the capability of determining which word sequence is more likely or predicting the next most likely word by giving a context, a context or a plurality of words of the context. The target domain language model is a language model obtained by training a target domain corpus, the prediction capacity of the target domain language model has certain domain correlation, the prediction capacity of the general domain language model has weaker domain correlation, if the difference between the possibility of target domain language model prediction and the possibility of general domain language model prediction is smaller, the larger the correlation between the sentence and the target domain is indicated, otherwise, the larger the difference is indicated, the smaller the correlation between the sentence and the target domain is indicated, so the computer equipment can obtain the domain score of the training corpus based on the conception. The computer device may filter the training corpus satisfying the preset quality condition for the target domain from the training corpus satisfying the preset quality condition according to the domain score after obtaining the domain score of each parallel sentence in the training corpus satisfying the preset quality condition. For example, the computer device may use the sentence with the top M of the domain score as the training corpus of the target domain and meeting the preset quality condition, where M may be a percentage or a ranking number, and the computer device may use the sentence with the domain score higher than N as the training corpus of the target domain and meeting the preset quality condition, where N may be a preset threshold.

In one embodiment, the method further comprises: obtaining parallel corpus of the target field, and performing model training on a language model to be trained by using the parallel corpus of the target field to obtain the language model of the target field; sampling the training corpus meeting the preset quality condition to obtain a sampling corpus, and performing model training on the language model to be trained by using the sampling corpus to obtain the language model in the general field.

Specifically, the computer device may obtain a parallel corpus of the target domain, and perform model training on the constructed language model by using the parallel corpus of the target domain to obtain the language model of the target domain. Meanwhile, the computer equipment can sample the whole high-quality training corpus to be subjected to domain screening to obtain a sampling corpus, and the sampling corpus is used for carrying out model training on the constructed language model with the same model structure to obtain a general domain language model. The computer device may obtain the parallel corpus of the target domain, for example, the corpus of a specific domain such as a news domain, a financial domain, and the like, through a crawler or other modes. Of course, the general domain language model may be obtained by performing model training on training corpus with weak domain correlation obtained through other channels.

Optionally, when the computer device only needs to use the original target domain language model and the general domain language model to obtain the domain score, the computer device may only use the original sentence in the parallel corpus of the target domain to perform model training on the constructed language model to obtain the original target domain language model, and at the same time, the computer device only uses the original sentence in the sampling corpus to perform model training on the constructed language model to obtain the general domain language model of the original. Optionally, when the computer device only needs to use the target domain language model of the translation and the general domain language model to obtain the domain score, the computer device may only use the translation sentences in the parallel corpus of the target domain to perform model training on the constructed language model to obtain the target domain language model of the translation, and at the same time, the computer device only uses the translation sentences in the sampling corpus to perform model training on the constructed language model to obtain the general domain language model of the translation.

FIG. 4 is a schematic diagram of a corpus processing flow of a translation model in one embodiment. Referring to fig. 4, after the original training corpus is subjected to mixed scoring by at least two groups of general language models, quality scores corresponding to all parallel sentences are obtained, filtering is performed according to the quality scores to obtain training corpus meeting preset quality conditions, then the trained target domain language model and the general domain language model are utilized to perform domain scoring on the training corpus meeting the preset quality conditions, and the training corpus meeting the preset quality conditions in the target domain is screened according to the domain scores.

After the computer equipment obtains the training corpus in the target field and meeting the preset quality condition, the corpus can be used for carrying out model training on the translation model to obtain the translation model for translating the text in the target field.

According to the corpus processing method of the translation model, at least two groups of trained general language models are used for carrying out mixed scoring on each parallel sentence in the original corpus, and compared with quality filtering based on a single language model, as the model structure of each group of general language models is different, parallel sentences can be scored comprehensively from different angles by integrating multiple groups of language models, so that the corpus meeting the preset quality condition can be filtered from the original corpus; further, on the basis, the field scores corresponding to parallel sentences in the training corpus meeting the preset quality conditions are obtained through the trained target field language model and the universal field language model, so that the training corpus meeting the target field and meeting the preset quality conditions is further screened from the training corpus meeting the preset quality conditions, the translation model in the target field can be used for model training, and compared with filtering based on a single quality aspect, the corpus in the target field can be screened on the basis of ensuring high quality, and the translation performance of the translation model can be greatly improved by the obtained corpus; in addition, through the scoring in the aspect of high quality, the process setting of the training corpus in the target field is screened on the basis of high quality, and compared with the process of directly carrying out quality and field mixed scoring on the original training corpus, the method can greatly ensure that the output corpus is high in quality and closest to the target field.

In one embodiment, as shown in fig. 5, obtaining, by each set of generic language models, a quality score corresponding to each parallel sentence in the original training corpus includes:

step 502, scoring original sentence and translated sentence in each parallel sentence in original training corpus through each group of general language models, and obtaining original quality score and translated quality score of each group of general language models corresponding to the parallel sentence.

In this embodiment, each set of general language models includes an original language model and a translated language model, the original language model scores original language sentences in each parallel sentence in the original training corpus to obtain corresponding original quality scores, and the translated language model scores translated language sentences in each parallel sentence in the original training corpus to obtain corresponding translated quality scores. In this way, the computer device not only integrates a plurality of language models to score parallel sentences, but also scores the parallel sentences together with the original text and the translated text, so as to obtain the original text quality score and the translated text quality score of each group of general language models corresponding to each parallel sentence in the original training corpus, and further ensure that the parallel sentences are high in quality on both sides.

And step 504, merging the original text quality scores and the translated text quality scores of the parallel sentences corresponding to each group of general language models to obtain the quality scores corresponding to the parallel sentences.

In this embodiment, by performing hybrid scoring on each parallel sentence in the original training corpus by using multiple models, compared with quality filtering based on a single language model, due to different model structures of each group of general language models, the integrated multiple groups of language models can comprehensively score the parallel sentences from different angles, so as to ensure that high-quality training corpus can be filtered from the original training corpus.

In one embodiment, as shown in FIG. 6, step 502 includes:

step 602, scoring original text sentences in parallel sentences in the original training corpus respectively through original text language models in each group of general language models, and obtaining original text quality scores of the parallel sentences respectively.

Wherein each set of generic language models includes an original language model. Each original text language model needs to score original text sentences in parallel sentences in the original training corpus, and the original text quality scores of the parallel sentences corresponding to each original text language model are obtained respectively. That is, the textual language models with different model structures in each set of generic language models require scoring of parallel sentences.

Step 604, scoring the translated sentences in each parallel sentence in the original training corpus through the translated language models in each group of general language models, and obtaining the quality scores of the translated sentences respectively.

Wherein each set of generic language models further includes a translation language model. Each translation language model needs to score the translation sentences in each parallel sentence in the original training corpus, and the translation quality scores of the parallel sentences corresponding to each translation language model are respectively obtained. That is, parallel sentences need to be scored by the language models of the translations with different model structures in each group of general language models.

In this embodiment, each group of general language models includes an original language model and a translated language model, and the model structures of each group of general language models are different, so that the model structures of the original language models of each group are different, and the model structures of the translated language models of each group are different, so that multiple original language models can be integrated to comprehensively score original sentences, multiple translated language models can be integrated to comprehensively score translated sentences, and training corpus meeting preset quality conditions can be filtered from the original training corpus after scoring on both sides of the synthesis.

In one embodiment, as shown in FIG. 7, 504 includes:

step 702, summing the original text quality score and the translated text quality score of each group of general language models to obtain a group level score.

Specifically, the computer device obtains a group-level score corresponding to each group of the generic language models by summing the textual quality score and the translation quality score of the parallel sentences corresponding to each group of the generic language models.

Alternatively, the computer device may also average the textual quality scores and the translation quality scores for each set of generic language models to obtain a set level score.

It can be understood that the quality of parallel sentences on both sides of the original text and the translated text can be considered by adopting the fusion mode, namely, only when the score sum or the score average value is higher, the parallel sentences can be filtered out as high-quality corpus to obtain high-quality training corpus.

It can be understood that if each set of general language models only includes an original language model, the original quality score corresponding to the parallel sentence is a set level score. If each group of general language models only comprises a translation language model, the translation quality scores corresponding to parallel sentences are group-level scores.

Step 704, obtaining a weighting coefficient corresponding to each group of general language models.

Specifically, the computer device may obtain the set weighting coefficients corresponding to each set of the generic language models. The weighting coefficients corresponding to each group of the universal language models can be set randomly or manually. Of course, the setting of the weighting coefficients corresponding to each set of the generic language models may also depend on the performance of each set of the generic language models, for example, the generic language models obtained by the computer device are three sets, the performance of the first set of the generic language models is the best, and the weighting coefficients corresponding to the first set of the generic language models are the largest, whereas the weighting coefficients corresponding to the first set of the generic language models are the smallest if the performance of the first set of the generic language models is the worst among the three sets.

In addition, the computer device may set a constraint condition for the weighting coefficients, for example, the weighting coefficients corresponding to each group of the generic language models are λ1, λ2 and λ3, and when setting the values of λ1, λ2 and λ3, the constraint condition needs to be satisfied:

λ1+λ2+λ3＝0.5。

and step 706, carrying out weighted summation on the group-level scores of the parallel sentences corresponding to each group of the universal language models based on the weighting coefficients corresponding to each group of the universal language models, and obtaining the quality scores corresponding to the parallel sentences.

It can be understood that if each set of general language models only includes an original language model, the computer device only needs to perform weighted summation on the original quality scores of the parallel sentences according to the weighting coefficients to obtain the quality scores corresponding to the parallel sentences. If each group of general language models only comprises a translation language model, the computer equipment only needs to carry out weighted summation on the translation quality scores of the parallel sentences according to the weighting coefficients to obtain the quality scores corresponding to the parallel sentences.

For example, the computer device obtains 3 sets of general language models, each set of general language models includes an original language model and a translated language model, the 1 st set of general language models scores the original quality of the parallel sentence as S1, the translated quality as S1 x, the 2 nd set of general language models scores the original quality of the parallel sentence as S2, the translated quality as S2 x, the 3 rd set of general language models scores the original quality of the parallel sentence as S3, the translated quality as S3 x, and the quality Score q_score corresponding to the parallel sentence S can be expressed by the following formula:

Q_Score＝λ1×(S1+S1*)+λ2×(S2+S2*)+λ3×(S3+S3*)。

for another example, the computer device obtains 3 sets of generic language models, each set including only textual language models, and the quality Score q_score corresponding to the parallel statement may be expressed by the following formula:

Q_Score＝λ1×S1+λ2×S2+λ3×S3。

For another example, the computer device obtains 3 sets of generic language models, each set including only translated language models, and the quality Score Q_Scare corresponding to the parallel statement may be expressed by the following formula:

Q_Score＝λ1×S1*+λ2×S2*+λ3×S3*。

in this embodiment, the quality scores on both sides of the original text and the translated text are weighted and summed by the weighting coefficients, and the obtained quality scores can measure the quality of sentences on both sides of the original text and the translated text in parallel.

FIG. 8 is a flow diagram of quality filtering in one embodiment. Referring to fig. 8, taking an original training corpus as a chinese bilingual corpus and three sets of acquired general language models, each set including a language model of an original text and a language model of a translated text as an example, the following description will be given: the first set of language models includes Chinese and English statistical language models, such as Ken-LM, the second set of language models includes Chinese and English autoregressive language models, such as Transformer-LM, and the third set of language models includes Chinese and English autorecording language models, such as Bert-LM. The Chinese in the parallel sentences is scored through a Chinese statistical language model, the corresponding English is scored through an English statistical language model, a first group of scores of the parallel sentences are obtained after fusion, and the like, a second group of scores of the parallel sentences are obtained through a Chinese autoregressive language model and an English autoregressive language model respectively, a third group of scores of the parallel sentences are obtained through a Chinese autoregressive language model and an English autoregressive language model respectively, and then weighted summation is carried out on the three groups of scores through a weighting device, so that quality scores corresponding to the parallel sentences are obtained. The computer device may filter the training corpus satisfying the preset quality condition from the original training corpus according to the quality score.

In one embodiment, as shown in FIG. 9, step 504 includes:

and step 902, carrying out normalization processing on the original text quality scores of the parallel sentences obtained by the same group of general language models according to the highest score and the lowest score in the original text quality scores of the parallel sentences in the original training corpus obtained by the same group of general language models, so as to obtain normalized original text quality scores.

Because of the language models of different model structures, there may be inconsistent ranges of predicted quality scores, for example, for a first set of generic language models, a predicted parallel sentence quality score if greater than 80 points would indicate that sentence quality is good, and for a second set of generic language models, a predicted parallel sentence quality score if greater than 0.85 points would indicate that sentence quality is good, if the quality scores obtained from the different generic language models are directly weighted together, it may result in the final quality score not objectively representing the quality of the parallel sentence. Therefore, in this case, in order to measure the quality of the parallel sentences more reasonably and accurately, the computer device may further perform normalization processing on the quality scores obtained by the different generic language models after obtaining the quality scores predicted by each set of generic language models.

Specifically, for the original text quality scores of the parallel sentences obtained by the original text language models in the same group of general language models, the computer equipment can determine the highest score and the lowest score in the parallel sentences, and then normalize the original text quality scores output by the original text language models according to the highest score and the lowest score to obtain normalized original text quality scores.

In one embodiment, the original text quality scores of parallel sentences obtained by the same group of general language models are normalized to obtain normalized original text quality scores, which can be realized by the following formula:

wherein S is _i Representing textual quality score of parallel sentence S obtained by ith group of general language models, S _i Min represents the lowest score in the original text quality scores of the parallel sentences obtained by the ith group of general language models, S _i Max represents the highest score in the original text quality score of each parallel sentence obtained by the ith group of general language models, S _i ' represents the normalized textual quality score of parallel statement S.

And step 904, normalizing the translation quality scores of the parallel sentences obtained by the same group of general language models according to the highest score and the lowest score in the translation quality scores of the parallel sentences in the original training corpus obtained by the same group of general language models to obtain normalized translation quality scores.

Specifically, for the quality scores of the parallel sentences obtained by the translation language models in the same group of general language models, the computer equipment can determine the highest score and the lowest score in the parallel sentences, and then normalize the quality scores of the translations output by the translation language models according to the highest score and the lowest score to obtain normalized quality scores of the translations.

In one embodiment, the normalization processing is performed on the translation quality scores of parallel sentences obtained by the same group of general language models to obtain normalized translation quality scores, which can be realized by the following formula:

wherein S is _i * Representing the translation quality score of parallel sentences S obtained by the ith group of general language models, S _i * Min represents the lowest score in the translation quality scores of the parallel sentences obtained by the ith group of general language models, S _i * Max represents the highest score in the translation quality score obtained by the ith group of general language models for each parallel sentence, S _i * ' represents the normalized translation quality score for parallel statement S.

Step 906, merging the normalized original text quality scores and normalized translated text quality scores of the parallel sentences corresponding to each group of general language models to obtain quality scores corresponding to the parallel sentences.

Q_Score＝λ1×(S1′+S1*′)+λ2×(S2′+S2*′)+λ3×(S3′+S3*′)。

it should be noted that, if the ranges of the quality scores output by the different sets of universal language models are basically consistent in at least two sets of universal language models obtained by the computer device, the computer device does not need to normalize the quality scores of the original text and the quality scores of the translated text, and the computer device can directly weight and sum the quality scores of the original text and the quality scores of the original text to obtain the final quality scores of the parallel sentences.

In this embodiment, by normalizing the quality scores output by the universal language models of each group, the quality scores of the parallel sentences in the original corpus can be more reasonably scored, so that the accuracy of filtering the high-quality corpus from the original corpus is improved.

The specific steps of the quality score output by each generic language model are described below:

in one embodiment, as shown in fig. 10, when the generic language model is a statistical language model obtained based on a high-quality corpus, obtaining, by each set of generic language models, a quality score corresponding to each parallel sentence in the original training corpus includes:

step 1002, parallel sentences are sequentially obtained from the original training corpus.

Step 1004, inputting the original sentence in the parallel sentence into the statistical language model of the original sentence, and obtaining the original quality score of the parallel sentence based on the condition frequency corresponding to each word in the original sentence through the statistical language model of the original sentence.

Wherein, the statistical language model of the original text is obtained based on the high-quality corpus of the original text. The computer equipment can construct a statistical language model in advance, carry out statistical training on the high-quality corpus of the original text, score the original text sentences of each parallel sentence in the original training corpus by using the statistical language model of the original text obtained by training, and obtain the corresponding quality score of the original text.

The statistical language model is based on a markov assumption that the probability of occurrence of the T-th word in a sentence is related to only N-1 finite words that have previously occurred, and then the probability of occurrence of the sentence is the product of the posterior probabilities of occurrence of the words in the sentence, i.e., the probability of occurrence of the sentence given the words in the sentence. The computer equipment can determine the probability of the occurrence of the sentence according to the condition frequency corresponding to each word in the sentence obtained through statistics of the statistical language model, and the probability is used as the quality score of the sentence.

The statistical language model adopts polynomial distribution to predict the probability of each word going out in the sentence according to the context, and has high scoring speed and very simple and convenient scoring, so that the statistical language model can be used in at least two groups of general language models acquired by computer equipment.

In one embodiment, the textual quality score of the textual statement S output by the statistical language model of the textual may be calculated by the following formula:

p(w _t |w _t-1 ,w _t-2 ,...,w ₁ )≈p(w _t |w _t-1 ,w _t-2 ,...,w _t-N+1 )。

for example, when n=3,

wherein p (S) is the original text quality score of the original text sentence S, w _t Is the t-th word in the original sentence S, p (w _t |w _t-1 ,w _t-2 ,...,w ₁ ) Expressed by knowing the 1 st word to the t-1 st word before, the next word w _t The probability of occurrence is approximately equal to the probability of occurrence of the preceding N-1 words given that word.

Wherein,

c represents the number of occurrences of the sentence sequence in brackets in the high quality corpus used for statistical training.

Step 1006, inputting the translated sentence in the parallel sentence into a statistical language model of the translation, and obtaining the quality score of the translation of the parallel sentence based on the condition frequency corresponding to each word in the translated sentence through the statistical language model of the translation.

Similarly, the computer device may construct a statistical language model in advance, perform statistical training on the high-quality corpus of translations, score the translated sentences of each parallel sentence in the original training corpus by using the statistical model of the translations obtained by training, and obtain a corresponding quality score of the translations.

Likewise, the quality score of the translation of the sentence S output by the statistical language model of the translation may be calculated by the following formula:

p(w _t |w _t-1 ,w _t-2 ,...,w ₁ )≈p(w _t |w _t-1 ,w _t-2 ,...,w _t-N+1 )。

wherein p (S) is the translation quality score, w of the translation sentence S _t Is the t word in the translation sentence S, p (w) _t |w _t-1 ,w _t-2 ,...,w ₁ ) Expressed by knowing the 1 st word to the t-1 st word before, the next word w _t The probability of occurrence is approximately equal to the probability of occurrence of the preceding N-1 words given that word.

And step 1008, merging the original text quality scores and the translated text quality scores of the parallel sentences to obtain the quality scores of the statistical language models corresponding to the parallel sentences.

For the way in which the statistical language model fuses the textual quality score with the translation quality score, reference may be made specifically to the manner mentioned above, and the description will not be repeated here.

In one embodiment, as shown in fig. 11, when the universal language model is an autoregressive language model, obtaining, by each set of universal language models, a quality score corresponding to each parallel sentence in the original training corpus includes:

step 1102, obtaining parallel sentences from the original training corpus in turn.

In step 1104, the original text sentence in the parallel sentence is input into the autoregressive language model of the original text, and the original text quality score of the parallel sentence is obtained according to the conditional probability corresponding to each word by predicting the conditional probability of each word from left to right or from right to left in the original Wen Yugou through the autoregressive language model of the original text.

The original autoregressive language model is obtained by model training based on an original training corpus. The computer equipment can construct an autoregressive language model in advance, perform model training on original text training corpus, score the original text sentences of each parallel sentence in the original training corpus by using the autoregressive language model obtained by training, and obtain corresponding original text quality scores.

The training goal of the autoregressive model is to predict what the word in the next position is, so that during training, the word left or right of the predicted position needs to be shielded by masking means (masking) to ensure that the model learns the ability to predict word by word from left to right or from right to left. For example, the training sentence is "I love China". In order to learn the model to have the ability to predict from left to right, the training sentence needs "Chinese" and "Chinese" on the right side of the 2 nd position when predicting the 2 nd word. "masking," predicts the 2 nd word based only on the previous "me," and when predicting the 3 rd word, it is necessary to right the 3 rd position. "masking" predicts word 3 based only on previous "me", "love".

After obtaining the autoregressive language model of the trained original text, the computer device may input the original text sentence of each parallel sentence in the original training corpus into the autoregressive language model of the original text, predict word by word from left to right or from right to left, and obtain a quality score corresponding to the original text sentence according to the following formula according to the predicted conditional probability of each word:

wherein w is _t Represents the t-th word, W in the original sentence S _＜t The antecedent representing the t-th word, i.e. the word that has appeared to the left or right, P _ALM (w _t |W _＜t The method comprises the steps of carrying out a first treatment on the surface of the Θ) represents the t-th corresponding conditional probability of word-by-word prediction of the model, Θ represents a manually set hyper-parameter during model training, score (S) represents a quality score corresponding to the original sentence S output by the autoregressive language model. The computer equipment obtains the conditional probability corresponding to each word in the original sentence according to the formula, carries out the density estimation of the joint probability of the sentence according to the conditional probability corresponding to each word, and measures the sentence quality by using the value.

Step 1106, inputting the translated sentence in the parallel sentence into an autoregressive language model of the translation, predicting the conditional probability of each word from left to right or from right to left in the translated sentence through the autoregressive language model of the translation, and obtaining the quality score of the translation of the parallel sentence according to the conditional probability corresponding to each word.

Similarly, the computer device may construct an autoregressive language model in advance, perform model training on the translation training corpus, score the translation sentences of each parallel sentence in the original training corpus by using the autoregressive language model obtained by training, and obtain a corresponding translation quality score.

And step 1108, fusing the original text quality scores and the translated text quality scores of the parallel sentences to obtain the quality scores of the autoregressive language models corresponding to the parallel sentences.

For the way in which the autoregressive language model fuses the textual quality score with the translation quality score, reference may be made specifically to the manner mentioned above, and the description will not be repeated here.

In one embodiment, as shown in fig. 12, when the universal language model is a self-coding language model, obtaining, by each set of universal language models, a quality score corresponding to each parallel sentence in the original training corpus includes:

step 1202, obtaining parallel sentences from the original training corpus in turn.

Step 1204, sequentially taking each word in the original sentence of the parallel sentence as a masking word, inputting the masked original sentence into a self-coding language model of the original sentence, outputting a prediction probability corresponding to the masking word through the self-coding language model of the original sentence, and obtaining the original quality score of the parallel sentence according to the prediction probability corresponding to each masking word.

The self-coding language model of the original text is obtained by model training based on training corpus of the original text. The computer equipment can construct a self-coding language model in advance, perform model training on original text training corpus, score the original text sentences of each parallel sentence in the original training corpus by using the self-coding language model obtained by training, and obtain corresponding original text quality scores.

The training goal of the self-coding model is to predict the probability of a certain word appearing at a certain position according to the context information of the position in a sentence, so that during training, the word at the predicted position needs to be shielded by a masking means (masking), the context information of the position is input, and the self-coding language model is trained by taking the word at the predicted position as label information, so that the model can learn the prediction capability according to the context information of each position.

After obtaining the self-coding language model of the trained original text, the computer equipment can mask each word in the original text sentence of each parallel sentence in the original training corpus, sequentially input the self-regression language model of the original text, and output the prediction probability corresponding to each masked word. It is understood that when the outputted predicted word does not include the mask word, the predicted probability corresponding to the mask word is 0. The computer device may determine a quality score corresponding to the textual statement based on the predicted probability of each of the mask words.

The computer equipment obtains the quality scores corresponding to the original text sentences according to the predicted probability of occurrence of the masking words of each position and the following formula according to the trained self-coding language model:

wherein w is _t Represents the t-th word in the original sentence S _\t Representing the sequence obtained by removing the t-th mask word from the original sentence S, P _{Mask_LM} (w _t |S _\t The method comprises the steps of carrying out a first treatment on the surface of the Θ) represents the prediction probability corresponding to the t-th mask word predicted by the model, Θ represents the super-parameters artificially set during model training, score (S) represents the quality score corresponding to the original sentence S output from the coding language model. The computer equipment obtains the prediction probability corresponding to each masking word in the original sentence according to the formula, and obtains the original quality score of the original sentence S according to the prediction probability corresponding to each masking word.

Step 1206, sequentially taking each word in the translated sentence of the parallel sentence as a masking word, inputting the masked translated sentence into a self-coding language model of the translated sentence, outputting a prediction probability corresponding to the masking word through the self-coding language model of the translated sentence, and obtaining the quality score of the translated sentence of the parallel sentence according to the prediction probability corresponding to each masking word.

Similarly, the computer device may construct a self-coding language model in advance, perform model training on the translation training corpus, score the translation sentences of each parallel sentence in the original training corpus by using the self-coding language model obtained by training, and obtain a corresponding translation quality score.

Step 1208, fusing the original text quality score and the translated text quality score of each parallel sentence to obtain the quality score of the self-coding language model corresponding to each parallel sentence.

For the way in which the original text quality score and the translated text quality score are fused from the coding language model, reference may be made specifically to the way mentioned above, and the description thereof will not be repeated here.

After the quality filtering is completed, the following continues to describe specific embodiments of the domain screening.

Since the training corpus satisfying the preset quality condition is screened from the original training corpus in the last step, the original text and the translated text of the parallel sentences in the training corpus satisfying the preset quality condition can be considered as the training corpus satisfying the preset quality condition, and the secondary screening is only required by taking the field correlation as the target during the field screening. The more the original sentence accords with the target field, the corresponding translated sentence also accords with the target field necessarily, so that the computer equipment only needs to perform field grading on the original sentence, namely, the target field language model only comprises the target field language model of the original sentence, correspondingly, the general field language model only comprises the general field language model of the original sentence, and the computer equipment obtains the field grading corresponding to the original sentence in each parallel sentence in the training corpus meeting the preset quality condition by utilizing the target field language model and the general field language model of the original sentence as the field grading corresponding to each parallel sentence.

That is, in one embodiment, as shown in fig. 13, step 306, obtaining, by the trained target domain language model and the universal domain language model, the domain score corresponding to each parallel sentence in the training corpus that satisfies the preset quality condition includes:

in step 1302, scoring original text sentences in each parallel sentence in the training corpus meeting the preset quality condition through the target domain language model of the original text, and obtaining a first domain score corresponding to the parallel sentence.

Specifically, the target domain language model is obtained through corpus training of the target domain, so that the prediction capability of the target domain language model has domain correlation to a certain extent, therefore, the computer equipment can score the original text sentences in each parallel sentence in the training corpus meeting the preset quality condition by using the target domain language model of the original text, obtain a first domain score corresponding to the parallel sentence, and can be marked as H1 (S).

And 1304, scoring original text sentences in each parallel sentence in the training corpus meeting the preset quality condition through a universal field language model of the original text, and obtaining a second field score corresponding to the parallel sentence.

Specifically, the general domain language model is obtained through corpus training without domain screening, the prediction capability of the general domain language model has weak domain correlation, the computer equipment can score the original sentence in each parallel sentence in the training corpus meeting the preset quality condition by using the general domain language model of the original sentence, and a second domain score corresponding to the parallel sentence is obtained and can be marked as H2 (S).

Step 1306, obtaining the domain score corresponding to each parallel sentence in the training corpus meeting the preset quality condition according to the difference between the first domain score and the second domain score corresponding to each parallel sentence.

The target domain language model is identical to the general domain language model in terms of model structure, except that the corpus used in model training is different, the corpus of the target domain is used, the corpus of the target domain is not subjected to domain screening, and therefore the output domain scores are consistent in scope, such as values between 0 and 1, if the difference between the first domain score and the second domain score is larger, the domain correlation of the original sentence is lower, and otherwise, the difference is smaller, the domain correlation of the original sentence is higher.

In one embodiment, the computer device may obtain a cross entropy loss corresponding to each original sentence as the domain score corresponding to the original sentence according to the difference between the first domain score and the second domain score. The computer equipment can screen the training corpus which meets the target field and meets the preset quality condition from the training corpus which meets the preset quality condition according to the field score of the original sentence according to the principle that the cross entropy loss is smaller.

Referring to fig. 14 (a), a schematic diagram of domain filtering on textual sentences in a corpus meeting a preset quality condition in one embodiment is shown. The computer equipment inputs the original sentence in the training corpus of the target field and the original sentence in the training corpus meeting the preset quality condition to be subjected to field screening into the language model. Training the language model by using original text sentences in the training corpus of the target field to obtain the target field language model of the original text. And training the language model by using a part of original text sentences in the training corpus which is to be subjected to domain screening and meets the preset quality condition, so as to obtain a general domain language model of the original text. And scoring original text sentences in the training corpus meeting the preset quality conditions to be subjected to domain screening through the target domain language model and the general domain language model respectively, and screening the training corpus meeting the preset quality conditions in the target domain from the training corpus meeting the preset quality conditions according to scoring differences.

In one embodiment, step 306, obtaining, by the trained target domain language model and the universal domain language model, a domain score corresponding to each parallel sentence in the training corpus meeting the preset quality condition includes: scoring translation sentences in each parallel sentence in the training corpus meeting the preset quality condition through a target field language model of the translation to obtain a third field score corresponding to the parallel sentence; scoring translation sentences in each parallel sentence in the training corpus meeting the preset quality condition through a universal field language model of the translation to obtain a fourth field score corresponding to the parallel sentence; and obtaining the domain scores corresponding to the parallel sentences in the training corpus meeting the preset quality conditions according to the difference between the third domain scores and the fourth domain scores corresponding to the parallel sentences.

Similarly, the computer device may score translated sentences in parallel sentences in the corpus satisfying the preset quality condition by using the target domain language model of the translated sentences, and obtain a third domain score corresponding to the parallel sentences, which may be denoted as H1 (S). And scoring translated sentences in each parallel sentence in the training corpus meeting the preset quality condition by using a universal domain language model of the translated sentences to obtain a fourth domain score corresponding to the parallel sentences, wherein the fourth domain score can be marked as H2 (S). The computer device may obtain a cross entropy loss corresponding to each translation sentence as a domain score corresponding to the translation sentence according to a difference between the third domain score and the fourth domain score. The computer equipment can screen the training corpus which meets the target field and meets the preset quality condition from the training corpus which meets the preset quality condition according to the field score of the translation sentence according to the principle that the cross entropy loss is smaller.

Referring to fig. 14 (b), a schematic diagram of domain filtering on translated sentences in a corpus that satisfies a preset quality condition in one embodiment is shown. The computer equipment inputs the translation sentences in the training corpus of the target field and the translation sentences in the training corpus meeting the preset quality condition to be subjected to field screening into the language model. And training the language model by using translation sentences in the training corpus of the target domain to obtain the target domain language model of the translation. And training the language model by using a part of translation sentences in the training corpus which is to be subjected to domain screening and meets the preset quality condition, so as to obtain a universal domain language model of the translation. Scoring translation sentences in the training corpus meeting the preset quality conditions to be subjected to domain screening through the target domain language model and the general domain language model respectively, and screening the training corpus meeting the preset quality conditions in the target domain according to scoring differences.

In one embodiment, obtaining, by the trained target domain language model and the universal domain language model, a domain score corresponding to each parallel sentence in the training corpus meeting a preset quality condition includes: scoring original text sentences in each parallel sentence in the training corpus meeting the preset quality condition through a target field language model of the original text to obtain a first field score corresponding to the parallel sentence; scoring original text sentences in each parallel sentence in the training corpus meeting the preset quality condition through a universal field language model of the original text to obtain a second field score corresponding to the parallel sentence; scoring translation sentences in each parallel sentence in the training corpus meeting the preset quality condition through a target field language model of the translation to obtain a third field score corresponding to the parallel sentence; scoring translation sentences in each parallel sentence in the training corpus meeting the preset quality condition through a universal field language model of the translation to obtain a fourth field score corresponding to the parallel sentence; and fusing the difference between the first domain score and the second domain score corresponding to each parallel sentence and the difference between the third domain score and the fourth domain score to obtain the domain score corresponding to each parallel sentence in the training corpus meeting the preset quality condition.

In this embodiment, the target domain language model may include a target domain language model of an original text and a target domain language model of a translated text, and correspondingly, the general domain language model may include a general domain language model of an original text and a general domain language model of a translated text, so that the computer device may fuse domain scores on both sides of the original text and the translated text, and select a training corpus of the target domain and meeting a preset quality condition from training corpuses meeting a preset quality condition.

The method of fusing the domain scores on both sides can be specifically direct addition, weighted addition or averaging.

For example, the computer device obtains a first domain score corresponding to original Wen Yugou in the parallel sentence using the target domain language model of the original, denoted as H1 (S), obtains a second domain score corresponding to the original sentence in the parallel sentence using the general domain language model of the original, denoted as H2 (S), obtains a third domain score corresponding to the translated sentence in the parallel sentence using the target domain language model of the translated sentence, denoted as H1 (S), obtains a fourth domain score corresponding to the translated sentence in the parallel sentence using the general domain language model of the translated sentence, denoted as H2 (S), and obtains the domain score corresponding to the parallel sentence according to the following formula:

F_Score＝|H1(S)-H2(S)|+|H1(S*)-H2(S*)|。

In some embodiments, the computer device may also integrate multiple models to perform domain screening, so as to comprehensively obtain the domain correlation of each sentence, thereby improving the accuracy of screening the training corpus of the target domain from the high-quality corpus. For example, the trained target domain language model and the generic domain language model include two sets, and the computer device performs weighted summation on the domain scores obtained from each set to obtain the final domain score of the parallel sentence:

In some embodiments, when the model structures adopted by the two sets of language models are different and the output domain score ranges are larger, the computer device may perform normalization processing on the domain scores output by the language models of each set, as in the manner of performing normalization processing on the quality scores described in the foregoing description of the figure, which is not repeated here.

In a specific embodiment, the corpus processing method of the translation model includes the following steps:

1. acquiring an original training corpus used for training a translation model;

2. acquiring at least two groups of trained universal language models, wherein the model structure of each group of universal language models is different;

3. scoring the original text sentences in each parallel sentence in the original training corpus through the original text language model in each group of general language models, and obtaining the original text quality scores of the parallel sentences respectively;

4. scoring the translated sentences in each parallel sentence in the original training corpus through the translated language models in each group of general language models, and obtaining the translated quality scores of the parallel sentences respectively;

5. normalizing the original text quality scores of the parallel sentences obtained by the same group of general language models according to the highest score and the lowest score in the original text quality scores of the parallel sentences in the original training corpus obtained by the same group of general language models to obtain normalized original text quality scores;

6. according to the highest score and the lowest score in the translation quality scores of the parallel sentences in the original training corpus obtained by the same group of general language models, carrying out normalization processing on the translation quality scores of the parallel sentences obtained by the same group of general language models to obtain normalized translation quality scores;

7. Fusing the normalized original text quality scores and normalized translated text quality scores of the parallel sentences corresponding to each group of general language models to obtain quality scores corresponding to the parallel sentences;

8. filtering the original training corpus according to the quality scores to obtain training corpus meeting the preset quality conditions;

9. obtaining parallel corpus of the target field, and performing model training on a language model to be trained by using original text sentences in the parallel corpus of the target field to obtain a language model of the target field of original text;

10. sampling the training corpus meeting the preset quality condition to obtain a sampled corpus, and performing model training on the language model to be trained by using original text sentences in the sampled corpus to obtain a general field language model of the original text;

11. scoring original text sentences in each parallel sentence in the training corpus meeting the preset quality condition through a target field language model of the original text to obtain a first field score corresponding to the parallel sentence;

12. scoring original text sentences in each parallel sentence in the training corpus meeting the preset quality condition through a universal field language model of the original text to obtain a second field score corresponding to the parallel sentence;

13. Obtaining the domain scores corresponding to the parallel sentences in the training corpus meeting the preset quality conditions according to the difference between the first domain scores and the second domain scores corresponding to the parallel sentences;

14. and screening the training corpus which meets the preset quality condition and meets the target field from the training corpus which meets the preset quality condition according to the field scores corresponding to the parallel sentences.

15. And training the translation model by using training corpus in the target field and meeting the preset quality condition to obtain a translation model for translating the text in the target field.

According to the corpus processing method for the translation model, manual workload can be greatly reduced, and high-quality corpus in the target field can be selected, so that the performance of the translation model in the specific field is greatly improved, and user experience is remarkably improved, wherein the specific performance is shown in the following table 1.

Corpus scale	BLEU
		100 ten thousand novel baseline model	20.03
100 ten thousand novel +100 ten thousand novel field data mixed training model	20.51
		100 ten thousand novel +300 ten thousand novel field data mixed training model	20.91
100 ten thousand novel +500 ten thousand novel field data mixed training model	21.56

The table above is schematically illustrated: by adopting the corpus processing method of the translation model provided by the embodiment of the application, the corpus in the field of 100 ten thousand novel is used for carrying out quality filtering and field screening on the large-scale bilingual data 6 hundred million to obtain the field data of 500 ten thousand novel, and then the field data of 100 ten thousand novel and 500 ten thousand novel are utilized to form a new corpus training translation model, so that the performance index of the translation model is 21.56.

It should be understood that, although the steps in the above-described flowcharts are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described above may include a plurality of steps or stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of execution of the steps or stages is not necessarily sequential, but may be performed in turn or alternately with at least a part of other steps or stages.

In one embodiment, as shown in fig. 15, a corpus processing apparatus 1500 of a translation model is provided, where the apparatus may use a software module or a hardware module, or a combination of the two, and the apparatus specifically includes: a corpus acquisition module 1502, a quality filtering module 1504, and a domain screening module 1506, wherein:

a corpus acquisition module 1502, configured to acquire an original training corpus for training a translation model;

The quality filtering module 1504 is configured to obtain at least two sets of trained universal language models, obtain quality scores corresponding to parallel sentences in an original training corpus through each set of universal language models, filter the original training corpus according to the quality scores to obtain a training corpus meeting a preset quality condition, and each set of universal language models has a different model structure;

the domain screening module 1506 is configured to obtain, through the trained target domain language model and the universal domain language model, domain scores corresponding to parallel sentences in the training corpus that satisfy the preset quality condition, and screen the training corpus that satisfies the preset quality condition from the training corpus that satisfies the preset quality condition according to the domain scores, where the target domain language model has the same model structure as the universal domain language model; the training corpus which is screened out and meets the preset quality condition is used for carrying out model training on the translation model to obtain the translation model for translating the text in the target field.

In one embodiment, the quality filtering module 1504 is further configured to score, through each set of universal language models, an original sentence and a translated sentence in each parallel sentence in the original training corpus, so as to obtain an original quality score and a translated quality score of each set of universal language models corresponding to the parallel sentence; and fusing the original text quality scores and the translated text quality scores of the parallel sentences corresponding to each group of general language models to obtain the quality scores corresponding to the parallel sentences.

In one embodiment, the quality filtering module 1504 includes an original scoring unit and a translation scoring unit;

the original text scoring unit is used for scoring the original text sentences in each parallel sentence in the original training corpus through the original text language models in each group of general language models, and obtaining the original text quality scores of the parallel sentences respectively;

the translation scoring unit is used for scoring the translation sentences in each parallel sentence in the original training corpus through the translation language models in each group of general language models, and obtaining the quality scores of the translations of the parallel sentences respectively.

In one embodiment, the quality filtering module 1504 is further configured to normalize the textual quality scores of parallel sentences obtained by the same set of general language models according to the highest score and the lowest score in the textual quality scores of parallel sentences in the original training corpus obtained by the same set of general language models, so as to obtain normalized textual quality scores; according to the highest score and the lowest score in the translation quality scores of the parallel sentences in the original training corpus obtained by the same group of general language models, carrying out normalization processing on the translation quality scores of the parallel sentences obtained by the same group of general language models to obtain normalized translation quality scores; and fusing the normalized original text quality scores and normalized translated text quality scores of the parallel sentences corresponding to each group of general language models to obtain the quality scores corresponding to the parallel sentences.

In one embodiment, the quality filtering module 1504 is further configured to sum the textual quality scores and the translation quality scores of each set of generic language models to obtain a set level score; obtaining a weighting coefficient corresponding to each group of general language models; and carrying out weighted summation on the group-level scores of the parallel sentences corresponding to each group of the universal language models based on the weighting coefficients corresponding to each group of the universal language models, and obtaining the quality scores corresponding to the parallel sentences.

In one embodiment, when the generic language model is a statistical language model obtained based on a high-quality corpus, the quality filtering module 1504 is further configured to sequentially obtain parallel sentences from the original training corpus; inputting original text sentences in the parallel sentences into an original text statistical language model, and obtaining an original text quality score of the parallel sentences based on the condition frequency corresponding to each word in the original text sentences through the original text statistical language model; inputting translated sentence in parallel sentences into a statistical language model of the translated sentence, and obtaining the quality score of the translated sentence of the parallel sentences based on the condition frequency corresponding to each word in the translated sentence through the statistical language model of the translated sentence; and fusing the original text quality score and the translated text quality score of each parallel sentence to obtain the quality score of the corresponding statistical language model of each parallel sentence.

In one embodiment, when the generic language model is an autoregressive language model, the quality filtering module 1504 is further configured to sequentially obtain parallel sentences from the original training corpus; inputting original text sentences in parallel sentences into an autoregressive language model of the original text, predicting the conditional probability of each word from left to right or from right to left in the original Wen Yugou through the autoregressive language model of the original text, and obtaining the original text quality score of the parallel sentences according to the conditional probability corresponding to each word; inputting translated sentences in parallel sentences into an autoregressive language model of the translated sentences, predicting the conditional probabilities of each word from left to right or from right to left in the translated sentences through the autoregressive language model of the translated sentences, and obtaining the translated quality scores of the parallel sentences according to the conditional probabilities corresponding to each word; and fusing the original text quality score and the translated text quality score of each parallel sentence to obtain the quality score of the autoregressive language model corresponding to each parallel sentence.

In one embodiment, when the generic language model is a self-coding language model, the quality filtering module 1504 is further configured to sequentially obtain parallel sentences from the original training corpus; sequentially taking each word in the original sentence of the parallel sentence as a masking word, inputting the masked original sentence into a self-coding language model of the original sentence, outputting a prediction probability corresponding to the masking word through the self-coding language model of the original sentence, and obtaining an original quality score of the parallel sentence according to the prediction probability corresponding to each masking word; sequentially taking each word in the translated sentence of the parallel sentence as a masking word, inputting the masked translated sentence into a self-coding language model of the translation, outputting a prediction probability corresponding to the masking word through the self-coding language model of the translation, and obtaining a translation quality score of the parallel sentence according to the prediction probability corresponding to each masking word; and fusing the original text quality score and the translated text quality score of each parallel sentence to obtain the quality score of the self-coding language model corresponding to each parallel sentence.

In one embodiment, the domain filtering module 1506 is further configured to score, through a target domain language model of the original text, the original text sentence in each parallel sentence in the training corpus that meets a preset quality condition, so as to obtain a first domain score corresponding to the parallel sentence; scoring original text sentences in each parallel sentence in the training corpus meeting the preset quality condition through a universal field language model of the original text to obtain a second field score corresponding to the parallel sentence; and obtaining the domain scores corresponding to the parallel sentences in the training corpus meeting the preset quality conditions according to the difference between the first domain scores corresponding to the parallel sentences and the second domain scores.

In one embodiment, the domain filtering module 1506 is further configured to score, through a target domain language model of the translation, translation sentences in each parallel sentence in the training corpus that meets a preset quality condition, to obtain a third domain score corresponding to the parallel sentence; scoring translation sentences in each parallel sentence in the training corpus meeting the preset quality condition through a universal field language model of the translation to obtain a fourth field score corresponding to the parallel sentence; and obtaining the domain scores corresponding to the parallel sentences in the training corpus meeting the preset quality conditions according to the difference between the third domain scores and the fourth domain scores corresponding to the parallel sentences.

In one embodiment, the domain filtering module 1506 is further configured to score, through a target domain language model of the original text, the original text sentence in each parallel sentence in the training corpus that meets a preset quality condition, so as to obtain a first domain score corresponding to the parallel sentence; scoring original text sentences in each parallel sentence in the training corpus meeting the preset quality condition through a universal field language model of the original text to obtain a second field score corresponding to the parallel sentence; scoring translation sentences in each parallel sentence in the training corpus meeting the preset quality condition through a target field language model of the translation to obtain a third field score corresponding to the parallel sentence; scoring translation sentences in each parallel sentence in the training corpus meeting the preset quality condition through a universal field language model of the translation to obtain a fourth field score corresponding to the parallel sentence; and fusing the difference between the first domain score and the second domain score corresponding to each parallel sentence and the difference between the third domain score and the fourth domain score to obtain the domain score corresponding to each parallel sentence in the training corpus meeting the preset quality condition.

In one embodiment, the corpus processing apparatus 1500 of the translation model further includes:

The first training module is used for obtaining parallel corpus of the target field, and carrying out model training on the language model to be trained by using the parallel corpus of the target field to obtain the language model of the target field;

the second training module is used for sampling the training corpus meeting the preset quality condition to obtain a sampling corpus, and performing model training on the language model to be trained by using the sampling corpus to obtain the language model in the general field.

The corpus processing device 1500 of translation model uses at least two groups of trained general language models to perform mixed scoring on each parallel sentence in the original corpus, and compared with the quality filtering based on a single language model, the model structure of each group of general language models is different, and multiple groups of integrated language models can comprehensively score parallel sentences from different angles, so that the corpus meeting the preset quality condition can be filtered from the original corpus; further, on the basis, the field scores corresponding to parallel sentences in the training corpus meeting the preset quality conditions are obtained through the trained target field language model and the universal field language model, so that the training corpus meeting the target field and meeting the preset quality conditions is further screened from the training corpus meeting the preset quality conditions, the translation model in the target field can be used for model training, and compared with filtering based on a single quality aspect, the corpus in the target field can be screened on the basis of ensuring high quality, and the translation performance of the translation model can be greatly improved by the obtained corpus; in addition, through the scoring in the aspect of high quality, the process setting of the training corpus in the target field is screened on the basis of high quality, and compared with the process of directly carrying out quality and field mixed scoring on the original training corpus, the method can greatly ensure that the output corpus is high in quality and closest to the target field.

For specific limitations of the corpus processing apparatus of the translation model, reference may be made to the above limitation of the corpus processing method of the translation model, and no further description is given here. The modules in the corpus processing device of the translation model can be realized completely or partially by software, hardware and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be the server or terminal of fig. 1, and the internal structure diagram thereof may be as shown in fig. 16. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a method for processing a training corpus of a translation model.

It will be appreciated by those skilled in the art that the structure shown in fig. 16 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the computer device to which the present application is applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, storing a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the steps in the above-described method embodiments.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. A method for processing a training corpus of a translation model, the method comprising:

acquiring an original training corpus used for training a translation model;

The selected training corpus is used for obtaining the translation model of the target field after model training is carried out on the translation model.

2. The method of claim 1, wherein the obtaining, by each set of generic language models, a quality score corresponding to each parallel sentence in the original training corpus comprises:

scoring original sentence and translated sentence in each parallel sentence in the original training corpus through each group of general language models, and respectively obtaining original quality score and translated quality score of each group of general language models corresponding to the parallel sentence;

and fusing the original text quality scores and the translated text quality scores of the parallel sentences corresponding to each group of general language models to obtain the quality scores corresponding to the parallel sentences.

3. The method according to claim 2, wherein the scoring, by each set of generic language models, the original sentence and the translated sentence in each parallel sentence in the original training corpus, respectively, to obtain an original quality score and a translated quality score of each set of generic language models corresponding to the parallel sentence, respectively, includes:

scoring the original text sentences in each parallel sentence in the original training corpus through the original text language model in each group of general language models, and respectively obtaining the original text quality scores of the parallel sentences;

And scoring the translated sentences in each parallel sentence in the original training corpus through the translated language models in each group of general language models, and obtaining the translated quality scores of the parallel sentences respectively.

4. The method according to claim 2, wherein the fusing the textual quality scores and the translation quality scores of the parallel sentences corresponding to each set of the universal language models to obtain quality scores corresponding to the parallel sentences comprises:

normalizing the original text quality scores of the parallel sentences obtained by the same group of general language models according to the highest score and the lowest score in the original text quality scores of the parallel sentences in the original training corpus obtained by the same group of general language models to obtain normalized original text quality scores;

normalizing the translation quality scores of the parallel sentences obtained by the same group of general language models according to the highest score and the lowest score in the translation quality scores of the parallel sentences in the original training corpus obtained by the same group of general language models to obtain normalized translation quality scores;

and merging the normalized original text quality scores and the normalized translated text quality scores of the parallel sentences corresponding to each group of general language models to obtain the quality scores corresponding to the parallel sentences.

5. The method according to claim 2, wherein the fusing the textual quality scores and the translation quality scores of the parallel sentences corresponding to each set of the universal language models to obtain quality scores corresponding to the parallel sentences comprises:

summing the original text quality scores and the translated text quality scores of each group of general language models to obtain group-level scores;

obtaining a weighting coefficient corresponding to each group of general language models;

and carrying out weighted summation on the group-level scores of the parallel sentences corresponding to each group of the universal language models based on the weighting coefficients corresponding to each group of the universal language models, and obtaining the quality scores corresponding to the parallel sentences.

6. The method according to claim 1, wherein when the generic language model is a statistical language model obtained based on a high-quality corpus, the obtaining, by each set of generic language models, a quality score corresponding to each parallel sentence in the original training corpus includes:

sequentially obtaining parallel sentences from the original training corpus;

inputting an original sentence in the parallel sentence into a statistical language model of the original, and obtaining an original quality score of the parallel sentence based on a condition frequency corresponding to each word in the original sentence through the statistical language model of the original sentence;

Inputting translated sentence in the parallel sentence into a statistical language model of the translated sentence, and obtaining a translated quality score of the parallel sentence based on a condition frequency corresponding to each word in the translated sentence through the statistical language model of the translated sentence;

and fusing the original text quality scores and the translated text quality scores of the parallel sentences to obtain the quality scores of the parallel sentences corresponding to the statistical language model.

7. The method according to claim 1, wherein when the generic language model is an autoregressive language model, the obtaining, by each set of generic language models, a quality score corresponding to each parallel sentence in the original training corpus includes:

sequentially obtaining parallel sentences from the original training corpus;

inputting an original sentence in the parallel sentence into an autoregressive language model of the original sentence, predicting the conditional probability of each word from left to right or from right to left in the original sentence through the autoregressive language model of the original sentence, and obtaining an original quality score of the parallel sentence according to the conditional probability corresponding to each word;

inputting the translated sentence in the parallel sentence into an autoregressive language model of the translated sentence, predicting the conditional probability of each word from left to right or from right to left in the translated sentence through the autoregressive language model of the translated sentence, and obtaining the translated quality score of the parallel sentence according to the conditional probability corresponding to each word;

And fusing the original text quality scores and the translated text quality scores of the parallel sentences to obtain the quality scores of the autoregressive language models corresponding to the parallel sentences.

8. The method according to claim 1, wherein when the generic language model is a self-coding language model, the obtaining, by each set of generic language models, a quality score corresponding to each parallel sentence in the original training corpus includes:

sequentially obtaining parallel sentences from the original training corpus;

sequentially taking each word in the original sentence of the parallel sentence as a masking word, inputting the masked original sentence into a self-coding language model of the original sentence, outputting a prediction probability corresponding to the masking word through the self-coding language model of the original sentence, and obtaining an original quality score of the parallel sentence according to the prediction probability corresponding to each masking word;

sequentially taking each word in the translation sentence of the parallel sentence as a masking word, inputting the masked translation sentence into a self-coding language model of the translation, outputting a prediction probability corresponding to the masking word through the self-coding language model of the translation, and obtaining a translation quality score of the parallel sentence according to the prediction probability corresponding to each masking word;

And fusing the original text quality score and the translated text quality score of each parallel sentence to obtain the quality score of each parallel sentence corresponding to the self-coding language model.

9. The method according to claim 1, wherein the obtaining, by the trained target domain language model and the universal domain language model, the domain score corresponding to each parallel sentence in the training corpus satisfying the preset quality condition includes:

scoring original text sentences in each parallel sentence in the training corpus meeting a preset quality condition through a target field language model of the original text to obtain a first field score corresponding to the parallel sentence;

scoring original text sentences in each parallel sentence in the training corpus meeting a preset quality condition through a general field language model of the original text to obtain a second field score corresponding to the parallel sentence;

and obtaining the domain scores corresponding to the parallel sentences in the training corpus meeting the preset quality conditions according to the difference between the first domain scores corresponding to the parallel sentences and the second domain scores.

10. The method according to claim 1, wherein the obtaining, by the trained target domain language model and the universal domain language model, the domain score corresponding to each parallel sentence in the training corpus satisfying the preset quality condition includes:

Scoring translation sentences in each parallel sentence in the training corpus meeting a preset quality condition through a target domain language model of the translation to obtain a third domain score corresponding to the parallel sentence;

scoring translation sentences in each parallel sentence in the training corpus meeting a preset quality condition through a universal field language model of the translation to obtain a fourth field score corresponding to the parallel sentence;

and obtaining the domain scores corresponding to the parallel sentences in the training corpus meeting the preset quality condition according to the difference between the third domain score corresponding to the parallel sentences and the fourth domain score.

11. The method according to claim 1, wherein the obtaining, by the trained target domain language model and the universal domain language model, the domain score corresponding to each parallel sentence in the training corpus satisfying the preset quality condition includes:

and fusing the difference between the first domain score and the second domain score corresponding to each parallel sentence and the difference between the third domain score and the fourth domain score to obtain the domain score corresponding to each parallel sentence in the training corpus meeting the preset quality condition.

12. The method according to any one of claims 1 to 11, further comprising:

obtaining a parallel corpus of a target field, and performing model training on a language model to be trained by using the parallel corpus of the target field to obtain the language model of the target field;

Sampling the training corpus meeting the preset quality condition to obtain a sampling corpus, and carrying out model training on the language model to be trained by using the sampling corpus to obtain the universal field language model.

13. A corpus processing apparatus for a translation model, the apparatus comprising:

14. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 12 when the computer program is executed.

15. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method of any one of claims 1 to 12.