CN113761944A

CN113761944A - Corpus processing method, apparatus, device and storage medium for translation model

Info

Publication number: CN113761944A
Application number: CN202110553522.4A
Authority: CN
Inventors: 王龙跃; 刘宏烨
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-05-20
Filing date: 2021-05-20
Publication date: 2021-12-07
Anticipated expiration: 2041-05-20
Also published as: CN113761944B

Abstract

The application relates to a corpus processing method, a corpus processing device and a corpus processing storage medium of a translation model. The method relates to the technical field of natural language processing, and comprises the following steps: acquiring an original training corpus for training a translation model; acquiring at least two groups of trained universal language models, acquiring quality scores corresponding to parallel sentences in an original training corpus through each group of universal language models, and filtering the original training corpus according to the quality scores to obtain high-quality training corpora, wherein the model structures of each group of universal language models are different; and obtaining the domain scores corresponding to all parallel sentences in the high-quality training corpus through the trained target domain language model and the general domain language model, and screening the high-quality training corpus of the target domain from the high-quality training corpus according to the domain scores. By adopting the method, the corpus of the target field can be screened on the basis of ensuring high quality, so that the translation performance of the translation model can be greatly improved by the obtained corpus.

Description

Corpus processing method, apparatus, device and storage medium for translation model

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a corpus processing method and apparatus for a translation model, a computer device, and a storage medium.

Background

Currently, neural networks are widely used in the technical field of artificial intelligence, including speech recognition, computer vision, natural language processing, etc., and neural network models perform well in various tasks of natural language processing, such as machine translation tasks. In a machine translation task, along with the continuous increase of the scale of the translated corpus in recent years, the performance of a translation model is obviously improved in the early stage, which shows that the large-scale corpus plays a very great role in the training of the translation model, but the translation model cannot be improved in performance by using the larger-scale corpus in the later stage.

The inventor finds two reasons after research: 1) the quality of sentences in large-scale corpora is uneven, and more noise data exist; 2) in the large-scale corpus, the translated corpuses from different fields have distribution difference, and the fields of the large-scale corpus are not uniformly distributed.

At present, only some modes of cleaning large-scale corpora by using manual rules or filtering the large-scale corpora in a single quality aspect by using a single language model are available, and the large-scale corpora are processed in a less comprehensive mode, so that the problem that the translation performance of the obtained corpora cannot be improved still exists.

Disclosure of Invention

Therefore, it is necessary to provide a corpus processing method, an apparatus, a computer device, and a storage medium for a translation model, which can ensure that a corpus for model training of a translation model in a target domain is obtained while obtaining a high-quality corpus, thereby improving the performance of the translation model in the target domain.

A corpus processing method of a translation model, the method comprising:

acquiring an original training corpus for training a translation model;

obtaining at least two groups of trained universal language models, obtaining quality scores corresponding to parallel sentences in the original training corpus through each group of universal language models, filtering the original training corpus according to the quality scores to obtain training corpora meeting preset quality conditions, wherein the model structures of each group of universal language models are different;

obtaining a domain score corresponding to each parallel statement in the training corpus meeting the preset quality condition through a trained target domain language model and a trained general domain language model, and screening the training corpus which belongs to a target domain and meets the preset quality condition from the training corpus meeting the preset quality condition according to the domain scores, wherein the model structures of the target domain language model and the general domain language model are the same;

and the screened training corpus is used for obtaining the translation model of the target field after model training is carried out on the translation model.

A corpus processing apparatus for a translation model, the apparatus comprising:

the corpus acquiring module is used for acquiring an original training corpus used for training the translation model;

the quality filtering module is used for acquiring at least two groups of trained universal language models, acquiring a quality score corresponding to each parallel statement in the original training corpus through each group of universal language models, and filtering the original training corpus according to the quality scores to acquire the training corpus meeting preset quality conditions, wherein the model structures of each group of universal language models are different;

the domain screening module is used for obtaining domain scores corresponding to parallel sentences in the training corpuses meeting the preset quality condition through a trained target domain language model and a trained general domain language model, screening the training corpuses which belong to a target domain and meet the preset quality condition from the training corpuses meeting the preset quality condition according to the domain scores, and the model structures of the target domain language model and the general domain language model are the same; and the screened training corpus is used for obtaining the translation model of the target field after model training is carried out on the translation model.

In an embodiment, the quality filtering module is further configured to score, through each group of general language models, an original sentence and a translated sentence in each parallel sentence in the original corpus, and obtain an original quality score and a translated quality score of each group of general language models corresponding to the parallel sentences respectively; and fusing the original text quality scores and the translation quality scores of each group of general language models corresponding to each parallel statement to obtain the quality scores corresponding to each parallel statement.

In one embodiment, the quality filtering module comprises an original text scoring unit and a translated text scoring unit;

the original text scoring unit is used for scoring original text sentences in parallel sentences in the original training corpus respectively through the original text language models in each group of general language models to respectively obtain original text quality scores of the parallel sentences;

and the translation scoring unit is used for scoring the translation sentences in each parallel sentence in the original training corpus respectively through the translation language models in each group of universal language models to respectively obtain the translation quality scores of the parallel sentences.

In one embodiment, the quality filtering module is further configured to perform normalization processing on the original text quality scores of the parallel sentences obtained by the same group of universal language models according to the highest score and the lowest score of the original text quality scores of the parallel sentences in the original corpus obtained by the same group of universal language models, so as to obtain normalized original text quality scores; normalizing the translation quality scores of the parallel sentences obtained by the same group of universal language models according to the highest score and the lowest score in the translation quality scores of the parallel sentences in the original training corpus obtained by the same group of universal language models to obtain normalized translation quality scores; and fusing the normalized original text quality scores and the normalized translation quality scores of each group of general language models corresponding to the parallel sentences to obtain the quality scores corresponding to the parallel sentences.

In one embodiment, the quality filtering module is further configured to sum the original text quality scores and the translated text quality scores of each group of the universal language models to obtain group-level scores; acquiring a weighting coefficient corresponding to each group of universal language models; and carrying out weighted summation on the group-level scores of each group of general language models corresponding to the parallel statements based on the weighting coefficients corresponding to each group of general language models to obtain the quality scores corresponding to the parallel statements.

In one embodiment, when the generic language model is a statistical language model obtained based on a high-quality corpus, the quality filtering module is further configured to sequentially obtain parallel sentences from the original corpus; inputting the original sentences in the parallel sentences into a statistical language model of the original text, and obtaining the quality scores of the original texts of the parallel sentences based on the condition frequency corresponding to each word in the original sentences through the statistical language model of the original text; inputting the translation sentences in the parallel sentences into a statistical language model of the translation, and obtaining the translation quality scores of the parallel sentences through the statistical language model of the translation based on the condition frequency corresponding to each word in the translation sentences; and fusing the original text quality scores and the translation quality scores of the parallel sentences to obtain the quality scores of the parallel sentences corresponding to the statistical language model.

In one embodiment, when the generic language model is an autoregressive language model, the quality filtering module is further configured to sequentially obtain parallel sentences from the original corpus; inputting the original sentences in the parallel sentences into an autoregressive language model of the original text, predicting the conditional probability of each word appearing from left to right or from right to left in the original sentences through the autoregressive language model of the original text, and obtaining the original text quality scores of the parallel sentences according to the conditional probability corresponding to each word; inputting the translated sentence in the parallel sentences into an autoregressive language model of the translated sentence, predicting the conditional probability of each word appearing from left to right or from right to left in the translated sentence through the autoregressive language model of the translated sentence, and obtaining the translated sentence quality score of the parallel sentences according to the conditional probability corresponding to each word; and fusing the original text quality scores and the translation quality scores of the parallel sentences to obtain the quality scores of the parallel sentences corresponding to the autoregressive language model.

In one embodiment, when the generic language model is a self-coding language model, the quality filtering module is further configured to sequentially obtain parallel sentences from the original corpus; sequentially taking each word in the original text sentences of the parallel sentences as a masking word, inputting the masked original text sentences into a self-coding language model of the original text, outputting the corresponding prediction probability of the masking word through the self-coding language model of the original text, and obtaining the original text quality scores of the parallel sentences according to the corresponding prediction probability of each masking word; sequentially taking each word in the translation sentence of the parallel sentence as a masking word, inputting the masked translation sentence into a self-coding language model of the translation, outputting a prediction probability corresponding to the masking word through the self-coding language model of the translation, and obtaining a translation quality score of the parallel sentence according to the prediction probability corresponding to each masking word; and fusing the original text quality scores and the translation quality scores of the parallel sentences to obtain the quality scores of the parallel sentences corresponding to the self-coding language model.

In one embodiment, the domain screening module is further configured to score, through a target domain language model of a text, a text sentence in each parallel sentence in the corpus that meets a preset quality condition, and obtain a first domain score corresponding to the parallel sentence; scoring the original sentences in the parallel sentences in the training corpus meeting the preset quality condition through a general domain language model of the original text to obtain second domain scores corresponding to the parallel sentences; and obtaining the domain score corresponding to each parallel statement in the corpus meeting the preset quality condition according to the difference between the first domain score and the second domain score corresponding to each parallel statement.

In one embodiment, the domain screening module is further configured to score, through a target domain language model of a translated text, a translated text sentence in each parallel sentence in the training corpus that meets a preset quality condition, and obtain a third domain score corresponding to the parallel sentence; scoring translation sentences in parallel sentences in the training corpus meeting the preset quality condition through a general domain language model of the translation to obtain fourth domain scores corresponding to the parallel sentences; and obtaining the domain score corresponding to each parallel statement in the corpus meeting the preset quality condition according to the difference between the third domain score and the fourth domain score corresponding to each parallel statement.

In one embodiment, the domain screening module is further configured to score, through a target domain language model of a text, a text sentence in each parallel sentence in the corpus that meets a preset quality condition, and obtain a first domain score corresponding to the parallel sentence; scoring the original sentences in the parallel sentences in the training corpus meeting the preset quality condition through a general domain language model of the original text to obtain second domain scores corresponding to the parallel sentences; scoring translation sentences in parallel sentences in the training corpus meeting the preset quality condition through a target domain language model of the translation to obtain third domain scores corresponding to the parallel sentences; scoring translation sentences in parallel sentences in the training corpus meeting the preset quality condition through a general domain language model of the translation to obtain fourth domain scores corresponding to the parallel sentences; and fusing the difference between the first domain score and the second domain score and the difference between the third domain score and the fourth domain score corresponding to each parallel sentence to obtain the domain score corresponding to each parallel sentence in the corpus meeting the preset quality condition.

In one embodiment, the apparatus further comprises:

the first training module is used for obtaining the parallel corpora of the target field and carrying out model training on a language model to be trained by using the parallel corpora of the target field to obtain the language model of the target field;

and the second training module is used for sampling the training corpora meeting the preset quality condition to obtain a sampling corpora, and using the sampling corpora to carry out model training on the language model to be trained to obtain the universal field language model.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

acquiring an original training corpus for training a translation model;

and the screened training corpora of the target field and meeting the preset quality condition are used for obtaining the translation model of the target field after model training is carried out on the translation model.

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

acquiring an original training corpus for training a translation model;

A computer program comprising computer instructions stored in a computer-readable storage medium, the computer instructions being read by a processor of a computer apparatus from the computer-readable storage medium, the computer instructions being executed by the processor to cause the computer apparatus to perform the steps of the corpus processing method of the translation model described above.

According to the corpus processing method and device, the computer equipment and the storage medium of the translation model, at least two groups of trained general language models are used for carrying out mixed scoring on each parallel statement in the original corpus, compared with quality filtering based on a single language model, due to the fact that the model structures of each group of general language models are different, the integrated multiple groups of language models can comprehensively score the parallel statements from different angles, and it is guaranteed that a corpus meeting preset quality conditions, namely a high-quality corpus, can be filtered from the original corpus; furthermore, on the basis, the field scores corresponding to the parallel sentences in the high-quality training corpus are obtained through the trained target field language model and the general field language model, so that the training corpus which belongs to the target field and meets the preset quality condition is further screened out from the high-quality training corpus, the translation model in the target field can be used for model training, and compared with the filtering based on the single quality aspect, the corpus in the target field can be screened on the high-quality basis, so that the translation performance of the translation model can be greatly improved through the obtained corpus; in addition, through the process arrangement of firstly scoring in the aspect of high quality and then screening the training corpora in the target field on the basis of high quality, compared with the process of directly performing mixed scoring on the original training corpora in the aspect of quality and field, the output corpora can be guaranteed to be high in quality and most close to the target field to the greatest extent.

Drawings

FIG. 1 is a diagram illustrating an exemplary environment for a corpus processing method of a translation model;

FIG. 2 is a diagram illustrating the performance of a translation model as a function of corpus size in one embodiment;

FIG. 3 is a flowchart illustrating a corpus processing method of the translation model according to an embodiment;

FIG. 4 is a diagram illustrating corpus processing for the translation model in one embodiment;

FIG. 5 is a flowchart illustrating a process of obtaining quality scores corresponding to parallel sentences in an original corpus, according to an embodiment;

FIG. 6 is a flowchart illustrating a process of obtaining quality scores of original documents and quality scores of translated documents of each set of universal language models corresponding to parallel sentences according to an embodiment;

FIG. 7 is a flowchart illustrating a process of fusing the original quality scores and the translated quality scores of each group of universal language models corresponding to each parallel sentence to obtain quality scores corresponding to each parallel sentence according to an embodiment;

FIG. 8 is a schematic flow diagram of mass filtering in one embodiment;

FIG. 9 is a schematic flow chart illustrating a process of fusing the original text quality scores and the translated text quality scores of each group of the universal language models corresponding to each parallel sentence to obtain quality scores corresponding to each parallel sentence according to another embodiment;

FIG. 10 is a flowchart illustrating an embodiment of obtaining quality scores corresponding to parallel sentences in an original corpus using a statistical language model;

FIG. 11 is a flowchart illustrating an embodiment of obtaining quality scores corresponding to parallel sentences in an original corpus using an autoregressive language model;

FIG. 12 is a flowchart illustrating obtaining quality scores corresponding to parallel sentences in an original corpus by using a self-coding language model according to an embodiment;

FIG. 13 is a flowchart illustrating a process of obtaining a domain score corresponding to each parallel sentence in a corpus meeting a predetermined quality condition according to an embodiment;

FIG. 14 is a diagram illustrating domain screening of corpus satisfying predetermined quality criteria in an embodiment;

FIG. 15 is a block diagram showing a corpus processing apparatus for translation modeling in accordance with an embodiment;

FIG. 16 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Before describing the embodiments of the present application, some terms referred to in the embodiments of the present application are explained as follows:

deep Learning (DL): is a branch of machine learning and is an algorithm that attempts to perform high-level abstraction of data using multiple processing layers that contain complex structures or consist of multiple non-linear transformations.

Neural Network (NN for short): a deep learning model simulating the structure and function of a biological neural network in the fields of machine learning and cognitive science.

Machine Translation (MT): automatically translating one language text into another language text using a computer or the like.

Counting machine translation: (Statistical Machine Translation, SMT): machine translation techniques based on traditional statistical methods. Before the advent of neural network methods, machine translation was primarily a statistical model-based translation model.

Neural Network Machine Translation (NMT): the latest generation of machine translation technologies based on neural networks.

Recurrent Neural Network (RNN): a network model for converting sequence modeling into time sequence modeling transfers states circularly in the network.

Self-Attention neural Network (Self-Attention Network, SAN for short): a neural network structure model based on a self-care mechanism.

Convolutional Neural Network (CNN): consists of one or more convolutional layers and a top fully-connected layer, and also includes associated weights and pooling layers.

Attention Mechanism (Attention Mechanism): a method for modeling hidden state dependency of an encoder and a decoder in a neural network.

BLEU (Bilngual Evaluation understudy): and the higher the value of the machine translation evaluation index is, the better the translation effect is.

RNNsearch: RNN-based encoder-decoder framework.

LightConv: CNN-based encoder-decoder framework.

Transformer, a SAN network-based encoder-decoder framework, is the most popular sequence-to-sequence generation (sequence-to-sequence generation) model structure at present.

Parallel corpus (parallel corpora): the bilingual/multilingual corpus is a bilingual/multilingual corpus consisting of original texts and translated texts corresponding to the original texts in parallel, the alignment degree of the bilingual/multilingual corpus can be word level, sentence level, paragraph level and chapter level, sentence pairs consisting of the original texts and the translated texts can be called as parallel sentences, and a large number of parallel sentences form parallel linguistic data.

Language Model (Language Model, LM): various statistical and probabilistic techniques are used to determine the probability of a given word occurring in a sentence. Language models provide the basis for their word prediction by analyzing the body of textual data. Colloquially, the language model can grammatically determine whether a sentence is smooth.

Mask LM (Mask Language Model): the method is a language model which randomly obstructs partial words of an input sequence and predicts the obstructed words by using information of bilateral contexts to avoid the words from seeing oneself.

N-Gram LM (N-Gram Language Model): an algorithm based on a statistical language model. The basic idea is to perform a sliding window operation with the size of N on the content in the text according to bytes, and form a byte fragment sequence with the length of N.

Transformer LM (Transformer Language Model): the transform is based on the encoder-decoder framework of SAN network, and is the most mainstream neural network model structure from sequence-to-sequence generation (sequence-to-sequence generation) at present. The Transformer LM is a language model that uses a Transformer model to count and analyze sentences.

BERT (bidirectional Encoder retrieval from transformations): a pre-trained neural network model aims to get a bi-directional representation of depth by considering bilateral context information in all layers.

PPL (Perplexity, confusion): the method can be used for measuring an index of a language model for predicting the quality of a sample sentence, and the lower the confusion degree, the higher the probability of the sentence.

The corpus processing method of the translation model provided by the embodiment of the application relates to an Artificial Intelligence (AI) technology, wherein the AI technology is a theory, a method, a technology and an application system which simulate, extend and expand human Intelligence by using a digital computer or a machine controlled by the digital computer, sense the environment, acquire knowledge and use the knowledge to acquire an optimal result. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The embodiment of the application provides a method for processing training corpus of a translation model, which mainly relates to a Natural Language Processing (NLP) technology of artificial intelligence, and the natural Language processing is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

The corpus processing method of the Translation model provided by the embodiment of the application mainly relates to a Machine Translation (Machine Translation) technology of a natural language processing technology, and the Machine Translation utilizes computer equipment to automatically translate one language character into another language character. For example, in the embodiment of the present application, a trained Language Model (LM) is used to perform quality scoring on parallel sentences. For another example, in the embodiment of the present application, after the corpus of the target domain and satisfying the preset quality condition is obtained by performing quality filtering and domain screening on the original corpus, the translation model may be trained based on the corpus of the target domain and satisfying the preset quality condition, so as to obtain the translation model for translating the text of the target domain.

The corpus processing method of the translation model provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. Each terminal 102 may crawl the corpus from the network, and after collecting large-scale original corpus for training the translation model, the server 104 may obtain the original corpus; obtaining at least two groups of trained universal language models, obtaining quality scores corresponding to parallel sentences in an original training corpus through each group of universal language models, filtering the original training corpus according to the quality scores to obtain the training corpus meeting preset quality conditions, wherein the model structures of each group of universal language models are different; and obtaining the field scores corresponding to all parallel sentences in the training corpus meeting the preset quality condition through the trained target field language model and the trained general field language model, and screening the training corpus meeting the preset quality condition in the target field from the training corpus meeting the preset quality condition according to the field scores, wherein the model structures of the target field language model and the general field language model are the same. Optionally, the server 104 may further perform model training on the translation model based on the selected corpus of the target domain and meeting the preset quality condition, so as to obtain a translation model for translating the text of the target domain.

In some embodiments, after the original corpus is obtained, the terminal 102 may directly execute the corpus processing method of the translation model provided in the embodiment of the present application, so as to obtain the corpus in the target field and meeting the preset quality condition. In some embodiments, the translation model may be model-trained by the terminal 102 based on the selected corpus of the target domain and meeting the preset quality condition, so as to obtain a translation model for translating the text of the target domain.

The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers. In some embodiments, the server 104 may be a blockchain node in a blockchain network, and the corpus of the screened target domain and satisfying the predetermined quality condition may be stored on the blockchain node.

Both the statistical machine translation based on statistics and the neural machine translation based on neural network mentioned above are data-driven models, that is, the performance of the translation model is closely related to the training corpus. For example, through the research of the inventor, in recent years, the scale of the corpus used in the model training of the deep learning-based translation model also increases rapidly, and fig. 2 is a schematic diagram illustrating that the performance of a certain neural network translation model changes with the scale of the corpus in 2016-. Therefore, with the increase of the data scale, the translation model is obviously improved in the early stage, which shows that the performance improvement of the neural network machine translation model based on large-scale linguistic data plays a very large role, but even if the larger-scale linguistic data is increased in the later stage, the translation model cannot be improved in a larger performance.

The inventor finds two reasons after research: 1) the large-scale corpus has the defects of uneven statement quality and more noise data. Because many large-scale parallel corpora are from websites, when a large amount of related corpora are obtained, a lot of noises are inevitably mixed, and the data noises can cause that the model is difficult to learn correct semantic representation knowledge in the training process. 2) In the large-scale corpus, the translated corpuses from different fields have distribution difference, and the fields of the large-scale corpus are not uniformly distributed. The accuracy of machine translation depends on the distribution of the fields to a certain extent, and the translated linguistic data from different fields have distribution difference and are mixed together, so that the model can mutually interfere when learning characteristics of different fields.

That is, it is difficult to ensure a better model by increasing the scale of the corpus, and it is necessary to ensure the data quality in the corpus while extending the corpus. The inventor provides a corpus processing method of a translation model in the embodiment of the application, which is used for performing quality filtering and field screening on an original corpus to obtain a high-quality corpus of a target field and then performing training on a machine translation model, so that better translation performance can be achieved by using less corpus, and resource waste and training cost are saved; the performance of the translation system can be further improved by filtering noise in the training corpus; and a translation model with better performance can be obtained by adjusting the field distribution of the training corpus.

In one embodiment, as shown in fig. 3, a corpus processing method of a translation model is provided, which is described by taking the example that the method is applied to a computer device (terminal 102 or server 104) in fig. 1, and includes the following steps:

step 302, obtaining an original corpus used for training a translation model.

Wherein the translation model is an algorithm for translating one language word into another or other multiple language words by a computer algorithm. The translation model to be trained may be a statistical translation model based on statistics or a neural network translation model based on a neural network. The statistical translation model may be, for example, N-Gram LM. The neural network translation model may be, for example, a Transformer model based on an encoder-decoder framework, which may be embodied based on a recurrent neural network, a Convolutional Neural Network (CNN), or a self-attention neural network. The original corpus used for training the translation model may be a corpus that is not subjected to corpus filtering or screening, or the original corpus may be a corpus that is simply and manually screened by using some artificial rules, for example, a corpus that is obtained by filtering with a regular expression in a targeted manner.

To train the translation model, the computer device needs to obtain the original corpus. The original corpus may be a monolingual corpus of a language in a parallel corpus. For example, when the computer device only needs to process the original corpus in the parallel corpus, the original corpus in the parallel corpus may be obtained, and then the quality filtering and the domain screening are performed on the original corpus to obtain the high-quality original corpus in the target domain. For another example, when the computer device only needs to process the translated language material in the parallel language material library, the original translated language material in the parallel language material library may be obtained, and then the quality filtering and the domain screening may be performed on the original translated language material to obtain the high-quality translated language material in the target domain. Of course, the original corpus obtained by the computer device may also be the original corpus in the parallel corpus and the original translated corpus corresponding to the original corpus in parallel, and the original corpus is processed at the same time to obtain the high-quality corpus in the target field. The parallel linguistic data can be bilingual linguistic data or multilingual linguistic data, and the embodiment of the application mainly uses the bilingual linguistic data as a main description. The computer device may obtain the original corpus used to train the translation model from a number of parallel corpus databases.

And 304, acquiring at least two groups of trained universal language models, acquiring quality scores corresponding to parallel sentences in the original training corpus through each group of universal language models, and filtering the original training corpus according to the quality scores to obtain the training corpus meeting preset quality conditions, wherein the model structures of each group of universal language models are different.

The trained universal language model is a language model obtained by utilizing a universal training corpus to perform model training, and compared with a language model which is obtained by performing model training by adopting a training corpus in a target field and aims at the target field, the universal language model is more universal and not very strong in pertinence. The general language model is adopted firstly because the step 304 is mainly to improve the quality of the corpus, ensure the corpus with high quality firstly, and then, on the basis of high quality, utilize the subsequent step 306 to carry out the field screening, so that the corpus has the field with stronger relevance with the task, compared with the weighting after directly carrying out the mixed scoring on the original corpus in the aspects of quality and field, the flow setting can greatly ensure that the output corpus is high quality and is closest to the target field, and the condition that the output corpus has strong field and participates in a plurality of low-quality corpora is avoided.

The at least two groups of trained general language models for quality filtering acquired by the computer device can comprise statistical language models, wherein the statistical language models are models describing probability distribution of words, sentences and even different grammar units of the whole document from the perspective of probability statistics, and can be used for measuring whether a certain sentence or word sequence accords with the daily Chinese speaking mode of people in the language environment. For example, the N-Gram is realized based on the markov assumption, that is, the T-th word in the text is only related to the preceding N-1 limited words, and can be further classified into a unigram language model (Uni-Gram), a bigram language model (Bi-Gram) and a trigram language model (Tri-Gram) according to the difference of the value of N. The statistical language model can be Ken-LM realized based on N-Gram, and is a language model which is written in C + + and built based on N-Gram with higher speed and smaller memory.

The at least two groups of trained general Language models for quality filtering acquired by the computer device may include a neural network Language Model, and the neural network Language Model may be an Autoregressive Language Model (Autoregressive Language Model) or an Autoregressive Language Model (Autoencoder LM). The autoregressive language Model, such as GPT of a one-way Transformer, or a deformed autoregressive language Model that concatenates two above and below LSTM, predicts the probability of each word in a sentence one by one from left to right or right to left, i.e., predicts the next word from above or below, using an Autonomous learning approach (Autonomous leaner Model). The self-coding language model is able to fuse context information into the model, i.e. predict the word at the current location, e.g. Bert-LM, using the context information at the current location.

In some embodiments, the computer device obtains at least two sets of generic language models of different model structures, which may include at least two, three, or more than three of a statistical language model, an autoregressive model, and an autocorrelation model. Because the model structures of each group of general language models are different, the integrated multi-group language models can comprehensively grade parallel sentences from different angles, and training corpora meeting preset quality conditions, namely high-quality training corpora, can be filtered from the original training corpora. For example, a first set of trained generic language models employs statistical language models, a second set of trained generic language models employs autoregressive language models, and a third set of trained generic language models employs self-coding language models. Because different language models have different characteristics, the scoring of sentences has different emphasis points, and if only a single language model is used for scoring the sentences, the scoring is not comprehensive enough. For example, only using the Bert-LM to score the speech, some unknown words Bert-LM cannot be well scored, and some obscure words are all regarded as unknown words in the processing of Bert-LM, so that the situation cannot be well scored, and some sentences which relatively more meet the training requirements are lost; the N-Gram model can alleviate the problem, so that a plurality of models are integrated to perform mixed scoring and filtering on the corpus.

In some embodiments, the at least two sets of generic language models of different model structures obtained by the computer device may be the same type of language model, but obtained by different training methods. For example, the model structure of the first group of trained universal language models adopts a statistical language model N-Gram, wherein N takes a value of 5; the model structure of the second group of trained universal language models also adopts a statistical language model N-Gram, wherein N takes the value of 7; the model structure of the third group of trained universal language models is realized by adopting a neural network-based language model, which can be an autoregressive language model or an autocorrelation language model. Obviously, although the model structures of the first group and the second group of models are the same, the training modes are slightly different, and the models have different grading capabilities on sentences due to the slight difference of the training modes, so that the two models are integrated to mix and grade the corpora, and the effects of mixing and comprehensively grading can be realized to a certain extent.

In some embodiments, the computer device may directly obtain the trained universal language models, or may obtain the universal language models by self-training using some training corpora.

After the computer equipment obtains at least two groups of trained universal language models, each parallel statement in the original training corpus is scored through each group of universal language models, and quality scores corresponding to each parallel statement are obtained. Optionally, the computer device may only score the original sentences in each parallel sentence in the original corpus, take the scores corresponding to the original sentences as quality scores corresponding to each parallel sentence, and perform quality filtering on the original corpus according to the quality scores. Optionally, the computer device may score only translation sentences in each parallel sentence in the original corpus, take the score corresponding to the original sentence as a quality score corresponding to each parallel sentence, and perform quality filtering on the original corpus according to the quality score. Optionally, the computer device may score the original sentences and the translated sentences in each parallel sentence in the original corpus respectively to obtain quality scores corresponding to each parallel sentence, and perform quality filtering on the original corpus according to the quality scores.

It can be understood that when the computer device scores the original sentences and the translated sentences in the parallel sentences in the original training corpus to obtain the quality scores corresponding to the parallel sentences, each group of the trained general language models obtained by the computer device needs to include the original language model and the translated language model with the same model structure. For example, the first set of trained generic language models includes statistical language models of the original and the translated text, the second set of trained generic language models includes autoregressive language models of the original and the translated text, and the third set of trained generic language models includes self-coding language models of the original and the translated text.

If the computer device only needs to score the original sentences or the translated sentences in the parallel sentences in the original training corpus to realize quality filtering, each group of general language models in at least two groups of trained general language models obtained by the computer device only needs to include the original language models or the translated language models. For example, the first set of generic language models is statistical language models of the original, the second set of generic language models is autoregressive language models of the original, and the third set of generic language models is an autorecoding language model of the original. For another example, the first set of generic language models is statistical language models of the translation, the second set of generic language models is autoregressive language models of the translation, and the third set of generic language models is an autorecoding language model of the translation.

Because at least two groups of general language models acquired by the computer equipment are trained language models, the computer equipment has the capability of determining which word sequence has higher possibility or predicting the next most likely-to-appear word by giving a plurality of words of the upper context, the lower context or the context, and the predicted possibility of the next word occurrence can represent the smoothness degree of the sentence to a certain extent, namely the quality is good or bad, and the possibility can be regarded as an index for measuring the quality of the sentence, namely the quality score. If the probability of each word in the parallel sentences is predicted to be higher, the sentences are smoother, and thus the training corpora meeting the preset quality condition, namely the high-quality training corpora can be filtered from the original training corpora according to the quality scores. The preset quality condition is a preset filtering condition for determining the high-quality corpus from the original corpus, and can be set according to actual needs. For example, the preset quality condition may be that the ranking of the quality score is greater than a preset threshold, the ranking percentage of the quality score is greater than a preset threshold, and the like. Specifically, the computer device may filter out the high-quality corpus from the original corpus according to the quality score and a preset quality condition after obtaining the quality score of each parallel sentence in the original corpus. For example, the computer device may use a sentence with a quality score ranked top M as a high-quality corpus, where M may be a percentage or a sequence number, or use a sentence with a quality score higher than N as a high-quality corpus, and N may be a preset threshold.

And step 306, obtaining the domain scores corresponding to the parallel sentences in the training corpus meeting the preset quality condition through the trained target domain language model and the trained general domain language model, and screening the training corpus meeting the preset quality condition in the target domain from the training corpus meeting the preset quality condition according to the domain scores, wherein the model structures of the target domain language model and the general domain language model are the same.

At present, some corpus filtering methods are based on a single model to perform quality-aspect filtering, and the inventor thinks that translation needs to ensure quality and field simultaneously. In order to screen data in a specific field for model training in the specific field on the basis of ensuring high-quality data, after obtaining high-quality training corpus through first-step quality filtering, the computer device needs to obtain a target-field language model and a general-field language model, and further screens the high-quality training corpus in the target field through the target-field language model and the general-field language model.

The target domain language model is a language model obtained by training with a corpus of a target domain, such as a news domain, a financial domain, a medical domain, a computer technology domain, and the like. The generic domain language model is a language model that is less relevant in the generic domain than the target domain language model. The model structure of the target field language model is the same as that of the general field language model, the computer equipment can respectively input the training corpora meeting the preset quality conditions obtained after the quality is filtered into the target field language model and the general field language model, field scores are obtained according to the difference between the outputs of the target field language model and the general field language model, and the training corpora with stronger target field relevance, namely the training corpora meeting the preset quality conditions in the target field, are screened out from the training corpora meeting the preset quality conditions according to the field scores.

The model structure of the target domain language model is the same as that of the general domain language model. For example, the target domain language model and the general domain language model may both be statistical language models, may both be autoregressive language models, and may both be self-encoding language models.

Alternatively, the target domain language model may include a target domain language model of the original and a target domain language model of the translated text, and accordingly, the general domain language model may include a general domain language model of the original and a general domain language model of the translated text. For example, the target domain language model includes a statistical language model of the original obtained by model training using the original of the target domain, and also includes a statistical language model of the translated obtained by model training using the translated of the target domain, and the general domain language model also includes a statistical language model of the original and a statistical language model of the translated.

Optionally, since the high-quality corpus has been screened from the original corpus in the previous step, it may be considered that both the original text and the translation of the parallel sentences in the high-quality corpus are high-quality corpus, and then the current step only needs to perform secondary screening with the domain correlation as the target, so the target domain language model may only include the target domain language model of the original text, and correspondingly, the general domain language model may only include the general domain language model of the original text, and the computer device obtains the domain score corresponding to the original sentence in each parallel sentence in the high-quality corpus as the domain score corresponding to each parallel sentence by using the target domain language model and the general domain language model of the original text. For example, the target domain language model is a statistical language model of the original text, and the general domain language model is a statistical language model of the original text.

Optionally, the target domain language model may only include a target domain language model of the translated text, and accordingly, the general domain language model may only include a general domain language model of the translated text, and the computer device obtains, by using the target domain language model of the original text and the general domain language model, a domain score corresponding to a translated text sentence in each parallel sentence in the corpus that meets the preset quality condition, as the domain score corresponding to each parallel sentence. For example, the target domain language model is a statistical language model of the translation, and the general domain language model is a statistical language model of the translation.

In some embodiments, by using the idea of integrating multiple models for quality filtering during quality filtering, the computer device may also integrate multiple models for field screening to achieve comprehensive obtaining of the field relevance of each sentence, thereby improving the accuracy of screening the corpus of the target field from the high-quality corpus. For example, the trained target domain language model and the universal domain language model include two groups, the first group includes a target domain statistical language model of the original text and a universal domain statistical language model of the original text, and the second group includes a target domain autoregressive language model of the original text and a universal domain autoregressive language model of the original text. For another example, the first group includes a target domain statistical language model for the original, a target domain statistical language model for the translated, a general domain statistical language model for the original, and the second group includes a target domain autoregressive language model for the original, a target domain autoregressive language model for the translated, a general domain statistical autoregressive language model for the original, and a general domain autoregressive language model for the original.

The computer equipment utilizes two or more groups of target domain language models and general domain language models to carry out integrated and mixed domain screening on each parallel statement in the high-quality corpus, scores corresponding to each group of models of each parallel statement are obtained, then scores corresponding to each group of models are fused, and final domain scores of each parallel statement are obtained. And screening the training corpora of the target field and meeting the preset quality condition from the training corpora meeting the preset quality condition by the computer equipment according to the field score after the integrated scoring.

Since the trained target domain language model and the generic domain language model obtained by the computer device are trained language models and the models have the same structure, the trained target domain language model and the generic domain language model have the ability to determine which word sequence is more likely or to predict the next most likely word given a number of words in the context, text or text. The target domain language model is a language model obtained by training target domain linguistic data, the prediction capability of the language model has certain domain correlation, the prediction capability of the general domain language model has weaker domain correlation, if the difference between the possibility predicted by the target domain language model and the possibility predicted by the general domain language model is smaller, the correlation between the sentence and the target domain is larger, otherwise, the correlation between the sentence and the target domain is smaller if the difference is larger, and therefore based on the concept, the computer equipment can obtain the domain score of the training linguistic data from the difference. After the domain scores of the parallel sentences in the corpus meeting the preset quality condition are obtained, the computer equipment can filter the corpus meeting the preset quality condition in the target domain from the corpus meeting the preset quality condition according to the domain scores. For example, the computer device may use a sentence with a domain score ranked at the top M as a corpus that meets a preset quality condition in the target domain, where M may be a percentage or a sequence number, or use a sentence with a domain score higher than N as a corpus that meets a preset quality condition in the target domain, and N may be a preset threshold.

In one embodiment, the method further comprises: obtaining parallel corpora of the target field, and performing model training on the language model to be trained by using the parallel corpora of the target field to obtain a language model of the target field; and sampling the training corpora meeting the preset quality condition to obtain sampling corpora, and performing model training on the language model to be trained by using the sampling corpora to obtain the universal field language model.

Specifically, the computer device may obtain the parallel corpus of the target field, and perform model training on the constructed language model using the parallel corpus of the target field to obtain the target field language model. Meanwhile, the computer equipment can sample the whole high-quality training corpus to be subjected to the field screening to obtain a sampling corpus, and the sampling corpus is used for carrying out model training on the constructed language model with the same model structure to obtain the universal field language model. The computer device may obtain, by a crawler or in other manners, the parallel corpora of the target domain, such as corpora of a specific domain, such as a news domain, a financial domain, and the like. Of course, the universal domain language model may also be obtained by performing model training on training corpora with weak domain correlation acquired through other channels.

Optionally, when the computer device only needs to obtain the domain score using the target domain language model and the general domain language model of the original text, the computer device may perform model training on the constructed language model using only the original text sentences in the parallel corpus of the target domain to obtain the target domain language model of the original text, and meanwhile, the computer device performs model training on the constructed language model using only the original text sentences in the sampling corpus to obtain the general domain language model of the original text. Alternatively, when the computer device only needs to obtain the domain score using the target domain language model and the general domain language model of the translated text, the computer device may perform model training on the constructed language model using only the translated text sentences in the parallel corpus of the target domain to obtain the target domain language model of the translated text, and at the same time, the computer device may perform model training on the constructed language model using only the translated text sentences in the sampling corpus to obtain the general domain language model of the translated text.

FIG. 4 is a diagram illustrating a corpus processing flow of the translation model in an embodiment. Referring to fig. 4, the original corpus is subjected to mixed scoring by at least two groups of general language models to obtain quality scores corresponding to parallel sentences, the corpus meeting preset quality conditions is obtained by filtering according to the quality scores, then the trained target domain language model and the general domain language model are utilized to perform domain scoring on the corpus meeting the preset quality conditions, and the corpus meeting the preset quality conditions in the target domain is screened out according to the domain scores.

After the computer equipment obtains the training corpus which is in the target field and meets the preset quality condition, model training can be carried out on the translation model by using the corpus, and the translation model for translating the text in the target field is obtained.

The corpus processing method of the translation model uses at least two groups of trained general language models to carry out mixed scoring on each parallel statement in the original corpus, compared with quality filtering based on a single language model, because the model structures of each group of general language models are different, the integrated multi-group language models can comprehensively score the parallel statements from different angles, and ensure that the corpus meeting preset quality conditions can be filtered from the original corpus; furthermore, on the basis, the field scores corresponding to the parallel sentences in the training corpora meeting the preset quality condition are obtained through the trained target field language model and the trained general field language model, so that the training corpora meeting the target field and meeting the preset quality condition are further screened out from the training corpora meeting the preset quality condition, model training can be carried out on the translation model in the target field, and compared with filtering based on single quality, the corpora in the target field can be screened on the basis of ensuring high quality, so that the translation performance of the translation model can be greatly improved through the obtained corpora; in addition, through the process arrangement of firstly scoring in the aspect of high quality and then screening the training corpora in the target field on the basis of high quality, compared with the process of directly performing mixed scoring on the original training corpora in the aspect of quality and field, the output corpora can be guaranteed to be high in quality and most close to the target field to the greatest extent.

In one embodiment, as shown in fig. 5, obtaining the quality score corresponding to each parallel sentence in the original corpus by using each set of generic language models includes:

step 502, scoring the original sentences and the translated sentences in the parallel sentences in the original training corpus respectively through each group of universal language models, and respectively obtaining the original quality scores and the translated quality scores of the parallel sentences corresponding to each group of universal language models.

In this embodiment, each group of general language models includes a language model of an original text and a language model of a translated text, the language model of the original text scores original sentences in parallel sentences in an original training corpus to obtain corresponding original text quality scores, and the language model of the translated text scores translated sentences in parallel sentences in the original training corpus to obtain corresponding translated text quality scores. According to the mode, the computer device not only integrates a plurality of language models to grade the parallel sentences, but also grades the parallel sentences by combining the original text and the translated text together to obtain the original text quality scores and the translated text quality scores of each group of general language models corresponding to each parallel sentence in the original training corpus, so that the parallel sentences can be further ensured to be high in quality at two sides.

And step 504, fusing the original text quality scores and the translated text quality scores of each group of general language models corresponding to each parallel statement to obtain the quality scores corresponding to each parallel statement.

In this embodiment, through using multiple models to mix each parallel statement in the original corpus and grade, compare with the quality filtering based on single language model, because the model structure of every group of general language models is different, the integrated multiunit language model can grade the parallel statement comprehensively from different angles, guarantee to filter out high quality corpus from the original corpus.

In one embodiment, as shown in FIG. 6, step 502 includes:

and step 602, scoring the original sentences in the parallel sentences in the original training corpus respectively through the original language models in each group of universal language models to respectively obtain the original quality scores of the parallel sentences.

Wherein each set of generic language models includes a textual language model. Each original language model needs to score the original sentences in each parallel sentence in the original training corpus, and the original quality scores of the parallel sentences corresponding to each original language model are respectively obtained. That is, the textual language models with different model structures in each set of generic language models require scoring of parallel sentences.

And step 604, scoring the translation sentences in each parallel sentence in the original training corpus respectively through the translation language models in each group of universal language models to respectively obtain the translation quality scores of the parallel sentences.

Wherein each set of generic language models further includes a translation language model. Each translation language model needs to score translation sentences in each parallel sentence in the original training corpus, and translation quality scores of the parallel sentences corresponding to each translation language model are obtained respectively. That is, the translation language models with different model structures in each set of the common language models need to be scored.

In this embodiment, each set of general language model includes an original language model and a translated language model, and the model structures of each set of general language model are different, so the model structures of the original language models of each set are also different, and the model structures of the translated language models of each set are also different, so that multiple original language models can be integrated to fully score original sentences, multiple translated language models can be integrated to fully score translated sentences, and training corpora meeting preset quality conditions are filtered from the original training corpora after scoring at both sides is integrated.

In one embodiment, as shown in FIG. 7, 504 includes:

and step 702, summing the original text quality scores and the translated text quality scores of each group of universal language models to obtain group-level scores.

Specifically, the computer device sums the original text quality scores and the translated text quality scores of each group of general language models corresponding to the parallel sentences to obtain group-level scores corresponding to each group of general language models.

Optionally, the computer device may also average the original text quality score and the translated text quality score of each group of the universal language models to obtain a group-level score.

It can be understood that no matter the sum or the mean value is calculated, the quality of the parallel sentences on both sides of the original text and the translated text can be considered in the fusion mode, that is, only when the sum of the scores or the mean value of the scores is higher, the parallel sentences can be filtered out as high-quality corpus to obtain the high-quality corpus.

It can be understood that if each group of general language models only includes the original text language model, the original text quality score corresponding to the parallel sentences is the group-level score. And if each group of general language models only comprises a translation language model, the translation quality score corresponding to the parallel sentences is the group-level score.

And step 704, acquiring the weighting coefficients corresponding to each group of the universal language models.

Specifically, the computer device may obtain a set weighting coefficient corresponding to each group of the common language models. The weighting coefficient corresponding to each group of the universal language model can be set randomly or artificially. Of course, the setting of the weighting factor corresponding to each set of generic language models may also depend on the performance of each set of generic language models, for example, the generic language models obtained by the computer device are in three sets, and if the performance of the first set of generic language models is the best, the corresponding weighting factor is the largest, whereas if the performance of the first set of generic language models is the worst in the three sets, the corresponding weighting factor is the smallest.

In addition, the computer device may also set a limiting condition for the weighting coefficients, for example, the weighting coefficients corresponding to each group of the universal language models are λ 1, λ 2, and λ 3, and when setting values of λ 1, λ 2, and λ 3, it needs to satisfy:

λ1+λ2+λ3＝0.5。

and 706, carrying out weighted summation on the group-level scores of the general language models corresponding to each group of the parallel statements based on the weighting coefficients corresponding to each group of the general language models, so as to obtain the quality scores corresponding to the parallel statements.

It can be understood that if each group of general language models only includes the original language model, the computer device only needs to perform weighted summation on the original quality scores of the parallel statements according to the weighting coefficients to obtain the quality scores corresponding to the parallel statements. If each group of general language models only comprises a translation language model, the computer equipment only needs to carry out weighted summation on the translation quality scores of the parallel statements according to the weighting coefficients to obtain the quality scores corresponding to the parallel statements.

For example, the computer device obtains 3 sets of general language models, each set of general language model includes an original language model and a translated language model, the 1 st set of general language model scores S1 for the original quality of the parallel sentences, the translated quality Score is S1, the 2 nd set of general language model scores S2 for the original quality of the parallel sentences, the translated quality Score is S2, the 3 rd set of general language model scores S3 for the original quality of the parallel sentences, and the translated quality Score is S3, and the quality Score Q _ Score corresponding to the parallel sentences S can be expressed by the following formula:

Q_Score＝λ1×(S1+S1*)+λ2×(S2+S2*)+λ3×(S3+S3*)。

for another example, the computer device obtains 3 sets of general language models, each set of general language models includes only the original language model, and the quality Score Q _ Score corresponding to the parallel statement can be represented by the following formula:

Q_Score＝λ1×S1+λ2×S2+λ3×S3。

for another example, the computer device obtains 3 sets of general language models, each set of general language models only includes a translation language model, and the quality Score Q _ Score corresponding to the parallel sentence can be represented by the following formula:

Q_Score＝λ1×S1*+λ2×S2*+λ3×S3*。

in this embodiment, the quality scores of the two sides of the original text and the translated text are weighted and summed through the weighting coefficients, and the obtained quality scores can measure the quality of sentences of the parallel sentences on the two sides of the original text and the translated text.

FIG. 8 is a schematic flow diagram of mass filtering in one embodiment. Referring to fig. 8, the original training corpus is a chinese-english bilingual corpus, the obtained universal language models are three groups, and each group includes the language model of the original text and the language model of the translated text as an example to explain: the first set of language models includes a chinese statistical language model and an english statistical language model, such as Ken-LM, the second set of language models includes a chinese autoregressive language model and an english autoregressive language model, such as Transformer-LM, and the third set of language models includes a chinese autoregressive language model and an english autoregressive language model, such as Bert-LM. Scoring Chinese in the parallel sentences through a Chinese statistical language model, scoring corresponding English through an English statistical language model, obtaining a first group of scores of the parallel sentences after fusion, and so on, obtaining a second group of scores of the parallel sentences through a Chinese autoregressive language model and an English autoregressive language model respectively, obtaining a third group of scores of the parallel sentences through a Chinese autoregressive language model and an English autoregressive language model respectively, and then performing weighted summation on the three groups of scores by using a weighter to obtain quality scores corresponding to the parallel sentences. The computer device can filter the training corpuses meeting the preset quality condition from the original training corpuses according to the quality scores.

In one embodiment, as shown in FIG. 9, step 504 includes:

and 902, normalizing the original text quality scores of the parallel sentences obtained by the same group of universal language models according to the highest score and the lowest score in the original text quality scores of the parallel sentences in the original training corpus obtained by the same group of universal language models to obtain normalized original text quality scores.

Due to the language models with different model structures, there may be a case where the range of the predicted quality scores is inconsistent, for example, for the first set of general language models, the quality score of the predicted parallel sentence is better if it is greater than 80 points, while for the second set of general language models, the quality score of the predicted parallel sentence is better if it is greater than 0.85 points, and if the quality scores obtained by different general language models are directly weighted and summed, the finally obtained quality score may not objectively represent the quality of the parallel sentence. Therefore, in this case, in order to more reasonably and accurately measure the quality of the parallel sentences, the computer device may further perform normalization processing on the quality scores obtained by different general language models after obtaining the quality score predicted by each group of general language models.

Specifically, for the original text quality scores of the parallel sentences obtained by the original text language models in the same group of general language models, the computer device may determine the highest score and the lowest score, and then normalize the original text quality scores output by the original text language models according to the highest score and the lowest score to obtain normalized original text quality scores.

In one embodiment, the normalization processing is performed on the raw text quality scores of the parallel sentences obtained by the same group of common language models to obtain normalized raw text quality scores, which can be implemented by the following formula:

wherein S is_iRepresenting the textual quality score of the parallel sentences S obtained by the ith set of generic language models, S_iMin represents the lowest score in the original text quality scores obtained by the ith group of universal language models, S_iMax represents the highest score in the original text quality scores obtained by the ith set of generic language models for each parallel sentence, S_i' denotes the normalized textual quality score of the parallel sentence S.

And 904, normalizing the translation quality scores of the parallel sentences obtained by the same group of universal language models according to the highest score and the lowest score in the translation quality scores of the parallel sentences in the original training corpus obtained by the same group of universal language models to obtain normalized translation quality scores.

Specifically, for the translation quality scores of the parallel sentences obtained by the translation language models in the same group of general language models, the computer device may determine the highest score and the lowest score, and then normalize the translation quality scores output by the translation language models according to the highest score and the lowest score to obtain normalized translation quality scores.

In one embodiment, the normalization processing is performed on the translation quality scores of the parallel sentences obtained by the same set of common language models to obtain normalized translation quality scores, which can be implemented by the following formula:

wherein S is_iRepresenting translation quality scores of parallel sentences S obtained by the i-th group of universal language models, S_i"min" represents the lowest score in the translation quality scores obtained by the i-th group of universal language models for each parallel sentence, S_iMax represents the highest score in the translation quality scores obtained for each parallel sentence by the i-th set of generic language models, S_iDenotes the normalized translation quality score of the parallel sentence S.

And 906, fusing the normalized original text quality scores and the normalized translation quality scores of each group of general language models corresponding to each parallel statement to obtain the quality scores corresponding to each parallel statement.

Q_Score＝λ1×(S1′+S1*′)+λ2×(S2′+S2*′)+λ3×(S3′+S3*′)。

it should be noted that, if the ranges of the quality scores output by the different groups of general language models in the at least two groups of general language models acquired by the computer device are substantially consistent, the computer device does not need to normalize the quality scores of the original documents and the quality scores of the translated documents, and the computer device can directly perform weighted summation on the quality scores of the original documents and the quality scores of the original documents to obtain the final quality scores of the parallel sentences.

In this embodiment, the quality scores output by each group of the universal language models are normalized, so that the quality scores of the parallel sentences in the original corpus can be more reasonably scored, and the accuracy of filtering out the high-quality corpus from the original corpus is improved.

The following description continues with specific steps for the quality score output by each generic language model:

in one embodiment, as shown in fig. 10, when the universal language model is a statistical language model obtained based on a high-quality corpus, obtaining, through each group of universal language models, a quality score corresponding to each parallel sentence in the original corpus includes:

step 1002, parallel sentences are sequentially obtained from the original training corpus.

And 1004, inputting the original sentences in the parallel sentences into the statistical language model of the original text, and obtaining the quality scores of the original texts of the parallel sentences based on the condition frequency numbers corresponding to the words in the original sentences through the statistical language model of the original text.

Wherein, the statistical language model of the original text is obtained based on the high-quality corpus of the original text. The computer equipment can construct a statistical language model in advance, perform statistical training on the high-quality corpus of the original text, and score the original text sentences of each parallel sentence in the original training corpus by using the statistical language model of the original text obtained by training to obtain corresponding original text quality scores.

The statistical language model is based on the Markov assumption, and assuming that the probability of occurrence of the Tth word in a sentence is only related to the N-1 finite words that have occurred before, then the probability of occurrence of the sentence is the product of the posterior probabilities of occurrence of the words in the sentence, i.e., the probability of occurrence of the sentence given the words in the sentence. The computer equipment can determine the probability of the occurrence of the statement according to the condition frequency corresponding to each word in the statement obtained by statistics of the statistical language model, and the probability is used as the quality score of the statement.

The statistical language model predicts the probability of each word going out in the sentence according to the context by adopting polynomial distribution, and is high in scoring speed and very simple and convenient, so that the statistical language model can be used in at least two groups of general language models acquired by computer equipment.

In one embodiment, the original text quality score of the original text sentence S output by the statistical language model of the original text can be calculated by the following formula:

p(w_t|w_t-1,w_t-2,...,w₁)≈p(w_t|w_t-1,w_t-2,...,w_t-N+1)。

for example, when N is 3,

wherein p (S) is the original text quality score of the original text sentence S, w_tFor the t-th word in the textual statement S, p (w)_t|w_t-1,w_t-2,...,w₁) Representing the next word w given the first 1 st to t-1 st words_tA probability of occurrence that is approximately equal to the probability of occurrence of the word given the first N-1 words are known.

Wherein the content of the first and second substances,

c represents the number of occurrences of the sentence sequence in parentheses in the high quality corpus used for statistical training.

Step 1006, inputting the translation statement in the parallel statement into a statistical language model of the translation, and obtaining a translation quality score of the parallel statement based on the condition frequency corresponding to each word in the translation statement through the statistical language model of the translation.

Similarly, the computer device may construct a statistical language model in advance, perform statistical training on the high-quality corpus of the translation, and score the translation sentences of each parallel sentence in the original training corpus using the statistical model of the translation obtained by training to obtain corresponding translation quality scores.

Similarly, the translation quality score of the sentence S output by the statistical language model of the translation can be calculated by the following formula:

p(w_t|w_t-1,w_t-2,...,w₁)≈p(w_t|w_t-1,w_t-2,...,w_t-N+1)。

wherein p (S) is the translation quality score of the translation statement S, w_tFor the t-th word, p (w) in translation statement S_t|w_t-1,w_t-2,...,w₁) Representing the next word w given the first 1 st to t-1 st words_tA probability of occurrence that is approximately equal to the probability of occurrence of the word given the first N-1 words are known.

And 1008, fusing the original text quality scores and the translated text quality scores of the parallel sentences to obtain the quality scores of the statistical language models corresponding to the parallel sentences.

Regarding the way of fusing the original text quality score and the translated text quality score by the statistical language model, reference may be made to the above mentioned way, and the description is not repeated here.

In an embodiment, as shown in fig. 11, when the universal language model is an autoregressive language model, obtaining, through each group of universal language models, a quality score corresponding to each parallel sentence in the original corpus includes:

step 1102, parallel sentences are sequentially obtained from the original training corpus.

And 1104, inputting the original sentence in the parallel sentences into an autoregressive language model of the original, predicting the conditional probability of each word appearing from left to right or from right to left in the original sentence through the autoregressive language model of the original, and obtaining the original quality score of the parallel sentences according to the conditional probability corresponding to each word.

The autoregressive language model of the original text is obtained by performing model training based on a training corpus of the original text. The computer device may construct an auto-regression language model in advance, perform model training on the original text training corpus, and use the auto-regression language model obtained by training to score original text sentences of each parallel sentence in the original training corpus to obtain corresponding original text quality scores.

The training goal of the autoregressive model is to predict what the words in the next position are, so during training, the words to the left or right of the predicted position need to be shielded by masking means (masking) to ensure the model learning the ability to predict word by word from left to right or from right to left. For example, the training sentence is "I love China. "to make the model learn the ability to predict from left to right," China "on the right side of the 2 nd position is required to be summed up when predicting the 2 nd word for this training sentence at the time of training. "masked, predict the 2 nd word only from the preceding" I ", and in predicting the 3 rd word, it is necessary to put the 3 rd position to the right". "mask, predict the 3 rd word only from the previous" I "and" love ".

After obtaining the trained auto-regression language model of the original text, the computer device may input the original text sentences of each parallel sentence in the original training corpus into the auto-regression language model of the original text, predict words one by one from left to right or from right to left, and obtain the quality scores corresponding to the original text sentences according to the following formula according to the predicted conditional probability of each word:

wherein, w_tDenotes the t-th word, W, in the original sentence S_＜tAntecedent representing the t-th word, i.e. the word already present on the left or right, P_ALM(w_t|W_＜t(ii) a Theta) represents the conditional probability corresponding to the t-th word prediction of the model, theta represents a manually set hyper-parameter during the training of the model, and score (S) represents the quality score corresponding to the original sentence S output by the autoregressive language model. And the computer equipment obtains the conditional probability corresponding to each word in the original sentence according to the formula, performs the density estimation of the joint probability of the sentence according to the conditional probability corresponding to each word, and uses the value to measure the quality of the sentence.

Step 1106, inputting the translated sentence in the parallel sentence into the autoregressive language model of the translated sentence, predicting the conditional probability of each word appearing from left to right or from right to left in the translated sentence through the autoregressive language model of the translated sentence, and obtaining the translated sentence quality score of the parallel sentence according to the conditional probability corresponding to each word.

Similarly, the computer device may construct an auto-regression language model in advance, perform model training on the translated text corpus, and use the auto-regression language model obtained by training to score the translated text sentences of each parallel sentence in the original corpus to obtain corresponding translated text quality scores.

Step 1108, merging the original text quality scores and the translated text quality scores of the parallel sentences to obtain the quality scores of the autoregressive language models corresponding to the parallel sentences.

Regarding the way of fusing the quality scores of the original text and the quality scores of the translated text by the autoregressive language model, the above mentioned ways can be specifically referred to, and the description is not repeated here.

In an embodiment, as shown in fig. 12, when the universal language model is a self-coding language model, obtaining, by each group of universal language models, a quality score corresponding to each parallel sentence in the original corpus includes:

step 1202, parallel sentences are sequentially obtained from the original training corpus.

And 1204, sequentially taking each word in the original sentence of the parallel sentence as a masking word, inputting the masked original sentence into a self-coding language model of the original, outputting the prediction probability corresponding to the masking word through the self-coding language model of the original, and obtaining the original quality score of the parallel sentence according to the prediction probability corresponding to each masking word.

The self-coding language model of the original text is obtained by performing model training based on a training corpus of the original text. The computer equipment can construct a self-coding language model in advance, perform model training on the original text training corpus, and score original text sentences of each parallel sentence in the original training corpus by using the self-coding language model obtained by training to obtain corresponding original text quality scores.

Since the training of the self-coding model aims to predict the probability of a word at a certain position in a sentence according to context information at the position, during training, the word at the predicted position needs to be shielded by masking means (masking), the context information at the position needs to be input, and the self-coding language model needs to be trained by using the word at the predicted position as label information so as to ensure the capability of the model to learn prediction according to the context information at each position.

After the trained self-coding language model of the original text is obtained, the computer equipment can perform masking processing on each word in the original text sentence of each parallel sentence in the original training corpus, then sequentially input the autoregressive language model of the original text, and output the prediction probability corresponding to each masked word. It will be appreciated that when the output predicted word does not include the masked word, the masked word corresponds to a prediction probability of 0. The computer device may determine a quality score corresponding to the textual statement based on the predicted probability of each of the masked words.

The computer equipment obtains the quality score corresponding to the original sentence according to the predicted probability of the occurence of the mask word at each position according to the trained self-coding language model and the following formula:

wherein, w_tDenotes the t-th word in the original sentence S, S_\tDenotes a sequence obtained by removing the t-th masked word from the original sentence S, P_{Mask_LM}(w_t|S_\t(ii) a Theta) represents the prediction probability corresponding to the t-th masked word predicted by the model word by word, theta represents a manually set hyper-parameter during the model training, score (S) represents the quality score corresponding to the original sentence S output from the coding language model. And the computer equipment obtains the corresponding prediction probability of each mask word in the original sentence according to the formula and obtains the original quality score of the original sentence S according to the corresponding prediction probability of each mask word.

And 1206, taking each word in the translation sentence of the parallel sentence as a masking word in sequence, inputting the masked translation sentence into the self-coding language model of the translation, outputting the prediction probability corresponding to the masking word through the self-coding language model of the translation, and obtaining the translation quality score of the parallel sentence according to the prediction probability corresponding to each masking word.

Similarly, the computer device may construct a self-coding language model in advance, perform model training on the translated text training corpus, and use the self-coding language model obtained by training to score the translated text sentences of each parallel sentence in the original training corpus to obtain corresponding translated text quality scores.

And 1208, fusing the original text quality scores and the translation quality scores of the parallel sentences to obtain the quality scores of the coding language models corresponding to the parallel sentences.

Regarding the way of fusing the original text quality score and the translated text quality score by the self-coding language model, reference may be made to the above mentioned way, and a description thereof is not repeated here.

After the quality filtering is completed, the following description will proceed with specific embodiments of field screening.

Since the training corpus meeting the preset quality condition is screened from the original training corpus in the previous step, the original text and the translation of the parallel sentences in the training corpus meeting the preset quality condition can be considered as the training corpuses meeting the preset quality condition, and the secondary screening is only needed to be carried out by taking the field correlation as a target during the field screening. The more the original sentence conforms to the target field, the more the corresponding translated sentence necessarily conforms to the target field, therefore, the computer device only needs to perform field scoring on the original sentence, that is, the target field language model may only include the target field language model of the original, correspondingly, the general field language model may only include the general field language model of the original, and the computer device obtains the field score corresponding to the original sentence in each parallel sentence in the training corpus satisfying the preset quality condition by using the target field language model and the general field language model of the original as the field score corresponding to each parallel sentence.

That is, in an embodiment, as shown in fig. 13, in step 306, obtaining a domain score corresponding to each parallel sentence in the corpus meeting the preset quality condition through the trained target domain language model and the trained general domain language model includes:

step 1302, scoring the original sentences in each parallel sentence in the training corpus meeting the preset quality condition through the target domain language model of the original text, and obtaining first domain scores corresponding to the parallel sentences.

Specifically, the target domain language model is obtained by corpus training of the target domain, so that the predictive capability of the target domain language model has domain correlation to a certain extent, and therefore, the computer device can use the target domain language model of the original text to score original sentences in parallel sentences in the corpus meeting preset quality conditions, obtain first domain scores corresponding to the parallel sentences, and can record the first domain scores as H1 (S).

And 1304, scoring the original sentences in the parallel sentences in the training corpus meeting the preset quality condition through the universal domain language model of the original text to obtain second domain scores corresponding to the parallel sentences.

Specifically, the universal domain language model is obtained by corpus training without domain screening, the prediction capability of the universal domain language model has weaker domain correlation, and the computer device can use the universal domain language model of the original text to score original sentences in parallel sentences in the training corpus meeting preset quality conditions, so as to obtain second domain scores corresponding to the parallel sentences, which can be recorded as H2 (S).

Step 1306, according to the difference between the first domain score and the second domain score corresponding to each parallel statement, obtaining the domain score corresponding to each parallel statement in the corpus meeting the preset quality condition.

The model structure of the target domain language model is the same as that of the general domain language model, except that the linguistic data used in the model training are different, the linguistic data used in the target domain is used in the former, and the linguistic data not subjected to domain screening is used in the latter, so the range of the output domain scores should be consistent, such as the numerical values between 0 and 1, if the difference between the first domain score and the second domain score is larger, the domain correlation of the original sentence is lower, otherwise, the difference is smaller, the domain correlation of the original sentence is higher.

In one embodiment, the computer device may obtain, as the domain score corresponding to the original sentence, the cross entropy loss corresponding to each original sentence according to a difference between the first domain score and the second domain score. The computer equipment can screen out the training corpuses which meet the target field and meet the preset quality condition from the training corpuses which meet the preset quality condition according to the principle that the cross entropy loss is smaller and the field score of the original sentence.

Referring to part (a) of fig. 14, a schematic diagram of performing domain filtering on the original sentences in the corpus that satisfy the preset quality condition in one embodiment is shown. And inputting the original sentences in the training corpus of the target field and the original sentences in the training corpus meeting the preset quality condition to be subjected to field screening into the language model by the computer equipment. And training the language model by using the original sentences in the training corpus of the target field to obtain the target field language model of the original text. And training the language model by using a part of the original sentences in the training corpus meeting the preset quality condition to be subjected to the field screening to obtain the universal field language model of the original text. And respectively scoring original sentences in the training corpora meeting the preset quality condition to be subjected to field screening through the target field language model and the general field language model, and screening the training corpora meeting the preset quality condition in the target field from the training corpora meeting the preset quality condition according to scoring differences.

In one embodiment, step 306, obtaining, through the trained target domain language model and the trained general domain language model, a domain score corresponding to each parallel statement in the corpus that meets the preset quality condition, includes: scoring translation sentences in parallel sentences in a training corpus meeting preset quality conditions through a target domain language model of the translation to obtain third domain scores corresponding to the parallel sentences; scoring translation sentences in parallel sentences in a training corpus meeting preset quality conditions through a universal domain language model of the translation to obtain fourth domain scores corresponding to the parallel sentences; and obtaining the domain score corresponding to each parallel statement in the training corpus meeting the preset quality condition according to the difference between the third domain score and the fourth domain score corresponding to each parallel statement.

Similarly, the computer device may use the target domain language model of the translation to score a translation sentence in each parallel sentence in the corpus that meets the preset quality condition, and obtain a third domain score corresponding to the parallel sentence, which may be denoted as H1 (S). And scoring the translated sentences in each parallel sentence in the training corpus meeting the preset quality condition by using the universal domain language model of the translated sentences to obtain fourth domain scores corresponding to the parallel sentences, wherein the fourth domain scores can be recorded as H2 (S). The computer device may obtain, as the domain score corresponding to the translation sentence, the cross entropy loss corresponding to each translation sentence according to a difference between the third domain score and the fourth domain score. The computer equipment can screen out the training corpuses which meet the target field and meet the preset quality condition from the training corpuses which meet the preset quality condition according to the principle that the cross entropy loss is smaller.

Referring to part (b) of fig. 14, a schematic diagram of performing domain filtering on translated sentences in the corpus that satisfy the preset quality condition in one embodiment is shown. And inputting the translation sentences in the training corpuses in the target field and the translation sentences in the training corpuses which meet the preset quality condition and are to be subjected to field screening into the language model by the computer equipment. And training the language model by using the translation sentences in the training corpus of the target field to obtain the target field language model of the translation. And training the language model by using a part of the translation sentences in the training corpus meeting the preset quality condition to be subjected to the domain screening to obtain a universal domain language model of the translation. And (3) respectively scoring translated sentences in the training corpuses meeting the preset quality condition to be subjected to field screening through the target field language model and the general field language model, and screening the training corpuses meeting the preset quality condition in the target field from the training corpuses meeting the preset quality condition according to scoring differences.

In one embodiment, obtaining a domain score corresponding to each parallel sentence in a training corpus that meets a preset quality condition through a trained target domain language model and a trained general domain language model includes: scoring original sentences in parallel sentences in a training corpus meeting preset quality conditions through a target domain language model of the original texts to obtain first domain scores corresponding to the parallel sentences; scoring original sentences in parallel sentences in a training corpus meeting preset quality conditions through a general domain language model of the original texts to obtain second domain scores corresponding to the parallel sentences; scoring translation sentences in parallel sentences in a training corpus meeting preset quality conditions through a target domain language model of the translation to obtain third domain scores corresponding to the parallel sentences; scoring translation sentences in parallel sentences in a training corpus meeting preset quality conditions through a universal domain language model of the translation to obtain fourth domain scores corresponding to the parallel sentences; and fusing the difference between the first domain score and the second domain score and the difference between the third domain score and the fourth domain score corresponding to each parallel sentence to obtain the domain score corresponding to each parallel sentence in the training corpus meeting the preset quality condition.

In this embodiment, the target domain language model may include a target domain language model of the original text and a target domain language model of the translated text, and accordingly, the general domain language model may include a general domain language model of the original text and a general domain language model of the translated text, so that the computer device may fuse the domain scores on both sides of the original text and the translated text, and screen out the training corpus that satisfies the preset quality condition and is in the target domain from the training corpus that satisfies the preset quality condition.

The method for fusing the domain scores on the two sides can be direct addition, weighted addition or averaging.

For example, the computer device obtains a first domain score corresponding to an original sentence in the parallel sentence, which is denoted as H1(S), a second domain score corresponding to the original sentence in the parallel sentence, which is denoted as H2(S), a third domain score corresponding to a translated sentence in the parallel sentence, which is denoted as H1(S), a fourth domain score corresponding to the translated sentence in the parallel sentence, which is denoted as H2(S), using the target domain language model of the original sentence, and obtains a domain score corresponding to the parallel sentence according to the following formula:

F_Score＝|H1(S)-H2(S)|+|H1(S*)-H2(S*)|。

in some embodiments, the computer device may also integrate multiple models for domain screening to achieve comprehensive acquisition of domain relevance of each sentence, thereby improving accuracy of screening corpus of a target domain from high-quality corpus. For example, the trained target domain language model and the universal domain language model include two groups, and the computer device performs weighted summation on the domain scores obtained from each group to obtain the final domain score of the parallel sentences:

f _ Score ═ 1 (| H1(S) -H2(S) | + | H1(S) |) + λ 2 (| H1(S) -H2(S) | + | H1(S) | -H2(S) |) where λ 1 is the weighting coefficient corresponding to the first set of language models, λ 2 is the weighting coefficient output by the second set of language models, the first half (| H1(S) -H2(S) | + | H1(S) | H2(S) |) is the domain Score output by the first set of language models, the second half (| H1(S) -H2(S) | + | H1(S) | H2(S) |) is the domain Score output by the second set of language models, and the final Score after F _ Score.

In some embodiments, when the two sets of language models have different model structures and the output domain scores have a large difference, the computer device may further normalize the domain scores output by the language models, as described in the foregoing description of the normalization of the quality scores, and thus, the description will not be repeated here.

In a specific embodiment, the corpus processing method of the translation model includes the following steps:

1. acquiring an original training corpus for training a translation model;

2. acquiring at least two groups of trained universal language models, wherein the model structures of each group of universal language models are different;

3. scoring original sentences in parallel sentences in the original training corpus respectively through the original language models in each group of general language models to obtain original quality scores of the parallel sentences respectively;

4. scoring the translation sentences in each parallel sentence in the original training corpus respectively through the translation language models in each group of universal language models to respectively obtain translation quality scores of the parallel sentences;

5. normalizing the original text quality scores of the parallel sentences obtained by the same group of universal language models according to the highest score and the lowest score in the original text quality scores of the parallel sentences in the original training corpus obtained by the same group of universal language models to obtain normalized original text quality scores;

6. normalizing the translation quality scores of the parallel sentences obtained by the same group of universal language models according to the highest score and the lowest score in the translation quality scores of all the parallel sentences in the original training corpus obtained by the same group of universal language models to obtain normalized translation quality scores;

7. merging the normalized original text quality scores and the normalized translation quality scores of each group of general language models corresponding to each parallel statement to obtain the quality scores corresponding to each parallel statement;

8. filtering the original training corpus according to the quality score to obtain the training corpus meeting the preset quality condition;

9. obtaining a parallel corpus of a target field, and performing model training on a language model to be trained by using an original text sentence in the parallel corpus of the target field to obtain an original target field language model;

10. sampling the training corpora meeting the preset quality condition to obtain sampling corpora, and performing model training on the language model to be trained by using the original sentences in the sampling corpora to obtain a universal field language model of the original text;

11. scoring original sentences in parallel sentences in a training corpus meeting preset quality conditions through a target domain language model of the original texts to obtain first domain scores corresponding to the parallel sentences;

12. scoring original sentences in parallel sentences in a training corpus meeting preset quality conditions through a general domain language model of the original texts to obtain second domain scores corresponding to the parallel sentences;

13. obtaining a domain score corresponding to each parallel statement in the training corpus meeting the preset quality condition according to the difference between the first domain score and the second domain score corresponding to each parallel statement;

14. and screening the training corpora of the target field and meeting the preset quality condition from the training corpora meeting the preset quality condition according to the field score corresponding to each parallel statement.

15. And performing model training on the translation model by using the training corpora which are in the target field and meet the preset quality condition to obtain the translation model for translating the text in the target field.

The corpus training processing method of the translation model, provided by the embodiment of the application, can greatly reduce the manual workload, and can also select the high-quality corpus in the target field, so that the performance of the translation model in the specific field is greatly improved, the user experience is remarkably improved, and the specific performance is shown in the following table 1.

Training corpus scale	BLEU
		100 ten thousand novel baseline model	20.03
100 ten thousand novel +100 ten thousand novel field data hybrid training model	20.51
		100 ten thousand novel +300 ten thousand novel field data hybrid training model	20.91
100 ten thousand novel +500 ten thousand novel field data hybrid training model	21.56

The above table is schematically illustrated: by adopting the corpus processing method of the translation model provided by the embodiment of the application, the corpus in the field of 100 ten million novels is used for performing quality filtering and field screening on 6 hundred million of large-scale bilingual data to obtain 500 ten million of novels field data, and then the 100 ten million of novels and 500 ten million of novels field data are utilized to form a new corpus training translation model, so that the performance index of the translation model is 21.56.

It should be understood that, although the steps in the above-described flowcharts are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the above-mentioned flowcharts may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or the stages is not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a part of the steps or the stages in other steps.

In one embodiment, as shown in fig. 15, there is provided a corpus processing apparatus 1500 of a translation model, which may be a part of a computer device using a software module or a hardware module, or a combination of the two modules, the apparatus specifically includes: the corpus is obtained module 1502, quality filter module 1504 and domain filter module 1506, wherein:

a corpus obtaining module 1502, configured to obtain an original training corpus used for training a translation model;

the quality filtering module 1504 is used for acquiring at least two groups of trained universal language models, acquiring quality scores corresponding to parallel sentences in the original training corpus through each group of universal language models, and filtering the original training corpus according to the quality scores to acquire the training corpus meeting preset quality conditions, wherein the model structures of each group of universal language models are different;

the domain screening module 1506 is used for obtaining domain scores corresponding to parallel sentences in the corpus meeting the preset quality condition through the trained target domain language model and the trained general domain language model, screening the corpus meeting the preset quality condition in the target domain from the corpus meeting the preset quality condition according to the domain scores, and the model structures of the target domain language model and the general domain language model are the same; and the screened training corpora of the target field and meeting the preset quality condition are used for obtaining a translation model for translating the text of the target field after model training is carried out on the translation model.

In one embodiment, the quality filtering module 1504 is further configured to score, through each group of general language models, an original sentence and a translated sentence in each parallel sentence in the original training corpus, and obtain an original quality score and a translated quality score of each group of general language models corresponding to the parallel sentences respectively; and fusing the original text quality scores and the translated text quality scores of each group of general language models corresponding to each parallel statement to obtain the quality scores corresponding to each parallel statement.

In one embodiment, the quality filtering module 1504 includes an original scoring unit and a translated scoring unit;

In one embodiment, the quality filtering module 1504 is further configured to perform normalization processing on the original text quality scores of the parallel sentences obtained by the same group of universal language models according to the highest score and the lowest score of the original text quality scores of the parallel sentences in the original corpus obtained by the same group of universal language models, so as to obtain normalized original text quality scores; normalizing the translation quality scores of the parallel sentences obtained by the same group of universal language models according to the highest score and the lowest score in the translation quality scores of all the parallel sentences in the original training corpus obtained by the same group of universal language models to obtain normalized translation quality scores; and fusing the normalized original text quality scores and the normalized translation quality scores of each group of general language models corresponding to each parallel statement to obtain the quality scores corresponding to each parallel statement.

In one embodiment, the quality filtering module 1504 is further configured to sum the original text quality scores and the translated text quality scores of each group of the universal language models to obtain group-level scores; acquiring a weighting coefficient corresponding to each group of universal language models; and carrying out weighted summation on the group-level scores of the parallel statements corresponding to each group of general language models based on the weighting coefficients corresponding to each group of general language models to obtain the quality scores corresponding to the parallel statements.

In one embodiment, when the generic language model is a statistical language model obtained based on high-quality corpus, the quality filtering module 1504 is further configured to sequentially obtain parallel sentences from the original corpus; inputting the original sentences in the parallel sentences into a statistical language model of the original text, and obtaining the original text quality scores of the parallel sentences based on the condition frequency corresponding to each word in the original sentences through the statistical language model of the original text; inputting the translation sentences in the parallel sentences into a statistical language model of the translation, and obtaining the translation quality scores of the parallel sentences based on the condition frequency corresponding to each word in the translation sentences through the statistical language model of the translation; and fusing the original text quality scores and the translated text quality scores of the parallel sentences to obtain the quality scores of the statistical language models corresponding to the parallel sentences.

In one embodiment, when the generic language model is an autoregressive language model, the quality filtering module 1504 is further configured to sequentially obtain parallel sentences from the original corpus; inputting the original sentences in the parallel sentences into an autoregressive language model of the original text, predicting the conditional probability of each word appearing from left to right or from right to left in the original sentences through the autoregressive language model of the original text, and obtaining the original text quality scores of the parallel sentences according to the conditional probability corresponding to each word; inputting the translated sentences in the parallel sentences into an autoregressive language model of the translated sentences, predicting the conditional probability of each word appearing from left to right or from right to left in the translated sentences through the autoregressive language model of the translated sentences, and obtaining the translated sentence quality scores of the parallel sentences according to the conditional probability corresponding to each word; and fusing the original text quality scores and the translated text quality scores of the parallel sentences to obtain the quality scores of the autoregressive language models corresponding to the parallel sentences.

In one embodiment, when the generic language model is a self-coding language model, the quality filtering module 1504 is further configured to sequentially obtain parallel sentences from the original corpus; sequentially taking each word in the original sentence of the parallel sentence as a masking word, inputting the masked original sentence into a self-coding language model of the original, outputting a prediction probability corresponding to the masking word through the self-coding language model of the original, and obtaining the original quality score of the parallel sentence according to the prediction probability corresponding to each masking word; sequentially taking each word in the translation sentence of the parallel sentence as a masking word, inputting the masked translation sentence into a self-coding language model of the translation, outputting a prediction probability corresponding to the masking word through the self-coding language model of the translation, and obtaining a translation quality score of the parallel sentence according to the prediction probability corresponding to each masking word; and fusing the original text quality scores and the translated text quality scores of the parallel sentences to obtain the quality scores of the coding language models corresponding to the parallel sentences.

In one embodiment, the domain screening module 1506 is further configured to score, through a target domain language model of the original text, an original text sentence in each parallel sentence in the corpus that meets a preset quality condition, so as to obtain a first domain score corresponding to the parallel sentence; scoring original sentences in parallel sentences in a training corpus meeting preset quality conditions through a general domain language model of the original texts to obtain second domain scores corresponding to the parallel sentences; and obtaining the domain score corresponding to each parallel statement in the training corpus meeting the preset quality condition according to the difference between the first domain score and the second domain score corresponding to each parallel statement.

In one embodiment, the domain screening module 1506 is further configured to score, through a target domain language model of the translation, a translation statement in each parallel statement in the training corpus that meets a preset quality condition, so as to obtain a third domain score corresponding to the parallel statement; scoring translation sentences in parallel sentences in a training corpus meeting preset quality conditions through a universal domain language model of the translation to obtain fourth domain scores corresponding to the parallel sentences; and obtaining the domain score corresponding to each parallel statement in the training corpus meeting the preset quality condition according to the difference between the third domain score and the fourth domain score corresponding to each parallel statement.

In one embodiment, the domain screening module 1506 is further configured to score, through a target domain language model of the original text, an original text sentence in each parallel sentence in the corpus that meets a preset quality condition, so as to obtain a first domain score corresponding to the parallel sentence; scoring original sentences in parallel sentences in a training corpus meeting preset quality conditions through a general domain language model of the original texts to obtain second domain scores corresponding to the parallel sentences; scoring translation sentences in parallel sentences in a training corpus meeting preset quality conditions through a target domain language model of the translation to obtain third domain scores corresponding to the parallel sentences; scoring translation sentences in parallel sentences in a training corpus meeting preset quality conditions through a universal domain language model of the translation to obtain fourth domain scores corresponding to the parallel sentences; and fusing the difference between the first domain score and the second domain score and the difference between the third domain score and the fourth domain score corresponding to each parallel sentence to obtain the domain score corresponding to each parallel sentence in the training corpus meeting the preset quality condition.

In an embodiment, the corpus processing apparatus 1500 of the translation model further includes:

the first training module is used for obtaining the parallel linguistic data of the target field and carrying out model training on the language model to be trained by using the parallel linguistic data of the target field to obtain a language model of the target field;

and the second training module is used for sampling the training corpora meeting the preset quality condition to obtain sampling corpora, and performing model training on the language model to be trained by using the sampling corpora to obtain the general field language model.

The corpus processing apparatus 1500 of the above translation model uses at least two groups of trained common language models to perform mixed scoring on each parallel sentence in the original corpus, compared with quality filtering based on a single language model, because the model structure of each group of common language models is different, the integrated multi-group language models can fully score the parallel sentences from different angles, and ensure that the corpus meeting the preset quality condition can be filtered from the original corpus; furthermore, on the basis, the field scores corresponding to the parallel sentences in the training corpora meeting the preset quality condition are obtained through the trained target field language model and the trained general field language model, so that the training corpora meeting the target field and meeting the preset quality condition are further screened out from the training corpora meeting the preset quality condition, model training can be carried out on the translation model in the target field, and compared with filtering based on single quality, the corpora in the target field can be screened on the basis of ensuring high quality, so that the translation performance of the translation model can be greatly improved through the obtained corpora; in addition, through the process arrangement of firstly scoring in the aspect of high quality and then screening the training corpora in the target field on the basis of high quality, compared with the process of directly performing mixed scoring on the original training corpora in the aspect of quality and field, the output corpora can be guaranteed to be high in quality and most close to the target field to the greatest extent.

For the specific definition of the corpus processing apparatus of the translation model, reference may be made to the above definition of the corpus processing method of the translation model, and details are not described herein again. All or part of each module in the corpus processing device of the translation model can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, and the computer device may be the server or the terminal in fig. 1, and its internal structure diagram may be as shown in fig. 16. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a corpus processing method for a translation model.

Those skilled in the art will appreciate that the architecture shown in fig. 16 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps in the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A corpus processing method of a translation model is characterized by comprising the following steps:

acquiring an original training corpus for training a translation model;

2. The method according to claim 1, wherein the obtaining the quality score corresponding to each parallel sentence in the original corpus by each group of generic language models comprises:

scoring the original sentences and the translated sentences in the parallel sentences in the original training corpus respectively through each group of universal language models, and respectively obtaining original quality scores and translated quality scores of the parallel sentences corresponding to each group of universal language models;

and fusing the original text quality scores and the translation quality scores of each group of general language models corresponding to each parallel statement to obtain the quality scores corresponding to each parallel statement.

3. The method according to claim 2, wherein the scoring the original sentences and the translated sentences in the parallel sentences in the original corpus by each group of universal language models to obtain the original text quality scores and the translated text quality scores of the parallel sentences corresponding to each group of universal language models respectively comprises:

respectively scoring original sentences in parallel sentences in an original training corpus through original language models in each group of general language models to respectively obtain original quality scores of the parallel sentences;

and scoring the translation sentences in each parallel sentence in the original training corpus respectively through the translation language models in each group of universal language models to respectively obtain the translation quality scores of the parallel sentences.

4. The method according to claim 2, wherein the fusing the original text quality score and the translated text quality score of each group of the parallel sentences corresponding to each group of the universal language model to obtain the quality score corresponding to each parallel sentence comprises:

normalizing the original text quality scores of the parallel sentences obtained by the same group of universal language models according to the highest score and the lowest score in the original text quality scores of the parallel sentences in the original training corpus obtained by the same group of universal language models to obtain normalized original text quality scores;

normalizing the translation quality scores of the parallel sentences obtained by the same group of universal language models according to the highest score and the lowest score in the translation quality scores of the parallel sentences in the original training corpus obtained by the same group of universal language models to obtain normalized translation quality scores;

and fusing the normalized original text quality scores and the normalized translation quality scores of each group of general language models corresponding to the parallel sentences to obtain the quality scores corresponding to the parallel sentences.

5. The method according to claim 2, wherein the fusing the original text quality score and the translated text quality score of each group of the parallel sentences corresponding to each group of the universal language model to obtain the quality score corresponding to each parallel sentence comprises:

summing the original text quality scores and the translated text quality scores of each group of universal language models to obtain group-level scores;

acquiring a weighting coefficient corresponding to each group of universal language models;

and carrying out weighted summation on the group-level scores of each group of general language models corresponding to the parallel statements based on the weighting coefficients corresponding to each group of general language models to obtain the quality scores corresponding to the parallel statements.

6. The method according to claim 1, wherein when the generic language model is a statistical language model obtained based on high-quality corpus, the obtaining, through each group of generic language models, a quality score corresponding to each parallel sentence in the original corpus comprises:

sequentially acquiring parallel sentences from the original training corpus;

inputting the original sentences in the parallel sentences into a statistical language model of the original text, and obtaining the quality scores of the original texts of the parallel sentences based on the condition frequency corresponding to each word in the original sentences through the statistical language model of the original text;

inputting the translation sentences in the parallel sentences into a statistical language model of the translation, and obtaining the translation quality scores of the parallel sentences through the statistical language model of the translation based on the condition frequency corresponding to each word in the translation sentences;

and fusing the original text quality scores and the translation quality scores of the parallel sentences to obtain the quality scores of the parallel sentences corresponding to the statistical language model.

7. The method according to claim 1, wherein when the common language model is an autoregressive language model, the obtaining the quality score corresponding to each parallel sentence in the original corpus through each group of common language models comprises:

sequentially acquiring parallel sentences from the original training corpus;

inputting the original sentences in the parallel sentences into an autoregressive language model of the original text, predicting the conditional probability of each word appearing from left to right or from right to left in the original sentences through the autoregressive language model of the original text, and obtaining the original text quality scores of the parallel sentences according to the conditional probability corresponding to each word;

inputting the translated sentence in the parallel sentences into an autoregressive language model of the translated sentence, predicting the conditional probability of each word appearing from left to right or from right to left in the translated sentence through the autoregressive language model of the translated sentence, and obtaining the translated sentence quality score of the parallel sentences according to the conditional probability corresponding to each word;

and fusing the original text quality scores and the translation quality scores of the parallel sentences to obtain the quality scores of the parallel sentences corresponding to the autoregressive language model.

8. The method according to claim 1, wherein when the generic language model is a self-coding language model, the obtaining the quality score corresponding to each parallel sentence in the original corpus through each group of generic language models comprises:

sequentially acquiring parallel sentences from the original training corpus;

sequentially taking each word in the original text sentences of the parallel sentences as a masking word, inputting the masked original text sentences into a self-coding language model of the original text, outputting the corresponding prediction probability of the masking word through the self-coding language model of the original text, and obtaining the original text quality scores of the parallel sentences according to the corresponding prediction probability of each masking word;

sequentially taking each word in the translation sentence of the parallel sentence as a masking word, inputting the masked translation sentence into a self-coding language model of the translation, outputting a prediction probability corresponding to the masking word through the self-coding language model of the translation, and obtaining a translation quality score of the parallel sentence according to the prediction probability corresponding to each masking word;

and fusing the original text quality scores and the translation quality scores of the parallel sentences to obtain the quality scores of the parallel sentences corresponding to the self-coding language model.

9. The method according to claim 1, wherein the obtaining, through the trained target domain language model and the trained general domain language model, the domain score corresponding to each parallel sentence in the corpus that satisfies the preset quality condition comprises:

scoring the original sentences in the parallel sentences in the training corpus meeting the preset quality condition through a target domain language model of the original text to obtain first domain scores corresponding to the parallel sentences;

scoring the original sentences in the parallel sentences in the training corpus meeting the preset quality condition through a general domain language model of the original text to obtain second domain scores corresponding to the parallel sentences;

and obtaining the domain score corresponding to each parallel statement in the corpus meeting the preset quality condition according to the difference between the first domain score and the second domain score corresponding to each parallel statement.

10. The method according to claim 1, wherein the obtaining, through the trained target domain language model and the trained general domain language model, the domain score corresponding to each parallel sentence in the corpus that satisfies the preset quality condition comprises:

scoring translation sentences in parallel sentences in the training corpus meeting the preset quality condition through a target domain language model of the translation to obtain third domain scores corresponding to the parallel sentences;

scoring translation sentences in parallel sentences in the training corpus meeting the preset quality condition through a general domain language model of the translation to obtain fourth domain scores corresponding to the parallel sentences;

and obtaining the domain score corresponding to each parallel statement in the corpus meeting the preset quality condition according to the difference between the third domain score and the fourth domain score corresponding to each parallel statement.

11. The method according to claim 1, wherein the obtaining, through the trained target domain language model and the trained general domain language model, the domain score corresponding to each parallel sentence in the corpus that satisfies the preset quality condition comprises:

and fusing the difference between the first domain score and the second domain score and the difference between the third domain score and the fourth domain score corresponding to each parallel sentence to obtain the domain score corresponding to each parallel sentence in the corpus meeting the preset quality condition.

12. The method according to any one of claims 1 to 11, further comprising:

obtaining a parallel corpus of a target field, and performing model training on a language model to be trained by using the parallel corpus of the target field to obtain the language model of the target field;

and sampling the training corpora meeting the preset quality condition to obtain sampling corpora, and performing model training on the language model to be trained by using the sampling corpora to obtain the universal field language model.

13. A corpus processing apparatus for a translation model, the apparatus comprising:

14. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 12.

15. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 12.