WO2021098397A1 - 数据处理方法、设备及存储介质 - Google Patents

数据处理方法、设备及存储介质 Download PDF

Info

Publication number
WO2021098397A1
WO2021098397A1 PCT/CN2020/119523 CN2020119523W WO2021098397A1 WO 2021098397 A1 WO2021098397 A1 WO 2021098397A1 CN 2020119523 W CN2020119523 W CN 2020119523W WO 2021098397 A1 WO2021098397 A1 WO 2021098397A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
source language
training data
target
feature
Prior art date
Application number
PCT/CN2020/119523
Other languages
English (en)
French (fr)
Inventor
袁松岭
文心杰
王晓利
伍海江
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2021098397A1 publication Critical patent/WO2021098397A1/zh
Priority to US17/517,075 priority Critical patent/US20220058349A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/44Statistical methods, e.g. probability models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks

Definitions

  • the embodiments of the present application relate to the field of computer technology, and in particular, to a data processing method, device, and storage medium.
  • the bilingual training data is composed of source language data and labeled language data corresponding to the source language data.
  • the cost of obtaining labeled language data in bilingual training data is relatively high. Therefore, in order to obtain high-quality bilingual training data under fixed cost constraints, a large amount of source language data needs to be filtered first, and then obtained and filtered.
  • the markup language data corresponding to the source language data is relatively high.
  • source language data is filtered based on word frequency or model confidence.
  • the adaptation scenarios of these filtering rules are relatively limited.
  • the quality of the filtered source language data is not good, making it based on the filtered source language data and the filtered source language data.
  • the translation performance of the machine translation model obtained from the source language data corresponding to the annotation language data is poor.
  • the embodiments of the present application provide a data processing method, device, and storage medium, which can be used to improve the quality of source language data after screening.
  • the technical solution is as follows:
  • an embodiment of the present application provides a data processing method, and the method includes:
  • markup language data corresponding to the target source language data Obtain the markup language data corresponding to the target source language data, and obtain a machine translation model based on the target source language data and the markup language data.
  • a data processing device in another aspect, includes:
  • the first obtaining module is configured to obtain a data set to be screened, and the data set to be screened includes a plurality of source language data to be screened;
  • the screening module is configured to screen each source language data in the data set to be screened based on the target data screening model to obtain the screened target source language data, and the target data screening model is obtained by training with a reinforcement learning algorithm;
  • the second acquiring module is configured to acquire the annotation language data corresponding to the target source language data
  • the third acquisition module is configured to acquire a machine translation model based on the target source language data and the annotation language data.
  • a computer device in another aspect, includes a processor and a memory, and at least one piece of program code is stored in the memory, and the at least one piece of program code is loaded and executed by the processor to realize the foregoing Any of the aforementioned data processing methods.
  • a non-transitory computer-readable storage medium stores at least one piece of program code, and the at least one piece of program code is loaded and executed by a processor to Realize any of the above-mentioned data processing methods.
  • a computer program product stores at least one computer program, and the at least one computer program is loaded and executed by a processor, so as to implement any of the above-mentioned data processing methods .
  • each source language data in the data set to be filtered is filtered, and then the machine translation model is obtained based on the filtered target source language data and the annotation language data corresponding to the target source language data.
  • the screening rules in the target data screening model are automatically learned by the machine in the process of reinforcement learning.
  • the target data screening model has a wide range of adaptation scenarios, and the quality of the source language data after screening is high.
  • the translation performance of the machine translation model obtained based on the filtered source language data and the annotation language data corresponding to the filtered source language data is better.
  • FIG. 1 is a schematic diagram of an implementation environment of a data processing method provided by an embodiment of the present application
  • Fig. 2 is a flowchart of a data processing method provided by an embodiment of the present application
  • FIG. 3 is a flowchart of a data processing method provided by an embodiment of the present application.
  • FIG. 4 is a flowchart of a method for obtaining a second data screening model provided by an embodiment of the present application
  • FIG. 5 is a schematic diagram of a process of obtaining a screening result of any source language training data in a first target training data set according to an embodiment of the present application
  • FIG. 6 is a schematic diagram of a process of obtaining an updated first data screening model provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of an active learning process provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a data processing device provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a data processing device provided by an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of a first training module provided by an embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of a data processing device provided by an embodiment of the present application.
  • Natural language processing (Nature Language Processing, NLP) is an important direction in the field of computer science and artificial intelligence. Natural language processing research can realize various theories and methods for effective communication between humans and computers in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Therefore, research in this field will involve natural language, that is, the language people use daily, so natural language processing is closely related to the study of linguistics. Natural language processing technology usually includes text processing, semantic understanding, machine translation, robot question answering, knowledge graph and other technologies.
  • Machine translation refers to the use of a machine to translate one natural language (the natural language that needs to be translated is generally called the source language) into another natural language (the translated natural language is called the target language) to achieve the conversion of natural languages in different languages process.
  • machine translation is generally implemented through a machine translation model, such as a neural network-based NMT (Neural Machine Translation) model.
  • a sufficient amount of bilingual training data is required.
  • the bilingual training data is composed of source language data and labeled language data corresponding to the source language data.
  • professional translators are often required to manually translate the source language data into labeled language data.
  • the cost of manual translation is high, the cost of obtaining bilingual training data is relatively high. Therefore, in order to obtain high-quality bilingual training data under fixed cost constraints, it is necessary to filter a large amount of source language data first, and then obtain the labeled language data corresponding to the filtered source language data, thereby improving the source based on the filtered source data.
  • the translation performance of the machine translation model obtained from the language data and the annotation language data corresponding to the filtered source language data.
  • FIG. 1 shows a schematic diagram of an implementation environment of the data processing method provided by an embodiment of the present application.
  • the implementation environment includes: a terminal 11 and a server 12.
  • the terminal 11 can obtain the source language data to be filtered from the Internet, and send the source language data to be filtered to the server 12.
  • the terminal 11 can also receive the filtered source language data returned by the server 12, and display the filtered source language data.
  • a professional translator translates the filtered source language data into annotated language data.
  • the terminal 11 sends the annotation language data to the server 12.
  • the server 12 can use the reinforcement learning algorithm to train to obtain a target data screening model. Based on the target data screening model, the source language data to be screened sent by the terminal 11 can be screened.
  • the server 12 can also send the screened source language data to the terminal 11.
  • the markup language data corresponding to the filtered source language data sent by the terminal 11 is acquired.
  • the server 12 can obtain a machine translation model based on the filtered source language data and the markup language data corresponding to the filtered source language data.
  • the terminal 11 can also use a reinforcement learning algorithm to train to obtain a target data screening model, filter the acquired source language data to be screened based on the target data screening model, and then according to the filtered source language data and and The marked language data corresponding to the filtered source language data obtains a machine translation model.
  • the terminal 11 is a smart device such as a mobile phone, a tablet computer, a personal computer, and the like.
  • the server 12 is a server, or a server cluster composed of multiple servers, or a cloud computing service center.
  • the terminal 11 and the server 12 establish a communication connection through a wired or wireless network.
  • terminal 11 and server 12 are only examples, and other existing or future terminals or servers that are applicable to this application should also be included in the scope of protection of this application, and are used here as The citation method is included here.
  • an embodiment of the present application provides a data processing method, which is applied to a computer device, and the computer device is a server or a terminal.
  • the method is applied to a server as an example.
  • the method provided by the embodiment of the present application includes the following steps:
  • step 201 a data set to be filtered is obtained, and the data set to be filtered includes a plurality of source language data to be filtered.
  • the data set to be filtered is the data set that needs to be filtered.
  • the data set to be filtered includes multiple source language data to be filtered. It should be noted that in the embodiments of the present application, the language corresponding to the source language data is referred to as the first language. Exemplarily, the source language data refers to sentences in the first language.
  • the manner in which the server obtains the data set to be filtered includes but is not limited to the following two:
  • Method 1 The server obtains the data set to be filtered from the database in the first language.
  • the server randomly selects the first reference number of sentences from the database of the first language to form the data set to be filtered.
  • the first reference quantity is determined according to the quantity of bilingual data that needs to be acquired, or can be adjusted freely according to actual conditions, which is not limited in the embodiment of the present application.
  • Manner 2 The server receives the network data sent by the terminal, analyzes the sentences in the first language in the network data, and obtains the data set to be filtered based on the parsed sentences in the first language.
  • the terminal can obtain network data, which may include sentences in different languages; after the terminal sends the network data to the server, the server can parse the sentences in the first language in the network data.
  • the process for the server to obtain the data set to be filtered based on the parsed sentences in the first language is: the server selects the first reference number of sentences from the parsed sentences in the first language to form the data set to be filtered .
  • step 202 may be executed.
  • a sufficient amount of bilingual training data is required.
  • the amount of bilingual training data in the existing bilingual database may be less.
  • the server needs to obtain new bilingual training data to expand the existing bilingual database.
  • the cost of acquiring new bilingual training data is relatively high. Therefore, the server needs to filter a large amount of source language data to be filtered based on step 202 to improve the quality of the acquired bilingual training data.
  • step 202 based on the target data screening model, each source language data in the data set to be screened is screened to obtain the target source language data after screening, and the target data screening model is obtained by training with a reinforcement learning algorithm.
  • the server After the server obtains the data set to be filtered, it can filter the model based on the target data, and filter each source language data in the data set to be filtered to obtain the filtered target source language data.
  • the target data screening model is trained using reinforcement learning algorithms. That is to say, the screening rules of the target data screening model are automatically learned by the machine in the process of reinforcement learning.
  • the screening rules of the target data screening model can be adapted to various different scenarios. It has a wide range of applications.
  • the server filters each source language data in the data set to be filtered based on the target data filtering model, and the process of obtaining the filtered target source language data is: obtain the data of each source language data in the data set to be filtered Features, input the features of each source language data into the target data screening model; the target data screening model processes the input features of each source language data, and outputs the screening results of each source language data; the server obtains the screening results based on each source language data The filtered target source language data.
  • the embodiment of the present application does not limit the method of obtaining the characteristics of each source language data in the data set to be filtered.
  • the any source language data is obtained based on the word embedding feature corresponding to each sub-data in the source language data and the length of the any source language data The characteristics of the data, etc.
  • the feature of any source language data is expressed in the form of a vector.
  • the server inputs the characteristics of each source language data into the target data screening model, including but not limited to the following two:
  • Method 1 The server inputs one source language data feature into the target data screening model for processing at a time, until the features of each source language data are input into the target data screening model.
  • the target data filtering model only outputs the filtering results of one source language data at a time.
  • Method 2 The server divides each source language data into the source language data group of the second reference quantity, and inputs the characteristics of all the source language data in one source language data group into the target data screening model for processing at the same time each time, until all the source language data are processed. The characteristics of all source language data in the source language data group of are input into the target data screening model.
  • the target data screening model outputs the screening results of all source language data in one source language data group at a time.
  • the second reference quantity is set according to experience or freely adjusted according to application scenarios, which is not limited in the embodiment of the present application. Exemplarily, when the second reference quantity is set to 1, the characteristics of each source language data are input into the target data screening model in the same batch for processing, and the target data screening model outputs the screening results of each source language data in the same batch.
  • the screening result is the first result or the second result.
  • the first result is used to indicate that the reliability of the source language data is high
  • the second result is used to indicate that the reliability of the source language data is low.
  • the screening result of any source language data when the screening result of any source language data is the first result, it indicates that the reliability of any source language data is high, that is, the any source language data is a high-quality source Language data; when the screening result of any source language data is the second result, it indicates that the reliability of any source language data is low, that is, any source language data is low-quality source language data.
  • the first result and the second result are represented by a value of 1 and a value of 0, respectively.
  • the screening result of a certain source language data output by the target data screening model is 1, it means that the screening result of the source language data is the first result; when the screening result of a certain source language data output by the target data screening model is 0 , Which means that the screening result of the source language data is the second result.
  • the server obtains the filtered target source language data based on the filtering results of each source language data: the server uses the source language data whose filtering result is the first result as the filtered target source language data.
  • the server After the server obtains the filtered target source language data, it can execute step 203 based on the filtered target source language data.
  • step 202 it is necessary to use a reinforcement learning algorithm to train to obtain the target data screening model.
  • the process of using the reinforcement learning algorithm to train to obtain the target data screening model is detailed in the embodiment shown in step 301 to step 303, and will not be repeated here.
  • step 203 the markup language data corresponding to the target source language data is obtained, and a machine translation model is obtained based on the target source language data and the markup language data.
  • the filtered source language data is high-quality source language data
  • the filtered source language data is used as the target source language data
  • the annotation language data corresponding to the target source language data is further obtained.
  • the language corresponding to the labeling language data is referred to as the second language.
  • the labeling language data refers to sentences in the second language.
  • the tagging language data is obtained by professional translators translating the target source language data.
  • the process for the server to obtain the labeling language data corresponding to the target source language data is as follows: the server sends the target source language data to the terminal; the terminal displays the target source language data for professional translators to view the target source language data and perform processing on the target source language data. Manual translation; when a professional translator's translation confirmation instruction is detected, the terminal obtains the markup language data corresponding to the target source language data; the terminal sends the markup language data corresponding to the target source language data to the server.
  • the server obtains the markup language data corresponding to the target source language data.
  • the server After obtaining the markup language data corresponding to the target source language data, the server obtains a machine translation model based on the target source language data and the markup language data. It should be noted that in the process of obtaining a machine translation model based on the target source language data and annotated language data, the server directly trains the machine translation model based on the target source language data and annotated language data; or, the server combines the target source language data and annotated The language data is added to the existing bilingual training data to obtain the expanded bilingual training data, and then the machine translation model is trained based on the expanded bilingual training data.
  • the embodiment of the present application does not limit the specific method of obtaining the machine translation model.
  • the translation performance of the machine translation model obtained according to the method provided in the embodiment of the present application and the machine translation model obtained according to other methods are compared through experiments.
  • the experiment process is: in the data set to be screened, according to the method provided in the embodiments of this application, a target number of target source language data is obtained, and annotated language data corresponding to the target source language data is obtained, and the target source language data and the target source language data are combined with the target source language data.
  • the corresponding labeling language data is used as the first bilingual training sample; based on the first bilingual training sample, the translation model 1 is trained.
  • the translation model 2 is obtained by training. Test the translation performance of translation model 1 and translation model 2 on the WMT (Workshop on Machine Translation, machine translation competition) field test set, the economic field test set, and the political field test set.
  • WMT Workshop on Machine Translation, machine translation competition
  • the translation model 1 obtained according to the method provided in the embodiments of the present application has a higher translation performance than the translation model 2 on the test set of various fields.
  • translation performance is represented by BLEU (Bilingual Evaluation Understudy) value.
  • the method provided in the embodiments of this application can obtain more effective and higher-quality source language data, reduce the translation cost of professional translators, and is important in reducing budget and cost. the value of.
  • each source language data in the data set to be filtered is screened, and then based on the screened target source language data and the label corresponding to the target source language data Language data acquisition machine translation model.
  • the screening rules in the target data screening model are automatically learned by the machine in the process of reinforcement learning.
  • the target data screening model has a wide range of adaptation scenarios, and the quality of the source language data after screening is high.
  • the translation performance of the machine translation model obtained based on the filtered source language data and the annotation language data corresponding to the filtered source language data is better.
  • the embodiment of the present application provides a method for obtaining a target data screening model by training with a reinforcement learning algorithm, and the method is applied to a server as an example. As shown in FIG. 3, the method provided by the embodiment of the present application includes the following steps:
  • a first training data set is initialized, and the first training data set includes a plurality of source language training data.
  • the first training data set is a data set to be screened required for training to obtain the target data screening model, and the first training data set includes multiple source language training data.
  • the source language training data is the source language data to be screened required for training to obtain the target data screening model.
  • the method of initializing the first training data set is to initialize the first training data set randomly, or to initialize the first training data set according to a preset manner, which is not limited in the embodiment of the present application.
  • the process of randomly initializing the first training data set is to randomly shuffle the order of the source language training data in the first training data set. Initializing the first training data set randomly is beneficial to improve the generalization ability of the target data screening model obtained by training.
  • step 302 based on the initialized first training data set, a reinforcement learning algorithm is used to train the first data screening model to obtain a second data screening model.
  • the first data screening model is an initial data screening model corresponding to the initialized first training data set
  • the second data screening model is a final data screening model corresponding to the initialized first training data set.
  • the embodiment of the present application does not limit the specific form of the data screening model.
  • the data screening model is a DQN (Deep Q-Learning) model.
  • Step 302 is a process of obtaining a second data screening model, that is, obtaining a final data screening model corresponding to the initialized first training data set. As shown in FIG. 4, the process includes steps 3021 to 3026.
  • Step 3021 Divide the initialized first training data set into at least one target training data set.
  • the initialized first training data set includes multiple source language training data, and the initialized first training data set is divided into at least one target training data set, so that each target training data set includes part of the source in the initialized first training data set Language training data.
  • each target training data set is used for training each time. Compared with using one source language training data for training each time, this method can shorten the training time and improve the stability of the training process. It should be noted that after dividing into at least one target training data set, each target training data set is sorted, and in the subsequent training process, each target training data set is selected in sequence according to the sorting order. According to the arrangement order, each target training data set is the first target training data set, the second target training data set, ..., the nth target training data set (n is an integer greater than 0).
  • Step 3022 Invoke the first data screening model to process the target features of each source language training data in the first target training data set to obtain the screening results of each source language training data in the first target training data set, and the first target training data
  • the set is the first target training data set in at least one target training data set.
  • the target feature of each source language training data in the first target training data set needs to be acquired. That is, after dividing the initialized first training data set into at least one target training data set, the target features of each source language training data in the first target training data set are acquired.
  • the first target training data set is the first target training data set in at least one target training data set.
  • the process of obtaining the target feature of any source language training data in the first target training data set includes the following steps 3022A to 3022C:
  • Step 3022A Obtain the first feature of any source language training data based on each sub-data in any source language training data.
  • the first feature is used to indicate the feature of any source language training data itself, and the first feature is obtained based on each sub-data in the any source language training data.
  • Any source language training data includes multiple sub-data. For example, when any source language training data is a sentence, each word in the any source language training data is a part of the source language training data. A sub-data.
  • the process of obtaining the first feature of any source language training data includes the following steps 1 to 4:
  • Step 1 Obtain the third feature of any source language training data based on the word embedding feature of each sub-data in any source language training data.
  • the word embedding feature of each sub-data in any source language training data based on the vocabulary, add the word embedding feature of each sub-data to the same length, based on the word embedding feature of each sub-data of the same length, Then, the third feature of any source language training data can be obtained.
  • the vocabulary refers to a table that stores the word embedding characteristics corresponding to each word.
  • the vocabulary can be constructed based on an existing corpus. The embodiment of the present application does not limit the construction process of the vocabulary.
  • the word embedding feature corresponding to each word in the vocabulary can be represented by a vector, and the dimension of the vector is set according to experience, for example, the dimension of the vector is set to 512 dimensions.
  • the way to obtain the third feature of any source language training data is: input the word embedding feature of each sub-data of the same length into the first nerve
  • the network uses the features obtained through the processing of the convolutional layer and the fully connected layer in the first neural network as the third feature of any source language training data.
  • the embodiment of the present application does not limit the settings of the convolutional layer and the fully connected layer in the first neural network.
  • the convolutional layer further includes a ReLU (Rectified Linear Unit, linear rectification function) processing module.
  • the first neural network is a CNN (Convolutional Neural Networks, convolutional neural network) network
  • the filter size of the convolutional layer is set to 3, 4, and 5, respectively.
  • the number of cores (filter number) is set to 128, a 384*256-dimensional feature vector is obtained after the fully connected layer is processed, and the feature vector is used as the third feature.
  • Step 2 Obtain the fourth feature of any source language training data based on the comparison result of each sub-data in any source language training data and the existing corpus database.
  • N-gram includes one or more of 2-gram, 3-gram, and 4-gram.
  • the way to obtain the fourth feature of any source language training data is: input the comparison result into the second neural network, and use the feature obtained through the processing of the second neural network as The fourth feature of any source language training data.
  • the embodiment of the present application does not limit the setting of the second neural network. Exemplarily, as shown in FIG. 5, after processing by the second neural network, a 1*256-dimensional feature vector is obtained, and the feature vector is used as the fourth feature.
  • Step 3 Determine the length of any source language training data based on the number of each sub-data in any source language training data, and obtain the fifth feature of any source language training data based on the length of any source language training data.
  • the length of any source language training data can be determined. For example, when any source language training data is a sentence and the sub-data is a word, the number of words included in the sentence is the length of the sentence.
  • the way to obtain the fifth feature of any source language training data is: input the length of any source language training data into the third neural network, and then The feature obtained by the processing of the third neural network is used as the fifth feature of any source language training data.
  • the embodiment of the present application does not limit the setting of the third neural network. Exemplarily, as shown in FIG. 5, after processing by the third neural network, a 1*256-dimensional feature vector can be obtained, and the feature vector can be used as the fifth feature.
  • Step 4 Based on the third feature, the fourth feature, and the fifth feature, obtain the first feature of any source language training data.
  • the first feature of any source language training data can be obtained.
  • the way to obtain the first feature of any source language training data is: combine the third feature, fourth feature of any source language training data The feature and the fifth feature are spliced together to obtain the first feature.
  • Step 3022B Based on any source language training data and the third translation model, obtain a second feature of any source language training data.
  • the second feature is used to indicate the feature of any source language training data obtained on the basis of comprehensively considering the translation result of the third translation model.
  • the third translation model is any model that can translate the source language training data, which is not limited in the embodiment of the present application.
  • the process of obtaining the second feature of any source language training data includes the following steps a to d:
  • Step a Obtain the translation data of any source language training data based on the third translation model, and obtain the sixth feature of any source language training data based on the word embedding feature of the translation data.
  • the process of obtaining the translation data of any source language training data is: input the any source language training data into the third translation model, and use the translation data output by the third translation model as the any source language training The translation data of the data.
  • the word embedding feature can be represented by a vector, and the dimension of the vector is set according to experience, for example, the dimension of the vector is set to 512 dimensions.
  • the way to obtain the sixth feature of any source language training data is: input the word embedding feature of the translation data into the fourth neural network, and it will pass through the fourth neural network.
  • the features obtained from the processing of the convolutional layer and the fully connected layer in, are used as the sixth feature of any source language training data.
  • the embodiment of the present application does not limit the settings of the convolutional layer and the fully connected layer in the fourth neural network.
  • a ReLU processing module is also included in the convolutional layer.
  • the fourth neural network is a CNN network
  • the filter size of the convolution layer is set to 3, 4, and 5, and the number of convolution kernels is set to 128 .
  • the fourth neural network is the same as the first neural network.
  • Step b Based on the third translation model, obtain the target translation sub-data corresponding to each sub-data in any source language training data, and obtain any source based on the word embedding characteristics of the target translation sub-data corresponding to each sub-data.
  • the seventh feature of the language training data is that the translation probability of the target translation sub-data corresponding to any sub-data is the largest among the translation probabilities of each candidate translation sub-data corresponding to any sub-data.
  • Inputting any source language training data into the third translation model can obtain the candidate translation sub-data and the translation probability of the candidate translation sub-data corresponding to each sub-data in any source language training data output by the third translation model.
  • the number of candidate translation sub-data corresponding to any sub-data is set based on experience. For example, if the number of candidate translation sub-data is set to 10, the third translation model outputs the corresponding sub-data respectively. The translation probabilities of the 10 candidate translation sub-data and the 10 candidate translation sub-data with the largest translation probability.
  • the target translation sub-data corresponding to each sub-data in any source language training data can be determined.
  • the target translation sub-data corresponding to any sub-data is the candidate translation sub-data with the largest translation probability among the candidate translation sub-data corresponding to the any sub-data.
  • the word embedding feature of the target translation sub-data is used to obtain the seventh feature of any source language training data.
  • the way to obtain the seventh feature of any source language training data is: separate each sub-data of the same length
  • the word embedding feature of the corresponding target translation sub-data is input to the fifth neural network, and the feature obtained through the processing of the convolutional layer and the fully connected layer in the fifth neural network is used as the seventh feature of any source language training data.
  • the embodiment of the present application does not limit the settings of the convolutional layer and the fully connected layer in the fifth neural network.
  • a ReLU processing module is also included in the convolutional layer.
  • the fifth neural network is a CNN (Convolutional Neural Networks, convolutional neural network) network
  • the convolution kernel size (filter size) of the convolutional layer is set to 5
  • the number of convolution kernels (filter number) ) Is set to 64.
  • the fully connected layer is processed, a 64*256-dimensional feature vector is obtained, and the feature vector is taken as the seventh feature.
  • the fifth neural network is the same as the first neural network or the fourth neural network.
  • Step c Obtain the translation probability of the target translation sub-data corresponding to each sub-data, and obtain the eighth feature of any source language training data based on the translation probability of the target translation sub-data corresponding to each sub-data and the length of the translation data.
  • the translation probability of the target translation sub-data corresponding to each sub-data can also be obtained.
  • the process of obtaining the eighth feature of any source language training data is: The translation probabilities of the target translation sub-data are added to obtain the total probability, and the eighth feature of any source language training data is obtained based on the ratio of the total probability to the length of the translation data.
  • the eighth feature is used to indicate the confidence score (Confidence Score) of any source language training data.
  • the way to obtain the eighth feature of any source language training data based on the ratio of the total probability to the length of the translation data is: input the ratio of the total probability to the length of the translation data into the sixth neural network, and The feature obtained through the processing of the sixth neural network is used as the eighth feature of any source language training data.
  • the embodiment of the present application does not limit the setting of the sixth neural network. Exemplarily, as shown in FIG. 5, after processing by the sixth neural network, a 1*256-dimensional feature vector can be obtained, and the feature vector can be used as the eighth feature.
  • Step d Obtain the second feature of any source language training data based on the sixth feature, the seventh feature, and the eighth feature.
  • the second feature of any source language training data can be obtained.
  • the way to obtain the second feature of any source language training data is: combine the sixth feature, the seventh feature of any source language training data The feature and the eighth feature are spliced together to obtain the second feature.
  • step 3022A is performed first, and then step 3022B; or, step 3022B is performed first, and then step 3022A; or, step 3022A and step 3022B are performed simultaneously.
  • Step 3022C Based on the first feature and the second feature, obtain the target feature of any source language training data.
  • the target feature of any source language training data is acquired.
  • the method of obtaining the target feature of any source language training data is: splicing the first feature and the second feature, and using the spliced feature as any The target feature of the source language training data. It should be noted that the embodiments of the present application do not limit the splicing sequence of the first feature and the second feature.
  • the target feature of any source language training data can be obtained based on the third feature, fourth feature, fifth feature, sixth feature, seventh feature, and eighth feature of any source language training data.
  • the target features of each source language training data in the first target training data set can be obtained. Then the first data screening model is called to screen the target features of each source language training data in the first target training data set.
  • the first data screening model After the target features of each source language training data in the first target training data set are input into the first data screening model, the first data screening model processes the target features of each source language training data. After the first data screening model processes the target features, it outputs the screening results of each source language training data based on the classifier. For example, the process of obtaining the screening results of each source language training data in the first target training data set is shown in FIG. 5.
  • the embodiment of the present application does not limit the manner in which the first data screening model processes the target feature.
  • the training data corresponds to the probabilities of different screening results, and then passes through the classifier, and outputs the screening results with high probability as the screening results of the source language training data.
  • a i argmaxQ ⁇ (s i , a)
  • a i the screening result
  • Q ⁇ (s i , a) the objective function corresponding to the first data screening model.
  • the screening results include two types, namely the first result and the second result.
  • the first result is used to indicate that the reliability of the source language training data is high
  • the second result is that the user indicates that the reliability of the source language training data is low.
  • the screening result is represented by a numerical value, and the corresponding relationship between the screening result and the numerical value is preset according to experience, for example, the numerical value corresponding to the first result is 1, and the numerical value corresponding to the second result is 0.
  • Step 3023 For any source language training data in the first target training data set, a weight value of any source language training data is determined based on the screening result of any source language training data.
  • the source language training data of different screening results corresponds to different weight values.
  • the process of determining the weight value of any source language training data is: in response to the screening result of any source language training data being the first result, The first weight value is used as the weight value of any source language training data; in response to the screening result of any source language training data being the second result, the second weight value is used as the weight value of any source language training data.
  • the second weight value is a preset weight value corresponding to the source language training data whose screening result is the second result.
  • the embodiment of the present application does not limit the manner of setting the second weight value, for example, the second weight value is set to 0.
  • the first weight value needs to be obtained first.
  • the process of obtaining the first weight value includes the following steps A to D:
  • Step A Obtain labeled language training data corresponding to each target source language training data in the first target training data set, and the screening results of each target source language training data are the first results.
  • each source language training data whose screening result in the first target training data set is the first result is used as each target source language Training data, and then obtain labeled language training data corresponding to each target source language training data.
  • the labeling language training data corresponding to each source language training data in the first training data set is obtained in advance and stored.
  • the labeled language training data corresponding to each target source language training data is obtained from the storage to save training time.
  • step A the labeled language training data corresponding to each target source language training data whose screening result is the first result in the first target training data set can be obtained, and then step B is executed.
  • Step B Add each target source language training data and the labeled language training data corresponding to each target source language training data as training data to the second training data set.
  • the initial value of the second training data set is an empty set, and the second training data set is used to store bilingual training data.
  • Any bilingual training data is composed of a source language training data and annotated language data corresponding to the source language training data.
  • each target source language training data and the labeled language training data corresponding to each target source language training data can be added as training data to the second training data concentrated.
  • record any target source language training data as x i record the labeled language training data corresponding to x i as y i , and record the second training data set as D l , then (x i , y i ) Add to D l .
  • step B all the target source language training data whose screening result in the first target training data set is the first result and the labeled language training data corresponding to all target source language training data are added to the second The training data set. Based on the second training data set obtained in this way, the accuracy of the obtained first weight value can be improved.
  • Step C Training the first translation model based on the second training data set to obtain the second translation model.
  • the first translation model is a translation model pre-trained using known bilingual training data.
  • the embodiment of the application does not limit the specific form of the first translation model.
  • the first translation model is an NMT (Neural Machine Translation, Neural Machine Translation) model.
  • the updated second training data set is obtained. Since the data in the second training data set are all bilingual training data, the first translation model can be trained based on the second training data set. The embodiment of the present application does not limit the way of training the first translation model. Use the trained translation model as the second translation model.
  • Step D Obtain the first weight value based on the second translation model and the first translation model.
  • the first weight value is used to indicate the performance difference between the second translation model and the first translation model.
  • the process of obtaining the first weight value is: use the verification data set (held out data set) to perform the first translation model and the second translation model respectively. Verification, the model performance of the first translation model and the model performance of the second translation model are obtained, and the first weight value is obtained based on the model performance of the first translation model and the model performance of the second translation model.
  • the first weight value is obtained based on the following formula 1:
  • Acc( ⁇ i ) represents the model performance of the second translation model.
  • Acc( ⁇ i-1 ) represents the model performance of the first translation model.
  • R(s i-1 , a) represents the first weight value (Reward). The value of the first weight value is positive or negative, indicating that the influence of the added bilingual training samples (x i , y i ) in the second training data set D l on the performance of the model may be a positive influence or a negative influence.
  • the first weight value can be used as the weight value of each source language training data of the first result as the screening result in the first target training data set.
  • Step 3024 based on the target feature of any source language training data, the screening result of any source language training data, the weight value of any source language training data, and the target feature of the reference source language training data, generate and match any source language training data.
  • the reference source language training data is the source language data corresponding to any source language training data in the second target training data set.
  • the second target training data set is the next target training data set of the first target training data set in the at least one target training data set.
  • the candidate data is data used to update the parameters of the first data screening model.
  • the method of generating candidate data corresponding to any source language training data is:
  • the screening result of any source language training data being the first result
  • the first result, the first weight value and the target feature of the reference source language data it is generated to be compatible with any source language training data.
  • the first candidate data corresponding to the training data
  • any source language training data being the second result
  • the second result based on the target feature of any source language training data, the second result, the second weight value, and the target feature of the reference source language data, it is generated to be compatible with any source language training data.
  • the second candidate data corresponding to the training data.
  • each source language training data corresponds to one candidate data
  • the candidate data is the first candidate data or the second candidate data.
  • any source language to the target is referred to as the training data s i
  • the target reference feature data is referred to as the source language s i + 1
  • the training and any source language The candidate data corresponding to the data is denoted as (s i , a i , r i , s i+1 ).
  • a i and r i are determined according to the screening result of any source language training data.
  • r i When a i represents the first result, r i represents the first weight value, (s i , a i , r i , s i+1 ) represents the first candidate data; when a i represents the second result, r i represents The second weight value, (s i , a i , r i , s i+1 ) represents the second candidate data.
  • steps 3023 and 3024 introduce the process of generating candidate data corresponding to any source language training data from the perspective of any source language training data in the first target training data set.
  • candidate data corresponding to each source language training data in the first target training data set can be generated.
  • step 3025 is executed.
  • Step 3025 based on the candidate data corresponding to each source language training data in the first target training data set, select the candidate data of the target number, and update the parameters of the first data screening model based on the candidate data of the target number to obtain the updated The first data screening model.
  • a target number of candidate data is selected to be based on The target number of candidate data updates the parameters of the first data screening model.
  • the target number is set according to experience, or freely adjusted according to the number of all candidate data, which is not limited in the embodiment of the present application.
  • the method of selecting the target number of candidate data is: The candidate data of the target number are randomly selected from the candidate data respectively corresponding to the training data.
  • the way to select the target number of candidate data is: compare each source language training data in the first target training data set.
  • the first candidate data in the candidate data corresponding to the training data is added to the first candidate data set, and the second candidate data in the candidate data corresponding to each source language training data in the first target training data set is added to the second candidate data.
  • Candidate data set; the first candidate data set and the second candidate data set are selected in equal proportions to obtain the target number of candidate data.
  • the candidate data selected based on this selection method is more representative, which is beneficial to improve the stability of the training process of the data screening model.
  • the first candidate data set is used to continuously collect the newly generated first candidate data during the process of training to obtain the target data screening model
  • the second candidate data set is used to continuously collect new data during the process of training the target training data screening model.
  • the second candidate data generated.
  • the initial values of the first candidate data set and the second candidate data set are both empty sets.
  • the parameters of the first data screening model are updated, and the process of obtaining the updated first data screening model includes the following steps I to III:
  • Step I Based on the target number of candidate data, update the objective function corresponding to the first data screening model.
  • the objective function is in the form of Q ⁇ (s, a), and the method of updating the objective function corresponding to the first data screening model is: update and first data screening based on Bellman equation (formula 2) The objective function corresponding to the model.
  • Step II Calculate the loss function corresponding to the first data screening model according to the updated objective function.
  • the current loss function can be calculated according to the updated objective function.
  • the loss function is calculated based on the following formula 3:
  • L( ⁇ ) represents the loss function
  • y i (r,s′) r+ ⁇ max a′
  • Q(s′,a′; ⁇ i-1 ) is the current parameter ⁇ i- based on the first data screening model 1 Obtained objective function value.
  • Step III Based on the loss function, update the parameters of the first data screening model to obtain the updated first data screening model.
  • the parameters of the first data screening model are updated to obtain the updated first data screening model.
  • the SGD (Stochastic Gradient Descent) algorithm is used to minimize the loss function L( ⁇ ).
  • the process of obtaining the updated first data screening model is shown in FIG. 6.
  • the screening result When the screening result is 0, use 0 as the weight value r i ; when the screening result is 1, obtain the labeling language data y i , add (x i , y i ) to the second training data set D l , and use the first
  • the second training data set D l train the first translation model to obtain the second translation model; use the held-out verification data set to calculate the model performance of the first translation model and the second translation model, and use the difference in model performance as a filter
  • the result is the weight value of the source language training data of 1.
  • Generate candidate data (s i , a i , r i , s i+1 ). Select the target number of candidate data, use the SGD algorithm to minimize the loss function L( ⁇ ), and get the updated first data screening model.
  • Step 3026 Training the updated first data screening model based on the second target training data set, and so on, until the second training termination condition is met, and the second data screening model is obtained.
  • the process of training the updated first data screening model based on the second target training data set is: perform step 3022 to step 3025 based on the second target training data set and the updated first data screening model to obtain a further updated The first data screening model. And so on, until the second training termination condition is met.
  • each pair of the first data screening model is updated once, that is, it is judged once whether the second training termination condition is satisfied. If the second training termination condition is not met, steps 3022 to 3025 are executed based on the next target training data set and the current latest first data screening model to continue to update the first data screening model; if the second training termination condition is met, The iterative training is stopped, and the updated first data screening model obtained at this time is used as the second data screening model.
  • the second training termination condition is satisfied, including but not limited to the following two situations:
  • Case 1 There is no target training data set that meets the condition in the first training data set, and the target feature of each source language training data in the target training data set that meets the condition has not been filtered.
  • Case 2 The number of source language training data whose screening result is the first result reaches the number threshold.
  • the number threshold is set according to the training cost (budget).
  • the number threshold When the number of source language training data whose screening result is the first result reaches the number threshold, it means that a sufficient number of source language training data has been screened out.
  • the second training termination condition When the number of source language training data whose screening result is the first result reaches the number threshold, it means that a sufficient number of source language training data has been screened out. The second training termination condition.
  • step 303 in response to not satisfying the first training termination condition, the first training data set is reinitialized, and based on the reinitialized first training data set, a reinforcement learning algorithm is used to train the second data screening model to obtain the third data Screening model; and so on, until the first training termination condition is met, and the target data screening model is obtained.
  • the target data screening model is further obtained based on the second data screening model.
  • the method of obtaining the target data screening model based on the second data screening model is: in response to satisfying the first training termination condition, using the second data screening model as the target data screening model; and in response to not satisfying the first training termination condition.
  • Training termination conditions re-initialize the first training data set, based on the re-initialized first training data set, use the reinforcement learning algorithm to train the second data screening model to obtain the third data screening model, and so on, until the first
  • the training termination condition is the data screening model obtained when the first training termination condition is satisfied as the target data screening model. That is, when the first training termination condition is not met, step 301 and step 302 are executed again to obtain the third data screening model corresponding to the reinitialized first training data set; the above process is repeated.
  • each time a data screening model is obtained it is judged whether the first training termination condition is satisfied once. If the first training termination condition is not met, continue to perform step 301 and step 302 to continue to obtain the data screening model; if the first training termination condition is met, stop iterative training, and use the data screening model obtained at this time as the target data screening model.
  • satisfying the first training termination condition is: the number of times the first training data set is initialized reaches the number threshold.
  • the process of obtaining the target data screening model is regarded as the process of obtaining the policy ⁇ , and the algorithm flow of obtaining the policy ⁇ is as follows:
  • the data screening model can be applied to the active learning process.
  • Active learning is a simple technique for labeling data. Active learning first selects some instances from the unlabeled data set, then manually labels these instances, and then repeats it many times until the termination condition is met.
  • the data screening model is updated based on the existing labeled data training set L. Based on the data screening model, part of the data to be labeled is selected from the unlabeled data pool U. Professionals perform manual labeling, and then the labeled data The data is added to the labeled data training set L, and the above process is repeated until the termination condition is met.
  • the termination condition is that the number of data in the indicator data training set L reaches the threshold.
  • the target data screening model is trained by a reinforcement learning algorithm.
  • the screening rules in the target data screening model are automatically learned by the machine in the process of reinforcement learning.
  • the target data screening model has a wide range of adaptation scenarios, making it based on The quality of the source language data filtered by the target data screening model is higher, which in turn helps to improve the translation performance of the machine translation model obtained based on the filtered source language data and the annotation language data corresponding to the filtered source language data.
  • an embodiment of the present application provides a data processing device, which includes:
  • the first acquisition module 801 is configured to acquire a data set to be filtered, and the data set to be filtered includes a plurality of source language data to be filtered;
  • the screening module 802 is used for screening the model based on the target data, screening each source language data in the data set to be screened to obtain the target source language data after screening, and the target data screening model is obtained by training with a reinforcement learning algorithm;
  • the second acquiring module 803 is configured to acquire the annotation language data corresponding to the target source language data
  • the third acquisition module 804 is configured to acquire a machine translation model based on the target source language data and the annotation language data.
  • the device further includes:
  • the initialization module 805 is configured to initialize a first training data set, where the first training data set includes multiple source language training data;
  • the first training module 806 is configured to use a reinforcement learning algorithm to train the first data screening model based on the initialized first training data set to obtain the second data screening model;
  • the second training module 807 is configured to re-initialize the first training data set in response to not satisfying the first training termination condition, and use the reinforcement learning algorithm to train the second data screening model based on the re-initialized first training data set to obtain The third data screening model; and so on, until the first training termination condition is met, and the target data screening model is obtained.
  • the first training module 806 includes:
  • the dividing unit 8061 is configured to divide the initialized first training data set into at least one target training data set;
  • the processing unit 8062 is configured to call the first data screening model to screen the target features of each source language training data in the first target training data set to obtain the screening results of each source language training data in the first target training data set.
  • a target training data set is the first target training data set in at least one target training data set;
  • the determining unit 8063 is configured to determine the weight value of any source language training data based on the screening result of any source language training data for any source language training data in the first target training data set;
  • the generating unit 8064 is used to generate the target feature of any source language training data, the filtering result of any source language training data, the weight value of any source language training data, and the target feature of the reference source language training data to generate a reference to the target feature of any source language training data.
  • the candidate data corresponding to the source language training data, the reference source language training data is the source language data corresponding to any source language training data in the second target training data set, and the second target training data set is the first one in the at least one target training data set.
  • the selecting unit 8065 is configured to select a target quantity of candidate data based on the candidate data corresponding to each source language training data in the first target training data set;
  • the updating unit 8066 is configured to update the parameters of the first data screening model based on the target number of candidate data to obtain the updated first data screening model;
  • the training unit 8067 is configured to train the updated first data screening model based on the second target training data set, and so on, until the second training termination condition is met, and the second data screening model is obtained.
  • the determining unit 8063 is configured to respond to the selection result of any source language training data being the first result, and use the first weight value as the weight value of any source language training data; in response to any The filtering result of the source language training data is the second result, and the second weight value is used as the weight value of any source language training data.
  • the first training module 806 further includes:
  • the obtaining unit 8068 is configured to obtain labeled language training data corresponding to each target source language training data in the first target training data set, and the screening result of each target source language training data is the first result;
  • the first training module 806 further includes:
  • the adding unit 8069 is configured to add each target source language training data and the labeled language training data corresponding to each target source language training data as training data to the second training data set;
  • the training unit 8067 is also used to train the first translation model based on the second training data set to obtain the second translation model;
  • the obtaining unit 8068 is further configured to obtain the first weight value based on the second translation model and the first translation model.
  • the obtaining unit 8068 is further configured to obtain any source language training data based on each sub-data in any source language training data for any source language training data in the first target training data set The first feature of; based on any source language training data and the third translation model, obtain the second feature of any source language training data; based on the first feature and the second feature, obtain the target feature of any source language training data.
  • the obtaining unit 8068 is further configured to obtain the third feature of any source language training data based on the word embedding feature of each sub-data in any source language training data; training based on any source language The comparison result of each sub-data in the data with the existing corpus database to obtain the fourth feature of any source language training data; determine any source language training data based on the number of each sub-data in any source language training data Based on the length of any source language training data, obtain the fifth feature of any source language training data; based on the third feature, fourth feature, and fifth feature, obtain the first feature of any source language training data.
  • the obtaining unit 8068 is also used to obtain the translation data of any source language training data based on the third translation model, and obtain the sixth translation data of any source language training data based on the word embedding feature of the translation data.
  • Features based on the third translation model, obtain the target translation sub-data corresponding to each sub-data in any source language training data, and obtain any source language based on the word embedding feature of the target translation sub-data corresponding to each sub-data
  • the seventh feature of the training data the translation probability of the target translation sub-data corresponding to any sub-data is the largest among the translation probabilities of each candidate translation sub-data corresponding to any sub-data; obtain the translation of the target translation sub-data corresponding to each sub-data Probability, based on the translation probability of the target translation sub-data corresponding to each sub-data and the length of the translation data, obtain the eighth feature of any source language training data; obtain any one of the source language training data based on the sixth, seventh, and eighth features
  • the second feature of the source language training data
  • the generating unit 8064 is configured to respond to the selection result of any source language training data being the first result, based on the target feature, the first result, the first weight value and the first result of any source language training data. Referring to the target feature of the source language training data, generate first candidate data corresponding to any source language training data;
  • the source language training data In response to the selection result of any source language training data as the second result, based on the target feature of any source language training data, the second result, the second weight value, and the target feature of the reference source language training data, the source language training data The second candidate data corresponding to the language training data.
  • the adding unit 8069 is further configured to add the first candidate data in the candidate data corresponding to each source language training data in the first target training data set to the first candidate data set, and add the first candidate data to the first candidate data set.
  • the second candidate data in the candidate data corresponding to each source language training data in the first target training data set is added to the second candidate data set;
  • the selecting unit 8065 is further configured to perform equal ratio selection in the first candidate data set and the second candidate data set to obtain the target number of candidate data.
  • the update unit 8066 is used to update the objective function corresponding to the first data screening model based on the target number of candidate data; according to the updated objective function, calculate the loss corresponding to the first data screening model Function: Based on the loss function, update the parameters of the first data screening model to obtain the updated first data screening model.
  • satisfying the second training termination condition includes:
  • the number of source language training data whose screening result is the first result reaches the number threshold.
  • each source language data in the data set to be filtered is screened, and then based on the screened target source language data and the label corresponding to the target source language data Language data acquisition machine translation model.
  • the screening rules in the target data screening model are automatically learned by the machine in the process of reinforcement learning.
  • the target data screening model has a wide range of adaptation scenarios, and the quality of the source language data after screening is high.
  • the translation performance of the machine translation model obtained based on the filtered source language data and the annotation language data corresponding to the filtered source language data is better.
  • FIG 11 is a schematic structural diagram of a data processing device provided by an embodiment of the present application.
  • the data processing device is a server.
  • the server may have relatively large differences due to different configurations or performance.
  • the server includes one or more One processor (Central Processing Units, CPU) 1101 and one or more memories 1102, wherein at least one program code is stored in the one or more memories 1102, and the at least one program code is generated by the one or more processors 1101 Load and execute to implement the data processing method provided by the foregoing method embodiments.
  • the server can also have components such as a wired or wireless network interface, a keyboard, an input and output interface for input and output, and the server can also include other components for implementing device functions, which will not be repeated here.
  • a computer device in an exemplary embodiment, includes a processor and a memory, and at least one piece of program code is stored in the memory. The at least one piece of program code is loaded and executed by one or more processors, so as to implement any of the foregoing data processing methods.
  • a non-transitory computer-readable storage medium stores at least one piece of program code, and the at least one piece of program code is loaded by a processor of a computer device. And execute to implement any of the above data processing methods.
  • the aforementioned non-temporary computer-readable storage medium is Read-Only Memory (ROM), Random Access Memory (RAM), Compact Disc Read-Only Memory, CD -ROM), magnetic tapes, floppy disks and optical data storage devices.
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • CD -ROM Compact Disc Read-Only Memory
  • magnetic tapes magnetic tapes
  • floppy disks optical data storage devices.
  • a computer program product stores at least one section of a computer program.
  • the at least one section of computer program is loaded and executed by a processor of a computer device to implement any of the foregoing data. Approach.

Abstract

数据处理方法、设备及存储介质,属于计算机技术领域。方法包括:获取待筛选数据集,待筛选数据集包括多个待筛选的源语言数据(201);基于目标数据筛选模型,对待筛选数据集中的各个源语言数据进行筛选,得到筛选后的目标源语言数据,目标数据筛选模型利用强化学习算法训练得到(202);获取与目标源语言数据对应的标注语言数据,基于目标源语言数据和标注语言数据获取机器翻译模型(203)。在此种数据处理的过程中,目标数据筛选模型中的筛选规则为机器在强化学习的过程中自动学习出来的,目标数据筛选模型的适应场景广泛,筛选后的源语言数据的质量较高,使得获取的机器翻译模型的翻译性能较好。

Description

数据处理方法、设备及存储介质
本申请要求于2019年11月21日提交的申请号为201911149101.4、发明名称为“数据处理方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及计算机技术领域,特别涉及一种数据处理方法、设备及存储介质。
背景技术
在机器翻译领域,要训练一个精确的机器翻译模型,需要足够数量的双语训练数据。双语训练数据由源语言数据和与源语言数据对应的标注语言数据组成。通常,获取双语训练数据中的标注语言数据的成本较高,因此,为了在固定成本约束下获取高质量的双语训练数据,需要先对大量的源语言数据进行筛选,然后再获取与筛选后的源语言数据对应的标注语言数据。
相关技术中,基于词频或者基于模型置信度对源语言数据进行筛选,这些筛选规则的适应场景较局限,筛选后的源语言数据的质量不佳,使得基于筛选后的源语言数据和与筛选后的源语言数据对应的标注语言数据获取的机器翻译模型的翻译性能较差。
发明内容
本申请实施例提供了一种数据处理方法、设备及存储介质,可用于提高筛选后的源语言数据的质量。所述技术方案如下:
一方面,本申请实施例提供了一种数据处理方法,所述方法包括:
获取待筛选数据集,所述待筛选数据集包括多个待筛选的源语言数据;
基于目标数据筛选模型,对所述待筛选数据集中的各个源语言数据进行筛选,得到筛选后的目标源语言数据,所述目标数据筛选模型利用强化学习算法训练得到;
获取与所述目标源语言数据对应的标注语言数据,基于所述目标源语言数据和所述标注语言数据获取机器翻译模型。
另一方面,提供了一种数据处理装置,所述装置包括:
第一获取模块,用于获取待筛选数据集,所述待筛选数据集包括多个待筛选的源语言数据;
筛选模块,用于基于目标数据筛选模型,对所述待筛选数据集中的各个源语言数据进行筛选,得到筛选后的目标源语言数据,所述目标数据筛选模型利用强化学习算法训练得到;
第二获取模块,用于取与所述目标源语言数据对应的标注语言数据;
第三获取模块,用于基于所述目标源语言数据和所述标注语言数据获取机器翻译模型。
另一方面,提供了一种计算机设备,所述计算机设备包括处理器和存储器,所述存储器中存储有至少一条程序代码,所述至少一条程序代码由所述处理器加载并执行,以实现上述任一所述的数据处理方法。
另一方面,还提供了一种非临时性计算机可读存储介质,所述非临时性计算机可读存储介质中存储有至少一条程序代码,所述至少一条程序代码由处理器加载并执行,以实现上述任一所述的数据处理方法。
另一方面,还提供了一种计算机程序产品,所述计算机程序产品中存储有至少一段计算机程序,所述至少一段计算机程序由处理器加载并执行,以实现上述任一所述的数据处理方法。
本申请实施例提供的技术方案至少带来如下有益效果:
基于利用强化学习算法训练得到的目标数据筛选模型,对待筛选数据集中的各个源语言数据进行筛选,进而基于筛选后的目标源语言数据和与目标源语言数据对应的标注语言数据获取机器翻译模型。在此种数据处理的过程中,目标数据筛选模型中的筛选规则为机器在强化学习的过程中自动学习出来的,目标数据筛选模型的适应场景广泛,筛选后的源语言数据的质量较高,使得基于筛选后的源语言数据和与筛选后的源语言数据对应的标注语言数据获取的机器翻译模型的翻译性能较好。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的 附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还能够根据这些附图获得其他的附图。
图1是本申请实施例提供的一种数据处理方法的实施环境的示意图;
图2是本申请实施例提供的一种数据处理方法的流程图;
图3是本申请实施例提供的一种数据处理方法的流程图;
图4是本申请实施例提供的一种获取第二数据筛选模型的方法的流程图;
图5是本申请实施例提供的一种获取第一目标训练数据集中的任一源语言训练数据的筛选结果的过程示意图;
图6是本申请实施例提供的一种获取更新后的第一数据筛选模型的过程示意图;
图7是本申请实施例提供的一种主动学习过程的示意图;
图8是本申请实施例提供的一种数据处理装置的示意图;
图9是本申请实施例提供的一种数据处理装置的示意图;
图10是本申请实施例提供的一种第一训练模块的结构示意图;
图11是本申请实施例提供的一种数据处理设备的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
自然语言处理(Nature Language Processing,NLP)是计算机科学领域与人工智能领域中的一个重要方向。自然语言处理研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。自然语言处理是一门融语言学、计算机科学、数学于一体的科学。因此,这一领域的研究将涉及自然语言,即人们日常使用的语言,所以自然语言处理与语言学的研究有着密切的联系。自然语言处理技术通常包括文本处理、语义理解、机器翻译、机器人问答、知识图谱等技术。机器翻译是指使用机器将一种自然语言(需翻译的自然语言一般称为源语言)翻译为另一种自然语言(翻译后的自然语言称为目标语言),实现不同语种的自然语言的转换过程。
目前,机器翻译一般通过机器翻译模型实现,如,基于神经网络的NMT(Neural Machine Translation,神经网络机器翻译)模型等。要训练一个精确的机器翻译模型,需要足够数量的双语训练数据。双语训练数据由源语言数据和与源语言数据对应的标注语言数据组成。在获取双语训练数据过程中,常常需要专业翻译人员将源语言数据人工翻译成标注语言数 据,由于人工翻译的成本昂贵,所以获取双语训练数据的成本较高。因此,为了在固定成本约束下获取高质量的双语训练数据,需要先对大量的源语言数据进行筛选,然后再获取与筛选后的源语言数据对应的标注语言数据,进而提高基于筛选后的源语言数据和与筛选后的源语言数据对应的标注语言数据获取的机器翻译模型的翻译性能。
对此,本申请实施例提供了一种数据处理方法,请参考图1,图1示出了本申请实施例提供的数据处理方法的实施环境的示意图。该实施环境包括:终端11和服务器12。
终端11能够从网络上获取待筛选的源语言数据,将待筛选的源语言数据发送至服务器12,终端11也能够接收服务器12返回的筛选后的源语言数据,展示筛选后的源语言数据,以由专业翻译人员将该筛选后的源语言数据翻译成标注语言数据。然后,终端11将标注语言数据发送至服务器12。服务器12能够利用强化学习算法训练得到目标数据筛选模型,基于该目标数据筛选模型对终端11发送的待筛选的源语言数据进行筛选,服务器12还能够将筛选后的源语言数据发送至终端11,获取终端11发送的与筛选后的源语言数据对应的标注语言数据。然后,服务器12能够基于筛选后的源语言数据和与筛选后的源语言数据对应的标注语言数据获取机器翻译模型。
在示例性实施例中,终端11也能够利用强化学习算法训练得到目标数据筛选模型,基于该目标数据筛选模型对获取的待筛选的源语言数据进行筛选,进而根据筛选后的源语言数据和与筛选后的源语言数据对应的标注语言数据获取机器翻译模型。
可选地,终端11是诸如手机、平板电脑、个人计算机等的智能设备。服务器12是一台服务器,或者是由多台服务器组成的服务器集群,或者是一个云计算服务中心。终端11与服务器12通过有线或无线网络建立通信连接。
本领域技术人员应能理解上述终端11和服务器12仅为举例,其他现有的或今后可能出现的终端或服务器如可适用于本申请,也应包含在本申请保护范围以内,并在此以引用方式包含于此。
基于上述图1所示的实施环境,本申请实施例提供一种数据处理方法,该方法应用于计算机设备,该计算机设备为服务器或者终端。本申请实施例以该方法应用于服务器为例。如图2所示,本申请实施例提供的方法包括如下步骤:
在步骤201中,获取待筛选数据集,待筛选数据集包括多个待筛选的源语言数据。
待筛选数据集为需要进行筛选的数据集。待筛选数据集包括多个待筛选的源语言数据。需要说明的是,在本申请实施例中,将源语言数据对应的语种称为第一语种。示例性地,源语言数据是指第一语种的语句。
在一种可能实现方式中,服务器获取待筛选数据集的方式包括但不限于以下两种:
方式一:服务器从第一语种的数据库中获取待筛选数据集。
在一种可能实现方式中,服务器从第一语种的数据库中随机选取第一参考数量的语句组成待筛选数据集。第一参考数量根据需要获取的双语数据的数量确定,或者根据实际情况自由调整,本申请实施例对此不加以限定。
方式二:服务器接收终端发送的网络数据,在网络数据中解析出第一语种的语句,基于解析出的第一语种的语句获取待筛选数据集。
终端在处理互联网业务的过程中,能够获取网络数据,在网络数据中可能包括不同语种的语句;终端将网络数据发送至服务器后,服务器能够在网络数据中解析出第一语种的语句。在一种可能实现方式中,服务器基于解析出的第一语种的语句获取待筛选数据集的过程为:服务器在解析出的第一语种的语句中选取第一参考数量的语句组成待筛选数据集。
服务器在获取待筛选数据集后,即可执行步骤202。
在实际应用场景中,要训练一个精确的机器翻译模型,需要足够数量的双语训练数据。但是,当双语中的一方语种为不常见语种时,已有的双语数据库中的双语训练数据的数据量可能较少。在此种情况下,服务器需要获取新的双语训练数据扩充已有的双语数据库。获取新的双语训练数据的成本较高,因此,服务器需要先基于步骤202对大量的待筛选的源语言数据进行筛选,以提高获取的双语训练数据的质量。
在步骤202中,基于目标数据筛选模型,对待筛选数据集中的各个源语言数据进行筛选,得到筛选后的目标源语言数据,目标数据筛选模型利用强化学习算法训练得到。
服务器在获取待筛选数据集后,即可基于目标数据筛选模型,对待筛选数据集中的各个源语言数据进行筛选,以得到筛选后的目标源语言数据。目标数据筛选模型利用强化学习算法训练得到,也就是说,目标数据筛选模型的筛选规则为机器在强化学习的过程中自动学习到的,目标数据筛选模型的筛选规则能够适应各种不同的场景,应用范围广泛。
在一种可能实现方式中,服务器基于目标数据筛选模型,对待筛选数据集中的各个源语言数据进行筛选,得到筛选后的目标源语言数据的过程为:获取待筛选数据集中的各个源语言数据的特征,将各个源语言数据的特征输入目标数据筛选模型;目标数据筛选模型对输入的各个源语言数据的特征进行处理,输出各个源语言数据的筛选结果;服务器基于各个源语言数据的筛选结果得到筛选后的目标源语言数据。
本申请实施例对待筛选数据集中的各个源语言数据的特征的获取方式不加以限定。例如,对于各个源语言数据中的任一源语言数据,基于该任一源语言数据中的每个子数据对应的词嵌入(Embedding)特征以及该任一源语言数据的长度获取该任一源语言数据的特征 等。示例性地,任一源语言数据的特征以向量的形式表示。
在一种可能实现方式中,服务器将各个源语言数据的特征输入目标数据筛选模型的方式包括但不限于以下两种:
方式一:服务器每次将一个源语言数据的特征输入目标数据筛选模型进行处理,直至将各个源语言数据的特征均输入目标数据筛选模型。
在此种方式一下,目标数据筛选模型每次仅输出一个源语言数据的筛选结果。
方式二:服务器将各个源语言数据划分到第二参考数量的源语言数据组中,每次将一个源语言数据组中的全部源语言数据的特征同时输入目标数据筛选模型进行处理,直至将所有的源语言数据组中的全部源语言数据的特征均输入目标数据筛选模型。
在此种方式二下,目标数据筛选模型每次输出一个源语言数据组中的全部源语言数据的筛选结果。第二参考数量根据经验设置,或者根据应用场景自由调整,本申请实施例对此不加以限定。示例性地,当第二参考数量设置为1时,将各个源语言数据的特征同一批次输入目标数据筛选模型进行处理,目标数据筛选模型同一批次输出各个源语言数据的筛选结果。
在一种可能实现方式中,筛选结果为第一结果或第二结果。第一结果用于指示源语言数据的可靠性高,第二结果用于指示源语言数据的可靠性低。对于任一源语言数据,当该任一源语言数据的筛选结果为第一结果时,说明该任一源语言数据的可靠性高,也就是说,该任一源语言数据为高质量的源语言数据;当该任一源语言数据的筛选结果为第二结果时,说明该任一源语言数据的可靠性低,也就是说,该任一源语言数据为低质量的源语言数据。
在一种可能实现方式中,第一结果和第二结果分别用数值1和数值0表示。当目标数据筛选模型输出的某一源语言数据的筛选结果为1时,说明该源语言数据的筛选结果为第一结果;当目标数据筛选模型输出的某一源语言数据的筛选结果为0时,说明该源语言数据的筛选结果为第二结果。
在一种可能实现方式中,服务器基于各个源语言数据的筛选结果得到筛选后的目标源语言数据的方式为:服务器将筛选结果为第一结果的源语言数据作为筛选后的目标源语言数据。
服务器在得到筛选后的目标源语言数据后,即可基于筛选后的目标源语言数据执行步骤203。
需要说明的是,在服务器执行步骤202之前,需要先利用强化学习算法训练得到目标数据筛选模型。利用强化学习算法训练得到目标数据筛选模型的过程详见步骤301至步骤303所示的实施例,此处暂不赘述。
在步骤203中,获取与目标源语言数据对应的标注语言数据,基于目标源语言数据和标注语言数据获取机器翻译模型。
由于筛选后的源语言数据为质量高的源语言数据,因此,将筛选后的源语言数据作为目标源语言数据,进一步获取与目标源语言数据对应的标注语言数据。在本申请实施例中,将标注语言数据对应的语种称为第二语种。示例性地,标注语言数据是指第二语种的语句。
在一种可能实现方式中,标注语言数据由专业翻译人员对目标源语言数据进行翻译得到。服务器获取与目标源语言数据对应的标注语言数据的过程为:服务器将目标源语言数据发送至终端;终端展示目标源语言数据,以供专业翻译人员查看目标源语言数据且对目标源语言数据进行人工翻译;当检测到专业翻译人员的翻译确认指令时,终端获取与目标源语言数据对应的标注语言数据;终端将与目标源语言数据对应的标注语言数据发送至服务器。由此,服务器获取与目标源语言数据对应的标注语言数据。
在获取与目标源语言数据对应的标注语言数据后,服务器基于目标源语言数据和标注语言数据获取机器翻译模型。需要说明的是,在基于目标源语言数据和标注语言数据获取机器翻译模型的过程中,服务器直接基于目标源语言数据和标注语言数据训练得到机器翻译模型;或者,服务器将目标源语言数据和标注语言数据添加至已有的双语训练数据中,得到扩充后的双语训练数据,然后基于扩充后的双语训练数据训练得到机器翻译模型。本申请实施例对获取机器翻译模型的具体方式不加以限定。
在实际应用过程中,通过实验比对了根据本申请实施例提供的方法获取的机器翻译模型和根据其他方法获取的机器翻译模型的翻译性能。实验过程为:在待筛选数据集中,根据本申请实施例提供的方法获取目标数量的目标源语言数据,获取与目标源语言数据对应的标注语言数据,将目标源语言数据以及与目标源语言数据对应的标注语言数据作为第一双语训练样本;基于第一双语训练样本,训练得到翻译模型1。在同样的待筛选数据集中,随机选取目标数量的选定源语言数据,获取与选定源语言数据对应的标注语言数据,将选定源语言数据和与选定源语言数据对应的标注语言数据作为第二双语训练样本;基于第二双语训练样本,训练得到翻译模型2。分别测试翻译模型1和翻译模型2在WMT(Workshop on Machine Translation,机器翻译比赛)领域测试集、经济领域测试集和政治领域测试集上的翻译性能。
分别以源语言数据为中文数据、标注语言数据为英文数据(中英机器翻译),以及源语言数据为英文数据、标注语言数据为中文数据(英中机器翻译)为例,翻译模型1和翻译模型2的性能的比对结果如表1所示。
表1
Figure PCTCN2020119523-appb-000001
基于表1可知,无论是中英机器翻译还是英中机器翻译,根据本申请实施例提供的方法获取的翻译模型1在各个领域的测试集上均具有比翻译模型2更高的翻译性能。其中,翻译性能用BLEU(Bilingual Evaluation Understudy,双语评估替补)值表示。
在机器翻译任务中,为达到预定的机器翻译性能,利用本申请实施例提供的方法能够获取更有效质量更高的源语言数据,减少专业翻译人员的翻译成本,在降低预算和成本方面具有重要的价值。
在本申请实施例中,基于利用强化学习算法训练得到的目标数据筛选模型,对待筛选数据集中的各个源语言数据进行筛选,进而基于筛选后的目标源语言数据和与目标源语言数据对应的标注语言数据获取机器翻译模型。在此种数据处理的过程中,目标数据筛选模型中的筛选规则为机器在强化学习的过程中自动学习出来的,目标数据筛选模型的适应场景广泛,筛选后的源语言数据的质量较高,使得基于筛选后的源语言数据和与筛选后的源语言数据对应的标注语言数据获取的机器翻译模型的翻译性能较好。
本申请实施例提供一种利用强化学习算法训练得到目标数据筛选模型的方法,以该方法应用于服务器为例。如图3所示,本申请实施例提供的方法包括如下步骤:
在步骤301中,初始化第一训练数据集,第一训练数据集包括多个源语言训练数据。
第一训练数据集为训练得到目标数据筛选模型所需的待筛选数据集,第一训练数据集包括多个源语言训练数据。源语言训练数据为训练得到目标数据筛选模型的所需的待筛选源语言数据。
在一种可能实现方式中,初始化第一训练数据集的方式为随机初始化第一训练数据集,或者根据预先设置的方式初始化第一训练数据集,本申请实施例对此不加以限定。
在一种可能实现方式中,随机初始化第一训练数据集的过程为:将第一训练数据集中的各个源语言训练数据的顺序随机打乱。将第一训练数据集随机初始化,有利于提高训练得到的目标数据筛选模型的泛化能力。
在步骤302中,基于初始化的第一训练数据集,利用强化学习算法对第一数据筛选模型进行训练,得到第二数据筛选模型。
第一数据筛选模型为与初始化的第一训练数据集对应的初始数据筛选模型,第二数据筛选模型为与初始化的第一训练数据集对应的最终数据筛选模型。本申请实施例对数据筛选模型的具体形式不加以限定。例如,数据筛选模型为DQN(Deep Q-Learning,深度Q学习)模型。
步骤302为获取第二数据筛选模型,也就是获取与初始化的第一训练数据集对应的最终数据筛选模型的过程,如图4所示,该过程包括步骤3021至步骤3026。
步骤3021,将初始化的第一训练数据集划分为至少一个目标训练数据集。
初始化的第一训练数据集中包括多个源语言训练数据,将初始化的第一训练数据集划分为至少一个目标训练数据集,使得每个目标训练数据集中包括初始化的第一训练数据集中的部分源语言训练数据。
在划分为至少一个目标训练数据集后,在获取与该初始化的第一训练数据集对应的第二数据筛选模型的过程中,每次使用一个目标训练数据集进行训练。相比于每次使用一个源语言训练数据进行训练,此方式能够缩短训练时间,提高训练过程的稳定性。需要说明的是,在划分为至少一个目标训练数据集后,对各个目标训练数据集进行排序,在后续训练过程中,按照排列顺序依次选取各个目标训练数据集。根据排列顺序,各个目标训练数据集依次为第一目标训练数据集,第二目标训练数据集、……、第n目标训练数据集(n为大于0的整数)。
在一种可能实现方式中,目标训练数据集的数量n根据第一训练数据集中的源语言训练数据的总数量M和小批量尺寸(Mini-batch size)S确定,确定方式为n=M/S。小批量尺寸S根据经验设置,或者根据源语言训练数据的总数量进行调整,本申请实施例对此不加以限定。例如,小批量尺寸设置为16。也就是说,每个目标训练数据集中包括16个源语言训练数据。此时,目标训练数据集的数量n=M/16。
步骤3022,调用第一数据筛选模型对第一目标训练数据集中的各个源语言训练数据的目标特征进行处理,得到第一目标训练数据集中的各个源语言训练数据的筛选结果,第一目标训练数据集为至少一个目标训练数据集中的第一个目标训练数据集。
在一种可能实现方式中,在实现步骤3022之前,需要先获取第一目标训练数据集中的各个源语言训练数据的目标特征。也就是说,在将初始化的第一训练数据集划分为至少一个目标训练数据集后,获取第一目标训练数据集中各个源语言训练数据的目标特征。其中,第一目标训练数据集为至少一个目标训练数据集中的第一个目标训练数据集。
在一种可能实现方式中,获取第一目标训练数据集中的任一源语言训练数据的目标特征的过程包括以下步骤3022A至步骤3022C:
步骤3022A:基于任一源语言训练数据中的各个子数据,获取任一源语言训练数据的第一特征。
第一特征用于指示该任一源语言训练数据本身的特征,第一特征基于该任一源语言训练数据中的各个子数据获取到。任一源语言训练数据中包括多个子数据,示例性地,当任一源语言训练数据为语句时,该任一源语言训练数据中的每个词均为该任一源语言训练数据中的一个子数据。
在一种可能实现方式中,基于任一源语言训练数据中的各个子数据,获取任一源语言训练数据的第一特征的过程包括以下步骤1至步骤4:
步骤1:基于任一源语言训练数据中的各个子数据的词嵌入特征,获取任一源语言训练数据的第三特征。
基于词表查询任一源语言训练数据中各个子数据的词嵌入(Embedding)特征,将各个子数据的词嵌入特征补充(Pad)到同一长度,基于同一长度的各个子数据的词嵌入特征,即可获取该任一源语言训练数据的第三特征。
词表是指存储各个词对应的词嵌入特征的表,词表能够基于已有的语料库构建得到,本申请实施例对词表的构建过程不加以限定。词表中每个词对应的词嵌入特征能够用向量表示,向量的维度根据经验设置,例如,将向量的维度设置为512维。
在一种可能实现方式中,基于同一长度的各个子数据的词嵌入特征,获取任一源语言训练数据的第三特征的方式为:将同一长度的各个子数据的词嵌入特征输入第一神经网络,将经过第一神经网络中的卷积层和全连接层的处理得到的特征作为任一源语言训练数据的第三特征。本申请实施例对第一神经网络中的卷积层和全连接层的设置不加以限定。示例性地,在卷积层中还包含ReLU(Rectified Linear Unit,线形整流函数)处理模块。例如,如图5所示,第一神经网络为CNN(Convolutional Neural Networks,卷积神经网络)网络,卷积层的卷积核尺寸(filter size)分别设置为3、4、和5,卷积核的数量(filter number)设置为128,经过全连接层处理后得到384*256维的特征向量,将该特征向量作为第三特征。
步骤2:基于任一源语言训练数据中的各个子数据和已有语料数据库的比对结果,获取任一源语言训练数据的第四特征。
通过将任一源语言训练数据中的各个子数据和已有语料数据库进行比对,能够统计该任一源语言训练数据中N-gram(N元)的子数据在已有语料数据库中出现的概率,将该任一源语言训练数据中N-gram的子数据在已有语料数据库中出现的概率作为比对结果。然后基于比对结果,获取任一源语言训练数据的第四特征。示例性地,N-gram包括2-gram、3-gram和4-gram中的一种或多种。
在一种可能实现方式中,基于比对结果,获取任一源语言训练数据的第四特征的方式为:将比对结果输入第二神经网络,将经过第二神经网络的处理得到的特征作为任一源语言训练数据的第四特征。本申请实施例对第二神经网络的设置不加以限定。示例性地,如图5所示,经过第二神经网络的处理后,得到1*256维的特征向量,将该特征向量作为第四特征。
步骤3:基于任一源语言训练数据中的各个子数据的数量,确定任一源语言训练数据的长度,基于任一源语言训练数据的长度,获取任一源语言训练数据的第五特征。
根据任一源语言训练数据中的子数据的数量,即可确定该任一源语言训练数据的长度。例如,当任一源语言训练数据为语句,子数据为词时,语句中包括的词的数量即为该语句的长度。
在一种可能实现方式中,基于任一源语言训练数据的长度,获取任一源语言训练数据的第五特征的方式为:将任一源语言训练数据的长度输入第三神经网络,将经过第三神经网络的处理得到的特征作为任一源语言训练数据的第五特征。本申请实施例对第三神经网络的设置不加以限定。示例性地,如图5所示,经过第三神经网络的处理后,能够得到1*256维的特征向量,将该特征向量作为第五特征。
步骤4:基于第三特征、第四特征和第五特征,获取任一源语言训练数据的第一特征。
在根据步骤1至步骤3获取该任一源语言训练数据的第三特征、第四特征和第五特征后,即可获取该任一源语言训练数据的第一特征。在一种可能实现方式中,基于第三特征、第四特征和第五特征,获取任一源语言训练数据的第一特征的方式为:将任一源语言训练数据的第三特征、第四特征和第五特征拼接起来得到第一特征。
步骤3022B:基于任一源语言训练数据和第三翻译模型,获取任一源语言训练数据的第二特征。
第二特征用于指示该任一源语言训练数据在综合考虑第三翻译模型的翻译结果的基础上得到的特征。示例性地,第三翻译模型为任意一个能够对源语言训练数据进行翻译的模型,本申请实施例对此不加以限定。在一种可能实现方式中,基于任一源语言训练数据和第三翻译模型,获取任一源语言训练数据的第二特征的过程包括以下步骤a至步骤d:
步骤a:基于第三翻译模型,获取任一源语言训练数据的翻译数据,基于翻译数据的词嵌入特征,获取任一源语言训练数据的第六特征。
基于第三翻译模型,获取任一源语言训练数据的翻译数据的过程为:将该任一源语言训练数据输入第三翻译模型,将第三翻译模型输出的翻译数据作为该任一源语言训练数据的翻译数据。
在获取任一源语言训练数据的翻译数据后,在词表中查询该翻译数据的词嵌入特征,基于翻译数据的词嵌入特征,获取任一源语言训练数据的第六特征。词嵌入特征能够用向量表示,向量的维度根据经验设置,例如,将向量的维度设置为512维。
在一种可能实现方式中,基于翻译数据的词嵌入特征,获取任一源语言训练数据的第六特征的方式为:将翻译数据的词嵌入特征输入第四神经网络,将经过第四神经网络中的卷积层和全连接层的处理得到的特征作为任一源语言训练数据的第六特征。本申请实施例对第四神经网络中的卷积层和全连接层的设置不加以限定。在示例性实施例中,在卷积层中还包含ReLU处理模块。例如,如图5所示,第四神经网络为CNN网络,卷积层的卷积核尺寸(filter size)分别设置为3、4、和5,卷积核的数量(filter number)设置为128。经过全连接层处理后能够得到384*256维的特征向量,将该特征向量作为第六特征。在示例性实施例中,第四神经网络与第一神经网络相同。
步骤b:基于第三翻译模型,获取与任一源语言训练数据中的各个子数据分别对应的目标翻译子数据,基于各个子数据分别对应的目标翻译子数据的词嵌入特征,获取任一源语言训练数据的第七特征,任一子数据对应的目标翻译子数据的翻译概率在任一子数据对应的各个候选翻译子数据的翻译概率中最大。
将任一源语言训练数据输入第三翻译模型,能够得到第三翻译模型输出的与任一源语言训练数据中的各个子数据分别对应的候选翻译子数据及候选翻译子数据的翻译概率。在一种可能实现方式中,与任一子数据对应的候选翻译子数据的数量根据经验设置,例如,将候选翻译子数据的数量设置为10,则第三翻译模型输出各个子数据分别对应的翻译概率最大的10个候选翻译子数据及10个候选翻译子数据的翻译概率。
根据与任一源语言训练数据中的各个子数据分别对应的候选翻译子数据及候选翻译子数据的翻译概率,能够确定与任一源语言训练数据中的各个子数据分别对应的目标翻译子数据。任一子数据对应的目标翻译子数据为该任一子数据对应的各个候选翻译子数据中翻译概率最大的候选翻译子数据。在词表中查找各个子数据分别对应的目标翻译子数据的词嵌入特征,将各个子数据分别对应的目标翻译子数据的词嵌入特征补充到同一长度,基于同一长度的各个子数据分别对应的目标翻译子数据的词嵌入特征,获取任一源语言训练数据的第七特征。
在一种可能实现方式中,基于同一长度的各个子数据分别对应的目标翻译子数据的词嵌入特征,获取任一源语言训练数据的第七特征的方式为:将同一长度的各个子数据分别对应的目标翻译子数据的词嵌入特征输入第五神经网络,将经过第五神经网络中的卷积层和全连接层的处理得到的特征作为任一源语言训练数据的第七特征。本申请实施例对第五 神经网络中的卷积层和全连接层的设置不加以限定。示例性地,在卷积层中还包含ReLU处理模块。例如,如图5所示,第五神经网络为CNN(Convolutional Neural Networks,卷积神经网络)网络,卷积层的卷积核尺寸(filter size)设置为5,卷积核的数量(filter number)设置为64。经过全连接层处理后得到64*256维的特征向量,将该特征向量作为第七特征。在示例性实施例中,第五神经网络与第一神经网络或第四神经网络相同。
步骤c:获取各个子数据分别对应的目标翻译子数据的翻译概率,基于各个子数据分别对应的目标翻译子数据的翻译概率和翻译数据的长度,获取任一源语言训练数据的第八特征。
根据步骤b还能够获取各个子数据分别对应的目标翻译子数据的翻译概率。在一种可能实现方式中,基于各个子数据分别对应的目标翻译子数据的翻译概率和翻译数据的长度,获取任一源语言训练数据的第八特征的过程为:将各个子数据分别对应的目标翻译子数据的翻译概率相加得到总概率,基于总概率与翻译数据的长度的比值获取任一源语言训练数据的第八特征。示例性地,第八特征用于指示任一源语言训练数据的置信分数(Confidence Score)。
在一种可能实现方式中,基于总概率与翻译数据的长度的比值获取任一源语言训练数据的第八特征的方式为:将总概率与翻译数据的长度的比值输入第六神经网络,将经过第六神经网络的处理得到的特征作为任一源语言训练数据的第八特征。本申请实施例对第六神经网络的设置不加以限定。示例性地,如图5所示,经过第六神经网络的处理后,能够得到1*256维的特征向量,将该特征向量作为第八特征。
步骤d:基于第六特征、第七特征和第八特征,获取任一源语言训练数据的第二特征。
在根据步骤a至步骤d获取该任一源语言训练数据的第六特征、第七特征和第八特征后,即可获取该任一源语言训练数据的第二特征。在一种可能实现方式中,基于第六特征、第七特征和第八特征,获取任一源语言训练数据的第二特征的方式为:将任一源语言训练数据的第六特征、第七特征和第八特征拼接起来得到第二特征。
需要说明的是,本申请实施例对步骤3022A和步骤3022B的执行顺序不加以限定。在示例性实施例中,先执行步骤3022A,再执行步骤3022B;或者,先执行步骤3022B,再执行步骤3022A;再或者,同时执行步骤3022A和步骤3022B。
步骤3022C:基于第一特征和第二特征,获取任一源语言训练数据的目标特征。
在获取到该任一源语言训练数据的第一特征和第二特征后,基于第一特征和第二特征,获取任一源语言训练数据的目标特征。在一种可能实现方式中,基于第一特征和第二特征,获取任一源语言训练数据的目标特征的方式为:将第一特征和第二特征进行拼接,将拼接 后的特征作为任一源语言训练数据的目标特征。需要说明的是,本申请实施例对第一特征和第二特征的拼接顺序不加以限定。
在一种可能实现方式中,由于第一特征是基于第三特征、第四特征和第五特征获取到的,第二特征是基于第六特征、第七特征和第八特征获取到的,所以,任一源语言训练数据的目标特征能够基于该任一源语言训练数据的第三特征、第四特征、第五特征、第六特征、第七特征和第八特征获取得到。
根据上述步骤3022A至步骤3022C的方式,能够获取第一目标训练数据集中各个源语言训练数据的目标特征。然后调用第一数据筛选模型对第一目标训练数据集中的各个源语言训练数据的目标特征进行筛选处理。
将第一目标训练数据集中各个源语言训练数据的目标特征输入第一数据筛选模型后,第一数据筛选模型对各个源语言训练数据的目标特征进行处理。第一数据筛选模型对目标特征进行处理后,基于分类器输出每个源语言训练数据的筛选结果。例如,得到第一目标训练数据集中的各个源语言训练数据的筛选结果的过程如图5所示。
本申请实施例对第一数据筛选模型处理目标特征的方式不加以限定。例如,第一数据筛选模型将目标特征通过两个全连接层进行处理。在通过第一个全连接层进行处理后,得到源语言训练数据的全连接特征;将全连接特征送入另外一个全连接层,基于公式a i=argmaxQ π(s i,a)输出源语言训练数据对应不同筛选结果的概率,然后经过分类器,输出概率大的筛选结果作为该源语言训练数据的筛选结果。在公式a i=argmaxQ π(s i,a)中,a i表示筛选结果,Q π(s i,a)表示第一数据筛选模型对应的目标函数。由此,服务器能够得到第一目标训练数据集中的各个源语言训练数据的筛选结果。
在一种可能实现方式中,筛选结果包括两种,分别为第一结果和第二结果。其中,第一结果用于指示源语言训练数据的可靠性高,第二结果用户指示源语言训练数据的可靠性低。示例性地,筛选结果用数值表示,筛选结果和数值的对应关系根据经验预先设置,例如,第一结果对应的数值为1,第二结果对应的数值为0。
步骤3023,对于第一目标训练数据集中的任一源语言训练数据,基于任一源语言训练数据的筛选结果,确定任一源语言训练数据的权重值。
不同筛选结果的源语言训练数据对应有不同的权重值。在一种可能实现方式中,基于任一源语言训练数据的筛选结果,确定任一源语言训练数据的权重值的过程为:响应于任一源语言训练数据的筛选结果为第一结果,将第一权重值作为任一源语言训练数据的权重值;响应于任一源语言训练数据的筛选结果为第二结果,将第二权重值作为任一源语言训练数据的权重值。
在一种可能实现方式中,第二权重值为预先设置的与筛选结果为第二结果的源语言训练数据对应的权重值。本申请实施例对第二权重值的设置方式不加以限定,例如,将第二权重值设置为0。
在一种可能实现方式中,在将第一权重值作为任一源语言训练数据的权重值之前,需要先获取第一权重值。获取第一权重值的过程包括以下步骤A至步骤D:
步骤A:获取与第一目标训练数据集中的各个目标源语言训练数据分别对应的标注语言训练数据,各个目标源语言训练数据的筛选结果均为第一结果。
当源语言训练数据的筛选结果为第一结果时,说明该源语言训练数据的可靠性高,将第一目标训练数据集中的筛选结果为第一结果的各个源语言训练数据作为各个目标源语言训练数据,然后获取与各个目标源语言训练数据分别对应的标注语言训练数据。
在示例性实施例中,在训练之前,预先获取第一训练数据集中的各个源语言训练数据对应的标注语言训练数据并存储。在执行步骤A时,从存储中获取与各个目标源语言训练数据分别对应的标注语言训练数据,以节省训练时间。
基于步骤A,即可获取到与第一目标训练数据集中的筛选结果为第一结果的各个目标源语言训练数据分别对应的标注语言训练数据,然后执行步骤B。
步骤B:将各个目标源语言训练数据和与各个目标源语言训练数据分别对应的标注语言训练数据作为训练数据添加至第二训练数据集中。
第二训练数据集的初始值为空集,第二训练数据集用于存储双语训练数据。任一双语训练数据由一个源语言训练数据和与该源语言训练数据对应的标注语言数据组成。
在获取与各个目标源语言训练数据对应的标注语言训练数据后,即可将各个目标源语言训练数据和与各个目标源语言训练数据分别对应的标注语言训练数据作为训练数据添加至第二训练数据集中。示例性地,将任一目标源语言训练数据记作x i,将与x i对应的标注语言训练数据记作y i,将第二训练数据集记作D l,则将(x i,y i)添加至D l中。
需要说明的是,经过步骤B,将第一目标训练数据集中的筛选结果为第一结果的全部目标源语言训练数据和与全部目标源语言训练数据对应的标注语言训练数据均对应添加至第二训练数据集中。基于此种方式得到的第二训练数据集,能够提高获取的第一权重值的准确性。
步骤C:基于第二训练数据集对第一翻译模型进行训练,得到第二翻译模型。
第一翻译模型为利用已知的双语训练数据预训练得到的翻译模型。本申请实施例对第一翻译模型的具体形式不加以限定。例如,第一翻译模型为NMT(Neural Machine Translation,神经机器翻译)模型。
在经过步骤B后,得到更新后的第二训练数据集。由于第二训练数据集中的数据均为双语训练数据,所以能够基于第二训练数据集对第一翻译模型进行训练。本申请实施例对训练第一翻译模型的方式不加以限定。将训练得到的翻译模型作为第二翻译模型。
步骤D:基于第二翻译模型和第一翻译模型,获取第一权重值。
第一权重值用于指示第二翻译模型与第一翻译模型的性能差异。在一种可能实现方式中,基于第二翻译模型和第一翻译模型,获取第一权重值的过程为:利用验证数据集(held out数据集)分别对第一翻译模型和第二翻译模型进行验证,得到第一翻译模型的模型性能和第二翻译模型的模型性能,基于第一翻译模型的模型性能和第二翻译模型的模型性能,获取第一权重值。
在一种可能实现方式中,基于下述公式1获取第一权重值:
R(s i-1,a)=Acc(Φ i)-Acc(Φ i-1)   (公式1)
其中,Acc(Φ i)表示第二翻译模型的模型性能。Acc(Φ i-1)表示第一翻译模型的模型性能。R(s i-1,a)表示第一权重值(Reward)。第一权重值的取值有正有负,表示第二训练数据集D l中增加的双语训练样本(x i,y i)对模型性能的影响可能是正向影响,也可能是负向影响。
在获取第一权重值后,即可将第一权重值作为第一目标训练数据集中的筛选结果为第一结果的各个源语言训练数据的权重值。
步骤3024,基于任一源语言训练数据的目标特征、任一源语言训练数据的筛选结果、任一源语言训练数据的权重值和参考源语言训练数据的目标特征,生成与任一源语言训练数据对应的候选数据,参考源语言训练数据为第二目标训练数据集中与任一源语言训练数据对应的源语言数据。
第二目标训练数据集为至少一个目标训练数据集中的第一目标训练数据集的下一个目标训练数据集。候选数据为用于更新第一数据筛选模型的参数的数据。
在一种可能实现方式中,生成与任一源语言训练数据对应的候选数据的方式为:
响应于任一源语言训练数据的筛选结果为第一结果,基于任一源语言训练数据的目标特征、第一结果、第一权重值和参考源语言数据的目标特征,生成与任一源语言训练数据对应的第一候选数据;
响应于任一源语言训练数据的筛选结果为第二结果,基于任一源语言训练数据的目标特征、第二结果、第二权重值和参考源语言数据的目标特征,生成与任一源语言训练数据对应的第二候选数据。
也就是说,每个源语言训练数据均对应一个候选数据,该候选数据为第一候选数据或 者第二候选数据。将任一源语言训练数据的目标特征记作s i、筛选结果记作a i、权重值记作r i、参考源语言数据的目标特征记作s i+1,则与任一源语言训练数据对应的候选数据记作(s i,a i,r i,s i+1)。其中,a i和r i根据该任一源语言训练数据的筛选结果确定。当a i表示第一结果时,r i表示第一权重值,(s i,a i,r i,s i+1)表示第一候选数据;当a i表示第二结果时,r i表示第二权重值,(s i,a i,r i,s i+1)表示第二候选数据。
上述步骤3023和步骤3024从第一目标训练数据集中的任一源语言训练数据的角度,介绍了生成与该任一源语言训练数据对应的候选数据的过程。按照步骤3023和步骤3024的方式能够生成与第一目标训练数据集中的各个源语言训练数据分别对应的候选数据。在生成与第一目标训练数据集中的各个源语言训练数据分别对应的候选数据后,执行步骤3025。
步骤3025,基于与第一目标训练数据集中的各个源语言训练数据分别对应的候选数据,选取目标数量的候选数据,基于目标数量的候选数据,更新第一数据筛选模型的参数,得到更新后的第一数据筛选模型。
在生成与第一目标训练数据集中的各个源语言训练数据对应的候选数据后,基于与第一目标训练数据集中的各个源语言训练数据分别对应的候选数据,选取目标数量的候选数据,以基于目标数量的候选数据更新第一数据筛选模型的参数。目标数量根据经验设置,或者根据全部的候选数据的数量自由调整,本申请实施例对此不加以限定。
在一种可能实现方式中,基于与第一目标训练数据集中的各个源语言训练数据分别对应的候选数据,选取目标数量的候选数据的方式为:在与第一目标训练数据集中的各个源语言训练数据分别对应的候选数据中随机选取目标数量的候选数据。
在一种可能实现方式中,基于与第一目标训练数据集中的各个源语言训练数据分别对应的候选数据,选取目标数量的候选数据的方式为:将与第一目标训练数据集中的各个源语言训练数据分别对应的候选数据中的第一候选数据添加至第一候选数据集中,将与第一目标训练数据集中的各个源语言训练数据分别对应的候选数据中的第二候选数据添加至第二候选数据集中;在第一候选数据集和第二候选数据集中进行等比例选取,得到目标数量的候选数据。基于此种选取方式选取的候选数据更具有代表性,有利于提高数据筛选模型的训练过程的稳定性。
第一候选数据集用于在训练得到目标数据筛选模型的过程中不断归集新生成的第一候选数据,第二候选数据集用于在训练得到目标训练数据筛选模型的过程中不断归集新生成的第二候选数据。在示例性实施例中,第一候选数据集和第二候选数据集的初始值均为空集。
在一种可能实现方式中,基于目标数量的候选数据,更新第一数据筛选模型的参数,得到更新后的第一数据筛选模型的过程包括以下步骤I至步骤III:
步骤I:基于目标数量的候选数据,更新与第一数据筛选模型对应的目标函数。
在一种可能实现方式中,目标函数的形式为Q π(s,a),更新与第一数据筛选模型对应的目标函数的方式为:基于贝尔曼方程(公式2)更新与第一数据筛选模型对应的目标函数。
Q π(s,a)=E[R i|s i=s,a i=a,π]   (公式2)
其中,
Figure PCTCN2020119523-appb-000002
γ∈[0,1],R i是折扣后的长期权重,γ为折扣因子。
步骤II:根据更新后的目标函数,计算与第一数据筛选模型对应的损失函数。
在得到更新后的目标函数后,即可根据更新后的目标函数,计算当前的损失函数。在一种可能实现方式中,基于下述公式3计算损失函数:
L(θ)=E s,a,r,s′[(y i(r,s′)-Q(s,a;θ)) 2]   (公式3)
其中,L(θ)表示损失函数,y i(r,s′)=r+γmax a′Q(s′,a′;θ i-1)为基于第一数据筛选模型的当前参数θ i-1得到的目标函数值。
步骤III:基于损失函数,更新第一数据筛选模型的参数,得到更新后的第一数据筛选模型。
在得到损失函数后,基于最小化损失函数的目标,更新第一数据筛选模型的参数,以得到更新后的第一数据筛选模型。
在一种可能实现方式中,利用SGD(Stochastic Gradient Descent,随机梯度下降)算法最小化损失函数L(θ)。
综上所述,获取更新后的第一数据筛选模型的过程如图6所示。基于第一训练数据集D u中的任一目标训练数据集的各个源语言训练数据x i和神经网络,获取各个源语言训练数据的目标特征s i;将s i输入第一数据筛选模型中,第一数据筛选模型基于公式a i=argmaxQ π(s i,a)确定各个源语言训练数据的筛选结果。当筛选结果为0时,将0作为权重值r i;当筛选结果为1时,获取标注语言数据y i,将(x i,y i)添加至第二训练数据集D l中,利用第二训练数据集D l对第一翻译模型进行训练,得到第二翻译模型;利用held-out验证数据集分别计算第一翻译模型和第二翻译模型的模型性能,将模型性能的差值作为筛选结果为1的源语言训练数据的权重值。生成候选数据(s i,a i,r i,s i+1)。选取目标数量的候选数据,利用SGD算法最小化损失函数L(θ),得到更新后的第一数据筛选模型。
步骤3026,基于第二目标训练数据集对更新后的第一数据筛选模型进行训练,以此类推,直至满足第二训练终止条件,得到第二数据筛选模型。
基于第二目标训练数据集对更新后的第一数据筛选模型进行训练的过程为:基于第二目标训练数据集和更新后的第一数据筛选模型执行步骤3022至步骤3025,得到进一步更新后的第一数据筛选模型。以此类推,直至满足第二训练终止条件。
在一种可能实现方式中,每对第一数据筛选模型更新一次,即判断一次是否满足第二训练终止条件。若不满足第二训练终止条件,则基于下一个目标训练数据集和当前最新的第一数据筛选模型执行步骤3022至步骤3025,以继续更新第一数据筛选模型;若满足第二训练终止条件,停止迭代训练,将此时得到的更新后的第一数据筛选模型作为第二数据筛选模型。
在一种可能实现方式中,满足第二训练终止条件,包括但不限于以下两种情况:
情况一:第一训练数据集中不存在满足条件的目标训练数据集,满足条件的目标训练数据集中的各个源语言训练数据的目标特征未进行过筛选处理。
当第一训练数据集中不存在满足条件的目标训练数据集时,说明第一训练数据集中的全部源语言训练数据均作为训练数据参与了获取第二数据筛选模型的训练过程,此时认为满足第二训练终止条件。
情况二:筛选结果为第一结果的源语言训练数据的数量达到数量阈值。
示例性地,数量阈值根据训练成本(budget)进行设置,当筛选结果为第一结果的源语言训练数据的数量达到数量阈值时,说明已筛选出足够数量的源语言训练数据,此时认为满足第二训练终止条件。
当满足上述两种情况中的任一种情况时,即认为满足第二训练终止条件,得到第二数据筛选模型。
在步骤303中,响应于不满足第一训练终止条件,重新初始化第一训练数据集,基于重新初始化的第一训练数据集,利用强化学习算法对第二数据筛选模型进行训练,得到第三数据筛选模型;以此类推,直至满足第一训练终止条件,得到目标数据筛选模型。
在基于步骤3026得到第二数据筛选模型后,进一步基于第二数据筛选模型获取目标数据筛选模型。
在一种可能实现方式中,基于第二数据筛选模型获取目标数据筛选模型的方式为:响应于满足第一训练终止条件,将第二数据筛选模型作为目标数据筛选模型;响应于不满足第一训练终止条件,重新初始化第一训练数据集,基于重新初始化的第一训练数据集,利用强化学习算法对第二数据筛选模型进行训练,得到第三数据筛选模型,以此类推,直至满足第一训练终止条件,将满足第一训练终止条件时得到的数据筛选模型作为目标数据筛选模型。也就是说,当不满足第一训练终止条件时,再次执行步骤301和步骤302,得到与 重新初始化的第一训练数据集对应的第三数据筛选模型;循环进行上述过程。
在一种可能实现方式中,每得到一个数据筛选模型,即判断一次是否满足第一训练终止条件。若不满足第一训练终止条件,则继续执行步骤301和步骤302,以继续获取数据筛选模型;若满足第一训练终止条件,则停止迭代训练,将此时得到的数据筛选模型作为目标数据筛选模型。在一种可能实现方式中,满足第一训练终止条件为:初始化第一训练数据集的次数达到次数阈值。
综上所述,在一种可能实现方式中,将获取目标数据筛选模型的过程看作获取策略π(policy π)的过程,获取策略π的算法流程如下:
Input:data D u,budget B,NMT model
Figure PCTCN2020119523-appb-000003
//输入:第一训练数据集D u,成本B,翻译模型
Figure PCTCN2020119523-appb-000004
Output:π//输出:π
1:for episode=1,2,…,N do//在每个时期均执行下述步骤
2:D l
Figure PCTCN2020119523-appb-000005
and shuffle D u//第二训练数据集D l为空集,随机打乱第一训练数据集D u
3:
Figure PCTCN2020119523-appb-000006
←Init NMT//初始化翻译模型
Figure PCTCN2020119523-appb-000007
4:for mini-batch(x 1,x 2,…x k)sample from D u//对于第一训练数据集D u中的每个目标训练数据集(x 1,x 2,…x k),执行下述步骤
5:Construct the state(s 1,s 2,…s k)using(x 1,x 2,…x k)//构建目标训练数据集(x 1,x 2,…x k)的目标特征(s 1,s 2,…s k)
6:The agent makes a decision according to//智能体(本申请中的数据筛选模型)根据公式a i=argmax Q π(s i,a)输出筛选结果
a i=argmax Q π(s i,a),i∈(1,…k)
7:for i in k do://对于每个源语言训练数据,执行下述操作
8:if a i=1 then//若筛选结果为1
9:Obtain the annotation y i//获取标注语言数据y i
10:D l←D l+(x i,y i)//将(x i,y i)添加到第二训练数据集D l
11:end if
12:end for
13:Update model
Figure PCTCN2020119523-appb-000008
based on D l//利用第二训练数据集D l更新翻译模型
Figure PCTCN2020119523-appb-000009
14:Receive a reward r i using held-out set//利用验证数据集获取奖励值(本申请中的第一权重值)r i
15:if|D l|=B then//若第二训练数据集满足成本B
16:Store(s i,a i,r i,Terminate)in M//将(s i,a i,r i,停止)存储在候选数据集M中
17:Break
18:end if
19:Construct the new state(s k+1,s k+2,…s 2k)//构建新的目标特征(s k+1,s k+2,…s 2k)
20:Store transition(s i,a i,r i,s i+1)in M//将(s i,a i,r i,s i+1)存储在候选数据集M中
21:Sample random minibatch of transitions{(s j,a j,r j,s j+1)}from M,
and perform gradient descent step on L(θ)//从M中随机选取目标数量的候选数据{(s j,a j,r j,s j+1)},且对损失函数L(θ)执行梯度下降
22:Update policy π with θ//利用θ更新π
23:end for
24:end for
25:return the latest policy π//返回最新的π
在实际应用场景中,数据筛选模型能够应用于主动学习过程。主动学习是一种标记数据的简单技术,主动学习首先从未标注的数据集中选择一些实例,然后由人工标注这些实例,然后重复多次,直到满足终止条件。如图7所示,基于已有的标注数据训练集L更新数据筛选模型,基于数据筛选模型在无标注数据池U中筛选出部分待标注数据,由专业人员进行人工标注,然后将标注后的数据添加至标注数据训练集L中,循环上述过程,直至满足终止条件。例如,终止条件是指标注数据训练集L中的数据的数量达到阈值。
在本申请实施例中,利用强化学习算法训练得到目标数据筛选模型,目标数据筛选模型中的筛选规则为机器在强化学习的过程中自动学习出来的,目标数据筛选模型的适应场景广泛,使得基于目标数据筛选模型筛选后的源语言数据的质量较高,进而有利于提高基于筛选后的源语言数据和与筛选后的源语言数据对应的标注语言数据获取的机器翻译模型的翻译性能。
参见图8,本申请实施例提供了一种数据处理装置,该装置包括:
第一获取模块801,用于获取待筛选数据集,待筛选数据集包括多个待筛选的源语言数据;
筛选模块802,用于基于目标数据筛选模型,对待筛选数据集中的各个源语言数据进行筛选,得到筛选后的目标源语言数据,目标数据筛选模型利用强化学习算法训练得到;
第二获取模块803,用于获取与目标源语言数据对应的标注语言数据;
第三获取模块804,用于基于目标源语言数据和标注语言数据获取机器翻译模型。
在一种可能实现方式中,参见图9,该装置还包括:
初始化模块805,用于初始化第一训练数据集,第一训练数据集包括多个源语言训练数据;
第一训练模块806,用于基于初始化的第一训练数据集,利用强化学习算法对第一数据筛选模型进行训练,得到第二数据筛选模型;
第二训练模块807,用于响应于不满足第一训练终止条件,重新初始化第一训练数据集,基于重新初始化的第一训练数据集,利用强化学习算法对第二数据筛选模型进行训练,得到第三数据筛选模型;以此类推,直至满足第一训练终止条件,得到目标数据筛选模型。
在一种可能实现方式中,参见图10,第一训练模块806,包括:
划分单元8061,用于将初始化的第一训练数据集划分为至少一个目标训练数据集;
处理单元8062,用于调用第一数据筛选模型对第一目标训练数据集中的各个源语言训练数据的目标特征进行筛选处理,得到第一目标训练数据集中的各个源语言训练数据的筛选结果,第一目标训练数据集为至少一个目标训练数据集中的第一个目标训练数据集;
确定单元8063,用于对于第一目标训练数据集中的任一源语言训练数据,基于任一源语言训练数据的筛选结果,确定任一源语言训练数据的权重值;
生成单元8064,用于基于任一源语言训练数据的目标特征、任一源语言训练数据的筛选结果、任一源语言训练数据的权重值和参考源语言训练数据的目标特征,生成与任一源语言训练数据对应的候选数据,参考源语言训练数据为第二目标训练数据集中与任一源语言训练数据对应的源语言数据,第二目标训练数据集为至少一个目标训练数据集中的第一目标训练数据集的下一个目标训练数据集;
选取单元8065,用于基于与第一目标训练数据集中的各个源语言训练数据分别对应的候选数据,选取目标数量的候选数据;
更新单元8066,用于基于目标数量的候选数据,更新第一数据筛选模型的参数,得到更新后的第一数据筛选模型;
训练单元8067,用于基于第二目标训练数据集对更新后的第一数据筛选模型进行训练,以此类推,直至满足第二训练终止条件,得到第二数据筛选模型。
在一种可能实现方式中,确定单元8063,用于响应于任一源语言训练数据的筛选结果为第一结果,将第一权重值作为任一源语言训练数据的权重值;响应于任一源语言训练数据的筛选结果为第二结果,将第二权重值作为任一源语言训练数据的权重值。
在一种可能实现方式中,参见图10,第一训练模块806,还包括:
获取单元8068,用于获取与第一目标训练数据集中的各个目标源语言训练数据分别对应的标注语言训练数据,各个目标源语言训练数据的筛选结果均为第一结果;
参见图10,第一训练模块806,还包括:
添加单元8069,用于将各个目标源语言训练数据和与各个目标源语言训练数据分别对应的标注语言训练数据作为训练数据添加至第二训练数据集中;
训练单元8067,还用于基于第二训练数据集对第一翻译模型进行训练,得到第二翻译模型;
获取单元8068,还用于基于第二翻译模型和第一翻译模型,获取第一权重值。
在一种可能实现方式中,获取单元8068,还用于对于第一目标训练数据集中的任一源语言训练数据,基于任一源语言训练数据中的各个子数据,获取任一源语言训练数据的第一特征;基于任一源语言训练数据和第三翻译模型,获取任一源语言训练数据的第二特征;基于第一特征和第二特征,获取任一源语言训练数据的目标特征。
在一种可能实现方式中,获取单元8068,还用于基于任一源语言训练数据中的各个子数据的词嵌入特征,获取任一源语言训练数据的第三特征;基于任一源语言训练数据中的各个子数据和已有语料数据库的比对结果,获取任一源语言训练数据的第四特征;基于任一源语言训练数据中的各个子数据的数量,确定任一源语言训练数据的长度,基于任一源语言训练数据的长度,获取任一源语言训练数据的第五特征;基于第三特征、第四特征和第五特征,获取任一源语言训练数据的第一特征。
在一种可能实现方式中,获取单元8068,还用于基于第三翻译模型,获取任一源语言训练数据的翻译数据,基于翻译数据的词嵌入特征,获取任一源语言训练数据的第六特征;基于第三翻译模型,获取与任一源语言训练数据中的各个子数据分别对应的目标翻译子数据,基于各个子数据分别对应的目标翻译子数据的词嵌入特征,获取任一源语言训练数据的第七特征,任一子数据对应的目标翻译子数据的翻译概率在任一子数据对应的各个候选翻译子数据的翻译概率中最大;获取各个子数据分别对应的目标翻译子数据的翻译概率,基于各个子数据分别对应的目标翻译子数据的翻译概率和翻译数据的长度,获取任一源语言训练数据的第八特征;基于第六特征、第七特征和第八特征,获取任一源语言训练数据的第二特征。
在一种可能实现方式中,生成单元8064,用于响应于任一源语言训练数据的筛选结果为第一结果,基于任一源语言训练数据的目标特征、第一结果、第一权重值和参考源语言训练数据的目标特征,生成与任一源语言训练数据对应的第一候选数据;
响应于任一源语言训练数据的筛选结果为第二结果,基于任一源语言训练数据的目标特征、第二结果、第二权重值和参考源语言训练数据的目标特征,生成与任一源语言训练数据对应的第二候选数据。
在一种可能实现方式中,添加单元8069,还用于将与第一目标训练数据集中的各个源语言训练数据分别对应的候选数据中的第一候选数据添加至第一候选数据集中,将与第一目标训练数据集中的各个源语言训练数据分别对应的候选数据中的第二候选数据添加至第二候选数据集中;
选取单元8065,还用于在第一候选数据集和第二候选数据集中进行等比例选取,得到目标数量的候选数据。
在一种可能实现方式中,更新单元8066,用于基于目标数量的候选数据,更新与第一数据筛选模型对应的目标函数;根据更新后的目标函数,计算与第一数据筛选模型对应的损失函数;基于损失函数,更新第一数据筛选模型的参数,得到更新后的第一数据筛选模型。
在一种可能实现方式中,满足第二训练终止条件,包括:
第一训练数据集中不存在满足条件的目标训练数据集,满足条件的目标训练数据集中的各个源语言训练数据的目标特征未进行过筛选处理;或者,
筛选结果为第一结果的源语言训练数据的数量达到数量阈值。
在本申请实施例中,基于利用强化学习算法训练得到的目标数据筛选模型,对待筛选数据集中的各个源语言数据进行筛选,进而基于筛选后的目标源语言数据和与目标源语言数据对应的标注语言数据获取机器翻译模型。在此种数据处理的过程中,目标数据筛选模型中的筛选规则为机器在强化学习的过程中自动学习出来的,目标数据筛选模型的适应场景广泛,筛选后的源语言数据的质量较高,使得基于筛选后的源语言数据和与筛选后的源语言数据对应的标注语言数据获取的机器翻译模型的翻译性能较好。
需要说明的是,上述实施例提供的装置在实现其功能时,仅以上述各功能模块的划分进行举例说明,实际应用中,能够根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的装置与方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
图11是本申请实施例提供的一种数据处理设备的结构示意图,示例性地,该数据处理设备为服务器,该服务器可因配置或性能不同而产生比较大的差异,该服务器包括一个或多个处理器(Central Processing Units,CPU)1101和一个或多个存储器1102,其中,该一个或多个存储器1102中存储有至少一条程序代码,该至少一条程序代码由该一个或多个处理器1101加载并执行,以实现上述各个方法实施例提供的数据处理方法。当然,该服务器 还能够具有有线或无线网络接口、键盘以及输入输出接口等部件,以便进行输入输出,该服务器还能够包括其他用于实现设备功能的部件,在此不做赘述。
在示例性实施例中,还提供了一种计算机设备,该计算机设备包括处理器和存储器,该存储器中存储有至少一条程序代码。该至少一条程序代码由一个或者一个以上处理器加载并执行,以实现上述任一种数据处理方法。
在示例性实施例中,还提供了一种非临时性计算机可读存储介质,该非临时性计算机可读存储介质中存储有至少一条程序代码,该至少一条程序代码由计算机设备的处理器加载并执行,以实现上述任一种数据处理方法。
可选地,上述非临时性计算机可读存储介质是只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、只读光盘(Compact Disc Read-Only Memory,CD-ROM)、磁带、软盘和光数据存储设备等。
在示例性实施例中,还提供了一种计算机程序产品,该计算机程序产品中存储有至少一段计算机程序,该至少一段计算机程序由计算机设备的处理器加载并执行,以实现上述任一种数据处理方法。
应当理解的是,在本文中提及的“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示存在三种关系,例如,A和/或B,表示:单独存在A,同时存在A和B,单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。
以上所述仅为本申请的示例性实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (15)

  1. 一种数据处理方法,其中,所述方法应用于计算机设备,所述方法包括:
    获取待筛选数据集,所述待筛选数据集包括多个待筛选的源语言数据;
    基于目标数据筛选模型,对所述待筛选数据集中的各个源语言数据进行筛选,得到筛选后的目标源语言数据,所述目标数据筛选模型利用强化学习算法训练得到;
    获取与所述目标源语言数据对应的标注语言数据,基于所述目标源语言数据和所述标注语言数据获取机器翻译模型。
  2. 根据权利要求1所述的方法,其中,所述基于目标数据筛选模型,对所述待筛选数据集中的各个源语言数据进行筛选之前,所述方法还包括:
    初始化第一训练数据集,所述第一训练数据集包括多个源语言训练数据;
    基于初始化的第一训练数据集,利用强化学习算法对第一数据筛选模型进行训练,得到第二数据筛选模型;
    响应于不满足第一训练终止条件,重新初始化所述第一训练数据集,基于重新初始化的第一训练数据集,利用强化学习算法对所述第二数据筛选模型进行训练,得到第三数据筛选模型;以此类推,直至满足所述第一训练终止条件,得到目标数据筛选模型。
  3. 根据权利要求2所述的方法,其中,所述基于初始化的第一训练数据集,利用强化学习算法对第一数据筛选模型进行训练,得到第二数据筛选模型,包括:
    将所述初始化的第一训练数据集划分为至少一个目标训练数据集;
    调用所述第一数据筛选模型对第一目标训练数据集中的各个源语言训练数据的目标特征进行筛选处理,得到所述第一目标训练数据集中的各个源语言训练数据的筛选结果,所述第一目标训练数据集为所述至少一个目标训练数据集中的第一个目标训练数据集;
    对于所述第一目标训练数据集中的任一源语言训练数据,基于所述任一源语言训练数据的筛选结果,确定所述任一源语言训练数据的权重值;
    基于所述任一源语言训练数据的目标特征、所述任一源语言训练数据的筛选结果、所述任一源语言训练数据的权重值和参考源语言训练数据的目标特征,生成与所述任一源语言训练数据对应的候选数据,所述参考源语言训练数据为第二目标训练数据集中与所述任一源语言训练数据对应的源语言数据,所述第二目标训练数据集为所述至少一个目标训练数据集中的所述第一目标训练数据集的下一个目标训练数据集;
    基于与所述第一目标训练数据集中的各个源语言训练数据分别对应的候选数据,选取目标数量的候选数据;基于所述目标数量的候选数据,更新所述第一数据筛选模型的参数,得到更新后的第一数据筛选模型;
    基于所述第二目标训练数据集对所述更新后的第一数据筛选模型进行训练,以此类推,直至满足第二训练终止条件,得到第二数据筛选模型。
  4. 根据权利要求3所述的方法,其中,所述基于所述任一源语言训练数据的筛选结果,确定所述任一源语言训练数据的权重值,包括:
    响应于所述任一源语言训练数据的筛选结果为第一结果,将第一权重值作为所述任一源语言训练数据的权重值;
    响应于所述任一源语言训练数据的筛选结果为第二结果,将第二权重值作为所述任一源语言训练数据的权重值。
  5. 根据权利要求4所述的方法,其中,所述响应于所述任一源语言训练数据的筛选结果为第一结果,将第一权重值作为所述任一源语言训练数据的权重值之前,所述方法还包括:
    获取与所述第一目标训练数据集中的各个目标源语言训练数据分别对应的标注语言训练数据,所述各个目标源语言训练数据的筛选结果均为第一结果;
    将所述各个目标源语言训练数据和与所述各个目标源语言训练数据分别对应的标注语言训练数据作为训练数据添加至第二训练数据集中;
    基于所述第二训练数据集对第一翻译模型进行训练,得到第二翻译模型;
    基于所述第二翻译模型和所述第一翻译模型,获取所述第一权重值。
  6. 根据权利要求3-5任一所述的方法,其中,所述调用所述第一数据筛选模型对所述第一目标训练数据集中的各个源语言训练数据的目标特征进行筛选处理之前,所述方法还包括:
    对于所述第一目标训练数据集中的任一源语言训练数据,基于所述任一源语言训练数据中的各个子数据,获取所述任一源语言训练数据的第一特征;
    基于所述任一源语言训练数据和第三翻译模型,获取所述任一源语言训练数据的第二特征;
    基于所述第一特征和所述第二特征,获取所述任一源语言训练数据的目标特征。
  7. 根据权利要求6所述的方法,其中,所述基于所述任一源语言训练数据中的各个子数据,获取所述任一源语言训练数据的第一特征,包括:
    基于所述任一源语言训练数据中的各个子数据的词嵌入特征,获取所述任一源语言训练数据的第三特征;
    基于所述任一源语言训练数据中的各个子数据和已有语料数据库的比对结果,获取所述任一源语言训练数据的第四特征;
    基于所述任一源语言训练数据中的各个子数据的数量,确定所述任一源语言训练数据的长度,基于所述任一源语言训练数据的长度,获取所述任一源语言训练数据的第五特征;
    基于所述第三特征、所述第四特征和所述第五特征,获取所述任一源语言训练数据的第一特征。
  8. 根据权利要求6所述的方法,其中,所述基于所述任一源语言训练数据和第三翻译模型,获取所述任一源语言训练数据的第二特征,包括:
    基于所述第三翻译模型,获取所述任一源语言训练数据的翻译数据,基于所述翻译数据的词嵌入特征,获取所述任一源语言训练数据的第六特征;
    基于所述第三翻译模型,获取与所述任一源语言训练数据中的各个子数据分别对应的目标翻译子数据,基于所述各个子数据分别对应的目标翻译子数据的词嵌入特征,获取所述任一源语言训练数据的第七特征,任一子数据对应的目标翻译子数据的翻译概率在所述任一子数据对应的各个候选翻译子数据的翻译概率中最大;
    获取所述各个子数据分别对应的目标翻译子数据的翻译概率,基于所述各个子数据分别对应的目标翻译子数据的翻译概率和所述翻译数据的长度,获取所述任一源语言训练数据的第八特征;
    基于所述第六特征、所述第七特征和所述第八特征,获取所述任一源语言训练数据的第二特征。
  9. 根据权利要求4所述的方法,其中,所述基于所述任一源语言训练数据的目标特征、所述任一源语言训练数据的筛选结果、所述任一源语言训练数据的权重值和参考源语言训练数据的目标特征,生成与所述任一源语言训练数据对应的候选数据,包括:
    响应于所述任一源语言训练数据的筛选结果为第一结果,基于所述任一源语言训练数据的目标特征、所述第一结果、所述第一权重值和所述参考源语言训练数据的目标特征, 生成与所述任一源语言训练数据对应的第一候选数据;
    响应于所述任一源语言训练数据的筛选结果为第二结果,基于所述任一源语言训练数据的目标特征、所述第二结果、所述第二权重值和所述参考源语言训练数据的目标特征,生成与所述任一源语言训练数据对应的第二候选数据。
  10. 根据权利要求9所述的方法,其中,所述基于与所述第一目标训练数据集中的各个源语言训练数据分别对应的候选数据,选取目标数量的候选数据,包括:
    将与所述第一目标训练数据集中的各个源语言训练数据分别对应的候选数据中的第一候选数据添加至第一候选数据集中,将与所述第一目标训练数据集中的各个源语言训练数据分别对应的候选数据中的第二候选数据添加至第二候选数据集中;
    在所述第一候选数据集和所述第二候选数据集中进行等比例选取,得到目标数量的候选数据。
  11. 根据权利要求3所述的方法,其中,所述基于所述目标数量的候选数据,更新所述第一数据筛选模型的参数,得到更新后的第一数据筛选模型,包括:
    基于所述目标数量的候选数据,更新与所述第一数据筛选模型对应的目标函数;
    根据更新后的目标函数,计算与所述第一数据筛选模型对应的损失函数;
    基于所述损失函数,更新所述第一数据筛选模型的参数,得到更新后的第一数据筛选模型。
  12. 根据权利要求3所述的方法,其中,所述满足第二训练终止条件,包括:
    所述第一训练数据集中不存在满足条件的目标训练数据集,所述满足条件的目标训练数据集中的各个源语言训练数据的目标特征未进行过筛选处理;或者,
    筛选结果为第一结果的源语言训练数据的数量达到数量阈值。
  13. 一种计算机设备,其中,所述计算机设备包括处理器和存储器,所述存储器中存储有至少一条程序代码,所述至少一条程序代码由所述处理器加载并执行,以实现如权利要求1至12任一所述的数据处理方法。
  14. 一种非临时性计算机可读存储介质,其中,所述非临时性计算机可读存储介质中存储有至少一条程序代码,所述至少一条程序代码由处理器加载并执行,以实现如权利要 求1至12任一所述的数据处理方法。
  15. 一种计算机程序产品,其中,所述计算机程序产品中存储有至少一段计算机程序,所述至少一段计算机程序由处理器加载并执行,以实现如权利要求1至12任一所述的数据处理方法。
PCT/CN2020/119523 2019-11-21 2020-09-30 数据处理方法、设备及存储介质 WO2021098397A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/517,075 US20220058349A1 (en) 2019-11-21 2021-11-02 Data processing method, device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911149101.4A CN110929532B (zh) 2019-11-21 2019-11-21 数据处理方法、装置、设备及存储介质
CN201911149101.4 2019-11-21

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/517,075 Continuation US20220058349A1 (en) 2019-11-21 2021-11-02 Data processing method, device, and storage medium

Publications (1)

Publication Number Publication Date
WO2021098397A1 true WO2021098397A1 (zh) 2021-05-27

Family

ID=69850595

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/119523 WO2021098397A1 (zh) 2019-11-21 2020-09-30 数据处理方法、设备及存储介质

Country Status (3)

Country Link
US (1) US20220058349A1 (zh)
CN (1) CN110929532B (zh)
WO (1) WO2021098397A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110929532B (zh) * 2019-11-21 2023-03-21 腾讯科技(深圳)有限公司 数据处理方法、装置、设备及存储介质
CN115423485B (zh) * 2022-11-03 2023-03-21 支付宝(杭州)信息技术有限公司 数据处理方法、装置及设备

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010119615A1 (ja) * 2009-04-15 2010-10-21 日本電気株式会社 学習データ生成装置、及び固有表現抽出システム
CN102945232A (zh) * 2012-11-16 2013-02-27 沈阳雅译网络技术有限公司 面向统计机器翻译的训练语料质量评价及选取方法
CN104572614A (zh) * 2014-12-03 2015-04-29 北京捷通华声语音技术有限公司 一种语言模型的训练方法及系统
CN107402919A (zh) * 2017-08-07 2017-11-28 中译语通科技(北京)有限公司 基于图的机器翻译数据选择方法及机器翻译数据选择系统
CN108874790A (zh) * 2018-06-29 2018-11-23 中译语通科技股份有限公司 一种基于语言模型和翻译模型的清洗平行语料方法及系统
CN108920468A (zh) * 2018-05-07 2018-11-30 内蒙古工业大学 一种基于强化学习的蒙汉双语种互译方法
US20190130030A1 (en) * 2017-10-30 2019-05-02 Fujitsu Limited Generation method, generation device, and recording medium
CN110223675A (zh) * 2019-06-13 2019-09-10 苏州思必驰信息科技有限公司 用于语音识别的训练文本数据的筛选方法及系统
CN110929532A (zh) * 2019-11-21 2020-03-27 腾讯科技(深圳)有限公司 数据处理方法、装置、设备及存储介质

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6649124B2 (ja) * 2015-05-25 2020-02-19 パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカPanasonic Intellectual Property Corporation of America 機械翻訳方法、機械翻訳装置及びプログラム
WO2018053187A1 (en) * 2016-09-15 2018-03-22 Google Inc. Deep reinforcement learning for robotic manipulation
US10565318B2 (en) * 2017-04-14 2020-02-18 Salesforce.Com, Inc. Neural machine translation with latent tree attention
WO2019060353A1 (en) * 2017-09-21 2019-03-28 Mz Ip Holdings, Llc SYSTEM AND METHOD FOR TRANSLATION OF KEYBOARD MESSAGES
CN108717574B (zh) * 2018-03-26 2021-09-21 浙江大学 一种基于连词标记和强化学习的自然语言推理方法
CN109359294B (zh) * 2018-09-18 2023-04-18 湖北文理学院 一种基于神经机器翻译的古汉语翻译方法
CN109543199B (zh) * 2018-11-28 2022-06-10 腾讯科技(深圳)有限公司 一种文本翻译的方法以及相关装置
CN110134944A (zh) * 2019-04-08 2019-08-16 国家计算机网络与信息安全管理中心 一种基于强化学习的指代消解方法
CN110162751A (zh) * 2019-05-13 2019-08-23 百度在线网络技术(北京)有限公司 文本生成器训练方法和文本生成器训练系统
CN110245364B (zh) * 2019-06-24 2022-10-28 中国科学技术大学 零平行语料多模态神经机器翻译方法
CN110334360B (zh) * 2019-07-08 2021-07-06 腾讯科技(深圳)有限公司 机器翻译方法及装置、电子设备及存储介质

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010119615A1 (ja) * 2009-04-15 2010-10-21 日本電気株式会社 学習データ生成装置、及び固有表現抽出システム
CN102945232A (zh) * 2012-11-16 2013-02-27 沈阳雅译网络技术有限公司 面向统计机器翻译的训练语料质量评价及选取方法
CN104572614A (zh) * 2014-12-03 2015-04-29 北京捷通华声语音技术有限公司 一种语言模型的训练方法及系统
CN107402919A (zh) * 2017-08-07 2017-11-28 中译语通科技(北京)有限公司 基于图的机器翻译数据选择方法及机器翻译数据选择系统
US20190130030A1 (en) * 2017-10-30 2019-05-02 Fujitsu Limited Generation method, generation device, and recording medium
CN108920468A (zh) * 2018-05-07 2018-11-30 内蒙古工业大学 一种基于强化学习的蒙汉双语种互译方法
CN108874790A (zh) * 2018-06-29 2018-11-23 中译语通科技股份有限公司 一种基于语言模型和翻译模型的清洗平行语料方法及系统
CN110223675A (zh) * 2019-06-13 2019-09-10 苏州思必驰信息科技有限公司 用于语音识别的训练文本数据的筛选方法及系统
CN110929532A (zh) * 2019-11-21 2020-03-27 腾讯科技(深圳)有限公司 数据处理方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN110929532A (zh) 2020-03-27
US20220058349A1 (en) 2022-02-24
CN110929532B (zh) 2023-03-21

Similar Documents

Publication Publication Date Title
US11468246B2 (en) Multi-turn dialogue response generation with template generation
WO2021051560A1 (zh) 文本分类方法和装置、电子设备、计算机非易失性可读存储介质
JP2021089705A (ja) 翻訳品質を評価するための方法と装置
US10395641B2 (en) Modifying a language conversation model
WO2021051866A1 (zh) 判案结果确定方法、装置、设备及计算机可读存储介质
US11551437B2 (en) Collaborative information extraction
CN111310440B (zh) 文本的纠错方法、装置和系统
CN110415679B (zh) 语音纠错方法、装置、设备和存储介质
JP7430820B2 (ja) ソートモデルのトレーニング方法及び装置、電子機器、コンピュータ可読記憶媒体、コンピュータプログラム
US10496751B2 (en) Avoiding sentiment model overfitting in a machine language model
WO2020215683A1 (zh) 基于卷积神经网络的语义识别方法及装置、非易失性可读存储介质、计算机设备
WO2021098397A1 (zh) 数据处理方法、设备及存储介质
JP2008216341A (ja) 誤り傾向学習音声認識装置及びコンピュータプログラム
CN113407677B (zh) 评估咨询对话质量的方法、装置、设备和存储介质
CN110162771A (zh) 事件触发词的识别方法、装置、电子设备
CN112380855B (zh) 确定语句通顺度的方法、确定概率预测模型的方法和装置
WO2021174814A1 (zh) 众包任务的答案验证方法、装置、计算机设备及存储介质
WO2021082070A1 (zh) 智能对话方法及相关设备
Lorenc et al. Benchmark of public intent recognition services
US20230004715A1 (en) Method and apparatus for constructing object relationship network, and electronic device
US11429795B2 (en) Machine translation integrated with user analysis
CN114330311A (zh) 一种翻译方法、装置、电子设备和计算机可读存储介质
CN111858899A (zh) 语句处理方法、装置、系统和介质
CN111813941A (zh) 结合rpa和ai的文本分类方法、装置、设备及介质
US20180300313A1 (en) System and method for generating a multi-lingual and multi-intent capable semantic parser based on automatically generated operators and user-designated utterances relating to the operators

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20890410

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20890410

Country of ref document: EP

Kind code of ref document: A1