CN112602098A - Alignment to sequence data selector - Google Patents

Alignment to sequence data selector Download PDF

Info

Publication number
CN112602098A
CN112602098A CN201980046281.5A CN201980046281A CN112602098A CN 112602098 A CN112602098 A CN 112602098A CN 201980046281 A CN201980046281 A CN 201980046281A CN 112602098 A CN112602098 A CN 112602098A
Authority
CN
China
Prior art keywords
data
target
batch
pairs
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201980046281.5A
Other languages
Chinese (zh)
Inventor
王伟
梁博文
麦克达夫·休斯
渡边多吕
中川铁二
亚历山大·鲁德尼克
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Google LLC
Original Assignee
Google LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to US201862668650P priority Critical
Priority to US62/668,650 priority
Application filed by Google LLC filed Critical Google LLC
Priority to PCT/US2019/026003 priority patent/WO2019217013A1/en
Publication of CN112602098A publication Critical patent/CN112602098A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Computing arrangements based on biological models using neural network models
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/005Probabilistic networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network-specific arrangements or communication protocols supporting networked applications
    • H04L67/10Network-specific arrangements or communication protocols supporting networked applications in which an application is distributed across nodes in the network

Abstract

The method (1000) comprises generating a base model (134) by training a first data set (132) of the data pair (133), generating an adapted model (144) by training the base model on a second data set (142) of the data pair (143). The method also includes determining a comparison score (154) for each data pair (153) of a third data set (152) of data pairs using the base model and the adapted model. The comparison score indicates the probability of quality of the respective data pair. The method also includes training a target model (230) using the data pairs of the third data set and the comparison scores.

Description

Alignment to sequence data selector
Technical Field
The present disclosure relates to a compare sequence to sequence data selector for training a neural translation model on noisy data.
Background
The neural translation model will learn to distribute the probability mass over the translations. The model trainer typically trains the model with parallel data so that more confident translations are more likely than less confident translations. If training is performed on parallel data that is more noisy, the learned distribution may be inaccurate, resulting in inaccurate translations.
However, large-scale high quality data that is clean and matches the test domain is rare. An automatic data miner will typically produce parallel data, and a statement aligner will process the parallel data. The processing of parallel data may introduce severe noise to the parallel data. Typically, the trainer solves this problem as a classification problem by training a convolutional network to classify good data or bad data with a small amount of clean data (or in-domain data). The trainer then uses the selected data to train a system having a different architecture than the selector. Thus, the data that the selector identifies as good data is not necessarily good data for the final model.
Disclosure of Invention
One aspect of the present disclosure provides a method for training a target model. The method comprises the following steps: the method further includes generating, by the data processing hardware, a base model by training using a first data set of the data pairs, and generating, by the data processing hardware, an adapted model by training the base model on a second data set of the data pairs. The method also includes determining, by the data processing hardware, a comparison score for each data pair of the third data set of data pairs using the base model and the adaptation model. The comparison score indicates the probability of quality of the respective data pair. The method also includes training, by the data processing hardware, the target model using the data pairs of the third data set and the comparison scores.
Implementations of the disclosure may include one or more of the following optional features. In some embodiments, training the target model further comprises using data pairs of the third data set that satisfy a threshold comparison score. In some examples, the method further comprises: determining by the data processing hardware that the target model is the same size as the base model; and replacing, by the data processing hardware, the base model with the adapted model; replacing, by the data processing hardware, the adapted model with the target model; determining, by the data processing hardware, a comparison score for each data pair of the fourth data set of data pairs using the base model and the replaced adaptation model; the subsequent target model is trained by the data processing hardware using the data pairs of the fourth data set and the comparison scores. In other examples, the target model is larger than the base model.
The first data set may include random data. Here, when the first data set includes random data, the second data set may include cleaner data than the random data of the first data set. Additionally or alternatively, the comparison score may include a Kullback-leibler (kl) divergence and/or each data set may include a statement language pair.
In some implementations, the method further includes sorting, by the data processing hardware, the data pairs of the third data set based on the respective comparison scores. In these examples, training the target model may further include generating a plurality of data batches and training the target model using each data batch. Here, each data batch includes at least one data pair, and wherein the probability of including a selected data pair in a selected data batch is based on the respective comparison scores of the selected data pair, and wherein the probability increases as the respective comparison score increases. Further, in these examples, generating the plurality of data batches may include: determining the selection rate of each data batch; determining the batch size of each data batch according to the selection rate and the number of data pairs in the third data set; selecting a number of data pairs from the third data set corresponding to the determined batch size; sorting the selected data pairs according to respective comparison scores; the removal rate of the selected data pair with the lowest contrast score, including the inverse of the selection rate, is removed from the data batch. The selection rate may decrease with training time. In this case, the batch size may be equal to the fixed batch size divided by the selection rate.
Another aspect of the present disclosure provides a system for training a target model. The system includes data processing hardware and memory hardware in communication with the data processing hardware and storing instructions that, when executed by the data processing hardware, cause the data processing hardware to perform operations. The operations include generating a base model by training using a first data set of data pairs, and generating an adapted model by training the base model on a second data set of data pairs. The operations further include determining a comparison score for each data pair of the third data set of data pairs using the base model and the adaptation model. The comparison score indicates the probability of quality of the respective data pair. The operations further include training a target model using the data pairs of the third data set and the comparison scores.
This aspect may include one or more of the following optional features. In some embodiments, training the target model further comprises using data pairs of the third data set that satisfy a threshold comparison score. In some examples, the operations further comprise: determining that the size of the target model is the same as that of the basic model; and replacing the base model with the adapted model; replacing the adapted model with the target model; determining a comparison score for each data pair of the fourth data set of data pairs using the base model and the replaced adapted model; subsequent target models are trained using the data pairs and the comparison scores of the fourth data set. In other examples, the target model is larger than the base model.
The first data set may include random data. Here, when the first data set includes random data, the second data set may include cleaner data than the random data of the first data set. Additionally or alternatively, the comparison score may include a Kullback-leibler (kl) divergence and/or each data set may include sentence language pairs.
In some embodiments, the operations further comprise sorting the data pairs of the third data set based on the respective comparison scores. In these examples, training the target model may further include generating a plurality of data batches and training the target model using each data batch. Here, each data batch includes at least one data pair, and wherein the probability of selecting a data pair for inclusion in the selected data batch is based on the respective comparison score of the selected data pair, and wherein the probability increases as the respective comparison score increases. Further, in these examples, generating the plurality of data batches may include: determining the selection rate of each data batch; determining the batch size of each data batch according to the selection rate and the number of data pairs in the third data set; selecting a number of data pairs from the third data set corresponding to the determined batch size; sorting the selected data pairs according to respective comparison scores; the removal rate of the selected data pair with the lowest contrast score, including the inverse of the selection rate, is removed from the data batch. The selection rate may decrease with training time. In this case, the batch size may be equal to the fixed batch size divided by the selection rate.
Drawings
FIG. 1 is a schematic diagram of an example system for aligning sequences to a sequence data selector.
FIG. 2 is a schematic diagram of example components of the target model trainer of FIG. 1.
FIG. 3 is a schematic diagram of the example data batch generator of FIG. 2.
Fig. 4 is an exemplary graph of the decrease in the selection rate over time.
FIG. 5 is an exemplary graph of data batch size increasing over time.
Fig. 6 is an exemplary diagram of dynamic data scheduling.
FIG. 7 is an example flow diagram for training a target model using the example system of FIG. 1.
Fig. 8A and 8B are example graphs comparing scoring to manual scoring for the example system of fig. 1.
FIG. 9 is another example flow diagram for training a target model using the example system of FIG. 1.
FIG. 10 is a flow diagram of an example method for iteratively training a target model.
FIG. 11 is a schematic diagram of an example computing device that may be used to implement the systems and methods described herein.
Like reference symbols in the various drawings indicate like elements.
Detailed Description
Embodiments herein are directed to a model trainer configured to generate a small sequence-to-sequence base model (e.g., for a neural network) by training a base model with a first data set of noisy data pairs. Noisy data is defined as unclean or parallel data, or data that does not exactly match the test field. Such noisy data can lead to inaccurate probability distributions over the examples. The model trainer then generates a small sequence-to-sequence adaptation model by training the base model using the second data set of data pairs. The second data set comprises data of higher quality than the first data set. The model trainer trains the target model by determining a comparison score for each data pair of the third data set, ranking the third data set according to the comparison scores and selecting the best quality part from the ranked data sets, thereby generating the target model.
Referring to FIG. 1, in some implementations, an example system 100 includes a computing system 110 that executes a model trainer 120. Computing system 110 may correspond to a remote system or computing device, such as a desktop workstation or a laptop computer workstation. The remote system 110 may be a single computer, multiple computers, or a distributed system (e.g., a cloud environment) with extensible/resilient computing resources 112 (e.g., data processing hardware) and/or storage resources 114 (e.g., storage hardware).
In some examples, the data processing hardware 112 of the computing system 110 executes a model trainer 120, the model trainer 120 including a base model generator 130, an adaptation model generator 140, a score determiner 150, and a target model trainer 200. The base model generator 130 receives a first data set 132 of sentence data pairs 133, 133a-n for training the sequence to a sequence base model 134 until convergence. Each sentence data pair 133 includes a first sentence in a first language and a second sentence that is a potential translation of the first sentence into a second language. The first data set 132 typically includes random noise data.
Adaptation model generator 140 uses sequence-to-sequence base model 134 generated by base model generator 130 based on first data set 132 of statement data pair 133 and second data set 142 of statement pairs 143, 143a-n to progressively train sequence-to-sequence adaptation model 144. Similar to the first data set 132, each statement data pair 143a-n of the second data set 142 includes a first statement in a first language and a second statement that is a potential translation of the first statement into a second language. The second data set 142 may include cleaner data than the random data of the first data set 132. For example, the second data set 142 may include a relatively small amount (as compared to the first data set 132) of artificially organized, high-quality data. This causes the adaptation model 144 to transform the probability quality from less parallel (or noisy) data to better parallel (cleaner) data. This transformation allows the use of the contrast information to determine the quality associated with the data pairs evaluated by base model 134 and adaptation model 144.
Typically, a data set S, S ═ S for a given sentence pair0,…,si… } in which siIs the ith statement pair, the score determiner 150 performing the data selection method is typically based on a score functionIs s isiA score 154 is assigned. The score reflects the desired quality. For example, the higher the score, the cleaner the data (or more closely matched to the domain, or more difficult for curriculum learning, or more uncertain for active learning). The trainer 120 may use the scores 154 to produce a hard data filter, for example, based on or meeting a threshold comparison score. Alternatively, the trainer 120 soft uses the scores 154, e.g., for weighting.
The selection method also defines a policy for scheduling data based on the scores 154. Using a static selection strategy, data is selected offline and used in random order during training. On the other hand, a dynamic selection strategy may attempt to schedule data in order based on a scoring function in ongoing training. A static version is a specific implementation of dynamic selection. Dynamic scheduling through a dynamic selection policy produces a dynamic example sampling effect that can achieve example weighting.
Still referring to FIG. 1, the score determiner 150 is configured to receive a third data set 152 having sentence pairs 153, 153a-n (e.g., data pairs 153, 153a-n), a sequence-to-sequence base model 134, and a sequence-to-sequence adaptation model 144 for determining a respective comparison score 154, 154a-n for each sentence pair 153 in the third data set 152. Each comparison score 154 indicates a probability of quality or cleanliness associated with the respective data pair 153. Optionally, the contrast score 154 includes a Kullback-Leibler (KL) divergence. The KL divergence (also referred to as relative entropy) is a measure of the deviation between one probability distribution and a second expected probability distribution. Specifically, the model 144, p (S) is adaptedi)=p(ti|si) And base model 134, q (S)i)=q(ti|si) KL divergence between can be used to determine statement pair S according to the following equationi=(si,ti) Comparison score 154 (or quality metric):
the distribution p transforms the probability quality from poor data to better data, since the distribution p is a cleaner model. Therefore, if p (S)i) Greater than q (S)i) Then S isiIt is likely to provide good information gain. However, even with information gain, SiMay still be a rare instance or usage, and p (S)i) Is used to determine this. Because the score determiner 150 uses the probabilities between the sequence-to-sequence basis 134 and the adaptation model 144, no separate measure of data quality or cleanliness or domain is required. Because data quality relates to a probabilistic quality distribution, high quality data (e.g., clean or matched data) enables a modelA more accurate distribution can be generated. Thus, equation (1) may represent a unified metric for data quality.
As discussed in more detail below, the target model trainer 200 receives and trains the target model 230 using the contrast scores 154 and the third data set 152. In some examples, the target model trainer 200 ranks the data pairs 153 of the third data set 152 based on the respective contrast scores 154. Based on the correlation between the training time and the resulting model size, the target model 230 may be larger than both the adaptive model 144 and the base model 134. Thus, generating a small base model 134 and an equally small adapted model 144 will significantly reduce computational overhead and train much faster than a larger target model 230, saving a significant amount of time. Optimally, the base model 134 and the adaptation model 144 share a similar architecture (e.g., sequence-to-sequence) with the target model 230, as the similarities between the models 134, 144, 230 will enable the selector (base model 134 and adaptation model 144) to select the optimal sentence pair for the target model 230.
Referring now to FIG. 2, in some embodiments, the target model trainer 200 includes a data batch generator 300 that uses the third data set 152 and the comparison scores 154 to generate data batches 210, 210a-n of statement pairs 153. That is, the data batch 210 is a subset of the statement pairs 153 of the third data set 152. Data batch generator 300 generates a plurality of data batches 210, each data batch 210 having a different number of data pairs 153 and a subset of data pairs 153. The comparison score 154 of the selected sentence pair 153 determines the probability that the data pair 153 is included in the selected data batch 210. For example, the increased contrast score 154 may reflect a corresponding increased probability of including a data pair 153 in the data batch 210. The trainer 600 trains the target model 230 using the data batch 210.
The comparison score 154 may be used to rank the sentence pairs 153 to select and train the target model 230 using the static top x% of the data, and discard 1-x% of the sentence pairs 153. However, when the training data is small, this static offline selection is problematic because discarding some of the total data (e.g., 1-x%) means reducing the size of the training data. Furthermore, when the training data is mostly lower quality data (e.g., non-parallel or mostly out of domain), it is beneficial for the smaller x% to have the selected data of sufficiently good quality. In both cases, the selected training data may not be sufficient to train the target model 230, but a larger x% will again make the selected training data noisy again, thereby compromising the effectiveness of the data selection.
The dynamic data scheduling approach allows the model trainer 120 to train the target model 230 over the entire data set, but may also benefit from the quality of the data selection. The model trainer 120, via the target model trainer 200, achieves this by training the target model 230 on unselected data at the beginning of training, but selecting high quality data step by step until the end of training. In other words, dynamic data scheduling allows the object model trainer 200 to utilize different qualities of training data, increasing from lower quality training data to higher quality training data.
Typically, a model trainer uses random data to train a model of a data batch having a fixed batch size b (e.g., b 256). For example, for each data batch, a typical trainer may randomly select 256 data pairs from the data set for each data batch. However, for dynamic data selection, the data batch size b (t) increases over time, and to maintain a fixed batch size, data selection is used to select higher quality data. Referring now to FIG. 3, to generate data batches 210, data batch generator 300 of target model trainer 200 may include a selection rate determiner 310 to determine a selection rate 312, 312a-n for each data batch 210. For example, the selection rate r (t)312 may be defined as a global step (time) function, as follows:
referring to fig. 4, an example graph 400 illustrates that selection R (T)312 decreases exponentially over time such that it is halved every T steps until it reaches a determined lower limit value R (e.g., R ═ 0.2). That is, the selectivity r (t)312 decreases with each generated data batch 210. The lower limit value R is determined to ensure that R (t) does not become too small to introduce a selection bias. Referring again to data batch generator 300 of FIG. 3, batch sizer 320 may determine a corresponding batch size 322, 322a-n for each data batch 210. The data batch size 322 may be based on the selection rate r (t)312 and the fixed batch size b 324. For example, the data batch size b (t)322 may be defined as follows:
FIG. 5 shows an example graph 500 depicting data batch sizes b (t)322 with the selectivity r (t) of FIG. 4
312 until the data batch size b (t)322 reaches a maximum value b/R and remains there until training is complete. Thus, the selection rate r (t)312 (FIG. 4) may decrease with training time. Referring again to FIG. 3, after the batch size determiner 320 determines the data batch size 322, the data pair selector 330 selects a plurality of data pairs 342 from the third data set 152 that are associated with the determined batch size b (t) 322. The selection is typically random, but other selection methods may be used. After selection, the data pair ranker 340 ranks the selected data pairs 342 based on the respective contrast scores 154 for each selected data pair. The data pair remover 350 then removes the removal rate of the scored and sorted data pairs having the lowest contrast score 154 from the data batch 210. The removal rate is equal to the inverse of the selectivity, i.e., 1-r (t). For example, when r (t) is 0.5, then 50% of the selected data pairs 342 (having the 50% of the lowest comparison score 154) will be removed from data batch 210. Thus, the effective batch size for the target model 230 training remains the same as for typical training, but the data batch 210 contains the top selection rate r (t)312 of the selected data pairs 153, and thus, as training progresses (or as t increases), quality improves. For example, for b 256 and r (t) 0.5, b (t) would equal 512, where for a final batch size of 256, 50% would be selected before (contrast score) (since r (t) 0.5).
Referring now to FIG. 6, as training time t progresses, trainer 600 of target model training 200 receives data batches 210 of increasingly higher quality (i.e., less noise and cleaner), although this is done on every data batch rather than all data globally. This reflects the cross-batch example weighting. Typical example weights are in the data batch 210, so the model trainer can assign weights to the examples according to the quality of the examples. Even though intra-batch weighting may decrease the weight of the low quality examples, the selector may still mix in the low quality examples and contaminate the data with noise. Cross-batch examples increase the weight of good examples by using them more frequently in different, future batches of data. In the example shown, trainer 600 selects the example of the deepest shade (best) three times in three time steps, and selects the example of the shallowest shade (worst) only once. The low quality examples disappear from the future data batches to decrease the weight of the low quality examples, thereby improving the data quality of these batches. A target model 230 trained using data with higher quality data batches 210 in successive steps may generally improve the translation quality of the target model 230.
FIG. 7 illustrates a flow diagram 700 for training the target model 230 with dynamic contrast data selection. Decision diagram 700 may be described with reference to fig. 1-3. The score determiner 150 receives a third data set 152 having sentence pairs 153, 153a-n, a sequence-to-sequence basis model 134 and a sequence-to-sequence adaptation model 144 for determining a respective comparison score 154, 154a-n for each sentence pair 153 in the third data set 152. In particular, the score determiner 150 scores b (t) random examples and feeds the selected examples to the goal model trainer 200 for loss. In training a neural network, the loss reflects the error that the neural network makes with respect to the best or most reliable model (e.g., the gold standard) available at one time in the training. The target model trainer 200 trains parameters in the target model 230 only if the parameters of the base model 134 and the adaptation model 144 are frozen. As previously described, the contrast model may be much smaller than the target model 230 to reduce computational overhead. Importantly, the comparison score 154 is related to data quality. Here, the model trainer 120 determines that the size of the target model 230 (e.g., 8x1024) is larger than the size of the base model 134 (e.g., 3x512) and the size of the adaptation model 144 (e.g., 3x 512).
Referring now to fig. 8A and 8B, the artificial cleanliness scores for two thousand (2,000) statement pairs are plotted against the corresponding artificial cleanliness scores and the associated contrast scores 154. In FIG. 8A, English to Spanish translations and English to Chinese translations are averaged and plotted against oracle (manual) scores in graph 800 a. In FIG. 8B, the English to Bengali translations and English to Hindi translations are averaged and plotted against the oracle score in FIG. 800B. These graphs 800a, 800b show that as the contrast score 154 decreases, the data quality decreases accordingly.
Referring now to the flowchart 900 of FIG. 9, in some embodiments, the model trainer 120 determines that the target model 230 is the same size (e.g., 3x512) as the base model 134 and the adapted model 144. For example, if the target model 230 is determined to be the same size as the base model 134 and the adapted model 144, the model trainer 120 replaces the base model 134 with the adapted model 144 and replaces the adapted model 144 with the target model 230. Model trainer 120 then determines a contrast score 154 for each data pair of fourth data set 910 of data pairs 911 using new base model 134 and new adapted model 144. The model trainer 120 then trains subsequent target models 230 using the data pairs of the fourth data set 910. This process may continue indefinitely. In this way, the object model 230 may be progressively refined. If the size of the target model 230 is different from the size of the adapted model 144 and the base model 134, the model trainer 120 may derive a modified target model from the target model 230 that is the same size as the adapted model 144 and the base model 134. After the training iteration is complete, the original size target model 230 may be updated or regenerated using the modified target model.
FIG. 10 is a flow diagram of an example method 1000 for training a comparison sequence to a sequence data selector. The flowchart begins at operation 1002 where the base model 134 is generated on the data processing hardware 112 by training the first data set 132 using the data pairs 133. In some examples, the first data set 132 includes random data. At operation 1004, the method 1000 includes generating, by the data processing hardware 112, an adapted model 144 by training the base model 134 on the second data set 142 of data pairs 143. Alternatively, the second data set 142 may include data that is cleaner (e.g., organized by a person) than the random data of the first data set 132. At operation 1006, the method 1000 includes determining, by the data processing hardware 112, a comparison score 154 for each data pair 153 of the third data set 152 of data pairs 153, 153a-n using the base model 134 and the adaptation model 144. The contrast score 154 may include a KL divergence. At operation 1008, the method 1000 further includes training, by the data processing hardware 112, the target model 230 using the data pairs 153 and the comparison scores 154 of the third data set 152. In some embodiments, the method 1000 includes training the target model 230 using the data pairs 153 of the third data set 152 that satisfy the threshold comparison score 154. Each data set 132, 142, 152 may include a statement language pair. Further, the target model 230 may be larger than the base model 134 and the adaptation model 144.
In some examples, the method 1000 further includes sorting, by the data processing hardware 112, the data pairs 153 of the third data set 152 based on the respective contrast scores 154. Optionally, method 1000 includes generating a plurality of data batches 210, wherein each data batch 210 includes at least one data pair 153, and wherein the probability that a selected data pair 153a is included in selected data batch 210a is based on the respective comparison scores 154a of selected data pair 153 a. As the corresponding contrast score 154a increases, the probability of including the selected data pair 153a increases. The method 1000 then includes training the target model 230 using each data batch 210. Generating the plurality of data batches 210 may include determining a selection rate 312 for each data batch 210 and determining a batch size 322 for each data batch 210, where the batch size 322 is based on the selection rate 312 and a fixed batch size 324. Further, generating the plurality of data batches 210 further includes selecting a plurality of data pairs 153 from the third data set 152 corresponding to the determined batch size 322, sorting the selected data pairs 342 based on the respective contrast scores 154, and removing from the data batches 210 the removal rate of the selected pairs 342 having the lowest contrast scores 154. Optionally, the selection rate 312 decreases with training time. Batch size 322 may be equal to fixed batch size 324 divided by selection rate 312.
Alternatively, the method 1000 comprises: the target model 230 is determined by the data processing hardware 112 to be the same size as the base model 134, and the base model 134 is replaced with the adapted model 144 by the data processing hardware 112. When the size of the target model 230 is the same as the size of the base model 134, the method 1000 further includes, by the data processing hardware 112, replacing the base model 134 with the adapted model 144, and replacing, by the data processing hardware 112, the adapted model 144 with the target model 230. Then, the method 1000 includes determining, by the data processing hardware 112, a contrast score 154 for each data pair of a fourth data set 910 of the data pairs 911 using the base model 134 and the replaced adaptation model 144, and training, by the data processing hardware 112, the subsequent target model 230 using the data pairs of the fourth data set 910 and the contrast scores 154.
FIG. 11 is a schematic diagram of an example computing device 1100 that can be used to implement the systems and methods described herein. Computing device 1100 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not intended to limit implementations of the inventions described and/or claimed in this document.
The computing device 1100 includes: a processor 1110 (e.g., data processing hardware), a memory 1120, a storage device 1130, a high-speed interface/controller 1140 connected to the memory 1120 and the high-speed expansion ports 1150, and a low-speed interface/controller 1160 connected to the low-speed bus 1170 and the storage device 1130. Each of the components 1110, 1120, 1130, 1140, 1150, and 1160 is interconnected using a different bus, and may be mounted on a common motherboard or in other manners as needed. Processor 1110 may process instructions for execution within computing device 1100, including instructions stored in memory 1120 or on storage device 1130 to display graphical information for a Graphical User Interface (GUI) on an external input/output device, such as display 1180 coupled to high-speed interface 1140. In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and types of memory, as desired. Also, multiple computing devices 1100 may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system).
The memory 1120 stores information non-temporarily within the computing device 1100. The memory 1120 may be a computer-readable medium, volatile memory unit(s), or non-volatile memory unit(s). The non-volatile memory 1120 may be a physical device used to store programs (e.g., sequences of instructions) or data (program state information) for use by the computing device 1100, on a temporary or permanent basis. Examples of non-volatile memory include, but are not limited to: flash memory and Read Only Memory (ROM)/Programmable Read Only Memory (PROM)/Erasable Programmable Read Only Memory (EPROM)/Electrically Erasable Programmable Read Only Memory (EEPROM) (e.g., typically used for firmware such as boot programs). Examples of volatile memory include: but are not limited to, Random Access Memory (RAM), Dynamic Random Access Memory (DRAM), Static Random Access Memory (SRAM), Phase Change Memory (PCM), and optical disks or magnetic tape.
The storage device 1130 is capable of providing mass storage for the computing device 1100. In some implementations, the storage device 1130 is a computer-readable medium. In various different embodiments, the storage device 1130 may be a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid state memory device, or an array of devices, including: devices in a storage area network or other configuration. In additional embodiments, a computer program product is tangibly embodied in an information carrier. The computer program product contains instructions that, when executed, perform one or more methods, such as the methods described above. The information carrier is a computer-readable or machine-readable medium, such as the memory 1120, the storage device 1130, or memory on processor 1110.
The high speed controller 1140 manages bandwidth-intensive operations for the computing device 1100, while the low speed controller 1160 manages lower bandwidth-intensive operations. Such allocation of functions is merely exemplary. In some embodiments, the high-speed controller 1140 is coupled to memory 1120, display 1180 (e.g., through a graphics processor or accelerator), and to high-speed expansion ports 1150, which high-speed expansion ports 1150 may accept various expansion cards (not shown). In some embodiments, low-speed controller 1160 is coupled to storage device 1130 and low-speed expansion port 1190. The low-speed expansion port 1190 may include various communication ports (e.g., USB, bluetooth, ethernet, and wireless ethernet) that may be coupled to one or more input/output devices, e.g., a keyboard, a pointing device, a scanner, or a network device such as a switch or router, e.g., via a network adapter.
As shown, the computing device 1100 may be implemented in a variety of forms. For example, the computing device 1100 may be implemented as a standard server 1100a, or multiple times in a group of such servers 1100a, or as a laptop 1100b, or as part of a rack server system 1100 c.
Various implementations of the systems and techniques described here can be realized in digital electronic and/or optical circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, that receives data and instructions from, and transmits data and instructions to, a storage system, at least one input device, and at least one output device.
A software application (i.e., software resource) may refer to computer software that causes a computing device to perform tasks. In some examples, a software application may be referred to as an "application," application program, "or" program. Example applications include:
but are not limited to, system diagnostic applications, system management applications, system maintenance applications, word processing applications, spreadsheet applications, messaging applications, media streaming applications, social networking applications, and gaming applications.
These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, non-transitory computer-readable medium, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
The processes and logic flows described in this specification can be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and in special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit). A processor adapted to execute a computer program comprises: such as a general purpose microprocessor, a special purpose microprocessor, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of the computer are: a processor for executing instructions, and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, the computer need not have such devices. Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and storage devices, including, for example, semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., an internal hard disk or a removable disk), magneto-optical disks, and CD-ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
To provide for interaction with a user, one or more aspects of the disclosure may be implemented on a computer having: a display device, such as a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, or a touch screen for displaying information to the user and optionally a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other kinds of devices may be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user may be received in any form, including acoustic, speech, or tactile input. In addition, the computer may interact with the user by sending documents to a device used by the user and receiving documents from the device, for example, by sending web pages to a web browser on the user's client device in response to requests received from the web browser.
A number of embodiments have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the disclosure. Accordingly, other implementations are within the scope of the following claims.

Claims (26)

1. A method (1000), comprising:
generating, by data processing hardware (112), a base model (134) by training a first data set (132) of data pairs (133) using data;
generating, by the data processing hardware (112), an adapted model (144) by training the base model (134) on a second data set (142) of data pairs (143);
determining, by the data processing hardware (112), a comparison score (154) for each data pair (153) of a third data set (152) of data pairs (153) using the base model (134) and the adaptation model (144), the comparison score (154) representing a probability of quality of the respective data pair (153); and
training, by the data processing hardware (112), a target model (230) using the data pairs (153) of the third data set (152) and the comparison scores (154).
2. The method (1000) of claim 1, wherein training the target model (230) further comprises using data pairs (153) of the third data set (152) that satisfy a threshold comparison score (154).
3. The method (1000) of claim 1 or 2, further comprising: ordering, by the data processing hardware (112), the data pairs (153) of the third data set (152) based on the respective contrast scores (154).
4. The method (1000) of claim 3, wherein training the target model (230) further comprises:
generating a plurality of data batches (210), wherein each data batch (210) includes at least one data pair, and wherein a probability of a selected data pair being included in a selected data batch (210) is based on the respective contrast score (154) of the selected data pair, and wherein the probability increases as the respective contrast score (154) increases; and
the target model (230) is trained using each data batch (210).
5. The method (1000) of claim 4, wherein generating the plurality of data batches (210) comprises:
determining a selection rate (312) for each data batch (210);
determining a batch size (322) for each data batch (210), wherein the batch size (322) is based on the selection rate (312) and a number of data pairs (153) in the third data set (152);
selecting a number of data pairs (153) from the third data set (152) corresponding to the determined batch size (322);
sorting the selected data pairs (342) based on the respective contrast scores (154); and
removing from the data batch (210) a removal rate of the selected data pair (342) having the lowest contrast score (154), the removal rate comprising an inverse of the selection rate (312).
6. The method (1000) of claim 5, wherein the selection rate (312) decreases with training time.
7. The method (1000) of claim 6, wherein the batch size (322) is equal to a fixed batch size (324) divided by the selection rate (312).
8. The method (1000) according to any one of claims 1-7, wherein the target model (230) is larger than the base model (134).
9. The method (1000) of claims 1-8, further comprising:
determining, by the data processing hardware (112), that a size of the target model (230) is the same as a size of the base model (134); and
when the size of the target model (230) is the same as the size of the base model (134):
replacing, by the data processing hardware (112), the base model (134) with the adapted model (144);
replacing, by the data processing hardware (112), the adapted model (144) with the target model (230);
determining, by the data processing hardware (112), the comparison score (154) for each data pair (911) of a fourth data set (910) of data pairs (911) using the base model (134) and the adapted model (144) after replacement; and
training, by the data processing hardware (112), a subsequent target model (230) using the data pairs (911) and the comparison scores (154) of the fourth data set (910).
10. The method (1000) according to any one of claims 1-9, wherein the first data set (132) includes random data.
11. The method (1000) of claim 10, wherein the second data set (142) comprises cleaner data than the random data of the first data set (132).
12. The method (1000) according to any one of claims 1-11, wherein the contrast score (154) includes a Kullback-leibler (kl) divergence.
13. The method (1000) of any of claims 1-12, wherein each data set (132, 142, 152) comprises a statement language pair (133, 143, 153).
14. A system (100), comprising:
data processing hardware (112); and
storage hardware in communication with the data processing hardware (112), the storage hardware storing instructions that, when executed on the data processing hardware (112), cause the data processing hardware (112) to:
generating a base model (134) by training a first data set (132) of the data pair (133) using the data;
generating an adapted model (144) by training the base model (134) on a second data set (142) of data pairs (143);
determining a comparison score (154) for each data pair (153) of a third data set (152) of data pairs (153) using the base model (134) and the adaptation model (144), the comparison score (154) representing a probability of quality of the respective data pair (153); and
training a target model (230) using the data pairs (153) and the comparison scores (154) of the third data set (152).
15. The system (100) of claim 14, wherein training the target model (230) comprises using data pairs (153) in the third data set (152) that satisfy a threshold comparison score (154).
16. The system (100) of claim 14 or 15, wherein the operations further comprise: ordering, by the data processing hardware (112), the data pairs (153) of the third data set (152) based on the respective contrast scores (154).
17. The system (100) of claim 16, wherein training the target model (230) further comprises:
generating a plurality of data batches (210), wherein each data batch (210) includes at least one data pair, and wherein a probability of a selected data pair being included in a selected data batch (210) is based on the respective contrast score (154) of the selected data pair, and wherein the probability increases as the respective contrast score (154) increases; and
the target model (230) is trained using each data batch (210).
18. The system (100) of claim 17, wherein generating the plurality of data batches (210) comprises:
determining a selection rate (312) for each data batch (210);
determining a batch size (322) for each data batch (210), wherein the batch size (322) is based on the selection rate (312) and a number of data pairs (153) in the third data set (152);
selecting a number of data pairs (153) from the third data set (152) corresponding to the determined batch size (322);
sorting the selected data pairs (342) based on the respective contrast scores (154); and
removing from the data batch (210) a removal rate of the selected data pair (342) having the lowest contrast score (154), the removal rate comprising an inverse of the selection rate (312).
19. The system (100) of claim 18, wherein the selection rate (312) decreases over training time.
20. The system (100) of claim 19, wherein the batch size (322) is equal to a fixed batch size (324) divided by the selection rate (312).
21. The system (100) according to any one of claims 14-20, wherein the target model (230) is larger than the base model (134).
22. The system (100) of claims 14-21, wherein the operations further comprise:
determining that the target model (230) is the same size as the base model (134); and
when the target model (230) and the base model (134) are the same size:
replacing the base model (134) with the adapted model (144);
replacing the adapted model (144) with the target model (230);
determining the comparison score (154) for each data pair (911) of a fourth data set (910) of data pairs (911) using the base model (134) and the adapted model (144) after replacement; and
training a subsequent target model (230) using data pairs (911) in the fourth data set (910) that satisfy the threshold comparison score (154).
23. The system (100) according to any one of claims 14-22, wherein the first data set (132) includes random data.
24. The system (100) of claim 23, wherein the second data set (142) comprises cleaner data than the random data of the first data set (132).
25. The system (100) according to any one of claims 14-24, wherein the contrast score includes a Kullback-leibler (kl) divergence.
26. The system (100) according to any one of claims 14-24, wherein each data set (132, 142, 152) includes a statement language pair (133, 143, 153).
CN201980046281.5A 2018-05-08 2019-04-05 Alignment to sequence data selector Pending CN112602098A (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
US201862668650P true 2018-05-08 2018-05-08
US62/668,650 2018-05-08
PCT/US2019/026003 WO2019217013A1 (en) 2018-05-08 2019-04-05 Contrastive sequence-to-sequence data selector

Publications (1)

Publication Number Publication Date
CN112602098A true CN112602098A (en) 2021-04-02

Family

ID=66248717

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201980046281.5A Pending CN112602098A (en) 2018-05-08 2019-04-05 Alignment to sequence data selector

Country Status (4)

Country Link
US (1) US20190347570A1 (en)
EP (1) EP3791330A1 (en)
CN (1) CN112602098A (en)
WO (1) WO2019217013A1 (en)

Also Published As

Publication number Publication date
WO2019217013A1 (en) 2019-11-14
EP3791330A1 (en) 2021-03-17
US20190347570A1 (en) 2019-11-14

Similar Documents

Publication Publication Date Title
CN107301171B (en) Text emotion analysis method and system based on emotion dictionary learning
CN105824802B (en) It is a kind of to obtain the method and device that knowledge mapping vectorization indicates
WO2019223384A1 (en) Feature interpretation method and device for gbdt model
Atzmon et al. Learning to generalize to new compositions in image understanding
US20190236412A1 (en) Data processing method and device, classifier training method and system, and storage medium
US20190347571A1 (en) Classifier training
WO2017079568A1 (en) Regularizing machine learning models
JPH10187754A (en) Device and method for classifying document
US8620837B2 (en) Determination of a basis for a new domain model based on a plurality of learned models
Poulis et al. Learning with feature feedback: from theory to practice
US11157779B2 (en) Differential classification using multiple neural networks
KR20220062065A (en) Robust training in the presence of label noise
CN113011602A (en) Method and device for training federated model, electronic equipment and storage medium
CN112602098A (en) Alignment to sequence data selector
Mountassir et al. Some methods to address the problem of unbalanced sentiment classification in an arabic context
US20210034976A1 (en) Framework for Learning to Transfer Learn
JP6172317B2 (en) Method and apparatus for mixed model selection
Balasubramani et al. User involvement in ontology matching using an online active learning approach.
US20190286703A1 (en) Clustering program, clustering method, and clustering device for generating distributed representation of words
Chen et al. Ensemble of diverse sparsifications for link prediction in large-scale networks
JP6625507B2 (en) Association device, association method and program
US20210056417A1 (en) Active learning via a sample consistency assessment
US20210158137A1 (en) New learning dataset generation method, new learning dataset generation device and learning method using generated learning dataset
CN110574047A (en) Generating output examples using bit blocks
JP5813498B2 (en) Model learning device, related information extraction device, related information prediction device, model learning method, related information extraction method, related information prediction method, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination