CN109388808B

CN109388808B - Training data sampling method for establishing word translation model

Info

Publication number: CN109388808B
Application number: CN201710678325.9A
Authority: CN
Inventors: 陈虎; 尹文鹏
Original assignee: Individual
Current assignee: Individual
Priority date: 2017-08-10
Filing date: 2017-08-10
Publication date: 2024-03-08
Anticipated expiration: 2037-08-10
Also published as: CN109388808A

Abstract

A training data sampling method for building word translation models, comprising: firstly, randomly sampling example sentences of a first proportion of target quantity aiming at the words from original data in first round sampling; attaching corresponding labels to the example sentences obtained in the first round of sampling and storing the corresponding labels in a label data pool; performing word embedding preprocessing on the tag data in the tag data pool and acquiring data center points of various categories corresponding to various interpretations of the word; heuristic clustering is carried out on the original data by using center points of different categories; and carrying out data post-processing on the example sentences acquired in the first round of sampling and feeding back the processing result for sampling of the next round, and circulating until the total sampling number reaches the target sampling number.

Description

Training data sampling method for establishing word translation model

Technical Field

The invention relates to the field of artificial intelligence, in particular to a training data sampling method for establishing a word translation model.

Background

Currently, translation systems provide translation results by machine translation when performing translation. Among them, machine translation is a process of converting a natural source language into another natural target language using a computer. In order to improve the accuracy of machine translation, many translation systems currently provide a mechanism for multiple candidates of translation results, and when a user is not satisfied with the current translation result, the user can select a candidate result at word or phrase level.

However, the resources in the corpus on which the existing machine translation is based are limited and the description capability of the translation model itself is limited. Therefore, there is a certain gap between the translation result provided by the machine translation and the translation result required by the actual user. When the translation result cannot meet the translation requirement of the user, the user cannot edit the translation result corresponding to the source language sentence in the translation system, and the user can only manually translate the source language sentence in other documents except the translation system, however, in the manual translation process, the user needs to manually input each target language word, the translation workload is large, the translation efficiency is low, and the manual translation experience of the user is not ideal.

Furthermore, deep learning based machine translation techniques are known and their advent has been aimed at reducing the cost of data training and improving cross-language capabilities. This technique uses an encoder and a decoder, both of which are implemented by recurrent neural networks, as described in Dzmtry Bahdanau, kyunghuyun Cho and Yoghua Bengio, document "Neural Machine Translation by Jointly Learning to Align and Translate" (CoRR, abs/1409.0473,2014). The foreign language sentence of the input end is read and encoded word by the encoder, the states of the intermediate layer recursion units are continuously updated, then the state value of the intermediate layer is used as the input of the decoder, and the translation of the target language is obtained word by word. The technology can effectively utilize the grammar structure to translate the smooth sentences. The disadvantage is that word accuracy is not high enough. Especially the translation accuracy of rare words and ambiguous words needs to be improved.

In addition, text classifiers based on Deep Learning are also known and mainly use convolutional or recurrent neural networks, as described in the literature "Deep Learning" (mitps, 2016) of Ian goodfullow, yoshua Bengio and Aaron couoville. The text to be classified is represented in matrix form as input. The convolutional neural network divides the entire text into several successive patches, then performs feature extraction on each patch, and finally takes the most representative features of all patches as input to the classifier, as described in the literature "Multichannel variable-size convolution for sentence classification" by Yin and schu tze, (CoRR, abs/1603.04513,2016). The hierarchical structure can effectively improve the accuracy of the text classifier. But has the disadvantage that a large amount of labeled data is required for training. If the amount of data is insufficient, the algorithm is difficult to converge. The cost of this technique is relatively high.

Currently, more and more readers use mobile devices such as smartphones and electronic books to read foreign books and web pages. When an unknown vocabulary is encountered and needs to be referred to the dictionary, the vocabulary can be selected and copied to another dictionary software to realize the translation function. The part of reader or reading software has dictionary function, and the user directly clicks the word or selects the word, and then can realize the translation function. However, most words have multiple interpretations, and existing dictionary software feeds back all interpretations of the word found by the user to the user. On one hand, some words are interpreted very much, and a long drop-down text box is needed to finish displaying, so that the difficulty of browsing on the small-screen intelligent device by a user is increased; on the other hand, many words are ambiguous words and there is no obvious definition between several different interpretations, which are easily confused. Therefore, how to determine the only accurate interpretation of the word that the user is looking for currently translated based on the context of the user's reading is a very challenging and desirable challenge.

Disclosure of Invention

In order to solve the above problems, the present invention proposes a translation method comprising creating a word translation model and performing translation using the created word translation model, wherein the creating the word translation model comprises: respectively carrying out adaptive sampling of target number of example sentences containing the words on each explanation of each word to be trained from the original data; labeling each explanation of the corresponding word on each example sentence subjected to self-adaptive sampling; the labeled sample sentences are stored in a label database as label data; and performing cyclic transfer learning training on the tag data stored in the tag database to obtain a word translation model.

Advantageously, the raw data comes from free network resources.

Advantageously, the illustrative sentence is the only context that contains the particular word to be trained.

Advantageously, the respective adaptive sampling for each word to be trained comprises: first, in the first round of sampling, a first proportion of example sentences of the target number is randomly sampled for the word, for example 25%, although other proportions are conceivable depending on different languages or other different conditions. Then, the example sentences obtained in the first round of sampling are attached with corresponding labels and stored in a label data pool. Then, word embedding preprocessing is performed on the tag data in the tag data pool, and data center points of various categories corresponding to various interpretations of the word are acquired, wherein the word embedding preprocessing is to convert each tag data into a real number vector with a fixed length, and the average value of the real number vectors corresponding to all the data is the center point. Next, heuristic clustering is performed on the raw data by using the center points of different categories (interpretations), wherein each piece of raw training data is assigned to one category. And carrying out data post-processing on the example sentences acquired in the first round of sampling and using the processing result to decide a sampling strategy of the next round, and circulating a plurality of rounds of sampling until the total sampling number reaches the target sampling number. In general, in addition to a first proportion of the target number of samples for the first round of random sampling, the rounds of sampling from the second round until the end of sampling are adaptive sampling and generally sample a second proportion of the target number of samples. The second ratio is typically much smaller than the first ratio, such as 5%, although other ratios are also contemplated.

The adaptive sampling herein means that the sampling of each round after the second round (including the second round) is determined according to the sampling result of the previous round. Specifically, the result of the last round of data post-processing is fed back into the adaptive sampling and determines the ratio of the target sampled data amounts of each category of the next round to each other, thereby affecting which original data will be sampled. The aim is to avoid excessive data sampling corresponding to common interpretations (classes) by adaptive sampling, while rarely used data sampling is too little or not at all. Thereby, the translation accuracy of the translation model trained later can be improved.

Advantageously, the processing results include the midpoint of the data for each category, the data in each category that has been tagged, and all of the raw data that is categorized for each category.

Besides the translation method realized by translation training through static massive original data, the invention also provides a translation method realized by data fed back by a user, which comprises the steps of updating a word translation model and translating by using the updated word translation model, wherein the step of updating the word translation model comprises the following steps: training data is obtained by the user, for example from translating the user, from user feedback to an error such as translation software or translation means constituted by an existing translation model or from user suggested feedback. Subsequently; labeling each explanation representing the corresponding word on each example sentence from the customer feedback under the condition of fully considering the customer feedback; the labeled sample sentences are stored in a label database as label data; and performing cyclic transfer learning training on the tag data stored in the tag database to obtain a word translation model.

In contrast, the translation method implemented by the client feedback is likely to adopt the client feedback, and is also likely to adhere to the original explanation, so that the pertinence is stronger, and the translation precision and quality are further ensured.

Alternatively, it is also contemplated to wait for the user to feed back to a certain amount, such as 100 strips, or to re-label after a certain time, such as a week. Therefore, the accuracy of the user feedback can be improved, and the fact that the user performs retraining on a sudden feedback or possibly wrong feedback is avoided, so that training resources are wasted.

In addition, the invention also provides a training data sampling method for establishing a word translation model, which comprises the following steps: firstly, randomly sampling example sentences of a first proportion of target quantity aiming at the words from original data in first round sampling; attaching corresponding labels to the example sentences obtained in the first round of sampling and storing the corresponding labels in a label data pool; performing word embedding preprocessing on the tag data in the tag data pool and acquiring data center points of various categories corresponding to various interpretations of the word; heuristic clustering is carried out on the original data by using center points of different categories; and carrying out data post-processing on the example sentences acquired in the first round of sampling and feeding back the processing result for sampling of the next round, and circulating until the total sampling number reaches the target sampling number.

Advantageously, the second proportion of the number of samples in the current round of samples is determined from the result of the previous round of samples in each round of samples following said first round of samples.

Advantageously, the first ratio is greater than the second ratio.

Advantageously, the first ratio is greater than 5 times, 6 times, 8 times, or even 10 times the second ratio.

Advantageously, the illustrative sentence is a context that uniquely contains the particular word to be trained.

Advantageously, said first proportion may be chosen to be 25%.

Advantageously, said second proportion may be chosen to be 5%.

Advantageously, the raw data is from free network resources.

Advantageously, the raw data is derived from feedback of the translation user.

In addition, the invention also provides a circulating type migration learning method for establishing a word translation model, which comprises the following steps: training a first word of the N >1 words according to a machine training method; after training the first word to a certain extent, migrating the partial parameters with similarity to the N words in the training for training the second word, and circulating until the partial parameters of the N-th word are used for training the first word, until the translation models of all N words converge.

Convergence refers to comparing the translation result of the current model with the complete correct result after each round of training, such as the result translated by manual labels, and calculating the accuracy. At the beginning of learning, the accuracy rate will jump significantly, and overall will appear to be a trend toward a better level. Until the final stage, the improvement of the accuracy rate is smooth, which means that the model has reached the optimal solution through learning, that is, convergence, which means that the model is basically trained again and cannot be further improved, so that the training should be ended.

Advantageously, the partial parameters are words and collocations that often occur in the context of different words.

Advantageously, each training for each word is trained using a target amount of training data. The target number may be, for example, 200 training data for each interpretation of each word.

Advantageously, the training data representing to some extent all target numbers for a single word is trained at least 1 pass.

Advantageously, the degree of training data representing all target numbers of individual words is trained 10 times. Of course, this other number of numbers is also contemplated, depending on the language and the type of vocabulary, empirically.

Advantageously, the N words of the simultaneous training are partially or fully relevant.

Advantageously, the machine training method is a text classifier training method.

Advantageously, in the transfer learning process, the specific parameter of the specific round of learning for a specific word is determined together with the parameter of the previous round of learning for the present word according to the corresponding parameter of the previous word.

Advantageously, in the transfer learning process, the specific parameter of a specific round of learning for a specific word is determined by a weighted average of the corresponding parameter of the previous word and the parameter of the previous round of learning of the present word.

Specifically, a certain parameter, e.g. word 1, is h ₁ The corresponding parameter for word 2 is h ₂ And h ₂ The value of the previous round is h' ₂ Then word 1 is weighted and migrated to the parameter for word 2Calculated according to the following formula:

the cyclic transfer learning method provided by the invention has the advantages that the training data of each word can be indirectly expanded into the training of other words by adopting the method, so that the training is more efficient, the average required training data quantity of each word can be reduced, and the training cost is greatly saved. This advantage becomes more pronounced especially in cases where the relevance of the individual words is relatively strong using a cyclic shift learning method.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It will be apparent that the figures in the following description depict only some embodiments of the invention. These drawings are not limiting to the invention, but serve an exemplary purpose.

Wherein:

fig. 1 shows a flow chart of an adaptive sampling of training data according to the method proposed by the invention;

FIG. 2 illustrates a flow chart of training a translation model by performing cyclic transfer learning in accordance with the method of the present invention;

FIG. 3 illustrates a flow chart of sampling data from a static corpus and training a model according to one embodiment of the method of the present invention;

fig. 4 shows a flow chart of updating a translation model by means of a user querying a translation record and corresponding feedback information according to an embodiment of the method of the invention.

Detailed Description

Fig. 1 shows a flow chart of adaptive sampling of source language training data in accordance with the present invention. Adaptive sampling of training data is shown in fig. 1. The goal of this sampling is to sample, for any one translated word, a certain number of 200 example sentences for each interpretation of that word from raw translation data such as free web material or other massive quantities. Assuming that there are 5 interpretations per word on average, the sampling target for each word is approximately 1000 example sentences. The term example is understood here to mean a context which contains only this word. The context may be a sentence or several sentences.

Subsequently, manual labeling is used according to the different interpretations of this word in context. The label is a sequence number of each interpretation for each word. For example, if a word has 5 different interpretations, then tags are tag 1, tag 2, tag 3, tag 4, and tag 5. The tagged data may be passed through a word translation model for training, such as machine learning techniques.

The adaptive sampling process of training data is described in detail below. In general, the sampling of training data is done in multiple rounds, each round sampling a proportional amount of data.

First, the first round employs a random sampling strategy. For example, the first round may sample 25% of the total number of targets, and as described above, if 200 pieces of data (i.e., example sentences) are required to be finally sampled for each of the total of 5 interpretations, the first round may sample 250 example sentences.

Subsequently, the 250 example sentences sampled and labeled in the first round are stored in the label data pool.

Then, word embedding (Word embedding) preprocessing is performed on 250 example sentences in the tag data pool to obtain the center point of the data in each interpretation (i.e., each category). Word embedding preprocessing is the conversion of each context (data) into a real vector of fixed length. The average of the real vectors corresponding to all contexts is the center point.

Next, the original data are heuristically clustered by using the center points of different interpretations, wherein each piece of original data is distributed into one interpretation. For any piece of data, we calculate the distances from the center point of all categories to the piece of data, and the piece of data will be assigned to the category corresponding to the center point closest to.

Finally, post-processing is carried out on the acquired first round of data, and three parameters are mainly obtained: the midpoint of the context (data) for each interpretation (category), the context (data) in each interpretation (category) that has been tagged, and all the original contexts (data) that are categorized for each interpretation (category). The result of this first round data post-processing will be fed back into the adaptive sampling and determine the target sampled data amount for each class of the second round and influence which original data will be sampled. After the first round of sampling, the next round of sampling is performed.

The following table 1 shows the algorithm variables and meaning of data post-processing and adaptive sampling:

variable(s)	Meaning of the following description
		N	Number of categories
c _k ，k＝0，1，2，...，N-1	Midpoint of kth category labeling data
		\|a-b\|	Representing the distance between two vectors
A _i ，i＝0，1，2，...，N-1	Heuristic clustering of raw data sets categorized into ith class
		a ^j _i ，j＝0，1，2，...，Mi-1	Heuristic clustering certain piece of original data classified into ith class
c _i ，i＝0，1，2，...，Mi-1	Sample data labeled in ith category

There are two constraints for adaptive sampling for class i: first, the sampled data is less than or equal to the distance from the center point of class i to the center point of the other class. The probability that data satisfying this condition belongs to class i is greater than the probability that data belongs to other classes, preferably with a probability difference exceeding 10%, although other numbers are also contemplated depending on the language of the different type or vocabulary of the different domain. It is understood that sampling is only theoretically estimated and is not completely determinable and accurate. The category of the sampled data can be completely determined only when manual labeling is performed thereafter. Second, the average distance between the newly sampled data (i.e., context) and the interpreted (i.e., like) sampled data is maximized. Strictly speaking, the ith interpretation of the next sampled data point can be expressed by the following formula:

the end of each round (including the first round) counts the class of the N interpretations (e.g., 5 in this embodiment) that has the least number of current samples, and the next round adaptively samples for this interpretation. The number of samples from the second round is typically fixed, with an empirical value of 5% of the target data amount for each interpretation. As described above, if the amount of data per interpretation target is 200, the number of samples per round after the initial sampling is 10. After several rounds of sampling, the number of samples for each interpretation will be very close. When the total target sample number, such as 1000, is reached, the sampling ends.

For example, the English word of school has two typical interpretations, the first is school, and is more common; the second is the school, which is more rarely used. For each interpretation, our target number of samples is 200 contexts. The first round of random sampling, for example, we sample 100 pieces of data, and after manual labeling, 81 pieces of data corresponding to the first interpretation and 19 pieces of data corresponding to the second interpretation are obtained. Then for the second round we sample 10 pieces of data for the second interpretation. The 10 pieces of data are labeled, and there may be 1 piece of data belonging to the first interpretation and 9 pieces of data belonging to the 2 nd, and then the two pieces of interpretation data amount distribution becomes 82:28, the amount of data for the second interpretation is still smaller than for the first interpretation, so the next round continues to sample 10 data for the second interpretation. Proceeding in this way, after a few rounds, the amount of data for the second interpretation would exceed the first interpretation, e.g., the distribution becomes 98:102, and then the next round begins to sample 10 pieces of data for the first interpretation. Finally, when the total data amount reaches 400, the adaptive sampling ends.

It follows that the use of the adaptive sampling technique according to the present invention enables substantially the same number of contexts to be mined and obtained for each interpretation, whether common or uncommon, thus ensuring that subsequently trained word translation models can also efficiently translate each interpretation, including uncommon interpretations. Meanwhile, the situation that if random sampling is adopted instead of adaptive sampling according to the invention is avoided, the number of contexts corresponding to the uncommon interpretation is very small compared with the uncommon interpretation, and the accuracy of a translation model trained later on to the uncommon interpretation is greatly reduced due to the insufficient data volume, namely the contexts.

FIG. 2 shows a flow chart of training a translation model by performing a cyclic transition learning in accordance with the method of the present invention.

As shown in FIG. 2, multiple word models may be trained simultaneously in performing translation model training. Preferably, the words trained simultaneously are preferably partially or fully related. Such as school (middle school) and university (university), are relatively relevant, so they are suitable for training in the same group. Typical example sentences are: a thenar 1000students in this school; the route is 4000students at this university. Both example sentences contain the word student. When training the translation model of the word of the school, the information of the word of the student is hidden in the middle layer of the word of the student, and is transferred to the middle layer of the uniqueness model in the process of transfer learning, so that the training of the model of the student is facilitated. That is, words with a relatively large relevance may better share training resources during the training process.

In the embodiment shown in FIG. 2, the training process for each word includes a preprocessing layer, a middle layer, a full connection layer, and a logistic regression layer. The models for a single particular word all have independent preprocessing, fully connected, and logistic regression layers. The preprocessing layer is used for realizing word embedding preprocessing of training data. Each word in a piece of data is converted into a real number column vector with a specific length by a word embedding technology, and the column vectors are sequentially arranged according to the positions of the corresponding words in sentences to form a matrix. Multiple different word embedding methods are used simultaneously, and the generated multiple matrices are delivered to the next layer for processing. Each node of the full connection layer and each numerical value in the output matrix result of the middle layer are connected, and each connection has own weight; the weights are obtained by calculation, more precisely by data training. The full connection layer represents important features that ultimately extract from the data that can affect classifier results. The logistic regression layer is the last layer to map the values of the individual nodes in the fully connected layer to the result of the classifier, i.e. the class or label of the data. The middle layer is an implicit layer and may include, for example, a first convolution layer, a dynamic pooling layer, a second convolution layer, and a pooling layer. Specific structures can be referred to, for example, yin anddocument "Multichannel variable-size convolution for sentence classification" (CoRR)Abs/1603.04513,2016). The results of the middle layer represent all the features implicit in the data that can affect the classifier results. While training of the middle layer of all words is communicated and can be utilized inter-each other between words, since the middle layer is herein understood to be a fixed collocation of the logo words and certain words affecting sentence meaning. Since these words and matches often appear in the context of different words, the middle layers of different words have similarities.

Specifically, word 1 may be trained first using known text classifier training methods. This includes embedding training data words sampled and stored for word 1 into a tag data pool sequentially through the preprocessing layer, middle layer, full connection layer, and logistic regression layer. After training word 1 to a certain extent, the middle layer of word 1 can be migrated for training of other words, such as word 2, according to the method proposed in the present invention. The term "to some extent" as used herein may be after training at least one pass for the word 1 using all training data. According to experience, the training effect of 10 times is better, and the training time is increased by 10 times.

Similarly, the middle layer in the word 2 training process can be migrated for training word 3, and so on. Until the training of the last word N is completed, the training of the first round is ended.

The middle layer of the last word N can then be used again for the second round of training of word 1, in the same cyclic order. This loops back and forth until the translation models for all words 1-N converge. Convergence refers to that after each round of training, the translation result of the current model is compared with the label manually attached before, and the accuracy is calculated. At the beginning of learning, the accuracy rate will jump significantly, and overall will appear to be a trend toward a better level. Until the final stage, the improvement of the accuracy rate is smooth, which means that the model has reached the optimal solution through learning, i.e. convergence, and means that the model training is finished.

In addition, the middle layer of a word is to be used with the next word when it is migrated for the next wordIs weighted averaged. As shown in fig. 2, the weight w ₁₂ Namely a word vector v corresponding to two words, namely word 1 and word 2 ₁ And v ₂ Is a correlation coefficient of (a):

that is, assume that a certain parameter of the middle layer of word 1 is h ₁ The parameter of the middle layer corresponding to word 2 is h ₂ And h ₂ The value of the previous round is h' ₂ Then the weighted migration of word 1 middle layer to word 2 middle layer is calculated according to the following formula:

the weighted average of any two other consecutive word middle layers is the same.

By the cyclic transfer learning method provided by the invention, common parts, namely middle layers, in the training process of each word can be fully utilized. The training data of each word is indirectly expanded into the training of other words so that the average required training data amount of each word can be reduced. Its advantages are high effect and low cost.

FIG. 3 illustrates a flow chart of sampling data from a static corpus and training a model according to one embodiment of the method of the present invention. For example, the word school, has two typical interpretations, the first being school, more common; the second is the school, which is more rarely used. For each interpretation we obtain 200 example sentences from the static corpus by adaptive sampling, the adaptive sampling process of these example sentences being implemented for example by the flow of the adaptive sampling method as shown in fig. 1. In the adaptive sampling process, 200 pieces of data for each interpretation, so that 400 pieces of data in total are both interpreted, have been labeled. The 400 pieces of tagged data will be used to train a translation model of the word school. In addition, among words requiring model training, we find out other words with the best relevance to the word of school, such as universalization, college, institute, and school are in the same group, and perform cyclic migration training. The cyclic migration training is specifically performed as shown in the flowchart shown in fig. 2. Finally, until all word models which are simultaneously trained tend to converge, and training is finished.

The basic principle as shown in fig. 3 is that the combination of adaptive sampling of training data according to fig. 1 and cyclic shift learning according to fig. 2 enables training of a text classifier for translating words, i.e. a word translation model, whereby the word translation model implemented can accurately determine the unique interpretation of the word sought by the user in context of the user based on the context of the user reading. Furthermore, the accuracy of the context-translating foreign language words, especially the rare words and ambiguities, can be substantially improved by the high-accuracy deep learning text classifier established according to the present invention. Meanwhile, on the premise of ensuring accuracy, the data volume of manual training can be reduced, and the cost is reduced.

Fig. 4 shows a flow chart of updating a translation model by means of a user querying a translation record and corresponding feedback information according to an embodiment of the method of the invention. Also taking the word school as an example, where the translation model of this word has been trained using data sampled in a static corpus, for example as described with reference to fig. 3, when a user translates the word school using translation software implemented in accordance with the method of the present invention, the software may give an interpretation of the word using the translation model of this word taking into account the data of the context of the word the user is to translate. If the user feedback translates correctly or does not give any feedback, then the existing translation model is kept unchanged. If the user feeds back the translation error, the manual translation is performed, then the data is labeled and stored in the label data pool, so that the data amount in the label data pool is increased, namely dynamic change occurs. Because there is a new data entry for this word of school, the translation model of this word of school needs to be retrained. Only the model of school need be retrained separately and no training in combination with other related word groupings is required. Alternatively, it is also contemplated to wait for the user to feed back to a certain amount, such as 100 strips, or to re-label after a certain time, such as a week. Therefore, the accuracy of the user feedback can be improved, and the fact that the user performs retraining on a sudden feedback or possibly wrong feedback is avoided, so that training resources are wasted.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. It should be understood that the features disclosed in the above embodiments may be used alone or in combination, except where specifically indicated. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Therefore, it is intended that the invention disclosed herein not be limited to the particular embodiments disclosed, but that the invention be constructed as modified within the spirit and scope of this invention as defined by the appended claims.

Claims

1. A training data sampling method for building word translation models, comprising:

firstly, randomly sampling example sentences of a first proportion of target quantity aiming at the words from original data in first round sampling;

attaching corresponding labels to the example sentences obtained in the first round of sampling and storing the corresponding labels in a label data pool;

performing word embedding preprocessing on the tag data in the tag data pool and acquiring data center points of various categories corresponding to various interpretations of the word;

heuristic clustering is carried out on the original data by using center points of different categories;

performing data post-processing on the example sentences acquired in the first round of sampling and feeding back the processing result to be used for sampling of the next round, circulating until the total sampling number reaches the target sampling number,

wherein the data post-processing obtains the midpoint of the data of each category by counting the data sampled at the first round, the data already labeled in each category and all the original data classified into each category,

in the next round of sampling, the acquired dataThe formula of (2) is:

wherein c _k K=0, 1,2, …, N-1 represents the midpoint of the kth category labeling data; the a-b represents the distance of the two vectors; a is that _i I=0, 1,2, …, N-1 represents the original dataset with heuristic clusters classified as the i-th class;j＝0,1,2,…,M _i -1 represents a piece of raw data of heuristic clustering into an ith class; c _i ，i＝0,1,2,…,M _i -1 represents sample data labeled for the ith class.

2. The method for data sampling according to claim 1, wherein,

and determining the example sentences of a second proportion of the sampling target number in the current round of sampling according to the result of the previous round of sampling in each round of sampling after the first round of sampling.

3. The method for data sampling according to claim 2, wherein,

the first ratio is greater than the second ratio.

4. The method for data sampling according to claim 3, wherein,

the first ratio is greater than 5 times, 6 times, 8 times, or 10 times the second ratio.

5. The method for data sampling according to claim 1, wherein,

the illustrative sentence is a context that uniquely contains the particular word to be trained.

6. The method for data sampling according to claim 3, wherein,

the first proportion is 25%.

7. The method for data sampling according to claim 3, wherein,

the second ratio is 5%.

8. The method for data sampling according to claim 1, wherein,

the raw data comes from free network resources.

9. The method for data sampling according to claim 1, wherein,

the raw data is derived from feedback from the translation user.