CN109086814B

CN109086814B - Data processing method and device and network equipment

Info

Publication number: CN109086814B
Application number: CN201810813137.7A
Authority: CN
Inventors: 李俊岑
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-07-23
Filing date: 2018-07-23
Publication date: 2021-05-14
Anticipated expiration: 2038-07-23
Also published as: CN109086814A

Abstract

The invention discloses a data processing method, a device and network equipment, wherein the data processing method comprises the following steps: acquiring a first annotation data set; traversing the labeled data in the first labeled data set, and determining the conflict labeled data by using a label prediction model when traversing the labeled data in the first labeled data set; acquiring a second labeled data set, wherein the second labeled data set is labeled data obtained by re-labeling the conflict labeled data obtained in the traversal process according to a preset labeling rule; determining a third annotation data set according to the first annotation data set and the second annotation data set; and when the evaluation result of the third annotation data set does not meet the preset evaluation condition, taking the third annotation data set as the first annotation data set, and executing the traversal step until the evaluation result of the third annotation data set meets the preset evaluation condition. The invention improves the quality of the labeled data and saves the labor and time cost.

Description

Data processing method and device and network equipment

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a data processing method, an apparatus, and a network device.

Background

As computer technology has developed, machine learning techniques are applied to more and more fields. Machine learning generally requires a large amount of labeled data to train a learning model, and therefore, the labeling quality of the data is an important factor affecting the accuracy of the learning model.

In order to improve the labeling quality of data, a common way is to label the same data by a plurality of labels, and then take the labeling result of most labels as the final labeling result; or sampling and evaluating the labeling result of each time, and if the accuracy of the sampling and evaluating is smaller than a preset threshold, letting the annotator annotate the data again until the accuracy of the sampling and evaluating reaches the preset threshold.

In the process of implementing the invention, the inventor finds that the prior art has at least the following problems:

in the related art, when the labeling quality of data is improved, especially when the data is labeled more complicatedly, the method mainly depends on manual participation, so that great human resources and time are consumed, and the accuracy of the labeled data is still to be further improved.

Therefore, it is desirable to provide a more reliable or efficient solution to effectively reduce the consumption of time and human resources while ensuring the quality of the annotation data.

Disclosure of Invention

In order to solve the problems in the prior art, embodiments of the present invention provide a data processing method, an apparatus, and a network device. The technical scheme is as follows:

in one aspect, a data processing method is provided, and the method includes:

acquiring a first labeling data set, wherein the first labeling data set is labeling data obtained by labeling data to be labeled according to a preset labeling rule;

traversing the labeled data in the first labeled data set, and determining conflict labeled data by using a label prediction model when traversing the labeled data in the first labeled data set;

acquiring a second labeled data set, wherein the second labeled data set is labeled data obtained by re-labeling the conflict labeled data obtained in the traversal process according to the preset labeling rule;

determining a third annotation data set according to the first annotation data set and the second annotation data set;

and when the evaluation result of the third labeled data set does not meet the preset evaluation condition, taking the third labeled data set as the first labeled data set, and executing the traversal step until the evaluation result of the third labeled data set meets the preset evaluation condition.

In another aspect, there is provided a data processing apparatus, the apparatus comprising:

the device comprises a first acquisition module, a first labeling module and a second acquisition module, wherein the first acquisition module is used for acquiring a first labeling data set, and the first labeling data set is labeling data obtained by labeling data to be labeled according to a preset labeling rule;

the traversing module is used for traversing the labeled data in the first labeled data set and determining the conflicting labeled data by utilizing a labeled prediction model when traversing the labeled data in the first labeled data set;

a second obtaining module, configured to obtain a second labeled data set, where the second labeled data set is labeled data obtained by labeling conflict labeled data obtained in a traversal process according to the preset labeling rule again;

the first determining module is used for determining a third annotation data set according to the first annotation data set and the second annotation data set;

and the cyclic processing module is used for taking the third labeled data set as the first labeled data set when the evaluation result of the third labeled data set does not meet the preset evaluation condition, and executing the traversal step until the evaluation result of the third labeled data set meets the preset evaluation condition.

In another aspect, a network device is provided, including:

a processor adapted to implement one or more instructions; and the number of the first and second groups,

a memory storing one or more instructions adapted to be loaded by the processor and to perform the data processing method described above.

The technical scheme provided by the embodiment of the invention has the following beneficial effects:

according to the method, after the labeled data set is obtained, the labeled data in the labeled data set is traversed, the label prediction model is combined to determine the conflict labeled data in the traversing process, the conflict labeled data obtained in the traversing process are labeled again, the newly labeled data and the previous labeled data set are fused to obtain a new labeled data set, then the new labeled data set is evaluated, and when the evaluation result does not meet the preset evaluation condition, the above-mentioned cyclic iteration operation is carried out until the new labeled data set meets the preset evaluation condition, so that the labeling quality of the data is greatly improved, and the model and the manpower are effectively combined, so that the manpower and the time cost are saved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flow chart of a data processing method according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of determining conflicting annotation data by using an annotation prediction model when traversing the annotation data in the first annotation data set according to the embodiment of the present invention;

FIG. 3 is a schematic flow chart of obtaining the evaluation result of the third annotation data set according to the embodiment of the present invention;

FIG. 4 is a flow chart illustrating another data processing method according to an embodiment of the present invention;

fig. 5 is another schematic flow chart of determining conflicting annotation data by using an annotation prediction model when traversing the annotation data in the first annotation data set according to the embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of a traversal module according to an embodiment of the present invention;

FIG. 8 is a block diagram of another data processing apparatus according to an embodiment of the present invention;

FIG. 9 is a schematic structural diagram of a first determining module according to an embodiment of the present invention;

FIG. 10 is a block diagram of another data processing apparatus according to an embodiment of the present invention;

fig. 11 is a schematic structural diagram of a network device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

Referring to fig. 1, which is a flow chart illustrating a data processing method according to an embodiment of the present invention, the present specification provides the method operation steps as described in the embodiment or the flow chart, but more or less operation steps may be included based on conventional or non-inventive labor. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. In actual system or product execution, sequential execution or parallel execution (e.g., parallel processor or multi-threaded environment) may be possible according to the embodiments or methods shown in the figures. Specifically, as shown in fig. 1, the method includes:

s102, a first annotation data set is obtained, wherein the first annotation data set is obtained by annotating data to be annotated according to a preset annotation rule.

In the embodiment of the present specification, the data to be annotated refers to an object that needs to be annotated by an annotation person, and the data to be annotated may include, but is not limited to, characters, images, audio, statistical data, and the like.

In the embodiment of the present specification, the preset labeling rule is information indicating how to label the data to be labeled by a labeling person. The annotation data comprises an annotation result corresponding to the data to be annotated, and the annotation result is data obtained after annotation personnel annotate the data to be annotated based on a preset annotation rule.

The preset labeling rules may include a labeling format and a label category, etc. For example, the tagging format may be "slot name ═ slot value # # slot name ═ slot value", where a slot is some entity words with specific attributes in the data to be tagged; the label category can be 'song name- > song, singer- > singer, category, style- > tag, movie, program, work- > tv, language- > language, and location information- > place'. According to the preset marking rule, the marking result that the data to be marked is that the cowboy is busy for me ' can be ' song is busy for cowboy '; the annotation result for the data to be annotated as "what good-listening song zhou jilun recently" may be "singer ═ zhou jing # # tag ═ good-listening"; the annotation result of the data to be annotated as "point-to-point Love Is Just A Dream" may be "song ═ Love Is Just A Dream".

It should be noted that the above is only an example of the preset labeling rule, and in practical applications, corresponding preset labeling rules may also be set according to needs, for example, when an intention to be labeled is to be labeled, a label category of the intention a may be indicated as 1, a label category of the intention B may be indicated as 2, and the like in the preset labeling rules.

In practical application, a marking person can initiate a request to a marking data system through some interactive devices, the marking data system can select data to be marked from a data set to be marked, and the part of data to be marked and a preset marking rule are packaged into a data packet to be sent to the marking person. Subsequently, the annotation data of the annotation completed by the annotation personnel can be obtained.

And S104, traversing the labeled data in the first labeled data set, and determining the conflicting labeled data by using a labeled prediction model when traversing the labeled data in the first labeled data set.

In order to control the annotation quality of the annotation data, in the embodiment of the present specification, the annotation data in the first annotation data set is traversed, and when traversing the annotation data of the first annotation data set, the annotation prediction model is combined to determine the conflicting annotation data.

In an embodiment of the present specification, data to be annotated is input to an annotation prediction model, so that corresponding prediction annotation data can be output, and when annotation data corresponding to the data to be annotated in a first annotation data set is inconsistent with the prediction annotation data, annotation data corresponding to the data to be annotated in the first annotation data set is determined as conflict annotation data.

In this embodiment, when traversing the annotation data in the first annotation data set, the method shown in fig. 2 may be adopted to determine the conflicting annotation data by using the annotation prediction model. Fig. 2 is a schematic flow chart of determining conflicting annotation data by using an annotation prediction model when traversing the annotation data in the first annotation data set according to an embodiment of the present invention, as shown in fig. 2, the flow chart may include:

s202, selecting at least one piece of annotation data from the first annotation data set as annotation data to be screened, and using the annotation data of the first annotation data set without the annotation data to be screened as training annotation data.

In the embodiment of the present specification, the annotation data in the first annotation data set is split into annotation data to be screened and training annotation data. The training annotation data is a data set for establishing a model by matching some parameters, that is, the training annotation data is used for training the machine learning model to determine the parameters of the machine learning model. The marked data to be screened is a data set used for screening out the conflict marked data.

In this embodiment of the present specification, the annotation data to be filtered may be one annotation data in the first annotation data set, or may be a set of several annotation data. The annotation data to be screened may be selected from the first annotation data set in a random selection manner.

And S204, performing machine learning on the training annotation data to generate an annotation prediction model.

In this embodiment, the model category for performing machine learning may be determined according to the content corresponding to the annotation data, for example, if the content of the annotation data is data related to an intention classification, a model of the intention classification (e.g., a classifier such as a support vector machine) may be selected for performing machine learning, and if the content of the annotation data is content related to a slot annotation, a model of a sequence annotation (e.g., an LSTM model, a CRF model) may be selected for performing machine learning.

In embodiments of the present specification, generating an annotation prediction model may be performed by maximizing a likelihood function of a data set

Where x represents the training annotation data input and y represents the class label output of the training annotation data. In the machine learning process, firstly, the training annotation data x is converted into a vector c, and then the vector c is converted into a vector cThe quantity c is converted into a corresponding output y.

In this embodiment, when converting the vector c into the corresponding output y, the vector c may be input into a Softmax multi-item classifier to calculate a probability of each class label, and specifically, the generation probability of the ith class label may be expressed as:

wherein j is 1, …, K;

the likelihood of all class labels is then expressed as:

parameters in the model for machine learning can be determined by determining the maximum likelihood values of all the class labels, and then an annotation prediction model is generated.

In the embodiment of the present specification, the transformation of the training annotation data x into the vector c may adopt at least two ways:

in a first mode, the TF-IDF value of each word in the training annotation data is counted, and the whole sentence is converted into a TF-IDF vector, which can be expressed as:

V_d＝[w_1，d，w_2，d，...，w_N，d]^Twherein, in the step (A),

wherein the content of the first and second substances,

tf_t，dis the frequency of appearance of the phrase t in the input text;

is the reverse file frequency;

| D | is the total number of files in the file set;

i { D 'e D | t e D' } | is the number of files containing the phrase t.

In the second mode, x is encoded as a vector c with a width K by a recurrent neural network encoder. Giving an ordered sequence of features of arbitrary length

The recurrent neural network encoder will return a fixed length feature vector c_k∈R^out(wherein, x_iMay be a one-hot representation or a dense low dimensional feature).

In the embodiment of the present specification, the recurrent neural network encoder adopts a recursive definition: specifically, in the process of depicting sequence information, when a partial sequence consisting of the first i elements is depicted, a hidden state s is introduced_iAs a previous hidden state s_i-1And current element x_iOutput of (i), i.e. s_i＝R(s_i-1,x_i) (ii) a The final output to the fixed length feature vector is the final state s by mapping O (-)_kMapping to c_kSpecifically, the following are shown:

RNN(x_1:k；s₀)＝c_1:k

c_i＝O(s_i)

s_i＝R(s_i-1,x_i)

and S206, inputting the data to be labeled corresponding to the labeled data to be screened into the labeling prediction model for labeling prediction to obtain the predicted labeled data corresponding to the data to be labeled.

In the embodiment of the present specification, after the training is finished to obtain the annotation prediction model, the data to be annotated corresponding to the annotation data to be screened may be input to the annotation prediction model for annotation prediction, so as to obtain the prediction annotation data corresponding to the data to be annotated.

And S208, determining conflict annotation data according to the annotation data to be screened and the prediction annotation data.

In the embodiment of the present specification, after obtaining the predicted annotation data, the annotation result of the predicted annotation data may be compared with the annotation result of the annotation data to be screened, and when the annotation result of the annotation data to be screened is inconsistent with the annotation result of the predicted annotation data, it is indicated that the annotation data to be screened may have problems such as an annotation error, and at this time, the annotation data to be screened is determined to be conflicting annotation data.

For example, in the annotation for intention shown in table 1, since the annotation result of the annotation data to be filtered corresponding to the sequence number 2 is different from the annotation result of the prediction annotation data, the annotation data to be filtered corresponding to the sequence number 2 is determined as the conflict annotation data.

TABLE 1

In this embodiment, the conflict annotation data determined in each traversal may be placed in a data set.

In the embodiment of the present specification, since the conflicting annotation data is data that may have problems such as annotation errors, in order to improve the reliability of the annotation prediction model obtained by machine learning in the traversal process, after determining that the annotation data to be screened is the conflicting annotation data, the conflicting annotation data may be removed from the first annotation data set, so that the conflicting annotation data does not exist in the training annotation data in the next traversal process, so that the annotation prediction model obtained by machine learning using the training annotation data may be more reliable, and the accuracy of the screened conflicting annotation data may be improved.

And S106, acquiring a second annotation data set, wherein the second annotation data set is obtained by re-annotating the conflict annotation data obtained in the traversal process according to the preset annotation rule.

In an embodiment of the present specification, after determining the conflicting annotation data from the first annotation data set, the conflicting annotation data may be sent to the annotating staff again, so that the annotating staff performs the re-annotation on the data to be annotated corresponding to the conflicting annotation data according to the preset annotation rule to obtain an annotation result. The re-annotated annotation data can then be retrieved as the second annotation data set.

And S108, determining a third annotation data set according to the first annotation data set and the second annotation data set.

In this embodiment, the third annotation data set can be obtained by replacing the conflicting annotation data in the first annotation data set with the annotation data in the second annotation data set. .

And S110, when the evaluation result of the third labeled data set does not meet the preset evaluation condition, taking the third labeled data set as the first labeled data set, and executing the traversing step until the evaluation result of the third labeled data set meets the preset evaluation condition.

In this embodiment of the present specification, after the third annotation data set is determined, quality evaluation may be performed on the third annotation data set, that is, whether an evaluation result of the third annotation data set meets a preset evaluation condition is determined. This quality assessment process may be performed manually to reduce the effect of inaccuracies in the model predictions that may exist in the preceding steps.

Specifically, when the evaluation result of the third labeled data set cannot meet the preset evaluation condition, the third labeled data set may be used as the first labeled data set, and step S104 is executed until the obtained evaluation result of the third labeled data set can meet the preset evaluation condition, which indicates that the labeling quality of the third labeled data set at this time is qualified, meets the requirement, and the execution may be finished.

The preset evaluation condition may be set according to the evaluation method of the third annotation data set. In this embodiment of the present specification, the evaluation manner of the third annotation data set may adopt sampling evaluation, and the method shown in fig. 3 may be adopted to obtain the evaluation result of the third annotation data set. Fig. 3 is a schematic flow chart of obtaining an evaluation result of a third annotation data set according to an embodiment of the present invention, and as shown in fig. 3, the flow chart may include:

and S302, extracting a first amount of marking data from the third marking data set as sample marking data.

In the embodiment of the specification, the extraction may be performed in a random manner, the first number of the extractions may be set according to actual requirements, and generally, the larger the first number is, the larger the number of the sample label data is, the higher the reliability of the evaluation result is; conversely, the smaller the first number, the smaller the number of sample labeling data, and the lower the reliability of the evaluation result.

S304, counting a second quantity of the labeling data meeting the preset labeling rule in the sample labeling data.

In the embodiment of the present specification, the sample annotation data can be detected one by one according to the preset annotation rule for indicating how the annotating personnel annotate the to-be-annotated data, and when the detection result of a certain sample annotation data meets the preset annotation rule, the sample annotation data can be considered to be accurately annotated, and the number of the annotation data meeting the preset annotation rule in the sample annotation data is counted as the second number.

S306, calculating the ratio of the second quantity to the first quantity, and taking the ratio as the evaluation result of the third labeling data set.

In the embodiment of the present specification, a ratio of the second quantity to the first quantity may be calculated and used as the evaluation result of the third annotation data set.

It should be noted that, when the evaluation result of the third annotation data set is the above ratio, the preset evaluation condition may be set to be the preset ratio, for example, the preset evaluation condition may be set to be a value such as 95% or 90%. When the evaluation result is smaller than the preset ratio, the evaluation result of the third annotation data set is not satisfied with the preset evaluation condition; on the contrary, when the evaluation result is greater than or equal to the preset ratio, it is indicated that the evaluation result of the third annotation data set meets the preset evaluation condition.

It should be noted that the above is only an example of obtaining the evaluation result of the third annotation data set, in practical applications, the annotation quality of the third annotation data set may also be evaluated in other manners to obtain the evaluation result, for example, the evaluation may also be performed according to the distribution of the annotation data in the third annotation data set, and the present invention is not limited to this.

In summary, after the labeled data set is obtained, the embodiment of the invention traverses the labeled data in the labeled data set, determining conflict annotation data in the traversal process by combining the annotation prediction model, re-annotating the conflict annotation data obtained in the traversal process, and the newly labeled data is fused with the previous labeled data set to obtain a new labeled data set, and then the new labeled data set is evaluated, when the evaluation result does not meet the preset evaluation condition, the above-mentioned loop iteration operation is carried out until the new labeled data set meets the preset evaluation condition, thereby greatly improving the labeling quality of the data, and because in the whole control process of labeling the data quality, the model and the manpower are effectively combined, the manpower and time cost are reduced, and the efficiency of quality inspection of the labeled data is improved.

Referring to fig. 4, which is a flow chart illustrating another data processing method according to an embodiment of the present invention, the present specification provides the method operation steps as described in the embodiment or the flow chart, but more or less operation steps can be included based on conventional or non-inventive labor. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. In actual system or product execution, sequential execution or parallel execution (e.g., parallel processor or multi-threaded environment) may be possible according to the embodiments or methods shown in the figures. Specifically, as shown in fig. 4, the method includes:

s402, a first labeling data set is obtained, wherein the first labeling data set is labeling data obtained by labeling data to be labeled according to a preset labeling rule.

S404, acquiring the data characteristics of the labeled data in the first labeled data set.

In the embodiment of the present specification, the data feature of the annotation data may be a feature obtained by analyzing the annotation result, for example, a slot feature of the annotation data, an intention feature of the annotation data, or the like.

S406, the first annotation data set is divided into N annotation data subsets, the data characteristics of the annotation data contained in the annotation data subsets meet a preset distribution rule, and N is larger than or equal to 2.

In the embodiment of the present specification, in order to improve the efficiency of data processing, the first annotation data set may be split into N (N is greater than or equal to 2) annotation data subsets, and the data characteristics of the annotation data included in each annotation data subset need to meet a preset distribution rule, so as to ensure the reliability of subsequently screened conflicting annotation data.

In one embodiment, the predetermined distribution rule may satisfy a poisson distribution for the data characteristics of each tagged data subset. If the labeled data subset includes m labeled data, the probability p (x) of occurrence of the data feature x can be represented by the following formula:

P(0)＝e^-m

when the data characteristics of the labeled data subsets meet the Poisson distribution, the consistency of the data of each labeled data subset can be guaranteed to the greatest extent, and then, when a labeled prediction model obtained by machine learning is used for screening the conflicting labeled data from the labeled data to be screened, the conflicting labeled data can be screened as far as possible, so that the screening accuracy and reliability can be greatly improved, the data processing efficiency can be further improved, and the quality of the finally obtained labeled data is favorably improved.

In this embodiment of the present specification, the preset distribution rule may also be set according to the label category and the actual requirement, for example, the preset distribution rule may be a preset proportion of each data feature in the labeled data subset, for example, when the labeling type of the labeled data is an intention, the preset distribution rule may be an intention feature 1 in the labeled data subset: the intention characteristic 2 is more than or equal to 9:1, namely the data characteristics of the labeled data in each sub-labeled data subset obtained after splitting need to meet the intention characteristic 1: it is intended that feature 2 ≧ 9:1, for which the present invention is not limited.

S408, traversing the labeled data in the first labeled data set, and determining the conflicting labeled data by using a label prediction model when traversing the labeled data in the first labeled data set.

In this embodiment, when traversing the annotation data in the first annotation data set, the method shown in fig. 5 can be adopted to determine the conflicting annotation data by using the annotation prediction model. Fig. 5 is another flow chart illustrating that a label prediction model is used to determine conflicting label data when traversing the label data in the first label data set according to the embodiment of the present invention, as shown in fig. 5, the flow chart may include:

s502, selecting K parts of labeled data subsets from the N parts of labeled data subsets as labeled data to be screened, and taking (N-K) parts of labeled data subsets as training labeled data, wherein K is more than or equal to 1 and is less than or equal to N/2.

In the embodiment of the present specification, K parts of the annotation data subsets may be randomly selected from the N parts of the divided annotation data subsets, where K is greater than or equal to 1 and less than or equal to N/2, for example, 1 part of the annotation data subset may be selected from the N parts of the divided annotation data subsets as the annotation data to be screened, and then the remaining (N-1) parts of the annotation data subsets are used as the training annotation data.

And S504, performing machine learning on the training annotation data to generate an annotation prediction model.

S506, inputting the to-be-labeled data corresponding to the to-be-screened labeled data into the label prediction model for label prediction to obtain predicted labeled data corresponding to the to-be-labeled data.

The steps 504 to 506 may refer to the method embodiment shown in fig. 2, and are not described herein again.

And S508, determining conflict annotation data according to the annotation data to be screened and the prediction annotation data.

In this embodiment, since the annotation data to be filtered is one or more annotation data subsets, the determined conflicting annotation data may be one or more annotation data in the annotation data subsets. Specifically, when the labeling result of the labeling data in the labeling data subset as the to-be-screened labeling data is inconsistent with the labeling result of the predicted labeling data, the corresponding labeling data in the labeling data subset may be determined as the conflict labeling data. For example, as shown in table 2, the annotation data to be screened includes an annotation data subset 1 and an annotation data subset 2, where an annotation result of annotation data corresponding to a sequence number 2 in the annotation data subset 1 is inconsistent with a predicted annotation result of predicted annotation data, and therefore, the annotation data corresponding to the sequence number 2 in the annotation data subset 1 can be determined as conflict annotation data; the annotation result of the annotation data corresponding to the sequence number 3 in the annotation data subset 2 is inconsistent with the prediction annotation result of the prediction annotation data, so the annotation data corresponding to the sequence number 3 in the annotation data subset 2 can be determined as the conflict annotation data.

TABLE 2

In the embodiment of the present specification, since the conflict annotation data is data that may have problems such as annotation errors, in order to improve the reliability of the annotation prediction model obtained by machine learning in the traversal process, after determining the conflict annotation data by one traversal, the conflict annotation data may be removed from the first annotation data set, so that the conflict annotation data does not exist in the training annotation data in the next traversal process, and thus the annotation prediction model obtained by machine learning using the training annotation data may be more reliable.

And S410, acquiring a second annotation data set, wherein the second annotation data set is obtained by re-annotating the conflict annotation data obtained in the traversal process according to the preset annotation rule.

S412, determining a third annotation data set according to the first annotation data set and the second annotation data set.

And S414, when the evaluation result of the third labeled data set does not meet the preset evaluation condition, taking the third labeled data set as the first labeled data set, and executing the traversal step until the evaluation result of the third labeled data set meets the preset evaluation condition.

In this embodiment of the present specification, when the evaluation result of the third annotation data set meets the preset evaluation condition, it indicates that the annotation quality of the third annotation data set meets the requirement, and the execution may be ended.

For details of the steps S412 to S414, reference may be made to the method embodiment shown in fig. 1, which is not described herein again.

In summary, after the labeled data set is obtained, the embodiment of the invention traverses the labeled data in the labeled data set, determining conflict annotation data in the traversal process by combining the annotation prediction model, re-annotating the conflict annotation data obtained in the traversal process, and the newly labeled data is fused with the previous labeled data set to obtain a new labeled data set, and then the new labeled data set is evaluated, when the evaluation result does not meet the preset evaluation condition, the above-mentioned loop iteration operation is carried out until the new labeled data set meets the preset evaluation condition, thereby greatly improving the labeling quality of the data, and in the whole control process of the labeling data quality, the model and the manpower are effectively combined, the manpower and time cost are reduced, and the efficiency of quality inspection of the labeled data is improved.

Corresponding to the data processing methods provided by the above embodiments, embodiments of the present invention further provide a data processing apparatus, and since the data processing apparatus provided by the embodiments of the present invention corresponds to the data processing methods provided by the above embodiments, the implementation of the foregoing data processing method is also applicable to the data processing apparatus provided by the embodiments, and is not described in detail in the embodiments.

Referring to fig. 6, a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention is shown, and as shown in fig. 6, the apparatus may include: a first acquisition module 610, a traversal module 620, a second acquisition module 630, a first determination module 640, and a loop processing module 650.

The first obtaining module 610 may be configured to obtain a first labeled data set, where the first labeled data set is labeled with to-be-labeled data according to a preset labeling rule to obtain labeled data;

a traversing module 620, configured to traverse the labeled data in the first labeled data set, and determine conflicting labeled data by using a label prediction model when traversing the labeled data in the first labeled data set;

a second obtaining module 630, configured to obtain a second labeled data set, where the second labeled data set is labeled data obtained by labeling the conflict label data obtained in the traversal process according to the preset label rule;

a first determining module 640, configured to determine a third annotation data set according to the first annotation data set and the second annotation data set;

the loop processing module 650 is configured to, when the evaluation result of the third labeled data set does not satisfy a preset evaluation condition, use the third labeled data set as the first labeled data set, and execute the traversal step until the evaluation result of the third labeled data set satisfies the preset evaluation condition.

In one example, as shown in fig. 7, traversal module 620 can include: a selection module 6210, a generation module 6220, a prediction module 6230 and a second determination module 6240.

A selecting module 6210, configured to select at least one piece of annotation data from the first annotation data set as annotation data to be screened, and use the annotation data of the first annotation data set from which the annotation data to be screened is removed as training annotation data;

a generating module 6220, configured to perform machine learning on the training annotation data to generate an annotation prediction model;

the prediction module 6230 is configured to input the to-be-labeled data corresponding to the to-be-screened labeled data into the label prediction model for label prediction, so as to obtain predicted labeled data corresponding to the to-be-labeled data;

the second determining module 6240 may be configured to determine the conflicting annotation data according to the annotation data to be filtered and the predictive annotation data.

In a specific example, the second determining module 6240 is specifically configured to determine the annotation data to be filtered as conflict annotation data when the annotation data to be filtered is inconsistent with the prediction annotation data.

In another example, as shown in fig. 8, the apparatus may include: a first obtaining module 610, a traversing module 620, a second obtaining module 630, a first determining module 640, a loop processing module 650, a third obtaining module 660, and a splitting module 670.

A third obtaining module 660, configured to obtain data characteristics of the labeled data in the first labeled data set;

the splitting module 670 is configured to split the first labeled data set into N labeled data subsets, where data characteristics of labeled data included in the labeled data subsets satisfy a preset distribution rule, and N is greater than or equal to 2;

in this example, the first obtaining module 610, the traversing module 620, the second obtaining module 630, the first determining module 640 and the loop processing module 650 may refer to the embodiment of the apparatus shown in fig. 6. The traversal module 620 may have a structure as shown in fig. 7, where the selection module 6210 is specifically configured to select K parts of the labeled data subsets from the N parts of labeled data subsets as labeled data to be filtered, and use (N-K) parts of the labeled data subsets as training labeled data, where K is greater than or equal to 1 and less than or equal to N/2.

Optionally, as shown in fig. 7, the traversing module 620 may further include:

a first culling module 6250 may be configured to cull the conflicting annotation data from the first annotation data set.

In a specific example, as shown in fig. 9, the first determining module 640 may include:

a replacing module 6410, configured to replace the conflicting annotation data in the first annotation data set with the annotation data in the second annotation data set to obtain a third annotation data set.

Alternatively, as shown in fig. 10, the apparatus may include: a first obtaining module 610, a traversing module 620, a second obtaining module 630, a first determining module 640, a loop processing module 650, an extracting module 680, a counting module 690, and a calculating module 6010.

An extracting module 680, configured to extract a first amount of labeled data from the third labeled data set as sample labeled data;

the counting module 690 may be configured to count a second number of the labeled data satisfying the preset labeling rule in the sample labeled data;

the calculating module 6010 may be configured to calculate a ratio of the second quantity to the first quantity, and use the ratio as an evaluation result of the third annotation data set.

In this example, the first obtaining module 610, the traversing module 620, the second obtaining module 630, the first determining module 640 and the loop processing module 650 may refer to the embodiment of the apparatus shown in fig. 6.

To sum up, after the data processing apparatus provided in the embodiment of the present invention obtains the labeled data set, traversing the labeled data in the labeled data set, determining conflict labeled data by combining a label prediction model in the traversing process, labeling the conflict labeled data obtained in the traversing process again, and the newly labeled data is fused with the previous labeled data set to obtain a new labeled data set, and then the new labeled data set is evaluated, when the evaluation result does not meet the preset evaluation condition, the above-mentioned loop iteration operation is carried out until the new labeled data set meets the preset evaluation condition, thereby greatly improving the labeling quality of the data, and in the whole control process of the labeling data quality, the model and the manpower are effectively combined, the manpower and time cost are reduced, and the efficiency of quality inspection of the labeled data is improved.

It should be noted that, when the apparatus provided in the foregoing embodiment implements the functions thereof, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the apparatus may be divided into different functional modules to implement all or part of the functions described above.

Fig. 11 is a schematic structural diagram of a network device according to an embodiment of the present invention, where the network device is configured to implement the data processing method provided in the foregoing embodiment. The network device may be a terminal device such as a PC (personal computer), a mobile phone, a PDA (tablet computer), or a service device such as an application server and a cluster server. Referring to fig. 11, the internal structure of the network device may include, but is not limited to: a processor, a network interface, and a memory. The processor, the network interface, and the memory in the network device may be connected by a bus or in other manners, and fig. 11 shown in the embodiment of the present specification is exemplified by being connected by a bus.

The processor (or CPU) is a computing core and a control core of the network device. The network interface may optionally include a standard wired interface, a wireless interface (e.g., WI-FI, mobile communication interface, etc.). The Memory (Memory) is a Memory device in the network device for storing programs and data. It is understood that the memory herein may be a high-speed RAM storage device, or may be a non-volatile storage device (non-volatile memory), such as at least one magnetic disk storage device; optionally, at least one memory device located remotely from the processor. The memory provides storage space that stores the operating system of the network device, which may include, but is not limited to: a Windows system (an operating system), a Linux system (an operating system), an Android system, an IOS system, etc., which are not limited in the present invention; also, one or more instructions, which may be one or more computer programs (including program code), are stored in the memory space and are adapted to be loaded and executed by the processor. In this embodiment of the present specification, the processor loads and executes one or more instructions stored in the memory to implement the data processing method provided by the foregoing method embodiment.

Embodiments of the present invention also provide a storage medium that can be disposed in a network device to store at least one instruction, at least one program, a code set, or a set of instructions related to implementing a data processing method in the method embodiments, where the at least one instruction, the at least one program, the code set, or the set of instructions can be loaded and executed by a processor of the network device to implement the data processing method provided by the above method embodiments.

Optionally, in this embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method of data processing, the method comprising:

when the evaluation result of the third annotation data set does not meet a preset evaluation condition, taking the third annotation data set as the first annotation data set, and executing the traversal step until the evaluation result of the third annotation data set meets the preset evaluation condition;

the traversing the labeled data in the first labeled data set, and when traversing the labeled data in the first labeled data set, determining the conflicting labeled data by using a label prediction model includes:

selecting at least one piece of annotation data from the first annotation data set as annotation data to be screened, and taking the annotation data of the first annotation data set without the annotation data to be screened as training annotation data;

performing machine learning on the training annotation data to generate an annotation prediction model;

inputting the data to be labeled corresponding to the labeled data to be screened into the labeling prediction model for labeling prediction to obtain predicted labeled data corresponding to the data to be labeled;

determining conflict annotation data according to the annotation data to be screened and the prediction annotation data;

and removing the conflict annotation data from the first annotation data set.

2. The data processing method of claim 1, wherein prior to traversing the annotation data in the first set of annotation data, the method further comprises:

acquiring data characteristics of the labeled data in the first labeled data set;

splitting the first labeling data set into N labeling data subsets, wherein the data characteristics of labeling data contained in the labeling data subsets meet a preset distribution rule, and N is more than or equal to 2;

the selecting at least one piece of annotation data from the first annotation data set as annotation data to be screened, and using the annotation data of the first annotation data set without the annotation data to be screened as training annotation data comprises:

and selecting K parts of marking data subsets from the N parts of marking data subsets as marking data to be screened, and taking (N-K) parts of marking data subsets as training marking data, wherein K is more than or equal to 1 and is less than or equal to N/2.

3. The data processing method according to claim 1, wherein the determining conflicting annotation data according to the annotation data to be filtered and the predictive annotation data comprises:

and when the annotation data to be screened is inconsistent with the prediction annotation data, determining the annotation data to be screened as conflict annotation data.

4. The data processing method of any of claims 1 to 3, wherein determining a third annotated data set from the first and second annotated data sets comprises:

and replacing the conflict annotation data in the first annotation data set with the annotation data in the second annotation data set to obtain a third annotation data set.

5. The data processing method according to claim 4, wherein when the evaluation result of the third annotation data set does not satisfy a preset evaluation condition, the method further comprises, before the third annotation data set is taken as the first annotation data set:

extracting a first amount of marking data from the third marking data set as sample marking data;

counting a second quantity of the labeling data meeting the preset labeling rule in the sample labeling data;

and calculating the ratio of the second quantity to the first quantity, and taking the ratio as the evaluation result of the third annotation data set.

6. A data processing apparatus, characterized in that the apparatus comprises:

the loop processing module is used for taking the third labeled data set as the first labeled data set when the evaluation result of the third labeled data set does not meet the preset evaluation condition, and executing the traversal step until the evaluation result of the third labeled data set meets the preset evaluation condition;

the traversal module comprises:

the selecting module is used for selecting at least one piece of marking data from the first marking data set as marking data to be screened, and taking the marking data of the first marking data set without the marking data to be screened as training marking data;

the generating module is used for performing machine learning on the training annotation data to generate an annotation prediction model;

the prediction module is used for inputting the data to be labeled corresponding to the labeled data to be screened into the labeling prediction model for labeling prediction to obtain predicted labeled data corresponding to the data to be labeled;

and the second determining module is used for determining conflict annotation data according to the annotation data to be screened and the prediction annotation data and removing the conflict annotation data from the first annotation data set.

7. The data processing apparatus of claim 6, wherein the apparatus further comprises:

a third obtaining module, configured to obtain data characteristics of the labeled data in the first labeled data set;

the splitting module is used for splitting the first labeling data set into N labeling data subsets, the data characteristics of the labeling data contained in the labeling data subsets meet a preset distribution rule, and N is more than or equal to 2;

the selection module is specifically used for selecting K parts of labeled data subsets from the N parts of labeled data subsets as labeled data to be screened, and (N-K) parts of labeled data subsets are used as training labeled data, wherein K is more than or equal to 1 and is less than or equal to N/2.

8. The data processing apparatus according to claim 6, wherein the second determining module is specifically configured to determine the annotation data to be filtered as conflicting annotation data when the annotation data to be filtered is inconsistent with the predictive annotation data.

9. The data processing apparatus according to any of claims 6 to 8, wherein the first determining means comprises:

and the replacing module is used for replacing the conflict annotation data in the first annotation data set with the annotation data in the second annotation data set to obtain a third annotation data set.

10. The data processing apparatus of claim 9, wherein the apparatus further comprises:

the extraction module is used for extracting a first amount of labeled data from the third labeled data set as sample labeled data;

the counting module is used for counting a second quantity of the labeling data meeting the preset labeling rule in the sample labeling data;

and the calculating module is used for calculating the ratio of the second quantity to the first quantity, and taking the ratio as the evaluation result of the third labeling data set.

11. A network device, comprising:

memory storing one or more instructions adapted to be loaded by the processor and to perform the data processing method of any of claims 1-5.

12. A computer storage medium having stored therein at least one instruction or at least one program, the at least one instruction or the at least one program being loaded and executed by a processor to implement the data processing method of any one of claims 1 to 5.