CN109993315B - Data processing method and device and electronic equipment - Google Patents

Data processing method and device and electronic equipment Download PDF

Info

Publication number
CN109993315B
CN109993315B CN201910251585.7A CN201910251585A CN109993315B CN 109993315 B CN109993315 B CN 109993315B CN 201910251585 A CN201910251585 A CN 201910251585A CN 109993315 B CN109993315 B CN 109993315B
Authority
CN
China
Prior art keywords
data
task
target user
labeling
type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910251585.7A
Other languages
Chinese (zh)
Other versions
CN109993315A (en
Inventor
何向宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lenovo Beijing Ltd
Original Assignee
Lenovo Beijing Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lenovo Beijing Ltd filed Critical Lenovo Beijing Ltd
Priority to CN201910251585.7A priority Critical patent/CN109993315B/en
Publication of CN109993315A publication Critical patent/CN109993315A/en
Application granted granted Critical
Publication of CN109993315B publication Critical patent/CN109993315B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • User Interface Of Digital Computer (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The application discloses a data processing method, a data processing device and electronic equipment, wherein the method comprises the following steps: acquiring a first type of task data with a preset labeling result; forming a task set by the first type of task data and the second type of task data, and labeling the task data in the task set by a target user; obtaining a label data set comprising at least one piece of label data; if the first labeling data in the labeling data set meet a first condition, dividing the first labeling data into task data of a first type, and generating a preset labeling result of the first labeling data by using user labeling information in the first labeling data; wherein the first condition comprises: the first annotation data is generated by the task data mark of the second type; and the first target user and the second target user meet a second condition, the first target user is a target user corresponding to the first annotation data, and the second target user is a target user corresponding to the second annotation data generated by the first type of task data.

Description

Data processing method and device and electronic equipment
Technical Field
The present application relates to the field of machine learning technologies, and in particular, to a data processing method and apparatus, and an electronic device.
Background
In the artificial intelligence, it is necessary to collect label data output by task users who perform label processing in a crowdsourcing mode, and provide the label data to applications such as machine learning in the following. The labeled data refers to data labeled with answers to the options or judgment answers, such as data of choice questions with the answer being the option a.
Meanwhile, it is usually necessary to check and verify the labeled data output by the task user to determine whether the labeled data is labeled correctly, for example, to select "option a" whether the labeling result of the question data is a correct answer, or to determine whether the labeling result of the question data is an "error" of a correct answer, or the like.
However, the existing manual checking and auditing method generally has the problem of low efficiency.
Disclosure of Invention
In view of the above, the present application provides a data processing method, including:
acquiring task data of a first type, wherein the task data of the first type has a preset labeling result;
forming a task set by the first type of task data and the second type of task data, and labeling the task data in the task set by at least one target user;
obtaining a marked data set, wherein the marked data set comprises at least one piece of marked data, and the marked data is data obtained by marking task data in a task set by the target user;
if the first labeling data in the labeling data set meets a first condition, dividing the first labeling data into the first type of task data, and generating a preset labeling result of the first labeling data by using user labeling information in the first labeling data;
wherein the first condition comprises:
the first annotation data is generated by the task data mark of the second type;
and the first target user and the second target user meet a second condition, the first target user is a target user corresponding to the first annotation data, and the second target user is a target user corresponding to second annotation data generated by the first type of task data.
In the above method, preferably, the second condition includes:
the first target user is a target user with the annotation accuracy higher than a first threshold value in the second target users.
In the above method, preferably, the second condition includes:
the first target user is: and the second target users are target users corresponding to the first marking data with the same user marking information, and the ratio of the first target users to the second target users is higher than a second threshold value.
In the method, preferably, the labeling accuracy of the second target user is obtained by:
comparing the user labeling information in the second labeling data with a corresponding preset labeling result to obtain a comparison result, wherein the comparison result shows whether the second target user is accurately labeled;
and generating the marking accuracy of the second target user based on the comparison result.
Preferably, in the method, generating the labeling accuracy of the second target user based on the comparison result includes:
obtaining the accurate quantity value marked by the second target user in the first type of task data based on the comparison result;
and generating the marking accuracy of the second target user based on the quantity value and the quantity value of the first type of task data.
The above method, preferably, further comprises:
and outputting the user identification of the first target user.
In the above method, preferably, in the task set, the number of the first type of task data and the number of the second type of task data are in a preset ratio.
The present application also provides a data processing apparatus, including:
the labeling unit is used for obtaining a first type of task data, and the first type of task data has a preset labeling result; forming a task set by the first type of task data and the second type of task data, and labeling the task data in the task set by at least one target user;
the acquiring unit is used for acquiring the marking data set, wherein the marking data set comprises at least one piece of marking data, and the marking data is data obtained by marking the task data in the task set by the target user;
the dividing unit is used for dividing the first annotation data in the marking data set into the first type of task data if the first annotation data meets a first condition, and generating a preset annotation result of the first annotation data by using user annotation information in the first annotation data;
wherein the first condition comprises:
the first annotation data is generated by the task data mark of the second type;
and a first target user and a second target user meet a second condition, the first target user is a target user corresponding to the first annotation data, and the second target user is a target user corresponding to second annotation data generated by the first type of task data
The present application further provides an electronic device, including:
the memory is used for storing the application program and data generated by the running of the application program;
a processor for executing the application to implement: acquiring task data of a first type, wherein the task data of the first type has a preset labeling result; forming a task set by the first type of task data and the second type of task data, and labeling the task data in the task set by at least one target user; obtaining the marking data set, wherein the marking data set comprises at least one piece of marking data, and the marking data is data obtained by marking the task data in the task set by the target user; if the first labeling data in the labeling data set meets a first condition, dividing the first labeling data into the first type of task data, and generating a preset labeling result of the first labeling data by using user labeling information in the first labeling data;
wherein the first condition comprises:
the first annotation data is generated by the task data mark of the second type;
and the first target user and the second target user meet a second condition, the first target user is a target user corresponding to the first annotation data, and the second target user is a target user corresponding to second annotation data generated by the first type of task data.
The present application also provides another electronic device, comprising:
the input and output device is used for obtaining a first type of task data, and the first type of task data has a preset labeling result; forming a task set by the first type of task data and the second type of task data, and labeling the task data in the task set by at least one target user;
the processor is used for obtaining the marking data set, wherein the marking data set comprises at least one piece of marking data, and the marking data is data obtained by marking the task data in the task set by the target user; if the first labeling data in the labeling data set meets a first condition, dividing the first labeling data into the first type of task data, and generating a preset labeling result of the first labeling data by using user labeling information in the first labeling data;
wherein the first condition comprises:
the first annotation data is generated by the task data mark of the second type;
and the first target user and the second target user meet a second condition, the first target user is a target user corresponding to the first annotation data, and the second target user is a target user corresponding to second annotation data generated by the first type of task data.
According to the technical scheme, the task data with the preset labeling result is added to the task set needing to be labeled, so that after labeling of at least one target user, other task data without the preset labeling result are verified and divided based on the task data with the preset labeling result, and the task data with the preset labeling result is further supplemented. Therefore, the labeling verification is achieved by setting the preset labeling result in the task data, manual verification is avoided, and the verification efficiency is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 and fig. 2 are flow charts of a data processing method according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a data processing apparatus according to a second embodiment of the present application;
fig. 4 is a schematic structural diagram of an electronic device according to a third embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present application;
fig. 6 is a diagram illustrating an application example of an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
As shown in fig. 1, which is a flowchart of a data processing method disclosed in an embodiment of the present application, the method in this embodiment may be applied to an electronic device with data processing capability, such as a computer or a server, and is mainly used for processing task data labeled by a user.
Specifically, the method in this embodiment may include the following steps:
step 101: a first type of task data is obtained.
And the task data of the first type has a preset labeling result. The preset marking result refers to data that is considered as a correct marking result. In this embodiment, the task data that is determined to have the correct annotation result may be read in the database as the first type of task data.
In this embodiment, a plurality of first types of task data may be provided.
Step 102: and forming a task set by the first type of task data and the second type of task data, and labeling the task data in the task set by at least one target user.
In this embodiment, the first type of task data with preset labeling results is added to the second type of task data that needs to be labeled to form a task set.
It should be noted that, when at least one target user labels various task data in the task set, for the target user, the task data in the task set does not present whether the preset labeling result is present, so for the target user, all task data are to be labeled, and the target user does not know which task data has what preset labeling result.
In this embodiment, the first type of task data and the second type of task data may be obtained from a database, so as to form a task set, and the task set is transmitted to a labeling terminal of a target user, so as to provide the target user with labeling on the task data in the task set.
It should be noted that, in this embodiment, a crowdsourcing mode may be adopted to distribute task data in a task set to a target user, and the target user labels the distributed task data.
Step 103: a marker data set is obtained.
The mark data set comprises at least one piece of mark data, and the mark data are data obtained by marking the task data in the task set by the target user, namely: the annotation data is data for setting user annotation information of a target user on the task data, for example, the task data: selecting question data comprising a question stem and 4 options, and after at least one target user marks the question stem, obtaining marking data comprising: the question stem and 4 options and the user with answer A are labeled with information.
Step 104: first annotation data is obtained in the marked data set that satisfies a first condition.
Wherein the first condition comprises:
the first annotation data is generated by the task data mark of the second type;
and the first target user and the second target user meet a second condition, the first target user is a target user corresponding to the first annotation data, and the second target user is a target user corresponding to the second annotation data generated by the first type of task data.
That is to say, the first annotation data obtained in this embodiment is, firstly, annotation data generated by the target user annotating the task data of the second type of task data set without the preset annotation result, and secondly, for the target user corresponding to the first annotation data, the target user meets a second condition in the target user participating in the annotation operation on the task data with the preset annotation result, where the second condition is related to the user annotation information of the annotation data by the target user.
As can be seen, in this embodiment, based on the preset labeling result, the labeling data generated by the task data without the preset labeling result is verified, and the corresponding first labeling data is obtained.
Step 105: the method comprises the steps of dividing first annotation data into first type task data, and generating a preset annotation result of the first annotation data by using user annotation information in the first annotation data.
After the first annotation data is obtained, the first annotation data is determined to be the annotation data with the correct user annotation information, at this time, the embodiment divides the first annotation data into the first type of task data, and determines the user annotation information of the first annotation data as the correct annotation result, and for the first annotation data divided into the first type of task data, the embodiment sets the user annotation information of the first annotation data as the preset annotation result of the first annotation data, thereby enriching and updating the first type of task data.
According to the above scheme, in the data processing method provided in the first embodiment of the present application, the task data with the preset labeling result is added to the task set that needs to be labeled, so that after the labeling of at least one target user, other task data without the preset labeling result is verified and divided based on the task data with the preset labeling result, and the task data with the preset labeling result is further supplemented. Therefore, in the embodiment, the marking verification is realized by setting the preset marking result in the task data, so that the manual verification is avoided, and the verification efficiency is improved.
In one implementation, the second condition satisfied by the first target user and the second target user may be:
the first target user is a target user with the annotation accuracy higher than a first threshold value in the second target users.
The annotation accuracy can be understood as the proportion of the quantity of the annotation data of which the annotation information is considered to be a correct annotation result in the annotation data generated by the target user for performing annotation in all the annotation data of the target user.
Specifically, in this embodiment, the labeling accuracy of the second target user may be obtained through the following steps:
firstly, comparing the user labeling information in the second labeling data with the corresponding preset labeling result to obtain a comparison result, wherein the comparison result shows whether the second target user is accurately labeled or not, and then generating the labeling accuracy of the second target user based on the comparison result.
Specifically, in this embodiment, when the labeling accuracy of the second target user is generated based on the comparison result, an accurate quantity value of the second target user labeled in the first type of task data, that is, a quantity value of the second labeled data labeled accurately may be obtained based on the comparison result, and then the labeling accuracy of the second target user is generated based on the quantity value and the quantity value of the first type of task data, that is, the quantity value of the second labeled data.
That is, for the second target user, the second annotation data generated after the second target user marks the first type of task data has the preset annotation result. Therefore, in the embodiment, the user labeling information in the second labeling data is compared with the previous preset labeling result to obtain a comparison result whether the user labeling information is consistent with the preset labeling result or not, if the user labeling information is consistent with the preset labeling result, the comparison result shows that the labeling of the second target user to the second labeling data is accurate, if the user labeling information is not consistent with the preset labeling result, the comparison result shows that the labeling of the second target user to the second labeling data is inaccurate, so that, after the comparison result is obtained, according to the accurate quantity of the second target user to label the second label data and the total quantity of the second label data in the comparison result, the label accuracy rate of the second target user can be generated, and if the value obtained by dividing the accurate labeling quantity of the second labeling data by the total labeling quantity of the second labeling data is the labeling accuracy.
It should be noted that the labeling accuracy obtained based on the above calculation can be understood as the labeling accuracy of the second target user for labeling the first type of task data, that is, the labeling accuracy of the second labeled data, and similarly, the labeling accuracy of the second target user for labeling the second labeled data can also be regarded as the labeling accuracy of the second target user for all labeled data, and correspondingly, the labeling accuracy is also the labeling accuracy of the second target user for the first labeled data.
The first threshold may be a preset value or a dynamically changing value, where the first threshold is 50% or 80%, that is, a target user, among the second target users, whose labeling accuracy can reach the first threshold, for example, a target user above 80% is a first target user, for example, a user identified as a high-quality label, where first label data generated by the first target user labeling the second type of task data is the first label data, and at this time, the first label data satisfies a first condition, in this embodiment, the first label data is divided into the first type of task data, and user label information in the first label data is set as a preset label result.
In another implementation, the second condition satisfied by the first target user and the second target user may be:
the first target user is: and the second target users have target users corresponding to the first marking data with the same user marking information, and the percentage value of the first target users in the second target users is higher than a second threshold value.
The target user corresponding to the first annotation data with the same user annotation information in the second target user refers to: in the second target user, for a certain first annotation data or a plurality of first annotation data, the annotated target users with the same user annotation information, and the ratio of the first target user to the second target user is higher than the second threshold, which means that: in the second target users, the proportion of the target users with the same user annotation information as the user annotation information annotated by the first annotation data is higher than a second threshold, for example, the target users with a certain choice annotation answer of option a account for 70% of the total number of the target users annotated to the choice, and at this time, the target users are the first target users.
That is to say, in this embodiment, when selecting the high-quality annotation user, the target users who annotate the task data of the first type are selected first from the second target users who annotate the task data of the first type, and if the percentage of the target users with the same user annotation information to the total number of all target users annotating the task data exceeds the second threshold, these target users with the same user annotation information can be identified as the high-quality annotation users, at this time, the first annotation data generated by selecting these high-quality annotation users to annotate the task data of the second type is identified as the data with correct annotation, that is, the first type of task data is divided into the task data of the first type, and the user annotation information thereof is set as the preset annotation result.
The second threshold may be a preset value or a dynamically changing value, the second threshold is 60% or 75%, and the like, that is, one or more task data of the second target user has the same user labeling information, and the target user occupying the ratio exceeding the second threshold is the first target user, and then the first labeling data generated by labeling the second type of task data by the first target user is determined as correctly labeled data, that is, classified into the first type of task data, and the user labeling information is set as a preset labeling result.
For example, target users a, b, C, D, and e label the choice questions respectively, and the labeling results are option a, option C, option D, and option a, respectively, so that the target users having the same user labeling information include: a. b, e, and the 3 target users account for 60% of the total users and exceed the second threshold value of 50%, at this time, the user annotation information is considered as: and the option A is a correct labeling result, the target users a, b and e are high-quality labeling users, at the moment, the first labeling data generated by labeling the second type of task data of the target users a, b and e are divided into the first type of task data, and the user labeling information option A is set as a preset labeling result.
Based on the above implementation, the above two implementations may be combined in the second condition, that is, the second condition includes:
the first target user is a target user with the marking accuracy higher than a first threshold value in the second target users;
wherein the first target user is: and the second target users have target users corresponding to the first marking data with the same user marking information, and the ratio of the first target users to the second target users is higher than a second threshold value.
That is to say, when the first target user with high-quality annotation is selected in this embodiment, it is required that the annotation accuracy of the first target user when annotating the task data is higher than the first threshold, and it is also required that the user annotation information of the first annotation data occupied by the first target user in the second target user is higher than the second threshold.
For example, the annotation accuracy of the target users a, b and d is determined to be higher than 80% based on the annotation accuracy of the target users a, b, c, d and e to the first type of task data, meanwhile, the target users a, b, C, D and e respectively label the choice questions, and the labeling results are respectively option A, option C, option D and option A, so that the target users a, b and e are considered as high-quality labeling users, at this time, the target users a and b both meet the condition that the labeling accuracy is higher than a first threshold value and the condition that the proportion of the user labeling information in a second target user is higher than a second threshold value, at this time, first labeling data generated by labeling the second type of task data by the target users a and b are divided into the first type of task data, and the user labeling information option A is set as a preset labeling result.
In one implementation, after step 105, the following steps may be further included in this embodiment, as shown in fig. 2:
step 106: and outputting the user identification of the first target user.
Based on the above statements, the first target user is considered as a high-quality labeled user, and at this time, the user identifier of the first target user may be output, so as to prompt a worker for performing subsequent processing, such as performing subsequent processing on the labeled data of the first target user or rewarding the first target user, and the like.
Specifically, in this embodiment, the user identifier of the first target user may be output to a display screen or to a user terminal of the first target user.
In specific implementation, the number of the first type of task data and the second type of task data in the task set is in a preset ratio, for example, the first type of task data cannot be too much, so that the marking efficiency of unlabeled data is prevented from being affected, and the first type of task data cannot be too little, so that the accuracy of subsequent data processing such as marking accuracy calculation is prevented from being affected, and for example, the preset ratio may be 1:4 or 3: 7.
Referring to fig. 3, a schematic structural diagram of a data processing apparatus provided in the second embodiment of the present application is shown, where the apparatus may be applied to a device such as a computer or a server with data processing capability, and in this embodiment, the apparatus may include the following structure:
a labeling unit 301, configured to obtain a first type of task data, where the first type of task data has a preset labeling result; forming a task set by the first type of task data and the second type of task data, and labeling the task data in the task set by at least one target user;
in this embodiment, a plurality of first types of task data may be provided.
In this embodiment, the annotation unit 301 may obtain the first type of task data and the second type of task data from the database, so as to form a task set, and transmit the task set to the annotation terminal of the target user, so as to provide the target user with annotation on the task data in the task set.
It should be noted that, in this embodiment, a crowdsourcing mode may be adopted to distribute task data in a task set to a target user, and the target user labels the distributed task data.
The task data of the first type and the task data of the second type are in a preset proportion in quantity.
An obtaining unit 302, configured to obtain the tagged data set, where the tagged data set includes at least one piece of tagged data, and the tagged data is data obtained by tagging, by the target user, task data in a task set;
a dividing unit 303, configured to divide the first annotation data in the markup data set into the first type of task data if the first annotation data meets a first condition, and generate a preset annotation result of the first annotation data by using user annotation information in the first annotation data;
wherein the first condition comprises:
the first annotation data is generated by the task data mark of the second type;
and the first target user and the second target user meet a second condition, the first target user is a target user corresponding to the first annotation data, and the second target user is a target user corresponding to second annotation data generated by the first type of task data.
That is to say, the first annotation data obtained in this embodiment is, firstly, annotation data generated by the target user annotating the task data of the second type of task data set without the preset annotation result, and secondly, for the target user corresponding to the first annotation data, the target user meets a second condition in the target user participating in the annotation operation on the task data with the preset annotation result, where the second condition is related to the user annotation information of the annotation data by the target user.
As can be seen, in this embodiment, based on the preset labeling result, the labeling data generated by the task data without the preset labeling result is verified, and the corresponding first labeling data is obtained.
After the first annotation data is obtained, the first annotation data is determined to be the annotation data with the correct user annotation information, at this time, the embodiment divides the first annotation data into the first type of task data, and determines the user annotation information of the first annotation data as the correct annotation result, and for the first annotation data divided into the first type of task data, the embodiment sets the user annotation information of the first annotation data as the preset annotation result of the first annotation data, thereby enriching and updating the first type of task data.
According to the above scheme, in the data processing device provided in the second embodiment of the present application, the task data with the preset labeling result is added to the task set that needs to be labeled, so that after the labeling of at least one target user, other task data without the preset labeling result is verified and divided based on the task data with the preset labeling result, and the task data with the preset labeling result is further supplemented. Therefore, in the embodiment, the marking verification is realized by setting the preset marking result in the task data, so that the manual verification is avoided, and the verification efficiency is improved.
In one implementation, the second condition may be:
the first target user is a target user with the annotation accuracy higher than a first threshold value in the second target users.
And/or, the second condition may be:
the first target user is: and the second target users are target users corresponding to the first marking data with the same user marking information, and the ratio of the first target users to the second target users is higher than a second threshold value.
And the labeling accuracy of the second target user is obtained by the following steps:
comparing the user labeling information in the second labeling data with a corresponding preset labeling result to obtain a comparison result, wherein the comparison result shows whether the second target user is accurately labeled;
and generating the marking accuracy of the second target user based on the comparison result.
Specifically, generating the labeling accuracy of the second target user based on the comparison result includes:
obtaining the accurate quantity value marked by the second target user in the first type of task data based on the comparison result; and generating the marking accuracy of the second target user based on the quantity value and the quantity value of the first type of task data.
In addition, the dividing unit 303 may further output a user identifier of the first target user.
It should be noted that, for the specific implementation of each unit of the apparatus in this embodiment, reference may be made to the corresponding content in the foregoing embodiments, and details are not described here.
Referring to fig. 4, a schematic structural diagram of an electronic device provided in the third embodiment of the present application is shown, where the electronic device may be a computer or a server with data processing capability, and specifically, the electronic device may include the following structure:
a memory 401 for storing an application program and data generated by the application program;
a processor 402 for executing the application to implement: acquiring task data of a first type, wherein the task data of the first type has a preset labeling result; forming a task set by the first type of task data and the second type of task data, and labeling the task data in the task set by at least one target user; obtaining the marking data set, wherein the marking data set comprises at least one piece of marking data, and the marking data is data obtained by marking the task data in the task set by the target user; if the first labeling data in the labeling data set meets a first condition, dividing the first labeling data into the first type of task data, and generating a preset labeling result of the first labeling data by using user labeling information in the first labeling data;
wherein the first condition comprises:
the first annotation data is generated by the task data mark of the second type;
and the first target user and the second target user meet a second condition, the first target user is a target user corresponding to the first annotation data, and the second target user is a target user corresponding to second annotation data generated by the first type of task data.
The task data of the first type and the task data of the second type are in a preset proportion in quantity.
According to the above scheme, in the electronic device provided by the third embodiment of the present application, the task data with the preset labeling result is added to the task set that needs to be labeled, so that after the labeling of at least one target user, other task data without the preset labeling result is verified and divided based on the task data with the preset labeling result, and the task data with the preset labeling result is further supplemented. Therefore, in the embodiment, the marking verification is realized by setting the preset marking result in the task data, so that the manual verification is avoided, and the verification efficiency is improved.
In one implementation, the second condition may be:
the first target user is a target user with the annotation accuracy higher than a first threshold value in the second target users.
And/or, the second condition may be:
the first target user is: and the second target users are target users corresponding to the first marking data with the same user marking information, and the ratio of the first target users to the second target users is higher than a second threshold value.
And the labeling accuracy of the second target user is obtained by the following steps:
comparing the user labeling information in the second labeling data with a corresponding preset labeling result to obtain a comparison result, wherein the comparison result shows whether the second target user is accurately labeled;
and generating the marking accuracy of the second target user based on the comparison result.
Specifically, generating the labeling accuracy of the second target user based on the comparison result includes:
obtaining the accurate quantity value marked by the second target user in the first type of task data based on the comparison result; and generating the marking accuracy of the second target user based on the quantity value and the quantity value of the first type of task data.
Additionally, processor 402 may also output a user identification of the first target user.
It should be noted that, for a specific implementation manner of the processor of the electronic device in this embodiment, reference may be made to the corresponding contents in the foregoing embodiments, and details are not described here.
Referring to fig. 5, a schematic structural diagram of an electronic device according to a fourth embodiment of the present disclosure is provided, where the electronic device may be a computer or a server with data processing capability, and specifically, the electronic device may include the following structure:
the input and output device 501 is used for obtaining a first type of task data, and the first type of task data has a preset labeling result; and forming a task set by the first type of task data and the second type of task data, and labeling the task data in the task set by at least one target user.
The input/output device 501 may implement acquisition and transmission of task data through a wired or wireless transmission interface, for example, read task data from a database, and send a task set to a terminal of a target user, so that the target user labels the task data.
In this embodiment, the task data in the task set may be distributed to the terminal of the target user through the input/output device 501 in a crowdsourcing mode, and the target user marks the distributed task data on the terminal.
A processor 502, configured to obtain the tagged data set, where the tagged data set includes at least one piece of tagged data, and the tagged data is data obtained by tagging, by the target user, task data in a task set; if the first labeling data in the labeling data set meets a first condition, dividing the first labeling data into the first type of task data, and generating a preset labeling result of the first labeling data by using user labeling information in the first labeling data;
wherein the first condition comprises:
the first annotation data is generated by the task data mark of the second type;
and the first target user and the second target user meet a second condition, the first target user is a target user corresponding to the first annotation data, and the second target user is a target user corresponding to second annotation data generated by the first type of task data.
The task data of the first type and the task data of the second type are in a preset proportion in quantity.
According to the above scheme, in the electronic device provided in the fourth embodiment of the present application, the task data with the preset labeling result is added to the task set that needs to be labeled, so that after the labeling of at least one target user, other task data without the preset labeling result is verified and divided based on the task data with the preset labeling result, and the task data with the preset labeling result is further supplemented. Therefore, in the embodiment, the marking verification is realized by setting the preset marking result in the task data, so that the manual verification is avoided, and the verification efficiency is improved.
In one implementation, the second condition may be:
the first target user is a target user with the annotation accuracy higher than a first threshold value in the second target users.
And/or, the second condition may be:
the first target user is: and the second target users are target users corresponding to the first marking data with the same user marking information, and the ratio of the first target users to the second target users is higher than a second threshold value.
And the labeling accuracy of the second target user is obtained by the following steps:
comparing the user labeling information in the second labeling data with a corresponding preset labeling result to obtain a comparison result, wherein the comparison result shows whether the second target user is accurately labeled;
and generating the marking accuracy of the second target user based on the comparison result.
Specifically, generating the labeling accuracy of the second target user based on the comparison result includes:
obtaining the accurate quantity value marked by the second target user in the first type of task data based on the comparison result; and generating the marking accuracy of the second target user based on the quantity value and the quantity value of the first type of task data.
Additionally, the processor 502 may also output the user identification of the first target user through the input-output device 501.
It should be noted that, for the implementation of the processor 502 of the electronic device in this embodiment, reference may be made to the corresponding contents in the foregoing embodiments, and details are not described here.
Based on the above implementation, the following describes the technical solution in this embodiment:
in this embodiment, part of the annotated Data (IDS: Identification Data Set) may be prepared and mixed into un-annotated Data (UDS: un-annotated Data Set), and then these task Data may be combined into a task Set and distributed to each user in a crowdsourcing mode, where it is required to ensure that the task retrieved by each user includes these two types of Data and the proportion is uniform.
And after each user finishes tagging and submits data, the tagging data corresponding to the UDS tagged by the user and the tagging data corresponding to the IDS are compared respectively, and whether the tagging result of the user meets the requirement or not is judged. And recording the labeling data set and the coincidence condition of the user.
After each check, the overall coincidence rate of the User is counted, and if the coincidence rate of the User reaches an expected standard, the User can be considered as a High Quality User (HQU). The user's annotated data can be considered to be highly reliable and its annotated UDS can be extended to previous IDS to preserve IDS freshness, thereby ensuring that IDS is continuously available.
As shown in fig. 6, when labeling task data, the method may specifically include the following steps:
1. a portion IDS (which may be provided by an expert or annotation team) is prepared.
2. And mixing the task with the UDS in proportion to serve as a data source of the crowdsourcing task, and picking up the task and marking data by a user.
3. After the labeled data is obtained, the IDS part labeled by the user is found out, the correct proportion of the IDS part is calculated, and the conforming condition of the user is recorded.
4. And setting the judgment reference times n and the expected coincidence rate threshold value p. And after the user marks for n times, calculating the coincidence rate pu of the user marking for n times, and if pu > p, marking the user as an HQU (the marked UDS can be regarded as IDS).
5. And expanding the UDS marked by the HQU into the original IDS.
And (5) repeating the steps 2-5 for other task data, namely, the process for automatically checking the crowdsourcing data quality is realized.
Therefore, the accuracy of the crowdsourcing user marking result is automatically verified in the embodiment, the marking result of the user is automatically verified by using part of preset data, and the verification set is expanded by using the high-quality user marked data.
Furthermore, the verification process in the embodiment can realize automation, and the time cost and the labor cost of expert review are saved. And analyzing the multiple labeling conditions of the users without depending on the capability variance of the users in the same batch.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method of data processing, comprising:
acquiring task data of a first type, wherein the task data of the first type has a preset labeling result;
forming a task set by the first type of task data and the second type of task data, and marking the task data in the task set by at least one target user, wherein the second type of task data refers to the task data without a preset marking result;
obtaining a marked data set, wherein the marked data set comprises at least one piece of marked data, and the marked data is data obtained by marking task data in a task set by the target user;
if the first labeling data in the labeling data set meets a first condition, dividing the first labeling data into the first type of task data, and generating a preset labeling result of the first labeling data by using user labeling information in the first labeling data;
wherein the first condition comprises:
the first annotation data is generated by the task data mark of the second type;
and the first target user and the second target user meet a second condition, the first target user is a target user corresponding to the first annotation data, and the second target user is a target user corresponding to second annotation data generated by the first type of task data.
2. The method of claim 1, the second condition comprising:
the first target user is a target user with the annotation accuracy higher than a first threshold value in the second target users.
3. The method of claim 1 or 2, the second condition comprising:
the first target user is: and the second target users are target users corresponding to the first marking data with the same user marking information, and the ratio of the first target users to the second target users is higher than a second threshold value.
4. The method of claim 2, wherein the annotation accuracy of the second target user is obtained by:
comparing the user labeling information in the second labeling data with a corresponding preset labeling result to obtain a comparison result, wherein the comparison result shows whether the second target user is accurately labeled;
and generating the marking accuracy of the second target user based on the comparison result.
5. The method of claim 4, generating a labeling accuracy for the second target user based on the comparison, comprising:
obtaining the accurate quantity value marked by the second target user in the first type of task data based on the comparison result;
and generating the marking accuracy of the second target user based on the quantity value and the quantity value of the first type of task data.
6. The method of claim 1, further comprising:
and outputting the user identification of the first target user.
7. The method of claim 1, the task set, the first type of task data quantitatively proportional to the second type of task data by a predetermined amount.
8. A data processing apparatus comprising:
the labeling unit is used for obtaining a first type of task data, and the first type of task data has a preset labeling result; forming a task set by the first type of task data and the second type of task data, and marking the task data in the task set by at least one target user, wherein the second type of task data refers to the task data without a preset marking result;
the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a marked data set, the marked data set comprises at least one piece of marked data, and the marked data is data obtained by marking task data in a task set by a target user;
the dividing unit is used for dividing first label data in the label data set into the first type of task data if the first label data meets a first condition, and generating a preset label result of the first label data by using user label information in the first label data;
wherein the first condition comprises:
the first annotation data is generated by the task data mark of the second type;
and the first target user and the second target user meet a second condition, the first target user is a target user corresponding to the first annotation data, and the second target user is a target user corresponding to second annotation data generated by the first type of task data.
9. An electronic device, comprising:
the memory is used for storing the application program and data generated by the running of the application program;
a processor for executing the application to implement: acquiring task data of a first type, wherein the task data of the first type has a preset labeling result; forming a task set by the first type of task data and the second type of task data, and marking the task data in the task set by at least one target user, wherein the second type of task data refers to the task data without a preset marking result; obtaining a marked data set, wherein the marked data set comprises at least one piece of marked data, and the marked data is data obtained by marking task data in a task set by the target user; if the first labeling data in the labeling data set meets a first condition, dividing the first labeling data into the first type of task data, and generating a preset labeling result of the first labeling data by using user labeling information in the first labeling data;
wherein the first condition comprises:
the first annotation data is generated by the task data mark of the second type;
and the first target user and the second target user meet a second condition, the first target user is a target user corresponding to the first annotation data, and the second target user is a target user corresponding to second annotation data generated by the first type of task data.
10. An electronic device, comprising:
the input and output device is used for obtaining a first type of task data, and the first type of task data has a preset labeling result; forming a task set by the first type of task data and the second type of task data, and labeling the task data in the task set by at least one target user;
the processor is used for obtaining a marked data set, wherein the marked data set comprises at least one piece of marked data, and the marked data is data obtained by marking the task data in the task set by the target user; if the first labeling data in the labeling data set meets a first condition, dividing the first labeling data into the first type of task data, and generating a preset labeling result of the first labeling data by using user labeling information in the first labeling data;
wherein the first condition comprises:
the first annotation data is generated by the task data mark of the second type;
and the first target user and the second target user meet a second condition, the first target user is a target user corresponding to the first annotation data, and the second target user is a target user corresponding to second annotation data generated by the first type of task data.
CN201910251585.7A 2019-03-29 2019-03-29 Data processing method and device and electronic equipment Active CN109993315B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910251585.7A CN109993315B (en) 2019-03-29 2019-03-29 Data processing method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910251585.7A CN109993315B (en) 2019-03-29 2019-03-29 Data processing method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN109993315A CN109993315A (en) 2019-07-09
CN109993315B true CN109993315B (en) 2021-05-18

Family

ID=67131988

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910251585.7A Active CN109993315B (en) 2019-03-29 2019-03-29 Data processing method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN109993315B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112749150B (en) * 2019-10-31 2023-11-03 北京中关村科金技术有限公司 Error labeling data identification method, device and medium
CN112749308A (en) * 2019-10-31 2021-05-04 北京国双科技有限公司 Data labeling method and device and electronic equipment
CN116246332B (en) * 2023-05-11 2023-07-28 广东工业大学 Eyeball tracking-based data labeling quality detection method, device and medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102867025A (en) * 2012-08-23 2013-01-09 百度在线网络技术(北京)有限公司 Method and device for acquiring picture marking data
CN104965821A (en) * 2015-07-17 2015-10-07 苏州大学张家港工业技术研究院 Data annotation method and apparatus
CN105975980A (en) * 2016-04-27 2016-09-28 百度在线网络技术(北京)有限公司 Method of monitoring image mark quality and apparatus thereof
CN106489149A (en) * 2016-06-29 2017-03-08 深圳狗尾草智能科技有限公司 A kind of data mask method based on data mining and mass-rent and system
CN107067105A (en) * 2017-04-07 2017-08-18 华东师范大学 A kind of mass-rent strategy distribution method being grouped based on optimal data
CN107256428A (en) * 2017-05-25 2017-10-17 腾讯科技(深圳)有限公司 Data processing method, data processing equipment, storage device and the network equipment
CN107705034A (en) * 2017-10-26 2018-02-16 医渡云(北京)技术有限公司 Mass-rent platform implementation method and device, storage medium and electronic equipment
CN108229772A (en) * 2016-12-14 2018-06-29 北京国双科技有限公司 Mark processing method and processing device
CN108536662A (en) * 2018-04-16 2018-09-14 苏州大学 A kind of data mask method and device
CN108875768A (en) * 2018-01-23 2018-11-23 北京迈格威科技有限公司 Data mask method, device and system and storage medium
CN109241513A (en) * 2018-08-27 2019-01-18 上海宝尊电子商务有限公司 A kind of method and device based on big data crowdsourcing model data mark

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140278815A1 (en) * 2013-03-12 2014-09-18 Strathspey Crown LLC Systems and methods for market analysis and automated business decisioning
US20180144403A1 (en) * 2016-11-21 2018-05-24 Daniel Heimowitz Select group crowdsource enabled system, method and analytical structure to perform securities valuations and valuation adjustments and generate derivatives thereform
CN108090499B (en) * 2017-11-13 2020-08-11 中国科学院自动化研究所 Data active labeling method and system based on maximum information triple screening network
CN109447374A (en) * 2018-11-17 2019-03-08 朱学庆 A kind of new shop quick site selection method and apparatus based on big data analysis

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102867025A (en) * 2012-08-23 2013-01-09 百度在线网络技术(北京)有限公司 Method and device for acquiring picture marking data
CN104965821A (en) * 2015-07-17 2015-10-07 苏州大学张家港工业技术研究院 Data annotation method and apparatus
CN105975980A (en) * 2016-04-27 2016-09-28 百度在线网络技术(北京)有限公司 Method of monitoring image mark quality and apparatus thereof
CN106489149A (en) * 2016-06-29 2017-03-08 深圳狗尾草智能科技有限公司 A kind of data mask method based on data mining and mass-rent and system
CN108229772A (en) * 2016-12-14 2018-06-29 北京国双科技有限公司 Mark processing method and processing device
CN107067105A (en) * 2017-04-07 2017-08-18 华东师范大学 A kind of mass-rent strategy distribution method being grouped based on optimal data
CN107256428A (en) * 2017-05-25 2017-10-17 腾讯科技(深圳)有限公司 Data processing method, data processing equipment, storage device and the network equipment
CN107705034A (en) * 2017-10-26 2018-02-16 医渡云(北京)技术有限公司 Mass-rent platform implementation method and device, storage medium and electronic equipment
CN108875768A (en) * 2018-01-23 2018-11-23 北京迈格威科技有限公司 Data mask method, device and system and storage medium
CN108536662A (en) * 2018-04-16 2018-09-14 苏州大学 A kind of data mask method and device
CN109241513A (en) * 2018-08-27 2019-01-18 上海宝尊电子商务有限公司 A kind of method and device based on big data crowdsourcing model data mark

Also Published As

Publication number Publication date
CN109993315A (en) 2019-07-09

Similar Documents

Publication Publication Date Title
CN109993315B (en) Data processing method and device and electronic equipment
CN107426583B (en) Video editing method, server and video playing system based on hot spots
US11226991B2 (en) Interest tag determining method, computer device, and storage medium
CN107909338A (en) Training Management method, apparatus, computer equipment and storage medium
CN105068935B (en) Method and device for processing software test result
CN111522942B (en) Training method and device for text classification model, storage medium and computer equipment
US10104027B2 (en) Systems and methods for inquiry-based learning including collaborative question generation
CN104318497A (en) Method and system for automatic communitization learning
CN110764999A (en) Automatic testing method and device, computer device and storage medium
CN104598584A (en) Method and equipment for processing assessment data of information system
CN111367982B (en) Method, device, computer equipment and storage medium for importing TRRIGA basic data
CN109189849B (en) Standardized and streamlined data entry method and system
CN113158022B (en) Service recommendation method, device, server and storage medium
CN111966600B (en) Webpage testing method, webpage testing device, computer equipment and computer readable storage medium
CN114817222B (en) Meter optimization method, device, equipment and storage medium
CN109819024A (en) Information-pushing method, device, storage medium and terminal based on data analysis
JP2006244010A (en) Inspection processing program, device and method
CN112016607B (en) Error cause analysis method based on deep learning
CN115002199A (en) User label pushing method and related device
CN114971962A (en) Student homework evaluation method and device, electronic device and storage medium
US11665114B2 (en) Information processing apparatus and non-transitory computer readable medium for determination of message and member suitability for tasks
US20200380414A1 (en) Data collection system for machine learning and a method for collecting data
JP2018142248A (en) Answer sheet grading system and answer sheet grading method
CN110008356B (en) Error correction book generation system and method
JP2006018735A (en) Coding standard observance situation monitoring system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant