CN111522854B - Data labeling method and device, storage medium and computer equipment - Google Patents

Data labeling method and device, storage medium and computer equipment Download PDF

Info

Publication number
CN111522854B
CN111522854B CN202010190591.9A CN202010190591A CN111522854B CN 111522854 B CN111522854 B CN 111522854B CN 202010190591 A CN202010190591 A CN 202010190591A CN 111522854 B CN111522854 B CN 111522854B
Authority
CN
China
Prior art keywords
data
marked
labeling
similarity
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010190591.9A
Other languages
Chinese (zh)
Other versions
CN111522854A (en
Inventor
刘一鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dazhu Hangzhou Technology Co ltd
Original Assignee
Dazhu Hangzhou Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dazhu Hangzhou Technology Co ltd filed Critical Dazhu Hangzhou Technology Co ltd
Priority to CN202010190591.9A priority Critical patent/CN111522854B/en
Publication of CN111522854A publication Critical patent/CN111522854A/en
Application granted granted Critical
Publication of CN111522854B publication Critical patent/CN111522854B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24573Query processing with adaptation to user needs using data annotations, e.g. user-defined metadata
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a data labeling method, a device, a storage medium and computer equipment, wherein the method comprises the following steps: receiving target data; inquiring whether reference data with similarity meeting a preset reference data similarity condition exists in the marked data set or not; and if so, outputting the reference data which corresponds to the target data and meets the preset reference data similarity condition. Compared with the prior art, the method and the device have the advantages that through unified management of the target data, time waste is not required in the process of inquiring and waiting for replying by the labeling personnel, the data labeling reference data corresponding to the target data is inquired from the labeled data, the labeled data is fully utilized to provide a reference basis for labeling the target data, and the problem that the standard efficiency is affected by repeated processing of similar data is solved.

Description

Data labeling method and device, storage medium and computer equipment
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a data labeling method, a data labeling device, a storage medium, and a computer device.
Background
The modeling of the artificial intelligence algorithm requires a large amount of marked supervised data, in the process of marking the data, due to the large data amount and high data diversity, a large amount of difficult-to-mark problem samples exist, the sample marking personnel can not be well identified, and in the process of inquiring and waiting for replying, the marking personnel can cause a large amount of time waste so as to influence the marking progress. And different annotators may encounter similar problem samples, and the same annotator also repeatedly encounters similar problem samples, and repeated discussion is caused by similar problems for a plurality of times, so that overall efficiency is seriously affected. The known data labeling system, platform or method only comprises labeling and acceptance of data, and the unified management of problem samples is lacking, so that the overall efficiency of labeling tasks is seriously affected by the existence of the problem samples.
If the labeling efficiency of the sample data can be improved, the development and progress of the modeling of the artificial intelligence algorithm can be facilitated.
Disclosure of Invention
In view of the foregoing, the present application provides a data labeling method, a data labeling device, a storage medium and a computer device.
According to one aspect of the present application, there is provided a data labeling method, the method comprising
Receiving target data;
inquiring whether reference data with similarity meeting a preset reference data similarity condition exists in the marked data set or not;
and if so, outputting the reference data which corresponds to the target data and meets the preset reference data similarity condition.
Specifically, the reference data includes a data tag, and after the outputting the reference data corresponding to the target data and meeting the preset reference data similarity condition, the method further includes:
receiving output feedback information corresponding to the reference data, wherein the output feedback information is used for indicating whether a data tag of the reference data is suitable for labeling the target data;
if the data tag of the reference data is suitable for marking the target data, marking the target data based on the data tag of the reference data, and adding the obtained marked data into the marked data set;
and if the data tag of the reference data is not suitable for marking the target data, adding the target data into a data list to be marked.
Specifically, after the querying whether the reference data with the similarity satisfying the preset reference data similarity condition exists in the marked data set, the method further includes:
and if not, adding the target data into the data list to be marked.
Specifically, the method further comprises:
receiving marking information, wherein the marking information comprises data to be marked and corresponding data labels, which are contained in the list to be marked;
and marking the corresponding data to be marked according to the data tag of the marking information, and adding the marked data obtained after marking into the marked data set.
Specifically, the querying whether the reference data with similarity satisfying the preset reference data similarity condition exists in the marked data set or not specifically includes:
respectively calculating the similarity between the target data and any marked data in the marked data set;
and if the highest similarity is larger than a first preset similarity threshold, taking the marked data corresponding to the highest similarity as the reference data.
Specifically, after the target data is added to the data list to be marked, the method further includes:
respectively calculating the similarity between the target data and any one of the data to be marked in the data to be marked list, and recording related data of which the similarity between the target data and any one of the data to be marked is larger than a second preset similarity threshold value;
the data label according to the labeling information labels the corresponding data to be labeled, and after the labeled data obtained after labeling is added into the labeled data set, the method further comprises the steps of:
when any data to be marked in the data to be marked list is marked, outputting related data marking inquiry information, wherein the related data inquiry information is used for inquiring whether the related data is marked with the same data label as any marked data to be marked;
and if the relevant data labeling feedback information is received, labeling a data label for the relevant data.
Specifically, the method further comprises:
when the data label is marked for any one of the target data and/or the data to be marked in the data set to be marked, outputting data label marking prompt information.
According to another aspect of the present application, there is provided a data tagging device, the device comprising:
the target data receiving module is used for receiving target data;
the reference data query module is used for querying whether reference data with similarity meeting a preset reference data similarity condition exists in the marked data set;
and the reference data output module is used for outputting the reference data which corresponds to the target data and meets the preset reference data similarity condition if the reference data exists.
Specifically, the device further comprises:
the feedback information receiving module is used for receiving output feedback information corresponding to the reference data after the reference data which corresponds to the target data and meets the preset reference data similarity condition is output, wherein the output feedback information is used for indicating whether the data label of the reference data is suitable for labeling the target data;
the target data labeling module is used for labeling the target data based on the data label of the reference data and adding the obtained labeled data into the labeled data set if the data label of the reference data is suitable for labeling the target data;
and the first list building module is used for adding the target data into a data list to be marked if the data tag of the reference data is not suitable for marking the target data.
Specifically, the device further comprises:
and the second list building module is used for inquiring whether the reference data with the similarity meeting the preset reference data similarity condition exists in the marked data set or not, and if not, adding the target data into the data list to be marked.
Specifically, the device further comprises:
the marking information receiving module is used for receiving marking information, wherein the marking information comprises data to be marked and corresponding data labels, which are contained in the list to be marked;
the marked set building module is used for marking the corresponding data to be marked according to the data tag of the marking information, and adding marked data obtained after marking into the marked data set.
Specifically, the reference data query module specifically includes:
the similarity calculation unit is used for calculating the similarity between the target data and any marked data in the marked data set respectively;
and the reference data determining unit is used for taking the marked data corresponding to the highest similarity as the reference data if the highest similarity is larger than a first preset similarity threshold value.
Specifically, the device further comprises:
the related data determining module is used for respectively calculating the similarity between the target data and any one of the data to be marked in the data to be marked list after the target data is added into the data to be marked list, and recording related data with the similarity between the target data and any one of the data to be marked being greater than a second preset similarity threshold value;
the data labeling inquiry module is used for outputting related data labeling inquiry information when any one of the data to be labeled in the data list to be labeled is labeled, wherein the related data inquiry information is used for inquiring whether the related data label which is the same as the labeled data is labeled;
and the related data labeling module is used for labeling the related data with a data label if the related data labeling feedback information is received.
Specifically, the device further comprises:
the marking prompt module is used for outputting data label marking prompt information when any one of the target data and/or the data to be marked in the data set to be marked is marked with a data label.
According to yet another aspect of the present application, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described data tagging method.
According to still another aspect of the present application, there is provided a computer device including a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor, the processor implementing the data labeling method described above when executing the program.
By means of the technical scheme, the data labeling method, the data labeling device, the storage medium and the computer equipment are used for uniformly managing target data to be labeled, inquiring reference data with similarity meeting a preset reference data similarity condition in a labeled data set subjected to data labeling when the target data is received, and outputting the reference data when the reference data related to the target data is obtained through inquiring. Compared with the prior art, the method and the device have the advantages that through unified management of the target data, time waste is not required in the process of inquiring and waiting for replying by the labeling personnel, the data labeling reference data corresponding to the target data is inquired from the labeled data, the labeled data is fully utilized to provide a reference basis for labeling the target data, and the problem that the standard efficiency is affected by repeated processing of the similar data is solved.
The foregoing description is only an overview of the technical solutions of the present application, and may be implemented according to the content of the specification in order to make the technical means of the present application more clearly understood, and in order to make the above-mentioned and other objects, features and advantages of the present application more clearly understood, the following detailed description of the present application will be given.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:
fig. 1 shows a flow chart of a data labeling method according to an embodiment of the present application;
FIG. 2 is a schematic flow chart of another method for labeling data according to an embodiment of the present application;
FIG. 3 is a schematic flow chart of another method for labeling data according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a data labeling device according to an embodiment of the present application;
fig. 5 shows a schematic structural diagram of another data labeling device according to an embodiment of the present application.
Detailed Description
The present application will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other.
In this embodiment, a data labeling method is provided, as shown in fig. 1, and the method includes:
step 101, receiving target data;
step 102, inquiring whether reference data with similarity meeting a preset reference data similarity condition exists in the marked data set or not;
step 103, if yes, outputting the reference data which corresponds to the target data and meets the preset reference data similarity condition.
The method and the device can be applied to a scene of sample data labeling, for example, when labeling personnel label data labels for sample data, problem sample data which cannot be labeled is encountered, or when labeling the sample data by using a labeling model, problem sample data which cannot be identified by the model is encountered, at this time, the problem sample data is generally reported to expert personnel in the prior art, and the expert personnel waits for labeling the problem sample data. The data annotation method provided by the embodiment of the application is applied to a data annotation management system, and the data annotation management system can comprise a target data submitting function module, a similar data searching function module and a similar data output module.
Firstly, in order to improve the data labeling efficiency, when problem sample data exist, the problem sample data can be submitted to a data labeling management system through a target data submitting functional module so as to realize labeling of the target data (namely the problem sample data) by utilizing the data labeling management system, after labeling personnel submits data which cannot be labeled by the labeling personnel to the data labeling management system, the target data can be uniformly labeled and managed by the system, the labeling personnel can continuously label other sample data without waiting for the labeling result of the data, and the labeling efficiency of the sample data is improved;
secondly, after the system receives the target data, inquiring whether marked data similar to the target data exists in a marked data set stored in the system, so that the marked data similar to the target data can be utilized to provide a reference basis for marking the target data, specifically, data with the similarity meeting a certain condition between the marked data and the target data can be screened out from the marked data set, the data can be used as reference data, the reference data can be used for providing reference for marking the target data, for example, the similarity between the reference data and the target data is larger than the similarity between other marked data in any marked data set and the target data, the similarity between the reference data and the target data is larger than a certain preset threshold, the similarity can be determined according to the cosine distance between the target data and the marked data, the similarity is smaller the larger the distance is, otherwise, the similarity is larger the smaller the similarity is the smallest the similarity value can be determined according to the smallest editing distance between the target data and the marked data, namely the editing operation times required by converting the target data into the marked data are larger the smallest the similarity is the larger the smallest editing distance is the smaller the similarity is the smaller the distance is the similarity is;
finally, if the marked data set contains the reference data meeting the preset reference data similarity condition, outputting the reference data to expert personnel so that the expert personnel can judge and mark the target data quickly based on the original marking information of the reference data similar to the target data, taking the reference data as the basis for marking the target data, fully utilizing the marked data, improving the marking efficiency of the target data, receiving the target data submitted by a plurality of mark personnel by a system, and when the same mark personnel encounters similar target data or different mark personnel encounters similar target data, avoiding the effect of the repeated processing of the similar data on the marking efficiency of the data by utilizing the management system, and outputting the reference data to the mark personnel so that the mark personnel can know the marking progress of the target data in time.
By applying the technical scheme of the embodiment, unified management is performed on target data to be marked, when the target data is received, reference data with similarity meeting the preset reference data similarity condition is queried in a marked data set with data marking, and when the reference data related to the target data is queried, the reference data is output. Compared with the prior art, the method and the device have the advantages that through unified management of the target data, time waste is not required in the process of inquiring and waiting for replying by the labeling personnel, the data labeling reference data corresponding to the target data is inquired from the labeled data, the labeled data is fully utilized to provide a reference basis for labeling the target data, and the problem that the standard efficiency is affected by repeated processing of the similar data is solved.
Further, as a refinement and extension of the foregoing embodiment, in order to fully describe a specific implementation procedure of the embodiment, another data labeling method is provided, as shown in fig. 2, where the method includes:
step 201, target data is received.
In the embodiment of the application, when the labeling personnel label the data, the data can be marked as a problem sample, the waiting time of problem processing is skipped to directly start the next piece, and the data marked as the problem sample is submitted to the data labeling management system.
Step 202, calculating the similarity between the target data and any marked data in the marked data set.
And 203, if the highest similarity is greater than the first preset similarity threshold, using the labeled data corresponding to the highest similarity as the reference data.
In step 202 and step 203, after the system receives the target data, it needs to search whether the data similar to the target data and capable of providing a reference basis for labeling the target data in the labeled data set, specifically, the similarity between the target data and each piece of labeled data in the labeled data set should be calculated first, then whether the value of the maximum similarity is greater than a specific value is judged, if the similarity is greater than the specific value, the labeled data corresponding to the maximum similarity can be obtained and used as the reference data corresponding to the target data, so that the reference data can be used for providing a reference basis for labeling the target data, and compared with the similarity between other labeled data and the target data, the reference data is the reference basis with the highest reliability in the labeling process of the target data, and the similarity degree between the reference data and the target data is greater than the first preset similarity threshold value, so that the reference data can be ensured to provide a more effective reference for labeling the target data.
It should be noted that in the process of screening the reference data according to the similarity in step 203, other screening conditions may be set, for example, the preset reference data similarity condition may also be set so that all marked data with similarity greater than a specific threshold are determined as the reference data, so that more reference bases may be provided for marking the target data, which is helpful for improving accuracy of data marking.
And 204, if the target data does not exist, adding the target data into a data list to be marked.
In the above embodiment, if the marked data set does not have the reference data satisfying the preset reference data similarity condition, the target data is added to the to-be-marked data list, so that the technical expert can intensively mark the data in the list.
Step 205, if yes, outputting the reference data corresponding to the target data and meeting the preset reference data similarity condition.
In the above embodiment, if the reference data corresponding to the target data can be found, the reference data and the corresponding data tag thereof are output, so that the technical expert or the intelligent terminal further determines whether the reference data has a reference value to the standard of the target data.
And 206, receiving output feedback information corresponding to the reference data, wherein the output feedback information is used for indicating whether the data tag of the reference data is suitable for labeling the target data.
In step 207, if the data tag of the reference data is suitable for labeling the target data, the target data is labeled based on the data tag of the reference data, and the obtained labeled data is added to the labeled data set.
And step 208, outputting data label labeling prompt information when labeling the data labels for the target data.
In step 209, if the data tag of the reference data is not suitable for labeling the target data, the target data is added to the data list to be labeled.
In steps 206 to 209, after outputting the reference basis corresponding to the target data, the technical expert or the intelligent terminal side returns corresponding feedback information to the system, where the feedback information is used to indicate whether the data tag of the reference data can be used to label the target data, that is, whether the data tag corresponding to the target data is consistent with the data tag of the reference data, if the feedback information indicates that the data tag of the reference data is suitable for labeling the target data, the target data can be labeled directly according to the data tag of the reference data, so that labeling of the target data is achieved, further, after labeling the target data, the target data can be added into a labeled data set, so that when the system receives other target data similar to the target data again, the data can be queried from the labeled data set, labeling basis is provided for other similar target data, so as to facilitate improving the labeling efficiency of the data, and meanwhile, after the target data is labeled, the target data label prompt information should be sent to the target data and the corresponding data tag label should be included, so as to inform the submitter that the target data has been labeled, so that the submitter can learn about the progress of labeling of the target data and the data. In addition, if the feedback information indicates that the data tag of the reference data is not suitable for labeling the target data, the target data is added into a data list to be labeled, and a relevant technical expert waits for labeling.
Fig. 3 is a schematic flow chart of another data labeling method provided in an embodiment of the present application, which is used for a data list to be labeled, and as shown in fig. 3, the method includes:
step 301, calculating the similarity between the target data and any one of the data to be marked in the data to be marked list, and recording the related data with the similarity greater than the second preset similarity threshold.
Step 302, receiving labeling information, wherein the labeling information comprises data to be labeled and corresponding data labels contained in a list to be labeled.
And 303, marking the corresponding data to be marked according to the data tag of the marking information, and adding the marked data obtained after marking into the marked data set.
And step 304, outputting relevant data labeling inquiry information when any data to be labeled in the data list to be labeled is labeled, wherein the relevant data inquiry information is used for inquiring whether the relevant data is labeled with the same data label as any data to be labeled.
In step 305, if the feedback information of the related data label is received, the data label is labeled for the related data.
And 306, outputting data tag labeling prompt information when labeling the data tag for any data to be labeled in the data set to be labeled.
In the above embodiment, for the data newly added to the to-be-marked data list, the to-be-marked data similar to the data in the list can be found out first, so that when the data or one of the to-be-marked data with higher similarity is marked, the marked data label can be used as a marking reference of the data with higher similarity, for example, the similarity between the newly added data A and the original data B and C in the list is higher, when the data A is marked, the similarity between the data A and the data B is higher, and the similarity between the data C and the data label corresponding to the data A is higher, and then the data label side data B and the data C can be inquired whether the data label identical to the data label A can be marked, thereby being beneficial to improving the efficiency of data marking.
In addition, after any data in the data list to be marked is marked, marking prompt information should be fed back to the submitter corresponding to the data so as to inform the submitter that the data is marked and the corresponding data label.
Further, as a specific implementation of the method of fig. 1, an embodiment of the present application provides a data labeling device, as shown in fig. 4, where the device includes: a target data receiving module 41, a reference data inquiring module 42, and a reference data outputting module 43.
A target data receiving module 41 for receiving target data;
a reference data query module 42, configured to query whether reference data with similarity to the target data satisfying a preset reference data similarity condition exists in the labeled data set;
the reference data output module 43 is configured to output, if any, reference data corresponding to the target data that satisfies a preset reference data similarity condition.
In a specific application scenario, as shown in fig. 5, the apparatus further includes: a feedback information receiving module 44, a target data labeling module 45, and a first list establishing module 46.
The feedback information receiving module 44 is configured to receive output feedback information corresponding to the reference data after the reference data corresponding to the target data and meeting a preset reference data similarity condition is output, where the output feedback information is used to indicate whether the data tag of the reference data is suitable for labeling the target data;
the target data labeling module 45 is configured to label the target data based on the data label of the reference data if the data label of the reference data is suitable for labeling the target data, and add the obtained labeled data to the labeled data set;
the first list creation module 46 is configured to add the target data to the to-be-marked data list if the data tag of the reference data is not suitable for marking the target data.
In a specific application scenario, as shown in fig. 5, the apparatus further includes: the second list creation module 47.
The second list creation module 47 is configured to query whether there is reference data in the marked data set, whose similarity with the target data satisfies a preset reference data similarity condition, and if not, add the target data to the to-be-marked data list.
In a specific application scenario, as shown in fig. 5, the apparatus further includes: the labeling information receiving module 48 and the labeled set establishing module 49.
The marking information receiving module 48 is configured to receive marking information, where the marking information includes data to be marked and a data tag corresponding to the data to be marked included in the list to be marked;
the marked set creating module 49 is configured to mark the corresponding data to be marked according to the data tag of the marking information, and add the marked data obtained after marking to the marked data set.
In a specific application scenario, as shown in fig. 5, the reference data query module 42 specifically includes: a similarity calculation unit 421, and a reference data determination unit 422.
A similarity calculating unit 421 for calculating the similarity between the target data and any one of the labeled data sets;
the reference data determining unit 422 is configured to take, as the reference data, the labeled data corresponding to the highest similarity if the highest similarity is greater than the first preset similarity threshold.
In a specific application scenario, as shown in fig. 5, the apparatus further includes: a related data determining module 50, a data labeling inquiry module 51 and a related data labeling module 52.
The related data determining module 50 is configured to calculate the similarity between the target data and any one of the data to be marked in the data to be marked list after adding the target data to the data to be marked list, and record related data having a similarity with any one of the data to be marked greater than a second preset similarity threshold;
the data labeling query module 51 is configured to output related data labeling query information when any data to be labeled in the data list to be labeled is labeled, where the related data query information is used to query whether the related data label is the same as any data to be labeled;
the related data labeling module 52 is configured to label the related data with a data tag if the related data labeling feedback information is received.
In a specific application scenario, as shown in fig. 5, the apparatus further includes: the annotation prompt module 53.
The labeling prompt module 53 is configured to output data label labeling prompt information when labeling a data label for any data to be labeled in the target data and/or the data set to be labeled.
It should be noted that, other corresponding descriptions of each functional unit related to the data labeling device provided in the embodiment of the present application may refer to corresponding descriptions in fig. 1 and fig. 2, and are not repeated herein.
Based on the above-mentioned methods shown in fig. 1 to 3, correspondingly, the embodiments of the present application further provide a storage medium, on which a computer program is stored, which when executed by a processor, implements the above-mentioned data labeling method shown in fig. 1 to 3.
Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.), and includes several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to perform the methods described in various implementation scenarios of the present application.
Based on the methods shown in fig. 1 to 3 and the virtual device embodiments shown in fig. 4 and 5, in order to achieve the above objects, the embodiments of the present application further provide a computer device, which may specifically be a personal computer, a server, a network device, etc., where the computer device includes a storage medium and a processor; a storage medium storing a computer program; a processor for executing a computer program to implement the data labeling method as described above and shown in fig. 1 to 3.
Optionally, the computer device may also include a user interface, a network interface, a camera, radio Frequency (RF) circuitry, sensors, audio circuitry, WI-FI modules, and the like. The user interface may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., bluetooth interface, WI-FI interface), etc.
It will be appreciated by those skilled in the art that the architecture of a computer device provided in the present embodiment is not limited to the computer device, and may include more or fewer components, or may combine certain components, or may be arranged in different components.
The storage medium may also include an operating system, a network communication module. An operating system is a program that manages and saves computer device hardware and software resources, supporting the execution of information handling programs and other software and/or programs. The network communication module is used for realizing communication among all components in the storage medium and communication with other hardware and software in the entity equipment.
Through the description of the above embodiments, it is clear for those skilled in the art that the present application may be implemented by means of software plus a necessary general hardware platform, or may be implemented by hardware to perform unified management on target data to be marked, when the target data is received, query, in a marked data set that has been marked with data, reference data whose similarity between the reference data and the target data satisfies a preset reference data similarity condition, and output the reference data when the reference data related to the target data is obtained by the query. Compared with the prior art, the method and the device have the advantages that through unified management of the target data, time waste is not required in the process of inquiring and waiting for replying by the labeling personnel, the data labeling reference data corresponding to the target data is inquired from the labeled data, the labeled data is fully utilized to provide a reference basis for labeling the target data, and the problem that the standard efficiency is affected by repeated processing of the similar data is solved.
Those skilled in the art will appreciate that the drawings are merely schematic illustrations of one preferred implementation scenario, and that the modules or flows in the drawings are not necessarily required to practice the present application. Those skilled in the art will appreciate that modules in an apparatus in an implementation scenario may be distributed in an apparatus in an implementation scenario according to an implementation scenario description, or that corresponding changes may be located in one or more apparatuses different from the implementation scenario. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.
The foregoing application serial numbers are merely for description, and do not represent advantages or disadvantages of the implementation scenario. The foregoing disclosure is merely a few specific implementations of the present application, but the present application is not limited thereto and any variations that can be considered by a person skilled in the art shall fall within the protection scope of the present application.

Claims (9)

1. A method for labeling data, the method comprising
Receiving target data, wherein the target data is problem sample data which cannot be identified when a labeling model labels the sample data;
inquiring whether reference data with similarity meeting a preset reference data similarity condition exists in the marked data set or not;
if so, outputting the reference data which corresponds to the target data and meets the preset reference data similarity condition;
receiving output feedback information corresponding to the reference data, wherein the output feedback information is used for indicating whether a data tag of the reference data is suitable for labeling the target data;
if the data tag of the reference data is not suitable for marking the target data, adding the target data into a data list to be marked;
respectively calculating the similarity between the target data and any one of the data to be marked in the data to be marked list, and recording related data of which the similarity between the target data and any one of the data to be marked is larger than a second preset similarity threshold value;
when any data to be marked in the data to be marked list is marked, outputting related data marking inquiry information, wherein the related data inquiry information is used for inquiring whether the related data is marked with the same data label as any marked data to be marked;
and if the relevant data labeling feedback information is received, labeling a data label for the relevant data.
2. The method of claim 1, wherein after receiving the output feedback information corresponding to the reference data, the method further comprises:
if the data tag of the reference data is suitable for marking the target data, marking the target data based on the data tag of the reference data, and adding the obtained marked data into the marked data set.
3. The method of claim 1, wherein after querying whether reference data satisfying a preset reference data similarity condition is present in the annotated data set, the method further comprises:
and if not, adding the target data into the data list to be marked.
4. A method according to claim 2 or 3, characterized in that the method further comprises:
receiving marking information, wherein the marking information comprises data to be marked and corresponding data labels, which are contained in the data to be marked list;
and marking the corresponding data to be marked according to the data tag of the marking information, and adding the marked data obtained after marking into the marked data set.
5. The method of claim 4, wherein the querying whether the reference data with the similarity with the target data satisfying the preset reference data similarity condition exists in the marked data set comprises:
respectively calculating the similarity between the target data and any marked data in the marked data set;
and if the highest similarity is larger than a first preset similarity threshold, taking the marked data corresponding to the highest similarity as the reference data.
6. The method of claim 5, wherein the method further comprises:
when the data label is marked for any one of the target data and/or the data to be marked in the data set to be marked, outputting data label marking prompt information.
7. A data tagging device, the device comprising:
the target data receiving module is used for receiving target data, wherein the target data are problem sample data which cannot be identified when the sample data are marked by the marking model;
the reference data query module is used for querying whether reference data with similarity meeting a preset reference data similarity condition exists in the marked data set;
the reference data output module is used for outputting the reference data which corresponds to the target data and meets the preset reference data similarity condition if the reference data exists;
the feedback information receiving module is used for receiving output feedback information corresponding to the reference data, wherein the output feedback information is used for indicating whether a data tag of the reference data is suitable for labeling the target data;
the first list building module is used for adding the target data into a data list to be marked if the data tag of the reference data is not suitable for marking the target data;
the related data determining module is used for respectively calculating the similarity between the target data and any one of the data to be marked in the data to be marked list, and recording related data with the similarity between the target data and any one of the data to be marked being greater than a second preset similarity threshold;
the data labeling inquiry module is used for outputting related data labeling inquiry information when any one of the data to be labeled in the data list to be labeled is labeled, wherein the related data inquiry information is used for inquiring whether the related data label which is the same as the labeled data is labeled;
and the related data labeling module is used for labeling the related data with a data label if the related data labeling feedback information is received.
8. A storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the data annotation method of any of claims 1 to 6.
9. A computer device comprising a storage medium, a processor and a computer program stored on the storage medium and executable on the processor, characterized in that the processor implements the data annotation method according to any of claims 1 to 6 when the program is executed by the processor.
CN202010190591.9A 2020-03-18 2020-03-18 Data labeling method and device, storage medium and computer equipment Active CN111522854B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010190591.9A CN111522854B (en) 2020-03-18 2020-03-18 Data labeling method and device, storage medium and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010190591.9A CN111522854B (en) 2020-03-18 2020-03-18 Data labeling method and device, storage medium and computer equipment

Publications (2)

Publication Number Publication Date
CN111522854A CN111522854A (en) 2020-08-11
CN111522854B true CN111522854B (en) 2023-08-01

Family

ID=71910611

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010190591.9A Active CN111522854B (en) 2020-03-18 2020-03-18 Data labeling method and device, storage medium and computer equipment

Country Status (1)

Country Link
CN (1) CN111522854B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112287911B (en) * 2020-12-25 2021-05-28 长沙海信智能系统研究院有限公司 Data labeling method, device, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018171288A1 (en) * 2017-03-22 2018-09-27 广州优视网络科技有限公司 Method and apparatus for tagging information stream, terminal device, and storage medium
CN109783697A (en) * 2018-12-14 2019-05-21 北京海数宝科技有限公司 Data processing method, device, computer equipment and storage medium
WO2019133729A1 (en) * 2017-12-28 2019-07-04 Alibaba Group Holding Limited Data processing method and apparatus
CN110737706A (en) * 2019-09-06 2020-01-31 平安城市建设科技(深圳)有限公司 Data management method, device, equipment and computer readable storage medium
CN110807086A (en) * 2019-10-08 2020-02-18 腾讯科技(深圳)有限公司 Text data labeling method and device, storage medium and electronic equipment
WO2020048377A1 (en) * 2018-09-05 2020-03-12 腾讯科技(深圳)有限公司 Neural network training method and apparatus, and computer device and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109522424B (en) * 2018-10-16 2020-04-24 北京达佳互联信息技术有限公司 Data processing method and device, electronic equipment and storage medium
CN109697468B (en) * 2018-12-24 2021-08-06 苏州科达科技股份有限公司 Sample image labeling method and device and storage medium
CN110532345A (en) * 2019-07-15 2019-12-03 北京小米智能科技有限公司 A kind of processing method of unlabeled data, device and storage medium
CN110472055B (en) * 2019-08-21 2021-09-14 北京百度网讯科技有限公司 Method and device for marking data
CN110647886A (en) * 2019-09-19 2020-01-03 北京百度网讯科技有限公司 Interest point marking method and device, computer equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018171288A1 (en) * 2017-03-22 2018-09-27 广州优视网络科技有限公司 Method and apparatus for tagging information stream, terminal device, and storage medium
WO2019133729A1 (en) * 2017-12-28 2019-07-04 Alibaba Group Holding Limited Data processing method and apparatus
WO2020048377A1 (en) * 2018-09-05 2020-03-12 腾讯科技(深圳)有限公司 Neural network training method and apparatus, and computer device and storage medium
CN109783697A (en) * 2018-12-14 2019-05-21 北京海数宝科技有限公司 Data processing method, device, computer equipment and storage medium
CN110737706A (en) * 2019-09-06 2020-01-31 平安城市建设科技(深圳)有限公司 Data management method, device, equipment and computer readable storage medium
CN110807086A (en) * 2019-10-08 2020-02-18 腾讯科技(深圳)有限公司 Text data labeling method and device, storage medium and electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于语义标注的数据资源库元数据质量自动评估方法研究;郭晓明;马良荔;苏凯;孙煜飞;;计算机应用与软件(第06期);全文 *

Also Published As

Publication number Publication date
CN111522854A (en) 2020-08-11

Similar Documents

Publication Publication Date Title
CN108920659B (en) Data processing system, data processing method thereof, and computer-readable storage medium
CN109308681B (en) Image processing method and device
WO2018188378A1 (en) Method and device for tagging label for application, terminal and computer readable storage medium
CN107943877B (en) Method and device for generating multimedia content to be played
CN111191012B (en) Knowledge graph generation device and method and computer readable storage medium thereof
CN109947989B (en) Method and apparatus for processing video
CN111259663B (en) Information processing method and device
CN108108342A (en) Generation method, search method and the device of structured text
CN108121814B (en) Search result ranking model generation method and device
CN111104479A (en) Data labeling method and device
CN111522854B (en) Data labeling method and device, storage medium and computer equipment
CN115801980A (en) Video generation method and device
US11250080B2 (en) Method, apparatus, storage medium and electronic device for establishing question and answer system
CN110059172B (en) Method and device for recommending answers based on natural language understanding
CN114328632A (en) User data analysis method and device based on bitmap and computer equipment
CN113987300A (en) Label generation method and device
CN109710634B (en) Method and device for generating information
CN111930891A (en) Retrieval text expansion method based on knowledge graph and related device
US11599544B2 (en) Primary tagging in a data stream
CN113742485A (en) Method and device for processing text
CN109857838B (en) Method and apparatus for generating information
CN112184027A (en) Task progress updating method and device and storage medium
CN111262727A (en) Service capacity expansion method, device, equipment and storage medium
CN111460269B (en) Information pushing method and device
CN116681408B (en) System management method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant