CN111522854B

CN111522854B - Data labeling method and device, storage medium and computer equipment

Info

Publication number: CN111522854B
Application number: CN202010190591.9A
Authority: CN
Inventors: 刘一鹏
Original assignee: Dazhu Hangzhou Technology Co ltd
Current assignee: Dazhu Hangzhou Technology Co ltd
Priority date: 2020-03-18
Filing date: 2020-03-18
Publication date: 2023-08-01
Anticipated expiration: 2040-03-18
Also published as: CN111522854A

Abstract

The application discloses a data labeling method, a device, a storage medium and computer equipment, wherein the method comprises the following steps: receiving target data; inquiring whether reference data with similarity meeting a preset reference data similarity condition exists in the marked data set or not; and if so, outputting the reference data which corresponds to the target data and meets the preset reference data similarity condition. Compared with the prior art, the method and the device have the advantages that through unified management of the target data, time waste is not required in the process of inquiring and waiting for replying by the labeling personnel, the data labeling reference data corresponding to the target data is inquired from the labeled data, the labeled data is fully utilized to provide a reference basis for labeling the target data, and the problem that the standard efficiency is affected by repeated processing of similar data is solved.

Description

Data labeling method and device, storage medium and computer equipment

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a data labeling method, a data labeling device, a storage medium, and a computer device.

Background

The modeling of the artificial intelligence algorithm requires a large amount of marked supervised data, in the process of marking the data, due to the large data amount and high data diversity, a large amount of difficult-to-mark problem samples exist, the sample marking personnel can not be well identified, and in the process of inquiring and waiting for replying, the marking personnel can cause a large amount of time waste so as to influence the marking progress. And different annotators may encounter similar problem samples, and the same annotator also repeatedly encounters similar problem samples, and repeated discussion is caused by similar problems for a plurality of times, so that overall efficiency is seriously affected. The known data labeling system, platform or method only comprises labeling and acceptance of data, and the unified management of problem samples is lacking, so that the overall efficiency of labeling tasks is seriously affected by the existence of the problem samples.

If the labeling efficiency of the sample data can be improved, the development and progress of the modeling of the artificial intelligence algorithm can be facilitated.

Disclosure of Invention

In view of the foregoing, the present application provides a data labeling method, a data labeling device, a storage medium and a computer device.

According to one aspect of the present application, there is provided a data labeling method, the method comprising

Receiving target data;

inquiring whether reference data with similarity meeting a preset reference data similarity condition exists in the marked data set or not;

and if so, outputting the reference data which corresponds to the target data and meets the preset reference data similarity condition.

Specifically, the reference data includes a data tag, and after the outputting the reference data corresponding to the target data and meeting the preset reference data similarity condition, the method further includes:

receiving output feedback information corresponding to the reference data, wherein the output feedback information is used for indicating whether a data tag of the reference data is suitable for labeling the target data;

if the data tag of the reference data is suitable for marking the target data, marking the target data based on the data tag of the reference data, and adding the obtained marked data into the marked data set;

and if the data tag of the reference data is not suitable for marking the target data, adding the target data into a data list to be marked.

Specifically, after the querying whether the reference data with the similarity satisfying the preset reference data similarity condition exists in the marked data set, the method further includes:

and if not, adding the target data into the data list to be marked.

Specifically, the method further comprises:

receiving marking information, wherein the marking information comprises data to be marked and corresponding data labels, which are contained in the list to be marked;

and marking the corresponding data to be marked according to the data tag of the marking information, and adding the marked data obtained after marking into the marked data set.

Specifically, the querying whether the reference data with similarity satisfying the preset reference data similarity condition exists in the marked data set or not specifically includes:

respectively calculating the similarity between the target data and any marked data in the marked data set;

and if the highest similarity is larger than a first preset similarity threshold, taking the marked data corresponding to the highest similarity as the reference data.

Specifically, after the target data is added to the data list to be marked, the method further includes:

respectively calculating the similarity between the target data and any one of the data to be marked in the data to be marked list, and recording related data of which the similarity between the target data and any one of the data to be marked is larger than a second preset similarity threshold value;

the data label according to the labeling information labels the corresponding data to be labeled, and after the labeled data obtained after labeling is added into the labeled data set, the method further comprises the steps of:

when any data to be marked in the data to be marked list is marked, outputting related data marking inquiry information, wherein the related data inquiry information is used for inquiring whether the related data is marked with the same data label as any marked data to be marked;

and if the relevant data labeling feedback information is received, labeling a data label for the relevant data.

Specifically, the method further comprises:

when the data label is marked for any one of the target data and/or the data to be marked in the data set to be marked, outputting data label marking prompt information.

According to another aspect of the present application, there is provided a data tagging device, the device comprising:

the target data receiving module is used for receiving target data;

the reference data query module is used for querying whether reference data with similarity meeting a preset reference data similarity condition exists in the marked data set;

and the reference data output module is used for outputting the reference data which corresponds to the target data and meets the preset reference data similarity condition if the reference data exists.

Specifically, the device further comprises:

the feedback information receiving module is used for receiving output feedback information corresponding to the reference data after the reference data which corresponds to the target data and meets the preset reference data similarity condition is output, wherein the output feedback information is used for indicating whether the data label of the reference data is suitable for labeling the target data;

the target data labeling module is used for labeling the target data based on the data label of the reference data and adding the obtained labeled data into the labeled data set if the data label of the reference data is suitable for labeling the target data;

and the first list building module is used for adding the target data into a data list to be marked if the data tag of the reference data is not suitable for marking the target data.

Specifically, the device further comprises:

and the second list building module is used for inquiring whether the reference data with the similarity meeting the preset reference data similarity condition exists in the marked data set or not, and if not, adding the target data into the data list to be marked.

Specifically, the device further comprises:

the marking information receiving module is used for receiving marking information, wherein the marking information comprises data to be marked and corresponding data labels, which are contained in the list to be marked;

the marked set building module is used for marking the corresponding data to be marked according to the data tag of the marking information, and adding marked data obtained after marking into the marked data set.

Specifically, the reference data query module specifically includes:

the similarity calculation unit is used for calculating the similarity between the target data and any marked data in the marked data set respectively;

and the reference data determining unit is used for taking the marked data corresponding to the highest similarity as the reference data if the highest similarity is larger than a first preset similarity threshold value.

Specifically, the device further comprises:

the related data determining module is used for respectively calculating the similarity between the target data and any one of the data to be marked in the data to be marked list after the target data is added into the data to be marked list, and recording related data with the similarity between the target data and any one of the data to be marked being greater than a second preset similarity threshold value;

the data labeling inquiry module is used for outputting related data labeling inquiry information when any one of the data to be labeled in the data list to be labeled is labeled, wherein the related data inquiry information is used for inquiring whether the related data label which is the same as the labeled data is labeled;

and the related data labeling module is used for labeling the related data with a data label if the related data labeling feedback information is received.

Specifically, the device further comprises:

the marking prompt module is used for outputting data label marking prompt information when any one of the target data and/or the data to be marked in the data set to be marked is marked with a data label.

According to yet another aspect of the present application, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described data tagging method.

According to still another aspect of the present application, there is provided a computer device including a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor, the processor implementing the data labeling method described above when executing the program.

By means of the technical scheme, the data labeling method, the data labeling device, the storage medium and the computer equipment are used for uniformly managing target data to be labeled, inquiring reference data with similarity meeting a preset reference data similarity condition in a labeled data set subjected to data labeling when the target data is received, and outputting the reference data when the reference data related to the target data is obtained through inquiring. Compared with the prior art, the method and the device have the advantages that through unified management of the target data, time waste is not required in the process of inquiring and waiting for replying by the labeling personnel, the data labeling reference data corresponding to the target data is inquired from the labeled data, the labeled data is fully utilized to provide a reference basis for labeling the target data, and the problem that the standard efficiency is affected by repeated processing of the similar data is solved.

The foregoing description is only an overview of the technical solutions of the present application, and may be implemented according to the content of the specification in order to make the technical means of the present application more clearly understood, and in order to make the above-mentioned and other objects, features and advantages of the present application more clearly understood, the following detailed description of the present application will be given.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

fig. 1 shows a flow chart of a data labeling method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of another method for labeling data according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of another method for labeling data according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a data labeling device according to an embodiment of the present application;

fig. 5 shows a schematic structural diagram of another data labeling device according to an embodiment of the present application.

Detailed Description

The present application will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other.

In this embodiment, a data labeling method is provided, as shown in fig. 1, and the method includes:

step 101, receiving target data;

step 102, inquiring whether reference data with similarity meeting a preset reference data similarity condition exists in the marked data set or not;

step 103, if yes, outputting the reference data which corresponds to the target data and meets the preset reference data similarity condition.

The method and the device can be applied to a scene of sample data labeling, for example, when labeling personnel label data labels for sample data, problem sample data which cannot be labeled is encountered, or when labeling the sample data by using a labeling model, problem sample data which cannot be identified by the model is encountered, at this time, the problem sample data is generally reported to expert personnel in the prior art, and the expert personnel waits for labeling the problem sample data. The data annotation method provided by the embodiment of the application is applied to a data annotation management system, and the data annotation management system can comprise a target data submitting function module, a similar data searching function module and a similar data output module.

Firstly, in order to improve the data labeling efficiency, when problem sample data exist, the problem sample data can be submitted to a data labeling management system through a target data submitting functional module so as to realize labeling of the target data (namely the problem sample data) by utilizing the data labeling management system, after labeling personnel submits data which cannot be labeled by the labeling personnel to the data labeling management system, the target data can be uniformly labeled and managed by the system, the labeling personnel can continuously label other sample data without waiting for the labeling result of the data, and the labeling efficiency of the sample data is improved;

secondly, after the system receives the target data, inquiring whether marked data similar to the target data exists in a marked data set stored in the system, so that the marked data similar to the target data can be utilized to provide a reference basis for marking the target data, specifically, data with the similarity meeting a certain condition between the marked data and the target data can be screened out from the marked data set, the data can be used as reference data, the reference data can be used for providing reference for marking the target data, for example, the similarity between the reference data and the target data is larger than the similarity between other marked data in any marked data set and the target data, the similarity between the reference data and the target data is larger than a certain preset threshold, the similarity can be determined according to the cosine distance between the target data and the marked data, the similarity is smaller the larger the distance is, otherwise, the similarity is larger the smaller the similarity is the smallest the similarity value can be determined according to the smallest editing distance between the target data and the marked data, namely the editing operation times required by converting the target data into the marked data are larger the smallest the similarity is the larger the smallest editing distance is the smaller the similarity is the smaller the distance is the similarity is;

finally, if the marked data set contains the reference data meeting the preset reference data similarity condition, outputting the reference data to expert personnel so that the expert personnel can judge and mark the target data quickly based on the original marking information of the reference data similar to the target data, taking the reference data as the basis for marking the target data, fully utilizing the marked data, improving the marking efficiency of the target data, receiving the target data submitted by a plurality of mark personnel by a system, and when the same mark personnel encounters similar target data or different mark personnel encounters similar target data, avoiding the effect of the repeated processing of the similar data on the marking efficiency of the data by utilizing the management system, and outputting the reference data to the mark personnel so that the mark personnel can know the marking progress of the target data in time.

By applying the technical scheme of the embodiment, unified management is performed on target data to be marked, when the target data is received, reference data with similarity meeting the preset reference data similarity condition is queried in a marked data set with data marking, and when the reference data related to the target data is queried, the reference data is output. Compared with the prior art, the method and the device have the advantages that through unified management of the target data, time waste is not required in the process of inquiring and waiting for replying by the labeling personnel, the data labeling reference data corresponding to the target data is inquired from the labeled data, the labeled data is fully utilized to provide a reference basis for labeling the target data, and the problem that the standard efficiency is affected by repeated processing of the similar data is solved.

Further, as a refinement and extension of the foregoing embodiment, in order to fully describe a specific implementation procedure of the embodiment, another data labeling method is provided, as shown in fig. 2, where the method includes:

step 201, target data is received.

In the embodiment of the application, when the labeling personnel label the data, the data can be marked as a problem sample, the waiting time of problem processing is skipped to directly start the next piece, and the data marked as the problem sample is submitted to the data labeling management system.

Step 202, calculating the similarity between the target data and any marked data in the marked data set.

And 203, if the highest similarity is greater than the first preset similarity threshold, using the labeled data corresponding to the highest similarity as the reference data.

In step 202 and step 203, after the system receives the target data, it needs to search whether the data similar to the target data and capable of providing a reference basis for labeling the target data in the labeled data set, specifically, the similarity between the target data and each piece of labeled data in the labeled data set should be calculated first, then whether the value of the maximum similarity is greater than a specific value is judged, if the similarity is greater than the specific value, the labeled data corresponding to the maximum similarity can be obtained and used as the reference data corresponding to the target data, so that the reference data can be used for providing a reference basis for labeling the target data, and compared with the similarity between other labeled data and the target data, the reference data is the reference basis with the highest reliability in the labeling process of the target data, and the similarity degree between the reference data and the target data is greater than the first preset similarity threshold value, so that the reference data can be ensured to provide a more effective reference for labeling the target data.

It should be noted that in the process of screening the reference data according to the similarity in step 203, other screening conditions may be set, for example, the preset reference data similarity condition may also be set so that all marked data with similarity greater than a specific threshold are determined as the reference data, so that more reference bases may be provided for marking the target data, which is helpful for improving accuracy of data marking.

And 204, if the target data does not exist, adding the target data into a data list to be marked.

In the above embodiment, if the marked data set does not have the reference data satisfying the preset reference data similarity condition, the target data is added to the to-be-marked data list, so that the technical expert can intensively mark the data in the list.

Step 205, if yes, outputting the reference data corresponding to the target data and meeting the preset reference data similarity condition.

In the above embodiment, if the reference data corresponding to the target data can be found, the reference data and the corresponding data tag thereof are output, so that the technical expert or the intelligent terminal further determines whether the reference data has a reference value to the standard of the target data.

And 206, receiving output feedback information corresponding to the reference data, wherein the output feedback information is used for indicating whether the data tag of the reference data is suitable for labeling the target data.

In step 207, if the data tag of the reference data is suitable for labeling the target data, the target data is labeled based on the data tag of the reference data, and the obtained labeled data is added to the labeled data set.

And step 208, outputting data label labeling prompt information when labeling the data labels for the target data.

In step 209, if the data tag of the reference data is not suitable for labeling the target data, the target data is added to the data list to be labeled.

In steps 206 to 209, after outputting the reference basis corresponding to the target data, the technical expert or the intelligent terminal side returns corresponding feedback information to the system, where the feedback information is used to indicate whether the data tag of the reference data can be used to label the target data, that is, whether the data tag corresponding to the target data is consistent with the data tag of the reference data, if the feedback information indicates that the data tag of the reference data is suitable for labeling the target data, the target data can be labeled directly according to the data tag of the reference data, so that labeling of the target data is achieved, further, after labeling the target data, the target data can be added into a labeled data set, so that when the system receives other target data similar to the target data again, the data can be queried from the labeled data set, labeling basis is provided for other similar target data, so as to facilitate improving the labeling efficiency of the data, and meanwhile, after the target data is labeled, the target data label prompt information should be sent to the target data and the corresponding data tag label should be included, so as to inform the submitter that the target data has been labeled, so that the submitter can learn about the progress of labeling of the target data and the data. In addition, if the feedback information indicates that the data tag of the reference data is not suitable for labeling the target data, the target data is added into a data list to be labeled, and a relevant technical expert waits for labeling.

Fig. 3 is a schematic flow chart of another data labeling method provided in an embodiment of the present application, which is used for a data list to be labeled, and as shown in fig. 3, the method includes:

step 301, calculating the similarity between the target data and any one of the data to be marked in the data to be marked list, and recording the related data with the similarity greater than the second preset similarity threshold.

Step 302, receiving labeling information, wherein the labeling information comprises data to be labeled and corresponding data labels contained in a list to be labeled.

And 303, marking the corresponding data to be marked according to the data tag of the marking information, and adding the marked data obtained after marking into the marked data set.

And step 304, outputting relevant data labeling inquiry information when any data to be labeled in the data list to be labeled is labeled, wherein the relevant data inquiry information is used for inquiring whether the relevant data is labeled with the same data label as any data to be labeled.

In step 305, if the feedback information of the related data label is received, the data label is labeled for the related data.

And 306, outputting data tag labeling prompt information when labeling the data tag for any data to be labeled in the data set to be labeled.

In the above embodiment, for the data newly added to the to-be-marked data list, the to-be-marked data similar to the data in the list can be found out first, so that when the data or one of the to-be-marked data with higher similarity is marked, the marked data label can be used as a marking reference of the data with higher similarity, for example, the similarity between the newly added data A and the original data B and C in the list is higher, when the data A is marked, the similarity between the data A and the data B is higher, and the similarity between the data C and the data label corresponding to the data A is higher, and then the data label side data B and the data C can be inquired whether the data label identical to the data label A can be marked, thereby being beneficial to improving the efficiency of data marking.

In addition, after any data in the data list to be marked is marked, marking prompt information should be fed back to the submitter corresponding to the data so as to inform the submitter that the data is marked and the corresponding data label.

Further, as a specific implementation of the method of fig. 1, an embodiment of the present application provides a data labeling device, as shown in fig. 4, where the device includes: a target data receiving module 41, a reference data inquiring module 42, and a reference data outputting module 43.

A target data receiving module 41 for receiving target data;

a reference data query module 42, configured to query whether reference data with similarity to the target data satisfying a preset reference data similarity condition exists in the labeled data set;

the reference data output module 43 is configured to output, if any, reference data corresponding to the target data that satisfies a preset reference data similarity condition.

In a specific application scenario, as shown in fig. 5, the apparatus further includes: a feedback information receiving module 44, a target data labeling module 45, and a first list establishing module 46.

The feedback information receiving module 44 is configured to receive output feedback information corresponding to the reference data after the reference data corresponding to the target data and meeting a preset reference data similarity condition is output, where the output feedback information is used to indicate whether the data tag of the reference data is suitable for labeling the target data;

the target data labeling module 45 is configured to label the target data based on the data label of the reference data if the data label of the reference data is suitable for labeling the target data, and add the obtained labeled data to the labeled data set;

the first list creation module 46 is configured to add the target data to the to-be-marked data list if the data tag of the reference data is not suitable for marking the target data.

In a specific application scenario, as shown in fig. 5, the apparatus further includes: the second list creation module 47.

The second list creation module 47 is configured to query whether there is reference data in the marked data set, whose similarity with the target data satisfies a preset reference data similarity condition, and if not, add the target data to the to-be-marked data list.

In a specific application scenario, as shown in fig. 5, the apparatus further includes: the labeling information receiving module 48 and the labeled set establishing module 49.

The marking information receiving module 48 is configured to receive marking information, where the marking information includes data to be marked and a data tag corresponding to the data to be marked included in the list to be marked;

the marked set creating module 49 is configured to mark the corresponding data to be marked according to the data tag of the marking information, and add the marked data obtained after marking to the marked data set.

In a specific application scenario, as shown in fig. 5, the reference data query module 42 specifically includes: a similarity calculation unit 421, and a reference data determination unit 422.

A similarity calculating unit 421 for calculating the similarity between the target data and any one of the labeled data sets;

the reference data determining unit 422 is configured to take, as the reference data, the labeled data corresponding to the highest similarity if the highest similarity is greater than the first preset similarity threshold.

In a specific application scenario, as shown in fig. 5, the apparatus further includes: a related data determining module 50, a data labeling inquiry module 51 and a related data labeling module 52.

The related data determining module 50 is configured to calculate the similarity between the target data and any one of the data to be marked in the data to be marked list after adding the target data to the data to be marked list, and record related data having a similarity with any one of the data to be marked greater than a second preset similarity threshold;

the data labeling query module 51 is configured to output related data labeling query information when any data to be labeled in the data list to be labeled is labeled, where the related data query information is used to query whether the related data label is the same as any data to be labeled;

the related data labeling module 52 is configured to label the related data with a data tag if the related data labeling feedback information is received.

In a specific application scenario, as shown in fig. 5, the apparatus further includes: the annotation prompt module 53.

The labeling prompt module 53 is configured to output data label labeling prompt information when labeling a data label for any data to be labeled in the target data and/or the data set to be labeled.

It should be noted that, other corresponding descriptions of each functional unit related to the data labeling device provided in the embodiment of the present application may refer to corresponding descriptions in fig. 1 and fig. 2, and are not repeated herein.

Based on the above-mentioned methods shown in fig. 1 to 3, correspondingly, the embodiments of the present application further provide a storage medium, on which a computer program is stored, which when executed by a processor, implements the above-mentioned data labeling method shown in fig. 1 to 3.

Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.), and includes several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to perform the methods described in various implementation scenarios of the present application.

Based on the methods shown in fig. 1 to 3 and the virtual device embodiments shown in fig. 4 and 5, in order to achieve the above objects, the embodiments of the present application further provide a computer device, which may specifically be a personal computer, a server, a network device, etc., where the computer device includes a storage medium and a processor; a storage medium storing a computer program; a processor for executing a computer program to implement the data labeling method as described above and shown in fig. 1 to 3.

Optionally, the computer device may also include a user interface, a network interface, a camera, radio Frequency (RF) circuitry, sensors, audio circuitry, WI-FI modules, and the like. The user interface may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., bluetooth interface, WI-FI interface), etc.

It will be appreciated by those skilled in the art that the architecture of a computer device provided in the present embodiment is not limited to the computer device, and may include more or fewer components, or may combine certain components, or may be arranged in different components.

The storage medium may also include an operating system, a network communication module. An operating system is a program that manages and saves computer device hardware and software resources, supporting the execution of information handling programs and other software and/or programs. The network communication module is used for realizing communication among all components in the storage medium and communication with other hardware and software in the entity equipment.

Through the description of the above embodiments, it is clear for those skilled in the art that the present application may be implemented by means of software plus a necessary general hardware platform, or may be implemented by hardware to perform unified management on target data to be marked, when the target data is received, query, in a marked data set that has been marked with data, reference data whose similarity between the reference data and the target data satisfies a preset reference data similarity condition, and output the reference data when the reference data related to the target data is obtained by the query. Compared with the prior art, the method and the device have the advantages that through unified management of the target data, time waste is not required in the process of inquiring and waiting for replying by the labeling personnel, the data labeling reference data corresponding to the target data is inquired from the labeled data, the labeled data is fully utilized to provide a reference basis for labeling the target data, and the problem that the standard efficiency is affected by repeated processing of the similar data is solved.

Those skilled in the art will appreciate that the drawings are merely schematic illustrations of one preferred implementation scenario, and that the modules or flows in the drawings are not necessarily required to practice the present application. Those skilled in the art will appreciate that modules in an apparatus in an implementation scenario may be distributed in an apparatus in an implementation scenario according to an implementation scenario description, or that corresponding changes may be located in one or more apparatuses different from the implementation scenario. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.

The foregoing application serial numbers are merely for description, and do not represent advantages or disadvantages of the implementation scenario. The foregoing disclosure is merely a few specific implementations of the present application, but the present application is not limited thereto and any variations that can be considered by a person skilled in the art shall fall within the protection scope of the present application.

Claims

1. A method for labeling data, the method comprising

Receiving target data, wherein the target data is problem sample data which cannot be identified when a labeling model labels the sample data;

if so, outputting the reference data which corresponds to the target data and meets the preset reference data similarity condition;

if the data tag of the reference data is not suitable for marking the target data, adding the target data into a data list to be marked;

2. The method of claim 1, wherein after receiving the output feedback information corresponding to the reference data, the method further comprises:

if the data tag of the reference data is suitable for marking the target data, marking the target data based on the data tag of the reference data, and adding the obtained marked data into the marked data set.

3. The method of claim 1, wherein after querying whether reference data satisfying a preset reference data similarity condition is present in the annotated data set, the method further comprises:

and if not, adding the target data into the data list to be marked.

4. A method according to claim 2 or 3, characterized in that the method further comprises:

receiving marking information, wherein the marking information comprises data to be marked and corresponding data labels, which are contained in the data to be marked list;

5. The method of claim 4, wherein the querying whether the reference data with the similarity with the target data satisfying the preset reference data similarity condition exists in the marked data set comprises:

6. The method of claim 5, wherein the method further comprises:

7. A data tagging device, the device comprising:

the target data receiving module is used for receiving target data, wherein the target data are problem sample data which cannot be identified when the sample data are marked by the marking model;

the reference data output module is used for outputting the reference data which corresponds to the target data and meets the preset reference data similarity condition if the reference data exists;

the feedback information receiving module is used for receiving output feedback information corresponding to the reference data, wherein the output feedback information is used for indicating whether a data tag of the reference data is suitable for labeling the target data;

the first list building module is used for adding the target data into a data list to be marked if the data tag of the reference data is not suitable for marking the target data;

the related data determining module is used for respectively calculating the similarity between the target data and any one of the data to be marked in the data to be marked list, and recording related data with the similarity between the target data and any one of the data to be marked being greater than a second preset similarity threshold;

8. A storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the data annotation method of any of claims 1 to 6.

9. A computer device comprising a storage medium, a processor and a computer program stored on the storage medium and executable on the processor, characterized in that the processor implements the data annotation method according to any of claims 1 to 6 when the program is executed by the processor.