CN115496036A

CN115496036A - Intelligent text data labeling method and device, computer equipment and storage medium

Info

Publication number: CN115496036A
Application number: CN202211140589.6A
Authority: CN
Inventors: 段炼; 周忠诚; 黄九鸣; 张圣栋
Original assignee: Hunan Xinghan Shuzhi Technology Co ltd
Current assignee: Hunan Xinghan Shuzhi Technology Co ltd
Priority date: 2022-09-20
Filing date: 2022-09-20
Publication date: 2022-12-20

Abstract

The invention relates to the technical field of natural language processing, and provides a text data intelligent marking method, a text data intelligent marking device, a computer device and a storage medium, wherein the text data intelligent marking method comprises the following steps: acquiring first text data to be annotated, wherein the first text data to be annotated is obtained by retrieval according to a data source corresponding to annotation task information; respectively carrying out machine labeling and manual labeling on the first data to be labeled to obtain a machine labeling sample set and a manual labeling sample set; determining the current distribution of the labeling tasks according to the machine labeling sample set and the manual labeling sample set; and judging whether the current distribution of the labeling tasks is aligned with the target distribution of the labeling tasks, if not, deleting data or/and compensating data of the current distribution of the labeling tasks until the current distribution of the labeling tasks is aligned to obtain labeling data. By adopting the method, the labeling efficiency can be improved.

Description

Text data intelligent labeling method and device, computer equipment and storage medium

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to a text data intelligent labeling method and device, computer equipment and a storage medium.

Background

With the rise and development of artificial intelligence, artificial intelligence has been widely applied to various fields, and thus the demand for labeling data is increasing. The existing text data labeling is generally to obtain labeling data by manually retrieving or importing a certain number of text sets to a labeling system to notify a labeling person to perform manual labeling. And if the labeled data is directly used for model training or other purposes to find that the data is not ideal, obtaining new manual labeled data through a similar process until a satisfactory data set is obtained. Therefore, the existing data labeling process is complicated, more manual parameters are needed, and multiple manual iterations exist, so that the workload of manual labeling is large, and the labeling efficiency is reduced.

Disclosure of Invention

In view of the foregoing, it is necessary to provide a method, an apparatus, a computer device and a storage medium for intelligent annotation of text data, which can improve annotation efficiency.

The invention provides an intelligent text data labeling method, which comprises the following steps:

acquiring first text data to be annotated, wherein the first text data to be annotated is obtained by retrieval according to a data source corresponding to annotation task information;

respectively carrying out machine labeling and manual labeling on the first data to be labeled to obtain a machine labeling sample set and a manual labeling sample set;

determining the current distribution of the labeling tasks according to the machine labeling sample set and the manual labeling sample set;

and judging whether the current distribution of the labeling tasks is aligned with the target distribution of the labeling tasks, if not, deleting data or/and compensating data of the current distribution of the labeling tasks until the current distribution of the labeling tasks is aligned to obtain labeling data.

In one embodiment, the counting the current distribution of tasks according to the machine labeled sample set and the artificial labeled sample set includes:

acquiring a union set of the machine labeling sample set and the manual labeling sample set to obtain a union set sample set;

and counting the number of samples of each category in the union sample set to obtain the current distribution of the labeling tasks.

In one embodiment, the determining whether the current distribution of the annotation task is aligned with the target distribution of the annotation task, and if not, performing data deletion or/and data compensation on the current distribution of the annotation task until the current distribution of the annotation task is aligned to obtain annotation data includes:

the quantity of samples of the same category sample in the current distribution of the labeling task and the target distribution of the labeling task is differentiated to obtain the difference distribution of each category sample;

when the difference distribution meets a preset first error requirement, determining that the current distribution of the labeling tasks is aligned with the target distribution of the labeling tasks, and taking the current distribution of the labeling tasks as labeling data;

and when the difference distribution does not meet a preset first error requirement, performing data deletion or/and data compensation on the class samples corresponding to the difference quantity until the class samples are aligned to obtain labeled data.

In one embodiment, performing data complementation on the current distribution of the labeling task includes:

acquiring difference distribution between the current distribution of the labeling tasks and the target distribution of the labeling tasks;

acquiring a preset number of second text data to be labeled from the data source; performing machine labeling on the second text data to be labeled to obtain a compensation machine labeling sample set and counting compensation distribution;

when the difference compensation distribution and the difference distribution do not meet a preset second error requirement, returning to obtain a preset number of second text data to be labeled from the data source for iterative difference compensation until the difference compensation distribution and the difference distribution meet the error requirement;

and supplementing the compensation distribution meeting the preset second error requirement into the current distribution of the labeling task.

In one embodiment, the performing machine labeling and manual labeling on the first data to be labeled respectively to obtain a machine-labeled sample set and a manual-labeled sample set includes:

inputting the first data to be labeled into a trained labeling model, and labeling the first data to be labeled by the labeling model to obtain a machine labeling sample set;

and visually displaying the first data to be labeled and/or the machine labeling sample set, receiving a manually input labeling operation instruction, labeling the first data to be labeled and/or the machine labeling sample set according to the labeling operation instruction, and obtaining the manual labeling sample set.

In one embodiment, before the performing machine labeling and manual labeling on the first data to be labeled respectively to obtain a machine-labeled sample set and a manual-labeled sample set, the method further includes: and carrying out data processing on the first data to be annotated, wherein the data processing comprises any one or more of data cleaning conversion, deduplication and translation.

In one embodiment, an intelligent labeling device for text data is provided, which includes:

the data retrieval module is used for acquiring first text data to be annotated, and the first text data to be annotated is retrieved according to a data source corresponding to the annotation task information;

the data labeling module is used for respectively performing machine labeling and manual labeling on the first data to be labeled to obtain a machine labeling sample set and a manual labeling sample set;

the distribution statistical module is used for determining the current distribution of the labeling tasks according to the machine labeling sample set and the manual labeling sample set;

and the distribution alignment module is used for judging whether the current distribution of the labeling tasks is aligned with the target distribution of the labeling tasks, and if not, deleting data or/and compensating data until the current distribution of the labeling tasks is aligned to obtain labeling data.

The invention also provides computer equipment which comprises a processor and a memory, wherein the memory stores a computer program, and the processor realizes the steps of the intelligent text data annotation method when executing the computer program.

The invention also provides a computer readable storage medium, on which a computer program is stored, and the computer program realizes the steps of the intelligent text data labeling method when being executed by a processor.

According to the text data intelligent labeling method, the text data intelligent labeling device, the computer equipment and the storage medium, text data to be labeled are automatically retrieved according to a data source, a machine labeling sample set and a manual labeling sample set are obtained through machine and manual labeling respectively, the distribution condition of the currently labeled data is determined according to the machine labeling sample set and the manual labeling sample set, then the current labeled data is compared with the target distribution of a labeling task to judge whether the current labeled data is aligned, and if the current labeled data is not aligned, data deletion and/or data compensation are/is carried out until the current labeled data is aligned, so that labeled data are obtained. The method can automatically access and acquire the data source, can provide a system labeling mechanism of a machine and a manual work to improve the labeling efficiency, and adjusts the data by dynamically perceiving and coordinating the difference condition of the labeled data distribution and the labeled task target distribution, thereby being convenient for high-efficiency labeling.

Drawings

Fig. 1 is an application environment diagram of an intelligent text data labeling method in an embodiment.

Fig. 2 is a schematic flowchart of an intelligent text data labeling method in an embodiment.

FIG. 3 is a block diagram illustrating an exemplary embodiment of an intelligent labeling apparatus for text data.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The intelligent text data annotation method provided by the application can be applied to the application environment shown in fig. 1, wherein the application environment relates to the terminal 102 and the server 104. Wherein the terminal 102 communicates with the server 104 via a network. The terminal 102 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster composed of a plurality of servers.

When the terminal 102 receives the annotation task, the intelligent annotation method for the text data can be implemented by the terminal 102 alone. The terminal 102 may also send the annotation task to the server 104 for communication, and the server 104 may implement the intelligent annotation method for text data. Taking the server 104 as an example, specifically, the server 104 obtains first text data to be annotated, where the first text data to be annotated is obtained by retrieving according to a data source corresponding to the annotation task information; the server 104 respectively carries out machine labeling and manual labeling on the first data to be labeled to obtain a machine labeling sample set and a manual labeling sample set; the server 104 determines the current distribution of the labeling tasks according to the machine labeling sample set and the manual labeling sample set; the server 104 judges whether the current distribution of the annotation task is aligned with the target distribution of the annotation task, and if not, the current distribution of the annotation task is subjected to data deletion or/and data compensation until the current distribution of the annotation task is aligned to obtain annotation data.

In one embodiment, as shown in fig. 2, an intelligent annotation method for text data is provided, which is described by taking an example that the method is applied to a server, and includes the following steps:

step S201, obtaining first text data to be annotated, wherein the first text data to be annotated is obtained by retrieving according to a data source corresponding to annotation task information.

The first text data to be labeled is the currently acquired text data to be labeled. The annotation task information is predefined task information, and the task information comprises a data source and annotation task target distribution. The data source is characterized in that a plurality of pieces of structured data can be obtained by accessing in an authentication or authentication-free mode through identification information and a corresponding data source adapter program, and the structured information at least comprises one text message and can also comprise label description information. The data source adapter program can receive data source identification information and selectively receive authentication information, and is provided with a program interface for outputting a plurality of samples, the output samples comprise a text field and a label field, and the interface can selectively receive a sample number integer parameter expected to be output. The target distribution of the labeling tasks refers to that a sample set which is expected to be finally formed when one labeling task is processed has a specified number of samples in a plurality of categories.

Specifically, when the server receives the annotation task, the annotation task information corresponding to the annotation task is obtained, and the data source and the annotation task target distribution corresponding to the annotation task are obtained. Then, an output sample program interface of the data source corresponding data source adapter is called to obtain a data sample output by the data source corresponding data source adapter as first text data to be annotated. The data sample output by the interface comprises a text field and a label field, wherein the text field is a section of text, namely a character string, and the content of the label field is label description information of data of the text field, which can be a category label related to the whole body of the text, or a category label of text segments, or a relation category label between the text segments. It should be understood that the data corresponding to the data source may also include only the body field and not the tag field, that is, the annotation description information may not be output.

For example, when the data source corresponding to the annotation task is a json file and a corresponding json parser adapter is provided, the first text data to be annotated obtained is represented in json format, which is in the form of { "text": "puppy is very lovely", "label": "positive", "label2": [ [4,5], "affective words" ] ], "label3": [ [1,2,4,5], "rating" ] }, indicating that the "text" field of the data is a text field, the text content is "puppy is very loved", the text category label is "positive", the text segment located between 4 and 5 (i.e., "loved") has the label "affective word", the segment "puppy" located between 1 and 2 and the segment "loved" located between 4 and 5 have the label "rating".

And S202, respectively carrying out machine labeling and manual labeling on the first data to be labeled to obtain a machine labeling sample set and a manual labeling sample set.

Specifically, after the server acquires the first text data to be labeled, labeling the text data to be labeled in two ways, one is machine labeling, and the other is manual labeling. And taking the labeling data obtained by machine labeling as a machine labeling sample set, and taking the labeling data obtained by manual labeling as a manual labeling sample set.

In one embodiment, step S202 includes: inputting first data to be labeled into a trained labeling model, and labeling the first data to be labeled by using a labeling model to obtain a machine labeling sample set; and visually displaying the first data to be labeled and/or the machine labeling sample set, receiving a manually input labeling operation instruction, and labeling the first data to be labeled and/or the machine labeling sample set according to the labeling operation instruction to obtain a manual labeling sample set.

Specifically, the machine labeling in this embodiment refers to a process of performing inference prediction on an input text by using an algorithm model to output labeling description information, and may perform machine labeling on text data without labeling description information or text data with labeling effect inferior to that of a currently trained labeling model. And when the server carries out machine labeling on the data to be labeled, calling a trained labeling model, wherein the labeling model is obtained by training, supervising and learning by utilizing the manual labeling data. And then inputting the data to be labeled into the labeling model for labeling, wherein the model can label the text data without labeling description information and can label the text with labeling description information again.

In this embodiment, the manual labeling means that a text field, a translation field, and a corresponding label description information field of the desired label data are visually presented and displayed together, then a manual labeling operation instruction input for the presented data is received, and after the manual labeling operation instruction is received, the manual labeling operation instruction is converted into a modification of the label description information. The marking refers to adding new marking description information, namely, marking related marking description information which is not existed in the corresponding object before marking through a manually input marking operation instruction, and the object has the marking description information after the manual marking operation. The proofreading means that the original labeling information exists, and the labeling description information is changed, such as deleted, after the manual input labeling operation instruction is operated. The confirmation means that the person approves the existing annotation description information after finishing the inspection of the sample to be processed, and the annotation description information does not need to be modified. After the manual labeling is completed, the processed text data records the manually labeled flag. In addition, the manual labeling process in this embodiment may be performed synchronously with the machine labeling, and the first data to be labeled are labeled respectively, or after the machine labeling is completed, the machine labeling sample set is manually labeled separately or simultaneously with the first data to be labeled, and the modification made to the machine labeling sample set by the manual labeling is regarded as a result of the manual labeling, that is, the result is output as the manual labeling sample set in a unified manner, so that the secondary calibration of the machine labeling by a manual operation is realized to improve the labeling accuracy.

And step S203, determining the current distribution of the labeling tasks according to the machine labeling sample set and the manual labeling sample set.

The annotation task target distribution refers to that a sample set which is expected to be finally formed when an annotation task is processed has a specified number of samples in a plurality of categories, so that the current distribution of the annotation task can be understood as the specified number of samples in the plurality of categories of the sample set which is currently formed after the machine and manual annotation. The data format of the task distribution is consistent with the format of the predefined data source, for example, a json file is used as the data source, then the data format of the task distribution is a json object, for example, the task target distribution is labeled as { "label": [ "animal", "non-animal" ], "num": 100, 110 }, which represents that the labeled category is "animal" or "non-animal", and 100 samples and 110 samples of the category of "animal" and "non-animal" in the finally formed sample set are expected.

Specifically, after a machine labeling sample set and a manual labeling sample set are obtained through machine and manual labeling, the server performs statistics on the number of samples of each category based on the two sample sets to obtain the current distribution of the labeling tasks.

In one embodiment, step S203 comprises: acquiring a union set of a machine labeling sample set and a manual labeling sample set to obtain a union set sample set; and counting the number of samples of each category in the sample set and collecting the samples to obtain the current distribution of the labeling tasks.

Specifically, when the statistical labeling task is currently distributed, in order to avoid repeatedly counting the same labeling results in the machine labeling sample set and the manual labeling sample set, a union of the machine labeling sample set and the manual labeling sample set is counted first, and the union is called a union sample set. Then, the number of the samples of each category in the sample set is counted and collected, and the current distribution of the labeling tasks is obtained.

And step S204, judging whether the current distribution of the annotation task is aligned with the target distribution of the annotation task, if not, deleting data or/and compensating data for the current distribution of the annotation task until the current distribution of the annotation task is aligned to obtain annotation data.

The alignment is to judge whether the sample number of the same type samples between the current distribution of the labeling task and the target distribution of the labeling task is within a certain error range, if the sample number is within the certain error range, the current distribution meets the target distribution, the labeling task is completed, and the data corresponding to the current distribution is used as the finally obtained labeling data. If the current distribution does not meet the target distribution, additional data deletion or/and data compensation processing is required to obtain the final labeled data.

In one embodiment, step S204 includes: the method comprises the steps of obtaining the difference distribution of samples of the same category in the current distribution of an annotation task and the target distribution of the annotation task by making a difference between the number of the samples of the same category; when the difference distribution meets a preset first error requirement, determining that the current distribution of the labeling tasks is aligned with the target distribution of the labeling tasks, and taking the current distribution of the labeling tasks as labeling data; and when the difference distribution does not meet the preset first error requirement, carrying out data deletion or/and data compensation on the class samples corresponding to the difference quantity until the class samples are aligned to obtain the labeling data.

Specifically, the current distribution of the labeling tasks is based on union statistics of a manual labeling sample set and a machine labeling sample set, the number of samples of each type of sample in the union sample set is counted to obtain the current distribution of the labeling tasks, and then the current distribution of the labeling tasks is differed from the target distribution of the labeling tasks to obtain the difference number of the samples of each type to serve as difference distribution. If the difference distribution meets the preset first error requirement, the first error requirement can be set according to the actual required condition, for example, the ratio of the difference distribution to the target distribution of the labeling task is within the preset range, the current distribution of the labeling task is determined to be aligned with the target distribution of the labeling task, the success of the labeling task is indicated, and the labeling task can be terminated at this moment. Otherwise, if the difference distribution does not meet the preset first error requirement, data deletion or data compensation is carried out according to specific conditions. For example, when the number of samples of a certain category sample in the current distribution of the annotation task is redundant to the number of samples of the category sample in the target distribution of the annotation task, the number of samples of the category sample may be deleted, and the deleted sample amount may be determined based on the actual situation. And when the number of samples of a certain category sample in the current distribution of the labeling task is less than that of the samples of the category sample in the target distribution of the labeling task, entering a data difference compensation process to compensate the number of the samples of the category sample in the current distribution of the labeling task.

In one embodiment, the data compensation is performed on the current distribution of the labeling task, and the data compensation comprises the following steps: acquiring difference distribution of current distribution of the labeling tasks and target distribution of the labeling tasks; acquiring a preset number of second text data to be annotated from a data source; performing machine labeling on the second text data to be labeled to obtain a compensation machine labeling sample set and counting compensation distribution; when the difference compensation distribution and the difference distribution do not meet the preset second error requirement, returning to obtain a preset number of second text data to be labeled from the data source for iterative difference compensation until the difference compensation distribution and the difference distribution meet the error requirement; and supplementing the compensation distribution meeting the preset second error requirement into the current distribution of the labeling task.

Specifically, after entering the data difference compensation process, the server first obtains a batch of text data from the data source again as second text data to be labeled, and the second text data to be labeled can be understood as the text data to be labeled obtained in the data difference compensation process. And then, calling a machine labeling model to label the description information of the second text data to be labeled, wherein the obtained machine labeling sample set is called a complementary machine labeling sample set. Secondly, the number of samples of each type in each compensation sample set is counted based on the compensation machine labeling sample set, and corresponding compensation distribution is obtained. And finally, judging whether the difference compensation distribution and the difference distribution meet a preset second error requirement, for example, judging whether the difference compensation distribution and the difference distribution are within an error range of a preset proportion, if so, successfully compensating, and supplementing the data corresponding to the difference compensation distribution into the data corresponding to the current distribution of the labeling task to obtain the final labeling data. Otherwise, entering an iterative difference compensation process, namely returning to the step of acquiring the preset number of second text data to be labeled from the data source to perform iterative difference compensation until the difference compensation distribution and the difference distribution meet the error requirement. In addition, a certain number of iterations can be preset, and when the number of iterations reaches the upper limit requirement, the data difference compensation process can be stopped even if the error requirement is not met.

According to the text data intelligent labeling method, text data to be labeled are automatically retrieved according to a data source, after a machine labeling sample set and a manual labeling sample set are obtained through machine and manual labeling respectively, the distribution condition of the currently labeled data is determined according to the machine labeling sample set and the manual labeling sample set, then the current labeled data is compared with the target distribution of labeling tasks to judge whether the labeled data are aligned, and if the labeled data are not aligned, data deletion and/or data compensation are/is carried out until the labeled data are aligned to obtain the labeled data. The method can automatically access and acquire the data source, can provide a system labeling mechanism of a machine and a manual work to improve the labeling efficiency, and adjusts the data by dynamically perceiving and coordinating the difference condition of the labeled data distribution and the labeled task target distribution, thereby being convenient for high-efficiency labeling.

In one embodiment, before step S202, the method further includes: and carrying out data processing on the first data to be annotated, wherein the data processing comprises any one or more of data cleaning conversion, duplication removal and translation.

Specifically, the data processing in this embodiment mainly includes three operations: the first is data cleaning conversion, the second is duplication removal, and the third is translation. The cleaning conversion is to delete or replace characters contained in character rules in the text field content based on preset rules to generate a new text. Any sub-process of cleaning and converting comprises maintaining position information in label description information, namely modifying text segment position information in original marking information to make the meaning of the text segment pointed by the text segment unchanged when adding or deleting text characters. The deduplication is a process of deleting repeated samples in all text sample sets by using text character strings as unique identifiers of text samples. The translation is a process of translating and converting a new text from a preset translation service provider according to an existing text.

In the embodiment, noise information can be removed through cleaning and conversion, text readability is improved, repeated samples can be reduced through duplication, manual or machine processing of the same text is avoided, efficiency is improved, and translation mainly provides translated native language texts for non-native language texts to serve as manual labeling reference information, so that labeling efficiency and labeling quality are improved.

It should be understood that, although the steps in the flowchart of fig. 2 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 2 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

In one embodiment, as shown in fig. 3, there is provided an intelligent labeling device for text data, including:

the data retrieval module 301 is configured to obtain first text data to be annotated, where the first text data to be annotated is obtained according to a data source retrieval corresponding to the annotation task information;

the data labeling module 302 is configured to perform machine labeling and manual labeling on the first data to be labeled respectively to obtain a machine labeling sample set and a manual labeling sample set;

the distribution statistical module 303 is configured to determine current distribution of the annotation tasks according to the machine annotation sample set and the manual annotation sample set;

and the distribution alignment module 304 is configured to determine whether the current distribution of the annotation task is aligned with the target distribution of the annotation task, and if not, perform data deletion or/and data compensation on the current distribution of the annotation task until the current distribution of the annotation task is aligned to obtain annotation data.

In one embodiment, the distribution statistics module 303 is further configured to obtain a union of the machine labeling sample set and the manual labeling sample set to obtain a union sample set; and counting the number of samples of each category in the sample set and collecting the samples to obtain the current distribution of the labeling tasks.

In one embodiment, the distribution alignment module 304 is further configured to obtain a difference distribution of samples of the same category in the current distribution of the labeling task and the target distribution of the labeling task; when the difference distribution meets a preset first error requirement, determining that the current distribution of the labeling tasks is aligned with the target distribution of the labeling tasks, and taking the current distribution of the labeling tasks as labeling data; and when the difference distribution does not meet the preset first error requirement, carrying out data deletion or/and data compensation on the class samples corresponding to the difference quantity until the class samples are aligned to obtain the labeled data.

In one embodiment, the distribution alignment module 304 is further configured to obtain a difference distribution between a current distribution of the annotation task and a target distribution of the annotation task; acquiring a preset number of second text data to be annotated from a data source; performing machine labeling on the second text data to be labeled to obtain a compensation machine labeling sample set and counting compensation distribution; when the difference compensation distribution and the difference distribution do not meet the preset second error requirement, returning to obtain a preset number of second text data to be labeled from the data source for iterative difference compensation until the difference compensation distribution and the difference distribution meet the error requirement; and supplementing the compensation distribution meeting the preset second error requirement into the current distribution of the labeling task.

In one embodiment, the data labeling module 302 is further configured to input first data to be labeled into a trained labeling model, and label the first data to be labeled by the labeling model to obtain a machine labeling sample set; and visually displaying the first data to be labeled and/or the machine labeling sample set, receiving a manually input labeling operation instruction, and labeling the first data to be labeled and/or the machine labeling sample set according to the labeling operation instruction to obtain a manual labeling sample set.

In one embodiment, the intelligent marking device for text data further comprises a data processing module for performing data processing on the first data to be marked, wherein the data processing includes, but is not limited to, any one or more of data cleaning conversion, duplication removal and translation.

For specific limitations of the intelligent labeling device for text data, reference may be made to the above limitations of the intelligent labeling method for text data, and details are not described here. All or part of the modules in the text data intelligent labeling device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules. Based on such understanding, all or part of the flow in the method according to the above embodiments may also be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of the above embodiments of the text data intelligent annotation method. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc.

In one embodiment, a computer device, which may be a server, is provided that includes a processor, a memory, and a network interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to realize an intelligent text data labeling method. Illustratively, a computer program may be partitioned into one or more modules, which are stored in a memory and executed by a processor to implement the present invention. One or more of the modules may be a sequence of computer program instruction segments for describing the execution of a computer program in a computer device that is capable of performing certain functions.

The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, the processor being the control center of the computer apparatus, various interfaces and lines connecting the various parts of the overall computer apparatus.

The memory may be used to store the computer programs and/or modules, and the processor may implement various functions of the computer device by running or executing the computer programs and/or modules stored in the memory and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, etc. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

It will be understood by those skilled in the art that the computer device structure shown in the embodiment is only a partial structure related to the solution of the present invention, and does not constitute a limitation to the computer device to which the present invention is applied, and a specific computer device may include more or less components, or combine some components, or have different component arrangements.

In one embodiment, a computer device is provided, comprising a memory having a computer program stored therein and a processor that when executing the computer program performs the steps of:

acquiring first text data to be annotated, wherein the first text data to be annotated is obtained by retrieval according to a data source corresponding to the annotation task information;

respectively carrying out machine labeling and manual labeling on first data to be labeled to obtain a machine labeling sample set and a manual labeling sample set;

and judging whether the current distribution of the annotation task is aligned with the target distribution of the annotation task, if not, deleting data or/and compensating the data of the current distribution of the annotation task until the current distribution of the annotation task is aligned to obtain annotation data.

In one embodiment, the processor when executing the computer program further performs the steps of: acquiring a union set of a machine labeling sample set and a manual labeling sample set to obtain a union set sample set; and counting the number of samples of each category in the sample set and collecting the samples to obtain the current distribution of the labeling tasks.

In one embodiment, the processor, when executing the computer program, further performs the steps of: the method comprises the steps of obtaining the difference distribution of samples of the same category in the current distribution of an annotation task and the target distribution of the annotation task by making a difference between the number of the samples of the same category; when the difference distribution meets the preset first error requirement, determining that the current distribution of the labeling tasks is aligned with the target distribution of the labeling tasks, and taking the current distribution of the labeling tasks as labeling data; and when the difference distribution does not meet the preset first error requirement, carrying out data deletion or/and data compensation on the class samples corresponding to the difference quantity until the class samples are aligned to obtain the labeling data.

In one embodiment, the processor, when executing the computer program, further performs the steps of: acquiring difference distribution of current distribution of the labeling tasks and target distribution of the labeling tasks; acquiring a preset number of second text data to be annotated from a data source; performing machine labeling on the second text data to be labeled to obtain a compensation machine labeling sample set and counting compensation distribution; when the compensation distribution and the difference distribution do not meet the preset second error requirement, returning to obtain the second text data to be labeled of the preset number from the data source to perform iterative compensation until the compensation distribution and the difference distribution meet the error requirement; and supplementing the compensation distribution meeting the preset second error requirement into the current distribution of the labeling task.

In one embodiment, the processor, when executing the computer program, further performs the steps of: inputting first data to be labeled into a trained labeling model, and labeling the first data to be labeled by the labeling model to obtain a machine labeling sample set; and visually displaying the first data to be labeled and/or the machine labeling sample set, receiving a manually input labeling operation instruction, and labeling the first data to be labeled and/or the machine labeling sample set according to the labeling operation instruction to obtain a manual labeling sample set.

In one embodiment, the processor, when executing the computer program, further performs the steps of: and carrying out data processing on the first data to be annotated, wherein the data processing comprises any one or more of data cleaning conversion, duplication removal and translation.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring a union set of a machine labeling sample set and a manual labeling sample set to obtain a union set sample set; and counting the number of samples of each category in the sample set and collecting the samples to obtain the current distribution of the labeling tasks.

In one embodiment, the computer program when executed by the processor further performs the steps of: the method comprises the steps of obtaining the difference distribution of samples of the same category in the current distribution of an annotation task and the target distribution of the annotation task by making a difference between the number of the samples of the same category; when the difference distribution meets the preset first error requirement, determining that the current distribution of the labeling tasks is aligned with the target distribution of the labeling tasks, and taking the current distribution of the labeling tasks as labeling data; and when the difference distribution does not meet the preset first error requirement, carrying out data deletion or/and data compensation on the class samples corresponding to the difference quantity until the class samples are aligned to obtain the labeling data.

In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring difference distribution between current distribution of the labeling tasks and target distribution of the labeling tasks; acquiring a preset number of second text data to be labeled from a data source; performing machine labeling on the second text data to be labeled to obtain a compensation machine labeling sample set and counting compensation distribution; when the difference compensation distribution and the difference distribution do not meet the preset second error requirement, returning to obtain a preset number of second text data to be labeled from the data source for iterative difference compensation until the difference compensation distribution and the difference distribution meet the error requirement; and supplementing the compensation distribution meeting the preset second error requirement into the current distribution of the labeling task.

In one embodiment, the computer program when executed by the processor further performs the steps of: inputting first data to be labeled into a trained labeling model, and labeling the first data to be labeled by using a labeling model to obtain a machine labeling sample set; and visually displaying the first data to be labeled and/or the machine labeling sample set, receiving a manually input labeling operation instruction, and labeling the first data to be labeled and/or the machine labeling sample set according to the labeling operation instruction to obtain a manual labeling sample set.

In one embodiment, the computer program when executed by the processor further performs the steps of: and carrying out data processing on the first data to be annotated, wherein the data processing comprises any one or more of data cleaning conversion, duplication removal and translation.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), for example.

All possible combinations of the technical features in the above embodiments may not be described for the sake of brevity, but should be considered as being within the scope of the present disclosure as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is specific and detailed, but not to be understood as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. An intelligent text data labeling method is characterized by comprising the following steps:

2. The method of claim 1, wherein the counting a current distribution of tasks according to the machine labeled sample set and the artificial labeled sample set comprises:

3. The method according to claim 1, wherein the determining whether the current distribution of the annotation task is aligned with the target distribution of the annotation task, and if not, performing data deletion or/and data complementation on the current distribution of the annotation task until the current distribution of the annotation task is aligned to obtain annotation data comprises:

the quantity of the samples of the same category samples in the current distribution of the labeling task and the target distribution of the labeling task is differentiated to obtain the difference distribution of each category sample;

and when the difference distribution does not meet the preset first error requirement, carrying out data deletion or/and data compensation on the class samples corresponding to the difference quantity until the class samples are aligned to obtain marking data.

4. The method according to claim 1 or 3, wherein performing data complementation on the current distribution of the labeling tasks comprises:

5. The method of claim 1, wherein the performing machine labeling and manual labeling on the first data to be labeled respectively to obtain a machine labeling sample set and a manual labeling sample set comprises:

6. The method of claim 1, wherein before performing machine labeling and manual labeling on the first data to be labeled respectively to obtain a machine-labeled sample set and a manual-labeled sample set, the method further comprises: and carrying out data processing on the first data to be annotated, wherein the data processing comprises any one or more of data cleaning conversion, deduplication and translation.

7. An intelligent labeling device for text data, comprising:

8. A computer device comprising a processor and a memory, wherein the memory stores a computer program, and the processor is configured to implement the intelligent text data annotation method according to any one of claims 1 to 6 when executing the computer program.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the intelligent labeling method for text data according to any one of claims 1 to 6.