CN112749563A

CN112749563A - Named entity identification data labeling quality evaluation and control method and system

Info

Publication number: CN112749563A
Application number: CN202110081491.7A
Authority: CN
Inventors: 吴佳鸣; 卫海天
Original assignee: Beijing Minglue Zhaohui Technology Co Ltd
Current assignee: Beijing Minglue Zhaohui Technology Co Ltd
Priority date: 2021-01-21
Filing date: 2021-01-21
Publication date: 2021-05-04

Abstract

The invention relates to a named entity identification data marking quality evaluation and control method and a system, wherein the method comprises the following steps: dividing data into N sub-training sets; training N first named entity recognition models by using N sub-training sets respectively; predicting given labeled linguistic data by using N first named entity recognition models to obtain N results; voting for each of the N said results for entity prediction; and judging the error type according to the voting result and the labeling result of the labeling person, and prompting suspected errors. The invention can intelligently analyze the real-time marking result in the marking process of the marker, guide the marker to carry out reliable marking, reduce the problems of inconsistent marking rules, label missing and the like before and after the marker as much as possible, increase the controllability of the marking process, control the marking quality of the marker in the marking process, feed back errors in real time and reduce the waste of the marker in time.

Description

Named entity identification data labeling quality evaluation and control method and system

Technical Field

The invention relates to the technical field of Named Entity Recognition (NER), in particular to a method and a system for evaluating and controlling the labeling quality of named entity recognition data.

Background

In recent years, with the continuous development and application of basic natural language processing techniques and knowledge-graph techniques, the demand for text labeling has also increased substantially. For many supervised tasks, high quality data is the basis for good performance of the model.

As is known, data annotation requires a lot of manpower and financial resources, and the annotation results of different annotators under the unified annotation rules are different. Poor data quality not only causes a loss of time and money for the business side, but also negatively impacts model performance. For the labor-intensive work of data labeling, the industry currently uses a crowdsourcing-based solution of labeling platform, model pre-labeling and manual labeling to improve the labeling efficiency as much as possible, and uses uniform quality assessment indexes to perform quality assessment on the labeled data.

There are a number of errors in the manual annotation data generated using a typical annotation solution, and such errors will undoubtedly affect the model effect.

Disclosure of Invention

In view of the above technical problems, the present invention provides a method and a system for evaluating and controlling labeling quality of named entity identification data.

The technical scheme for solving the technical problems is as follows:

according to a first aspect of the embodiments of the present invention, there is provided a method for evaluating and controlling labeling quality of named entity identification data, including:

dividing data into N sub-training sets;

training N first named entity recognition models by using N sub-training sets respectively;

predicting given labeled linguistic data by using N first named entity recognition models to obtain N results;

voting for each of the N said results for entity prediction;

and judging the error type according to the voting result and the labeling result of the labeling person, and prompting suspected errors.

On the basis of the technical scheme, the invention can be further improved as follows.

Further, the dividing of the data into N sub-training sets specifically includes:

and obtaining N sub-training sets in an N-fold crossing mode.

Further, the determining the error type according to the voting result and the labeling result of the labeling person specifically includes:

if the marking result of the marker is empty and the model voting result is not empty, judging that the error type is suspected label missing; and/or

If the labeling position information of the labeling person is completely matched with the voting position information but the labels are not completely matched, judging that the error type is a suspected category error; and/or

If the two suspected errors are not the same and the labeling result of the label is different from the model voting result, the type of the error is judged to be other suspected errors.

Further, still include:

after all the marking personnel finish marking, voting on the whole corpus marking result in an integral manner, and taking the first voting result of the whole corpus as a quasi answer;

training N folds of the whole corpus to obtain N second named entity recognition models;

predicting the whole corpus by using the N second named entity recognition models, and voting to obtain a reference answer;

and generating a data quality index containing an error source by comparing the standard answer with the reference answer.

According to a second aspect of the embodiments of the present invention, there is provided a system for evaluating and controlling labeling quality of named entity identification data, including:

the dividing module is used for dividing the data into N parts of sub-training sets;

the first training module is used for training N first named entity recognition models by using N sub-training sets respectively;

the first prediction and voting module is used for predicting the given labeled corpus by using the N first named entity recognition models to obtain N results and voting for each entity prediction result in the N results;

and the error prompt module is used for judging the error type according to the voting result and the labeling result of the labeling person and carrying out suspected error prompt.

Further, the dividing module is specifically configured to obtain N sub-training sets in an N-fold crossing manner.

Further, the error prompt module specifically includes:

the first judging unit is used for judging the error type as suspected label missing if the labeling result of the labeling person is empty and the model voting result is not empty;

the second judging unit is used for judging that the error type is a suspected category error if the labeling position information of the labeling person is completely matched with the voting position information but the labels are not completely matched;

and the third judging unit is used for judging the type of the error as suspected other errors if the two suspected errors are not the same and the labeling result of the label is different from the model voting result.

Further, still include:

the voting module is used for carrying out overall voting on the whole corpus labeling result after all the labeling personnel finish labeling, and taking the first voting result of the whole corpus as a quasi-answer;

the second training module is used for training N second named entity recognition models by taking N folds of the whole corpus;

the second prediction and voting module is used for predicting the whole corpus by using the N second named entity recognition models and voting to obtain a reference answer;

and the comparison module is used for generating a data quality index containing an error source by comparing the standard answer with the reference answer.

According to a third aspect of the embodiments of the present invention, there is provided a terminal device, including:

a processor; and

a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method as described above.

According to a fourth aspect of embodiments of the present invention, there is provided a non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method as described above.

The technical scheme provided by the embodiment of the invention can have the following beneficial effects:

the real-time labeling result in the labeling process of the label maker can be intelligently analyzed, the label maker is guided to perform reliable labeling, the problems that the front and back labeling rules of the label maker are not uniform, labels are missed and the like are reduced as much as possible, the controllability of the labeling process is increased, the labeling quality of the label maker is controlled in the labeling process, instant error feedback is achieved, and the time waste of the label maker is reduced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

Fig. 1 is a flowchart of a method for evaluating and controlling labeling quality of named entity identification data according to an embodiment of the present invention;

fig. 2 is a flowchart of another method for evaluating and controlling the labeling quality of the named entity identification data according to an embodiment of the present invention;

FIG. 3 is a visual overview interface with annotation quality.

Detailed Description

Preferred embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that, although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present invention. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

Fig. 1 is a flowchart of a method for evaluating and controlling annotation quality of named entity identification data according to an embodiment of the present invention, as shown in fig. 1, the method includes:

110. dividing data into N sub-training sets;

120. training N first named entity recognition models by using N sub-training sets respectively;

130. predicting given labeled linguistic data by using N first named entity recognition models to obtain N results;

140. voting for each of the N said results for entity prediction;

150. and judging the error type according to the voting result and the labeling result of the labeling person, and prompting suspected errors.

Specifically, N parts of sub-data sets (sub-training sets) can be obtained in an N-fold cross mode, N BERT-CRF models are trained by using the N parts of sub-training sets respectively, the given labeled corpus is predicted by using the N models, each entity prediction result in the N results is voted, suspected error prompt is carried out according to the voting result and the labeling result of a labeling person, the suspected labeling error prompt of the labeling person is increased, and the labeling person is guided to be familiar with the labeling rule quickly and form a good labeling habit.

The labeling error types are classified into three error types of "suspected label missing", "suspected type error" and "suspected other error", which are respectively explained as follows:

and (5) a suspected label missing judgment standard, namely that the label result of the label marker is null, but the model voting result is not null. Taking the BIO labeling mode as an example, if the label result is "O", but the model voting result is not "O".

And (4) the suspected type is wrong, namely the labeling position information of the labeling person is completely matched with the voting position information, but the label is not completely matched. Taking the BIO annotation mode as an example, considering the sentence "Shangri La Tai-La", the annotation result of the annotator may be "B-BRAND I-BRAND I-BRAND I-BRAND O O O", and the result of the model voting is "B-LOC I-LOC I-LOC O O O O", and it is necessary to prompt that the segment "Shangri La" should be "LOC" to represent the place rather than the "BRAND" BRAND.

And if the two suspected errors are not the same and the labeling result of the marker is different from the model voting result, the marker is prompted to be suspected of other errors.

The method for evaluating and controlling the labeling quality of the named entity identification data provided by the embodiment of the invention can intelligently analyze the real-time labeling result in the labeling process of a labeling person, guide the labeling person to perform reliable labeling, reduce the problems of inconsistent labeling rules, label missing and the like before and after the labeling person as much as possible, increase the controllability of the labeling process, control the labeling quality of the labeling person in the labeling process, feed back errors in real time and reduce the waste of the time of the labeling person.

Optionally, in this embodiment, the method further includes:

160. after all the marking personnel finish marking, voting on the whole corpus marking result in an integral manner, and taking the first voting result of the whole corpus as a quasi answer;

170. training N folds of the whole corpus to obtain N second named entity recognition models;

180. predicting the whole corpus by using the N second named entity recognition models, and voting to obtain a reference answer;

190. and generating a data quality index containing an error source by comparing the standard answer with the reference answer.

Specifically, after the labeling is preliminarily completed (on the individual/team level), N sub-data sets (sub-training sets) are obtained again in an N-fold crossing manner, and N models are obtained through training. And predicting the whole data set by using the N models to obtain N prediction results, and voting the N prediction results to obtain a reference answer. And comparing the reference answers with the actual labeling results to obtain a labeling data quality evaluation index comprising error sources and distribution.

In the embodiment of the invention, by taking the error source and the distribution as the reference indexes of the data quality, the administrator and the data annotator can conveniently and rapidly position the annotation problem and provide corresponding assessment suggestions.

Generally speaking, the method mainly comprises the steps of real-time prompting of entity annotator annotation errors, analysis of overall annotation quality of an administrator interface, analysis of error sources and the like. The overall process of the technical solution can refer to fig. 2, and fig. 2 mainly shows the overall process of adding a labeling quality analysis device to an existing named entity labeling platform.

The invention mainly focuses on data annotation error instant prompt and error source analysis, and the technical scheme of the invention mainly comprises the following design steps:

1. and in the annotation process of the annotator, starting training of the instant feedback device for annotation errors when the annotation sentences of the annotator reach K threshold values preset by an administrator in advance. The training step mainly comprises the steps of making N-fold cross validation data (the scale number of the corpus is K, N in the N-fold cross validation is a preset parameter and the default is 5), training N named entity recognition models, saving N models, loading N models, starting N model voting programs and generating model labeling results of unlabeled items by using the voting programs.

2. And error immediate feedback and modification, namely after the training of the error immediate feedback device is started, comparing the labeling result of the labeling person with the model labeling result generated by the voting program every time the labeling person finishes labeling a corpus, and prompting the type of the labeling error aiming at the segment with inconsistent labels.

3. When the administrator issues the task, the process of the annotating task of the annotator is sent to the administrator in real time for the administrator to inquire the annotating progress of the annotator. After the annotator marks K preset (after the annotator triggers the instant feedback device of the annotation error), the annotator sends the suspected error and the category of the annotation to the administrator for annotation quality control when marking a sentence. If the suspected labeling error of a certain annotator is obviously higher than that of other annotators, the administrator can assist and correct the labeling standard of the annotator in the labeling process.

4. And (4) marking quality overview: after all the annotating personnel finish the annotation, the whole corpus annotation result is voted integrally. And taking the first voting result of the whole corpus as a quasi-answer, training N models for the whole corpus, predicting the whole corpus by using the N models again, and voting again to obtain a reference answer. According to the judgment standard in the step 2, the data quality index containing the error source can be generated by comparing the reference answer with the reference answer. Wherein the "quasi-answer" corresponds to the "annotator labeling result" in step 2, and the "reference answer" corresponds to the "model voting result" in step 2. The administrator background displays a visual overview interface as shown in fig. 3.

The embodiment of the invention provides a named entity identification data marking quality evaluation and control system, which comprises:

Optionally, in this embodiment, the dividing module is specifically configured to obtain N sub-training sets by way of N-fold intersection.

Optionally, in this embodiment, the error prompt module specifically includes:

Optionally, in this embodiment, the system further includes:

An embodiment of the present invention provides a terminal device, including:

a processor; and

Embodiments of the present invention provide a non-transitory machine-readable storage medium having stored thereon executable code, which, when executed by a processor of an electronic device, causes the processor to perform the method as described above.

The reader should understand that in the description of this specification, reference to the description of the terms "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the modules and units in the above described system embodiment may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A named entity identification data annotation quality evaluation and control method is characterized by comprising the following steps:

dividing data into N sub-training sets;

voting for each of the N said results for entity prediction;

2. The method of claim 1, wherein the dividing the data into N sub-training sets specifically comprises:

and obtaining N sub-training sets in an N-fold crossing mode.

3. The method of claim 1, wherein the determining the error type according to the voting result and the labeling result of the labeling person comprises:

4. The method of any of claims 1 to 3, further comprising:

5. A named entity identification data annotation quality assessment and control system is characterized by comprising:

6. The system according to claim 5, wherein the partitioning module is configured to obtain N sub-training sets by way of N-fold interleaving.

7. The method according to claim 1, wherein the error prompt module specifically includes:

8. The system of any one of claims 5 to 7, further comprising:

9. A terminal device, comprising:

a processor; and

a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any of claims 1 to 4.

10. A non-transitory machine-readable storage medium having executable code stored thereon, wherein when the executable code is executed by a processor of an electronic device, causes the processor to perform the method of any of claims 1-4.