CN113673232A

CN113673232A - Text labeling method, device, equipment and medium

Info

Publication number: CN113673232A
Application number: CN202110975379.8A
Authority: CN
Inventors: 甘丽婷; 徐介夫
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-08-24
Filing date: 2021-08-24
Publication date: 2021-11-19
Anticipated expiration: 2041-08-24
Also published as: CN113673232B

Abstract

The invention relates to the technical field of artificial intelligence, and provides a text labeling method, a device, equipment and a medium. The invention also provides a text labeling device, equipment and a medium, wherein the text labeling device is pre-labeled by a machine before manual labeling is carried out by a data labeling person, the current correction labeling result pre-labeled by the machine is displayed together with the current text to be labeled, and the data labeling person only needs to modify and supplement the current existing label, so that the labeling efficiency is improved, the labeling cost is reduced, the labeling workload is reduced, the repeated work is reduced, and the user experience is improved.

Description

Text labeling method, device, equipment and medium

Technical Field

The invention relates to the technical field of artificial intelligence, and provides a text labeling method, a text labeling device, text labeling equipment and a text labeling medium.

Background

NLP (Natural Language Processing) is a sub-field of Artificial Intelligence (AI). Natural language is a crystal of human intelligence, and natural language processing is one of the most difficult problems in artificial intelligence. The NPL needs to collect corpus in the processing flow in tagging, and process the corpus, such as "part of speech tagging", as a data tagging member, and needs to implement text tagging by means of various text tagging tools.

In the related technology, manual marking is usually performed by reading texts sentence by data marking personnel, and words to be marked are marked in sequence, so that the marking process is low in processing efficiency and high in marking cost.

Disclosure of Invention

The invention provides a text labeling method, a text labeling device, text labeling equipment and a text labeling medium, which mainly aim to preliminarily label (pre-label) through a preset initial auxiliary labeling model, then carry out first correction according to a historical correction information set to obtain and display a current correction labeling result, and at the moment, current correction information generated by instructions of a data annotator can be obtained to carry out second correction on the current correction labeling result so as to finish labeling of a current text to be labeled.

In order to achieve the above object, the present invention provides a text labeling method, including:

acquiring historical correction information, wherein the historical correction information comprises historical correction words and historical correction labeling information, the historical correction information comprises modification information of a historical auxiliary labeling result, and the historical auxiliary labeling result is obtained by labeling a historical text to be labeled through a preset initial auxiliary labeling model;

acquiring history related words of the history corrected words, and generating a history corrected information set according to the history corrected words, the history related words and history corrected labeling information, wherein the word meanings of the history related words are similar to or the same as the word meanings of the history corrected words;

acquiring a current auxiliary labeling result, and performing first correction on the current auxiliary labeling result according to the historical correction information set to obtain a current correction labeling result, wherein the current auxiliary labeling result is obtained by inputting a current text to be labeled into the preset initial auxiliary labeling model;

and displaying the current correction marking result, acquiring current correction information, and performing secondary correction on the current correction marking result to finish marking of the current text to be marked.

Optionally, the current corrected labeling result includes a current labeling word and current labeling information, and the obtaining mode of the current corrected information includes:

distributing the current correction marking result to a corresponding modification execution object according to the current marking information;

and acquiring object correction information of each modification execution object to generate current correction information.

Optionally, after the current revision labeling result is assigned to the corresponding revision execution object according to the labeling category, and before the object revision information of each revision execution object is acquired, the method further includes:

and if the current correction labeling result is corrected for the second time by at least two correction execution objects at the same time, displaying the current labeling state of each current labeling word, wherein the current correction state comprises at least one of information of labeled, unlabeled and correction execution objects.

Optionally, the current correction information includes a current correction word and current correction annotation information, and the method further includes:

acquiring a current related word of the current correction word, and generating a current correction information set according to the current correction word, the current related word and current correction labeling information, wherein the labeling information of the current related word is the same as the current correction labeling information, and the word meaning of the current related word is similar to or the same as the word meaning of the current correction word;

generating a correction training set according to the historical correction information set and the current correction information set;

and training the preset initial auxiliary labeling model according to the correction training set.

Optionally, before the current correction information is obtained, the method further includes:

acquiring the number ratio of history correction words corresponding to each history correction marking information in the history correction information;

and if the quantity ratio is higher than a preset ratio threshold, taking the historical correction marking information as high-risk marking information, and prompting.

Optionally, inputting a second text to be labeled into the trained preset initial auxiliary labeling model to obtain a training auxiliary labeling result, wherein the training auxiliary labeling result comprises a labeling word and training labeling information;

comparing the training auxiliary labeling result with the current corrected labeling result after the second correction to obtain difference information, wherein the current corrected labeling result after the second correction comprises a label word and second correction labeling information, the difference word is the label word with the second correction labeling information different from the training labeling information, and the difference information comprises the difference word and second correction labeling information of the difference word;

acquiring quality inspection labeling information of the different words by a third party;

and determining the marking qualified rate of the modification execution object according to the quality inspection marking information and the second modification marking information.

Optionally, if the quality inspection labeling information is different from the second correction labeling information, the method further includes at least one of the following:

performing third correction on the current corrected marking result after the second correction according to the quality inspection marking information;

and adding the current correction information, the quality inspection labeling information and the difference words to the historical correction information.

In addition, to achieve the above object, the present invention also provides a text labeling apparatus, including:

the history correction information acquisition module is used for acquiring history correction information, wherein the history correction information comprises history correction words and history correction marking information, the history correction information comprises the correction information of a history auxiliary marking result, and the history auxiliary marking result is obtained by marking a history text to be marked through a preset initial auxiliary marking model;

a history related word obtaining module, configured to obtain a history related word of the history corrected word, and generate a history corrected information set according to the history corrected word, the history related word, and history corrected labeling information, where the labeling information of the history related word is the same as the history corrected labeling information, and the word meaning of the history related word is similar to or the same as the word meaning of the history corrected word;

the first correction module is used for acquiring a current auxiliary labeling result, and performing first correction on the current auxiliary labeling result according to the historical correction information set to obtain a current correction labeling result, wherein the current auxiliary labeling result is obtained by inputting a current text to be labeled into the preset initial auxiliary labeling model;

and the second correction module is used for displaying the current correction marking result, acquiring current correction information, and performing second correction on the current correction marking result to finish marking of the current text to be marked.

Furthermore, to achieve the above object, the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor executes the computer program to implement the steps of the method according to any one of the above embodiments.

Furthermore, to achieve the above object, the present invention further provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the method according to any one of the above embodiments.

The invention provides a text labeling method, a device, equipment and a medium, which obtains history correction information and history related words, generates a history correction information set according to the history correction words, the history related words and the history correction labeling information, obtains a current auxiliary labeling result, corrects the current auxiliary labeling result for the first time according to the history correction information set to obtain a current correction labeling result, displays the current correction labeling result, obtains the current correction information, corrects the current correction labeling result for the second time, finishes the labeling of a current text to be labeled, performs machine pre-labeling before manual labeling of a data labeler, displays the current correction labeling result pre-labeled by a machine and the current text to be labeled together, and improves the labeling efficiency because the data labeler only needs to perform modification and supplement on the basis of the current labeling, the labeling cost is reduced, the labeling workload is reduced, the repeated work is reduced, and the user experience is improved.

Drawings

Fig. 1 is a schematic flow chart of a text labeling method according to an embodiment of the present invention;

fig. 2 is another schematic flow chart of a text annotation method according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of a text annotation method according to an embodiment of the present invention;

FIG. 4 is a schematic flow chart of a text annotation method according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a text labeling apparatus according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a computer device provided in an embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In one embodiment, a text annotation method is provided, which is shown in fig. 1 and comprises the following steps:

step S101: and acquiring historical correction information.

Optionally, the historical correction information includes a historical correction word and historical correction labeling information of the historical correction word, the historical correction information includes modification information of a historical auxiliary labeling result, and the historical auxiliary labeling result is obtained by labeling a historical text to be labeled through a preset initial auxiliary labeling model.

In other words, the history text to be labeled is input into a preset initial auxiliary labeling model for labeling to obtain a history auxiliary labeling result, the history auxiliary labeling result is corrected manually or in other ways, a word modified manually or in other ways is used as a history correction word, and the final labeling information of the history correction word is used as history correction labeling information to form history correction information. The historical correction information can be obtained by modifying a historical auxiliary labeling result obtained by inputting one or more historical texts to be labeled into a preset initial auxiliary labeling model.

Optionally, the historical text to be annotated and the subsequently mentioned current text to be annotated may be one text or a plurality of texts.

In one embodiment, the preset initial auxiliary annotation model obtaining method includes:

acquiring a plurality of sample words and labeling information thereof which are manually labeled in advance to form a sample set;

and training a preset initial Bert model according to the sample set to obtain a preset initial auxiliary labeling model.

Optionally, the sample words and the labeling information of the sample words may be collected through a tensrflow algorithm for training, so as to obtain a Bert model (a preset initial auxiliary labeling model). The method comprises the steps that sample words can be data sources or other data sources on the internet such as news, the sample words are marked manually to obtain marking information of the sample words, a sample set is formed according to the marking words and the marking information, the sample set is divided into a training set and a verification set, the training set is used for training a preset initial Bert model, the verification set is used for verifying the marking effect of the trained preset initial Bert model, when the marking accuracy of the trained preset initial Bert model reaches the preset marking accuracy, training is completed to obtain a preset initial auxiliary marking model, and then the text is preliminarily marked based on the preset initial auxiliary marking model, so that the workload of a data marker is reduced, and the working efficiency is improved. The automatic text that has accomplished the mark is provided for staff such as data marker through predetermineeing the supplementary mark model of initial automation for the staff to supply the staff to carry out further mark, though the accuracy of preliminary mark probably is not high, but has some corpora that have correctly marked at least, based on this mode, can realize effectual reduction data marker and relevant staff's mark work load, promote mark work efficiency.

S102: and acquiring history related words of the history correction words, and generating a history correction information set according to the history correction words, the history related words and the history correction labeling information.

Optionally, the word senses of the history related words and the history corrected words are similar or identical. In other words, the history related words and the history corrected words have the same label information and similar or identical word senses. Because the preset initial auxiliary labeling model has a problem in labeling the historical corrected word, generally speaking, for the label of the synonym or synonym of the historical corrected word, the preset initial auxiliary labeling model may also have a labeling error, so that the historical related words of the historical corrected word can be acquired in a supplementing manner, and the supplement of the labeling error of the preset initial auxiliary labeling model can be better realized.

And the labeling information of the history related words is the same as the history correction labeling information. For example, the words "defeat" and "defeat" are similar, and if a verb is labeled for "defeat" that is "defeat brazil" and is heavy, then "defeat" in "defeat brazil" is also a verb.

In one embodiment, obtaining the history related words of the history corrected words comprises:

and inputting the historical correction words into a preset text similarity model to obtain a plurality of historical related words which are the same as or similar to the historical correction words.

Optionally, the preset text similarity model includes, but is not limited to, a preset synonym and/or a near synonym dictionary.

The acquisition of the history related words can be realized by a method known by those skilled in the art.

S103: and acquiring a current auxiliary annotation result, and performing first correction on the current auxiliary annotation result according to the historical correction information set to obtain a current correction annotation result.

Optionally, the current auxiliary labeling result is obtained by inputting the current text to be labeled into a preset initial auxiliary labeling model, and the current auxiliary labeling result includes a plurality of current labeling words and current labeling information of the labeling words.

The data annotator needs to label the current text to be annotated at present, at this time, the preset initial auxiliary annotation model can be used for roughly labeling the current text to be annotated for the first time to obtain a current auxiliary annotation result, and since the history correction information set comprises history correction words, history related words and history correction annotation information which are known to possibly exist in the preset initial auxiliary annotation model and are labeled inaccurately, the current auxiliary annotation result can be corrected for the first time through the history correction information set, so that some known annotation errors possibly existing in the current auxiliary annotation result are solved, the satisfaction degree of the data annotator on the method can be effectively improved, the errors of the words which are modified by the data annotator before continue to exist and in subsequent work are avoided, and the experience degree is reduced. For example, a data annotator annotates a batch of texts, some texts are similar or related, but the output result of the initial auxiliary annotation model is wrong each time, so that the data annotator needs to annotate the same error each time. In order to avoid repeated work as much as possible, the previously marked content can be referred to, the action and the data (historical correction information) marked by one or more data markers in the batch are recorded, and the current auxiliary marking result is corrected for the first time through the historical correction information set, so that the marking error occurring before the repetition can be avoided when the current text to be marked is marked, the current corrected marking error does not exist in the current corrected marking result any more, the more accurate current text to be marked including the current corrected marking result is presented to the data markers, the marking efficiency of the data markers is further improved, and the satisfaction degree of the data markers is improved.

S104: and displaying the current correction marking result, acquiring the current correction information, and performing secondary correction on the current correction marking result to finish marking of the current text to be marked.

The current correction labeling result comprises labeling information of each word in the current text to be labeled, the labeling information of the word which is not labeled is empty, the current correction labeling result and the current text to be labeled are displayed to a data labeling worker, the data labeling worker evaluates the current correction labeling result in a manual mode or other modes, the evaluation process comprises but is not limited to label supplement of the word which is not labeled, and/or the labeled word is labeled in a wrong mode, new labeling information is endowed again, the current correction information is generated according to the modification intention, and the current correction labeling result is corrected for the second time according to the current correction information, so that the labeling of the current text to be labeled is completed.

It can be seen through the above mode that, through treating the mark text at present and carry out preliminary mark through predetermineeing the initial supplementary mark model and obtain current supplementary mark result, carry out the first correction according to the current supplementary mark result of historical revision information set to preliminary mark again, obtain current correction mark result, can present the current mark text of treating that a version includes some correct label information for the data marker, the data marker only need revise the part of marking mistake this moment, and will omit the word of mark and carry out supplementary mark can, can be effectual the meaningless repetitive work of reduction data marker, reduce data marker's work load, promote data marker's work efficiency, also can reduce the mark cost simultaneously, promote user experience.

Optionally, the current text to be labeled and the historical text to be labeled include, but are not limited to, medical cases, medical news, and the like.

In an embodiment, the current modified annotation result includes a current annotation word and current annotation information, and referring to fig. 2, the obtaining method of the current modification information includes:

s201: distributing the current correction marking result to the corresponding modification execution object according to the current marking information;

s203: and acquiring object correction information of each modification execution object, and generating current correction information.

Optionally, the modification execution object includes one or more data annotators.

Optionally, the current labeling result may be pre-allocated to a plurality of groups according to the labeling information, and then the corresponding group is correspondingly allocated to each modification execution object according to the labeling information corresponding to the modification execution object.

Optionally, the current tagging information may be part-of-speech tags, such as nouns, verbs, adjectives, and the like; the current label information can also be entity labels, such as name of person, place, name of product, organization, company, etc.; the current tagging information may also include both part-of-speech tags and entity tags, etc. The current labeling information may also be set according to a rule required by a person skilled in the art, and is not limited herein.

And then distributing the current correction labeling result comprising different current labeling information to the corresponding modification execution object according to the mapping relation. For example, the current annotation information includes entity annotation, the mapping relationship includes a person name corresponding to the modification execution object 1, a company corresponding to the modification execution object 2, a place corresponding to the modification execution object 3, a mechanism corresponding to the modification execution object 2, displaying the current text to be annotated to the modification execution objects 1-3, and displaying the part of the current modification annotation result, of which the annotation information is the person name, to the modification execution object 1, a company and a mechanism to the modification execution object 2, and a place to the modification execution object 3, that is, displaying the corresponding current annotation information on the annotation result of the current text to be annotated according to each modification execution object of the mapping relationship. Therefore, the data annotating personnel can be enabled to attentively annotate the annotation information of one or more specified categories, the work is simpler, and the speed and the accuracy of the skillful annotation are more reliable.

If the modification execution object adds the label information of some words or changes the label information of the words already labeled currently, the modified or added label words and the label information of the label words are used for generating the object correction information of the modification execution object, and the object correction information of each modification execution object is integrated to generate the current correction information.

In an embodiment, with continued reference to fig. 2, after assigning the current revision annotation result to the corresponding revision execution object according to the annotation category, before acquiring the object revision information of each revision execution object, the method further includes:

s202: and if the at least two modification execution objects simultaneously carry out secondary modification on the current modification annotation result, displaying the current annotation state of each current annotation word.

The current correction state comprises at least one of marked, unmarked and modified execution object information. The display mode of the current labeling state includes but is not limited to color distinction, font distinction and the like which can be needed by the person skilled in the art.

By the method, a plurality of modification execution objects (data annotators) can simultaneously execute annotation operation on one current text to be annotated, multi-person cooperative work can be realized, and the working efficiency is further improved. Meanwhile, the method can also realize grouping to label a current text to be labeled, and improve the labeling accuracy.

In one embodiment, the current correction information includes a current correction word and current correction annotation information, and referring to fig. 3, the method further includes:

s301: acquiring a current relevant word of a current correction word, and generating a current correction information set according to the current correction word, the current relevant word and current correction marking information;

s302: generating a correction training set according to the historical correction information set and the current correction information set;

s303: and training the preset initial auxiliary labeling model according to the corrected training set.

Optionally, the tagging information of the current related word is the same as the current correction tagging information, and the word senses of the current related word are similar to or the same as the word senses of the current correction word.

The manner of determining the current related word according to the current corrected word is similar to the manner of determining the history related word according to the history corrected word, and is not described herein again. Similarly, the generation mode of the current correction information set is similar to that of the historical correction information set, and is not described herein again.

The method for training the preset initial auxiliary labeling model according to the modified training set is similar to the method for training the preset initial Bert model according to the sample set to obtain the preset initial auxiliary labeling model, and is not repeated here.

Through training the preset initial auxiliary labeling model according to the correction training set, the preset initial auxiliary labeling model can be continuously perfected according to the current requirements, the accuracy of the initial labeling executed by the model is further improved, the workload of a data annotator is reduced, and the satisfaction degree of the data annotator on auxiliary labeling is improved.

In one embodiment, before obtaining the current correction information, the method further comprises:

and if the quantity ratio is higher than a preset ratio threshold value, taking the historical correction marking information as high-risk marking information, and prompting.

The preset duty ratio threshold can be set by those skilled in the art as needed.

In other words, the probability of manual modification of various types of labeling information can be counted in advance, and if the probability of manual modification of a certain type of labeling information is higher, that is, the number of the certain type of labeling information is higher, it indicates that the labeling effect of the preset initial auxiliary labeling model on the type of labeling information is not good, and the corresponding data labeling operator can be reminded in a prompting manner. For example, the history correction information includes 300 history correction words, where the history correction words labeled as the names of people are 300, which indicates that the labeling effect of the preset initial auxiliary labeling model on the names of people is not good, and at this time, when the current correction labeling result is displayed, the user is prompted to have poor reliability of the labeling effect, and please note the corresponding data labeler, so that the data labeler can pay more attention to the words labeled as the names of people and the words not labeled yet in the current text to be labeled, and the possibility of labeling errors can be reduced.

Optionally, the prompting mode may be implemented by using a rolling caption, a voice broadcast, or other modes required by those skilled in the art, which is not limited herein.

In an embodiment, the method further includes detecting a qualification rate of modifying the annotation of the execution object, and specifically, referring to fig. 4, the method further includes:

s401: inputting a second text to be labeled into the trained preset initial auxiliary labeling model to obtain a training auxiliary labeling result;

s402: comparing the training auxiliary labeling result with the current corrected labeling result after the second correction to obtain difference information;

s403: acquiring quality inspection labeling information of the different words by a third party;

s404: and determining the marking qualified rate of the modification execution object according to the quality inspection marking information and the second correction marking information.

Optionally, the training auxiliary labeling result includes a label word and training labeling information, the difference information includes a difference word and second correction labeling information of the difference word, the current correction labeling result after the second correction includes the label word and the second correction labeling information, and the difference word is a label word with different second correction labeling information and training labeling information.

In other words, a result (training auxiliary labeling result) obtained by the second text to be labeled through the trained preset initial auxiliary labeling model and a final labeling result (current correction labeling result) obtained by the second text to be labeled through the second modification are respectively obtained, the two results are compared, the labeling information of a certain word in the two results may be inconsistent, the inconsistency may be that the labeling information of the certain word M in the training auxiliary labeling result is S, but no labeling information or the labeling information is not S but X in the current correction labeling result, at this time, the word M is a different word, and the difference information includes a different word S and labeling information X. Similarly, a certain word N has no label information (label information is null) or label information is O in the training auxiliary label result, and the label information in the current correction label result is not O but P, at this time, the word N is a difference word, and the difference information includes the difference word N and the label information P.

Optionally, the third party may be a quality inspector, or may be a plurality of pre-trained single-item labeling models only for a certain piece of labeling information. When the third party is a quality testing person, the quality testing person can evaluate each different word, whether the second correction labeling information corresponding to the different word is correct or not is judged, if not, the correct labeling information is provided as the quality testing labeling information, and if so, the second correction labeling information is used as the quality testing labeling information. In this case, there may be one or more quality inspectors, and a plurality of quality inspectors perform evaluation to obtain a final quality inspection result. When the third party is a single-item labeling model, because the labeling accuracy of the single-item labeling model is often higher than that of the right-item labeling model, each piece of labeling information can be labeled respectively by means of the single-item labeling model to obtain quality inspection labeling information, and the second corrected labeling information is evaluated according to the quality inspection labeling information. For the training of the single-item labeling model, the data set of the corresponding labeling information used in the training of the preset initial auxiliary labeling model can be adopted for training, and other training modes known to those skilled in the art can also be used for realizing the training.

If the quality inspection labeling information is the same as the second correction labeling information, the labeling work of the modification execution object is qualified, otherwise, the labeling of the modification execution object is unqualified.

One way of determining the qualification rate of the label is as follows:

the labeling yield is (the number of the difference words/the total number of the difference words in which the quality inspection labeling information is different from the second correction labeling information) × 100%.

Optionally, the annotation reliability of the modification execution object may be evaluated according to the annotation qualification rate, and performance evaluation may also be performed on the modification execution object. Through the determination of the marking qualification rate of each modification execution object, the data marking personnel can be promoted to work more cautiously and seriously, and the accuracy, the rigor and the qualification rate are improved.

Optionally, if the modification execution object needs to label at least two pieces of label information, at this time, the class label qualification rate of the modification execution object for each piece of label information may be determined, and the work content of the modification execution object may be increased according to the class label qualification rate. For example, when the annotation information is a, the qualification rate of the class annotation of a certain modification execution object is up to 100%, but when the annotation information is B, the qualification rate of the class annotation is only 50%, and then the modification execution object may be suggested to label only the annotation information a.

In one embodiment, if the quality control labeling information is different from the second modification labeling information, the method further comprises at least one of:

and adding the current correction information, the quality inspection labeling information and the difference words to historical correction information.

If the quality inspection labeling information is different from the second correction labeling information, it indicates that an error still exists in the current correction labeling result after the second correction, and the error needs to be corrected in time, that is, the error is corrected for the third time, so as to ensure that the standard of the current text to be labeled is accurate.

Errors found in the quality inspection process and errors missing in the preset initial auxiliary labeling model and the historical correction information which are found at present also need to be added to the historical correction information in time, so that the same or similar errors are avoided, and the user experience is improved.

In some embodiments, the method further includes retraining the preset initial auxiliary labeling model according to the historical correction information including the current correction information, the quality inspection labeling information, and the difference words, so as to improve the labeling accuracy of the model.

In some embodiments, the same current text to be labeled can be labeled through multiple groups of modification execution objects, so as to obtain labeling results of the groups, compare the labeling results to find difference information, and deliver the difference information to a third party for quality inspection.

By performing quality inspection on the labeling result of the modified execution object, accurate and rigorous labeling work can be promoted.

The text labeling method is exemplarily described below by a specific embodiment, and the specific text labeling method includes:

the method comprises the following steps: and acquiring a preset initial auxiliary labeling model, and performing pre-labeling on the historical text to be labeled to obtain a historical auxiliary labeling result.

The method comprises the steps that a preset initial auxiliary labeling model can collect sample words through a TensorFlow algorithm and label the sample words to form a sample set, and the preset initial model is trained according to the sample set to obtain a Bert model (a preset initial auxiliary labeling model).

And predicting the historical text to be labeled by a preset initial auxiliary labeling model, directly labeling the prediction result as an initialized recommended label, and obtaining a historical auxiliary labeling result.

Step two: and recording and analyzing the correction operation of the data annotator on the history auxiliary annotation result, acquiring which word in the history auxiliary annotation result is corrected by the data annotator, using the word as a history correction word and the given history correction annotation information of the history correction word, automatically generating the history correction information according to the history correction word and the history correction annotation information, and using the history correction information and the history correction annotation information as a correction rule for the subsequent pre-annotation of other texts.

Optionally, the historical correction words may be expanded in advance, for example, synonyms or synonyms of the historical correction words may be obtained according to a preset text similarity model, and added to the historical correction information to generate a historical correction information set. The preset text similarity model may be a Bert-based text similarity model. For example, the history related word "defeat" to the history correction word "defeat" is determined according to the preset text similarity model, in the history correction information, the "defeat" in "defeat brazil" is labeled as a verb, and verbs are automatically labeled for the "defeat" in subsequent text labels, for example, the "defeat" in "defeat brazil" is labeled as a verb.

Step three: and pre-labeling the current text to be labeled.

The method comprises the steps that a current text to be annotated can be annotated through a preset initial auxiliary annotation model to obtain a current auxiliary annotation result, and then according to the autonomous selection of a data annotator, if a first correction instruction of the data annotator is obtained, the current auxiliary annotation result can be corrected for the first time according to a history correction information set generated before, so that a current correction annotation result is obtained.

Optionally, the first-time correction instruction may include an instruction for correcting the current auxiliary labeling result by using only the historical correction information, and the first-time correction instruction may include an instruction for correcting the current auxiliary labeling result by using only the historical correction information set, which is not limited herein.

Step four: and displaying the current correction marking result to a data marker, acquiring the current correction information of the data marker, and performing secondary correction on the current correction marking result to finish marking of the current text to be marked.

At this time, the data annotator can see the current text to be annotated which has already been annotated in advance, that is, the word annotated in the current correction annotation result and the annotation information corresponding to the word are displayed to the data annotator together, the modification instruction of the data annotator is obtained to generate the current modification information, and the current correction annotation result is modified for the second time to complete the manual annotation of the current text to be annotated.

Optionally, in order to improve the labeling accuracy and efficiency, the labels may be grouped. For example, the data annotator of group 1 annotates entity tag 1 (e.g. region), the data annotator of group 2 annotates entity tag 2 (person name), and the data annotator of group 3 annotates entity tag 3 (company) are preset. In a possible embodiment, the full text of the current text to be labeled and the labeling information of the company as the current labeling information can be displayed to the group 3; and displaying the full text of the current text to be labeled and the labeling information of which the current labeling information is the name of the person to the group 2 and the like.

Optionally, in order to improve efficiency, multiple persons can simultaneously label on line, batch label a plurality of texts, and synchronously display the current labeling state of each word currently in the display state, such as unlabeled state, labeled state, and display the name, code or other identification information of the on-line operation data labeler.

Because the data annotator can annotate the batch of texts, some texts are similar or related, but if manual annotation is carried out again by the data annotator every time, a large amount of work is required, and resources are wasted. In order to avoid repeated work, the content (history correction information) marked before can be referred to, the behavior and data of the data marker when the history text to be marked is marked before are recorded by a client (web) and the like as the history correction information, and when the next text (the current text to be marked) is marked, the new rule (the history correction information) recorded before can be automatically applied to the marking of the text (the current text to be marked). For example, if the last text is labeled "Lisan" as a character and "XX securities" as a company, the data (current auxiliary labeling information) will be detected before processing the next text, and if there is "Lisan" or "XX securities" and "Lisan" is not labeled as a character and "XX securities" is labeled as a company (there may be no labeling information or labeling inconsistency), the "Lisan" will be automatically labeled as a character and the "XX securities" will be labeled as a company. Optionally, before the history auxiliary labeling information of the current text to be labeled is modified, the history related words of the history correction word may be obtained, for example, if the above mentioned "lytri" has an extra number of "lovely", at this time, "lovely" is added to the history correction information to generate the history correction information set, and then, when the history auxiliary labeling information of the current text to be labeled is modified, the current auxiliary labeling information is detected first, and "lovely" is also detected on the basis of the above detection of "lytri" and "XX securities", and if "lovely" is detected but not labeled, or the label is not a person, the labeling information of "lovely" is modified to be a person. Therefore, the workload of a data annotator can be greatly reduced, and the working efficiency is improved.

Step five: and training a preset initial auxiliary marking model.

Some cases with inaccurate labeling of the preset initial auxiliary labeling model and correction information of the cases are accumulated through the steps, and at the moment, the preset initial auxiliary labeling model can be trained by acquiring a current correction word, a current related word, current western political affairs labeling information and a historical correction information set as a training set so as to perfect the model.

It should be noted that, for the training of the preset initial auxiliary labeling model, the training may be performed on the basis of the training set in a manner known to those skilled in the art.

Optionally, when a group of current texts to be labeled is completed each time, words in which correction occurs and labeling information thereof are collected to perform one-time training on the preset initial auxiliary labeling model, or a certain number of corrected words are collected or a certain time is elapsed to perform one-time training on the preset initial auxiliary labeling model. Therefore, resources can be saved, and more computing resources are prevented from being wasted due to multiple times of training.

Step six: and determining the marking qualification rate of the data marking personnel.

And marking the current text to be marked through the trained preset initial auxiliary marking model to obtain a training auxiliary marking result, comparing the training auxiliary marking result with the current corrected marking result of the data marker obtained in the fourth step after the second correction, and detecting difference information. And sending the difference words in the difference information to corresponding auditors, and labeling by the auditors to obtain quality inspection labeling information. And obtaining the marking qualification rate of the data marker according to the consistency between the quality inspection marking information and the marking information in the current corrected marking result of the difference word after the second correction. The labeling qualification rate can be determined according to the total labeled word quantity and the number of the difference words with inconsistent labeling information and quality inspection labeling information in the difference words, or according to the total word quantity (total word quantity) of the current text to be labeled and the number of the difference words with inconsistent labeling information and quality inspection labeling information in the difference words (difference word quantity).

Optionally, the training auxiliary labeling result may also be directly used as a standard, and words with inconsistent labeling results of the data annotator and the training auxiliary labeling result are all used as difference words, so as to determine the labeling qualification rate of the data annotator.

The accuracy, the rigor and the qualification rate of a certain data annotator can be improved by automatically detecting the marking qualification rate of the data annotator.

Optionally, if the label information in the current corrected label result of a certain difference word after the second correction is inconsistent with the quality inspection label information, the label information in the current corrected label result after the second correction needs to be modified into the quality inspection label information, and the quality inspection label information and the difference word are updated to the historical correction information or the historical correction information set, and later similar data can be directly applied to the quality inspection label information.

Optionally, the marking qualification rate of the data marker can be displayed in real time on the work interface of the data marker, and the current average marking qualification rate and the current highest marking qualification rate are displayed, so that the data marker can know the current work accuracy of the data marker, and if the marking qualification rate is at a lower level in a team, the data marker can be timely alert, and the data marker can autonomously or seek the help of a colleague with a higher marking qualification rate, so that the working capacity of the data marker is improved. Meanwhile, the mode also enables the corresponding manager to know the working state of the managed data annotator in time. For the data annotating personnel with lower annotation qualification rate, the management personnel can intervene the data annotating personnel in time according to the duration of the state of the data annotating personnel so as to improve the credibility of the annotation work of the team.

Optionally, the error difference words detected in the quality inspection process can be fed back to the corresponding data annotators, so that on one hand, if the quality inspection is wrong, complaints can be timely made, and problems can be solved. On the other hand, the user can know the previous marking errors and analyze the marking errors, if the user is in negligence, the user can pay attention next time, if the user is in self-cognition, the short boards can be supplemented in time, and the continuous marking errors in the follow-up work are avoided.

Optionally, the labeling information of the error differential words detected in the quality inspection process can be counted, if the frequency of the differential words of a certain labeling information is too high, a prompt can be sent to related workers to perform corresponding training or perform targeted strengthening training on a preset initial auxiliary labeling model, so that the labeling accuracy of the words of the labeling information is improved.

Step seven: and prompting high-risk marking information.

Optionally, the high-risk annotation information may be determined by the number ratio of the history correction words corresponding to each history correction labeling information in the history correction information, and the high-risk annotation information is prompted. For example, the labeling information with low labeling qualification rate can be automatically detected, and a special prompt is displayed on the labeling interface of the data labeling personnel.

For example, the number of the history correction words in the history correction information is 300, wherein the number of the history correction words in the history correction information is 280, and the proportion of the history correction words in the history correction information is higher than the preset number proportion, the "company" can be used as high-risk annotation information, when the current text to be annotated and the pre-annotation information thereof are displayed, the annotation reliability of the "company" is prompted on a display interface to attract the attention of a data annotator, so that the word annotated as the "company" is paid more attention, and the description of the word of the company possibly appearing in the whole text is paid more attention, so that the current text to be annotated is annotated more accurately.

Optionally, the history correction information further includes auxiliary tagging information of the history correction word in the history auxiliary tagging information, at this time, a correction reason of a certain tagging information may also be determined according to the history correction information, for example, a certain word M is tagged with a label a through a preset initial auxiliary tagging model, but after manual correction, the history correction tagging information is tagged with a label B, the correction reason is a tagging error, a certain word N is tagged with a label empty through the preset initial auxiliary tagging model, that is, the word N is not tagged, but after manual correction, the history correction tagging information is tagged with a label C, and the correction reason is a missing label. At the moment, the correction reasons of the high-risk marking information can be determined, and the correction reasons with more occurrence times are displayed to the data annotator, so that the work of the data annotator can be further facilitated. For example, if the correction reason is that the label is omitted, the data annotator needs to pay attention to the currently unlabeled word to perform supplementary annotation, and secondary attention is paid to whether the other labeled words are labeled accurately. If the correction reason is caused by wrong annotation, the data annotating staff mainly needs to pay attention to whether the currently annotated words are accurately annotated, and manually corrects the words with wrong annotation, and secondary energy is put on other words which are not annotated.

Optionally, different labeling information may be distinguished by different fonts, colors, and the like, so as to facilitate work of a data labeling operator.

Optionally, the prompt may be prompted by a rolling caption at the top or bottom of the display interface, or by a bubble character, or by other ways known to those skilled in the art.

It should be noted that the execution sequence of step six and step seven is not limited herein.

The embodiment provides a text labeling method, which comprises the steps of obtaining history correction information and history related words, generating a history correction information set according to the history correction words, the history related words and the history correction labeling information, obtaining a current auxiliary labeling result, performing first correction on the current auxiliary labeling result according to the history correction information set to obtain a current correction labeling result, displaying the current correction labeling result, obtaining current correction information, performing second correction on the current correction labeling result, completing labeling of a current text to be labeled, performing machine pre-labeling before manual labeling by a data labeling person, displaying the current correction labeling result pre-labeled by a machine and the current text to be labeled together, wherein the data labeling person only needs to perform modification and supplement on the basis of the current label, so that the labeling efficiency is improved, and the labeling cost is reduced, reduce marking work load, reduce repetitive work, promote user experience.

In one embodiment, the present invention further provides a text annotation apparatus 500, see fig. 5, comprising:

a history correction information obtaining module 501, configured to obtain history correction information, where the history correction information includes history correction words and history correction labeling information, the history correction information includes modification information of a history auxiliary labeling result, and the history auxiliary labeling result is obtained by labeling a history text to be labeled through a preset initial auxiliary labeling model;

a history related word obtaining module 502, configured to obtain a history related word of the history corrected word, and generate a history corrected information set according to the history corrected word, the history related word, and history corrected labeling information, where the labeling information of the history related word is the same as the history corrected labeling information, and the word meaning of the history related word is similar to or the same as the word meaning of the history corrected word;

the first correction module 503 is configured to obtain a current auxiliary annotation result, and perform first correction on the current auxiliary annotation result according to the historical correction information set to obtain a current correction annotation result, where the current auxiliary annotation result is obtained by inputting a current text to be annotated into a preset initial auxiliary annotation model;

the second modification module 504 is configured to display the current modification labeling result, obtain the current modification information, and perform second modification on the current modification labeling result to complete labeling of the current text to be labeled.

In this embodiment, the current revised labeling result includes a current labeling word and current labeling information, and the obtaining manner of the current revised information includes:

distributing the current correction marking result to the corresponding modification execution object according to the current marking information;

and acquiring object correction information of each modification execution object, and generating current correction information.

In this embodiment, after allocating the current modification labeling result to the corresponding modification execution object according to the labeling type and before acquiring the object modification information of each modification execution object, the method further includes:

and if the current correction labeling result is corrected for the second time by the at least two correction execution objects at the same time, displaying the current labeling state of each current labeling word, wherein the current correction state comprises at least one of information of labeled, unlabeled and correction execution objects.

In this embodiment, the current correction information includes a current correction word and current correction tagging information, and the apparatus further includes a training module, where the training module is configured to:

acquiring a current related word of a current correction word, and generating a current correction information set according to the current correction word, the current related word and current correction labeling information, wherein the labeling information of the current related word is the same as the current correction labeling information, and the word meaning of the current related word is similar to or the same as the word meaning of the current correction word;

and training the preset initial auxiliary labeling model according to the corrected training set.

In this embodiment, the apparatus further includes a prompt module, where the prompt module is configured to obtain a ratio of the number of historical correction words corresponding to each historical correction labeling information in the historical correction information before obtaining the current correction information; and if the quantity ratio is higher than a preset ratio threshold value, taking the historical correction marking information as high-risk marking information, and prompting.

In this embodiment, the apparatus further includes a quality inspection module, and the quality inspection module is configured to:

inputting a second text to be labeled into a trained preset initial auxiliary labeling model to obtain a training auxiliary labeling result, wherein the training auxiliary labeling result comprises a labeling word and training labeling information;

comparing the training auxiliary labeling result with the current corrected labeling result after the second correction to obtain difference information, wherein the current corrected labeling result after the second correction comprises a labeling word and second correction labeling information, the difference word is a labeling word with different second correction labeling information and training labeling information, and the difference information comprises the difference word and second correction labeling information of the difference word;

and determining the marking qualified rate of the modification execution object according to the quality inspection marking information and the second correction marking information.

In this embodiment, if the quality inspection labeling information is different from the second correction labeling information, the apparatus further includes a third correction module and/or an addition module, wherein,

the third correction module is used for performing third correction on the current corrected marking result after the second correction according to the quality inspection marking information;

the adding module is used for adding the current correction information, the quality inspection labeling information and the difference words to the historical correction information.

The embodiment provides a text labeling device, which obtains a current auxiliary labeling result by obtaining history correction information and history related words, generating a history correction information set according to the history correction words, the history related words and the history correction labeling information, performing first correction on the current auxiliary labeling result according to the history correction information set to obtain a current correction labeling result, displaying the current correction labeling result, obtaining current correction information, performing second correction on the current correction labeling result, completing labeling of a current text to be labeled, performing machine pre-labeling before manual labeling by a data labeling person, displaying the current correction labeling result pre-labeled by a machine and the current text to be labeled together, wherein the data labeling person only needs to perform modification and supplement on the basis of the current label, thereby improving the labeling efficiency and reducing the labeling cost, reduce marking work load, reduce repetitive work, promote user experience.

It should be understood that the text labeling apparatus system substantially includes a plurality of modules for executing the text labeling method in any of the embodiments, and specific functions and technical effects are only required by referring to the embodiments, which are not described herein again.

In an embodiment, referring to fig. 6, the embodiment further provides a computer device 600, which includes a memory 601, a processor 602, and a computer program stored on the memory and executable on the processor, and when the processor 602 executes the computer program, the steps of the method according to any one of the above embodiments are implemented.

In an embodiment, a computer-readable storage medium is also provided, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any of the above embodiments.

The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments. Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A text labeling method, the method comprising:

2. The text annotation method of claim 1, wherein the current revised annotation result includes a current annotation word and current annotation information, and the current revised annotation information is obtained in a manner that includes:

3. The text annotation method of claim 2, wherein after assigning the current revision annotation result to the corresponding revision execution object according to the annotation category, and before acquiring the object revision information of each of the revision execution objects, the method further comprises:

4. The text annotation method of any one of claims 1-3, wherein the current correction information includes a current correction word and current correction annotation information, the method further comprising:

5. The text annotation method of any one of claims 1-3, wherein prior to obtaining the current revision information, the method further comprises:

6. The text annotation method of claim 4, further comprising:

inputting a second text to be labeled into the trained preset initial auxiliary labeling model to obtain a training auxiliary labeling result, wherein the training auxiliary labeling result comprises a labeling word and training labeling information;

7. The method of claim 6, wherein if the quality control annotation information is different from the second revision annotation information, the method further comprises at least one of:

8. A text labeling apparatus, the apparatus comprising:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 7 are implemented by the processor when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.