CN108446695B

CN108446695B - Method and device for data annotation and electronic equipment

Info

Publication number: CN108446695B
Application number: CN201810115780.2A
Authority: CN
Inventors: 张兰渝
Original assignee: Advanced New Technologies Co Ltd
Current assignee: Advanced New Technologies Co Ltd; Advantageous New Technologies Co Ltd
Priority date: 2018-02-06
Filing date: 2018-02-06
Publication date: 2022-02-11
Anticipated expiration: 2038-02-06
Also published as: CN114677681A; CN108446695A

Abstract

The embodiment of the application discloses a method, a device and electronic equipment for data annotation, wherein the method comprises the following steps: acquiring a plurality of recognition results after multi-round recognition is carried out on data to be marked, wherein the data to be marked comprises at least one character; determining a candidate result and an interference result of the plurality of recognition results when it is determined that different recognition results exist among the plurality of recognition results; and judging the identification condition of the data to be marked according to the candidate result and the interference result, wherein the identification condition comprises successful identification and failed identification.

Description

Method and device for data annotation and electronic equipment

Technical Field

The present application relates to the field of computer data processing, and more particularly, to a method, an apparatus, and an electronic device for data annotation.

Background

At present, in most of crawler channels, in order to protect data, the data is usually displayed to users in a picture mode, and the threshold of data acquisition is improved. Analyzing the data (image data) displayed by the pictures can improve the competitiveness of the crawler channel in the market. At present, manual labeling is needed in scenes for recognizing image data (such as an optical character recognition scene, a verification code recognition scene and a handwritten character recognition scene). The image data is identified manually, and the label set formed by the obtained identification result has very important application.

However, the existing manual labeling process needs a lot of manpower and material resources, and manual errors cannot be avoided, so that the effective utilization rate of the identification result (or the labeling result) cannot be guaranteed.

Therefore, a method for data annotation is needed to overcome the above technical problems.

Disclosure of Invention

The application aims to provide a method, a device and electronic equipment for data annotation, which can ensure the effective utilization rate of an identification result.

In order to solve the above technical problem, the embodiment of the present application is implemented as follows:

in a first aspect, a method for data annotation is provided, including:

acquiring a plurality of recognition results after multi-round recognition is carried out on data to be marked, wherein the data to be marked comprises at least one character;

determining a candidate result and an interference result of the plurality of recognition results when it is determined that different recognition results exist among the plurality of recognition results;

and judging the identification condition of the data to be marked according to the candidate result and the interference result, wherein the identification condition comprises successful identification and failed identification.

In a second aspect, an apparatus for data annotation is provided, including:

the device comprises an acquisition unit, a display unit and a control unit, wherein the acquisition unit is used for acquiring a plurality of identification results after multi-round identification is carried out on data to be labeled, and the data to be labeled comprises at least one character;

a determination unit that determines a candidate result and an interference result among the plurality of recognition results when it is determined that different recognition results exist among the plurality of recognition results;

and the judging unit is used for judging the identification condition of the data to be marked according to the candidate result and the interference result, wherein the identification condition comprises successful identification and failed identification.

In a third aspect, an electronic device is provided, including:

a processor; and

a memory arranged to store computer executable instructions that when executed use the processor to perform the following:

In a fourth aspect, a computer-readable medium is provided that stores one or more programs that, when executed by an electronic device that includes a plurality of application programs, cause the electronic device to:

According to the technical scheme provided by the embodiment of the application, when different identification results exist in a plurality of identification results of data to be labeled, the embodiment of the application can judge the identification condition of the data to be labeled according to the candidate results and the interference results in the plurality of identification results, so that the problem that the effective utilization rate of the identification result cannot be guaranteed due to the fact that the identification condition of the data to be labeled can be judged according to the identification results only when the plurality of identification results are all the same is solved, and the effective utilization rate of the identification result is guaranteed.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without any creative effort.

FIG. 1 is a schematic flow chart diagram of a method for data annotation according to one embodiment of the present application.

FIG. 2 is a schematic flow chart diagram of a method for data annotation according to a specific embodiment of the present application.

FIG. 3 is a schematic structural diagram of an electronic device according to one embodiment of the present application.

Fig. 4 is a schematic structural diagram of an apparatus for data annotation according to an embodiment of the present application.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

FIG. 1 is a flow diagram of a method for data annotation according to one embodiment of the present application. The method of fig. 1 is performed by an apparatus for data annotation. It should be understood that the scheme of the embodiment of the application can be applied to the annotation of the text image data. Of course, the application of the method of the embodiments of the present application to the labeling of other data is not excluded.

As shown in fig. 1, at S102, a plurality of recognition results after performing multiple rounds of recognition on data to be annotated, which includes at least one character, are obtained.

It is understood that, in each round of identification, the data to be labeled can be identified by a plurality of persons, for example, each round can be identified by 3 persons. And the participants in any two rounds of identification may be different, such as A, B and C for the participant in the first round of identification, and D, E and F for the participant in the second round of identification. And each person identifies the data to be labeled to obtain an identification result.

Optionally, in some embodiments, the total number of rounds of identification of the data to be annotated and/or the number of participants in each round of identification process are determined according to the aging performance requirement. For example, the total number of rounds of identification to be performed on the data to be labeled and/or the number of participants in each round of identification process are determined according to the rule that the higher the requirement on the timeliness performance is, the smaller the total number of rounds of identification is and/or the fewer the number of participants in each round of identification process is.

At S104, when it is determined that there is a different recognition result among the plurality of recognition results, a candidate result and an interference result among the plurality of recognition results are determined.

Optionally, in some embodiments, if the multiple recognition results are the same, the accuracy of the recognition result may be directly calculated, and the recognition condition of the data to be labeled is determined according to the accuracy of the recognition result and the accuracy requirement.

Specifically, in some embodiments, the candidate result and the interference result are determined according to the number of occurrences, where the candidate result is the recognition result with the largest number of occurrences among the multiple recognition results, and the interference result is the recognition result except for the interference result among the multiple recognition results. The number of occurrences can also be understood as the number of votes or the number of people who have identified the same identification result for the data to be labeled.

For example, the data to be annotated includes 4 characters in total from 0 to 3, the recognition results obtained by recognizing the data to be annotated by 6 persons are 1234, 1324, 1234, 1342, 1234 and 123, where the occurrence number of 1234 is 4, the occurrence number of 1324 is 1, and the occurrence number of 123 is 1, then 1234 is determined as a candidate result, and 1324 and 123 are determined as interference results.

Further, in some embodiments, the number of characters included in the perturbation result is the same as the number of characters included in the candidate result. Or may be understood as determining, when determining a candidate result and an interference result in the plurality of recognition results, a determination that the number of occurrences is the largest as the candidate result, determining, as the interference result, other recognition results in which the number of characters included in the plurality of recognition results is the same as the number of characters included in the candidate result, and if there is a recognition result in the plurality of recognition results in which the number of characters included in the plurality of recognition results is different from the number of characters included in the candidate result, ignoring the recognition result in which the number of characters included in the portion is different from the number of characters included in the candidate result, or understanding as discarding the recognition result in which the number of characters included in the portion is different from the number of characters included in the candidate result by misregistration.

For example, the data to be annotated includes 4 characters in total from 0 to 3, the recognition results obtained by recognizing the data to be annotated by 6 persons are 1234, 1324, 1234, 1342, 1234 and 123, where the occurrence frequency of 1234 is 4, the occurrence frequency of 1324 is 1, and the occurrence frequency of 123 is 1, then 1234 is determined as a candidate result, 1324 is determined as an interference result, and 123 is discarded in a misplaced manner.

At S106, according to the candidate result and the interference result, determining an identification condition of the data to be labeled, where the identification condition includes success in identification and failure in identification.

Specifically, in some embodiments, the type of the candidate result is determined according to a distribution algorithm model, the occurrence frequency of the candidate result and the occurrence frequency of the interference result, the type of the candidate result includes a validation result, a validation result and a valid result to be validated, and the distribution algorithm model is obtained by training based on an accuracy requirement and an identification result of training data to be labeled; and judging the identification condition of the data to be labeled according to the type of the candidate result.

Optionally, as an example, in the process of training the distribution algorithm model, the accuracy requirement is determined, the number of total recognition rounds and the number of different participants in each round are determined according to the aging performance requirement, then the cumulative total number of participants is counted, the distribution of all possible recognition results is analyzed, the estimated accuracy of each distribution is determined, and then the distribution algorithm model is determined based on the estimated accuracy and the accuracy requirement of each distribution.

For example, assuming that the training data to be labeled for training the distribution algorithm model includes ten characters from 0 to 9, the recognition accuracy of the characters is a (here, the recognition accuracy of the characters should be understood as the lower limit of the character recognition accuracy), the initial probabilities of the ten characters from 0 to 9 being mistakenly recognized are all equal to (1-a)/9, and a is 0.9, the estimated accuracy of each distribution can be calculated.

For example, when there are 2 persons for identification, the distribution of the 2 persons having the same identification result is (2), and the estimated accuracy of (2) is 99.86301369863013%. When there are 3 persons for identification, the distribution of the 3 persons with the same identification result is (3), and the estimated accuracy of (3) is 99.99830651989839%. When there are 4 persons for identification, the distribution of the 4 persons with the same identification result is (4), and the estimated accuracy of (4) is 99.99997909248857%. When 5 persons are identified, the distribution of the 5 persons with the same identification result is (5), and the estimated accuracy of (5) is 99.99999974188252%. When there are 6 persons for identification, the distribution of the 6 persons with the same identification result is (6), and the estimated accuracy of (6) is 99.99999999681336%. When there are 3 persons for identification, the identification results of 2 persons among the 3 persons are the same, the distribution when the identification result of the other person is different from the identification results of the 2 persons is (2,1), and the estimated accuracy of (2,1) is 98.66165413533837%. When there are 4 persons for identification, the distribution when the identification results of 3 persons among the 4 persons are the same and the identification result of the other person is different from the identification results of the 3 persons is (3,1), and the estimated accuracy of (3,1) is 99.98325588395764%. When 5 persons are identified, the distribution of 3 persons among the 5 persons having the same identification result and 2 other persons having the same identification result and different from the identification result of the 3 persons is (3,2), and the estimated accuracy of (3,2) is 98.77901897734243%. When 5 persons are identified, the distribution of the identification results of 4 persons in the 5 persons when the identification results of the other 1 person are the same and the identification results of the 4 persons are different is (4,1), and the estimated accuracy of (4,1) is 99.99979324832664%. When there are 6 persons for identification, the distribution of 4 persons among the 6 persons having the same identification result and 2 other persons having the same identification result and different from the identification result of the 4 persons is (4,2), and the estimated accuracy of (4,2) is 99.9847421648845%. When 6 persons are identified, the distribution of 5 persons among the 6 persons having the same identification result and 1 other person having the same identification result as the identification result of the 5 persons is (5,1), and the estimated accuracy of (5,1) is 99.999997447505%. When there are 6 persons for identification, the distribution when the identification results of 4 persons among the 6 persons are the same and the identification results of the other 2 persons are different from the identification results of the 4 persons and the identification results of the 2 persons are also different is (4,1,1), and the estimated accuracy of (4,1,1) is 99.98994724384438%. When 5 persons are identified, the distribution when the identification results of 3 persons among the 5 persons are the same and the identification results of the other 2 persons are different from the identification results of the 3 persons and the identification results of the 2 persons are also different is (3,1,1), and the estimated accuracy of (3,1,1) is 98.96391181294628%. When there are 4 persons for identification, the distribution when the identification results of 2 persons among the 4 persons are the same and the identification results of the other 2 persons are different from the identification results of the 2 persons and the identification results of the other 2 persons are also different is (2,1,1), and the estimated accuracy of (2,1,1) is 45.67351200835363%. When there are 6 persons for identification, the distribution when the identification results of 3 persons among the 6 persons are the same and the identification results of the other 3 persons are different from the identification results of the 3 persons and the identification results of the other 3 persons are also different is (3,1,1,1), and the estimated accuracy of (3,1,1,1) is 81.24386157490324%. When there are 6 persons for identification, the distribution in which the identification results of 3 persons among the 6 persons are the same and the identification results of the other 3 persons are different from the identification results of the 3 persons and the identification results of 2 persons among the identification results of the other 3 persons are the same is (3,2,1), and the estimated accuracy of (3,2,1) is 79.73229373797001%.

Assuming that the accuracy requirement is higher than 99.99%, combining the estimated accuracy and accuracy requirement of each distribution condition, the distribution algorithm model can be obtained as follows: (a) when x < ═ y, the type of the candidate result is determined invalid result; (b) when x is 2, the type of the candidate result is determined as an invalid result; (c) when x-y >2, the type of the candidate result is determined to be a valid result; (d) and when x-y is 2, the type of the candidate result is the effective result to be determined. Where x denotes the number of occurrences of the candidate result and y denotes the number of occurrences of the interference result.

Optionally, as an embodiment, when the type of the candidate result is determined to be a valid result, it is determined that the data to be labeled is successfully identified, and the candidate result is determined to be a labeling result of the data to be labeled.

Further, when the candidate result is judged to be a determined effective result, updating the times of correctly recognizing and the times of incorrectly recognizing at least one character included in the data to be labeled according to the times of the candidate result and the times of the interference result.

For example, if 6 persons recognize data to be labeled, the number of times of occurrence of the candidate result is 5, the number of times of occurrence of the interference result is 1, and the interference result recognizes the character 6 in the data to be labeled as the character 8, the number of times that at least one character included in the data to be labeled is correctly recognized is increased by 5, and the number of times that the character 6 in the at least one character is recognized as the character 8 is increased by 1. Therefore, the probability that each character in at least one character included in the data to be labeled is recognized as other characters (or the recognition probability among the characters of the at least one character) can be counted conveniently, the accuracy corresponding to the distribution condition of the recognition result can be determined conveniently and accurately in the follow-up process, the effective utilization rate of the recognition result is improved, the cost is reduced, and a sample set is enriched.

Optionally, as another embodiment, when the type of the candidate result is determined to be an invalid result, determining a relationship between a discussion that an identification operation has been performed on the data to be labeled and a preset identification discussion; when the discussion of executing the identification operation on the data to be marked is judged to be less than the preset identification wheel number, executing the identification operation on the data to be marked again; and updating a plurality of identification results according to the result of executing the identification operation again aiming at the data to be marked, and judging the identification condition of the data to be marked according to the candidate result and the interference result in the plurality of updated identification results. Or, it may be understood that, when it is determined that the discussion of performing the identification operation on the data to be labeled is less than the preset number of identification rounds, the steps of the method shown in fig. 1 are repeatedly performed. Or when the discussion of executing the identification operation on the data to be labeled is judged to be equal to the preset identification discussion, determining that the identification of the data to be labeled fails.

Optionally, as another embodiment, when the type of the candidate result is determined to be an effective result to be determined, determining whether the candidate result meets the requirement of accuracy; and when the candidate result meets the accuracy requirement, determining that the data to be marked is successfully identified, and determining the candidate result as the marking result of the data to be marked.

Specifically, determining whether the candidate result meets the accuracy requirement includes: determining a recognition accuracy rate of the at least one character; and judging whether the candidate result meets the accuracy requirement or not according to the recognition accuracy of the at least one character and the accuracy requirement. That is, determining whether the candidate result satisfies the accuracy requirement actually divides the data to be annotated into a plurality of individual characters (the candidate result is actually a recognition result of the plurality of individual characters), and confirms whether the accuracy requirement is satisfied character by character. It can be understood that the candidate result satisfies the accuracy requirement when the recognition accuracy of each character satisfies the accuracy requirement, otherwise the candidate result does not satisfy the accuracy requirement.

In some embodiments, determining the recognition accuracy of the at least one character comprises: judging the type of the at least one character according to the distribution algorithm model, the occurrence frequency of the candidate result and the occurrence frequency of the interference result, wherein the type of the at least one character comprises a confirmed effective character, a confirmed invalid character and a to-be-confirmed effective character; and when the at least one character is judged to be the valid character to be determined, determining the recognition accuracy of the at least one character according to the candidate result.

Or, it may be understood that, when the type of the candidate result is the valid result to be determined, it is required to determine whether the recognition accuracy of each character in the data to be labeled meets the accuracy requirement. In this case, the recognition accuracy of each character can be determined according to the following manner: and counting the voting distribution condition of the current character according to the occurrence frequency of the candidate result and the occurrence frequency of the interference result, and judging the type of the current character according to a distribution algorithm model. If the type of the current character is the determined effective character, counting the voting distribution condition of the next character according to the occurrence frequency of the candidate result and the occurrence frequency of the interference result, and executing subsequent operation according to the voting distribution condition of the next character; if the type of the current character is the determined invalid character, determining that the candidate result does not meet the accuracy requirement, and restarting to execute the method shown in FIG. 1; and if the type of the current character is the valid character to be determined, determining the identification accuracy of the current character, further counting the voting distribution condition of the next character according to the occurrence times of the candidate result and the interference result, and executing subsequent operation according to the voting distribution condition of the next character.

Optionally, as an example, determining the recognition accuracy of the at least one character includes: determining a conditional probability formula corresponding to the candidate result; and determining the recognition accuracy rate of the at least one character according to the conditional probability formula and a recognition probability matrix, wherein elements in the recognition probability matrix are used for describing the probability that one character in the at least one character is recognized as other characters in the at least one character.

For example, the elements in the recognition probability matrix W may be represented as W_ij，W_ijRepresents the probability that character j is recognized as character i, and W_ijCan be determined by equation (1):

c in formula (1)_ijRepresenting the number of times character j is recognized as character i,

representing the total number of times the character j was recognized.

The conditional probability formula of the candidate result according to the embodiment of the present application will be described below with reference to specific examples.

Taking as an example that each round of recognition is participated in by 3 persons, if a round of recognition is performed on the data to be labeled, the recognition results of 3 persons in this round of recognition are the same, and the conditional probability formula in the case that the recognition results of 3 persons are the same can be expressed as formula (2):

and P (X)₁＝j，X₂＝j，X₃J | a ═ j) can be expressed as: p (X)₁＝j|A＝j)·P(X₂＝j|A＝j)·P(X₃J | a ═ j), then equation (2) is transformed into equation (3):

by substituting the elements in the recognition probability matrix W into formula (3), the recognition accuracy of each character can be determined under the condition that the recognition results of 3 persons are the same. And if it is assumed that the probability W that the character j is correctly recognized_jjA, probability that character j is not correctly recognized

Thereby obtaining

And then to

Assuming that a is 90% and the accuracy requirement is greater than 98%, it may be determined that the accuracy of each character meets the accuracy requirement if the recognition results of 3 persons are the same, the recognition results of the 3 persons meet the accuracy requirement, and the recognition results of the 3 persons may be determined as the labeling result of the data to be recognized.

Further, if the recognition results of the 3 persons are inconsistent and the data to be labeled is subjected to the next round of recognition, if the situation that one character is simultaneously recognized as a plurality of different characters is not considered, the recognition result with lower accuracy is removed, and the following 8 events will occur for one character: event 1: v × × × × × × ×; event 2: v × × v ×; event 3: v × × v √ x; event 4: a check mark of check mark "× × × check mark" √; event 5: v √ xxxx; event 6: v √ ×; event 7: v √ xv; event 8: v √ x √ v √. Where √ and × denote different recognition results, where √ denotes correct recognition, and × denotes an erroneous recognition. Analyzing the above 8 events can obtain 4 scenarios, where scenario one: two recognition results exist in the event 1 and the event 8, and the occurrence frequency of the two recognition results is 5: 1; scene two: two recognition results exist in the event 2 and the event 7, and the occurrence frequency of the two recognition results is 4: 2; scene three: two recognition results exist in the event 4 and the event 5, and the occurrence frequency of the two recognition results is 4: 2; scene four: two recognition results exist in the event 3 and the event 6, and the occurrence frequency of the two recognition results is 3: 3.

Specifically, the conditional probability formula of event 8 in scenario one is formula (4):

probability W if it is assumed that character j is correctly recognized_jjThe probability that character j is not correctly recognized is (1-a), and then

Further assume that when a character j is recognized as a character i as a common error, i.e., when the character j is recognized as only a character i in addition to being recognized as itself, W_ij1-a; w is W when character j is recognized as rare error, i.e. character j is never recognized as character i_ij0 is approximately distributed; and a is 90%. It can be determined that the accuracy of the recognition result of the event 8 is 99.97% in the case of a common error and is approximately 0 in the case of an rare error.

Alternatively, the conditional probability formula of event 7 in scenario two is formula (5):

and if it is assumed that the probability W that the character j is correctly recognized_jjThe probability that character j is not correctly recognized is (1-a), and then

Further assume that when a character j is recognized as a character i as a common error, i.e., when the character j is recognized as only a character i in addition to being recognized as itself, W_ij1-a; w is W when character j is recognized as rare error, i.e. character j is never recognized as character i_ij0 is approximately distributed; and a is 90%. It can be determined that the accuracy of the recognition result of the event 7 is 98.662% in the case of a common error and is approximately 0 in the case of an rare error.

Still alternatively, the conditional probability formula of event 3 in scenario four is formula (6):

further assume that when a character j is recognized as a character i as a common error, i.e., when the character j is recognized as only a character i in addition to being recognized as itself, W_ij1-a; w is W when character j is recognized as rare error, i.e. character j is never recognized as character i_ij0 is approximately distributed; and a is 90%. It can be determined that the accuracy of the recognition result of the event 3 is 49.727% in the case of a common error and is approximately 0 in the case of an rare error.

Therefore, without considering that characters are simultaneously recognized as a plurality of different numbers, when an error occurs as a common error, the scenes one, two, and three are scenes where the accuracy of the recognition result meets the requirement or are acceptable scenes, and when an unusual error occurs, the scenes one, two, and three are unacceptable scenes. And scenario four is an unacceptable scenario regardless of whether the error occurred is a common error.

Further, if a case that one character is simultaneously recognized as a plurality of different characters is considered, for example, 4 characters are correctly recognized and 2 characters are wrongly recognized, and the recognition results are different, 3 recognition results appear in the same character, the number of times of appearance of the recognition results is (4,1,1), and the recognition result with the largest number of times of appearance is determined as a candidate result, the conditional probability formula corresponding to the candidate result is formula (7):

in formula (7)

Can be represented by equation (8):

the latter two terms in equation (8) are analyzed using the mean inequality to yield:

further, it is found that:

when i is₁,i₂When the error is common, the error is not easy to be found,

then:

assuming that a is 0.9, i₁,i₂When the errors are all common errors, the accuracy of the candidate result is more than 99.485 percent, and when i is a common error₁,i₂When the error rate is all the error rate is rare,

the accuracy of the candidate results is approximately 0.

It will be appreciated that the above described extreme error cases are considered in describing the conditional probability formulation of the candidate results and in determining the accuracy from the conditional probability formulation. In the actual process, the elements in the recognition probability matrix W are brought into a specific conditional probability formula, so that the specific recognition accuracy of each character in the data to be labeled can be obtained, and whether the candidate result is valid can be determined according to the obtained specific recognition accuracy of each character.

In the embodiment of the application, in order to obtain the accurate recognition accuracy of the at least one character, before the recognition accuracy of the at least one character is determined according to the conditional probability formula and the recognition probability matrix, the total number of times that the at least one character is recognized is determined to be greater than or equal to the preset recognition number of times. For example, if the preset recognition is performed 100 times at this time, if the total number of times that at least one character is recognized is greater than 100 times, the elements in the recognition probability matrix may be brought into the corresponding conditional probability formula of the candidate result to obtain the recognition accuracy of the at least one character. And if the recognition accuracy of at least one character meets the accuracy requirement, determining that the candidate result meets the accuracy requirement. In this case, the number of times that at least one character included in the data to be labeled is correctly recognized and the number of times that the character is erroneously recognized are not updated according to the number of times that the candidate result appears and the number of times that the interference result appears.

FIG. 2 is a schematic flow chart diagram of a method for data annotation according to a specific embodiment of the present application. The method of fig. 2 may be performed by an apparatus for data annotation. As shown in FIG. 2, the method shown in FIG. 2 includes two main processes of distributed algorithm model training and recognition case judgment. The distributed model training process comprises S202 to S210, and the recognition condition judgment process comprises S212 to S234; wherein:

at S202, the accuracy requirement is determined.

At S204, a total number of manually identified rounds and a different number of participants per round are determined.

Specifically, the total number of rounds and the different number of participants per round can be determined according to the requirement of the timeliness performance.

At S206, the number of the accumulated participants is counted, and all possible results are analyzed.

The distribution of all possible results is here understood to be the number of occurrences in the method shown in fig. 1.

It will be appreciated that at S206, different numbers of manual identifications are performed, the number of statistically accumulated attendees will be different, and the distribution of all possible outcomes will be different for different numbers of manual identifications. Or it can be appreciated that a variety of resulting distributions can be obtained at S206.

At S208, an estimated accuracy for each outcome profile is determined.

Specifically, the estimated accuracy of each outcome distribution may be determined according to a conditional probability formula for each outcome distribution.

At S210, a distribution algorithm model is determined in combination with the accuracy requirement and the estimated accuracy of each branch situation.

At S212, it is determined whether the number of rounds of the performed manual recognition satisfies the total round number requirement.

At S214, if the total number of rounds of performed manual identification does not satisfy the total number of rounds requirement, performing one manual identification, counting the round identification results, and obtaining a plurality of identification results.

Alternatively, if the total number of rounds of the manual recognition that has been performed at S212 satisfies the total number of rounds requirement, it is determined at S214 that the recognition fails.

At S216, the number of occurrences of each recognition result is counted, and a candidate result and an interference result are determined.

Optionally, the recognition result with the largest frequency of occurrence is determined as a candidate result, the recognition result with the length consistent with that of the candidate result in the plurality of candidate results is determined as an interference result, and the other recognition results except the candidate result and the interference result in the plurality of recognition results are regarded as misplaced discard.

At S218, it is determined whether the candidate result is a positive result according to the distribution algorithm model.

At S220, if the candidate result is a positive result, the candidate result and the interference result are subjected to character statistics, and the recognition probability matrix between the characters is accumulated.

At S220, elements in the recognition probability matrix between characters are used to describe the probability of one character being recognized as another character. The sub-character counting of the candidate result and the interference result may be understood as counting the number of times each character is correctly recognized and the number of times each character is erroneously recognized according to the number of occurrences of the candidate result and the number of occurrences of the interference result, and the number of times each character is erroneously recognized herein specifically refers to the number of times each character is recognized as another character, for example, the number of times 8 is recognized as 6, or the number of times 8 is recognized as 3.

Further, at S220, if the candidate result is a determination valid result, a determination recognition success is returned, and a recognition result is returned. It is understood that the recognition result returned here is a candidate result.

At S222, if the candidate result is not the determination valid result, it is determined whether the candidate result is the determination invalid result according to the distribution algorithm model.

At S224, if the candidate result is not the invalidity determination result, the candidate result is cycled bit by character, and it is determined whether the cycling is completed.

At S224, if the candidate result is the determination invalidation result, S212 and the subsequent steps are performed.

At S224, if the loop is completed, it is determined that the recognition is successful, and a recognition result is returned. It is understood that the recognition result returned here is a candidate result.

At S226, if the loop is not completed, the voting distribution of the current character is counted according to the candidate result and the interference result.

At S228, it is determined whether the current character is a determined valid character according to the distribution algorithm model.

At S230, if the current character is not a determined valid character, it is determined whether the current character is a determined invalid character according to the distribution algorithm model.

Alternatively, at S230, if the current character is a determined valid character, S224 and its subsequent steps are performed.

At S232, if the current character is not the determined invalid character, determining the recognition accuracy of the current character according to the recognition probability matrix between the characters and the conditional probability formula corresponding to the candidate result.

Alternatively, at S232, if the current character is a determined invalid character, S212 and the following steps are executed.

At S234, it is determined whether the recognition accuracy of the current character meets the accuracy requirement, and if so, S224 and subsequent steps are performed, otherwise, S212 and subsequent steps are performed.

It should be noted that although the above embodiment places S222 after S218 for convenience of description, in practical cases, the execution order of these two steps may be switched, that is, it is determined whether the candidate result is a determination invalid result, if not, it is determined whether the candidate result is a determination valid result, if it is, S220 is executed, and if it is not, S224 is executed. Similarly, although 230 is placed after S228, in practical cases, the execution order of these two steps may be changed, that is, it is determined whether the current character is a determined invalid character, if the current character is not the determined invalid character, it is determined whether the current character is a determined valid character, if the current character is the determined valid character, S224 and its subsequent steps are executed, and if the current character is not the determined valid character, the recognition accuracy of the current character is determined according to the recognition probability matrix between the characters and the conditional probability formula corresponding to the candidate result.

The method for data annotation according to the embodiment of the present application is described in detail above in conjunction with fig. 1 and 2. An electronic device according to an embodiment of the present application will be described in detail below with reference to fig. 3. Referring to fig. 3, at a hardware level, the electronic device includes a processor, optionally an internal bus, a network interface, and a memory. The Memory may include a Memory, such as a Random-Access Memory (RAM), and may further include a non-volatile Memory, such as at least 1 disk Memory. Of course, the electronic device may also include hardware required for other services.

The processor, the network interface, and the memory may be interconnected by an internal bus, which may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 3, but this does not indicate only one bus or one type of bus.

And the memory is used for storing programs. In particular, the program may include program code comprising computer operating instructions. The memory may include both memory and non-volatile storage and provides instructions and data to the processor.

The processor reads the corresponding computer program from the non-volatile memory into the memory and then runs the computer program, and the device for marking data is formed on the logic level. The processor is used for executing the program stored in the memory and is specifically used for executing the following operations:

The method performed by the apparatus for data annotation disclosed in the embodiments of fig. 1 and fig. 2 of the present application may be applied to or implemented by a processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor.

The electronic device may also execute the method of fig. 2 and implement the functions of the apparatus for data annotation in the embodiment shown in fig. 2, which are not described herein again in this embodiment of the present application.

Of course, besides the software implementation, the electronic device of the present application does not exclude other implementations, such as a logic device or a combination of software and hardware, and the like, that is, the execution subject of the following processing flow is not limited to each logic unit, and may also be hardware or a logic device.

Embodiments of the present application also propose a computer-readable storage medium storing one or more programs, the one or more programs including instructions, which when executed by an electronic device including a plurality of application programs, enable the electronic device to perform the method of the embodiment shown in fig. 1 and 2, and in particular to perform the following method:

Fig. 4 is a schematic structural diagram of an apparatus for data source annotation according to an embodiment of the present application. Referring to fig. 4, in a software implementation, an apparatus 400 for data annotation may include: an acquisition unit 401, a determination unit 402, and a judgment unit 403, wherein,

the acquiring unit 401 is configured to acquire a plurality of recognition results obtained by performing multiple rounds of recognition on data to be annotated, where the data to be annotated includes at least one character;

a determining unit 402 configured to determine a candidate result and an interference result of the plurality of recognition results when it is determined that different recognition results exist among the plurality of recognition results;

the determining unit 403 determines, according to the candidate result and the interference result, an identification condition of the data to be labeled, where the identification condition includes success in identification and failure in identification.

According to the device for data labeling, when different recognition results exist in a plurality of recognition results of data to be labeled, the recognition condition of the data to be labeled can be judged according to the candidate results and the interference results in the recognition results, the problem that the effective utilization rate of the recognition results cannot be guaranteed due to the fact that the recognition condition of the data to be labeled can be judged according to the recognition results only when the recognition results are all the same is solved, and the effective utilization rate of the recognition results is guaranteed.

Optionally, as an embodiment, the determining unit 402:

and determining the candidate result and the interference result according to the occurrence frequency, wherein the candidate result is the recognition result with the largest occurrence frequency in the plurality of recognition results, and the interference result is the recognition result except the interference result in the plurality of recognition results.

Optionally, as an embodiment, the number of characters included in the interference result is the same as the number of characters included in the candidate result.

Optionally, as an embodiment, the determining unit 403:

judging the type of the candidate result according to a distribution algorithm model, the occurrence frequency of the candidate result and the occurrence frequency of the interference result, wherein the type of the candidate result comprises a confirmed effective result, a confirmed invalid result and a to-be-confirmed effective result, and the distribution algorithm model is obtained based on the accuracy requirement and the recognition result of the to-be-labeled training data through training;

and judging the identification condition of the data to be labeled according to the type of the candidate result.

Alternatively, as an embodiment, the judging unit 403

And when the type of the candidate result is judged to be a determined effective result, determining that the data to be labeled is successfully identified, and determining the candidate result as the labeling result of the data to be labeled.

Optionally, as an embodiment, when determining that the type of the candidate result is a determination valid result, the determining unit 403:

and updating the times of correctly recognizing and the times of incorrectly recognizing at least one character in the data to be marked according to the times of the candidate results and the times of the interference results.

Optionally, as an embodiment, the determining unit 403:

when the type of the candidate result is judged to be an invalid result, judging the relationship between the number of rounds of identification operation executed on the data to be marked and a preset identification number of rounds;

when the number of rounds of executing the identification operation on the data to be marked is judged to be less than the preset identification number of rounds, executing the identification operation on the data to be marked again;

and updating the multiple identification results according to the result of executing the identification operation again aiming at the data to be marked, and judging the identification condition of the data to be marked according to the candidate result and the interference result in the updated multiple identification results.

Optionally, as an embodiment, the determining unit 403:

and determining that the identification of the data to be marked fails when judging that the number of rounds of identification operation performed on the data to be marked is equal to the preset identification number of rounds.

Optionally, as an embodiment, the determining unit 403:

when the type of the candidate result is judged to be an effective result to be determined, judging whether the candidate result meets the accuracy requirement;

and when the candidate result meets the accuracy requirement, determining that the data to be labeled is successfully identified, and determining the candidate result as the labeling result of the data to be labeled.

Optionally, as an embodiment, the determining unit 403:

determining a recognition accuracy rate of the at least one character;

and judging whether the candidate result meets the accuracy requirement or not according to the recognition accuracy of the at least one character and the accuracy requirement.

Optionally, as an embodiment, the determining unit 403:

judging the type of the at least one character according to the distribution algorithm model, the occurrence frequency of the candidate result and the occurrence frequency of the interference result, wherein the type of the at least one character comprises a confirmed effective character, a confirmed invalid character and a to-be-confirmed effective character;

and when the at least one character is judged to be the valid character to be determined, determining the recognition accuracy of the at least one character.

Optionally, as an embodiment, the at least one character is at least two characters;

wherein the judging unit 403:

determining a conditional probability formula corresponding to the candidate result;

and determining the recognition accuracy rate of the at least one character according to the conditional probability formula and a recognition probability matrix, wherein elements in the recognition probability matrix are used for describing the probability that one character in the at least one character is recognized as other characters in the at least one character.

Optionally, as an embodiment, before determining the recognition accuracy of the at least one character according to the conditional probability formula and the recognition probability matrix, the judging unit 403:

determining that the total number of times the at least one character is recognized is greater than or equal to a preset recognition number.

The apparatus 400 for data annotation can also perform the method of the embodiment shown in fig. 1 and fig. 2, and implement the functions of the apparatus for data annotation in the embodiment shown in fig. 1 and fig. 2, which are not described herein again in this embodiment of the present application.

In short, the above description is only a preferred embodiment of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

Claims

1. A method for data annotation, comprising:

judging the identification condition of the data to be marked according to the candidate result and the interference result, wherein the identification condition comprises successful identification and failed identification, and the method comprises the following steps:

2. The method of claim 1, the determining candidate results and interference results in the plurality of recognition results, comprising:

and determining the candidate result and the interference result according to the occurrence frequency, wherein the candidate result is the recognition result with the largest occurrence frequency in the plurality of recognition results, and the interference result is the recognition result except the candidate result in the plurality of recognition results.

3. The method of claim 2, the perturbation result comprising the same number of characters as the candidate result comprises.

4. The method according to claim 3, wherein the determining, according to the type of the candidate result, the identification condition of the data to be labeled comprises:

5. The method of claim 4, when determining that the type of the candidate result is a positive result, further comprising:

6. The method of claim 4, further comprising:

7. The method of claim 6, further comprising:

8. The method of claim 3, further comprising:

9. The method of claim 8, wherein the first and second light sources are selected from the group consisting of,

the determining whether the candidate result meets the accuracy requirement includes:

determining a recognition accuracy rate of the at least one character;

10. The method of claim 9, wherein the first and second light sources are selected from the group consisting of,

the determining the recognition accuracy rate of the at least one character comprises:

11. The method of claim 10, the at least one character being at least two characters;

wherein the determining of the recognition accuracy of the at least one character comprises:

12. The method of claim 11, further comprising, prior to determining the recognition accuracy rate of the at least one character based on the conditional probability formula and the recognition probability matrix:

13. An apparatus for data annotation, comprising:

the judging unit is used for judging the identification condition of the data to be marked according to the candidate result and the interference result, wherein the identification condition comprises successful identification and failed identification, and the judging unit comprises the following steps:

14. An electronic device, comprising:

a processor; and

15. A computer readable medium storing one or more programs that, when executed by an electronic device including a plurality of application programs, cause the electronic device to: