CN115659983A

CN115659983A - Data processing method and device, electronic equipment and storage medium

Info

Publication number: CN115659983A
Application number: CN202211433852.0A
Authority: CN
Inventors: 贾亚龙; 郭林海; 张琛; 万化
Original assignee: Shanghai Pudong Development Bank Co Ltd
Current assignee: Shanghai Pudong Development Bank Co Ltd
Priority date: 2022-11-16
Filing date: 2022-11-16
Publication date: 2023-01-31

Abstract

The invention discloses a data processing method, a data processing device, electronic equipment and a storage medium. The data processing method comprises the following steps: acquiring data to be corrected; the data to be corrected is text information of the entity type marked; inputting the data to be corrected to at least one target prediction model which is created in advance to obtain the prediction probability value of each character in the data to be corrected on each preset label; the sum of the prediction probability values of each character on all the preset labels is 1; and correcting the entity category of the data to be corrected according to the predicted probability value of each character on each preset label and a preset probability threshold. According to the embodiment of the invention, the entity types of wrong marks and missed marks possibly existing in the data to be corrected are quickly corrected, the cost of manually correcting the data is reduced, the errors of manually correcting the data are reduced, the accuracy of correcting the entity types is improved, and the use experience of a user is improved.

Description

Data processing method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of computer application technologies, and in particular, to a data processing method and apparatus, an electronic device, and a storage medium.

Background

In a Named Entity Recognition (NER) task, since a large number of entities exist in the annotation data, it is not practical for an annotator to correctly annotate all the entities, and thus the phenomenon of data annotation errors is also unavoidable. The labeling error problem comprises: and the label missing and the label error are two types. In the actual operation process, entity missing results in the reduction of the number of labeled entities, and unlabeled entities are taken as negative samples during training, thereby resulting in the reduction of the NER index. The label missing problem in the NER data is reduced, and the NER identification index can be effectively improved.

In the prior art, there are two methods for solving the accuracy of data annotation: human-based data verification and model-based data verification. The data verification based on manual work mainly carries out manual recheck on the marked data; the mode data verification based on the model mainly utilizes the existing data to train the model, finds out the labeled data which possibly has problems, and carries out manual review or utilizes the model to carry out automatic correction.

However, the data size of the labeled data is generally large, a large amount of labor cost is required for manual data verification, and the label correction performed directly by using the model may cause error in correcting correct data.

Disclosure of Invention

The invention provides a data processing method, a data processing device, electronic equipment and a storage medium, which can reduce the cost of manually verifying data and improve the use experience of a user while realizing accurate data processing.

According to an aspect of the present invention, there is provided a data processing method including:

acquiring data to be corrected; the data to be corrected is text information of the entity type marked;

inputting the data to be corrected into at least one target prediction model which is created in advance, and obtaining the prediction probability value of each character in the data to be corrected on each preset label; the sum of the prediction probability values of each character on all preset labels is 1;

and correcting the entity category of the data to be corrected according to the prediction probability value of each character on each preset label and a preset probability threshold.

According to another aspect of the present invention, there is provided a data processing apparatus comprising:

the data acquisition module is used for acquiring data to be corrected; the data to be corrected is text information of the entity type marked;

the probability acquisition module is used for inputting the data to be corrected to at least one pre-established target prediction model to obtain the prediction probability value of each character in the data to be corrected on each preset label; the sum of the prediction probability values of each character on all preset labels is 1;

and the data correction module is used for correcting the entity type of the data to be corrected according to the prediction probability value of each character on each preset label and the preset probability threshold value.

According to another aspect of the present invention, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the data processing method of any of the embodiments of the invention.

According to another aspect of the present invention, there is provided a computer-readable storage medium storing computer instructions for causing a processor to implement a data processing method according to any one of the embodiments of the present invention when the computer instructions are executed.

According to the technical scheme, the data to be corrected are acquired and input into at least one pre-established target prediction model, the prediction probability value of each character in the data to be corrected on each preset label is obtained, and the entity category of the data to be corrected is corrected according to the prediction probability value of each character on each preset label and the preset probability threshold. According to the embodiment, the data to be corrected is input to the multiple target prediction models, each character in the data to be corrected is subjected to cross validation through the multiple target prediction models to obtain the preset label to which each character belongs, the entity type actually contained in the data to be corrected is determined according to the preset label of each character, and the marked entity type is corrected, so that the entity types possibly existing in the data to be corrected and having wrong labels and missed labels are quickly corrected, meanwhile, errors of manual data correction are reduced, the accuracy of correcting the entity type is improved, and the use experience of a user is improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present invention, nor do they necessarily limit the scope of the invention. Other features of the present invention will become apparent from the following description.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flow chart of a data processing method provided according to an embodiment of the invention;

FIG. 2 is a flow chart of another data processing method provided in accordance with an embodiment of the present invention;

FIG. 3 is a flow chart of yet another data processing method provided in accordance with an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device implementing a data processing method according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In an embodiment, fig. 1 is a flowchart of a data processing method provided according to an embodiment of the present invention, and this embodiment is applicable to a case of performing entity correction on data to be corrected of a labeled entity category in a quality inspection scene, and the method may be executed by a data processing apparatus, where the data processing apparatus may be implemented in a form of hardware and/or software, and the data processing apparatus may be configured in an electronic device. As shown in fig. 1, the method includes:

s110, acquiring data to be corrected; the data to be corrected is text information of the entity type marked.

The data to be corrected may be text information including one or more labeled entity categories, and the data to be corrected may be composed of a plurality of characters. The entity category refers to an entity classification contained in the data to be corrected, and the entity category can be pre-labeled. For example, the entity category may be labeled manually according to the entity characteristics of the data to be corrected, or may be labeled based on a neural network model. The entity categories may include a variety of entities, and may include, by way of example and not limitation, product names, interest rates, amounts, terms, bank card numbers, and the like. The data to be corrected can include characters corresponding to the marked entity category and characters irrelevant to the marked entity category. For example, a piece of data to be corrected may include "good you, good XX product 112 th interest rate of 0.05", where "XX product 112 th" is the product name and "0.05" is the interest rate, then these characters may correspond to a labeled label for determining the entity category, and "good you" and "interest rate of" may be characters unrelated to the labeled entity category, and these characters may not be labeled with a label. In an embodiment, the data to be corrected may be stored in a data set or a file, the data set or the file in which the data to be corrected is stored may be obtained, and the data to be corrected in the data set or the file is extracted. The data to be corrected may include a plurality of labeled entity categories, that is, one piece of data to be corrected may include text information corresponding to the plurality of entity categories. In an embodiment, according to the labeled entity category, a labeled label of each character may be obtained, and the character in each labeled entity category may correspond to one labeled label.

S120, inputting the data to be corrected into at least one target prediction model which is created in advance, and obtaining the prediction probability value of each character in the data to be corrected on each preset label; wherein the sum of the prediction probability values of each character on all preset labels is 1.

The target prediction model may be a model trained in advance, and may be a model capable of predicting a probability value of each character in the data to be corrected on each preset label. In the actual operation process, in order to improve the accuracy of the preset label to which each character belongs, a plurality of target prediction models can be adopted to predict each character in the data to be corrected. Each data to be corrected may include a plurality of characters, and the prediction probability value of each character for a different preset tag may be different. The preset labels may be preset, and for example, the preset labels may be set for different entity categories by using a BIO tagging method or a biees tagging method. In actual operation, when the label is set by adopting the BIO notation, B may indicate that the character is the beginning of an entity; i may indicate that the character is the middle or end of an entity; o may indicate that the character is not an integral part of an entity. When the label is set by the BIOES notation, B may indicate that the character is the beginning of an entity; i may indicate that the character is the middle of an entity; o may indicate that the character is not an integral part of an entity; e may indicate that the character is the end of an entity; s may indicate that the character alone is an entity. In an embodiment, when the label is set by using the biees tagging method, taking the entity type as the amount of money as an example, since there is no entity with a single character in the amount of money data, the label corresponding to the amount of money data may include B-amount, I-amount, and E-amount. When the entity type includes five types of entities including product name, interest rate, amount, period and bank card number, a BIOES labeling method is adopted, and because a single-character labeling entity does not exist, each type of entity has B, I, E three types of labels and a non-entity label O, at this time, the preset labels can be 16 types.

In one embodiment, the creation process of the target prediction model includes:

obtaining at least two original prediction models;

and training each original prediction model by adopting a training data set and a guide aggregation algorithm to obtain a corresponding target prediction model.

The Bootstrap aggregation (Bagging) algorithm may be an algorithm that sub-samples a training data set to form a sub-training data set required by each original prediction model, and synthesizes results predicted by all target prediction models to generate a final prediction result. The original prediction model may refer to a model created in advance for training. The training data set may be composed of text information of labeled entity classes, and the original prediction model may be trained as the target prediction model through the training data set.

In an embodiment, two raw predictive models may be obtained, with data in the training data set being input to each raw predictive model separately. Each of the original predictive models is trained separately through a training data set. Since the original prediction models may be different, the training results for different original prediction models may be different. And performing a guide aggregation algorithm on the trained original models, and taking the result after the guide aggregation algorithm as a final result to obtain the corresponding target prediction model.

In an embodiment, the data to be corrected is input into a pre-created target prediction model, and the target prediction model may split each character in the data to be corrected and predict a prediction probability value of each character on each preset tag. In an embodiment, the number of the target prediction models may be multiple, and the prediction probability value of each character in the data to be corrected, which is obtained by different target prediction models, on each preset tag may be different. The prediction probability values of the characters on each preset label predicted by different target prediction models can be averaged to be used as the prediction probability value of each character on each preset label; or, the guidance aggregation algorithm may be adopted to integrate the prediction probability value output by each target prediction model, and the integration is used as the prediction probability value of each character in the data to be corrected on each preset label. In one embodiment, the sum of the predicted probability values for each character over all preset tags is 1. It is understood that, in the case where the preset number of tags is 16, the total of the predicted probability values of the 16 tags per character is 1. For example, assuming that a character is in amount and the predicted probability value of a B-account tag is 80%, the sum of the predicted probability values of the character in other 15 tags is 20%.

S130, correcting the entity type of the data to be corrected according to the prediction probability value of each character on each preset label and a preset probability threshold value.

The preset probability threshold may be a preset threshold for determining whether each character corresponds to a preset tag. The preset probability threshold may be a value set according to a user requirement, and for example, the preset probability threshold may include, but is not limited to, 80%, 90%, 95%, and the like. And when the predicted probability value of the character on one preset label is greater than a preset probability threshold value, confirming that the character belongs to the preset label.

In an embodiment, after determining the predicted probability value of each character in the data to be corrected on each preset tag, the predicted probability value of each character on each preset tag may be compared with a preset probability threshold. Because each character can have a prediction probability value on each preset label, and the sum of the prediction probability values of each character on all the preset labels is 1, the prediction probability value of only one preset label of each character is greater than a preset probability threshold. The preset label larger than the preset probability threshold value can be used as a preset label of the character, and the entity category of the data to be corrected can be corrected according to the preset label. In actual operation, the preset probability threshold may be set according to requirements, and when the confidence of the requirements is higher, the larger the value of the preset probability threshold may be set. In an embodiment, when the probability of the predicted preset tag B-account of a character is greater than a preset probability threshold, the preset tag corresponding to the character may be considered as B-account. After the preset label corresponding to each character is confirmed, the preset label corresponding to each character is compared with the labeled label, and when the preset label corresponding to the character does not accord with the labeled label, the preset label can be used as the label of the character. After the label of each character is determined, each entity type can be combined according to the label, whether the labeled entity type is the same as the entity type of the label combination is determined, if the labeled entity type is not the same as the entity type of the label combination, the data to be corrected is considered to have a wrong label condition, and the entity type of the data to be corrected is corrected.

According to the embodiment of the invention, the data to be corrected is acquired, the data to be corrected is input into the plurality of target prediction models which are created in advance, the prediction probability value of each character in the data to be corrected on each preset label is acquired, the entity category actually contained in the data to be corrected is determined according to each preset label of each character, and the labeled entity category is corrected, so that the correction of the entity categories of wrong labels and missed labels which possibly exist in the entity categories of the data to be corrected is realized, the accuracy of the entity category correction is improved by comparing the prediction probability value of each character on each preset label with the preset probability threshold, and the use experience of a user is improved.

In an embodiment, fig. 2 is a flowchart of another data processing method provided according to an embodiment of the present invention, where this embodiment is an implementation process for correcting an entity type of data to be corrected based on the foregoing embodiment, and as shown in fig. 2, the method includes:

s210, acquiring each labeled entity type contained in the data to be corrected.

In an embodiment, one piece of data to be corrected may include a plurality of labeled entity categories, and after the data to be corrected is acquired, the labeled entity categories included in the data to be corrected may be extracted, and each labeled entity category may have an entity feature that is distinguished from other entity categories, respectively.

And S220, correcting the entity type of the data to be corrected according to the entity characteristics of the marked entity type.

The entity features may refer to the own features of each entity category, and may be features for distinguishing different entity categories. In one embodiment, when the entity categories include five types, namely, product name, interest rate, amount, term and bank card number, each entity category has characteristics with larger difference. Illustratively, the product name may include a name plus a name number, such as XX product date 1, etc.; interest rates may be percentages plus numbers, e.g., 0.05%, 0.02%, etc.; the amount may be an integer multiple of 100, such as 1000, 10000, etc.; the term may include a term or an age, etc.; the bank card number may include a four digit end number or a dozen or so digits.

In the embodiment, by extracting the entity features of the entity classes already labeled and comparing each entity class included in the data to be corrected, whether the entity class of the data to be corrected is correct or not can be determined, and the entity class labeled by an error in the data to be corrected is corrected.

In some embodiments, S220 comprises:

s2201, obtaining the labeled text associated with each labeled entity type.

S2202, determining the matching condition of the text characteristic of each labeled text and the entity characteristic of the corresponding labeled entity category.

S2203, correcting the entity type of the data to be corrected according to the matching condition.

The marked text can refer to text information with certain characteristics in the marked entity category, when the entity category comprises five types of product names, interest rates, money amounts, periods and bank card numbers, each entity category has associated marked text, and each marked text has text characteristics different from other texts. Illustratively, the product name may include a name plus a name number, such as XX product date 1, etc.; interest rates may be percentages plus numbers, e.g., 0.05%, 0.02%, etc.; the amount may be an integer multiple of 100, such as 1000, 10000, etc.; the term may include a term or an age, etc.; the bank card number may include a four digit end number or a dozen or so digits. The labeled text associated with each labeled entity category may be determined based on text characteristics of the different entity categories.

In the embodiment, the marked texts associated with the marked entity categories can be obtained, the entity features of the marked entity categories can be extracted, the text features of each marked text are matched with the entity features corresponding to the marked entity categories, the matching condition of the text features of each marked text and the entity features corresponding to the marked entity categories is determined, and when the matching condition of the text features of the marked text and the entity features corresponding to the marked entity categories is determined to be matching, the entity categories of the data to be corrected can be considered to be correct; and when the matching condition of the text features of the labeled text and the entity features corresponding to the labeled entity categories is determined to be not matched, the entity categories corresponding to the data to be corrected can be corrected into the entity features of the labeled entity categories, so that the correction of the entity categories of the data to be corrected is realized.

And S230, acquiring data to be corrected.

S240, inputting the data to be corrected into at least one target prediction model which is created in advance, and obtaining the prediction probability value of each character in the data to be corrected on each preset label.

And S250, correcting the entity type of the data to be corrected according to the prediction probability value of each character on each preset label and a preset probability threshold.

In the embodiment of the invention, on the basis of the embodiment, each labeled entity type contained in the data to be corrected is obtained, the labeled entity type of the data to be corrected is simply screened according to the entity characteristics of the entity type, the entity type of the data to be corrected is primarily corrected, the data to be corrected is input into the model for further complex screening, so that the labeled entity type of the data to be corrected is further corrected, the effect of carrying out double recognition on the labeled entity type of the data to be corrected is realized, the omission of the entity characteristics of wrong marks and missed marks in the correction process is prevented, and the accuracy of correcting the entity type is further improved.

In an embodiment, fig. 3 is a flowchart of another data processing method according to an embodiment of the present invention, and this embodiment is a description of a process for determining a preset probability threshold and a predicted probability value and a process for implementing a modification on an entity category of data to be modified based on the above embodiment. As shown in fig. 3, the method includes:

s310, inputting a pre-acquired training data set into a pre-established target prediction model to obtain a prediction probability value of each character on each preset label.

The training data set may refer to a data set for training the target model, and the training data set may be composed of data of labeled entity classes. Each piece of data can be composed of a plurality of characters, each character can correspond to one preset label, and the prediction probability value of each character on each preset label can be obtained through a pre-established target prediction model.

In an embodiment, after the training data set is input into the pre-created target prediction model, the target prediction model may extract characters in each data in the training data set and predict a prediction probability value of each character on a preset label. In actual operation, the number of the preset labels may be determined according to the entity categories, and when the number of the entity categories is 5, a biees labeling method is adopted, and since there is no single-character labeling entity, a total of 16 labels are included, that is, each character may have a predicted value in 16 labels.

And S320, obtaining the probability distribution condition of the corresponding preset label according to the predicted probability value of each character on each preset label.

In the embodiment, after the prediction probability value of each character on each preset label is determined, each character has one prediction probability value on any preset label, and the probability distribution condition of the corresponding preset label can be obtained according to the prediction probability value on the same preset label. In actual operation, each preset tag may correspond to a probability distribution.

S330, configuring a preset probability threshold corresponding to the preset label according to the probability distribution condition and the preset confidence.

The preset confidence level can be the reliability of a preset label set by a user according to requirements, and the higher the preset confidence level is, the larger the corresponding preset probability threshold value can be set.

In an embodiment, the preset probability threshold of the preset tag may be determined according to the probability distribution condition and the preset confidence. And obtaining a corresponding mean value according to the probability distribution condition, and determining a preset threshold value of the preset label according to the mean value and the preset confidence level. In one embodiment, the mean may be generated from a probability distribution fit. When the preset confidence coefficient is higher, the preset probability threshold value can be relatively larger according to the mean value; when the preset confidence is low, the preset probability threshold may be selected to be relatively small according to the mean.

And S340, acquiring data to be corrected.

And S350, inputting the data to be corrected into at least one pre-established target prediction model to obtain the prediction probability value of each character output by each target prediction model on each preset label.

In an embodiment, the data to be corrected may be input into a pre-created target prediction model, and the target prediction model may split each character in the data to be corrected and predict a prediction probability value of each character on each preset tag. In an embodiment, the number of the target prediction models may be multiple, the data to be corrected is input to the multiple target prediction models, and the prediction probability value of each character output by each target prediction model on each preset label may be obtained. Wherein the predicted probability value of each character output by each target model on each preset label can be different.

And S360, integrating the prediction probability value output by each target prediction model by adopting a guide aggregation algorithm to serve as the prediction probability value of each character in the data to be corrected on each preset label.

In an embodiment, since the prediction probability value of each character output by different target prediction models on each preset tag may be different, the prediction probability values output by each target prediction model need to be integrated. The manner of integrating the predicted probability values may include employing a guided aggregation algorithm. By adopting the guidance aggregation algorithm, the prediction probability value obtained by the guidance aggregation algorithm can be used as the prediction probability value of each character in the data to be corrected on each preset label. In an embodiment, the predicted probability value of each character on each preset label is the predicted probability value which occurs most in all the models, and for example, when the number of the target prediction models is 3, two of the target prediction models have a predicted probability value of 80% on one preset label for one character, and the other target prediction model has a predicted probability value of 70% on the preset label for the character, the predicted probability value of the character on the preset label can be considered to be 80%.

And S370, determining a probability comparison result between the prediction probability value of each character on each preset label and a preset probability threshold value.

In an embodiment, after determining the prediction probability value of each character on each preset tag, a preset probability threshold may be extracted, the prediction probability value of each character on each preset tag is compared with the preset probability threshold, and a probability comparison result between the prediction probability value of each character on each preset tag and the preset probability threshold is determined. The probability comparison result may include that the predicted probability value is greater than a preset probability threshold, that the predicted probability value is equal to the preset probability threshold, and that the predicted probability value is smaller than the preset probability threshold.

And S380, determining the prediction confidence of each character on the corresponding preset label according to the probability comparison result.

In an embodiment, after the probability comparison result is determined, the prediction confidence of each character on the corresponding preset tag may be determined according to the probability comparison result. When the prediction probability value of a certain character on a preset label is greater than or equal to a preset probability threshold, the character can be considered to have relatively high prediction confidence coefficient on the preset label; when the prediction probability value of a certain character on the preset label is smaller than the preset probability threshold, the prediction confidence of the character on the preset label can be considered to be relatively low. In one embodiment, when the predicted probability value of a character on a predicted tag is greater than a preset probability threshold, the confidence of the character on the predicted tag is higher.

And S390, correcting the entity type of the data to be corrected according to the prediction confidence coefficient and the labeled label of the corresponding character.

In an embodiment, after the prediction confidence of each character on the corresponding preset label is confirmed, the entity category of the data to be corrected may be corrected according to the prediction execution and the labeled label of the corresponding character. In an embodiment, when the confidence of a character in the B-account is high, but the labeled label is O or other labels, it can be considered that the labeled label of the corresponding character has a missing label or a wrong label, and the label corresponding to the character is corrected to be the B-account. After all the labeled tags are corrected, the entity type of the data to be corrected can be corrected according to each corrected tag. Because the label is generated according to the entity type when being labeled, the entity boundary can be confirmed according to the modified label. In one embodiment, when the entity type is money, the included characters may include B-amount, I-amount and E-amount, so that consecutive B-amount, I-amount and E-amount may be regarded as one entity, and when the entity type marked by the data to be modified is different from the entity type confirmed according to the tag, the corresponding entity type may be modified to the entity type confirmed according to the tag.

According to the embodiment of the invention, the pre-acquired training data set is input into the pre-established target prediction model to obtain the prediction probability value of each character on each preset label, the probability distribution condition of the corresponding preset label is obtained according to the prediction probability value of each character on each preset label, the preset probability threshold value of the corresponding preset label is determined, the determination of the preset threshold value is realized, and the rationality of the preset probability threshold value is improved. The data to be corrected is input into at least one pre-established target prediction model, the prediction probability value of each character output by each target prediction model on each preset label is obtained, the prediction probability values output by each target prediction model are integrated by adopting a guide aggregation algorithm, the prediction probability value of each character in the data to be corrected on each preset label is determined, and the accuracy of the prediction probability value is improved. The probability comparison result between the prediction probability value of each character on each preset label and the preset probability threshold is determined, the prediction confidence coefficient of each character on the corresponding label is further determined, and the entity category of the data to be corrected is corrected according to the prediction confidence coefficient and the labeled label of the corresponding character, so that the accurate correction of the entity category of the data to be corrected is realized, the errors of manual correction are reduced, and the use experience of a user is improved.

In an implementation, this embodiment is a specific description of a data processing method, which is a quality inspection of a telemarketing scenario in which a scenario is XXX products, and takes a test data set as data to be corrected. In the quality inspection scenario, the identified entity categories may include product name, interest rate, amount, time limit, bank card number.

Aiming at the problem of mislabel of the data to be corrected, the data to be corrected can be corrected by utilizing the characteristics of the entity. The distinction degree among entity types such as product name, interest rate, amount of money, period and bank card number is very obvious, and in actual operation, the entity type mislabel is mainly caused by fatigue of a labeling person or click error. By utilizing the characteristic of large entity category difference, the problem of entity category mislabeling can be effectively solved. Statistics can be made on each entity category, entities which obviously do not belong to the category are searched for in each entity category, and then texts to which the entities belong are found for manual correction. Illustratively, the amount and the card number are numbers, but the amount and the card number are also distinguished by comparison. The amount of money can be an integral multiple of 100, and the bank card number is a four-digit end number or a dozen-digit number, and the entity characteristics can be used for finding out possible entity with wrong category marking. Similarly, when the entity type of the data to be corrected is corrected through the entity characteristics, the problem of error labeling of the boundary can be effectively solved. In one embodiment, the entity class labels conform to a certain label specification, such as "amount" class entities, and it is specified in the label specification that only numbers are labeled, and specific units such as "element", "block", etc. are not labeled. Due to daily habits, a labeling person may label the unit as a part of the entity during labeling, and the entity type of the data to be corrected by correcting the entity characteristics can correct the part of the labeled data.

For the problem of missing the label of the data to be corrected, a preset label can be set according to the entity category, the setting of the preset label can include but is not limited to the adoption of a BIO labeling method or a BIOES labeling method, and different labeling systems have no influence on the scheme. When the biees notation is used, B indicates that the character is the beginning of an entity, E indicates that the character is the end of an entity, I indicates that the character is the middle of an entity, S indicates that the character is solely an entity, and O indicates that the character is not a component of an entity. There are no single word entities in the actual data of this scenario, and so it is actually a BIOE annotation hierarchy. Illustratively, the money label is taken as an example, and the money label comprises three labels of 'B-account', 'I-account' and 'E-account'. The entity category may include five types of entities, such as product name, amount, interest rate, term, and bank card number, when the BI0ES labeling method is adopted, since there is no single-character labeling entity, 16 types of labels are included (each type of entity has three types of labels, i.e., each character may be predicted to be any one of the 16 types of labels in one data to be corrected). For the data to be corrected with the entity class marked, a K-fold cross validation mode can be adopted to divide the data set into a training data set and a testing data set. For the training data set, a plurality of original prediction models can be trained by using the training data set, and a target preset model is determined. The original prediction models may be the same or different. The training data set is input to the original predictive model, each character in the text has a predictive probability value on each label, and the sum of the predictive probability values for the 16 labels for a single character is 1. In conjunction with the data tags, a probability distribution for each tag can be obtained. For example, all characters of the "B-ProductName" tag marked as the beginning of the product name entity in the data to be corrected have a prediction probability value in the "B-ProductName" tag, and the prediction probability values can obtain a probability distribution. The mean value u of the probability distribution can be simply obtained, or the mean value u can be fitted according to the probability distribution, if the preset confidence level is required to be high, the preset probability threshold value can be selected to be relatively large according to the probability distribution, and if the preset confidence level is required to be not high, the preset probability threshold value can be selected to be relatively small according to the probability distribution.

For the test data set, the data to be corrected can be input into a target prediction model, and the prediction probability value of each character on each label is obtained. Illustratively, when a character is predicted as a "B-ProductName" tag, but if its probability value is less than a preset probability threshold, its prediction confidence may be considered relatively low, and if its probability value is greater than the preset probability threshold, its prediction confidence may be considered relatively high. If a character is predicted to have a high prediction confidence of "B-ProductName" but it has been labeled with a label of "0," then it may have a missed label condition.

Similarly, if the boundary of the labeled label and the text of the predicted entity category is inconsistent or the type of the labeled label and the text of the predicted entity category are inconsistent, the situation that the entity of the text is possibly labeled with a boundary error or a label type error exists is shown. In an embodiment, in order to find the text of the missing label as accurately as possible, a bagging algorithm operation may be performed on a plurality of target prediction models, and a result obtained after the bagging algorithm is used as a final result to determine whether the data to be corrected has the condition of the missing label.

In an embodiment, the preset probability threshold may be adjusted according to actual conditions, and if the accuracy is to be improved, that is, the entity category most likely to miss the target is screened out, the preset probability threshold may be set to be larger, so that the number of to-be-corrected objects to be corrected is small each time, and multiple iterations are required. If more samples which are possible to miss the label are selected, the preset probability threshold value can be set to be a little smaller, so that the iteration times can be reduced, but the selected samples are not always wrong, and the correct samples are more likely to be mistaken for the missed label samples.

In an embodiment, fig. 4 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention. As shown in fig. 4, the apparatus includes: a data acquisition module 41, a probability acquisition module 42 and a data correction module 43.

The data obtaining module 41 is configured to obtain data to be corrected; the data to be corrected is text information of the entity type marked.

The probability obtaining module 42 is configured to input data to be corrected to at least one pre-created target prediction model, so as to obtain a prediction probability value of each character in the data to be corrected on each preset tag; wherein the sum of the prediction probability values of each character on all preset labels is 1.

And the data correcting module 43 is configured to correct the entity category of the data to be corrected according to the predicted probability value of each character on each preset tag and the preset probability threshold.

In the embodiment of the invention, the data acquisition module acquires the data to be corrected, the probability acquisition module inputs the data to be corrected into the pre-established target prediction models to obtain the prediction probability value of each character in the data to be corrected on each preset label, the data correction module determines the entity class actually contained in the data to be corrected according to each preset label of each character, and corrects the labeled entity class, so that the correction of the entity classes of wrong labels and missed labels possibly existing in the entity classes of the data to be corrected is realized, the accuracy of correcting the entity classes is improved by comparing the prediction probability value of each character on each preset label with the preset probability threshold, and the use experience of a user is improved.

In some embodiments, a data processing apparatus further comprises:

and the entity type acquisition module is used for acquiring each labeled entity type contained in the data to be corrected.

And the entity type correction module is used for correcting the entity type of the data to be corrected according to the entity characteristics of the labeled entity type.

In some embodiments, a data processing apparatus further comprises:

and the model creating module is used for inputting the pre-acquired training data set into a pre-created target prediction model to obtain the prediction probability value of each character on each preset label.

And the probability prediction module is used for obtaining the probability distribution condition of the corresponding preset label according to the prediction probability value of each character on each preset label.

And the probability threshold presetting module is used for configuring a preset probability threshold corresponding to the preset label according to the probability distribution condition and the preset confidence coefficient.

In some embodiments, the entity class correction module comprises:

and the text acquisition unit is used for acquiring the labeled text associated with each labeled entity type.

And the matching condition determining unit is used for determining the matching condition of the text characteristic of each labeled text and the entity characteristic of the corresponding labeled entity type.

And the entity type correcting unit is used for correcting the entity type of the data to be corrected according to the matching condition.

In some embodiments, the probability acquisition module 42 includes:

the first probability prediction unit is used for inputting the data to be corrected into at least one pre-established target prediction model to obtain the prediction probability value of each character output by each target prediction model on each preset label.

And the second probability prediction unit is used for integrating the prediction probability value output by each target prediction model by adopting a guide aggregation algorithm and taking the prediction probability value as the prediction probability value of each character in the data to be corrected on each preset label.

In some embodiments, the data modification module 43 includes:

and the comparison result determining unit is used for determining a probability comparison result between the prediction probability value of each character on each preset label and a preset probability threshold value.

And the confidence coefficient determining unit is used for determining the prediction confidence coefficient of each character on the corresponding preset label according to the probability comparison result.

And the data correction unit is used for correcting the entity type of the data to be corrected according to the prediction confidence coefficient and the labeled label of the corresponding character.

In some embodiments, the creation process of the target prediction model in the probability acquisition module 42 includes:

and the original model acquisition unit is used for acquiring at least two original prediction models.

And the target model acquisition unit is used for training each original prediction model by adopting a training data set and a guide aggregation algorithm to obtain a corresponding target prediction model.

The data processing device provided by the embodiment of the invention can execute the data processing method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

In an embodiment, fig. 5 is a schematic structural diagram of an electronic device 10 implementing a data processing method according to an embodiment of the present invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 5, the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, and the like, wherein the memory stores a computer program executable by the at least one processor, and the processor 11 can perform various suitable actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from a storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data necessary for the operation of the electronic apparatus 10 can also be stored. The processor 11, the ROM 12, and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.

A number of components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, or the like; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, or the like. The processor 11 performs the various methods and processes described above, such as a data processing method.

In some embodiments, a data processing method may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into the RAM 13 and executed by the processor 11, one or more steps of a data processing method as described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform a data processing method by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for implementing the methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solution of the present invention can be achieved.

The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A data processing method, comprising:

inputting the data to be corrected to at least one target prediction model which is created in advance to obtain the prediction probability value of each character in the data to be corrected on each preset label; the sum of the prediction probability values of each character on all preset labels is 1;

2. The method of claim 1, further comprising:

acquiring each labeled entity type contained in the data to be corrected;

and correcting the entity type of the data to be corrected according to the entity characteristics of the labeled entity type.

3. The method according to claim 1, further comprising, before said obtaining data to be corrected:

inputting a pre-acquired training data set into a pre-established target prediction model to obtain a prediction probability value of each character on each preset label;

obtaining the probability distribution condition of the corresponding preset label according to the predicted probability value of each character on each preset label;

and configuring a preset probability threshold corresponding to a preset label according to the probability distribution condition and a preset confidence level.

4. The method according to claim 2, wherein the modifying the entity class of the data to be modified according to the entity feature of the labeled entity class comprises:

acquiring a labeled text associated with each labeled entity type;

determining the matching condition of the text feature of each labeled text and the entity feature of the corresponding labeled entity category;

and correcting the entity type of the data to be corrected according to the matching condition.

5. The method according to any one of claims 1 to 3, wherein the inputting the data to be corrected into at least one target prediction model created in advance to obtain the prediction probability value of each character in the data to be corrected on each preset label comprises:

inputting the data to be corrected into at least one pre-established target prediction model to obtain the prediction probability value of each character output by each target prediction model on each preset label;

and integrating the prediction probability value output by each target prediction model by adopting a guide aggregation algorithm to be used as the prediction probability value of each character in the data to be corrected on each preset label.

6. The method according to any one of claims 1 to 3, wherein the modifying the entity category of the data to be modified according to the predicted probability value of each character on each preset label and a preset probability threshold comprises:

determining a probability comparison result between the prediction probability value of each character on each preset label and a preset probability threshold value;

determining the prediction confidence of each character on the corresponding preset label according to the probability comparison result;

and correcting the entity type of the data to be corrected according to the prediction confidence coefficient and the labeled label of the corresponding character.

7. The method according to any one of claims 1 to 3, wherein the creation of the object prediction model comprises:

obtaining at least two original prediction models;

8. A data processing apparatus, comprising:

the probability obtaining module is used for inputting the data to be corrected to at least one target prediction model which is created in advance to obtain the prediction probability value of each character in the data to be corrected on each preset label; the sum of the prediction probability values of each character on all preset labels is 1;

and the data correction module is used for correcting the entity type of the data to be corrected according to the prediction probability value of each character on each preset label and a preset probability threshold.

9. An electronic device, characterized in that the electronic device comprises:

at least one processor; and

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the data processing method of any one of claims 1-7.

10. A computer-readable storage medium, characterized in that it stores computer instructions for causing a processor to implement the data processing method of any of claims 1-7 when executed.