CN113807096A - Text data processing method and device, computer equipment and storage medium - Google Patents

Text data processing method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN113807096A
CN113807096A CN202110381793.6A CN202110381793A CN113807096A CN 113807096 A CN113807096 A CN 113807096A CN 202110381793 A CN202110381793 A CN 202110381793A CN 113807096 A CN113807096 A CN 113807096A
Authority
CN
China
Prior art keywords
text data
label
matrix
tag
characters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110381793.6A
Other languages
Chinese (zh)
Inventor
付振宇
郑宇宇
赵英普
顾松庠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jingdong Technology Holding Co Ltd
Original Assignee
Jingdong Technology Holding Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jingdong Technology Holding Co Ltd filed Critical Jingdong Technology Holding Co Ltd
Priority to CN202110381793.6A priority Critical patent/CN113807096A/en
Publication of CN113807096A publication Critical patent/CN113807096A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Abstract

The application provides a text data processing method, a text data processing device, computer equipment and a storage medium, wherein the method comprises the following steps: the method comprises the steps of obtaining a text data set, wherein the text data set comprises a plurality of text data and label information corresponding to each text data, processing the text data set to obtain N training sets and N corresponding test sets, the N training sets and the N test sets are different from each other, the N test sets form the text data set, N is an integer larger than 1, respectively training N recognition models by using the N training sets, respectively recognizing the text data in the corresponding test sets by using each recognition model to determine a prediction label corresponding to each text data in the text data set, and processing the text data in the text data set according to the difference degree between the prediction label corresponding to each text data and the label information. Therefore, wrong labels in the training text data can be quickly screened out, and the quality inspection cleaning speed and efficiency of the training text data are improved.

Description

Text data processing method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of natural language processing technologies, and in particular, to a method and an apparatus for processing text data, a computer device, and a storage medium.
Background
In the field of natural language processing, most tasks require training a model with a corresponding corpus, such as a Named Entity Recognition (NER) task. However, the quality of the corpus is an important factor affecting the model effect, so in practical applications, the original corpus needs to be quality-checked and cleaned to improve the model effect.
At present, the cleaning of the corpus is mainly realized by two methods, namely regular cleaning and manual quality inspection cleaning, wherein the regular cleaning is to artificially set a cleaning rule to remove or verify the corpus with low quality according to the characteristics of the corpus, and the cleaning in the way has low quality and can generate the condition of transitional cleaning, thereby causing the insufficient generalization capability of the model; the manual cleaning mode is to clean the corpora one by one manually, and the mode is time-consuming and low in cleaning efficiency.
Disclosure of Invention
The present application is directed to solving, at least to some extent, one of the technical problems in the related art.
The application provides a text data processing method, a text data processing device, a computer device and a storage medium, so as to realize automatic cleaning of labeled text data, a plurality of training sets and a plurality of corresponding test sets are obtained by processing a text data set, a plurality of recognition models are trained, a prediction label for recognizing the corresponding test set by each training model is obtained, the text data is further processed according to the difference degree of the prediction label and label information, wrong labels in the training text data can be rapidly screened out, the quality inspection cleaning speed and efficiency of the training text data are improved, the cleaning quality of the text data can be ensured, the problem of inaccurate models caused by the wrong labeled text data is avoided as much as possible, and a guarantee is provided for the effect of subsequent model training, so that the problems of low cleaning quality, low cleaning quality and low cleaning quality caused by a regular cleaning mode in the prior art are solved, Excessive cleaning and low cleaning efficiency caused by a manual cleaning mode.
An embodiment of a first aspect of the present application provides a text data processing method, including:
acquiring a text data set, wherein the text data set comprises a plurality of text data and label information corresponding to each text data;
processing the text data set to obtain N training sets and N corresponding test sets, wherein the N training sets and the N test sets are different from each other, the N test sets form the text data set, and N is an integer greater than 1;
respectively training N recognition models by utilizing the N training sets;
respectively identifying the text data in the corresponding test set by using each identification model to determine a prediction label corresponding to each text data in the text data set;
and processing the text data in the text data set according to the difference degree between the prediction label corresponding to each text data and the label information.
An embodiment of a second aspect of the present application provides a text data processing apparatus, including:
the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring a text data set, and the text data set comprises a plurality of text data and label information corresponding to each text data;
the splitting module is used for processing the text data set to obtain N training sets and N corresponding test sets, wherein the N training sets and the N test sets are different from each other, the N test sets form the text data set, and N is an integer greater than 1;
the training module is used for respectively training N recognition models by utilizing the N training sets;
the identification module is used for identifying the text data in the corresponding test set by respectively utilizing each identification model so as to determine a prediction label corresponding to each text data in the text data set;
and the cleaning module is used for processing the text data in the text data set according to the difference degree between the predicted label corresponding to each text data and the label information.
An embodiment of a third aspect of the present application provides a computer device, including: the text data processing method comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein when the processor executes the program, the text data processing method is realized as the text data processing method provided by the embodiment of the first aspect of the application.
A fourth aspect of the present application provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement a method for processing text data as set forth in the first aspect of the present application.
An embodiment of a fifth aspect of the present application proposes a computer program product, wherein when the instructions in the computer program product are executed by a processor, the method for processing text data as proposed in the embodiment of the first aspect of the present application is performed.
The text data processing method, the text data processing device, the computer equipment and the storage medium have the following beneficial effects that:
the text data set is processed to obtain N training sets and N corresponding test sets by obtaining the text data set, wherein the text data set comprises a plurality of text data and label information corresponding to each text data, the N training sets and the N test sets are different from each other, the N test sets form the text data set, N is an integer greater than 1, the N training sets are used for respectively training N recognition models, each recognition model is used for respectively recognizing the text data in the corresponding test set to determine a prediction label corresponding to each text data in the text data set, and then the text data in the text data set is processed according to the difference degree between the prediction label corresponding to each text data and the label information. Therefore, the text data set is processed to obtain a plurality of training sets and a plurality of corresponding test sets, a plurality of recognition models are trained, the prediction label for recognizing the corresponding test set by each training model is obtained, and then the text data is processed according to the difference degree of the prediction label and the label information, so that the error label in the training text data can be rapidly screened out, the quality inspection cleaning speed and efficiency of the training text data are improved, the cleaning quality of the text data can be ensured, the problem of inaccurate model caused by the error label text data is avoided as much as possible, and the guarantee is provided for the effect of subsequent model training.
Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a schematic flowchart of a text data processing method according to an embodiment of the present application;
FIG. 2 is an exemplary diagram of a split result of multiple splits of an original text data set;
fig. 3 is a schematic flowchart of a text data processing method according to a second embodiment of the present application;
fig. 4 is a schematic flowchart of a text data processing method according to a third embodiment of the present application;
fig. 5 is a schematic flowchart of a text data processing method according to a fourth embodiment of the present application;
fig. 6 is a schematic structural diagram of a text data processing apparatus according to a fifth embodiment of the present application;
fig. 7 is a schematic structural diagram of a text data processing apparatus according to a sixth embodiment of the present application;
fig. 8 is a schematic structural diagram of a text data processing apparatus according to a seventh embodiment of the present application;
fig. 9 is a schematic structural diagram of a text data processing apparatus according to an eighth embodiment of the present application;
FIG. 10 illustrates a block diagram of an exemplary computer device suitable for use to implement embodiments of the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.
A method, an apparatus, a device, and a storage medium for processing text data according to an embodiment of the present application are described below with reference to the drawings.
Fig. 1 is a flowchart illustrating a text data processing method according to an embodiment of the present application.
The embodiment of the present application is exemplified in that the processing method of the text data is configured in the processing apparatus of the text data, and the processing apparatus of the text data can be applied to any computer device, so that the computer device can execute the processing function of the text data.
The Computer device may be a Personal Computer (PC), a server, a cloud device, a mobile device, and the like, and the mobile device may be a mobile phone, a tablet Computer, a Personal digital assistant, a wearable device, and the like.
As shown in fig. 1, the processing method of the text data may include the steps of:
step 101, a text data set is obtained, wherein the text data set comprises a plurality of text data and label information corresponding to each text data.
The label information corresponding to each text data can be manually marked when the text data is acquired, so that the accuracy of the label information is ensured.
In this embodiment, a plurality of text data may be obtained from a plurality of ways to form a text data set, for example, a plurality of sentences including named entities are extracted from web articles and news contents to form a text data set, and then tag information included in the text data set is a named entity. The named entity can be one or more types of entity names such as a person name, a place name, an organization name, a company name and an organization name.
And 102, processing the text data set to obtain N training sets and N corresponding test sets, wherein the N training sets and the N test sets are different from each other, the N test sets form the text data set, and N is an integer greater than 1.
In the embodiment of the application, the acquired text data set can be copied into multiple copies, a splitting rule is adopted for splitting each text data set, and the text data set is split into a training set and a test set, wherein the splitting rules adopted by the text data sets are different, so that the multiple training sets obtained by splitting are different from each other, and the multiple test sets obtained by splitting are also different from each other. In addition, in this embodiment, the adopted splitting rule should ensure that a plurality of test sets obtained by splitting the text data set for a plurality of times do not intersect with each other, and the sum of all test sets is equal to the original text data set.
As an example, assuming that a text data set includes ten pieces of text data, when splitting the text data set, the original text data set may be split five times, and the original text data set may be split in the order shown in fig. 2 to obtain a training set, a verification set, and a test set. In fig. 2, each rectangular box represents a piece of text data, wherein a white rectangular box represents training text data, all training text data belonging to the same text data set form a training set corresponding to the text data set, and the training set is used for training to obtain a recognition model; the black rectangular box represents verification text data, all the verification text data belonging to the same text data set form a verification set corresponding to the text data set, and the verification set is used for verifying the identification model obtained by training; the gray rectangular boxes represent test text data, all the test text data belonging to the same text data set form a test set corresponding to the text data set, and the test set is used for obtaining the prediction labels. As can be seen from FIG. 2, all the test sets obtained by multiple splitting are not intersected with each other, and the sum of all the test sets is equal to the original text data set.
And 103, respectively training N recognition models by utilizing N training sets.
In the embodiment of the application, after the text data set is split for multiple times according to different splitting rules to obtain the corresponding N training sets and N test sets, each training set can be utilized to train to obtain a corresponding recognition model, so that the N training sets can be utilized to train to obtain N recognition models, and each training set corresponds to one recognition model.
And 104, respectively identifying the text data in the corresponding test set by using each identification model so as to determine a prediction label corresponding to each text data in the text data set.
In the embodiment of the application, after N corresponding recognition models are obtained by training with N training sets, text data in a corresponding test set can be recognized with each recognition model, so as to obtain a prediction label of each text data in the test set by the recognition model. The prediction label of each text data can be represented by a probability, the recognition model outputs the probability that the text data belongs to each label, and the probability is higher, so that the probability that the text data belongs to the label is higher.
Because a training set and a corresponding test set can be obtained after the text data set is split once, the N test sets obtained by splitting for multiple times are different from each other, and the N test sets form the text data set, the corresponding test set is identified by using each identification model, and a prediction label corresponding to each text data in the original text data set can be obtained. In the embodiment of the application, the text data set is split for multiple times to obtain N training sets and N test sets, and a plurality of recognition models are trained, so that each text data in the text data set can be used as the training text data to participate in the training of the model, and each text data can be used as the test text data to obtain a prediction result, thereby avoiding the problem of inaccurate model caused by mistakenly marking the sample as much as possible.
And 105, processing the text data in the text data set according to the difference degree between the prediction label corresponding to each text data and the label information.
In this embodiment, after the prediction tag corresponding to each text data in the text data set is identified, the difference degree between the prediction tag and the tag information of the same text data can be determined according to the prediction tag corresponding to each text data and the tag information corresponding to the text data, and then the text data in the text data set is processed according to the difference degree, for example, the text data in the text data set is cleaned.
As an example, whether the predicted tag and the tag information are consistent or not may be compared, and the text data in which the predicted tag and the tag information are inconsistent may be determined as the error marked text data, so as to complete the cleaning of the text data in the text data set. Furthermore, the determined error labeling text data can be verified and corrected manually, and the error labeling text data is verified manually because the number of the error labeling text data is much smaller than that of the total text data, so that the verification and cleaning speed can be improved, the text data cleaning quality can be ensured, and a powerful guarantee is provided for the effect of subsequent model training.
The text data processing method of the embodiment of the application processes a text data set by obtaining the text data set, where the text data set includes a plurality of text data and label information corresponding to each text data, so as to obtain N training sets and N corresponding test sets, where the N training sets and the N test sets are different from each other, and the N test sets form the text data set, where N is an integer greater than 1, the N training sets are used to respectively train N recognition models, and each recognition model is used to respectively recognize the text data in the corresponding test set, so as to determine a prediction label corresponding to each text data in the text data set, and further process the text data in the text data set according to the difference degree between the prediction label corresponding to each text data and the label information. Therefore, the text data set is processed to obtain a plurality of training sets and a plurality of corresponding test sets, a plurality of recognition models are trained, the prediction label for recognizing the corresponding test set by each training model is obtained, and then the text data is processed according to the difference degree of the prediction label and the label information, so that the error label in the training text data can be rapidly screened out, the quality inspection cleaning speed and efficiency of the training text data are improved, the cleaning quality of the text data can be ensured, the problem of inaccurate model caused by the error label text data is avoided as much as possible, and the guarantee is provided for the effect of subsequent model training.
In order to more clearly illustrate the foregoing embodiment, a specific implementation process of processing the text data in the text data set according to the degree of difference between the prediction result and the labeled result corresponding to each text data is described in detail below with reference to fig. 3.
Fig. 3 is a flowchart illustrating a text data processing method according to a second embodiment of the present application. As shown in fig. 3, step 105 may include the following steps based on the embodiment shown in fig. 1:
step 201, determining a predicted label corresponding to each character in each text data according to a label distribution probability matrix corresponding to each text data.
The label distribution probability matrix corresponding to each piece of text data is obtained by identifying and predicting the piece of text data through an identification model, and the dimensionality of the label distribution probability matrix is as follows: the number of tags is multiplied by the number of characters of the piece of text data. For convenience of description, the label distribution probability matrix may be represented as PS [ i ] [ j ], where i represents the ID of the label and j represents the serial number of the character in the text data (starting from 0).
For ease of understanding, the following is illustrated with the label as the named entity. In this example, a named entity contains only one type, namely a COMPANY, organization, or organization name, collectively referred to herein as COMPANY. The general NER labeling modes are B-I-E and B-I, in this example, B-I-E is used for illustration, and since only one entity type is contained, the number of labels is 4, which are respectively: B-COMPANY, I-COMPANY, E-COMPANY, O. Wherein B-I-E-represents the start, middle and end positions, respectively, of the NER entity and O represents OTHER. The corresponding relation between the label and the ID is set as follows:
Figure BDA0003013287680000041
for the text data "up to now, the guangdong pearl group has no relevant valve business. "with a single character as a segmentation (Tokenize) unit, its token list, label ID and corresponding NER entity are shown in table 1:
TABLE 1
Figure BDA0003013287680000051
The text data is predicted by the recognition model, and the obtained label distribution probability matrix is shown in table 2:
TABLE 2
Figure BDA0003013287680000052
In this embodiment, according to the label distribution probability matrix corresponding to each text data, the predicted label corresponding to each character in each text data can be determined. For example, the label with the highest probability may be determined as the predicted label of the corresponding character in the text data.
In the above example, as shown in table 2, PS [ I ═ 0] [ j ═ 1] ═ 0.72 indicates that the probability of predicting the 1 st token ("to") to be O is 0.72, PS [ I ═ 1] [ j ═ 1] ═ 0.08 indicates that the probability of predicting the 1 st token ("to") to be B-COMPANY is 0.08, PS [ I ═ 2] [ j ═ 1] ═ 0.1 indicates that the probability of predicting the 1 st token ("to") to be I-COMPANY is 0.1, and PS [ I ═ 3] [ j ═ 1], "0.1 indicates that the probability of predicting the 1 st token (" to ") to be E-COMPANY is 0.1, then the prediction label of the character" to "in the text data is O.
Step 202, determining the confidence of each text data according to the number of characters, of which the predicted labels are different from the label information, contained in each text data.
In this embodiment, after the prediction tag corresponding to each character in each text data is determined, for each character, the prediction tag of the character in the text data may be compared with the tag information (i.e., the artificial tag in table 2) corresponding to the character, the number of characters in the text data, in which the prediction tags are different from the tag information, may be counted, and the confidence of each text data may be determined.
The smaller the proportion of the number of characters, of which the prediction labels are different from the label information, contained in the text data to the total number of characters in the text data is, the higher the confidence of the text data is.
As an example, the confidence level of each text data may be determined as: a ratio of the number of characters included in the text data whose predictive tag is different from tag information to the total number of characters included in the text data.
Step 203, determining the text data with the confidence coefficient smaller than the threshold value as the error labeling text data.
The threshold value can be manually set in advance, the larger the threshold value is, the more the text data quantity of the text data which is determined as the error labeling text data in the text data set is, and the larger the cleaning force of the text data is.
In this embodiment, after the confidence level of each text data in the text data set is determined, the confidence level of each text data may be compared with a preset threshold, whether the confidence level of each text data is smaller than the threshold is determined, and the text data with the confidence level smaller than the threshold is determined as the incorrectly labeled text data.
According to the text data processing method, the predicted label corresponding to each character in each text data is determined according to the label distribution probability matrix corresponding to each text data, the confidence coefficient of each text data is determined according to the number of characters, different from label information, of the predicted label contained in each text data, and then the text data with the confidence coefficient smaller than the threshold value is determined as the error labeling text data.
In a possible implementation manner of the embodiment of the application, the predicted tag corresponding to each text data is a tag distribution probability matrix corresponding to each text data, wherein an ith row and a jth column of the tag distribution probability matrix represent probability values of jth characters in the text data belonging to ith tags, and i and j are natural numbers respectively. In this embodiment, as shown in fig. 4, on the basis of the embodiment shown in fig. 1, processing the text data in the text data set according to the degree of difference between the prediction result and the labeling result corresponding to each text data may include the following steps:
step 301, determining a confidence threshold corresponding to each type of label according to the label distribution probability matrix corresponding to each text data and the label information.
In order to obtain the confidence threshold corresponding to each type of label conveniently, in this embodiment, after the label distribution probability matrix of each text data is obtained, the label distribution probability matrices of all the text data may be sequentially arranged and summarized according to the length of the token list of each text data, the summarized label distribution probability matrices are represented by P [ i ] [ j ], and correspondingly, the label information of all the text data is also sequentially arranged and summarized according to the length of the token list of each text data, and the summarized label distribution probability matrices are represented by a [ j ], so as to form an integral combined matrix.
For example, as shown in table 3, table 3 is an overall combined matrix obtained by summarizing the label distribution probability matrix and the label information of all text data. Due to page limitations, only the text data "until now, guangdong pearl group has no relevant valve traffic is shown in table 3. "the Loco response of the happy sight is 122 hundred million, which is the promotion behavior" and "the Beijing health goes on the market successfully yesterday" and the corresponding label distribution probability.
TABLE 3
Figure BDA0003013287680000071
In the above example, the token list length of each text may be recorded by L [ k ], where k denotes the index of each text (starting with 0), and if L [ k ═ 0] ═ 19 denotes the first text "up to now, the guangxizhu group has no relevant valve service. "has a token list length of 19. Then P [ i ═ 1] [ j ═ 19] ═ 0.8 indicates that the probability that the first token ("music") of the second text is predicted to be B-COMPANY is 0.8, and its corresponding artificial tag ID is a [ j ═ 19] ═ 0.
In this embodiment, according to the label distribution probability matrix and the label information corresponding to each text data, the confidence threshold corresponding to each type of label can be determined. The calculation process of the confidence threshold corresponding to each type of label is as follows: determining each candidate character corresponding to any type of label according to the label distribution probability matrix corresponding to each text data; in response to that the label information corresponding to any candidate character is any type of label, determining that any candidate character is a target character corresponding to any type of label; and determining a confidence threshold corresponding to any type of label according to each target character corresponding to any type of label and the prediction probability value of each target character under any type of label.
When the confidence threshold corresponding to any type of label is determined according to each target character corresponding to any type of label and the prediction probability value of each target character under any type of label, the average value of the prediction probability values of the target characters can be calculated to serve as the confidence threshold, the average value can be obtained to serve as the confidence threshold after the maximum probability value and the minimum probability value are removed, the intermediate value can be taken as the confidence threshold, and the application does not limit the confidence threshold.
Taking the average value of the prediction probability values of the calculated target characters as the confidence threshold as an example, combining the aggregated overall combined matrix, the calculation process of the confidence threshold of each type of label can be summarized as follows:
(1) selecting all element sets with the ID of the prediction label as i from the whole combined matrix, namely selecting the ith row element in P [ i ] [ j ]; (2) selecting an element set of which the label information (namely the artificial label in the table 3) is also i from the result of the step (1); (3) and (3) averaging the element set in the step (2), namely obtaining the average probability t [ i ] when the ID of the artificial label is i, and taking the average probability t [ i ] as a confidence threshold of the ith class label.
For ease of understanding, the overall union matrix shown in table 3 is used as an example to illustrate the calculation process of the confidence threshold for each type of label. For I ═ 2, which indicates that the artificial tag ID is 2 and the corresponding tag is I-COMPANY, the method of t [ I ═ 2] calculation is as follows:
(1) and selecting a set of i-2 in the prediction probability P [ i ] [ j ], namely P [ i ═ 2] [ j ], for quick calculation and convenient understanding, omitting a part with an ellipsis, and performing the same processing in the following examples in the application.
(2) The parts corresponding to a [ j ] ═ 2 are selected from P [ i ═ 2] [ j ], corresponding to 0.8 and 0.7.
(3) Averaging the elements selected in step (2) to obtain t [ i ═ 2] (0.8+0.7)/2 ═ 0.75, that is, the confidence threshold of tag i ═ 2 is 0.75. Similarly, can calculate and obtain:
t[i=0]=(0.8+0.72+0.4+0.1+0.1+0.6+0.6+0.7+0.9)/9=0.55
t[i=1]=0.8
t[i=3]=0.7
it should be noted that the above description summarizes the label distribution probability matrix of each piece of text data to obtain an overall combined matrix, which is only for convenience of explaining the calculation process for determining the confidence threshold corresponding to each type of label, and is not a necessary processing step.
Step 302, determining a predicted label corresponding to each character in each text data according to the confidence threshold corresponding to each type of label and the label distribution probability matrix corresponding to each text data.
As an example, for each character in each piece of text data, the label whose label distribution probability is the largest and is not less than the confidence threshold of the class of labels may be determined as the predicted label of the character.
For example, as shown in table 3, the probability that the label ID of the character "jing" is predicted to be 0 is 0.1, the probability that the label ID is predicted to be 1 is 0.8, the probability that the label ID is predicted to be 2 is 0.05, and the probability that the label ID is predicted to be 3 is 0.05, and the confidence thresholds of the label IDs of 0 to 3 are 0.55, 0.8, 0.75, and 0.7, respectively, as can be seen from the above calculation, it can be determined by comparing the confidence thresholds of the various types of labels and the probability that the character "jing" is predicted to be each type of label, and the predicted label ID of the character "jing" is 1, that is, the predicted label of the character "jing" is B-compare.
Step 303, determining the confidence of each text data according to the number of characters, of which the predicted tag is different from the tag information, included in each text data.
In this embodiment, the predicted labels are different from the label information, and include that the predicted labels of the same character are not consistent with the label information, and the corresponding predicted labels cannot be determined according to the confidence threshold corresponding to each type of label and the label distribution probability matrix corresponding to each text data.
As an example, the confidence level of each text data may be determined as: a ratio of the number of characters included in the text data whose predictive tag is different from tag information to the total number of characters included in the text data. The smaller the proportion of the number of characters, of which the prediction labels are different from the label information, contained in the text data to the total number of characters in the text data is, the higher the confidence of the text data is.
And step 304, determining the text data with the confidence coefficient smaller than the threshold value as the error labeling text data.
The threshold value can be manually set in advance, the larger the threshold value is, the more the text data quantity of the text data which is determined as the error labeling text data in the text data set is, and the larger the cleaning force of the text data is.
In this embodiment, after the confidence level of each text data in the text data set is determined, the confidence level of each text data may be compared with a preset threshold, whether the confidence level of each text data is smaller than the threshold is determined, and the text data with the confidence level smaller than the threshold is determined as the incorrectly labeled text data.
In the text data processing method of this embodiment, the confidence threshold corresponding to each type of label is determined according to the label distribution probability matrix and the label information corresponding to each type of text data, determining a predicted label corresponding to each character in each text data according to the confidence coefficient threshold corresponding to each type of label and the label distribution probability matrix corresponding to each text data, determining the confidence of each text data according to the number of characters of which the predicted labels and the label information are different in each text data, determining the text data with the confidence smaller than a threshold value as the error labeling text data, therefore, the wrong labeled text data in the text data set can be automatically screened out, the predicted label of each character is determined by calculating the confidence threshold of each type of label, the accuracy of the predicted result is improved, and the text data cleaning quality is further improved.
In a possible implementation manner of the embodiment of the application, the predicted tag corresponding to each text data is a tag distribution probability matrix corresponding to each text data, wherein an ith row and a jth column of the tag distribution probability matrix represent probability values of jth characters in the text data belonging to ith tags, and i and j are natural numbers respectively. In this embodiment, as shown in fig. 5, on the basis of the embodiment shown in fig. 1, processing the text data in the text data set according to the degree of difference between the prediction result and the labeling result corresponding to each text data may include the following steps:
step 401, determining a confidence threshold corresponding to each type of label according to the label distribution probability matrix corresponding to each text data and the label information.
In this embodiment, reference may be made to the description of step 301 in the foregoing embodiment for the description of step 401, and details are not repeated here to avoid repetition.
Step 402, determining a counting matrix corresponding to the text data set according to a confidence threshold corresponding to each type of label, a label distribution probability matrix corresponding to each text data and label information, wherein a kth row and a vth column count value in the counting matrix represent the number of characters of which labels are predicted to be k and label information is v, and k and v are natural numbers.
In this embodiment, the rows and columns of the counting matrix respectively represent the number of the label information and the number of the predicted labels. For the count values of the elements in the kth row and the vth column in the count matrix, the number of all characters, of which the label information is v and the predicted label is k, corresponding to the character in all characters contained in the text data set may be counted as the count value of the elements in the kth row and the vth column.
And the predicted label corresponding to the character is a category label with the maximum probability value in the label categories.
Further, in a possible implementation manner of the embodiment of the present application, the count value of each element in the count matrix may also be determined by combining the confidence threshold corresponding to each type of tag. Specifically, an initial count matrix may be generated according to the number of tags included in the tag information and the number of tags included in the predicted tag, where a value of each element in the initial count matrix is zero. And then, counting label information corresponding to each character in the text data set, a predicted label and a probability value corresponding to the predicted label, and adding 1 to a line v and a column v count value in a k-th row in the initial counting matrix in response to that the label information corresponding to any character is v, the predicted label is k and the probability value corresponding to the predicted label k is greater than a confidence threshold value corresponding to a k-class label. That is, for the count value of the kth row and the vth column in the finally determined count matrix, the count value is determined by the number of characters, of all characters included in the text data set, of which the label information corresponding to the character is v, the predicted label is k, and the probability value corresponding to the predicted label k is greater than the confidence threshold value corresponding to the k-class label. And if the probability value of the predicted label k corresponding to any character is not greater than the confidence threshold value corresponding to the k-class label, discarding the character to ensure that the character does not participate in the calculation of the counting matrix.
Taking the text data set and the corresponding label distribution probability matrix shown in table 3 as an example, if the number of the label information and the number of the predicted labels are both 4, the dimension of the counting matrix (marked as C [ k ] [ v ]) is 4 × 4, and the initialized counting matrix is shown in table 4:
TABLE 4
C[k][v] v=0 v=1 v=2 v=3
k=0 0 0 0 0
k=1 0 0 0 0
k=2 0 0 0 0
k=3 0 0 0 0
As shown in table 3, for j ═ 0, i.e., the 0 th token ("truncation"), the probability values predicted by the model are 0.8, 0, 0.1, and 0.1, the maximum value is 0.8, and the corresponding tag i ═ 0, i.e., P [ i ═ 0] [ j ═ 0], and its value is greater than t [ i ═ 0] ═ 0.75, and then its predicted tag ID is 0. For j ═ 2, i.e., the second token ("mesh"), the probability values predicted by the model are 0.4, 0.2, 0.3, and 0.1, the value with the highest probability is selected as 0.4, and the corresponding label i ═ 0, i.e., P [ i ═ 0] [ j ═ 0], whose value is less than t [ i ═ 0] ═ 0.75, is discarded, and is not involved in the calculation of the subsequent count matrix capital element value, and may be x-marked. Finally, the determined predicted tag IDs are shown in black bold in table 5 based on the probability values and confidence thresholds for each class of tag.
TABLE 5
Figure BDA0003013287680000101
Statistics is performed based on table 5 to obtain the count value of each element in the count matrix. For example, C [ k ═ 3] [ v ═ 0] represents the number of tokens for which the artificial tag ID is 0 and the model predicted tag ID is 3, and its value is 1. Since the second token ("mesh") was discarded and not involved in the count calculation, only 12 tokens remain to be involved in the count calculation, and the final determined count matrix is shown in table 6:
TABLE 6
Figure BDA0003013287680000102
In step 403, the counting matrix is normalized to generate a processed matrix.
Since in some embodiments of the present application, when the confidence threshold participates in the determination of the count matrix, some tokens are discarded and do not participate in the calculation of the count matrix, which makes the number of samples included in the count matrix not equal to the number of actual tokens, it is necessary to normalize the count matrix so that the total number of counts is the same as the total number of tokens manually labeled.
In a possible implementation manner of the embodiment of the application, when the counting matrix is standardized, the number of predicted characters corresponding to each type label may be determined according to each counting value in the counting matrix, the number of labeled characters corresponding to each type label may be determined according to label information corresponding to each text data, and then each counting value in the counting matrix is standardized according to the number of predicted characters and the number of labeled characters to generate a standardized matrix. The normalization process formula can be described as the following formula (1):
Figure BDA0003013287680000103
wherein AL [ v ] represents the number of label characters with ID of tag information type v, and M represents the number of predicted tag types.
For example, for a token with v equal to 0 in the counting matrix, the number of actual manual labels is 9, but only 8 tokens participate in the calculation of the counting matrix, so that the part with v equal to 0 needs to be normalized, and the above formula (1) can be used to obtain:
C[k=0][v=0]=(C[k=0][v=0]/8)×9=(6/8)×9=6.75
the final normalized results are shown in table 7.
TABLE 7
C[k][v] v=0 v=1 v=2 v=3
K=0 6.75 0 0 0
K=1 1.125 1 0 0
K=2 0 0 2 0
K=3 1.125 0 0 1
Total of 8 1 2 1
Practice of 9 1 2 1
Then, for the counting matrix after the normalization processing, the total number of characters corresponding to the counting matrix can be determined according to each counting value in the counting matrix, and the normalization processing is performed on the normalized matrix according to the total number of the characters so as to generate the processed matrix. The formula of the normalization process is shown in formula (2).
Figure BDA0003013287680000111
For example, Q [ k ═ 0] [ v ═ 0], [ C [ k ═ 0] [ v ═ 0]/13 ═ 6.75/13 ═ 0.519, and so on, the resulting matrix is shown in table 8:
TABLE 8
C[k][v] v=0 v=1 v=2 v=3
K=0 0.519 0 0 0
K=1 0.086 0.077 0 0
K=2 0 0 0.154 0
K=3 0.086 0 0 0.077
In step 404, in response to that the count value of the kth row and the vth column in the processed matrix is non-zero and that k is different from v, each candidate character corresponding to the count value of the kth row and the vth column is obtained.
In this embodiment, after the processed counting matrix is generated, candidate characters corresponding to the non-zero off-diagonal units are found from the counting matrix.
As shown in table 8, the non-zero off-diagonal elements in the count matrix are C [ k ═ 1] [ v ═ 0] and C [ k ═ 3] [ v ═ 0], and for C [ k ═ 1] [ v ═ 0], that is, the token set [ "le" ] with the tag information ID of 0 and the model prediction tag ID of 1 is used as the candidate character of the incorrectly labeled token.
Step 405, determining a difference value between a first probability value with a prediction label k and a second probability value with a prediction label v corresponding to each candidate character in the candidate characters.
Still taking the candidate character "le" of the above token with the error label as an example, if the token ID corresponding to "le" is 19, the difference between the first probability value that the prediction label corresponding to "le" is k ═ 1 and the second probability value that the prediction label is v ═ 0 is calculated as P [ i ═ 19] [ j ═ 1] -P [ i ═ 19] [ j ═ 0] ═ 0.8-0.1 ═ 0.7.
Step 406, selecting a target character from the candidate characters according to the difference corresponding to each candidate character.
In a possible implementation manner of the embodiment of the application, when selecting the target characters, the target number of the target characters to be selected may be determined according to a product of a count value of a kth row and a vth column and a total number of characters corresponding to the calculation matrix, and then the target number of the target characters may be selected from each candidate character according to a difference value corresponding to each candidate character in a descending order.
Continuing the above example, the target number of target characters to be selected is N × Q [ k ═ 1] [ v ═ 0] ═ 13 × 0.086 ≈ 1, and finally arranged from large to small according to the probability difference as follows:
candidate token set: [ "le" ]
Probability difference value set [0.7]
And selecting the first token in the probability difference value set, namely 'music', as the token labeled by the error.
Step 407, determining whether each text data is the error labeling text data according to the number of the target characters contained in each text data.
For the NER text data, one piece of text data includes a plurality of tokens, in this embodiment, normalization processing is performed on the number of error tokens included in the text data to obtain an error labeling probability of each NER text data, and an appropriate threshold is set, the threshold can be set manually, and when the error labeling probability is greater than the threshold, the piece of text data is considered as error labeling text data, and is sent to manual verification and cleaning. The error labeling probability of any text data can be calculated by formula (3).
Figure BDA0003013287680000121
Wherein ERR _ T [ w ] represents the number of target characters contained in the w-th text data, namely the number of the error labeling tokens; ERR _ P [ w ] represents the probability of the error labeling of the w-th text data; l [ w ] represents the total number of characters contained in the w-th text data.
The method for processing text data in this embodiment determines a confidence threshold corresponding to each type of label according to a label distribution probability matrix and label information corresponding to each type of text data, determines a count matrix corresponding to a text data set according to the confidence threshold corresponding to each type of label, the label distribution probability matrix corresponding to each type of text data, and the label information, normalizes and normalizes the count matrix to generate a processed matrix, obtains each candidate character corresponding to a count value in a kth row and a vth column in response to that the count value in the kth row and the vth column in the processed matrix is nonzero and that k is different from v, determines a difference value between a first probability value that a predicted label is k and a second probability value that the predicted label is v corresponding to each candidate character in each candidate character, selects a target character from each candidate character according to a difference value corresponding to each candidate character, and determining whether each text data is the error labeling text data or not according to the number of the target characters contained in each text data. Therefore, automatic identification of the error labeling text data is realized, manual verification of the error text data is not needed, manpower is saved, and text data cleaning efficiency is improved.
Corresponding to the processing method of the text data provided in the embodiments of fig. 1 to 5, the present application also provides a processing apparatus of the text data, and since the processing apparatus of the text data provided in the embodiments of the present application corresponds to the processing method of the text data provided in the embodiments of fig. 1 to 5, the embodiment of the processing method of the text data is also applicable to the processing apparatus of the text data provided in the embodiments of the present application, and will not be described in detail in the embodiments of the present application.
Fig. 6 is a schematic structural diagram of a text data processing apparatus according to a fifth embodiment of the present application.
As shown in fig. 6, the processing apparatus 100 of the text data may include: an acquisition module 110, a splitting module 120, a training module 130, a recognition module 140, and a washing module 150.
The obtaining module 110 is configured to obtain a text data set, where the text data set includes a plurality of text data and tag information corresponding to each text data.
The splitting module 120 is configured to process the text data set to obtain N training sets and N corresponding test sets, where the N training sets and the N test sets are different from each other, the N test sets form the text data set, and N is an integer greater than 1.
A training module 130, configured to train N recognition models respectively using the N training sets.
The identification module 140 is configured to identify the text data in the corresponding test set by using each identification model, so as to determine a prediction tag corresponding to each text data in the text data set.
And a cleaning module 150, configured to process the text data in the text data set according to the difference between the predicted tag corresponding to each text data and the tag information.
Further, in a possible implementation manner of the embodiment of the present application, as shown in fig. 7, on the basis of the embodiment shown in fig. 6, the cleaning module 150 includes:
the first determining unit 151 is configured to determine, according to a label distribution probability matrix corresponding to each text data, a predicted label corresponding to each character in each text data.
A second determining unit 152, configured to determine a confidence of each text data according to the number of characters, of which the predicted tag is different from the tag information, included in each text data.
The third determining unit 153 is configured to determine the text data with the confidence level smaller than the threshold as the error marked text data.
Further, in a possible implementation manner of the embodiment of the present application, a predicted tag corresponding to each text data is a tag distribution probability matrix corresponding to each text data, where an ith row and a jth column of the tag distribution probability matrix represent probability values that a jth character in the text data belongs to an ith class tag, and i and j are natural numbers, as shown in fig. 8, on the basis of the embodiment shown in fig. 6, the washing module 150 includes:
a fourth determining unit 154, configured to determine a confidence threshold corresponding to each type of label according to the label distribution probability matrix and the label information corresponding to each type of text data.
In a possible implementation manner of the embodiment of the present application, the fourth determining unit 154 is specifically configured to: determining each candidate character corresponding to any type of label according to the label distribution probability matrix corresponding to each text data; in response to that the label information corresponding to any candidate character is any type of label, determining that the any candidate character is a target character corresponding to any type of label; and determining a confidence threshold corresponding to any type of label according to each target character corresponding to any type of label and the prediction probability value of each target character under any type of label.
A fifth determining unit 155, configured to determine, according to the confidence threshold corresponding to each type of tag and the tag distribution probability matrix corresponding to each text data, a predicted tag corresponding to each character in each text data.
A sixth determining unit 156, configured to determine the confidence of each text data according to the number of characters, of which the predicted tag is different from the tag information, included in each text data.
A seventh determining unit 157, configured to determine the text data with the confidence coefficient smaller than the threshold as the error marked text data.
Further, in a possible implementation manner of the embodiment of the present application, each predicted tag corresponding to each text data is a tag distribution probability matrix corresponding to each text data, where an ith row and a jth column of the tag distribution probability matrix represent probability values that a jth character in the text data belongs to an ith class tag, and i and j are natural numbers respectively. As shown in fig. 9, based on the embodiment shown in fig. 6, the cleaning module 150 includes:
a threshold determining unit 1501, configured to determine a confidence threshold corresponding to each type of tag according to the tag distribution probability matrix and the tag information corresponding to each type of text data.
A count matrix determining unit 1502, configured to determine a count matrix corresponding to the text data set according to a confidence threshold corresponding to each type of tag, a tag distribution probability matrix corresponding to each text data, and tag information, where a count value in a kth row and a vth column in the count matrix indicates that a predicted tag in the text data set is k, the tag information is the number of characters of v, and k and v are natural numbers.
In a possible implementation manner of the embodiment of the present application, the count matrix determining unit 1502 is specifically configured to: generating an initial counting matrix according to the number of the labels contained in the label information and the number of the labels contained in the predicted labels, wherein the value of each element in the initial counting matrix is zero; counting tag information, a prediction tag and a probability value corresponding to the prediction tag corresponding to each character in the text data set, and adding 1 to a line v and a column v count value in a k-th row in the initial count matrix in response to that the tag information corresponding to any character is v, the prediction tag is k, and the probability value corresponding to the prediction tag k is greater than a confidence threshold corresponding to the k-class tags.
The processing unit 1503 is configured to perform normalization and normalization processing on the count matrix to generate a processed matrix.
In a possible implementation manner of the embodiment of the present application, the processing unit 1503 is specifically configured to: determining the number of predicted characters corresponding to each type label according to each count value in the count matrix; determining the number of the labeled characters corresponding to each type label according to the label information corresponding to each text data; according to the number of the predicted characters and the number of the labeled characters, normalizing each count value in the count matrix to generate a normalized matrix; determining the total number of characters corresponding to the counting matrix according to each counting value in the counting matrix; and carrying out normalization processing on the normalized matrix according to the total number of the characters so as to generate the processed matrix.
An obtaining unit 1504, configured to, in response to that a count value of a kth row and a vth column in the processed matrix is nonzero and k is different from v, obtain each candidate character corresponding to the count value of the kth row and the vth column.
A difference determining unit 1505, for determining a difference between a first probability value of k and a second probability value of v corresponding to each of the candidate characters.
The selecting unit 1506 is configured to select a target character from the candidate characters according to the difference values corresponding to the candidate characters, respectively.
In a possible implementation manner of the embodiment of the present application, the selecting unit 1506 is specifically configured to: determining the target number of target characters to be selected according to the product of the count value of the kth row and the vth column and the total number of characters corresponding to the calculation matrix; and selecting the target characters with the target quantity from the candidate characters according to the difference values corresponding to the candidate characters respectively and the sequence from big to small.
A determining unit 1507, configured to determine whether each of the text data is error labeling text data according to the number of the target characters included in each of the text data.
The processing device for the text data, according to the embodiment of the application, by acquiring a text data set, the text data set comprises a plurality of text data and label information corresponding to each text data, and the text data set is processed to acquire N training sets and N corresponding test sets, wherein the N training sets and the N test sets are different from each other, the N test sets form the text data set, N is an integer greater than 1, the N recognition models are respectively trained by the N training sets, and the text data in the corresponding test sets are respectively recognized by the N recognition models to determine a prediction label corresponding to each text data in the text data set, so that the text data in the text data set is processed according to the difference degree between the prediction label corresponding to each text data and the label information. Therefore, the text data set is processed to obtain a plurality of training sets and a plurality of corresponding test sets, a plurality of recognition models are trained, the prediction label for recognizing the corresponding test set by each training model is obtained, and then the text data is processed according to the difference degree of the prediction label and the label information, so that the error label in the training text data can be rapidly screened out, the quality inspection cleaning speed and efficiency of the training text data are improved, the cleaning quality of the text data can be ensured, the problem of inaccurate model caused by the error label text data is avoided as much as possible, and the guarantee is provided for the effect of subsequent model training.
In order to implement the foregoing embodiments, the present application also provides a computer device, including: the device comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the text data processing method as set forth in any one of the previous embodiments of the application.
In order to achieve the above embodiments, the present application also proposes a non-transitory computer-readable storage medium having a computer program stored thereon, wherein the program, when executed by a processor, implements a method of processing text data as proposed in any of the foregoing embodiments of the present application.
In order to implement the foregoing embodiments, the present application also proposes a computer program product, wherein when the instructions in the computer program product are executed by a processor, the processing method of text data as proposed in any of the foregoing embodiments of the present application is executed.
FIG. 10 illustrates a block diagram of an exemplary computer device suitable for use to implement embodiments of the present application. The computer device 12 shown in fig. 10 is only an example, and should not bring any limitation to the function and the scope of use of the embodiments of the present application.
As shown in FIG. 10, computer device 12 is embodied in the form of a general purpose computing device. The components of computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.
Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. These architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus, to name a few.
Computer device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.
Memory 28 may include computer system readable media in the form of volatile Memory, such as Random Access Memory (RAM) 30 and/or cache Memory 32. Computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 10, and commonly referred to as a "hard drive"). Although not shown in FIG. 10, a disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a Compact disk Read Only Memory (CD-ROM), a Digital versatile disk Read Only Memory (DVD-ROM), or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the application.
A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally perform the functions and/or methodologies of the embodiments described herein.
Computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with computer device 12, and/or with any devices (e.g., network card, modem, etc.) that enable computer device 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Moreover, computer device 12 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public Network such as the Internet) via Network adapter 20. As shown in FIG. 10, the network adapter 20 communicates with the other modules of the computer device 12 via the bus 18. It should be appreciated that although not shown in FIG. 10, other hardware and/or software modules may be used in conjunction with computer device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
The processing unit 16 executes various functional applications and data processing, for example, implements the processing method of text data mentioned in the foregoing embodiments, by executing a program stored in the system memory 28.
In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims (19)

1. A method for processing text data, comprising:
acquiring a text data set, wherein the text data set comprises a plurality of text data and label information corresponding to each text data;
processing the text data set to obtain N training sets and N corresponding test sets, wherein the N training sets and the N test sets are different from each other, the N test sets form the text data set, and N is an integer greater than 1;
respectively training N recognition models by utilizing the N training sets;
respectively identifying the text data in the corresponding test set by using each identification model to determine a prediction label corresponding to each text data in the text data set;
and processing the text data in the text data set according to the difference degree between the prediction label corresponding to each text data and the label information.
2. The method of claim 1, wherein the processing the text data in the text data set according to the degree of difference between the prediction result and the labeling result corresponding to each text data comprises:
determining a predictive label corresponding to each character in each text data according to a label distribution probability matrix corresponding to each text data;
determining the confidence of each text data according to the number of characters, different from label information, of the prediction labels contained in each text data;
and determining the text data with the confidence coefficient smaller than the threshold value as the error labeling text data.
3. The method of claim 1, wherein the predicted label corresponding to each text data is a label distribution probability matrix corresponding to each text data, wherein the ith row and the jth column of the label distribution probability matrix represent probability values of the jth character in the text data belonging to the ith label, i and j are natural numbers respectively, and the processing of the text data in the text data set according to the difference degree between the predicted result and the labeled result corresponding to each text data comprises:
determining a confidence threshold corresponding to each type of label according to the label distribution probability matrix corresponding to each text data and label information;
determining a predicted label corresponding to each character in each text data according to a confidence threshold corresponding to each type of label and a label distribution probability matrix corresponding to each text data;
determining the confidence of each text data according to the number of characters, different from label information, of the prediction labels contained in each text data;
and determining the text data with the confidence coefficient smaller than the threshold value as the error labeling text data.
4. The method of claim 3, wherein determining the confidence threshold corresponding to each type of label according to the label distribution probability matrix and the label information corresponding to each text data comprises:
determining each candidate character corresponding to any type of label according to the label distribution probability matrix corresponding to each text data;
in response to that the label information corresponding to any candidate character is any type of label, determining that the any candidate character is a target character corresponding to any type of label;
and determining a confidence threshold corresponding to any type of label according to each target character corresponding to any type of label and the prediction probability value of each target character under any type of label.
5. The method of claim 1, wherein the predicted label corresponding to each text data is a label distribution probability matrix corresponding to each text data, wherein the ith row and the jth column of the label distribution probability matrix represent probability values of the jth character in the text data belonging to the ith label, i and j are natural numbers respectively, and the processing of the text data in the text data set according to the difference degree between the predicted result and the labeled result corresponding to each text data comprises:
determining a confidence threshold corresponding to each type of label according to the label distribution probability matrix corresponding to each text data and label information;
determining a counting matrix corresponding to the text data set according to a confidence threshold corresponding to each type of label, a label distribution probability matrix corresponding to each text data and label information, wherein a count value of a kth row and a vth column in the counting matrix represents that a predicted label in the text data set is k, the label information is the number of v characters, and k and v are natural numbers;
normalizing and normalizing the counting matrix to generate a processed matrix;
responding to that the count value of the v-th column of the k-th row in the processed matrix is nonzero and k is different from v, and acquiring each candidate character corresponding to the count value of the v-th column of the k-th row;
determining a difference value between a first probability value with a prediction label of k and a second probability value with a prediction label of v corresponding to each candidate character in the candidate characters;
selecting a target character from each candidate character according to the difference value corresponding to each candidate character;
and determining whether each text data is the error labeling text data or not according to the number of the target characters contained in each text data.
6. The method of claim 5, wherein determining a count matrix corresponding to the text data set according to the confidence threshold corresponding to each type of tag, the tag distribution probability matrix corresponding to each text data and the tag information comprises:
generating an initial counting matrix according to the number of the labels contained in the label information and the number of the labels contained in the predicted labels, wherein the value of each element in the initial counting matrix is zero;
counting tag information, a prediction tag and a probability value corresponding to the prediction tag corresponding to each character in the text data set, and adding 1 to a line v and a column v count value in a k-th row in the initial count matrix in response to that the tag information corresponding to any character is v, the prediction tag is k, and the probability value corresponding to the prediction tag k is greater than a confidence threshold corresponding to the k-class tags.
7. The method of claim 6, wherein normalizing the count matrix to generate a processed matrix comprises:
determining the number of predicted characters corresponding to each type label according to each count value in the count matrix;
determining the number of the labeled characters corresponding to each type label according to the label information corresponding to each text data;
according to the number of the predicted characters and the number of the labeled characters, normalizing each count value in the count matrix to generate a normalized matrix;
determining the total number of characters corresponding to the counting matrix according to each counting value in the counting matrix;
and carrying out normalization processing on the normalized matrix according to the total number of the characters so as to generate the processed matrix.
8. The method according to any one of claims 5-7, wherein said selecting a target character from said candidate characters according to said difference value corresponding to each of said candidate characters comprises:
determining the target number of target characters to be selected according to the product of the count value of the kth row and the vth column and the total number of characters corresponding to the calculation matrix;
and selecting the target characters with the target quantity from the candidate characters according to the difference values corresponding to the candidate characters respectively and the sequence from big to small.
9. A processing apparatus of text data, comprising:
the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring a text data set, and the text data set comprises a plurality of text data and label information corresponding to each text data;
the splitting module is used for processing the text data set to obtain N training sets and N corresponding test sets, wherein the N training sets and the N test sets are different from each other, the N test sets form the text data set, and N is an integer greater than 1;
the training module is used for respectively training N recognition models by utilizing the N training sets;
the identification module is used for identifying the text data in the corresponding test set by respectively utilizing each identification model so as to determine a prediction label corresponding to each text data in the text data set;
and the cleaning module is used for processing the text data in the text data set according to the difference degree between the predicted label corresponding to each text data and the label information.
10. The apparatus of claim 9, wherein the cleaning module comprises:
the first determining unit is used for determining a prediction label corresponding to each character in each text data according to a label distribution probability matrix corresponding to each text data;
a second determining unit, configured to determine a confidence of each text data according to the number of characters, in which a prediction tag included in each text data is different from tag information;
and the third determining unit is used for determining the text data with the confidence coefficient smaller than the threshold value as the error labeling text data.
11. The apparatus of claim 9, wherein the predicted tag corresponding to each text data is a tag distribution probability matrix corresponding to each text data, wherein an ith row and a jth column of the tag distribution probability matrix represent probability values that a jth character in the text data belongs to an ith tag, i and j are natural numbers respectively, and the cleansing module comprises:
a fourth determining unit, configured to determine a confidence threshold corresponding to each type of tag according to the tag distribution probability matrix and the tag information corresponding to each type of text data;
a fifth determining unit, configured to determine, according to a confidence threshold corresponding to each type of tag and a tag distribution probability matrix corresponding to each text data, a predicted tag corresponding to each character in each text data;
a sixth determining unit, configured to determine a confidence of each text data according to the number of characters, in which a prediction tag included in each text data is different from tag information;
and the seventh determining unit is used for determining the text data with the confidence coefficient smaller than the threshold value as the error labeling text data.
12. The apparatus of claim 11, wherein the fourth determining unit is specifically configured to:
determining each candidate character corresponding to any type of label according to the label distribution probability matrix corresponding to each text data;
in response to that the label information corresponding to any candidate character is any type of label, determining that the any candidate character is a target character corresponding to any type of label;
and determining a confidence threshold corresponding to any type of label according to each target character corresponding to any type of label and the prediction probability value of each target character under any type of label.
13. The apparatus of claim 9, wherein the predicted tag corresponding to each text data is a tag distribution probability matrix corresponding to each text data, wherein an ith row and a jth column of the tag distribution probability matrix represent probability values that a jth character in the text data belongs to an ith tag, i and j are natural numbers respectively, and the cleansing module comprises:
the threshold determining unit is used for determining a confidence threshold corresponding to each type of label according to the label distribution probability matrix corresponding to each text data and the label information;
a count matrix determining unit, configured to determine a count matrix corresponding to the text data set according to a confidence threshold corresponding to each type of tag, a tag distribution probability matrix corresponding to each text data, and tag information, where a count value in a kth row and a vth column in the count matrix indicates that a predicted tag in the text data set is k, the tag information is the number of characters of v, and k and v are natural numbers;
the processing unit is used for standardizing and normalizing the counting matrix to generate a processed matrix;
an obtaining unit, configured to obtain, in response to that a count value of a kth row and a vth column in the processed matrix is nonzero and k is different from v, each candidate character corresponding to the count value of the kth row and the vth column;
a difference value determining unit, configured to determine a difference value between a first probability value with a predicted tag k and a second probability value with a predicted tag v corresponding to each of the candidate characters;
a selecting unit, configured to select a target character from the candidate characters according to the difference values corresponding to the candidate characters, respectively;
and the determining unit is used for determining whether each piece of text data is error labeling text data or not according to the number of the target characters contained in each piece of text data.
14. The apparatus of claim 13, wherein the count matrix determination unit is specifically configured to:
generating an initial counting matrix according to the number of the labels contained in the label information and the number of the labels contained in the predicted labels, wherein the value of each element in the initial counting matrix is zero;
counting tag information, a prediction tag and a probability value corresponding to the prediction tag corresponding to each character in the text data set, and adding 1 to a line v and a column v count value in a k-th row in the initial count matrix in response to that the tag information corresponding to any character is v, the prediction tag is k, and the probability value corresponding to the prediction tag k is greater than a confidence threshold corresponding to the k-class tags.
15. The apparatus as claimed in claim 14, wherein said processing unit is specifically configured to:
determining the number of predicted characters corresponding to each type label according to each count value in the count matrix;
determining the number of the labeled characters corresponding to each type label according to the label information corresponding to each text data;
according to the number of the predicted characters and the number of the labeled characters, normalizing each count value in the count matrix to generate a normalized matrix;
determining the total number of characters corresponding to the counting matrix according to each counting value in the counting matrix;
and carrying out normalization processing on the normalized matrix according to the total number of the characters so as to generate the processed matrix.
16. The apparatus according to any of claims 13 to 15, wherein the selection unit is specifically configured to:
determining the target number of target characters to be selected according to the product of the count value of the kth row and the vth column and the total number of characters corresponding to the calculation matrix;
and selecting the target characters with the target quantity from the candidate characters according to the difference values corresponding to the candidate characters respectively and the sequence from big to small.
17. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing a method of processing text data as claimed in any one of claims 1 to 8 when executing the program.
18. A non-transitory computer-readable storage medium on which a computer program is stored, the program, when being executed by a processor, implementing a method for processing text data according to any one of claims 1 to 8.
19. A computer program product, characterized in that instructions in the computer program product, when executed by a processor, perform a method of processing text data according to any one of claims 1-8.
CN202110381793.6A 2021-04-09 2021-04-09 Text data processing method and device, computer equipment and storage medium Pending CN113807096A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110381793.6A CN113807096A (en) 2021-04-09 2021-04-09 Text data processing method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110381793.6A CN113807096A (en) 2021-04-09 2021-04-09 Text data processing method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113807096A true CN113807096A (en) 2021-12-17

Family

ID=78892985

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110381793.6A Pending CN113807096A (en) 2021-04-09 2021-04-09 Text data processing method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113807096A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114266253A (en) * 2021-12-21 2022-04-01 武汉百智诚远科技有限公司 Method for identifying semi-supervised named entity without marking data
CN116542250A (en) * 2023-06-29 2023-08-04 杭州同花顺数据开发有限公司 Information extraction model acquisition method and system

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160162803A1 (en) * 2014-12-07 2016-06-09 Microsoft Technology Licensing, Llc. Error-driven feature ideation in machine learning
US20170161633A1 (en) * 2015-12-07 2017-06-08 Xerox Corporation Transductive adaptation of classifiers without source data
CN110110715A (en) * 2019-04-30 2019-08-09 北京金山云网络技术有限公司 Text detection model training method, text filed, content determine method and apparatus
US20190325344A1 (en) * 2018-04-20 2019-10-24 Sas Institute Inc. Machine learning predictive labeling system
CN110704633A (en) * 2019-09-04 2020-01-17 平安科技(深圳)有限公司 Named entity recognition method and device, computer equipment and storage medium
CN110717039A (en) * 2019-09-17 2020-01-21 平安科技(深圳)有限公司 Text classification method and device, electronic equipment and computer-readable storage medium
CN111198935A (en) * 2018-11-16 2020-05-26 北京京东尚科信息技术有限公司 Model processing method and device, storage medium and electronic equipment
CN112560912A (en) * 2020-12-03 2021-03-26 北京百度网讯科技有限公司 Method and device for training classification model, electronic equipment and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160162803A1 (en) * 2014-12-07 2016-06-09 Microsoft Technology Licensing, Llc. Error-driven feature ideation in machine learning
US20170161633A1 (en) * 2015-12-07 2017-06-08 Xerox Corporation Transductive adaptation of classifiers without source data
US20190325344A1 (en) * 2018-04-20 2019-10-24 Sas Institute Inc. Machine learning predictive labeling system
CN111198935A (en) * 2018-11-16 2020-05-26 北京京东尚科信息技术有限公司 Model processing method and device, storage medium and electronic equipment
CN110110715A (en) * 2019-04-30 2019-08-09 北京金山云网络技术有限公司 Text detection model training method, text filed, content determine method and apparatus
CN110704633A (en) * 2019-09-04 2020-01-17 平安科技(深圳)有限公司 Named entity recognition method and device, computer equipment and storage medium
CN110717039A (en) * 2019-09-17 2020-01-21 平安科技(深圳)有限公司 Text classification method and device, electronic equipment and computer-readable storage medium
CN112560912A (en) * 2020-12-03 2021-03-26 北京百度网讯科技有限公司 Method and device for training classification model, electronic equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李博涵;姜姗;刘畅;于东;: "中文矛盾语块数据集构建和边界识别研究", 中文信息学报, no. 03 *
石雪: "临床医疗实体及其属性的联合抽取方法", 中国优秀硕士学位论文全文数据库 医药卫生科技辑, no. 02 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114266253A (en) * 2021-12-21 2022-04-01 武汉百智诚远科技有限公司 Method for identifying semi-supervised named entity without marking data
CN114266253B (en) * 2021-12-21 2024-01-23 武汉百智诚远科技有限公司 Method for identifying semi-supervised named entity without marked data
CN116542250A (en) * 2023-06-29 2023-08-04 杭州同花顺数据开发有限公司 Information extraction model acquisition method and system
CN116542250B (en) * 2023-06-29 2024-04-19 杭州同花顺数据开发有限公司 Information extraction model acquisition method and system

Similar Documents

Publication Publication Date Title
CN109815487B (en) Text quality inspection method, electronic device, computer equipment and storage medium
CN108733778B (en) Industry type identification method and device of object
CN110543592B (en) Information searching method and device and computer equipment
CN111476256A (en) Model training method and device based on semi-supervised learning and electronic equipment
CN108460098B (en) Information recommendation method and device and computer equipment
CN108090211B (en) Hot news pushing method and device
WO2020082734A1 (en) Text emotion recognition method and apparatus, electronic device, and computer non-volatile readable storage medium
CN108550054B (en) Content quality evaluation method, device, equipment and medium
CN110427487B (en) Data labeling method and device and storage medium
CN113807096A (en) Text data processing method and device, computer equipment and storage medium
CN110490237B (en) Data processing method and device, storage medium and electronic equipment
CN109446393B (en) Network community topic classification method and device
WO2022267454A1 (en) Method and apparatus for analyzing text, device and storage medium
CN113360768A (en) Product recommendation method, device and equipment based on user portrait and storage medium
CN109740156B (en) Feedback information processing method and device, electronic equipment and storage medium
CN110020638B (en) Facial expression recognition method, device, equipment and medium
CN108804413B (en) Text cheating identification method and device
CN113806500A (en) Information processing method and device and computer equipment
CN110096708B (en) Calibration set determining method and device
CN108511036A (en) A kind of method and system of Chinese symptom mark
CN109284384B (en) Text analysis method and device, electronic equipment and readable storage medium
CN113011164B (en) Data quality detection method, device, electronic equipment and medium
CN112732908B (en) Test question novelty evaluation method and device, electronic equipment and storage medium
WO2021057270A1 (en) Audio content quality inspection method and apparatus
CN110059180B (en) Article author identity recognition and evaluation model training method and device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination