CN114792085B

CN114792085B - Data processing system for error correction of label text

Info

Publication number: CN114792085B
Application number: CN202210710576.1A
Authority: CN
Inventors: 张正义; 林方; 刘宸; 傅晓航
Original assignee: Zhongke Yuchen Technology Co Ltd
Current assignee: Zhongke Yuchen Technology Co Ltd
Priority date: 2022-06-22
Filing date: 2022-06-22
Publication date: 2022-09-16
Anticipated expiration: 2042-06-22
Also published as: CN114792085A

Abstract

The invention relates to a data processing system for error correction of a label text, which comprises: a database, a processor and a memory storing a computer program which, when executed by the processor, performs the steps of: when the number of the marked texts is smaller than the text number threshold value, acquiring any marked text as a test set and a text set corresponding to the marked text as a training set; when the number of the marked texts is not less than the text number threshold value, dividing the marked text list into a plurality of intermediate marked text lists, acquiring any one of the intermediate marked text lists as a test set and a text set corresponding to the intermediate marked text list as a training set, and training a preset model according to the training set so as to determine all abnormal marks corresponding to abnormal texts based on the trained preset model and the trained test set; the abnormal text can be rapidly and accurately determined, only the proofreading personnel are needed to proofread the abnormal text, the workload is simplified, and the efficiency of text calibration is improved.

Description

Data processing system for error correction of label text

Technical Field

The invention relates to the technical field of text error correction, in particular to a data processing system for error correction of a marked text.

Background

Currently, the process of labeling the text includes: the marking personnel proofread the marked text and the proofreading personnel proofread the marked text, when the number of the text is large, the marking personnel and the proofreading personnel can carry out a large amount of work, the working efficiency is low, and the personnel cost is high.

In the prior art, a text error correction model is adopted to correct errors of marked texts, but the error correction accuracy of the text error correction model is low, and meanwhile, each marked text needs to be corrected, which results in low working efficiency.

Meanwhile, for errors which often occur in the text, for example, characters are missed in English words or wrong characters of names of people and places, and the like, a labeling person cannot know the labeling error, so that the workload of a proofreading person is increased, and the working efficiency is low.

Disclosure of Invention

In order to solve the above technical problems, the technical solution adopted by the present invention is a data processing system for error correction of a labeled text, the system comprising: a database, a processor, and a memory storing a computer program, wherein the database comprises: annotation text list a = { a = { (a) ₁ ，……，A _i ，……，A _m }，A _i I =1 … … m, where m is the number of the annotation texts, when the computer program is executed by a processor, the following steps are implemented:

s100, when m is less than a preset text quantity threshold value m ₀ Then, a first specified text set G = { G } corresponding to a is obtained ₁ ，……，G _i ，……，G _m H, the ith first specified text set G _i ={A _i ，B _i In which A _i Corresponding first text list B _i ={B _i1 ，……，B _ir ，……，B _is }，B _ir Refers to the first r text, r =2 … … s is the first text number, and A is _i As the ith first target test set in G and B in G _i As the ith first target training set;

s200, when m is more than or equal to m ₀ Then, according to A, obtaining an intermediate text set D = { D = { (D) } ₁ ，……，D _j ，……，D _n }，D _j ={D _j1 ，……，D _jt ，……，D _jk }，D _jt The method is characterized in that the method refers to the t-th intermediate text in the j-th intermediate text list, j =2 … … n, n is the number of intermediate text lists, t =1 … … k, k is the number of intermediate texts in any intermediate text list, wherein n meets the following conditions:

；

s300, obtaining a second specified text set G ' = { G ' corresponding to A ' ₁ ，……，G' _j ，……，G' _n }, jth second specified text set G' _j ={D _j ，C _j In which D is _j Corresponding second text set C _j ={C _j1 ，……，C _jq ，……，C _jp }，C _jq Refers to the qth second text list, q =2 … … p is the number of the second text lists, and D is _j As the jth second target test set in G' and C _j As the jth second target training set in G', where C _jq The q-th second text list is referred to, and q =2 … … p is the number of the second text lists;

s400, obtaining a target training set, training a preset text error correction model based on the target training set, and entering the target file error correction model to enable the target training set to be input into the target file error correction model according to a target test set to obtain an abnormal text corresponding to the A, wherein the target training set comprises a first target training set or a second target training set, the target test set comprises a first target test set or a second target test set, and the target test set and the target training set are in a corresponding relation;

s500, obtaining an abnormal text list H = { H) corresponding to A ₁ ，……，H _g ，……，H _z }，H _g G =1 … … z, z is the number of abnormal texts, and for H _g Performing text error correction to obtain H _g And marking all corresponding exceptions.

Compared with the prior art, the invention has obvious advantages and beneficial effects. By the technical scheme, the data processing system for correcting the error of the label text can achieve considerable technical progress and practicability, has wide industrial utilization value and at least has the following advantages:

the data processing system for error correction of the label text comprises the following components: a database, a processor, and a memory storing a computer program, wherein the database comprises: annotating a text list, the computer program when executed by a processor implementing the steps of: when the number of the marked texts in the marked text list is smaller than a preset text number threshold value, acquiring any marked text as a test set and a text set corresponding to the marked text as a training set; when the number of the labeled texts in the labeled text list is not less than a preset text number threshold value, dividing the labeled text list into a plurality of intermediate labeled text lists, wherein each intermediate labeled text list comprises labeled texts with the same number, acquiring any one of the intermediate labeled text lists as a test set and a text set corresponding to the intermediate labeled text list as a training set, training the preset model according to the training set, acquiring abnormal texts and performing text error correction processing according to the abnormal texts on the basis of the trained preset model and the trained test set to obtain all abnormal labels corresponding to the abnormal texts; the abnormal text can be rapidly and accurately determined, only the proofreading personnel are needed to proofread the abnormal text, the workload is simplified, and the efficiency of text calibration is improved.

In addition, when the entity in the abnormal text is a Chinese entity or an English entity, different methods are determined to obtain the similarity, so that the similarity is accurately determined, and further, the wrong labeling of the abnormal text is prompted.

The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical means of the present invention more clearly understood, the present invention may be implemented in accordance with the content of the description, and in order to make the above and other objects, features, and advantages of the present invention more clearly understood, the following preferred embodiments are described in detail with reference to the accompanying drawings.

Drawings

FIG. 1 is a flowchart illustrating steps executed by a data processing system for error correction of annotated texts according to an embodiment of the present invention.

Detailed Description

To further illustrate the technical means and effects of the present invention adopted to achieve the predetermined objects, the following detailed description will be given with reference to the accompanying drawings and preferred embodiments of a data processing system for acquiring a target position and its effects.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in other sequences than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Examples

The embodiment provides a data processing system for correcting errors of a label text, which comprises: a database, a processor, and a memory storing a computer program, wherein the database comprises: annotation text list a = { a = { (a) ₁ ，……，A _i ，……，A _m }，A _i I =1 … … m, where m is the number of the annotation texts, when the computer program is executed by the processor, the following steps are implemented, as shown in fig. 1:

s100, when m is less than a preset text quantity threshold value m ₀ Then, a first specified text set G = { G ] corresponding to A is obtained ₁ ，……，G _i ，……，G _m H, ith first designated text set G _i ={A _i ，B _i In which A _i Corresponding first text list B _i ={B _i1 ，……，B _ir ，……，B _is }，B _ir Refers to the first r text, r =2 … … s is the first text number, and A is _i As the ith first target test set in G and B in G _i As the ith first target training set.

Specifically, the annotation text refers to an annotated text.

In particular, m ₀ Is in the range of 10 to 50, preferably m ₀ Is 30.

Specifically, B _i The first text in A isExcept for A _i Any other than the tagged text.

Specifically, s satisfies the following condition: s = m-1.

。

specifically, in step S200, a is divided to generate an intermediate text list, which can be understood as: and randomly selecting k labeled texts from the A to construct an intermediate text list, wherein no repeated labeled texts exist in any two intermediate text lists.

S300, obtaining a second specified text set G ' = { G ' corresponding to A ' ₁ ，……，G' _j ，……，G' _n }, jth second specified text set G' _j ={D _j ，C _j In which D is _j Corresponding second text set C _j ={C _j1 ，……，C _jq ，……，C _jp }，C _jq Refers to the qth second text list, q =2 … … p is the number of the second text lists, and D is _j As the jth second target test set in G' and C _j As the jth second target training set in G', where C _jq Refers to the q-th second text list, and q =2 … … p is the number of the second text lists.

Specifically, C _j The second text list refers to dividing D by D in D _j Any intermediate text list other than the text list.

Specifically, p satisfies the following condition: p = n-1.

S400, obtaining a target training set, training a preset text error correction model based on the target training set, and entering the target file error correction model to enable the target training set to be input into the target file error correction model according to a target test set to obtain an abnormal text corresponding to the A, wherein the target training set comprises a first target training set or a second target training set, the target test set comprises a first target test set or a second target test set, and the target test set and the target training set are in a corresponding relation; it can be understood that: when training is carried out based on the first target training set, only the first target test is tested, or when training is carried out based on the second target training set, only the second target test is tested, the consistency between the training set and the testing set can be ensured, and the accuracy of determining the abnormal text is facilitated.

Specifically, a person skilled in the art may adopt any text error correction model as the preset text error correction model, and preferably, the preset text error correction model is a neural network model, where the neural network model includes: any one of CNN, LSTM, AlexNet, ZFNET, VGGNet, GoogLeNet, ResNet, UNet, SRCNN and BilSTM-CRF.

Further, those skilled in the art are aware of the training process of the neural network model, and will not be described herein.

Specifically, the step S400 further includes the steps of:

s401, when m is less than m ₀ While, traverse G and according to G _i And a target file error correction model for judging each A _i Whether the text exception condition is preset or not can be set by a person skilled in the art according to requirements, and details are not described herein.

S403, when A _i When the preset text abnormal condition is met, determining A _i Is an abnormal text.

S405, when m is more than or equal to m ₀ While, traverse G 'and according to G' _j And a target file error correction model, judgment D _j Each of D in _jt If the text exception condition is preset, a person skilled in the art can set the text exception condition according to requirements, which is not described herein again.

S403, when D _jt When the preset text abnormal condition is met, determining D _jt Is an abnormal text.

In the method, a group of or one marked text is selected as a test set in the marked text list, other marked texts are used as training sets, the text error correction model is trained, the abnormal text can be determined quickly and accurately, and only proofreading personnel are required to proofread the abnormal text, so that the workload is simplified, and the text calibration efficiency is improved.

S500, obtaining an abnormal text list H = { H) corresponding to A ₁ ，……，H _g ，……，H _z }，H _g The abnormal text corresponding to the g-th A is referred to, g =1 … … z, z is the number of the abnormal texts corresponding to the A, and for H _g Performing text error correction to obtain H _g And marking all corresponding exceptions.

Specifically, the database further includes: entity type set L = { L = { (L) ¹ ，……，L ^y ，……L ^w }，L ^y The method refers to an entity list corresponding to the y-th entity type, y =1 … … w, w is the number of entity types corresponding to texts, wherein the entity types can be understood as ontologies and the ontologies comprise multiple types, such as characters, place names, toys and the like.

Specifically, the step S500 further includes the steps of:

s501, obtaining H _g Corresponding list of tagged entities U ^g ={U ^g ₁ ，……，U ^g _x ，……，U ^g _βg }，U ^g _x Refers to the x-th annotated entity, x =1 … … β _g ，β _g The number of entities marked in the g-th abnormal text is referred to.

S503, according to U ^g _x Corresponding entity type, obtaining U from L ^g _x Corresponding entity list L ^y ={L ^y ₁ ，……，L ^y _e ，……，L ^y _vy }，L ^y _e Is referred to as L ^y E =1 … … v, of the e-th entity _y ，v _y Is referred to as L ^y The number of intermediate entities.

S505 according to U ^g _x And L ^y _e Obtaining U ^g _x Target similarity F of ^g _x 。

Specifically, the step S505 further includes the steps of:

s5051, when U ^g _x In the case of Chinese entities, from L ^y In obtaining L ^y Corresponding Chinese entity list T ^y ={T ^y ₁ ，……，T ^y _a ，……，T ^y _by }，T ^y _a Is referred to as L ^y The a-th chinese entity, a =1 … … b _y ，b _y Is referred to as L ^y The number of Chinese entities.

S5053 according to U ^g _x And T ^y _a Obtaining U ^g _x And T ^y _a List of similarities between E ^gy _x ={E ^gy _x1 ，……，E ^gy _xa ，……，E ^gy _xby And from E ^gy _x To obtain the maximum similarity as F ^g _x Wherein E is ^gy _xa Is referred to as U ^g _x And T ^y _a Similarity between them, E ^gy _xa The following conditions are met:

wherein, in the step (A),

MK ^gy _γ is referred to as U ^g _x Corresponding vector MK ^gy Middle gamma bit value, NK ^ya _γ Means T ^y _a Corresponding vector NK ^gy Middle gamma bit value, preferably MK ^gy And NK ^gy All vectors are 768-dimensional vectors, i.e., Φ =768, and those skilled in the art know the method for obtaining the vectors corresponding to the entities, which is not described herein again.

S5055, when U ^g _x In the case of Chinese entities, from L ^y In obtaining L ^y Corresponding non-Chinese entity list R ^y ={R ^y ₁ ，……，R ^y _c ，……，R ^y _dy }，R ^y _c Is referred to as L ^y C-th non-chinese entity, c =1 … … d _y ，d _y Is referred to as L ^y Number of Chinese and non-Chinese entities.

S5057 according to U ^g _x And R ^y Obtaining U ^g _x And R ^y List of similarities between F ^gy _x ={F ^gy _x1 ，……，F ^gy _xc ，……，F ^gy _xdy And from F ^gy _x To obtain the maximum similarity as F ^g _x Wherein F is ^gy _xc Is referred to as U ^g _x And R ^y _c Similarity between them, F ^gy _xc The following conditions are met:

wherein λ is ^gy _xc Is referred to as U ^g _x And R ^y _c Edit distance, η between ^gy _xc Is referred to as in U ^g _x Number of characters and R ^y _c The maximum number of characters between the numbers of characters of (c).

Specifically, through the step S605, when the entity in the abnormal text is a chinese entity or an english entity, it is determined that the similarity is obtained by adopting different methods, so that the similarity is accurately determined, and then the wrong labeling of the abnormal text is prompted.

S507, when F ^g _x =F ₀ Then, determine U ^g _x Is labeled as non-abnormal, wherein F ₀ Is a predetermined first similarity threshold value and F ₀ Is 1.

Specifically, the non-abnormal annotation refers to a correct annotation in the text.

S509, when F ^g _x ≠F ₀ Then, to U ^g _x Marking to determine U ^g _x And marking the abnormity.

Specifically, the exception label refers to an error label in the text.

Specifically, the step S509 further includes the steps of:

S5091、F ^g _x ＞F' ₀ then, to U ^g _x Marking to determine U ^g _x Annotate for exceptions and assign F ^g _x Corresponding L ^y The middle entity is marked in the text as a reference entity, so that the abnormal marking can be prompted to be used, a relatively correct reference entity is provided, and the error correction can be conveniently carried out by a marking person.

S5093、F ^g _x ≤F' ₀ While to U ^g _x Marking to determine U ^g _x And marking the abnormity.

The embodiment provides a data processing system for correcting errors of a label text, which comprises: a database, a processor, and a memory storing a computer program, wherein the database comprises: annotating a text list, the computer program when executed by a processor implementing the steps of: when the number of the marked texts in the marked text list is smaller than a preset text number threshold value, acquiring any marked text as a test set and a text set corresponding to the marked text as a training set; when the number of the labeled texts in the labeled text list is not less than a preset text number threshold value, dividing the labeled text list into a plurality of intermediate labeled text lists, wherein each intermediate labeled text list comprises labeled texts with the same number, acquiring any one of the intermediate labeled text lists as a test set and a text set corresponding to the intermediate labeled text list as a training set, training the preset model according to the training set, acquiring abnormal texts and performing text error correction processing according to the abnormal texts on the basis of the trained preset model and the trained test set to obtain all abnormal labels corresponding to the abnormal texts; abnormal texts can be determined quickly and accurately, only proofreading personnel are needed to proofread the abnormal texts, workload is simplified, and text calibration efficiency is improved.

Although the present invention has been described with reference to a preferred embodiment, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A data processing system for error correction of annotated text, the system comprising: a database, a processor, and a memory storing a computer program, wherein the database comprises: annotation text list a = { a = { (a) ₁ ，……，A _i ，……，A _m }，A _i I =1 … … m, where m is the number of the labeled texts, when the computer program is executed by a processor, the following steps are implemented:

s100, when m is less than a preset text quantity threshold value m ₀ Then, a first specified text set G = { G ] corresponding to A is obtained ₁ ，……，G _i ，……，G _m H, the ith first specified text set G _i ={A _i ，B _i In which A _i Corresponding first text list B _i ={B _i1 ，……，B _ir ，……，B _is }，B _ir Refers to the first r text, r =2 … … s is the first text number, and A is _i As the ith first target test set in G and B in G _i As the ith first target training set;

s200, when m is more than or equal to m ₀ Then, according to A, an intermediate text set D = { D) is obtained ₁ ，……，D _j ，……，D _n }，D _j ={D _j1 ，……，D _jt ，……，D _jk }，D _jt The method is characterized in that the method refers to the t-th intermediate text in the j-th intermediate text list, j =2 … … n, n is the number of intermediate text lists, t =1 … … k, k is the number of intermediate texts in any intermediate text list, wherein n meets the following conditions:

；

s500, obtaining an abnormal text list H = { H) corresponding to A ₁ ，……，H _g ，……，H _z }，H _g G =1 … … z, z is the number of abnormal texts, and for H _g Performing text error correction to obtain H _g Marking all corresponding exceptions;

wherein, still include in the said database: entity type set L = { L = { (L) ¹ ，……，L ^y ，……L ^w }，L ^y Refers to the entity list corresponding to the y-th entity type, y =1 … … w, w is corresponding to the textNumber of entity types, which when executed by a processor, further comprises the following step in step S500:

s501, obtaining H _g Corresponding list of tagged entities U ^g ={U ^g ₁ ，……，U ^g _x ，……，U ^g _βg }，U ^g _x Refers to the x-th annotated entity, x =1 … … β _g ，β _g The number of entities marked in the g-th abnormal text is referred to;

s503, according to U ^g _x Corresponding entity type, obtaining U from L ^g _x Corresponding entity list L ^y ={L ^y ₁ ，……，L ^y _e ，……，L ^y _vy }，L ^y _e Is referred to as L ^y E =1 … … v, of the e-th entity _y ，v _y Is referred to as L ^y The number of intermediate entities;

s505 according to U ^g _x And L ^y _e Obtaining U ^g _x Target similarity F of ^g _x (ii) a Wherein, the step S505 further includes the following steps:

s5051, when U ^g _x In the case of Chinese entities, from L ^y In obtaining L ^y Corresponding Chinese entity list T ^y ={T ^y ₁ ，……，T ^y _a ，……，T ^y _by }，T ^y _a Is referred to as L ^y The a-th chinese entity, a =1 … … b _y ，b _y Is meant for L ^y The number of Chinese entities;

s5053 according to U ^g _x And T ^y _a Obtaining U ^g _x And T ^y _a List of similarities between E ^gy _x ={E ^gy _x1 ，……，E ^gy _xa ，……，E ^gy _xby And from E ^gy _x To obtain the maximum similarity as F ^g _x Wherein E is ^gy _xa Is referred to as U ^g _x And T ^y _a The similarity betweenDegree, E ^gy _xa The following conditions are met:

wherein, in the step (A),

MK ^gx _γ is referred to as U ^g _x Corresponding vector MK ^gx Middle gamma bit value, NK ^ya _γ Means T ^y _a Corresponding vector NK ^ya The middle gamma bit value;

s5055, when U ^g _x In the case of Chinese entities, from L ^y In obtaining L ^y Corresponding non-Chinese entity list R ^y ={R ^y ₁ ，……，R ^y _c ，……，R ^y _dy }，R ^y _c Is referred to as L ^y C-th non-chinese entity, c =1 … … d _y ，d _y Is referred to as L ^y The number of Chinese and non-Chinese entities;

wherein λ is ^gy _xc Is referred to as U ^g _x And R ^y _c Edit distance, η between ^gy _xc Is referred to as in U ^g _x Number of characters and R ^y _c A maximum number of characters between the numbers of characters of (a);

s507, when F ^g _x =F ₀ While determining U ^g _x Is labeled as non-abnormal, wherein F ₀ Is a preset first similarity threshold value F ₀ Is 1;

s509, when F ^g _x ≠F ₀ While to U ^g _x Marking to determine U ^g _x And marking the abnormity.

2. The system of claim 1, wherein the annotated text is referred to as annotated text.

3. The data processing system for error correction of annotated text as claimed in claim 1, wherein m is ₀ The value range of (A) is 10-50.

4. The data processing system for error correction of annotated text as claimed in claim 1, wherein B is _i The first text in A means that A is divided by A _i Any other than the tagged text.

5. The data processing system for error correction of annotation text of claim 1, wherein the intermediate text refers to any annotation text in the intermediate text list divided based on A.

6. The data processing system for error correction of annotated text as in claim 1, wherein C is _j The second text list refers to dividing D by D in D _j Any intermediate text list other than the text list.

7. The data processing system for error correction of markup text according to claim 1, wherein s satisfies the following condition: s = m-1.

8. The data processing system for error correction of annotated text as claimed in claim 1, wherein p satisfies the condition: p = n-1.