CN116227495A

CN116227495A - Entity classification data processing system

Info

Publication number: CN116227495A
Application number: CN202310497381.8A
Authority: CN
Inventors: 张炜琛; 倪培峰; 王全修; 赵洲洋; 石江枫; 靳雯; 于伟; 王明超
Original assignee: Beijing Rich Information Technology Co ltd; Information And Communication Center Of Ministry Of Public Security
Current assignee: Beijing Rich Information Technology Co ltd; Information And Communication Center Of Ministry Of Public Security
Priority date: 2023-05-05
Filing date: 2023-05-05
Publication date: 2023-06-06
Anticipated expiration: 2043-05-05
Also published as: CN116227495B

Abstract

The present application relates to the field of electronic digital data processing technology, and in particular, to a data processing system for entity classification. The system includes a processor and a memory having stored thereon computer readable instructions which when executed by the processor perform the steps of: s100, acquiring a target Text; s200, obtaining a code vector of the Text; s300, reasoning the coding vector of the Text to obtain the coding vector of each entity corresponding to each sub-Text in the Text; s400, carrying out unified dimension and splicing processing on the coding vectors of the entities corresponding to the sub-texts in the Text to obtain target coding tensors corresponding to the Text; s500, reasoning the target coding tensor corresponding to the Text by using the trained third neural network model to obtain the types of the entities corresponding to the sub-texts in the Text. The invention realizes the fine classification of entity types in the text.

Description

Entity classification data processing system

Technical Field

The invention relates to the technical field of electric digital data processing, in particular to a data processing system for entity classification.

Background

In the prior art, a named entity recognition (NER, named Entity Recognition) model can be used for realizing the task of recognizing entities of the types of person names, place names, organization names, date and time, proper nouns and the like in texts. However, the type of the entity identified by the NER model may be a broader type, and in some application scenarios, the user needs to know the specific type of the entity identified by the NER model, for example, the place name in the text may be identified by the NER model, but the user needs to further know whether the place name is the place name or the destination name; or the date and time in the text can be identified using the NER model, but it is necessary for the user to know further whether the date and time is the departure time or the arrival time. How to realize fine classification of entity types in a text is a problem to be solved.

Disclosure of Invention

The invention aims to provide a data processing system for entity classification, which is used for realizing fine classification of entity types in texts, so that a user can acquire specific types of the entities in the texts.

According to the present invention there is provided a data processing system for entity classification comprising a processor and a memory, the memory having stored thereon computer readable instructions which when executed by the processor effect the steps of:

s100, acquiring text= { Text of the target Text ₁ ,text ₂ ,…,text _n ,…,text _N }，text _n To constitute the nth sub-text of the target text, N has a value ranging from 1 to N, N being the number of sub-texts constituting the target text.

S200, acquiring the coding vector of the Text by using the trained first neural network model.

S300, reasoning the coding vector of the Text by using a trained second neural network model to obtain the coding vector of each entity corresponding to each sub-Text in the Text; the second neural network model is used for entity recognition.

S400, carrying out unified dimension and splicing processing on the coding vectors of the entities corresponding to the sub-texts in the Text to obtain a target coding tensor corresponding to the Text.

S500, reasoning a target coding tensor corresponding to the Text by using a trained third neural network model to obtain the types of entities corresponding to each sub-Text in the Text; the third neural network model is used for entity classification.

Compared with the prior art, the invention has at least the following beneficial effects:

according to the method, on the basis of the second neural network model identifying and obtaining the entities in each sub-Text in the target Text, the target coding tensor corresponding to the Text input into the third neural network model is obtained according to the coding vector of the entities in each sub-Text in the target Text obtained through identification, and therefore the third neural network model can further classify the entities on the basis of the second neural network model identifying the entities, and the specific type corresponding to the entities is obtained. The invention realizes the fine classification of the entity types in the text, so that a user can acquire the specific types of the entities in the text.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a method for classifying entities according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

According to the present invention there is provided a data processing system for entity classification comprising a processor and a memory having stored thereon computer readable instructions which when executed by the processor perform a method of entity classification. As shown in fig. 1, the method for classifying entities includes the steps of:

Optionally, the first neural network model is a BERT model. Those skilled in the art will appreciate that any neural network model that can be used to obtain the encoding vector of text in the prior art falls within the scope of the present invention.

According to the invention, the second neural network model is the NER model. Those skilled in the art will appreciate that any NER model in the art falls within the scope of the present invention.

The invention performs unified dimension and splicing processing on the coding vectors of the entities corresponding to each sub-Text in the Text, and the obtaining of the target coding tensor corresponding to the Text comprises the following steps:

s410, acquiring E= (E) based on target Text ₁ ,e ₂ ,…,e _n ,…,e _N )，e _n Text output for the second neural network model _n A set of coded vectors corresponding to entities in e _n =(e _n,1 ,e _n,2 ,…,e _n,m ,…,e _n,Mn )，e _n,m Is text _n The value range of m is 1 to Mn, mn is text _n The number of entities in the system.

S420, the first entity number s=max (M1, M2, …, mn, …, mn) is acquired, and max () is the maximum value.

As one embodiment, the target text includes 4 sub-texts, i.e., n=4, the number of entities in the first sub-text identified using the trained second neural network model is 3, the number of entities in the second sub-text identified using the trained second neural network model is 4, the number of entities in the third sub-text identified using the trained second neural network model is 2, and the number of entities in the fourth sub-text identified using the trained second neural network model is 3, then s=max (3,4,2,3) =4.

S430, obtaining the average length L of all entities of all the sub-texts in E, L= (Σ) ^N _n=1 ∑ ^Mn _m=1 l _n,m )/(∑ ^N _n= ₁ Mn)，l _n,m Is text _n The length of the mth entity identified from the forward and backward.

S440, if L is less than or equal to L ₀ Then proceed to S450; otherwise, go to S460; l (L) ₀ Is a preset length threshold.

Alternatively, L ₀ Is an empirical value. Preferred L of the invention ₀ Is obtained by (1)The method comprises the following steps:

s441, obtaining a physical sample set B= { B ₁ ,b ₂ ,…,b _q ,…,b _Q }，b _q The Q-th entity sample in the B is the value range of Q from 1 to Q, and Q is the number of the entity samples in the B; a first coefficient i=1 is set.

S442, traversing B, if B _q The length of (d) is less than or equal to (d) ₀ +i×Δd), b is obtained _q Corresponding first vector and according to b _q Corresponding first vector fetch b _q The type of (2); otherwise, obtain b _q Corresponding second vector and according to b _q Corresponding second vector acquisition b _q The type of (2); d, d ₀ For a preset initial length, Δd is a preset length interval.

As an example, d ₀ And Δd is set to an empirical value, optionally d ₀ =2，Δd=1。

S443, traversing B, if the acquired B _q Is of the exact type, then b _q Add to the preset ith collection G _i ，G _i Is initialized to Null.

S445, obtain G _i The number of entities in the system.

S446, if G _i The number of the middle entities is greater than G _i-1 I=i+1, repeating S442-S445 until G _i The number of the middle entities is less than or equal to G _i-1 The number of the middle entities is recorded as H; g _i-1 To adopt and obtain G _i I-1 th set obtained by the same method as the method of (a).

According to the invention, G ₀ Resulting from executing S442-S445 when i=0 is set.

S447, obtain L ₀ = d ₀ +(H-1)×Δd。

L obtained according to the invention S441 to S447 ₀ More accurately, the L is ₀ The accuracy of the final entity classification result can be improved by being used as a preset length threshold.

S450, traversing E to obtain E _n,m Corresponding first vector f ¹ _n,m ，f ¹ _n,m From e _n,m The code of the first word and the code of the last word are spliced; if Mn is<S, then to text _n Corresponding combined code vector F ¹ _n Filling operation is carried out to obtain text _n A corresponding first target encoding vector, and then enter S470; f (F) ¹ _n From text _n The method comprises the steps that first vectors corresponding to all entities recognized from the front to the back are spliced, the dimension of a first target coding vector is S multiplied by 2 multiplied by A, and A is the dimension of a code corresponding to each word in the coding vector output by a first neural network model; if Mn=S, text is to be added _n Corresponding combined code vector F ¹ _n As text _n The corresponding first target encoding vector proceeds to S470.

As an embodiment, the dimension of the code corresponding to each word in the code vector output by the first neural network model is 768, that is, a=768, and then the dimension of the first target code vector is sx2×768.

It should be appreciated that the filling operation is performed at F ¹ _n Adding 0 later until text is obtained _n The corresponding dimension is sx2×a of the first target encoding vector. For example, mn=3, s=4, for F ¹ _n The filling operation is performed in F ¹ _n The (4-3) ×2×A 0's are then added.

S460, traversing E to obtain E _n Corresponding second vector f ² _n,m ，f ² _n,m E is _n,m The average value of codes corresponding to all words in the database; if Mn is<S, then to text _n Corresponding combined code vector F ² _n Filling operation is carried out to obtain text _n A corresponding second target encoding vector, and enter S480; f (F) ² _n From text _n The second target coding vector is obtained by splicing the second vectors corresponding to all the entities identified from the front to the back, and the dimension of the second target coding vector is S multiplied by A; if Mn=S, text is to be added _n Corresponding combined code vector F ² _n As text _n The corresponding second target encoding vector is entered into S480.

S470, inputting a first target coding tensor corresponding to the target Text to the trainedThe third neural network model performs reasoning to obtain the types of the entities in the Text of the target Text; the first target encoding tensor corresponding to the target Text is formed by each Text _n The corresponding first target coding vector is formed;

s480, inputting a second target coding tensor corresponding to the Text of the target Text into a trained third neural network model for reasoning to obtain the types of all entities in the target Text; the second target encoding tensor corresponding to the target Text is formed by each Text _n The corresponding second target encoding vector is formed.

In the process of acquiring the target coding tensor corresponding to the Text which is input into the third neural network model, the method for acquiring the target coding tensor corresponding to the Text is also distinguished according to the average length L of all entities in all the sub-texts in the target Text, wherein when L is relatively short, the vector corresponding to each entity is acquired by selecting a head-to-tail splicing method so as to retain the information of more entities; when L is relatively long, the vector corresponding to each entity is obtained by selecting an averaging method, so that more entity information is reserved, and the accuracy of finally classifying the entities is improved.

Those skilled in the art will appreciate that any entity classification model and neural network training method in the prior art falls within the scope of the present invention. The invention provides a preferable training method, namely, a combined training mechanism is adopted for training the second neural network model and the third neural network model, and the total Loss of training is set as Loss, loss= (Σ) ^Z _j=1 (α _j ×loss _1,j +β _j ×loss _2,j ))/Z，α _j Weights, alpha, corresponding to the second neural network model corresponding to the jth sub-text training sample _j =1.5-1/(1+e ^-Pj/4 )，β _j Training sample correspondence for jth subtextWeights corresponding to the third neural network model, beta _j =(1/(1+e ^-Pj/4 ))-0.5，loss _1,j For the loss corresponding to the second neural network model corresponding to the jth sub-text training sample,loss _2,j and (3) for the loss corresponding to the third neural network model corresponding to the jth sub-text training sample, wherein the value range of j is 1 to Z, Z is the number of the sub-text training samples, and Pj is the number of the entities identified in the jth sub-text training sample.

Alpha set by the invention _j And beta _j The method can ensure that the weight corresponding to the second neural network model corresponding to the jth sub-text training sample is larger than or equal to the weight corresponding to the third neural network model corresponding to the jth sub-text training sample, and the weight corresponding to the third neural network model increases along with the increase of the number of the identified entities, so that the entity identification task is the main task in the combined training process of the second neural network model and the third neural network model, the fitting effect on the entity identification task is improved, and the situation that the fitting effect on the entity identification task is poor due to the fact that the entity identification task is difficult is avoided; and the weight corresponding to the entity classification task can be increased when the number of the identified entities is increased, so that the model has more attention to the classification task when the number of the entities to be classified is increased (the more the classification times are, the greater the probability of classification errors is, and the more the corresponding loss is likely to be increased), and the fitting effect of the classification task is improved.

Optionally, the loss corresponding to the second neural network model and the loss corresponding to the third neural network model are both cross entropy losses. Those skilled in the art will appreciate that any type of loss in the prior art falls within the scope of the present invention.

According to the present invention, when reasoning is performed using the trained third neural network model, the target encoding tensor corresponding to the Text input to the third neural network model depends on the reasoning result of the second neural network model. However, preferably, in the stage of training the second neural network model and the third neural network model, the target encoding tensor input to the third neural network model is obtained according to the entity in the artificially labeled sub-text training sample, instead of the entity obtained by reasoning according to the second neural network model, so that the accuracy of the total target encoding tensor input to the third neural network model can be improved, and the situation that the loss of the third neural network model is large due to inaccurate reasoning results of the second neural network model is avoided.

As a specific implementation manner, the target text is an alert, the second neural network model is used for identifying a name, a place name and time in the target text, and the third neural network model is used for obtaining types of the name, the place name and the time, wherein the types of the name comprise suspects, alarming persons or victims, the types of the place name comprise case sending places or alarming places, and the types of the time comprise case sending time, alarming time or alarming time.

For example, the target text is: zhang san 8 am alarm calls that the mobile phone is stolen. The third person name can be identified by using the second neural network model, and 8 points in the morning are time; the third neural network model can further divide the third person into alarm person types based on the fact that the second neural network recognizes that the third person is a personal name, and can further divide the 8 am point into alarm time types based on the fact that the second neural network recognizes that the 8 am point is time.

While certain specific embodiments of the invention have been described in detail by way of example, it will be appreciated by those skilled in the art that the above examples are for illustration only and are not intended to limit the scope of the invention. Those skilled in the art will also appreciate that many modifications may be made to the embodiments without departing from the scope and spirit of the invention. The scope of the invention is defined by the appended claims.

Claims

1. A data processing system for entity classification, comprising a processor and a memory, said memory having stored thereon computer readable instructions, wherein said computer readable instructions when executed by said processor perform the steps of:

s100, acquiring text= { Text of the target Text ₁ ,text ₂ ,…,text _n ,…,text _N }，text _n For the nth sub-text constituting the target text, the value range of N is 1 to N, N being the number of sub-texts constituting the target text;

s200, acquiring a coding vector of the Text by using the trained first neural network model;

s300, reasoning the coding vector of the Text by using a trained second neural network model to obtain the coding vector of each entity corresponding to each sub-Text in the Text; the second neural network model is used for entity identification;

s400, carrying out unified dimension and splicing processing on the coding vectors of the entities corresponding to the sub-texts in the Text to obtain target coding tensors corresponding to the Text;

2. The data processing system of claim 1, wherein S400 comprises:

s410, acquiring E= (E) based on target Text ₁ ,e ₂ ,…,e _n ,…,e _N )，e _n Text output for the second neural network model _n A set of coded vectors corresponding to entities in e _n =(e _n,1 ,e _n,2 ,…,e _n,m ,…,e _n,Mn )，e _n,m Is text _n The value range of m is 1 to Mn, mn is text _n The number of intermediate entities;

s420, obtaining a first entity number s=max (M1, M2, …, mn, …, mn), max () being the maximum value;

s430, obtaining the average length L of all entities of all the sub-texts in E, L= (Σ) ^N _n=1 ∑ ^Mn _m=1 l _n,m )/(∑ ^N _n=1 Mn)，l _n,m Is text _n The length of the mth entity identified from the front to the back;

s440, if L is less than or equal to L ₀ Then proceed to S450; otherwise, go to S460; l (L) ₀ A preset length threshold value;

s450, traversing E to obtain E _n,m Corresponding first vector f ¹ _n,m ，f ¹ _n,m From e _n,m The code of the first word and the code of the last word are spliced; if Mn is<S, then to text _n Corresponding combined code vector F ¹ _n Filling operation is carried out to obtain text _n A corresponding first target encoding vector, and then enter S470; f (F) ¹ _n From text _n The method comprises the steps that first vectors corresponding to all entities recognized from the front to the back are spliced, the dimension of a first target coding vector is S multiplied by 2 multiplied by A, and A is the dimension of a code corresponding to each word in the coding vector output by a first neural network model; if Mn=S, text is to be added _n Corresponding combined code vector F ¹ _n As text _n A corresponding first target encoding vector, and then enter S470;

s460, traversing E to obtain E _n Corresponding second vector f ² _n,m ，f ² _n,m E is _n,m The average value of codes corresponding to all words in the database; if Mn is<S, then to text _n Corresponding combined code vector F ² _n Filling operation is carried out to obtain text _n A corresponding second target encoding vector, and enter S480; f (F) ² _n From text _n All entity pairs identified from the front to the backThe corresponding second vectors are spliced, and the dimension of the second target coding vector is S multiplied by A; if Mn=S, text is to be added _n Corresponding combined code vector F ² _n As text _n A corresponding second target encoding vector, and enter S480;

s470, inputting a first target coding tensor corresponding to the target Text into a trained third neural network model for reasoning to obtain the types of all entities in the target Text; the first target encoding tensor corresponding to the target Text is formed by each Text _n The corresponding first target coding vector is formed;

3. The data processing system of claim 2, wherein in S440, L ₀ The acquisition method of (1) comprises the following steps:

s441, obtaining a physical sample set B= { B ₁ ,b ₂ ,…,b _q ,…,b _Q }，b _q The Q-th entity sample in the B is the value range of Q from 1 to Q, and Q is the number of the entity samples in the B; setting a first coefficient i=1;

s442, traversing B, if B _q The length of (d) is less than or equal to (d) ₀ +i×Δd), b is obtained _q Corresponding first vector and according to b _q Corresponding first vector fetch b _q The type of (2); otherwise, obtain b _q Corresponding second vector and according to b _q Corresponding second vector acquisition b _q The type of (2); d, d ₀ For a preset initial length, Δd is a preset length interval;

s443, traversing B, if the acquired B _q Is of the exact type, then b _q Add to the preset ith collection G _i ，G _i Is initialized to Null;

s445, obtain G _i The number of intermediate entities;

s446, if G _i The number of the middle entities is greater than G _i-1 I=i+1, repeating S442-S445 until G _i The number of the middle entities is less than or equal to G _i-1 The number of the middle entities is recorded as H; g _i-1 To adopt and obtain G _i I-1 th set obtained by the same method;

s447, obtain L ₀ = d ₀ +(H-1)×Δd。

4. A data processing system for entity classification as claimed in claim 3, wherein d ₀ =2，Δd=1。

5. The data processing system of entity classification of claim 1, wherein training of the second neural network model and training of the third neural network model employ a joint training mechanism setting a total Loss of training to Loss, loss= (Σ) ^Z _j=1 (α _j ×loss _1,j +β _j ×loss _2,j ))/Z，α _j Weights, alpha, corresponding to the second neural network model corresponding to the jth sub-text training sample _j =1.5-1/(1+e ^-Pj/4 )，β _j The weight corresponding to the third neural network model corresponding to the j-th sub-text training sample is beta _j =(1/(1+e ^-Pj/4 ))-0.5，loss _1,j For the loss corresponding to the second neural network model corresponding to the jth sub-text training sample,loss _2,j and (3) for the loss corresponding to the third neural network model corresponding to the jth sub-text training sample, wherein the value range of j is 1 to Z, Z is the number of the sub-text training samples, and Pj is the number of the entities identified in the jth sub-text training sample.

6. The data processing system of entity classification of claim 5, wherein the loss corresponding to the second neural network model and the loss corresponding to the third neural network model are both cross entropy losses.

7. The data processing system of entity classification of claim 1, wherein the first neural network model is a BERT model.