CN111008271B

CN111008271B - Neural network-based key information extraction method and system

Info

Publication number: CN111008271B
Application number: CN201911138210.6A
Authority: CN
Inventors: 姜磊; 杨钊; 赖招展; 欧阳滨滨; 陈南山; 朱振航; 何慧; 沈广盈; 屈吕杰
Original assignee: Brilliant Data Analytics Inc
Current assignee: Brilliant Data Analytics Inc
Priority date: 2019-11-20
Filing date: 2019-11-20
Publication date: 2022-06-24
Anticipated expiration: 2039-11-20
Also published as: CN111008271A

Abstract

The invention relates to an information extraction technology, in particular to a key information extraction method and a system based on a neural network, which comprises the following steps: generating a label vector, setting the length of an article as n, setting the position of a first character of key information in the article as s and the position of a last character as e, taking s + n + e as elements of the label vector, initializing all the elements to be 0, and resetting the elements at the position of s + n + e to be 1; performing text tensor processing on the article to obtain a text tensor C, and then generating a text feature vector; replacing elements which are obviously impossible to be the largest in the text feature vector with minimum values, multiplying the elements which are not obviously impossible to be the largest by a weight, and generating an output vector; calculating the cross entropy of the output vector and the label vector as loss, and performing iterative training on the neural network until convergence to obtain a model; and inputting text data into the model to obtain an output vector and obtain key information. The problem of prior art on little data set easy overfitting and unable make full use of prior information is solved.

Description

Neural network-based key information extraction method and system

Technical Field

The invention belongs to the technical field of information extraction, and particularly relates to a key information extraction method and system based on a neural network.

Background

The neural network is a mathematical model, which is formed by connecting nodes, and the values of parameters in the neural network are generally updated by a back propagation algorithm during training, so that the whole model is closer to the mapping from a real input space to an output space. Theoretically two layers of neural networks wide enough can fit any function, but in practice if it is done so, it is likely that the model simply remembers the training set and does not learn deeper connections between the data. It may cause the model to perform well on the training set but perform poorly on the test set. Because of this problem, instead of a shallow but wide enough network, one tries a deep network with a certain width, hopefully the deep network can learn deeper features further with the features learned by the shallow network. But with the problem of gradient explosion and gradient disappearance, networks using sigmoid as the activation function were generally limited to five layers at the time. Later relu activation functions were proposed, alleviating the problems of gradient explosion and gradient disappearance.

In 2015, after residual connection is extracted, the problems of gradient explosion and gradient disappearance are basically solved, and hundreds of layers of neural networks can be easily constructed by using the residual connection. With such a deep neural network, the fitting ability is naturally not a problem, but an overfitting problem arises. The learning ability of the model is too strong, and some random phenomena are often learned as rules. This phenomenon is less severe on large datasets; according to the theorem of large numbers, if the data set is large enough, it is difficult for the neural network to learn a remarkable random phenomenon. However, the overfitting phenomenon is particularly remarkable on a small data set, and the existing model is good in performance on a large data set, but poor in performance on a small data set and even inferior to a simple model.

In summary, the technology of neural networks is used for key information extraction at present stage, and has two problems. One aspect is the wrecking of the standard model; for example, sometimes bert + crf (crf generally refers to a sub-class of crf: linear chain random field) is used for extracting key information, the linear chain random field is a regular constraint in nature for a neural network, but the regular constraint only acts on adjacent words, and the constraint between cross words is not existed, so that good prior information cannot be utilized naturally. On the other hand, the existing model is complex, if no prior information is added or no special optimization is performed, overfitting is easy to perform, and the effect is even inferior to that of a simple model.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a method and a system for extracting key information based on a neural network, which are used for customizing a neural network model for extracting the key information, and providing strict and effective rule constraints in the model according to the characteristics of the key information, thereby improving the performance of the model on a small data set and solving the problems that the existing key information extraction technology is easy to learn randomly-appearing features on the small data set, so that overfitting is caused and prior information cannot be fully utilized.

The extraction method is realized by adopting the following technical scheme: the key information extraction method based on the neural network comprises the following steps:

s1, generating label vectors, setting the length of an article as n, the position of a first word of key information in the article as S, the position of a last word of key information in the article as e, taking S + n + e as elements of the label vectors, and initializing an n-dimensional label vector for each article; initializing all elements of the label vector to 0, and resetting the elements of the label vector at the position of s × n + e to 1 to obtain a final label vector;

s2, carrying out text tensor processing on a given article to obtain a text tensor C;

s3, generating a text feature vector, obtaining a first character feature vector CS and a last character feature vector CE according to the text tensor C, and taking a Cartesian product of the first character feature vector CS and the last character feature vector CE as the text feature vector;

s4, replacing the elements which are obviously impossible to be the largest in the text feature vector with minimum values;

s5, sharing parameters, multiplying the elements which are not obvious and can not be the largest in the text feature vector by a weight to generate a new output vector;

s6, calculating a loss function, calculating the cross entropy of the output vector and the label vector as loss, and performing iterative training on the neural network;

s7, minimizing a loss function by using a gradient descent method, iterating until convergence, and storing a neural network model;

and S8, inputting text data into the stored neural network model to obtain a final output vector and obtain the extracted key information.

In a preferred embodiment, in step S1, the elements of the initialized label vector are processed out of order, copied for several times and spliced together end to obtain an element string, and then the first elements in the element string are fixedly intercepted to form a final label vector.

In step S5, all elements having the same difference between e and S, which are associated with the elements that are not obviously the largest but are not possible in the text feature vector, are multiplied by the same weight, and a new text feature vector is generated as an output vector.

In a preferred embodiment, step S8 includes:

obtaining m elements of the output vector, wherein the m elements are more than or equal to any element except the m elements in the output vector;

calculating to obtain s and e corresponding to the m elements according to the one-to-one correspondence between the positions of the m elements in the output vector and the combination of the s and the e;

obtaining text fields corresponding to the m elements through s and e corresponding to the m elements; and adding corresponding elements of the same text segments to serve as a new corresponding element of the text segment, and selecting the text segment corresponding to the largest element in the new corresponding elements as a final output vector.

The invention relates to a key information extraction system based on a neural network, which comprises:

the tag vector generation module is used for initializing a n-dimensional tag vector for each article by taking the length of the article as n, the position of the first character of the key information in the article as s and the position of the last character of the key information in the article as e and taking s + n + e as an element of the tag vector; initializing all elements of the label vector to 0, and resetting the elements of the label vector at the position of s × n + e to 1 to obtain a final label vector;

the text tensorial module is used for carrying out text tensorial processing on a given article to obtain a text tensor C;

the text feature vector generation module is used for obtaining a first character feature vector CS and a last character feature vector CE according to the text tensor C, and taking a Cartesian product of the first character feature vector CS and the last character feature vector CE as a text feature vector;

the assignment module is used for replacing obviously elements which cannot be the largest in the text feature vector with minimum values;

the parameter sharing module is used for multiplying the elements which are not obvious and can not be the largest in the text feature vector by a weight to generate a new output vector;

the loss function calculation module is used for calculating the cross entropy of the output vector and the label vector as loss and performing iterative training on the neural network;

the iterative convergence module is used for minimizing a loss function through a gradient descent method, iterating until convergence, and storing a neural network model;

and the prediction module is used for inputting text data into the stored neural network model to obtain a final output vector and obtain the extracted key information.

According to the technical scheme, compared with the prior art, the invention has the following beneficial effects:

1. aiming at a specific scene of key information extraction, the invention provides a set of neural network models specially used for processing the key information extraction problem, two strict and effective rule constraints are provided in the models according to the characteristics of the key information, the parameter utilization rate of the models is improved by using shared parameters, and the parameter quantity of the models is reduced; the representation of the model on the small data set is remarkably improved, and the problems that the existing key information extraction technology easily learns randomly-appearing features on the small data set, so that overfitting is caused and prior information cannot be fully utilized are solved.

2. The method has obvious effect on small data sets, and can improve the expression of the model and accelerate the convergence rate of the model to a certain extent for large data sets. The feature vectors of the text are extracted through the customized neural network, and then the neural network can utilize prior information by shielding certain elements in the feature vectors and adopting a parameter sharing mode, so that the convergence speed of the neural network is increased, and the overfitting degree is reduced.

3. The invention provides a neural network model in the field of key information extraction, and the remarkable progress is achieved by the neural network model comprises the following three aspects:

in the first aspect, compared with the common neural network models such as bert + crf, lstm + crf or bert + lstm + crf, the invention provides an idea of adding customized rule constraints into the neural network aiming at the characteristics of key information extraction. The customized rule constraint specifically includes two, and the first rule constraint is: setting the position of the first character of the key information in the text as s and the position of the last character of the key information in the text as e, wherein s is smaller than e; the second rule constraint is: e-s cannot exceed a certain value x, where x is a set threshold. Because the position of the word at the beginning of the key information (i.e. the first word) is obviously not larger than that of the word at the end of the key information (i.e. the last word), and because the length of the key information is not too long in general, the positions of the words at the head and the tail of the key information are not too far away; the two rule constraints achieve a significant technical effect of greatly reducing the effective dimension of the output vector of the neural network from original n x n (n is the article length) to x.

In a second aspect, the invention further provides a technical means that corresponding elements with the same e and s difference share parameters in the feature vector, so that x (x-1) parameters are reduced, and the overfitting phenomenon of the neural network model is effectively relieved.

In a third aspect, the present invention proposes a scheme that considers the first several (e.g., m) candidate key information. Compared with the method of directly taking the most possible candidate key information as the key information, the scheme of the invention has more stability.

In general, through strong and effective rule constraint and parameter sharing, the method has more excellent performance in the field of key information extraction compared with a general neural network model. The promotion of the model of the present invention over a generic general model is larger when the amount of data is smaller, i.e. applied on small data sets.

Drawings

FIG. 1 is a flow chart of key information extraction according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

Examples

In the key information extraction method based on the neural network, on the whole, on one hand, label vectors are generated according to articles and are used for training the neural network; on the other hand, for the article with the length of n, extracting a text feature vector with the dimension of n x n by using a neural network; and then extracting the index of the largest element in the text characteristic vector, and analyzing the index of the largest element in the text characteristic vector into the position of the first character of the key information in the article and the position of the last character of the key information in the article through the one-to-one correspondence relationship between the Cartesian combination of the position s of the first character of the key information in the article and the position e of the last character of the key information in the article and each element of the text characteristic vector, thereby extracting the key information.

The tenderers in the bidding documents are relatively critical information, and the corresponding tenderers are generally available on the existing bidding documents. The following describes in detail how to implement the present invention, taking as an example how to extract key information of a bidder from a bidding document. As shown in fig. 1, in this embodiment, the key information extraction method includes the following steps:

step 1, data cleaning.

And cleaning the data, and removing the repeated data and the abnormal data.

And 2, generating a label vector.

The tenderer may appear in the bidding document many times, and the embodiment uses a tag vector generated by the position information of the tenderer in the article as a tag instead of directly using the tenderer as a tag.

Specifically, assuming that the length of an article (also called text length) is n, the first word of the key information of the tenderer is s in the bidding document, the last word is e in the bidding document, and (s n + e) is taken as an element of a label vector, a label vector with dimensions of n × n is initialized for each article (bidding document), and all elements of the label vector are initialized to 0; and for each element of the label vector, if the corresponding combination of s and e exists, resetting the element of the label vector to 1, namely resetting the element of the label vector at the position of s × n + e to 1 to obtain a final label vector for the calculation of a subsequent loss function.

Because the number of elements of the label vector corresponding to different bidding documents may be different, which is not beneficial to the implementation of the model, the number of elements of the label vector corresponding to different bidding documents is preferably equal, and the invention considers that the elements of each label vector should be given equal status. Therefore, the elements of the initialized label vector are processed out of order (in a disordered order), are copied for a certain number of times and then are spliced together end to obtain an element string, and then a plurality of front elements in the element string are fixedly intercepted to form a final label vector as a label.

And step 3, text quantization.

In the embodiment, a Google Kaiyuan Chinese bert model is selected as a tensor quantization mode, and results are serialized. Specifically, given an article, the article is converted into a vector consisting of 512 words id according to a preprocessing method of a google Chinese bert model, and then the vector is used as an input of the google Chinese bert model, so that a text tensor C (the specific shape of the tensor C is [512,768]) is obtained. And then serializing the text tensor C to facilitate subsequent reutilization.

It should be noted that the Chinese bert model in Google is only a text quantization method, and other text quantization methods are available, such as a bert model trained from the beginning with business data, and a fastText model which is more cost-effective.

And 4, generating a text feature vector.

Assuming that the shape of the text tensor C is C [ n, d ] (n is the text length, i.e., the article length, and d is the dimension of a word), a query vector S is randomly initialized, the shape of which is [ d ], and the value of CS (CS ═ C × S) is used as the first character feature vector. Similarly, a query vector E is randomly initialized to have the shape of [ d ], and CE (CE ═ C × E) is used as the final word feature vector. Taking the Cartesian product of the first character feature vector CS and the last character feature vector CE as a text feature vector, wherein the dimension of the text feature vector is n x n.

A vector is a special case of a tensor, a shape is an attribute of a tensor, e.g. the shape of the vector [1,2,3] is [3], and the shape of the tensor [1,2], [3,4] is [2,2 ]. The shape of the text tensor C is [ n, d ], the shape of the query vector S is [ d ], the shape of the first character feature vector CS is [ n ], the dimension of the first character feature vector CS and the number of words are equal, and each element of the first character feature vector CS represents the probability that the word at the corresponding position is the first character; accordingly, each element of the last word feature vector CE represents the probability that the word at the corresponding position is the last word.

And 5, replacing the obviously impossible largest elements in the text feature vector with minimum values.

Since the first word and the last word of the key information are not too far apart, the method for judging that the first word and the last word of the key information are obviously not the largest among all elements of the feature vector may be: if s and e corresponding to a certain element in the text feature vector have s > -e, the element obviously cannot be the largest; or, if s and e corresponding to a certain element in the text feature vector have a value of e-s larger than a certain set threshold value x, the certain element is obviously unlikely to be the largest. For example, if the element has an e-s value greater than 40, then the element is clearly unlikely to be the largest.

According to the priori knowledge, if the element in the text feature vector with the dimension n x n is obviously not the largest element in the text feature vector, the element is reset to be a number which is extremely small compared with the value of the element in the text feature vector with the dimension n x n, namely a minimum value is given.

In this embodiment, the positions l, s, and e of the elements in the text feature vector have the following relationships: s + e ═ l; so s is l// n (//denotes the division of two numbers, rounded down) and e is l% n (% is the remainder). So replace the element of l% n < (s)/n with-1000 to achieve the first rule constraint, s must be less than e; elements of l% n-l// nd <40 are replaced with-1000 to implement the second rule constraint, e minus s cannot exceed a certain value.

Step 6, sharing parameters: multiplying all elements of the text feature vector which are not obviously possibly the largest by a weight to generate a new output vector.

The weights (parameters) multiplied by the elements that are not obviously likely to be the largest in the text feature vector with dimension n x n are shared according to the relation of s and e. All elements in the text feature vector that have the same difference between e and s, but for which it is not obviously possible to maximize the elements, are multiplied by the same (i.e., shared) weight (parameter), thereby generating a new output vector. That is, for each element which is not obviously possible to be the largest in the text feature vector with the dimension of n × n, when the difference between e and s corresponding to the elements is the same, each element with the same difference between e and s is multiplied by the same weight parameter, and a new n × n text feature vector is generated as an output vector.

Specifically, for the obviously unlikely largest element in the feature vector, its weight is 1; for other elements, the weights are trainable parameters, and for all elements with the same value of l% n-l// nd, their corresponding weights are set to be shared (i.e., multiplied by the same parameter).

Because the processing of step 5 and step 6, the obviously unlikely largest element in the output vector has been assigned the minimum value, i.e., -1000, it is basically guaranteed that these obviously unlikely largest elements will not be selected, and their corresponding fields will not be selected, which is the intuitive effect of the constraint.

And 7, calculating a loss function. And (4) performing softmax operation on the new output vector generated in the step (6), generating the new output vector again, calculating the cross entropy of the generated new output vector and the label vector again to be used as loss, and participating in iterative training of the neural network.

In the invention, the position information of the keyword in the article is used as the label, so the label vector reflects the position information of the keyword in the article. In this embodiment, for each training sample, all keywords to be extracted in an article are regularly matched, and then position information, i.e., a position s and a position e, of the regularly matched keywords is analyzed, and then the analyzed position information is used as a label of the corresponding article to participate in iterative training of a neural network. Because the iterative training of the neural network fully utilizes the label vector reflecting the position information of the keyword, and the neural network can utilize prior information by shielding certain elements in the characteristic vector and adopting a parameter sharing mode, the effective dimensionality of the output vector of the neural network can be effectively reduced, and the overfitting phenomenon of a neural network model can be effectively relieved.

And 8, iterative convergence, namely minimizing a loss function by using a gradient descent method, iterating until convergence, and storing the neural network model.

The neural network calculates gradients of the variables (e.g., query vector S, query vector E) based on the loss, updates the query vector S, E based on the calculated gradients of the variables, and so on. The iterative process may optimize query vector S, E so that query vector S, E can correctly determine which word should be given a high score (i.e., which word has a high probability of being the first or last word). When the loss is small enough, the output vector and the label vector are close enough, the neural network model is considered to have the prediction capability, and the iteration is stopped.

And 9, predicting, namely inputting text data into the neural network model stored in the step 8 to obtain a final output vector and obtain the extracted key information.

The method specifically comprises the following steps: m elements of the output vector are found, the m elements being equal to or greater than any element of the output vector other than the m elements. From the foregoing, the positions of the elements in the vector are s × n + e, and s and e corresponding to the m elements are calculated through the one-to-one correspondence between the positions of the m elements in the output vector and the combination of s and e; then, the literal fields corresponding to the m elements are solved through s and e corresponding to the m elements; and adding corresponding elements of the same text segments to serve as a new corresponding element of the text segment, and selecting the text segment corresponding to the largest element in the new corresponding elements as a final output vector to obtain the extracted key information, namely the tenderer.

The invention relates to a key information extraction system based on a neural network, which comprises: a tag vector generation module for implementing the step 2; a text tensoriation module for implementing the step 3; a text feature vector generation module for implementing the step 4; the assignment module is used for realizing the step 5; a parameter sharing module for implementing the step 6; a loss function calculation module for implementing the step 7; an iteration convergence module for implementing the step 8; and the prediction module is used for realizing the step 9.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A key information extraction method based on a neural network is characterized by comprising the following steps:

s1, generating label vectors, setting the length of an article as n, the position of a first word of key information in the article as S, the position of a last word of key information in the article as e, taking S + n + e as elements of the label vectors, and initializing an n-dimensional label vector for each article; initializing all elements of the label vector to be 0, and resetting the elements of the label vector at the position of s x n + e to be 1 to obtain a final label vector;

s8, inputting text data into the stored neural network model to obtain a final output vector and obtain the extracted key information;

in step S4, the element that is obviously unlikely to be the largest in the text feature vector is determined by the following method:

if s and e corresponding to a certain element in the text feature vector have s > -e, the element obviously cannot be the largest; or if s and e corresponding to a certain element in the text feature vector have the value of e-s larger than the set threshold value x, the element is obviously impossible to be the largest;

in step S5, for each element that is not obviously the largest possible in the text feature vector with dimension n × n, when the differences between e and S corresponding to the elements are the same, each element with the same difference between e and S is multiplied by the same weight parameter, and a new n × n text feature vector is generated as an output vector.

2. The method for extracting key information of claim 1, wherein in step S1, the elements of the initialized label vector are processed out of order, copied for several times and spliced together end to obtain an element string, and then the previous elements in the element string are fixedly intercepted to form a final label vector.

3. The method of claim 1, wherein in step S3, assuming that the shape of the text tensor C is C [ n, d ], where d is the dimension of a word, a query vector S is randomly initialized, and the shape is [ d ], and the value of CS ═ C ═ S is used as an initial feature vector; randomly initializing a query vector E, wherein the shape of the query vector E is [ d ], and the value of CE ═ C × E is used as a final character feature vector; taking the Cartesian product of the first character feature vector CS and the last character feature vector CE as a text feature vector, wherein the dimension of the text feature vector is n x n.

4. The method of extracting key information according to claim 1, wherein step S8 includes:

determining text fields corresponding to the m elements according to s and e corresponding to the m elements; and adding corresponding elements of the same text segments to serve as a new corresponding element of the text segment, and selecting the text segment corresponding to the largest element in the new corresponding elements as a final output vector.

5. A key information extraction system based on a neural network is characterized by comprising:

the tag vector generation module is used for initializing a n-dimensional tag vector for each article by taking the length of the article as n, the position of the first character of the key information in the article as s and the position of the last character of the key information in the article as e and taking s + n + e as an element of the tag vector; initializing all elements of the label vector to be 0, and resetting the elements of the label vector at the position of s x n + e to be 1 to obtain a final label vector;

the text tensor module is used for performing text tensor processing on a given article to obtain a text tensor C;

the iterative convergence module is used for minimizing the loss function through a gradient descent method, iterating until convergence, and storing the neural network model;

the prediction module is used for inputting text data into the stored neural network model to obtain a final output vector and obtain the extracted key information;

in the assignment module, elements obviously impossible to be the largest in the text feature vectors are judged, and the judgment method comprises the following steps:

if s and e corresponding to a certain element in the text feature vector have s > -e, the element obviously cannot be the largest; or, if s and e corresponding to a certain element in the text feature vector have the value of e-s larger than the set threshold value x, the element is obviously unlikely to be the largest;

in the parameter sharing module, for each element which is not obviously possible to be the largest in the text feature vector with the dimension of n x n, when the difference between e and s corresponding to each element is the same, each element with the same difference between e and s is multiplied by the same weight parameter, and a new n x n text feature vector is generated to serve as an output vector.

6. The key information extraction system of claim 5, wherein the tag vector generation module performs out-of-order processing on the elements of the initialized tag vector, copies the elements for a plurality of times, splices the elements together end to obtain an element string, and then fixedly intercepts a plurality of previous elements in the element string to form a final tag vector.

7. The key information extraction system of claim 5, wherein the process of obtaining the extracted key information by the prediction module comprises: