CN116956927A

CN116956927A - Method and system for identifying named entities of bankruptcy document

Info

Publication number: CN116956927A
Application number: CN202310949107.XA
Authority: CN
Inventors: 赵飞; 闫丰; 杜建业
Original assignee: Beijing Odetta Data Technology Co ltd
Current assignee: Beijing Odetta Data Technology Co ltd
Priority date: 2023-07-31
Filing date: 2023-07-31
Publication date: 2023-10-27

Abstract

The application relates to the technical field of natural language processing, and discloses a method and a system for identifying a named entity of a bankruptcy document, wherein the method comprises the following steps: performing word coding on the bankruptcy document through the BERT language model obtained through pre-training, and extracting text features to generate word vectors; performing bidirectional coding on the generated word vector to obtain text tag sequence data; performing optimal decoding on the text label sequence data to obtain an optimal text label sequence; and determining the label category to which each character belongs according to the optimal text label sequence. According to the application, by adding the BERT pre-training language model as the feature expression layer, text semantic information is stored more completely, the context bidirectional feature extraction capacity of the model is improved, the semantic information is utilized more fully, the problem of boundary division of named entities is solved better, and the recognition rate of the model to the entities is improved.

Description

Method and system for identifying named entities of bankruptcy document

Technical Field

The application relates to the technical field of natural language processing, in particular to a method and a system for identifying a named entity of a bankruptcy document.

Background

Named entity recognition begins at earliest, and is primarily based on dictionary and rule based methods that rely on manually constructed rule templates by linguists, are prone to errors, and the method can only process some simple text data, but cannot process complex unstructured data. Therefore, the machine learning-based method is increasingly popular and mainly includes a hidden markov model (Hidden Markov Model, HMM), a maximum entropy model (Maxmium Entropy model, ME), a support vector machine model (Support Vector Machine, SVM), a conditional random field (Conditional Random Field, CRF), and the like.

In recent years, with the development of hardware capability and the appearance of distributed expression of words, a neural network becomes a model capable of effectively processing many natural language processing (Natural Language Processing, NLP) tasks, bengio et al first put forward a method for constructing a language model by using the neural network, and the influence of data sparseness on statistical modeling is skillfully solved by the distributed expression of words, and meanwhile, the dimension disaster problem of model parameters is overcome.

Because of the common phenomenon of Word ambiguity and Word pluripotency among Chinese characters, in order to improve the accuracy of Chinese entity recognition, many students also use Word2Vec and other Word embedding models to train and learn the distributed representation of Word vectors. However, word2Vec and other pre-training models mainly focus on features between words or characters, neglecting the context of words, resulting in limited recognition capacity of the Word, and still have the problem of being unable to characterize the Word as ambiguous.

Indeed, research on named entity recognition at home and abroad is mature, but entity recognition in the field of bankruptcy documents is different from that in the general field, and has the specificity of the entity in the field. In one aspect, bankruptcy document web page data is mostly semi-structured text information presented in bulletin form, and there is no fixed standard format specification, but the organization of content has similar features. On the other hand, the related text has uneven length, contains a large amount of company information and various short names, the related terms have strong specialty, the phenomenon of word ambiguity or multi-word synonym exists, and the traditional method is not ideal for the identification accuracy and coverage rate of the entities. Through research, the related research in the field of bankruptcy documents is less, so that the task of identifying named entities in the field still has great research value and room for improvement.

Disclosure of Invention

The application provides a method and a system for identifying a named entity of a bankruptcy document, which are used for solving the technical problems in the prior art.

According to a first aspect of the application, a method for identifying named entities of bankruptcy documents is provided.

The method for identifying the named entities of the bankruptcy document comprises the following steps:

performing word coding on the bankruptcy document through the BERT language model obtained through pre-training, and extracting text features to generate word vectors;

performing bidirectional coding on the generated word vector to obtain text tag sequence data; performing optimal decoding on the text label sequence data to obtain an optimal text label sequence;

and determining the label category to which each character belongs according to the optimal text label sequence.

In addition, the method for identifying the named entities of the bankruptcy documents further comprises the following steps: performing masking language training on the BERT model to obtain a BERT language model; and when the BERT model is subjected to masking language training, 15% of words in text sentences are randomly masked, and then the words in masking positions are predicted by adopting an unsupervised learning method.

The BERT model is in a transducer structure.

In addition, bi-directionally encoding the generated word vector to obtain text tag sequence data includes: and taking the generated word vector as an input vector, inputting the input vector into a two-way long-short-term memory network layer for two-way coding, and obtaining text tag sequence data.

In addition, performing optimal decoding on the text tag sequence data to obtain an optimal text tag sequence includes: and decoding the text label sequence data through a CRF neural network model to obtain an optimal text label sequence.

According to a second aspect of the present application, there is provided a bankruptcy document named entity recognition system.

The bankruptcy document named entity recognition system comprises:

the text feature extraction module is used for carrying out word coding on the bankruptcy document through the BERT language model obtained through pre-training, extracting text features and generating word vectors;

the text label determining module is used for carrying out bidirectional coding on the generated word vector to obtain text label sequence data; performing optimal decoding on the text label sequence data to obtain an optimal text label sequence;

and the text character recognition module is used for determining the label category to which each character belongs according to the optimal text label sequence.

In addition, the bankruptcy document named entity recognition system further comprises: the model training module is used for carrying out masking language training on the BERT model to obtain a BERT language model; and when the BERT model is subjected to masking language training, 15% of words in text sentences are randomly masked, and then the words in masking positions are predicted by adopting an unsupervised learning method.

The BERT model is in a transducer structure.

In addition, when the text label determining module carries out bidirectional coding on the generated word vector to obtain text label sequence data, the generated word vector is used as an input vector and is input into a bidirectional long-short-term memory network layer to carry out bidirectional coding to obtain the text label sequence data.

In addition, when the text label determining module optimally decodes the text label sequence data to obtain an optimal text label sequence, the text label sequence data is decoded through a CRF neural network model to obtain an optimal text label sequence.

The technical scheme provided by the embodiment of the application can have the following beneficial effects:

according to the application, by adding the BERT pre-training language model as the feature expression layer, text semantic information is stored more completely, the context bidirectional feature extraction capability of the model is improved, the semantic information is utilized more fully, the problem of boundary division of named entities is solved better, the recognition rate of the model to the entities is improved, and the overall named entity recognition accuracy of the model reaches 92.45%.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

FIG. 1 is a flow diagram illustrating a method of identifying named entities of a bankruptcy document according to an exemplary embodiment;

FIG. 2 is a block diagram illustrating a system for identifying named entities of bankruptcy documents in accordance with an exemplary embodiment;

FIG. 3 is a diagram of an overall model framework of BERT-BiLSTM+CRF, shown in accordance with an exemplary embodiment;

FIG. 4 is an input vector representation of a BERT shown in accordance with an exemplary embodiment;

FIG. 5 is a diagram of a transducer structure shown in accordance with an exemplary embodiment;

FIG. 6 is a block diagram of an LSTM cell shown in accordance with an exemplary embodiment;

fig. 7 is a schematic diagram of a computer device, according to an example embodiment.

Detailed Description

The following description and the drawings sufficiently illustrate specific embodiments herein to enable those skilled in the art to practice them. Portions and features of some embodiments may be included in, or substituted for, those of others. The scope of the embodiments herein includes the full scope of the claims, as well as all available equivalents of the claims. The terms "first," "second," and the like herein are used merely to distinguish one element from another element and do not require or imply any actual relationship or order between the elements. Indeed the first element could also be termed a second element and vice versa. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a structure, apparatus, or device that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such structure, apparatus, or device. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a structure, apparatus or device comprising the element. Various embodiments are described herein in a progressive manner, each embodiment focusing on differences from other embodiments, and identical and similar parts between the various embodiments are sufficient to be seen with each other.

The terms "longitudinal," "transverse," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like herein refer to an orientation or positional relationship based on that shown in the drawings, merely for ease of description herein and to simplify the description, and do not indicate or imply that the devices or elements referred to must have a particular orientation, be constructed and operate in a particular orientation, and thus are not to be construed as limiting the application. In the description herein, unless otherwise specified and limited, the terms "mounted," "connected," and "coupled" are to be construed broadly, and may be, for example, mechanically or electrically coupled, may be in communication with each other within two elements, may be directly coupled, or may be indirectly coupled through an intermediary, as would be apparent to one of ordinary skill in the art.

Herein, unless otherwise indicated, the term "plurality" means two or more.

Herein, the character "/" indicates that the front and rear objects are an or relationship. For example, A/B represents: a or B.

Herein, the term "and/or" is an association relation describing an object, meaning that three relations may exist. For example, a and/or B, represent: a or B, or, A and B.

It should be understood that, although the steps in the flowchart are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the figures may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor does the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of other steps or other steps.

The various modules in the apparatus or system of the present application may be implemented in whole or in part in software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

Embodiments of the application and features of the embodiments may be combined with each other without conflict.

FIG. 1 illustrates one embodiment of a method of the present application for identifying named entities of bankruptcy documents.

In this alternative embodiment, the method for identifying named entities of the bankruptcy document includes:

step S101, performing word coding on a bankruptcy document through a BERT language model obtained through pre-training, and extracting text features to generate a word vector;

step S103, performing bidirectional coding on the generated word vector to obtain text tag sequence data; performing optimal decoding on the text label sequence data to obtain an optimal text label sequence;

step S105, determining the label category to which each character belongs according to the optimal text label sequence.

FIG. 2 illustrates one embodiment of a bankruptcy document named entity recognition system of the present application.

In this alternative embodiment, the bankruptcy document named entity recognition system includes:

the text feature extraction module 201 is configured to perform word encoding on the bankruptcy document through the BERT language model obtained through pre-training, and extract text features to generate a word vector;

the text tag determining module 203 is configured to perform bidirectional encoding on the generated word vector to obtain text tag sequence data; performing optimal decoding on the text label sequence data to obtain an optimal text label sequence;

the text character recognition module 205 is configured to determine, according to the optimal text label sequence, a label class to which each character belongs.

In practical application, as shown in fig. 3-5, the BERT language model is obtained by the following training method, specifically:

the input part of Bert is a linear sequence, two sentences are split by a separator, and the forefront and last two marks are added. Each word has Position Embeddings (position embedding), token embedded and Segment Embeddings (separation embedding), and three embedding corresponding to the word are overlapped to form the Bert input;

the Bert pre-training task adopts a Masked Language Model (mask language model) pre-training method, randomly masks 15% of words in one sentence, and then adopts an unsupervised learning method to predict the words at the mask position so as to achieve the training of the bidirectional features;

the BERT model architecture is based on multi-layer bi-directional transform decoding, and adopts a transducer structure (gesture-based motion recognition model). A transducer is an encoder-decoder structure formed by a stack of several encoders and decoders. The model structure is as follows:

the left part is an encoder, which consists of Multi-Head Attention and a full connection and is used for converting input corpus into feature vectors;

the right part is the decoder whose inputs are the output of the encoder and the predicted result, consisting of Masked Multi-Head Attention, multi-Head Attention and a full concatenation for outputting the conditional probability of the final result.

When the generated word vector is bidirectionally encoded to obtain text label sequence data, the generated word vector is used as an input vector and is input into a bidirectional long-short-term memory network layer to be bidirectionally encoded to obtain the text label sequence data.

As shown in fig. 2 and 6, the specific steps are as follows:

firstly, determining to discard a certain part from the previous information by a forgetting gate, wherein the forgetting gate is used for selectively forgetting the information in the cell state; output h of hidden layer at t-1 moment _t-1 And input x at the current time t _t As input; outputting the final value f by activating the function sigmoid _t The value range is [0,1 ]]Shows the cell state c at time t-1 _t-1 0 indicates total discard and 1 indicates total retention. f (f) _t The calculation formula of (2) is as follows: f (f) _t ＝σ(W _f ·[h _t-1 ，x _t ]+b _f ) The method comprises the steps of carrying out a first treatment on the surface of the In which W is _f And b _f Representing the weights and biases connecting the two layers.

The input gate is used for storing information to be updated, and the input gate selectively records new information into the cell state; the sigmoid layer decides what values to update; the tanh layer creates a candidate vector Will be added to the cell state, i _t 、/>c _t The specific formula is as follows:

i _t ＝σ(W _i ·[h _t-1 ，x _t ]+b _i )

wherein i is _t Determining a value to be updated for the output of the input gate;the probability of forgetting the state of the cell of the upper layer is in the value range of [0,1 ]]0 means completely discarded, 1 means completely reserved; c _t Indicating a temporary state, including a new candidate value. W (W) _i A weight coefficient representing ht-1 in the input gate; w (W) _c The weight coefficient of ht-1 in the extraction process is represented; b _i Representing the bias value of the input gate; b _c Representing the bias value in the feature extraction process; h represents a hidden layer; sigma represents the activation function Sigmoid.

Finally, it is decided what value to output through an output gate, which is the decision of the last output and how much of the cell state to control the layer needs to be filtered; the method comprises the following steps: firstly, calculating by adopting sigmoid function to obtain output factor o _t The method comprises the steps of carrying out a first treatment on the surface of the Then, calculating the normalized value of the current cell state through the tanh function, and then outputting the normalized value and the output factor o _t After multiplication, the hidden layer output h at the current moment is obtained _t The calculation formula is as follows:

o _t ＝σ(W _o [h _t-1 ，x _t ]+b _o )

h _t ＝o _t *tanh(c _t )

in which W is _o A weight coefficient representing ht-1 in the output gate; sigma activation function Sigmoid; b _o Representing the offset value of the output gate; h represents a hidden layer.

In an actual natural language sentence, key information may appear at the beginning of the sentence or at the end of the sentence, and for named entity recognition, a comprehensive reverse LSTM (Long Short-Term Memory network) is required to learn, i.e., a BiLSTM (two-way Long-Term Memory network). One LSTM network calculates forward hidden features, the other LSTM network calculates backward hidden features, and the two output results are spliced to obtain a bidirectional LSTM network, so that the bidirectional semantic dependence with a longer distance can be better captured.

And when the text label sequence data is optimally decoded to obtain an optimal text label sequence, decoding the text label sequence data through a CRF neural network model to obtain the optimal text label sequence.

Specifically, the CRF neural network model can learn the front-back dependency of sentences, and add some constraint conditions to ensure that the final prediction result is valid, and the algorithm firstly obtains a prediction tag sequence y= (y 1, y2, …, yn) for each input sequence x= (x 1, x2, …, xn), and defines the scoring function of the sequence as follows:the former is determined by CRF transfer matrix A, < + >>Representing the transition score from the yi-th tag to the yi+1th tag, the latter determined by p output by LSTM,/o>Outputting a probability of yi for the ith position softmax because the starting and ending positions are added, the dimension of the transition probability matrix is n+2;

given a training sample sequence x, probability normalization is carried out on the score of the correct sequence y:in (1) the->Representing true tag values, Y represents all possible tag sets, and the numerator represents the correct tag sequence, denominatorRepresenting all possible annotation sequences. Then, a loss function is defined:searching proper parameters through algorithms such as gradient descent and the like to minimize a loss function, and predicting an optimal solution of the model after training is completed:

FIG. 7 illustrates one embodiment of a computer device of the present application. The computer device may be a server including a processor, memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used to store static information and dynamic information data. The network interface of the computer device is used for communicating with an external terminal through a network connection. Which computer program, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

It will be appreciated by those skilled in the art that the structure shown in FIG. 7 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In an embodiment, a computer device is also provided, comprising a memory and a processor, the memory having stored therein a computer program, the processor performing the steps of the above-described method embodiments when the computer program is executed.

In one embodiment, a computer readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the method embodiments described above.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like.

The present application is not limited to the structure that has been described above and shown in the drawings, and various modifications and changes can be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. The method for identifying the named entity of the bankruptcy document is characterized by comprising the following steps of:

2. The method for identifying named entities of bankruptcy documents according to claim 1, further comprising:

performing masking language training on the BERT model to obtain a BERT language model;

and when the BERT model is subjected to masking language training, 15% of words in text sentences are randomly masked, and then the words in masking positions are predicted by adopting an unsupervised learning method.

3. The method for identifying the named entities of the bankruptcy document according to claim 2, wherein the structure of the BERT model is a transducer structure.

4. The method for identifying named entities of bankruptcy documents according to claim 1, wherein bi-directionally encoding the generated word vectors to obtain text tag sequence data comprises:

and taking the generated word vector as an input vector, inputting the input vector into a two-way long-short-term memory network layer for two-way coding, and obtaining text tag sequence data.

5. The method for identifying a named entity of a bankruptcy document according to claim 4, wherein optimally decoding the text tag sequence data to obtain an optimal text tag sequence comprises:

and decoding the text label sequence data through a CRF neural network model to obtain an optimal text label sequence.

6. A system for identifying named entities of bankruptcy documents, comprising:

7. The bankruptcy document named entity recognition system of claim 6, further comprising: the model training module is used for carrying out masking language training on the BERT model to obtain a BERT language model; and when the BERT model is subjected to masking language training, 15% of words in text sentences are randomly masked, and then the words in masking positions are predicted by adopting an unsupervised learning method.

8. The system of claim 7, wherein the BERT model is constructed as a Transformer.

9. The system of claim 6, wherein the text tag determination module, when performing bi-directional encoding on the generated word vector to obtain text tag sequence data, takes the generated word vector as an input vector, and inputs the input vector to a bi-directional long-short-term memory network layer to perform bi-directional encoding to obtain the text tag sequence data.

10. The system of claim 9, wherein the text tag determination module decodes the text tag sequence data via a CRF neural network model to obtain an optimal text tag sequence when optimally decoding the text tag sequence data to obtain an optimal text tag sequence.