CN111191452A

CN111191452A - Railway text named entity recognition method and device

Info

Publication number: CN111191452A
Application number: CN201911350774.6A
Authority: CN
Inventors: 杨连报; 王同军; 李新琴; 董兴芝; 薛蕊; 李平; 朱建生; 马小宁; 马志强; 刘军; 吴艳华; 邹丹; 王喆; 代明睿; 张晓栋; 程智博; 赵冰
Original assignee: China Academy of Railway Sciences Corp Ltd CARS; Institute of Computing Technologies of CARS; Beijing Jingwei Information Technology Co Ltd
Current assignee: China Academy of Railway Sciences Corp Ltd CARS; Institute of Computing Technologies of CARS; Beijing Jingwei Information Technology Co Ltd
Priority date: 2019-12-24
Filing date: 2019-12-24
Publication date: 2020-05-22

Abstract

The embodiment of the invention provides a method and a device for identifying a named entity of a railway text, wherein the method comprises the following steps: preprocessing original railway text data to obtain preprocessed railway text data information; inputting the preprocessed railway text data information into a preset BERT model to obtain railway text vector information; inputting the railway text vector information into a preset BilSTM-CRF model to obtain the identification result information of the railway text named entity; the BERT realizes the learning of the text context of the railway text characteristic vector and obtains the representation of the fault text vector of the railway industry accident. The vector and semantic representation of the keywords of the named entity of the fault text are enhanced by using the preset BERT, and the calculation and identification of the vector of the fault text are realized through a BilSTM-CRF model, so that the identification result information of the named entity of the railway text is obtained.

Description

Railway text named entity recognition method and device

Technical Field

The invention relates to the technical field of information processing, in particular to a method and a device for identifying named entities of railway texts.

Background

The railway has the characteristics of high technical complexity, high running speed, large passenger capacity, small driving interval, high rescue difficulty, high safety requirement and the like, local faults easily cause chain reaction and amplification effect, and new higher requirements are provided for safety early warning and rapid disposal of sudden faults in the railway operation process.

The names of the entities in the railway safety field are complex in structure, numerous in name abbreviation and strong in professional terminology, and the safety text data are mostly stored in the forms of Word, Excel and the like and are filed in the form of paper, so that the method is limited by the traditional technology, the effective information in the data is difficult to be extracted from the original database efficiently to mine the effective information in the data, the potential effective information in the railway safety text data is extracted, the utilization value of the railway safety data can be greatly improved, and an auxiliary decision is further provided for intelligent operation and maintenance of the railway.

Therefore, how to effectively implement named entity identification in railway security texts has become an urgent problem to be solved in the industry.

Disclosure of Invention

Embodiments of the present invention provide a method and an apparatus for identifying a named entity in a railway text, so as to solve the technical problems mentioned in the above background art, or at least partially solve the technical problems mentioned in the above background art.

In a first aspect, an embodiment of the present invention provides a method for identifying a named entity in a railway text, including:

preprocessing original railway text data to obtain preprocessed railway text data information;

inputting the preprocessed railway text data information into a preset BERT model to obtain railway text vector information;

inputting the railway text vector information into a preset BilSTM-CRF model to obtain the identification result information of the railway text named entity;

the preset BERT model is obtained by training sample preprocessing railway text data information with entity marks, and the preset BilSTM-CRF model is obtained by training sample railway text vector information with entity marks and word order marks.

More specifically, before the step of inputting the preprocessed railway text data information into a preset BERT model to obtain the railway text vector information, the method further includes:

obtaining sample preprocessed railway text data information, carrying out named entity marking on the sample preprocessed railway text data information through BIO marks, and respectively inserting CLS marks and SEP marks into sentence heads and sentence tails of the sample preprocessed railway text data information to obtain sample preprocessed railway text vector information with entity marks;

coding the language sequence information of the sample preprocessed railway text vector information with the entity mark to obtain sample railway text vector information with the entity mark and the language sequence mark;

and training the BERT model according to the sample railway text vector information with the entity mark and the word order mark, and obtaining a preset BERT model when the loss function of the BERT model is stably converged.

More specifically, before the step of inputting the railway text vector information into a preset BilSTM-CRF model to obtain the identification result information of the railway text named entity, the method further comprises the following steps:

acquiring sample railway text vector information with entity marks and language sequence marks, training a BilSTM model according to the sample railway text vector information with the entity marks and the language sequence marks, outputting sample railway text context information with a language sequence label, and acquiring the trained BilSTM model when preset training conditions are met;

and training the CRF model according to the sample railway text context information with the word order label, obtaining a trained CRF model when a second preset training condition is met, and obtaining a preset BilSTM-CRF model according to the trained BilSTM model and the trained CRF model.

More specifically, the step of training the BERT model according to the sample railway text vector information with the entity tag and the word order tag, and obtaining the preset BERT model when the BERT model loss function is stably converged includes:

covering partial characters in the sample railway text vector to obtain unmasked character information;

predicting the part of the character which is processed by the covering through the context meaning of the character which is not covered;

obtaining sample random vector information of a non-word-order label, and performing subsequent statement prediction training according to the sample random vector information of the non-word-order label and the sample random vector information of the non-word-order label;

and when the loss function of the predicted value is stably converged and the prediction function of the subsequent statement is stably converged, obtaining a preset BERT model.

More specifically, the step of preprocessing the original railway text data specifically includes:

removing equipment signal information, equipment number information and date information in original railway text data to obtain first railway text data information;

and splitting the first railway text data information into single Chinese characters, and removing repeated Chinese characters to obtain the preprocessed railway text data information.

More specifically, the BilSTM model includes a forward LSTM model and a backward LSTM model.

In a second aspect, an embodiment of the present invention provides a device for identifying a named entity in a railway text, including:

the preprocessing module is used for preprocessing original railway text data to obtain preprocessed railway text data information;

the vector conversion module is used for inputting the preprocessed railway text data information into a preset BERT model to obtain railway text vector information;

the recognition module is used for inputting the railway text vector information into a preset BilSTM-CRF model to obtain the railway text named entity recognition result information;

In a third aspect, an embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the railway text named entity recognition method according to the first aspect when executing the program.

In a fourth aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the railroad text-named entity identification method according to the first aspect.

According to the method and the device for recognizing the named entities in the railway text, vector and semantic representation of keywords of the named entities in the fault text are enhanced by utilizing the preset BERT, calculation and recognition of the named vectors in the fault text are realized by presetting the BilSTM-CRF model, the ordering relation of prediction information is adjusted, feature extraction of the text in the railway field is realized, and the labor cost is reduced to obtain the recognition result information of the named entities in the railway text.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a method for identifying named entities in railroad texts according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a BilSTM neural network according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a process for identifying named entities in railroad text according to an embodiment of the invention;

FIG. 4 is a schematic structural diagram of a railroad text named entity recognition apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a schematic flow chart of a method for identifying a named entity in a railway text described in an embodiment of the present invention, as shown in fig. 1, including:

step S1, preprocessing the original railway text data to obtain preprocessed railway text data information;

step S2, inputting the preprocessed railway text data information into a preset BERT model to obtain railway text vector information;

step S3, inputting the railway text vector information into a preset BilSTM-CRF model to obtain the identification result information of the railway text named entity;

Specifically, the original railway text data described in the embodiment of the present invention refers to text data in a railway safety text.

The preprocessing described in the embodiment of the invention is to remove incomplete and irregular data from the original railway text data, remove information such as railway equipment model, equipment number, date and the like, remove common stop words, punctuation marks and the like, split the railway text into a series of Chinese characters and remove repeated Chinese characters.

The preset BERT model described in the embodiment of the invention is used for processing the preprocessed railway text data information to form the railway text vector information with named entities.

In the embodiment of the invention, the training of the preset BERT model is mainly obtained by training the Masked LM task and the NextSence Prediction task together.

The preset BilSTM-CRF model described in the embodiment of the invention is used for predicting the context information of the railway text vector information with the named entity, adjusting the ordering relation of the prediction information and automatically extracting the target entity from the files such as a safety check list, a fault report and the like of the daily operation of the railway.

The preset BilSTM-CRF model described in the embodiment of the invention is composed of a trained BilSTM and a trained CRF model. BilTM as described herein refers to a bidirectional LSTM neural network, including forward and backward LSTM models, that can learn not only forward hidden layer features

At the same time, the characteristics of the backward hidden layer can be learned

The preset BilSTM-CRF model described in the embodiment of the invention is to input a railway text sample with an entity mark and a language sequence mark into the BilSTM model to train and output sample railway text context information with a language sequence mark, and when a preset training condition is met, a trained BilSTM model is obtained;

According to the embodiment of the invention, the vector and semantic representation of the keywords of the named entity of the railway text is enhanced by utilizing the preset BERT model, the prediction information of the context information of the named entity of the railway text is realized by presetting the BilSTM-CRF model, the feature extraction of the text in the railway field is completed by adjusting the ordering relation of the prediction information, the identification result information of the named entity of the railway text is finally obtained, and the consumption of the labor cost is reduced.

Based on the above embodiment, before the step of preprocessing the railway text data information and inputting the preset BERT model to obtain the railway text vector information, the method further includes:

coding the word order of the sample preprocessed railway text vector information with the entity mark to obtain sample railway text vector information with the entity mark and the word order mark;

and training the BERT model according to the sample railway text data information with the entity mark and the word order mark, and finishing the model training when the loss function of the BERT model is stably converged to obtain a preset BERT model.

Specifically, the BIO flag described in the embodiment of the present invention means that "B" indicates that the chinese character is a starting character of a named entity; "I" indicates that the Chinese character is a middle character and an end character of a named entity; "O" indicates that the Chinese character is not in a named entity.

Embedding a special classification mark [ CLS ] at the beginning of each sentence, inserting an [ SEP ] mark at the end of each sentence, and simultaneously coding the context sentence relation of the railway text sequence, if a language sequence relation exists with the next sentence, increasing the beginning of the next sentence to be coded as 1, otherwise, coding the beginning of the next sentence to be 0, namely the language sequence mark described in the embodiment of the invention.

Masking partial text information in the sample railway text vector information described in the embodiment of the invention, predicting Masked text content through other unmasked text context information, and finishing Masked LM task, wherein a loss function of a predicted value is stable and convergent.

Acquiring sample random data information of a no-word-sequence label randomly composed from a corpus, and performing subsequent sentence Prediction training according to the sample random data information of the no-word-sequence label, wherein a NextSennce Prediction task is completed when a subsequent sentence Prediction function is stable and converged.

And carrying out character statistics on the sample preprocessed railway text data information, coding in a descending order according to the statistical quantity, coding and representing the text according to the character codes, and complementing the representation lengths of the insufficient columns by 0 to obtain the sample preprocessed railway text coding information. Indicating the position of the word to be randomly masked in the list corresponding to the index, the initialization can be set to 1, and the defined BIO entity set [ CLS ], [ SEP ] is encoded. Tag s codes are carried out on the text according to the named entity codes, and the length of the list is not enough to be represented and is filled by 0.

The embodiment of the invention obtains the preset BERT model by training with the BERT model, thereby realizing the generation of the railway text vector information and being beneficial to the implementation of the subsequent steps.

On the basis of the above embodiment, before the step of inputting the railway text vector information into a preset BiLSTM-CRF model to obtain the railway text named entity identification result information, the method further includes:

acquiring sample railway text vector information with entity marks and word order marks, training a BilSTM model according to the sample railway text vector information with the entity marks and the word order marks, outputting sample railway text context information with the word order labels, and obtaining the trained BilSTM model when a first preset training condition is met;

Specifically, fig. 2 is a schematic structural diagram of a BiLSTM neural network described in an embodiment of the present invention, and as shown in fig. 2, the BiLSTM described in the embodiment of the present invention is a bidirectional LSTM neural network, which can not only learn forward hidden layer features

The current input of the neuron not only includes the current input x_tAnd also on the output h of the previous hidden layer_t-1Input X of LSTM neurons_t＝[h_t-1,x_t]. Sigmoid function for each layer of neural network

Normalization of the output between 0 and 1 is performed. The LSTM memory neuron comprises an input gate, an output gate and a forgetting gate, learning information is controlled by three gate structures to be accepted or rejected, and the calculation formulas of the three gates are as follows:

the input gate formula at time t: i.e. i_t＝σ*(W_i*X_t+b_i)

And (3) an output gate calculation formula at the time t: o_t＝σ*(W_o*X_t+b_o)

Forget gate formula at time t: f. of_t＝σ*(W_f*X_t+b_f)

Candidate neurons according to hyperbolic tangent function

Output is processed to [ -1,1 [)]Normalization of (1), candidate neuron calculation formula

The learning information calculation formula of the whole BilSTM neuron is as follows:

the output of the hidden layer at time t is: h is_t＝o_t*tanh(C_t)。

And evaluating the learning result of the neural network through the loss function, wherein the smaller the loss function is, the better the learning effect is, and the loss function is calculated by adopting maximum likelihood estimation as shown in the following formula.

Output sequence X ═ X (X) of CRF to BiLSTM in the embodiment of the present invention₁,x₂,…,x_n) After training, the learning sequence Y is output as (Y)₁,y₂,…,y_n) The CRF is output in a probability form, and when an output x sequence of the BilSTM is input, a conditional probability calculation formula of the output y is as follows:

wherein Z_wWhich represents a normalization parameter, is given by,

f (y, x) represents a global feature vector F (y, x) ═ F₁(y,x),f₂(y,x),...,f_k(y,x))^TW represents a weight vector w ═ w (w)₁,w₂,w₃,...,w_k)^T。

The first preset training condition described in the embodiment of the present invention may be that a preset number of training rounds is satisfied, or that a preset training time is satisfied.

The second preset training condition described in the embodiment of the present invention may be that a preset number of training rounds is met or a preset training time is met.

On the basis of the above embodiment, the step of preprocessing the original railway text data specifically includes:

Fig. 3 is a schematic diagram of a process for identifying named entities in railway texts according to an embodiment of the present invention, where as shown in fig. 3, original railway text data is "frozen turnout due to freezing of ice and snow", and is preprocessed to obtain "[ CLS ] O B-Cau I-Cau [ SEP ] O B-Fau I-Fau [ SEP ]", and then the preprocessed railway text data is input to a BERT model to obtain railway text vector information, and the railway text vector information is input to a preset BiLSTM-CRF model to obtain "OO B-Cau I-Cau O B-Fau I-Fau B-Fau O".

Fig. 4 is a schematic structural diagram of a railroad text named entity recognition apparatus according to an embodiment of the present invention, as shown in fig. 4, including: a preprocessing module 410, a vector conversion module 420, and an identification module 430; the preprocessing module 410 is configured to preprocess original railway text data to obtain preprocessed railway text data information; the vector conversion module 420 is configured to input the preprocessed railway text data information into a preset BERT model to obtain railway text vector information; the identification module 430 is configured to input the railway text vector information into a preset BiLSTM-CRF model to obtain railway text named entity identification result information; the preset BERT model is obtained by training sample preprocessing railway text data information with entity marks, and the BiLSTM-CRF model is obtained by training sample railway text vector information with entity marks and word order marks.

The apparatus provided in the embodiment of the present invention is used for executing the above method embodiments, and for details of the process and the details, reference is made to the above embodiments, which are not described herein again.

According to the embodiment of the invention, the semantic representation of the keywords of the named entity of the fault text is enhanced by using the preset BERT, the prediction information of the context information of the named entity of the fault text is realized by presetting the BilSTM-CRF model, the ordering relation of the prediction information is adjusted, the feature extraction of the text in the railway field is realized, the identification result information of the named entity of the railway text is obtained, and the consumption of labor cost is reduced.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 5, the electronic device may include: a processor (processor)510, a communication Interface (Communications Interface)520, a memory (memory)530 and a communication bus 540, wherein the processor 510, the communication Interface 520 and the memory 530 communicate with each other via the communication bus 540. Processor 510 may call logic instructions in memory 530 to perform the following method: preprocessing original railway text data to obtain preprocessed railway text data information; inputting the preprocessed railway text data information into a preset BERT model to obtain railway text vector information; inputting the railway text vector information into a preset BilSTM-CRF model to obtain the identification result information of the railway text named entity; the preset BERT model is obtained by training sample preprocessing railway text data information with entity marks, and the preset BilSTM-CRF model is obtained by training sample railway text vector information with entity marks and word order marks.

Furthermore, the logic instructions in the memory 530 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

An embodiment of the present invention discloses a computer program product, which includes a computer program stored on a non-transitory computer readable storage medium, the computer program including program instructions, when the program instructions are executed by a computer, the computer can execute the methods provided by the above method embodiments, for example, the method includes: preprocessing original railway text data to obtain preprocessed railway text data information; inputting the preprocessed railway text data information into a preset BERT model to obtain railway text vector information; inputting the railway text vector information into a preset BilSTM-CRF model to obtain the identification result information of the railway text named entity; the preset BERT model is obtained by training sample preprocessing railway text data information with entity marks, and the preset BilSTM-CRF model is obtained by training sample railway text vector information with entity marks and word order marks.

Embodiments of the present invention provide a non-transitory computer-readable storage medium storing server instructions, where the server instructions cause a computer to execute the method provided in the foregoing embodiments, for example, the method includes: preprocessing original railway text data to obtain preprocessed railway text data information; inputting the preprocessed railway text data information into a preset BERT model to obtain railway text vector information; inputting the railway text vector information into a preset BilSTM-CRF model to obtain the identification result information of the railway text named entity; the preset BERT model is obtained by training sample preprocessing railway text data information with entity marks, and the preset BilSTM-CRF model is obtained by training sample railway text vector information with entity marks and word order marks.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A railway text named entity recognition method is characterized by comprising the following steps:

2. The method of claim 1, wherein prior to the step of inputting the preprocessed railway text data information into a preset BERT model to obtain railway text vector information, the method further comprises:

and training the BERT model according to the sample railway vector data information with the entity mark and the word order mark, and obtaining a preset BERT model when the loss function of the BERT model is stably converged.

3. The method according to claim 1, wherein before the step of inputting the railway text vector information into a preset BilSTM-CRF model to obtain the identification result information of the railway text named entity, the method further comprises:

4. The method for recognizing named entities according to railway text of claim 2, wherein the step of training the BERT model according to the sample railway vector data information with the entity labels and the word order labels and obtaining the preset BERT model when a BERT model loss function is stable and converged specifically comprises:

5. The method for recognizing named entities according to railway text of claim 1, wherein the step of preprocessing the original railway text data specifically comprises:

6. The railroad text named entity recognition method of claim 3, wherein the BilSTM model comprises a forward LSTM model and a backward LSTM model.

7. A railroad text named entity recognition device, comprising:

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program performs the steps of the railroad text named entity recognition method according to any one of claims 1 to 6.

9. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the railroad text named entity recognition method according to any one of claims 1 to 6.