CN111191452A - Railway text named entity recognition method and device - Google Patents

Railway text named entity recognition method and device Download PDF

Info

Publication number
CN111191452A
CN111191452A CN201911350774.6A CN201911350774A CN111191452A CN 111191452 A CN111191452 A CN 111191452A CN 201911350774 A CN201911350774 A CN 201911350774A CN 111191452 A CN111191452 A CN 111191452A
Authority
CN
China
Prior art keywords
railway
text
information
sample
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911350774.6A
Other languages
Chinese (zh)
Inventor
杨连报
王同军
李新琴
董兴芝
薛蕊
李平
朱建生
马小宁
马志强
刘军
吴艳华
邹丹
王喆
代明睿
张晓栋
程智博
赵冰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Academy of Railway Sciences Corp Ltd CARS
Institute of Computing Technologies of CARS
Beijing Jingwei Information Technology Co Ltd
Original Assignee
China Academy of Railway Sciences Corp Ltd CARS
Institute of Computing Technologies of CARS
Beijing Jingwei Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Academy of Railway Sciences Corp Ltd CARS, Institute of Computing Technologies of CARS, Beijing Jingwei Information Technology Co Ltd filed Critical China Academy of Railway Sciences Corp Ltd CARS
Priority to CN201911350774.6A priority Critical patent/CN111191452A/en
Publication of CN111191452A publication Critical patent/CN111191452A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The embodiment of the invention provides a method and a device for identifying a named entity of a railway text, wherein the method comprises the following steps: preprocessing original railway text data to obtain preprocessed railway text data information; inputting the preprocessed railway text data information into a preset BERT model to obtain railway text vector information; inputting the railway text vector information into a preset BilSTM-CRF model to obtain the identification result information of the railway text named entity; the BERT realizes the learning of the text context of the railway text characteristic vector and obtains the representation of the fault text vector of the railway industry accident. The vector and semantic representation of the keywords of the named entity of the fault text are enhanced by using the preset BERT, and the calculation and identification of the vector of the fault text are realized through a BilSTM-CRF model, so that the identification result information of the named entity of the railway text is obtained.

Description

Railway text named entity recognition method and device
Technical Field
The invention relates to the technical field of information processing, in particular to a method and a device for identifying named entities of railway texts.
Background
The railway has the characteristics of high technical complexity, high running speed, large passenger capacity, small driving interval, high rescue difficulty, high safety requirement and the like, local faults easily cause chain reaction and amplification effect, and new higher requirements are provided for safety early warning and rapid disposal of sudden faults in the railway operation process.
The names of the entities in the railway safety field are complex in structure, numerous in name abbreviation and strong in professional terminology, and the safety text data are mostly stored in the forms of Word, Excel and the like and are filed in the form of paper, so that the method is limited by the traditional technology, the effective information in the data is difficult to be extracted from the original database efficiently to mine the effective information in the data, the potential effective information in the railway safety text data is extracted, the utilization value of the railway safety data can be greatly improved, and an auxiliary decision is further provided for intelligent operation and maintenance of the railway.
Therefore, how to effectively implement named entity identification in railway security texts has become an urgent problem to be solved in the industry.
Disclosure of Invention
Embodiments of the present invention provide a method and an apparatus for identifying a named entity in a railway text, so as to solve the technical problems mentioned in the above background art, or at least partially solve the technical problems mentioned in the above background art.
In a first aspect, an embodiment of the present invention provides a method for identifying a named entity in a railway text, including:
preprocessing original railway text data to obtain preprocessed railway text data information;
inputting the preprocessed railway text data information into a preset BERT model to obtain railway text vector information;
inputting the railway text vector information into a preset BilSTM-CRF model to obtain the identification result information of the railway text named entity;
the preset BERT model is obtained by training sample preprocessing railway text data information with entity marks, and the preset BilSTM-CRF model is obtained by training sample railway text vector information with entity marks and word order marks.
More specifically, before the step of inputting the preprocessed railway text data information into a preset BERT model to obtain the railway text vector information, the method further includes:
obtaining sample preprocessed railway text data information, carrying out named entity marking on the sample preprocessed railway text data information through BIO marks, and respectively inserting CLS marks and SEP marks into sentence heads and sentence tails of the sample preprocessed railway text data information to obtain sample preprocessed railway text vector information with entity marks;
coding the language sequence information of the sample preprocessed railway text vector information with the entity mark to obtain sample railway text vector information with the entity mark and the language sequence mark;
and training the BERT model according to the sample railway text vector information with the entity mark and the word order mark, and obtaining a preset BERT model when the loss function of the BERT model is stably converged.
More specifically, before the step of inputting the railway text vector information into a preset BilSTM-CRF model to obtain the identification result information of the railway text named entity, the method further comprises the following steps:
acquiring sample railway text vector information with entity marks and language sequence marks, training a BilSTM model according to the sample railway text vector information with the entity marks and the language sequence marks, outputting sample railway text context information with a language sequence label, and acquiring the trained BilSTM model when preset training conditions are met;
and training the CRF model according to the sample railway text context information with the word order label, obtaining a trained CRF model when a second preset training condition is met, and obtaining a preset BilSTM-CRF model according to the trained BilSTM model and the trained CRF model.
More specifically, the step of training the BERT model according to the sample railway text vector information with the entity tag and the word order tag, and obtaining the preset BERT model when the BERT model loss function is stably converged includes:
covering partial characters in the sample railway text vector to obtain unmasked character information;
predicting the part of the character which is processed by the covering through the context meaning of the character which is not covered;
obtaining sample random vector information of a non-word-order label, and performing subsequent statement prediction training according to the sample random vector information of the non-word-order label and the sample random vector information of the non-word-order label;
and when the loss function of the predicted value is stably converged and the prediction function of the subsequent statement is stably converged, obtaining a preset BERT model.
More specifically, the step of preprocessing the original railway text data specifically includes:
removing equipment signal information, equipment number information and date information in original railway text data to obtain first railway text data information;
and splitting the first railway text data information into single Chinese characters, and removing repeated Chinese characters to obtain the preprocessed railway text data information.
More specifically, the BilSTM model includes a forward LSTM model and a backward LSTM model.
In a second aspect, an embodiment of the present invention provides a device for identifying a named entity in a railway text, including:
the preprocessing module is used for preprocessing original railway text data to obtain preprocessed railway text data information;
the vector conversion module is used for inputting the preprocessed railway text data information into a preset BERT model to obtain railway text vector information;
the recognition module is used for inputting the railway text vector information into a preset BilSTM-CRF model to obtain the railway text named entity recognition result information;
the preset BERT model is obtained by training sample preprocessing railway text data information with entity marks, and the preset BilSTM-CRF model is obtained by training sample railway text vector information with entity marks and word order marks.
In a third aspect, an embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the railway text named entity recognition method according to the first aspect when executing the program.
In a fourth aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the railroad text-named entity identification method according to the first aspect.
According to the method and the device for recognizing the named entities in the railway text, vector and semantic representation of keywords of the named entities in the fault text are enhanced by utilizing the preset BERT, calculation and recognition of the named vectors in the fault text are realized by presetting the BilSTM-CRF model, the ordering relation of prediction information is adjusted, feature extraction of the text in the railway field is realized, and the labor cost is reduced to obtain the recognition result information of the named entities in the railway text.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
Fig. 1 is a schematic flow chart of a method for identifying named entities in railroad texts according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a BilSTM neural network according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a process for identifying named entities in railroad text according to an embodiment of the invention;
FIG. 4 is a schematic structural diagram of a railroad text named entity recognition apparatus according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic flow chart of a method for identifying a named entity in a railway text described in an embodiment of the present invention, as shown in fig. 1, including:
step S1, preprocessing the original railway text data to obtain preprocessed railway text data information;
step S2, inputting the preprocessed railway text data information into a preset BERT model to obtain railway text vector information;
step S3, inputting the railway text vector information into a preset BilSTM-CRF model to obtain the identification result information of the railway text named entity;
the preset BERT model is obtained by training sample preprocessing railway text data information with entity marks, and the preset BilSTM-CRF model is obtained by training sample railway text vector information with entity marks and word order marks.
Specifically, the original railway text data described in the embodiment of the present invention refers to text data in a railway safety text.
The preprocessing described in the embodiment of the invention is to remove incomplete and irregular data from the original railway text data, remove information such as railway equipment model, equipment number, date and the like, remove common stop words, punctuation marks and the like, split the railway text into a series of Chinese characters and remove repeated Chinese characters.
The preset BERT model described in the embodiment of the invention is used for processing the preprocessed railway text data information to form the railway text vector information with named entities.
In the embodiment of the invention, the training of the preset BERT model is mainly obtained by training the Masked LM task and the NextSence Prediction task together.
The preset BilSTM-CRF model described in the embodiment of the invention is used for predicting the context information of the railway text vector information with the named entity, adjusting the ordering relation of the prediction information and automatically extracting the target entity from the files such as a safety check list, a fault report and the like of the daily operation of the railway.
The preset BilSTM-CRF model described in the embodiment of the invention is composed of a trained BilSTM and a trained CRF model. BilTM as described herein refers to a bidirectional LSTM neural network, including forward and backward LSTM models, that can learn not only forward hidden layer features
Figure BDA0002334604570000051
At the same time, the characteristics of the backward hidden layer can be learned
Figure BDA0002334604570000052
The preset BilSTM-CRF model described in the embodiment of the invention is to input a railway text sample with an entity mark and a language sequence mark into the BilSTM model to train and output sample railway text context information with a language sequence mark, and when a preset training condition is met, a trained BilSTM model is obtained;
and training the CRF model according to the sample railway text context information with the word order label, obtaining a trained CRF model when a second preset training condition is met, and obtaining a preset BilSTM-CRF model according to the trained BilSTM model and the trained CRF model.
According to the embodiment of the invention, the vector and semantic representation of the keywords of the named entity of the railway text is enhanced by utilizing the preset BERT model, the prediction information of the context information of the named entity of the railway text is realized by presetting the BilSTM-CRF model, the feature extraction of the text in the railway field is completed by adjusting the ordering relation of the prediction information, the identification result information of the named entity of the railway text is finally obtained, and the consumption of the labor cost is reduced.
Based on the above embodiment, before the step of preprocessing the railway text data information and inputting the preset BERT model to obtain the railway text vector information, the method further includes:
obtaining sample preprocessed railway text data information, carrying out named entity marking on the sample preprocessed railway text data information through BIO marks, and respectively inserting CLS marks and SEP marks into sentence heads and sentence tails of the sample preprocessed railway text data information to obtain sample preprocessed railway text vector information with entity marks;
coding the word order of the sample preprocessed railway text vector information with the entity mark to obtain sample railway text vector information with the entity mark and the word order mark;
and training the BERT model according to the sample railway text data information with the entity mark and the word order mark, and finishing the model training when the loss function of the BERT model is stably converged to obtain a preset BERT model.
Specifically, the BIO flag described in the embodiment of the present invention means that "B" indicates that the chinese character is a starting character of a named entity; "I" indicates that the Chinese character is a middle character and an end character of a named entity; "O" indicates that the Chinese character is not in a named entity.
Embedding a special classification mark [ CLS ] at the beginning of each sentence, inserting an [ SEP ] mark at the end of each sentence, and simultaneously coding the context sentence relation of the railway text sequence, if a language sequence relation exists with the next sentence, increasing the beginning of the next sentence to be coded as 1, otherwise, coding the beginning of the next sentence to be 0, namely the language sequence mark described in the embodiment of the invention.
Masking partial text information in the sample railway text vector information described in the embodiment of the invention, predicting Masked text content through other unmasked text context information, and finishing Masked LM task, wherein a loss function of a predicted value is stable and convergent.
Acquiring sample random data information of a no-word-sequence label randomly composed from a corpus, and performing subsequent sentence Prediction training according to the sample random data information of the no-word-sequence label, wherein a NextSennce Prediction task is completed when a subsequent sentence Prediction function is stable and converged.
And carrying out character statistics on the sample preprocessed railway text data information, coding in a descending order according to the statistical quantity, coding and representing the text according to the character codes, and complementing the representation lengths of the insufficient columns by 0 to obtain the sample preprocessed railway text coding information. Indicating the position of the word to be randomly masked in the list corresponding to the index, the initialization can be set to 1, and the defined BIO entity set [ CLS ], [ SEP ] is encoded. Tag s codes are carried out on the text according to the named entity codes, and the length of the list is not enough to be represented and is filled by 0.
The embodiment of the invention obtains the preset BERT model by training with the BERT model, thereby realizing the generation of the railway text vector information and being beneficial to the implementation of the subsequent steps.
On the basis of the above embodiment, before the step of inputting the railway text vector information into a preset BiLSTM-CRF model to obtain the railway text named entity identification result information, the method further includes:
acquiring sample railway text vector information with entity marks and word order marks, training a BilSTM model according to the sample railway text vector information with the entity marks and the word order marks, outputting sample railway text context information with the word order labels, and obtaining the trained BilSTM model when a first preset training condition is met;
and training the CRF model according to the sample railway text context information with the word order label, obtaining a trained CRF model when a second preset training condition is met, and obtaining a preset BilSTM-CRF model according to the trained BilSTM model and the trained CRF model.
Specifically, fig. 2 is a schematic structural diagram of a BiLSTM neural network described in an embodiment of the present invention, and as shown in fig. 2, the BiLSTM described in the embodiment of the present invention is a bidirectional LSTM neural network, which can not only learn forward hidden layer features
Figure BDA0002334604570000081
At the same time, the characteristics of the backward hidden layer can be learned
Figure BDA0002334604570000082
The current input of the neuron not only includes the current input xtAnd also on the output h of the previous hidden layert-1Input X of LSTM neuronst=[ht-1,xt]. Sigmoid function for each layer of neural network
Figure BDA0002334604570000083
Normalization of the output between 0 and 1 is performed. The LSTM memory neuron comprises an input gate, an output gate and a forgetting gate, learning information is controlled by three gate structures to be accepted or rejected, and the calculation formulas of the three gates are as follows:
the input gate formula at time t: i.e. it=σ*(Wi*Xt+bi)
And (3) an output gate calculation formula at the time t: ot=σ*(Wo*Xt+bo)
Forget gate formula at time t: f. oft=σ*(Wf*Xt+bf)
Candidate neurons according to hyperbolic tangent function
Figure BDA0002334604570000084
Output is processed to [ -1,1 [)]Normalization of (1), candidate neuron calculation formula
Figure BDA0002334604570000085
The learning information calculation formula of the whole BilSTM neuron is as follows:
Figure BDA0002334604570000086
the output of the hidden layer at time t is: h ist=ot*tanh(Ct)。
And evaluating the learning result of the neural network through the loss function, wherein the smaller the loss function is, the better the learning effect is, and the loss function is calculated by adopting maximum likelihood estimation as shown in the following formula.
Figure BDA0002334604570000087
Output sequence X ═ X (X) of CRF to BiLSTM in the embodiment of the present invention1,x2,…,xn) After training, the learning sequence Y is output as (Y)1,y2,…,yn) The CRF is output in a probability form, and when an output x sequence of the BilSTM is input, a conditional probability calculation formula of the output y is as follows:
Figure BDA0002334604570000088
wherein ZwWhich represents a normalization parameter, is given by,
Figure BDA0002334604570000089
f (y, x) represents a global feature vector F (y, x) ═ F1(y,x),f2(y,x),...,fk(y,x))TW represents a weight vector w ═ w (w)1,w2,w3,...,wk)T
The first preset training condition described in the embodiment of the present invention may be that a preset number of training rounds is satisfied, or that a preset training time is satisfied.
The second preset training condition described in the embodiment of the present invention may be that a preset number of training rounds is met or a preset training time is met.
On the basis of the above embodiment, the step of preprocessing the original railway text data specifically includes:
removing equipment signal information, equipment number information and date information in original railway text data to obtain first railway text data information;
and splitting the first railway text data information into single Chinese characters, and removing repeated Chinese characters to obtain the preprocessed railway text data information.
Fig. 3 is a schematic diagram of a process for identifying named entities in railway texts according to an embodiment of the present invention, where as shown in fig. 3, original railway text data is "frozen turnout due to freezing of ice and snow", and is preprocessed to obtain "[ CLS ] O B-Cau I-Cau [ SEP ] O B-Fau I-Fau [ SEP ]", and then the preprocessed railway text data is input to a BERT model to obtain railway text vector information, and the railway text vector information is input to a preset BiLSTM-CRF model to obtain "OO B-Cau I-Cau O B-Fau I-Fau B-Fau O".
Fig. 4 is a schematic structural diagram of a railroad text named entity recognition apparatus according to an embodiment of the present invention, as shown in fig. 4, including: a preprocessing module 410, a vector conversion module 420, and an identification module 430; the preprocessing module 410 is configured to preprocess original railway text data to obtain preprocessed railway text data information; the vector conversion module 420 is configured to input the preprocessed railway text data information into a preset BERT model to obtain railway text vector information; the identification module 430 is configured to input the railway text vector information into a preset BiLSTM-CRF model to obtain railway text named entity identification result information; the preset BERT model is obtained by training sample preprocessing railway text data information with entity marks, and the BiLSTM-CRF model is obtained by training sample railway text vector information with entity marks and word order marks.
The apparatus provided in the embodiment of the present invention is used for executing the above method embodiments, and for details of the process and the details, reference is made to the above embodiments, which are not described herein again.
According to the embodiment of the invention, the semantic representation of the keywords of the named entity of the fault text is enhanced by using the preset BERT, the prediction information of the context information of the named entity of the fault text is realized by presetting the BilSTM-CRF model, the ordering relation of the prediction information is adjusted, the feature extraction of the text in the railway field is realized, the identification result information of the named entity of the railway text is obtained, and the consumption of labor cost is reduced.
Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 5, the electronic device may include: a processor (processor)510, a communication Interface (Communications Interface)520, a memory (memory)530 and a communication bus 540, wherein the processor 510, the communication Interface 520 and the memory 530 communicate with each other via the communication bus 540. Processor 510 may call logic instructions in memory 530 to perform the following method: preprocessing original railway text data to obtain preprocessed railway text data information; inputting the preprocessed railway text data information into a preset BERT model to obtain railway text vector information; inputting the railway text vector information into a preset BilSTM-CRF model to obtain the identification result information of the railway text named entity; the preset BERT model is obtained by training sample preprocessing railway text data information with entity marks, and the preset BilSTM-CRF model is obtained by training sample railway text vector information with entity marks and word order marks.
Furthermore, the logic instructions in the memory 530 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
An embodiment of the present invention discloses a computer program product, which includes a computer program stored on a non-transitory computer readable storage medium, the computer program including program instructions, when the program instructions are executed by a computer, the computer can execute the methods provided by the above method embodiments, for example, the method includes: preprocessing original railway text data to obtain preprocessed railway text data information; inputting the preprocessed railway text data information into a preset BERT model to obtain railway text vector information; inputting the railway text vector information into a preset BilSTM-CRF model to obtain the identification result information of the railway text named entity; the preset BERT model is obtained by training sample preprocessing railway text data information with entity marks, and the preset BilSTM-CRF model is obtained by training sample railway text vector information with entity marks and word order marks.
Embodiments of the present invention provide a non-transitory computer-readable storage medium storing server instructions, where the server instructions cause a computer to execute the method provided in the foregoing embodiments, for example, the method includes: preprocessing original railway text data to obtain preprocessed railway text data information; inputting the preprocessed railway text data information into a preset BERT model to obtain railway text vector information; inputting the railway text vector information into a preset BilSTM-CRF model to obtain the identification result information of the railway text named entity; the preset BERT model is obtained by training sample preprocessing railway text data information with entity marks, and the preset BilSTM-CRF model is obtained by training sample railway text vector information with entity marks and word order marks.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (9)

1. A railway text named entity recognition method is characterized by comprising the following steps:
preprocessing original railway text data to obtain preprocessed railway text data information;
inputting the preprocessed railway text data information into a preset BERT model to obtain railway text vector information;
inputting the railway text vector information into a preset BilSTM-CRF model to obtain the identification result information of the railway text named entity;
the preset BERT model is obtained by training sample preprocessing railway text data information with entity marks, and the preset BilSTM-CRF model is obtained by training sample railway text vector information with entity marks and word order marks.
2. The method of claim 1, wherein prior to the step of inputting the preprocessed railway text data information into a preset BERT model to obtain railway text vector information, the method further comprises:
obtaining sample preprocessed railway text data information, carrying out named entity marking on the sample preprocessed railway text data information through BIO marks, and respectively inserting CLS marks and SEP marks into sentence heads and sentence tails of the sample preprocessed railway text data information to obtain sample preprocessed railway text vector information with entity marks;
coding the language sequence information of the sample preprocessed railway text vector information with the entity mark to obtain sample railway text vector information with the entity mark and the language sequence mark;
and training the BERT model according to the sample railway vector data information with the entity mark and the word order mark, and obtaining a preset BERT model when the loss function of the BERT model is stably converged.
3. The method according to claim 1, wherein before the step of inputting the railway text vector information into a preset BilSTM-CRF model to obtain the identification result information of the railway text named entity, the method further comprises:
acquiring sample railway text vector information with entity marks and word order marks, training a BilSTM model according to the sample railway text vector information with the entity marks and the word order marks, outputting sample railway text context information with the word order labels, and obtaining the trained BilSTM model when a first preset training condition is met;
and training the CRF model according to the sample railway text context information with the word order label, obtaining a trained CRF model when a second preset training condition is met, and obtaining a preset BilSTM-CRF model according to the trained BilSTM model and the trained CRF model.
4. The method for recognizing named entities according to railway text of claim 2, wherein the step of training the BERT model according to the sample railway vector data information with the entity labels and the word order labels and obtaining the preset BERT model when a BERT model loss function is stable and converged specifically comprises:
covering partial characters in the sample railway text vector to obtain unmasked character information;
predicting the part of the character which is processed by the covering through the context meaning of the character which is not covered;
obtaining sample random vector information of a non-word-order label, and performing subsequent statement prediction training according to the sample random vector information of the non-word-order label and the sample random vector information of the non-word-order label;
and when the loss function of the predicted value is stably converged and the prediction function of the subsequent statement is stably converged, obtaining a preset BERT model.
5. The method for recognizing named entities according to railway text of claim 1, wherein the step of preprocessing the original railway text data specifically comprises:
removing equipment signal information, equipment number information and date information in original railway text data to obtain first railway text data information;
and splitting the first railway text data information into single Chinese characters, and removing repeated Chinese characters to obtain the preprocessed railway text data information.
6. The railroad text named entity recognition method of claim 3, wherein the BilSTM model comprises a forward LSTM model and a backward LSTM model.
7. A railroad text named entity recognition device, comprising:
the preprocessing module is used for preprocessing original railway text data to obtain preprocessed railway text data information;
the vector conversion module is used for inputting the preprocessed railway text data information into a preset BERT model to obtain railway text vector information;
the recognition module is used for inputting the railway text vector information into a preset BilSTM-CRF model to obtain the railway text named entity recognition result information;
the preset BERT model is obtained by training sample preprocessing railway text data information with entity marks, and the preset BilSTM-CRF model is obtained by training sample railway text vector information with entity marks and word order marks.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program performs the steps of the railroad text named entity recognition method according to any one of claims 1 to 6.
9. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the railroad text named entity recognition method according to any one of claims 1 to 6.
CN201911350774.6A 2019-12-24 2019-12-24 Railway text named entity recognition method and device Pending CN111191452A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911350774.6A CN111191452A (en) 2019-12-24 2019-12-24 Railway text named entity recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911350774.6A CN111191452A (en) 2019-12-24 2019-12-24 Railway text named entity recognition method and device

Publications (1)

Publication Number Publication Date
CN111191452A true CN111191452A (en) 2020-05-22

Family

ID=70707613

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911350774.6A Pending CN111191452A (en) 2019-12-24 2019-12-24 Railway text named entity recognition method and device

Country Status (1)

Country Link
CN (1) CN111191452A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111859916A (en) * 2020-07-28 2020-10-30 中国平安人寿保险股份有限公司 Ancient poetry keyword extraction and poetry sentence generation method, device, equipment and medium
CN112084783A (en) * 2020-09-24 2020-12-15 中国民航大学 Entity identification method and system based on civil aviation non-civilized passengers
CN113032582A (en) * 2021-04-20 2021-06-25 杭州叙简科技股份有限公司 Knowledge graph based entity unified model establishment and entity unified method
CN115221882A (en) * 2022-07-28 2022-10-21 平安科技(深圳)有限公司 Named entity identification method, device, equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106569998A (en) * 2016-10-27 2017-04-19 浙江大学 Text named entity recognition method based on Bi-LSTM, CNN and CRF
CN110083831A (en) * 2019-04-16 2019-08-02 武汉大学 A kind of Chinese name entity recognition method based on BERT-BiGRU-CRF
CN110297913A (en) * 2019-06-12 2019-10-01 中电科大数据研究院有限公司 A kind of electronic government documents entity abstracting method
CN110516256A (en) * 2019-08-30 2019-11-29 的卢技术有限公司 A kind of Chinese name entity extraction method and its system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106569998A (en) * 2016-10-27 2017-04-19 浙江大学 Text named entity recognition method based on Bi-LSTM, CNN and CRF
CN110083831A (en) * 2019-04-16 2019-08-02 武汉大学 A kind of Chinese name entity recognition method based on BERT-BiGRU-CRF
CN110297913A (en) * 2019-06-12 2019-10-01 中电科大数据研究院有限公司 A kind of electronic government documents entity abstracting method
CN110516256A (en) * 2019-08-30 2019-11-29 的卢技术有限公司 A kind of Chinese name entity extraction method and its system

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111859916A (en) * 2020-07-28 2020-10-30 中国平安人寿保险股份有限公司 Ancient poetry keyword extraction and poetry sentence generation method, device, equipment and medium
CN111859916B (en) * 2020-07-28 2023-07-21 中国平安人寿保险股份有限公司 Method, device, equipment and medium for extracting key words of ancient poems and generating poems
CN112084783A (en) * 2020-09-24 2020-12-15 中国民航大学 Entity identification method and system based on civil aviation non-civilized passengers
CN112084783B (en) * 2020-09-24 2022-04-12 中国民航大学 Entity identification method and system based on civil aviation non-civilized passengers
CN113032582A (en) * 2021-04-20 2021-06-25 杭州叙简科技股份有限公司 Knowledge graph based entity unified model establishment and entity unified method
CN115221882A (en) * 2022-07-28 2022-10-21 平安科技(深圳)有限公司 Named entity identification method, device, equipment and medium
CN115221882B (en) * 2022-07-28 2023-06-20 平安科技(深圳)有限公司 Named entity identification method, device, equipment and medium

Similar Documents

Publication Publication Date Title
CN107608956B (en) Reader emotion distribution prediction algorithm based on CNN-GRNN
CN111191452A (en) Railway text named entity recognition method and device
CN110232114A (en) Sentence intension recognizing method, device and computer readable storage medium
CN111694924A (en) Event extraction method and system
CN107797987B (en) Bi-LSTM-CNN-based mixed corpus named entity identification method
CN111339305B (en) Text classification method and device, electronic equipment and storage medium
CN111177382B (en) Intelligent legal system recommendation auxiliary system based on FastText algorithm
CN107977353A (en) A kind of mixing language material name entity recognition method based on LSTM-CNN
CN114282527A (en) Multi-language text detection and correction method, system, electronic device and storage medium
CN113392209A (en) Text clustering method based on artificial intelligence, related equipment and storage medium
CN112966525B (en) Law field event extraction method based on pre-training model and convolutional neural network algorithm
CN112560486A (en) Power entity identification method based on multilayer neural network, storage medium and equipment
CN112084336A (en) Entity extraction and event classification method and device for expressway emergency
CN113157916A (en) Civil aviation emergency extraction method based on deep learning
CN111753058A (en) Text viewpoint mining method and system
CN113505200A (en) Sentence-level Chinese event detection method combining document key information
CN107797988A (en) A kind of mixing language material name entity recognition method based on Bi LSTM
CN114239574A (en) Miner violation knowledge extraction method based on entity and relationship joint learning
CN114529903A (en) Text refinement network
CN115292568B (en) Civil news event extraction method based on joint model
CN113987183A (en) Power grid fault handling plan auxiliary decision-making method based on data driving
CN107992468A (en) A kind of mixing language material name entity recognition method based on LSTM
CN114417785A (en) Knowledge point annotation method, model training method, computer device, and storage medium
CN113051904A (en) Link prediction method for small-scale knowledge graph
CN110795531B (en) Intention identification method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination