WO2021042516A1 - Named-entity recognition method and device, and computer readable storage medium - Google Patents
Named-entity recognition method and device, and computer readable storage medium Download PDFInfo
- Publication number
- WO2021042516A1 WO2021042516A1 PCT/CN2019/116935 CN2019116935W WO2021042516A1 WO 2021042516 A1 WO2021042516 A1 WO 2021042516A1 CN 2019116935 W CN2019116935 W CN 2019116935W WO 2021042516 A1 WO2021042516 A1 WO 2021042516A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- text
- named entity
- data
- entity
- inference engine
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
Definitions
- This application relates to the field of artificial intelligence technology, and in particular to a method, device and computer-readable storage medium for identifying named entities in a text data set.
- This application provides a named entity recognition method, device, and computer-readable storage medium, the main purpose of which is to perform deep learning on an original text data set to obtain a named entity.
- a named entity identification method provided by this application includes:
- the text vector data and the named entity set are input to the inference engine in the neural entity inference engine recognition model to perform inference to obtain the named entity.
- the present application also provides a named entity recognition device, which includes a memory and a processor, the memory stores a named entity recognition program that can run on the processor, and the named entity The following steps are implemented when the recognition program is executed by the processor:
- the text vector data and the named entity set are input to the inference engine in the neural entity inference engine recognition model to perform inference to obtain the named entity.
- the present application also provides a computer-readable storage medium having a named entity recognition program stored on the computer-readable storage medium, and the named entity recognition program can be executed by one or more processors, In order to realize the steps of the named entity recognition method as described above.
- the named entity recognition method, device, and computer-readable storage medium described in this application apply deep learning technology.
- the neural entity inference engine recognition model includes a multi-layer structure, and each layer can independently complete a named entity recognition, and each layer of The named entity recognition result is used as a reference for the next layer. At this time, the optimal recognition result can be obtained through the inference engine; the named entity recognition of each layer can share parameters in most cases. Therefore, the named entity recognition method, device, and computer-readable storage medium proposed in this application can achieve precise, efficient and consistent named entity recognition.
- FIG. 1 is a schematic flowchart of a named entity identification method provided by an embodiment of this application
- FIG. 2 is a schematic diagram of the internal structure of a named entity identification device provided by an embodiment of the application.
- FIG. 3 is a schematic diagram of modules of a named entity identification method program provided by an embodiment of this application.
- This application provides a named entity recognition method.
- FIG. 1 it is a schematic flowchart of a named entity recognition method provided by an embodiment of this application.
- the method can be executed by a device, and the device can be implemented by software and/or hardware.
- the named entity recognition method includes:
- S1 Receive first text data composed of original sentences to be recognized, and preprocess the first text data to obtain text vector data.
- the preprocessing includes operations such as word segmentation, stop word removal, and duplicate removal on the first text data.
- this application performs word segmentation operations on the first text data to obtain second text data, performs stop word removal operations on the second text data to obtain third text data, and performs deduplication on the third text data
- the fourth text data is obtained by operation, and the word vector form conversion is performed on the fourth text data using the TF-IDF algorithm, so as to obtain the text vector data after the preprocessing is completed.
- the word segmentation is to segment each sentence in the original sentence to obtain a single word. Because there is no clear separation mark between words in Chinese representation, word segmentation is indispensable.
- words have the ability to truly reflect the content of the document, so words are usually used as text feature words in the vector space model. However, Chinese text does not use spaces to separate words and words like English text, so the Chinese text needs to be segmented first.
- the word segmentation described in the present application can adopt a dictionary-based word segmentation method, and the Chinese character string to be segmented and the entry in the preset dictionary are matched according to a certain strategy, such as a traversal operation, to obtain the final word segmentation result.
- a certain strategy such as a traversal operation
- the dictionary may include a statistical dictionary.
- the statistical dictionary is a dictionary constructed by all possible word segmentation obtained by statistical methods.
- the dictionary may also include a prefix dictionary.
- the prefix dictionary includes the prefix of each word segment in the statistical dictionary. For example, the prefixes of the word "Peking University” in the statistical dictionary are "North”, “Beijing", and “Beijing University”; The prefix is "big” and so on.
- stop words is to remove the functional words of the text data that have no actual meaning, and have no effect on the classification of the text, but words that appear frequently, including commonly used pronouns, prepositions, and the like.
- the selected method for removing stop words is stop word list filtering, that is, one-by-one matching is performed through a pre-built stop word list and words in the text data. If the matching is successful, this The word is a stop word, and the word needs to be deleted.
- the Euclidean distance method is used to perform the deduplication operation, and the formula is as follows:
- w 1j and w 2j are the two first text data respectively, and d is the Euclidean distance.
- the text is represented by a series of characteristic words (keywords), but the data in this text form cannot be directly processed by the classification algorithm, but should be converted into a numerical form.
- the weight calculation of these characteristic words is used to characterize the importance of the characteristic words in the text.
- the TF-IDF algorithm is used to perform feature word calculation, and the data after word segmentation, word segmentation, stop word removal, and deduplication are preprocessed to obtain text vector data.
- the TF-IDF algorithm uses statistical information, word vector information, and dependency syntax information between words, builds a dependency graph to calculate the correlation strength between words, and uses TextRank algorithm to iteratively calculate the importance score of words.
- the present application when performing recalculation feature words right, first calculates any two words (keywords) and dependent correlation degree w j W i is:
- len(W i , W j ) represents the length of the dependency path between words W i and W j
- b is a hyperparameter
- tfidf (W) is a TF-IDF value of word W
- d is the Euclidean distance between the vectors of words W i and word w j to.
- the neural entity inference engine recognition model shown in this application is a multi-layer architecture, and each layer is an encoding-decoding Bi-LSTM model.
- each layer independently completes a named entity neural reasoning, and the named entity neural reasoning result of each layer will be stored through a symbolic cache, as a reference for the next layer, this reference is through an interactive
- the essence of the pooled neural network is based on multiple real-time reasoning models.
- this application uses the demonstration text "Dong met Tao and Wiener John met the family of Tao" as an example to analyze the structure of the named entity neural reasoning model.
- the named entities actually contained in this sentence have four words: "John”, “Tao”, “Dong”, and "Wiener”.
- the named entity neural inference model of this application is not trained, in the first layer of the named entity neural inference model, the candidate pool is empty because the initial named entity has not been trained to identify the initial named entity.
- the identified named entity result is "John” because "John” is the name of an ordinary person.
- the name of an ordinary person appears frequently, and it is easy to correspond and be recognized as a named entity.
- "Tao" may be omitted.
- “Tao" is not an ordinary person name, so it does not appear frequently as a person name in the training model.
- the specific principle of reasoning is that the model can know that the word before "met” is a person's name based on the information of "John” and that "Tao” is a person's name, so the inference engine can infer “John” and the first " Tao” maintains consistency in sentence logic and grammatical positioning, and then updates the candidate pool to store "Tao” as the initial named entity into the candidate pool.
- the neural entity inference engine recognition model model can be recognized by the inference engine in the third layer that "Wiener” is the same as the aforementioned "Tao” in sentence logic and grammatical positioning. It is recognized as a named entity. Layer training, to recognize all word units in the text to be recognized, and finally recognize all named entities contained in the text, and complete the named entity recognition process of the entire neural entity inference engine.
- the preprocessed text vector data is encoded into an encoding representation sequence, and the decoder of each layer can independently give prediction labels based on word expression and context generation information. Since the predicted label indicates which words are entities, this application can find the entity representation from the predicted label. At the same time, the model of this application always records the entire neural entity inference engine recognition process, including the identified entity information, so that the model established in this application can "see" all the past decisions, and then each layer can use the inference engine from it. Cite and update the candidate pool, so that the predicted results can help the next layer to maintain global consistency and obtain better results.
- inputting the text vector data into the neural entity inference engine recognition model for training to obtain a named entity set includes the following steps:
- Step S301 Use the Bi-LSTM model to encode the text vector data to obtain an encoded representation sequence.
- a neural entity inference engine recognition model layer can be regarded as a rule codec framework based on the neural entity inference engine recognition model, which can receive additional information from the inference engine.
- the model of this application uses the Bi-LSTM model as the encoder and the LSTM model as the decoder.
- the candidate pool is a simple list consisting of a sequence of coded representations of named entities, which can contain all named entities identified in the entire text or in the entire result.
- the decoders and encoders of each layer can share parameters to avoid parameter growth and make the model easy to train as an end-to-end model. Therefore, the only difference between each layer is the candidate pool and the different named entities.
- the LSTM model is designed to solve the problem of gradient disappearance and long-term dependence of learning.
- the memory c_t and hidden state h_t of the basic LSTM unit are updated as follows:
- ⁇ represents the element product
- ⁇ is the sigmoid function
- x t represents the vector input at time t
- h t, o t , c t , and f t represent the update of the input gate, forget gate, and output gate at step t, respectively.
- Step S302 Input the coded representation sequence and the initial named entity in the candidate pool to be processed by the inference engine to obtain reference information.
- the inference engine is a set of programs used to control and coordinate the entire system.
- the expert system solves the problem according to the problem information (the information communicated by the user and the expert system) and the knowledge in the knowledge base. That is, after the target engine sets the target object, it uses external information as input and uses logical operation methods such as deductive induction to perform calculations on the target object to generate a conclusion based on the established pattern matching.
- the reasoning engine in this embodiment is actually a multi-fact reasoning model.
- the current code indicates that the sequence information is a query
- the initial named entity information in the candidate pool is a fact.
- Step S303 Input the coded representation sequence and the reference information into a decoder to obtain a predicted label; update the candidate pool according to the predicted label to obtain the named entity set.
- the Bi-LSTM model is used in the embodiment of the present application, a good prediction label y i can be obtained.
- this application adopts the BMEOS (Begin, Middle, End, Other, Single) marking scheme, so that from the predicted label y i , you can know where each named entity starts or ends to form boundary information, and then use the boundary information To organize and form a cache of documents. Since this model relies on local language characteristics to make decisions, this application considers how to store named entity information more reasonably and efficiently on this basis.
- BMEOS Begin, Middle, End, Other, Single
- a named entity is regarded as an independent and indivisible object, which is composed of several words, so the mode of appearance of an entity can be described as follows: [forward context][entity][ ⁇ After context]. Therefore, this application stores entities in this mode.
- the coded representation sequence of each entity contains information to determine its predicted label.
- the encoder in the coding layer is forward And backward The combination. Therefore, this application stores the obtained predicted label in the candidate pool to provide decisive information for the inference engine to give the inference result.
- this application Based on the candidate pool, this application actually stores an entity as an object, and this object has three descriptions. Therefore, for each word to be predicted, this application can use the similarity between the current word and the candidate word database as a reference from three aspects to make better decisions.
- Each matrix in the candidate pool is actually a vector representation list, which also contains some entity information facts. Based on this, this application can use a special multi-fact reasoning model to obtain suggestions from it.
- the decoder includes:
- X represents the preprocessed text vector data
- y i represents the predicted label of the i-th layer in the neural entity inference engine recognition model
- x t represents the value of the text vector x at time t.
- the neural entity inference-based named entity recognition model of each layer can share parameters in most cases, which makes the model of this application truly end-to-end.
- the candidate pool is updated in real time according to the predicted label to obtain the named entity set.
- a stable named entity neural inference engine is obtained by inputting text vector data into the neural entity inference engine recognition model for training.
- the neural entity inference engine recognition model by inputting the text data in the original sentence to be recognized, through the multi-layer neural entity inference engine recognition model, the corresponding initial named entity is obtained, and the initial named entity forms a named entity set.
- This application uses the trained neural entity inference engine recognition model inference engine to infer the text vector data and the named entity set to obtain the named entity.
- the invention also provides a named entity recognition device.
- FIG. 2 it is a schematic diagram of the internal structure of a named entity recognition device provided by an embodiment of this application.
- the named entity recognition device 1 may be a PC (Personal Computer, personal computer), or a terminal device such as a smart phone, a tablet computer, or a portable computer, or a server.
- the named entity recognition device 1 at least includes a memory 11, a processor 12, a communication bus 13, and a network interface 14.
- the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, and the like.
- the memory 11 may be an internal storage unit of the named entity recognition device 1 in some embodiments, for example, the hard disk of the named entity recognition device 1.
- the memory 11 may also be an external storage device of the named entity recognition device 1, such as a plug-in hard disk equipped on the named entity recognition device 1, a smart media card (SMC), and a secure digital (Secure Digital). Digital, SD) card, flash card (Flash Card), etc.
- the memory 11 may also include both an internal storage unit of the named entity recognition apparatus 1 and an external storage device.
- the memory 11 can be used not only to store application software and various data installed in the named entity recognition device 1, such as the code of the named entity recognition program 01, etc., but also to temporarily store data that has been output or will be output.
- the processor 12 may be a central processing unit (CPU), controller, microcontroller, microprocessor, or other data processing chip, for running program codes or processing stored in the memory 11 Data, such as named entity recognition program 01, etc.
- CPU central processing unit
- controller microcontroller
- microprocessor microprocessor
- other data processing chip for running program codes or processing stored in the memory 11 Data, such as named entity recognition program 01, etc.
- the communication bus 13 is used to realize the connection and communication between these components.
- the network interface 14 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface), and is usually used to establish a communication connection between the apparatus 1 and other electronic devices.
- the device 1 may also include a user interface.
- the user interface may include a display (Display) and an input unit such as a keyboard (Keyboard).
- the optional user interface may also include a standard wired interface and a wireless interface.
- the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, etc.
- the display can also be appropriately called a display screen or a display unit, which is used to display the information processed in the named entity recognition device 1 based on the neural entity inference engine and to display a visualized user interface.
- Figure 2 only shows a named entity recognition device 1 with components 11-14 and a named entity recognition program 01 based on a neural entity inference engine.
- Those skilled in the art will understand that the structure shown in Figure 1 does not constitute a naming
- the definition of the entity recognition device 1 may include fewer or more components than shown, or a combination of certain components, or a different component arrangement.
- the memory 11 stores the named entity recognition program 01; when the processor 12 executes the named entity recognition program 01 stored in the memory 11, the following steps are implemented:
- Step 1 Receive first text data composed of original sentences to be recognized, and preprocess the first text data to obtain text vector data.
- the preprocessing includes operations such as word segmentation, stop word removal, and duplicate removal on the first text data.
- this application performs word segmentation operations on the first text data to obtain second text data, performs stop word removal operations on the second text data to obtain third text data, and performs deduplication on the third text data
- the fourth text data is obtained by operation, and the word vector form conversion is performed on the fourth text data using the TF-IDF algorithm, so as to obtain the text vector data after the preprocessing is completed.
- the word segmentation is to segment each sentence in the original sentence to obtain a single word. Because there is no clear separation mark between words in Chinese representation, word segmentation is indispensable.
- words have the ability to truly reflect the content of the document, so words are usually used as text feature words in the vector space model. However, Chinese text does not use spaces to separate words and words like English text, so the Chinese text needs to be segmented first.
- the word segmentation described in the present application can adopt a dictionary-based word segmentation method, and the Chinese character string to be segmented and the entry in the preset dictionary are matched according to a certain strategy, such as a traversal operation, to obtain the final word segmentation result.
- a certain strategy such as a traversal operation
- the dictionary may include a statistical dictionary.
- the statistical dictionary is a dictionary constructed by all possible word segmentation obtained by statistical methods.
- the dictionary may also include a prefix dictionary.
- the prefix dictionary includes the prefix of each word segment in the statistical dictionary. For example, the prefixes of the word "Peking University” in the statistical dictionary are "North”, “Beijing", and “Beijing University”; The prefix is "big” and so on.
- stop words is to remove the functional words of the text data that have no actual meaning, and have no effect on the classification of the text, but words that appear frequently, including commonly used pronouns, prepositions, and the like.
- the selected method for removing stop words is stop word list filtering, that is, one-by-one matching is performed through a pre-built stop word list and words in the text data. If the matching is successful, this The word is a stop word, and the word needs to be deleted.
- the Euclidean distance method is used to perform the deduplication operation, and the formula is as follows:
- w 1j and w 2j are the two first text data respectively, and d is the Euclidean distance.
- the text is represented by a series of characteristic words (keywords), but the data in this text form cannot be directly processed by the classification algorithm, but should be converted into a numerical form.
- the weight calculation of these characteristic words is used to characterize the importance of the characteristic words in the text.
- the TF-IDF algorithm is used to perform feature word calculation, and the data after word segmentation, word segmentation, stop word removal, and deduplication are preprocessed to obtain text vector data.
- the TF-IDF algorithm uses statistical information, word vector information, and dependency syntax information between words, builds a dependency graph to calculate the correlation strength between words, and uses TextRank algorithm to iteratively calculate the importance score of words.
- the present application when performing recalculation feature words right, first calculates any two words (keywords) and dependent correlation degree w j W i is:
- len(W i , W j ) represents the length of the dependency path between words W i and W j
- b is a hyperparameter
- tfidf (W) is a TF-IDF value of word W
- d is the Euclidean distance between the vectors of words W i and word w j to.
- Step 2 Obtain a neural entity inference engine recognition model with a multi-layer structure.
- the neural entity inference engine recognition model shown in this application is a multi-layer architecture, and each layer is an encoding-decoding Bi-LSTM model.
- each layer independently completes a named entity neural reasoning, and the named entity neural reasoning result of each layer will be stored through a symbolic cache, as a reference for the next layer, this reference is through an interactive
- the essence of the pooled neural network is based on multiple real-time reasoning models.
- this application uses the demonstration text "Dong met Tao and Wiener John met the family of Tao" as an example to analyze the structure of the named entity neural reasoning model.
- the named entities actually contained in this sentence have four words: "John”, “Tao”, “Dong”, and "Wiener”.
- the named entity neural inference model of this application is not trained, in the first layer of the named entity neural inference model, the candidate pool is empty because the initial named entity has not been trained to identify the initial named entity.
- the identified named entity result is "John” because "John” is the name of an ordinary person.
- the name of an ordinary person appears frequently, and it is easy to correspond and be recognized as a named entity.
- "Tao" may be omitted.
- “Tao" is not an ordinary person name, so it does not appear frequently as a person name in the training model.
- the specific principle of reasoning is that the model can know that the word before "met” is a person's name based on the information of "John” and that "Tao” is a person's name, so the inference engine can infer “John” and the first " Tao” maintains consistency in sentence logic and grammatical positioning, and then updates the candidate pool to store "Tao” as the initial named entity into the candidate pool.
- the neural entity inference engine recognition model model can be recognized by the inference engine in the third layer that "Wiener” is the same as the aforementioned "Tao” in sentence logic and grammatical positioning. It is recognized as a named entity. Layer training, to recognize all word units in the text to be recognized, and finally recognize all named entities contained in the text, and complete the named entity recognition process of the entire neural entity inference engine.
- the preprocessed text vector data is encoded into an encoding representation sequence, and the decoder of each layer can independently give prediction labels based on word expression and context generation information. Since the predicted label indicates which words are entities, this application can find the entity representation from the predicted label. At the same time, the model of this application always records the entire neural entity inference engine recognition process, including the identified entity information, so that the model established in this application can "see" all the past decisions, and then each layer can use the inference engine from it. Cite and update the candidate pool, so that the predicted results can help the next layer to maintain global consistency and obtain better results.
- Step 3 Input the text vector data into the neural entity inference engine recognition model for training to obtain a named entity set.
- inputting the text vector data into the neural entity inference engine recognition model for training to obtain a named entity set includes the following steps:
- the first step is to use the Bi-LSTM model to encode the text vector data to obtain an encoded representation sequence.
- a neural entity inference engine recognition model layer can be regarded as a rule codec framework based on the neural entity inference engine recognition model, which can receive additional information from the inference engine.
- the model of this application uses the Bi-LSTM model as the encoder and the LSTM model as the decoder.
- the candidate pool is a simple list consisting of a sequence of coded representations of named entities, which can contain all named entities identified in the entire text or in the entire result.
- the decoders and encoders of each layer can share parameters to avoid parameter growth and make the model easy to train as an end-to-end model. Therefore, the only difference between each layer is the candidate pool and the different named entities.
- the LSTM model is designed to solve the problem of gradient disappearance and long-term dependence of learning.
- the memory c_t and hidden state h_t of the basic LSTM unit are updated as follows:
- ⁇ represents the element product
- ⁇ is the sigmoid function
- x t represents the input vector at time t
- h t h t
- o t t
- c t t
- f t the update of the input gate, forget gate, and output gate at step t, respectively.
- the second step is to input the coded representation sequence and the initial named entity in the candidate pool to be processed by the inference engine to obtain reference information.
- the inference engine is a set of programs used to control and coordinate the entire system.
- the expert system solves the problem according to the problem information (the information communicated by the user and the expert system) and the knowledge in the knowledge base. That is, after the target engine sets the target object, it uses external information as input and uses logical operation methods such as deductive induction to perform calculations on the target object to generate a conclusion based on the established pattern matching.
- the reasoning engine in this embodiment is actually a multi-fact reasoning model.
- the current code indicates that the sequence information is a query
- the initial named entity information in the candidate pool is a fact.
- the third step is to input the coded representation sequence and the reference information into a decoder to obtain a predicted label; update the candidate pool according to the predicted label to obtain the named entity set.
- the Bi-LSTM model is used in the embodiment of the present application, a good prediction label y i can be obtained.
- this application adopts the BMEOS (Begin, Middle, End, Other, Single) marking scheme, so that from the predicted label y i , you can know where each named entity starts or ends to form boundary information, and then use the boundary information To organize and form a cache of documents. Since this model relies on local language characteristics to make decisions, this application considers how to store named entity information more reasonably and efficiently on this basis.
- BMEOS Begin, Middle, End, Other, Single
- a named entity is regarded as an independent and indivisible object, which is composed of several words, so the mode of appearance of an entity can be described as follows: [forward context][entity][ ⁇ After context]. Therefore, this application stores entities in this mode.
- the coded representation sequence of each entity contains information to determine its predicted label.
- the encoder in the coding layer is forward And backward The combination. Therefore, this application stores the obtained predicted label in the candidate pool to provide decisive information for the inference engine to give the inference result.
- this application Based on the candidate pool, this application actually stores an entity as an object, and this object has three descriptions. Therefore, for each word to be predicted, this application can use the similarity between the current word and the candidate word database as a reference from three aspects to make better decisions.
- Each matrix in the candidate pool is actually a vector representation list, which also contains some entity information facts. Based on this, this application can use a special multi-fact reasoning model to obtain suggestions from it.
- the decoder includes:
- X represents the preprocessed text vector data
- y i represents the predicted label of the i-th layer in the neural entity inference engine recognition model
- x t represents the value of the text vector x at time t.
- the neural entity inference-based named entity recognition model of each layer can share parameters in most cases, which makes the model of this application truly end-to-end.
- the candidate pool is updated in real time according to the predicted label to obtain the named entity set.
- Step 4 Input the text vector data and the named entity set into the inference engine in the neural entity inference engine recognition model to perform inference to obtain a named entity.
- a stable named entity neural inference engine is obtained by inputting text vector data into the neural entity inference engine recognition model for training.
- the neural entity inference engine recognition model by inputting the text data in the original sentence to be recognized, through the multi-layer neural entity inference engine recognition model, the corresponding initial named entity is obtained, and the initial named entity forms a named entity set.
- This application uses the trained neural entity inference engine recognition model inference engine to infer the text vector data and the named entity set to obtain the named entity.
- the named entity recognition program may also be divided into one or more modules, and the one or more modules are stored in the memory 11 and run by one or more processors (in this embodiment, The processor 12) is executed to complete the application.
- the module referred to in the application refers to a series of computer program instruction segments capable of completing specific functions, and is used to describe the execution process of the named entity recognition program in the named entity recognition device.
- FIG. 3 it is a schematic diagram of modules of the named entity recognition program in an embodiment of the named entity recognition device of this application.
- the named entity recognition program can be divided into data receiving and processing modules 10,
- the word vector conversion module 20, the model training module 30, and the named entity output module 40 are exemplary:
- the data receiving and processing module 10 is configured to receive first text data composed of original sentences to be recognized, and perform operations such as word segmentation, stop word removal, and duplication on the first text data.
- the word vector conversion module 20 is configured to use the TF-IDF algorithm to perform word vector conversion on the first text data after operations such as word segmentation, stop word removal, and deduplication, so as to obtain text vector data.
- the model training module 30 is used to obtain a neural entity inference engine recognition model with a multi-layer structure, where each layer is an encoding-decoding Bi-LSTM model, and each layer independently completes a named entity Neural reasoning, and the named entity neural reasoning results of each layer will be stored through a symbolic cache, which is used as a reference for the next layer.
- the named entity output module 40 is configured to: input the text vector data into the neural entity inference engine recognition model for training to obtain a named entity set, and input the text vector data and the named entity set into the neural entity
- the inference engine recognizes the inference engine in the model for inference, and obtains a named entity.
- the above-mentioned data receiving and processing module 10, word vector conversion module 20, model training module 30, named entity output module 40 and other program modules implement functions or operation steps that are substantially the same as those in the above-mentioned embodiment, and will not be repeated here.
- an embodiment of the present application also proposes a computer-readable storage medium having a named entity recognition program stored on the computer-readable storage medium, and the named entity recognition program can be executed by one or more processors to achieve the following operating:
- the text vector data and the named entity set are input to the inference engine in the neural entity inference engine recognition model to perform inference to obtain the named entity.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
A named-entity recognition method and device, and a computer readable storage medium, relating to the technical field of artificial intelligence. The method comprises: receiving first text data consisting of original statements to be recognized, and preprocessing the first text data to obtain text vector data (S1); obtaining a neural entity inference engine recognition model having a multi-layer structure, and training the neural entity inference engine recognition model (S2); inputting the text vector data into the trained neural entity inference engine recognition model for training to obtain a named entity set (S3); and inputting the text vector data and the named entity set into an inference engine in the neural entity inference engine recognition model for inference to obtain a named entity (S4). The method can realize accurate and efficient named-entity recognition.
Description
本申请要求于2019年09月02日提交中国专利局、申请号为201910825074.1、发明名称为“命名实体识别方法、装置及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on September 02, 2019, the application number is 201910825074.1, and the invention title is "Named Entity Recognition Method, Device and Computer-readable Storage Medium", the entire content of which is incorporated by reference Incorporate in the application.
本申请涉及人工智能技术领域,尤其涉及一种文本数据集中的命名实体识别方法、装置及计算机可读存储介质。This application relates to the field of artificial intelligence technology, and in particular to a method, device and computer-readable storage medium for identifying named entities in a text data set.
随着当今互联网的发展,人们生活中的信息量也越来越多,而其中大部分是文本信息。因此,如何对文本信息进行处理,识别出其中的人名、机构名、地名等命名实体,从而简化人们对文本信息的提取是一大难题。然而,目前识别实体的主要方法是基于传统神经实体推理方法,但由于此方法过于依赖局部和底层的语言特征,当出现有歧义的说法或者少见的人名时,这类方法往往会遇到困难。With the development of the Internet today, there is more and more information in people's lives, most of which are text information. Therefore, how to process text information and identify named entities such as person names, organization names, and place names in it, so as to simplify people's extraction of text information is a big problem. However, the current main method of identifying entities is based on traditional neural entity reasoning methods, but because this method relies too much on local and low-level language features, such methods often encounter difficulties when there are ambiguous statements or rare names.
发明内容Summary of the invention
本申请提供一种命名实体识别方法、装置及计算机可读存储介质,其主要目的在于对原始文本数据集进行深度学习从而得到命名实体的方法。This application provides a named entity recognition method, device, and computer-readable storage medium, the main purpose of which is to perform deep learning on an original text data set to obtain a named entity.
为实现上述目的,本申请提供的一种命名实体识别方法,包括:In order to achieve the above-mentioned purpose, a named entity identification method provided by this application includes:
接收由待识别的原始语句组成的第一文本数据,并对所述第一文本数据进行预处理得到文本向量数据;Receiving first text data composed of original sentences to be recognized, and preprocessing the first text data to obtain text vector data;
获取具有多层结构的神经实体推理机识别模型;Obtain a neural entity inference engine recognition model with a multi-layer structure;
将所述文本向量数据输入所述神经实体推理机识别模型进行训练得到命名实体集合;Inputting the text vector data into the neural entity inference engine recognition model for training to obtain a named entity set;
将所述文本向量数据和所述命名实体集合输入所述神经实体推理机识别模型中的推理机进行推理,得到命名实体。The text vector data and the named entity set are input to the inference engine in the neural entity inference engine recognition model to perform inference to obtain the named entity.
此外,为实现上述目的,本申请还提供一种命名实体识别装置,该装置包括存储器和处理器,所述存储器中存储有可在所述处理器上运行的命名实体识别程序,所述命名实体识别程序被所述处理器执行时实现如下步骤:In addition, in order to achieve the above object, the present application also provides a named entity recognition device, which includes a memory and a processor, the memory stores a named entity recognition program that can run on the processor, and the named entity The following steps are implemented when the recognition program is executed by the processor:
接收由待识别的原始语句组成的第一文本数据,并对所述第一文本数据进行预处理得到文本向量数据;Receiving first text data composed of original sentences to be recognized, and preprocessing the first text data to obtain text vector data;
获取具有多层结构的神经实体推理机识别模型;Obtain a neural entity inference engine recognition model with a multi-layer structure;
将所述文本向量数据输入所述神经实体推理机识别模型进行训练得到命名实体集合;Inputting the text vector data into the neural entity inference engine recognition model for training to obtain a named entity set;
将所述文本向量数据和所述命名实体集合输入所述神经实体推理机识别模型中的推理机进行推理,得到命名实体。The text vector data and the named entity set are input to the inference engine in the neural entity inference engine recognition model to perform inference to obtain the named entity.
此外,为实现上述目的,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有命名实体识别程序,所述命名实体识别程序可被一个或者多个处理器执行,以实现如上所述的命名实体识别方法的步骤。In addition, in order to achieve the above objective, the present application also provides a computer-readable storage medium having a named entity recognition program stored on the computer-readable storage medium, and the named entity recognition program can be executed by one or more processors, In order to realize the steps of the named entity recognition method as described above.
本申请所述命名实体识别方法、装置及计算机可读存储介质应用了深度学习技术,其中神经实体推理机识别模型包括多层结构,每一层都可以独立完成一次命名实体识别,而每层的命名实体识别结果作为下一层的参考,此时通过推理机便能得到最优的识别结果;每个层的命名实体识别在大多数情况下都可以共享参数。因此本申请提出的一种命名实体识别方法、装置及计算机可读存储介质,可以实现精准高效且连贯的进行命名实体识别。The named entity recognition method, device, and computer-readable storage medium described in this application apply deep learning technology. The neural entity inference engine recognition model includes a multi-layer structure, and each layer can independently complete a named entity recognition, and each layer of The named entity recognition result is used as a reference for the next layer. At this time, the optimal recognition result can be obtained through the inference engine; the named entity recognition of each layer can share parameters in most cases. Therefore, the named entity recognition method, device, and computer-readable storage medium proposed in this application can achieve precise, efficient and consistent named entity recognition.
图1为本申请一实施例提供的命名实体识别法的流程示意图;FIG. 1 is a schematic flowchart of a named entity identification method provided by an embodiment of this application;
图2为本申请一实施例提供的命名实体识别装置的内部结构示意图;2 is a schematic diagram of the internal structure of a named entity identification device provided by an embodiment of the application;
图3为本申请一实施例提供的命名实体识别方法程序的模块示意图。FIG. 3 is a schematic diagram of modules of a named entity identification method program provided by an embodiment of this application.
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization, functional characteristics, and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。It should be understood that the specific embodiments described here are only used to explain the application, and not used to limit the application.
本申请提供一种命名实体识别方法。参照图1所示,为本申请一实施例提供的命名实体识别方法的流程示意图。该方法可以由一个装置执行,该装置可以由软件和/或硬件实现。This application provides a named entity recognition method. Referring to FIG. 1, it is a schematic flowchart of a named entity recognition method provided by an embodiment of this application. The method can be executed by a device, and the device can be implemented by software and/or hardware.
在本实施例中,命名实体识别方法包括:In this embodiment, the named entity recognition method includes:
S1、接收由待识别的原始语句组成的第一文本数据,并对所述第一文本数据进行预处理得到文本向量数据。S1. Receive first text data composed of original sentences to be recognized, and preprocess the first text data to obtain text vector data.
本申请较佳实施例中,所述预处理包括对所述第一文本数据进行分词、去停用词、去重等操作。In a preferred embodiment of the present application, the preprocessing includes operations such as word segmentation, stop word removal, and duplicate removal on the first text data.
具体地,本申请对所述第一文本数据进行分词操作得到第二文本数据,对所述第二文本数据进行去停用词操作得到第三文本数据,对所述第三文本数据进行去重操作得到第四文本数据,对所述第四文本数据利用TF-IDF算法进行词向量形式转化,从而得到预处理完成后的所述文本向量数据。Specifically, this application performs word segmentation operations on the first text data to obtain second text data, performs stop word removal operations on the second text data to obtain third text data, and performs deduplication on the third text data The fourth text data is obtained by operation, and the word vector form conversion is performed on the fourth text data using the TF-IDF algorithm, so as to obtain the text vector data after the preprocessing is completed.
本申请收集大量待识别的原始语句,组成所述第一文本数据。文本数据是非结构化或半结构化的数据,不能被分类算法直接识别,而预处理的目的是将文本数据转化为向量空间模型:d
i=(w
1,w
2,…,w
n),其中,w
j为第j个特征词的权重。
This application collects a large number of original sentences to be recognized to form the first text data. Text data is unstructured or semi-structured data, that can not be directly recognized classification algorithm, the purpose of the pretreatment is to be converted to text data vector space model: d i = (w 1, w 2, ..., w n), Among them, w j is the weight of the j-th feature word.
所述分词是对原始语句中的每句话进行切分得到单个的词,因为在汉语表示中,词和词之间没有明确的分隔标识,所以分词是必不可少的。对于中文文本,词语具有真实反映文档内容的能力,因此通常将词语作为向量空间 模型中的文本特征词。但是中文文本不像英文文本那样词和词之间采用空格分开,因此需要首先对中文文本进行分词操作。The word segmentation is to segment each sentence in the original sentence to obtain a single word. Because there is no clear separation mark between words in Chinese representation, word segmentation is indispensable. For Chinese text, words have the ability to truly reflect the content of the document, so words are usually used as text feature words in the vector space model. However, Chinese text does not use spaces to separate words and words like English text, so the Chinese text needs to be segmented first.
较佳地,本申请所述分词可以采用基于词典的分词方法,将待分词中文字符串和预设词典中的词条根据某种策略,如遍历操作,进行匹配,得到最终的分词结果。Preferably, the word segmentation described in the present application can adopt a dictionary-based word segmentation method, and the Chinese character string to be segmented and the entry in the preset dictionary are matched according to a certain strategy, such as a traversal operation, to obtain the final word segmentation result.
具体地,所述词典可以包括统计词典。所述统计词典是利用统计方法得到的所有可能的分词构造的词典。进一步地,所述词典也可以包括前缀词典。所述前缀词典包括所述统计词典中每一个分词的前缀,例如所述统计词典中的词“北京大学”的前缀分别是“北”、“北京”、“北京大”;词“大学”的前缀是“大”等。Specifically, the dictionary may include a statistical dictionary. The statistical dictionary is a dictionary constructed by all possible word segmentation obtained by statistical methods. Further, the dictionary may also include a prefix dictionary. The prefix dictionary includes the prefix of each word segment in the statistical dictionary. For example, the prefixes of the word "Peking University" in the statistical dictionary are "North", "Beijing", and "Beijing University"; The prefix is "big" and so on.
所述去停用词是去掉文本数据功能词中没有实际意义的,对文本的分类没有影响,但是出现频率高的词语,包括常用的代词、介词等。在本申请实施例中,所选取的去停用词的方法为停用词表过滤,即通过预先构建好的停用词表和文本数据中的词语进行一一匹配,如果匹配成功,则这个词语就是停用词,需要将该词删除。The removal of stop words is to remove the functional words of the text data that have no actual meaning, and have no effect on the classification of the text, but words that appear frequently, including commonly used pronouns, prepositions, and the like. In the embodiment of this application, the selected method for removing stop words is stop word list filtering, that is, one-by-one matching is performed through a pre-built stop word list and words in the text data. If the matching is successful, this The word is a stop word, and the word needs to be deleted.
进一步地,由于所收集的第一文本数据来源错综复杂,其中可能会存在很多重复的文本数据。大量的重复数据会影响分类精度,因此,需要进行执行去重操作。在本申请实施例利用欧式距离方法进行去重操作,其公式如下:Furthermore, due to the intricate sources of the collected first text data, there may be a lot of duplicate text data. A large amount of repeated data will affect the classification accuracy, therefore, it is necessary to perform deduplication operations. In the embodiment of this application, the Euclidean distance method is used to perform the deduplication operation, and the formula is as follows:
其中,w
1j和w
2j分别为2个第一文本数据,d为欧式距离。在分别计算每两个第一文本数据的欧式距离后,欧式距离越小,说明文本数据越相似,则删除欧氏距离小于预设阈值的两个第一文本数据中的其中一个。
Among them, w 1j and w 2j are the two first text data respectively, and d is the Euclidean distance. After calculating the Euclidean distances of every two first text data separately, the smaller the Euclidean distance, the more similar the text data, and then one of the two first text data whose Euclidean distance is less than the preset threshold is deleted.
在经过分词、去停用词、去重后,文本由一系列的特征词(关键词)表示,但是这种文本形式的数据不能直接被分类算法所处理,而应该转化为数值形式,因此需要对这些特征词进行权重计算,用来表征该特征词在文本中的重要性。After word segmentation, stop words removal, and deduplication, the text is represented by a series of characteristic words (keywords), but the data in this text form cannot be directly processed by the classification algorithm, but should be converted into a numerical form. The weight calculation of these characteristic words is used to characterize the importance of the characteristic words in the text.
在本申请的一些实施例中,使用TF-IDF算法进行特征词计算,对所述经过分词、分词、去停用词、去重等操作后的数据进行预处理得到文本向量数据。所述TF-IDF算法是利用统计信息、词向量信息以及词语间的依存句法信息,通过构建依存关系图来计算词语之间的关联强度,利用TextRank算法迭代算出词语的重要度得分。In some embodiments of the present application, the TF-IDF algorithm is used to perform feature word calculation, and the data after word segmentation, word segmentation, stop word removal, and deduplication are preprocessed to obtain text vector data. The TF-IDF algorithm uses statistical information, word vector information, and dependency syntax information between words, builds a dependency graph to calculate the correlation strength between words, and uses TextRank algorithm to iteratively calculate the importance score of words.
具体地,本申请在进行特征词的权重计算时,首先计算任意两词语(关键词)W
i和w
j的依存关联度为:
In particular, the present application when performing recalculation feature words right, first calculates any two words (keywords) and dependent correlation degree w j W i is:
其中len(W
i,W
j)表示词语W
i和W
j之间的依存路径长度,b是超参数。
Where len(W i , W j ) represents the length of the dependency path between words W i and W j , and b is a hyperparameter.
本申请认为两个词之间的语义相似度无法准确衡量词语的重要程度,只 有当两个词中至少有一个在文本中出现的频率很高,才能证明两个词很重要。根据万有引力的概念,将词频看作质量,将两个词的词向量间的欧氏距离视为距离,根据万有引力公式来计算两个词之间的引力。然而在当前文本环境下,仅利用词频来衡量文本中某个词的重要程度太过片面,因此本申请引入了IDF值,将词频替换为TF-IDF值,从而考虑到更全局性的信息,于是得到了新的词引力值公式。文本词语W
i和W
j的引力为:
This application believes that the semantic similarity between two words cannot accurately measure the importance of the words. Only when at least one of the two words appears frequently in the text can it prove that the two words are important. According to the concept of gravitation, the frequency of the word is regarded as the quality, the Euclidean distance between the word vectors of the two words is regarded as the distance, and the gravitation between the two words is calculated according to the formula of gravitation. However, in the current text environment, it is too one-sided to use word frequency to measure the importance of a word in a text. Therefore, this application introduces the IDF value and replaces the word frequency with the TF-IDF value to take into account more global information. So a new formula for the value of word gravity is obtained. The gravity of text words W i and W j is:
其中,tfidf(W)是词W的TF-IDF值,d是词W
i和w
j的词向量之间的欧式距离。
Wherein, tfidf (W) is a TF-IDF value of word W, d is the Euclidean distance between the vectors of words W i and word w j to.
因此,两个词语之间的关联度为:Therefore, the degree of relevance between the two words is:
weight(W
i,W
j)=Dep(W
i,W
j)*f
grav(W
i,W
j)
weight(W i ,W j )=Dep(W i ,W j )*f grav (W i ,W j )
最后,本申请利用TextRank算法建立一个无向图G=(V,E),其中V是顶点的集合,E是边的集合,根据下列式子算出词语W
i的得分,:
Finally, the present application establish an undirected graph G = (V, E) using TextRank algorithm, where V is the set of vertices, E is the set of edges, the following equation is calculated based on the scores of the words W i,:
其中
是与顶点W
i有关的集合,η为阻尼系数,由此得到特征权重WS(W
i),并因此将每个词语表示成数值向量形式,即得到所述文本向量数据。
among them Is a set related to the vertices W i , and η is the damping coefficient, from which the feature weight WS(W i ) is obtained, and therefore each word is expressed in the form of a numeric vector, that is, the text vector data is obtained.
S2、获取具有多层结构的神经实体推理机识别模型。S2. Obtain a neural entity inference engine recognition model with a multi-layer structure.
较佳地,本申请所示神经实体推理机识别模型是一个多层的架构,每一层都是一个编码-解码的Bi-LSTM模型。同时,每一层都独立完成一次命名实体的神经推理,而每层的命名实体神经推理结果会通过一个符号化的缓存存储起来,作为下一层的参考,这种参考是通过一个交互式的池化神经网络实现的,本质是一个基于多个实时的推理模型。同时,为更好的对模型流程进行解读,本申请以示范文本“Dong met Tao and Wiener John met the family of Tao”为例进行命名实体神经推理模型结构的分析。该段语句实际含有的命名实体有“John”、“Tao”、“Dong”、“Wiener”四个词。在本申请的命名实体神经推理模型未经训练时,在命名实体神经推理模型的第一层,候选池是空的,因为没有经过训练识别出初始命名实体。在此刻的模型中,识别出的命名实体结果为“John”,因为“John”是一个普通人的名字。在常规的训练模型中作为普通人的名字出现频率高,很容易进行对应进而被识别为命名实体。在前述模型识别过程中,“Tao”可能会被省略。首先“Tao”不是一个普通的人名,因此在训练模型中作为人名出现的频率不高,其次,因为上下文语义中出现了“met the family”不足以表达“Tao”作为人名的特征,因此训练模型中没有足够的和强烈的信号来对“Tao”进行正确识别。在经过本次训练后模型将“John”的信息作为初始命名实体信息存储到候选池中。这样在第二层,模型可以由推理机进行推理。推理的具体原理为,该模型可以根据“John”的信息知道“met”之前的单词是一个人名,也知道“Tao”是一个人名,所以 推理机可推断得出“John”和第一个“Tao”在语句逻辑以及语法定位保持一致性,然后更新候选池将“Tao”作为初始命名实体存储进入侯选池。同理,神经实体推理机识别模型模型可以在第三层中由推理机识别出“Wiener”在语句逻辑以及语法定位与前述的“Tao”一样都是人名,将其识别为命名实体,经过多层训练,对待识别文本中的所有词语单元进行识别,最终识别出文本多包含的全部命名实体,并完成整个神经实体推理机的命名实体识别过程。Preferably, the neural entity inference engine recognition model shown in this application is a multi-layer architecture, and each layer is an encoding-decoding Bi-LSTM model. At the same time, each layer independently completes a named entity neural reasoning, and the named entity neural reasoning result of each layer will be stored through a symbolic cache, as a reference for the next layer, this reference is through an interactive The essence of the pooled neural network is based on multiple real-time reasoning models. At the same time, in order to better interpret the model process, this application uses the demonstration text "Dong met Tao and Wiener John met the family of Tao" as an example to analyze the structure of the named entity neural reasoning model. The named entities actually contained in this sentence have four words: "John", "Tao", "Dong", and "Wiener". When the named entity neural inference model of this application is not trained, in the first layer of the named entity neural inference model, the candidate pool is empty because the initial named entity has not been trained to identify the initial named entity. In the model at this moment, the identified named entity result is "John" because "John" is the name of an ordinary person. In the conventional training model, the name of an ordinary person appears frequently, and it is easy to correspond and be recognized as a named entity. In the aforementioned model recognition process, "Tao" may be omitted. First, "Tao" is not an ordinary person name, so it does not appear frequently as a person name in the training model. Second, because "met the family" appears in the context semantics, it is not enough to express the feature of "Tao" as a person name, so the training model There is not enough and strong signal to correctly identify "Tao". After this training, the model stores the information of "John" as the initial named entity information in the candidate pool. In this way, in the second layer, the model can be inferred by the inference engine. The specific principle of reasoning is that the model can know that the word before "met" is a person's name based on the information of "John" and that "Tao" is a person's name, so the inference engine can infer "John" and the first " Tao" maintains consistency in sentence logic and grammatical positioning, and then updates the candidate pool to store "Tao" as the initial named entity into the candidate pool. In the same way, the neural entity inference engine recognition model model can be recognized by the inference engine in the third layer that "Wiener" is the same as the aforementioned "Tao" in sentence logic and grammatical positioning. It is recognized as a named entity. Layer training, to recognize all word units in the text to be recognized, and finally recognize all named entities contained in the text, and complete the named entity recognition process of the entire neural entity inference engine.
优选地,在本申请实施例中,上述经过预处理的文本向量数据被编码成一个编码表示序列,每一层的解码器就可以依靠单词表达及其上下文生成信息独立给出预测标签。由于预测标签指出哪些词是实体,因此本申请可以从预测标签中找出实体表示。同时,本申请的模型始终记录整个神经实体推理机识别过程,包括已识别的实体信息,这样本申请所建立的模型就可以“看到”过去的所有决策,然后每个层可通过推理机从中引用,并更新候选池,使来自预测结果以帮助下一个分层以保持全局一致性并获得更好的结果。Preferably, in the embodiment of the present application, the preprocessed text vector data is encoded into an encoding representation sequence, and the decoder of each layer can independently give prediction labels based on word expression and context generation information. Since the predicted label indicates which words are entities, this application can find the entity representation from the predicted label. At the same time, the model of this application always records the entire neural entity inference engine recognition process, including the identified entity information, so that the model established in this application can "see" all the past decisions, and then each layer can use the inference engine from it. Cite and update the candidate pool, so that the predicted results can help the next layer to maintain global consistency and obtain better results.
S3、将所述文本向量数据输入所述神经实体推理机识别模型进行训练得到命名实体集合。S3. Input the text vector data into the neural entity inference engine recognition model for training to obtain a named entity set.
较佳地,将所述文本向量数据输入所述神经实体推理机识别模型进行训练得到命名实体集合包括以下步骤:Preferably, inputting the text vector data into the neural entity inference engine recognition model for training to obtain a named entity set includes the following steps:
步骤S301、利用所述Bi-LSTM模型对所述文本向量数据进行编码,得到编码表示序列。Step S301: Use the Bi-LSTM model to encode the text vector data to obtain an encoded representation sequence.
在本申请的实施例中,一个神经实体推理机识别模型层可以看作是一个基于神经实体推理机识别模型的规则编解码器框架,它可以接收推理机额外的信息。在这项工作中,本申请的模型使用了Bi-LSTM模型作为编码器,LSTM模型作为解码器。候选池是一个简单的列表,它由命名实体的编码表示序列组成,它可以包含在整个文本中或在整个结果中识别出的所有命名实体。各层的解码器和编码器可以共享参数,避免参数增长,使模型易于训练为端到端模型,因此各层之间唯一的区别就是候选池和命名实体的不同。In the embodiment of the present application, a neural entity inference engine recognition model layer can be regarded as a rule codec framework based on the neural entity inference engine recognition model, which can receive additional information from the inference engine. In this work, the model of this application uses the Bi-LSTM model as the encoder and the LSTM model as the decoder. The candidate pool is a simple list consisting of a sequence of coded representations of named entities, which can contain all named entities identified in the entire text or in the entire result. The decoders and encoders of each layer can share parameters to avoid parameter growth and make the model easy to train as an end-to-end model. Therefore, the only difference between each layer is the candidate pool and the different named entities.
LSTM模型的设计是为了解决梯度消失和学习长期依赖关系的问题。形式上,在时刻t时,对基本LSTM单元的记忆c_t和隐藏状态h_t更新如下式:The LSTM model is designed to solve the problem of gradient disappearance and long-term dependence of learning. Formally, at time t, the memory c_t and hidden state h_t of the basic LSTM unit are updated as follows:
h
t=o
t⊙tanh(c
t)
h t =o t ⊙tanh(c t )
式中,⊙表示元素积,σ是sigmoid函数,x
t表示在t时刻输入的向量,h
t,o
t,c
t,f
t分别表示第t步输入门、忘记门、输出门的更新。由于LSTM只接收当前输入字之前的信息,但是在顺序任务中,后面的上下文信息也很重要。为了捕获来自过去和未来的上下文信息,本申请使用Bi-LSTM模型对其进行编码,其编码规则如下所示,据此得到编码表示序列:
In the formula, ⊙ represents the element product, σ is the sigmoid function, x t represents the vector input at time t , and h t, o t , c t , and f t represent the update of the input gate, forget gate, and output gate at step t, respectively. Because LSTM only receives the information before the current input word, but in sequential tasks, the context information behind is also very important. In order to capture contextual information from the past and the future, this application uses the Bi-LSTM model to encode it. The encoding rules are as follows, and the encoding representation sequence is obtained accordingly:
式中,
表示LSTM模型正向隐藏状态;
表示LSTM模型反向隐藏状态。
Where Indicates the forward hidden state of the LSTM model; Represents the reverse hidden state of the LSTM model.
步骤S302、输入所述编码表示序列和所述候选池中的所述初始命名实体由所述推理机进行处理,得到引用信息。Step S302: Input the coded representation sequence and the initial named entity in the candidate pool to be processed by the inference engine to obtain reference information.
所述推理机是一组程序,用来控制、协调整个系统。是在一定的控制策略下,专家系统根据问题信息(用户与专家系统交流的信息)及知识库中的知识执行对问题的求解。即在目标引擎设定目标对象后,使用外部信息作为输入,使用演绎归纳等逻辑运算方法根据已建立的模式匹配,针对目标对象进行演算生成结论的引擎。The inference engine is a set of programs used to control and coordinate the entire system. Under a certain control strategy, the expert system solves the problem according to the problem information (the information communicated by the user and the expert system) and the knowledge in the knowledge base. That is, after the target engine sets the target object, it uses external information as input and uses logical operation methods such as deductive induction to perform calculations on the target object to generate a conclusion based on the established pattern matching.
较佳地,本实施例中所述推理机实际上是一个多事实推理模型,在这个模型中,当前的编码表示序列信息是查询,候选池中的初始命名实体信息是事实。本实施例使用一个内核K(query,fact)来计算当前的编码表示序列信息与每个词之间的关系,其中初始命名实体信息在候选池中,计算结果s={s
1,s
2,s
3,…,s
n}表示给每个初始命名实体信息的建议,然后根据推理机从这些建议中得到引用信息。
Preferably, the reasoning engine in this embodiment is actually a multi-fact reasoning model. In this model, the current code indicates that the sequence information is a query, and the initial named entity information in the candidate pool is a fact. This embodiment uses a kernel K (query, fact) to calculate the relationship between the current code representation sequence information and each word, where the initial named entity information is in the candidate pool, and the calculation result s={s 1 ,s 2 , s 3 ,...,s n } represents suggestions for each initial named entity information, and then the inference engine obtains citation information from these suggestions.
步骤S303、将所述编码表示序列和所述引用信息输入解码器,得到预测标签;根据所述预测标签更新所述候选池,得到所述命名实体集合。Step S303: Input the coded representation sequence and the reference information into a decoder to obtain a predicted label; update the candidate pool according to the predicted label to obtain the named entity set.
较佳地,由于本申请的实施例使用Bi-LSTM模型,因此可以得到一个很好的预测标签y
i。同时本申请采用BMEOS(Begin、Middle、End、Other、Single)标记方案,这样就可以从预测标签y
i中知道每个命名实体的开始或结束在哪里从而形成边界信息,然后使用所述边界信息来组织和形成文档的缓存。由于该模型依赖于本地语言特性来进行决策,因此本申请考虑如何在此基础上更合理有效地存储命名实体信息。在本申请实施例中,把一个命名实体看作是一个独立的、不可分割的对象,它由几个单词组成,所以一个实体出现的模式可以这样描述:[向前上下文][实体][向后上下文]。因此,本申请以这种模式存储实体。
Preferably, since the Bi-LSTM model is used in the embodiment of the present application, a good prediction label y i can be obtained. At the same time, this application adopts the BMEOS (Begin, Middle, End, Other, Single) marking scheme, so that from the predicted label y i , you can know where each named entity starts or ends to form boundary information, and then use the boundary information To organize and form a cache of documents. Since this model relies on local language characteristics to make decisions, this application considers how to store named entity information more reasonably and efficiently on this basis. In the embodiment of this application, a named entity is regarded as an independent and indivisible object, which is composed of several words, so the mode of appearance of an entity can be described as follows: [forward context][entity][向After context]. Therefore, this application stores entities in this mode.
进一步地,由于每个实体的编码表示序列都包含信息来决定它的预测标签。编码层中的编码器是前向
和后向
的组合。因此,本申请将得到的预测标签存储在侯选池中,为推理机提供决定性的信息,以给出推理结果。基于侯选池,本申请实际上将一个实体存储为一个对象,这个对象有三个描述。所以对于每一个要预测的单词,本申请可以从三个方面利用当前单词和候选词库之间的相似性作为参考,做出更好的决策。候选池中的每个矩阵实际上都是一个向量表示列表,其中也包含部分实体信息的事实,据此本申请可以使用一个特殊的多事实推理模型从中获取建议。
Furthermore, since the coded representation sequence of each entity contains information to determine its predicted label. The encoder in the coding layer is forward And backward The combination. Therefore, this application stores the obtained predicted label in the candidate pool to provide decisive information for the inference engine to give the inference result. Based on the candidate pool, this application actually stores an entity as an object, and this object has three descriptions. Therefore, for each word to be predicted, this application can use the similarity between the current word and the candidate word database as a reference from three aspects to make better decisions. Each matrix in the candidate pool is actually a vector representation list, which also contains some entity information facts. Based on this, this application can use a special multi-fact reasoning model to obtain suggestions from it.
进一步地,所述解码器包括:Further, the decoder includes:
式中,X表示经过上述预处理的文本向量数据,y
i表示所述神经实体推理机识别模型中第i层的预测标签,x
t表示在t时刻文本向量x的值。
In the formula, X represents the preprocessed text vector data, y i represents the predicted label of the i-th layer in the neural entity inference engine recognition model, and x t represents the value of the text vector x at time t.
进一步地,在本实施例中,每个层的基于神经实体推理的命名实体识别模型在大多数情况下都可以共享参数,这使得本申请的模型真正实现了端到端。Further, in this embodiment, the neural entity inference-based named entity recognition model of each layer can share parameters in most cases, which makes the model of this application truly end-to-end.
因此,根据所述预测标签实时地更新所述候选池,得到所述命名实体集合。Therefore, the candidate pool is updated in real time according to the predicted label to obtain the named entity set.
S4、将所述文本向量数据和所述命名实体集合输入所述神经实体推理机识别模型中的推理机进行推理,得到命名实体。S4. Input the text vector data and the named entity set to the inference engine in the neural entity inference engine recognition model to perform inference to obtain the named entity.
在本实施例中,通过将文本向量数据输入所述所述神经实体推理机识别模型中进行训练获得了稳定的命名实体神经推理机。In this embodiment, a stable named entity neural inference engine is obtained by inputting text vector data into the neural entity inference engine recognition model for training.
同时,根据此神经实体推理机识别模型,通过输入待识别的原始语句中的文本数据,经过多层神经实体推理机识别模型,得到相应的初始命名实体,并由初始命名实体形成命名实体集合。At the same time, according to the neural entity inference engine recognition model, by inputting the text data in the original sentence to be recognized, through the multi-layer neural entity inference engine recognition model, the corresponding initial named entity is obtained, and the initial named entity forms a named entity set.
本申请利用经过训练后的所述神经实体推理机识别模型推理机对文本向量数据和命名实体集合进行推理,得到命名实体。This application uses the trained neural entity inference engine recognition model inference engine to infer the text vector data and the named entity set to obtain the named entity.
发明还提供一种命名实体识别装置。参照图2所示,为本申请一实施例提供的命名实体识别装置的内部结构示意图。The invention also provides a named entity recognition device. Referring to FIG. 2, it is a schematic diagram of the internal structure of a named entity recognition device provided by an embodiment of this application.
在本实施例中,所述命名实体识别装置1可以是PC(Personal Computer,个人电脑),或者是智能手机、平板电脑、便携计算机等终端设备,也可以是一种服务器等。该命名实体识别装置1至少包括存储器11、处理器12,通信总线13,以及网络接口14。In this embodiment, the named entity recognition device 1 may be a PC (Personal Computer, personal computer), or a terminal device such as a smart phone, a tablet computer, or a portable computer, or a server. The named entity recognition device 1 at least includes a memory 11, a processor 12, a communication bus 13, and a network interface 14.
其中,存储器11至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、磁性存储器、磁盘、光盘等。存储器11在一些实施例中可以是命名实体识别装置1的内部存储单元,例如该命名实体识别装置1的硬盘。存储器11在另一些实施例中也可以是命名实体识别装置1的外部存储设备,例如命名实体识别装置1上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,存储器11还可以既包括命名实体识别装置1的内部存储单元也包括外部存储设备。存储器11不仅可以用于存储安装于命名实体识别装置1的应用软件及各类数据,例如命名实体识别程序01的代码等,还可以用于暂时地存储已经输出或者将要输出的数据。The memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, and the like. The memory 11 may be an internal storage unit of the named entity recognition device 1 in some embodiments, for example, the hard disk of the named entity recognition device 1. In other embodiments, the memory 11 may also be an external storage device of the named entity recognition device 1, such as a plug-in hard disk equipped on the named entity recognition device 1, a smart media card (SMC), and a secure digital (Secure Digital). Digital, SD) card, flash card (Flash Card), etc. Further, the memory 11 may also include both an internal storage unit of the named entity recognition apparatus 1 and an external storage device. The memory 11 can be used not only to store application software and various data installed in the named entity recognition device 1, such as the code of the named entity recognition program 01, etc., but also to temporarily store data that has been output or will be output.
处理器12在一些实施例中可以是一中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器或其他数据处理芯片,用于运行存储器11中存储的程序代码或处理数据,例如命名实体识别程序01等。In some embodiments, the processor 12 may be a central processing unit (CPU), controller, microcontroller, microprocessor, or other data processing chip, for running program codes or processing stored in the memory 11 Data, such as named entity recognition program 01, etc.
通信总线13用于实现这些组件之间的连接通信。The communication bus 13 is used to realize the connection and communication between these components.
网络接口14可选的可以包括标准的有线接口、无线接口(如WI-FI接口),通常用于在该装置1与其他电子设备之间建立通信连接。The network interface 14 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface), and is usually used to establish a communication connection between the apparatus 1 and other electronic devices.
可选地,该装置1还可以包括用户接口,用户接口可以包括显示器 (Display)、输入单元比如键盘(Keyboard),可选的用户接口还可以包括标准的有线接口、无线接口。可选地,在一些实施例中,显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。其中,显示器也可以适当的称为显示屏或显示单元,用于显示在基于神经实体推理机的命名实体识别装置1中处理的信息以及用于显示可视化的用户界面。Optionally, the device 1 may also include a user interface. The user interface may include a display (Display) and an input unit such as a keyboard (Keyboard). The optional user interface may also include a standard wired interface and a wireless interface. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, etc. Among them, the display can also be appropriately called a display screen or a display unit, which is used to display the information processed in the named entity recognition device 1 based on the neural entity inference engine and to display a visualized user interface.
图2仅示出了具有组件11-14以及基于神经实体推理机的命名实体识别程序01的命名实体识别装置1,本领域技术人员可以理解的是,图1示出的结构并不构成对命名实体识别装置1的限定,可以包括比图示更少或者更多的部件,或者组合某些部件,或者不同的部件布置。Figure 2 only shows a named entity recognition device 1 with components 11-14 and a named entity recognition program 01 based on a neural entity inference engine. Those skilled in the art will understand that the structure shown in Figure 1 does not constitute a naming The definition of the entity recognition device 1 may include fewer or more components than shown, or a combination of certain components, or a different component arrangement.
在图2所示的装置1实施例中,存储器11中存储有命名实体识别程序01;处理器12执行存储器11中存储的命名实体识别程序01时实现如下步骤:In the embodiment of the apparatus 1 shown in FIG. 2, the memory 11 stores the named entity recognition program 01; when the processor 12 executes the named entity recognition program 01 stored in the memory 11, the following steps are implemented:
步骤一、接收由待识别的原始语句组成的第一文本数据,并对所述第一文本数据进行预处理得到文本向量数据。Step 1: Receive first text data composed of original sentences to be recognized, and preprocess the first text data to obtain text vector data.
本申请较佳实施例中,所述预处理包括对所述第一文本数据进行分词、去停用词、去重等操作。In a preferred embodiment of the present application, the preprocessing includes operations such as word segmentation, stop word removal, and duplicate removal on the first text data.
具体地,本申请对所述第一文本数据进行分词操作得到第二文本数据,对所述第二文本数据进行去停用词操作得到第三文本数据,对所述第三文本数据进行去重操作得到第四文本数据,对所述第四文本数据利用TF-IDF算法进行词向量形式转化,从而得到预处理完成后的所述文本向量数据。Specifically, this application performs word segmentation operations on the first text data to obtain second text data, performs stop word removal operations on the second text data to obtain third text data, and performs deduplication on the third text data The fourth text data is obtained by operation, and the word vector form conversion is performed on the fourth text data using the TF-IDF algorithm, so as to obtain the text vector data after the preprocessing is completed.
本申请收集大量待识别的原始语句,组成所述第一文本数据。文本数据是非结构化或半结构化的数据,不能被分类算法直接识别,而预处理的目的是将文本数据转化为向量空间模型:d
i=(w
1,w
2,…,w
n),其中,w
j为第j个特征词的权重。
This application collects a large number of original sentences to be recognized to form the first text data. Text data is unstructured or semi-structured data, that can not be directly recognized classification algorithm, the purpose of the pretreatment is to be converted to text data vector space model: d i = (w 1, w 2, ..., w n), Among them, w j is the weight of the j-th feature word.
所述分词是对原始语句中的每句话进行切分得到单个的词,因为在汉语表示中,词和词之间没有明确的分隔标识,所以分词是必不可少的。对于中文文本,词语具有真实反映文档内容的能力,因此通常将词语作为向量空间模型中的文本特征词。但是中文文本不像英文文本那样词和词之间采用空格分开,因此需要首先对中文文本进行分词操作。The word segmentation is to segment each sentence in the original sentence to obtain a single word. Because there is no clear separation mark between words in Chinese representation, word segmentation is indispensable. For Chinese text, words have the ability to truly reflect the content of the document, so words are usually used as text feature words in the vector space model. However, Chinese text does not use spaces to separate words and words like English text, so the Chinese text needs to be segmented first.
较佳地,本申请所述分词可以采用基于词典的分词方法,将待分词中文字符串和预设词典中的词条根据某种策略,如遍历操作,进行匹配,得到最终的分词结果。Preferably, the word segmentation described in the present application can adopt a dictionary-based word segmentation method, and the Chinese character string to be segmented and the entry in the preset dictionary are matched according to a certain strategy, such as a traversal operation, to obtain the final word segmentation result.
具体地,所述词典可以包括统计词典。所述统计词典是利用统计方法得到的所有可能的分词构造的词典。进一步地,所述词典也可以包括前缀词典。所述前缀词典包括所述统计词典中每一个分词的前缀,例如所述统计词典中的词“北京大学”的前缀分别是“北”、“北京”、“北京大”;词“大学”的前缀是“大”等。Specifically, the dictionary may include a statistical dictionary. The statistical dictionary is a dictionary constructed by all possible word segmentation obtained by statistical methods. Further, the dictionary may also include a prefix dictionary. The prefix dictionary includes the prefix of each word segment in the statistical dictionary. For example, the prefixes of the word "Peking University" in the statistical dictionary are "North", "Beijing", and "Beijing University"; The prefix is "big" and so on.
所述去停用词是去掉文本数据功能词中没有实际意义的,对文本的分类没有影响,但是出现频率高的词语,包括常用的代词、介词等。在本申请实 施例中,所选取的去停用词的方法为停用词表过滤,即通过预先构建好的停用词表和文本数据中的词语进行一一匹配,如果匹配成功,则这个词语就是停用词,需要将该词删除。The removal of stop words is to remove the functional words of the text data that have no actual meaning, and have no effect on the classification of the text, but words that appear frequently, including commonly used pronouns, prepositions, and the like. In the embodiment of this application, the selected method for removing stop words is stop word list filtering, that is, one-by-one matching is performed through a pre-built stop word list and words in the text data. If the matching is successful, this The word is a stop word, and the word needs to be deleted.
进一步地,由于所收集的第一文本数据来源错综复杂,其中可能会存在很多重复的文本数据。大量的重复数据会影响分类精度,因此,需要进行执行去重操作。在本申请实施例利用欧式距离方法进行去重操作,其公式如下:Furthermore, due to the intricate sources of the collected first text data, there may be a lot of duplicate text data. A large amount of repeated data will affect the classification accuracy, therefore, it is necessary to perform deduplication operations. In the embodiment of this application, the Euclidean distance method is used to perform the deduplication operation, and the formula is as follows:
其中,w
1j和w
2j分别为2个第一文本数据,d为欧式距离。在分别计算每两个第一文本数据的欧式距离后,欧式距离越小,说明文本数据越相似,则删除欧氏距离小于预设阈值的两个第一文本数据中的其中一个。
Among them, w 1j and w 2j are the two first text data respectively, and d is the Euclidean distance. After calculating the Euclidean distances of every two first text data separately, the smaller the Euclidean distance, the more similar the text data, and then one of the two first text data whose Euclidean distance is less than the preset threshold is deleted.
在经过分词、去停用词、去重后,文本由一系列的特征词(关键词)表示,但是这种文本形式的数据不能直接被分类算法所处理,而应该转化为数值形式,因此需要对这些特征词进行权重计算,用来表征该特征词在文本中的重要性。After word segmentation, stop words removal, and deduplication, the text is represented by a series of characteristic words (keywords), but the data in this text form cannot be directly processed by the classification algorithm, but should be converted into a numerical form. The weight calculation of these characteristic words is used to characterize the importance of the characteristic words in the text.
在本申请的一些实施例中,使用TF-IDF算法进行特征词计算,对所述经过分词、分词、去停用词、去重等操作后的数据进行预处理得到文本向量数据。所述TF-IDF算法是利用统计信息、词向量信息以及词语间的依存句法信息,通过构建依存关系图来计算词语之间的关联强度,利用TextRank算法迭代算出词语的重要度得分。In some embodiments of the present application, the TF-IDF algorithm is used to perform feature word calculation, and the data after word segmentation, word segmentation, stop word removal, and deduplication are preprocessed to obtain text vector data. The TF-IDF algorithm uses statistical information, word vector information, and dependency syntax information between words, builds a dependency graph to calculate the correlation strength between words, and uses TextRank algorithm to iteratively calculate the importance score of words.
具体地,本申请在进行特征词的权重计算时,首先计算任意两词语(关键词)W
i和w
j的依存关联度为:
In particular, the present application when performing recalculation feature words right, first calculates any two words (keywords) and dependent correlation degree w j W i is:
其中len(W
i,W
j)表示词语W
i和W
j之间的依存路径长度,b是超参数。
Where len(W i , W j ) represents the length of the dependency path between words W i and W j , and b is a hyperparameter.
本申请认为两个词之间的语义相似度无法准确衡量词语的重要程度,只有当两个词中至少有一个在文本中出现的频率很高,才能证明两个词很重要。根据万有引力的概念,将词频看作质量,将两个词的词向量间的欧氏距离视为距离,根据万有引力公式来计算两个词之间的引力。然而在当前文本环境下,仅利用词频来衡量文本中某个词的重要程度太过片面,因此本申请引入了IDF值,将词频替换为TF-IDF值,从而考虑到更全局性的信息,于是得到了新的词引力值公式。文本词语W
i和W
j的引力为:
This application believes that the semantic similarity between two words cannot accurately measure the importance of the words. Only when at least one of the two words appears frequently in the text can it prove that the two words are important. According to the concept of gravitation, the frequency of the word is regarded as the quality, the Euclidean distance between the word vectors of the two words is regarded as the distance, and the gravitation between the two words is calculated according to the formula of gravitation. However, in the current text environment, it is too one-sided to use word frequency to measure the importance of a word in a text. Therefore, this application introduces the IDF value and replaces the word frequency with the TF-IDF value to take into account more global information. So a new formula for the value of word gravity is obtained. The gravity of text words W i and W j is:
其中,tfidf(W)是词W的TF-IDF值,d是词W
i和w
j的词向量之间的欧式距离。
Wherein, tfidf (W) is a TF-IDF value of word W, d is the Euclidean distance between the vectors of words W i and word w j to.
因此,两个词语之间的关联度为:Therefore, the degree of relevance between the two words is:
weight(W
i,W
j)=Dep(W
i,W
j)*f
grav(W
i,W
j)
weight(W i ,W j )=Dep(W i ,W j )*f grav (W i ,W j )
最后,本申请利用TextRank算法建立一个无向图G=(V,E),其中V是顶点的集合,E是边的集合,根据下列式子算出词语W
i的得分,:
Finally, the present application establish an undirected graph G = (V, E) using TextRank algorithm, where V is the set of vertices, E is the set of edges, the following equation is calculated based on the scores of the words W i,:
其中
是与顶点W
i有关的集合,η为阻尼系数,由此得到特征权重WS(W
i),并因此将每个词语表示成数值向量形式,即得到所述文本向量数据。
among them Is a set related to the vertices W i , and η is the damping coefficient, from which the feature weight WS(W i ) is obtained, and therefore each word is expressed in the form of a numeric vector, that is, the text vector data is obtained.
步骤二、获取具有多层结构的神经实体推理机识别模型。Step 2: Obtain a neural entity inference engine recognition model with a multi-layer structure.
较佳地,本申请所示神经实体推理机识别模型是一个多层的架构,每一层都是一个编码-解码的Bi-LSTM模型。同时,每一层都独立完成一次命名实体的神经推理,而每层的命名实体神经推理结果会通过一个符号化的缓存存储起来,作为下一层的参考,这种参考是通过一个交互式的池化神经网络实现的,本质是一个基于多个实时的推理模型。同时,为更好的对模型流程进行解读,本申请以示范文本“Dong met Tao and Wiener John met the family of Tao”为例进行命名实体神经推理模型结构的分析。该段语句实际含有的命名实体有“John”、“Tao”、“Dong”、“Wiener”四个词。在本申请的命名实体神经推理模型未经训练时,在命名实体神经推理模型的第一层,候选池是空的,因为没有经过训练识别出初始命名实体。在此刻的模型中,识别出的命名实体结果为“John”,因为“John”是一个普通人的名字。在常规的训练模型中作为普通人的名字出现频率高,很容易进行对应进而被识别为命名实体。在前述模型识别过程中,“Tao”可能会被省略。首先“Tao”不是一个普通的人名,因此在训练模型中作为人名出现的频率不高,其次,因为上下文语义中出现了“met the family”不足以表达“Tao”作为人名的特征,因此训练模型中没有足够的和强烈的信号来对“Tao”进行正确识别。在经过本次训练后模型将“John”的信息作为初始命名实体信息存储到候选池中。这样在第二层,模型可以由推理机进行推理。推理的具体原理为,该模型可以根据“John”的信息知道“met”之前的单词是一个人名,也知道“Tao”是一个人名,所以推理机可推断得出“John”和第一个“Tao”在语句逻辑以及语法定位保持一致性,然后更新候选池将“Tao”作为初始命名实体存储进入侯选池。同理,神经实体推理机识别模型模型可以在第三层中由推理机识别出“Wiener”在语句逻辑以及语法定位与前述的“Tao”一样都是人名,将其识别为命名实体,经过多层训练,对待识别文本中的所有词语单元进行识别,最终识别出文本多包含的全部命名实体,并完成整个神经实体推理机的命名实体识别过程。Preferably, the neural entity inference engine recognition model shown in this application is a multi-layer architecture, and each layer is an encoding-decoding Bi-LSTM model. At the same time, each layer independently completes a named entity neural reasoning, and the named entity neural reasoning result of each layer will be stored through a symbolic cache, as a reference for the next layer, this reference is through an interactive The essence of the pooled neural network is based on multiple real-time reasoning models. At the same time, in order to better interpret the model process, this application uses the demonstration text "Dong met Tao and Wiener John met the family of Tao" as an example to analyze the structure of the named entity neural reasoning model. The named entities actually contained in this sentence have four words: "John", "Tao", "Dong", and "Wiener". When the named entity neural inference model of this application is not trained, in the first layer of the named entity neural inference model, the candidate pool is empty because the initial named entity has not been trained to identify the initial named entity. In the model at this moment, the identified named entity result is "John" because "John" is the name of an ordinary person. In the conventional training model, the name of an ordinary person appears frequently, and it is easy to correspond and be recognized as a named entity. In the aforementioned model recognition process, "Tao" may be omitted. First, "Tao" is not an ordinary person name, so it does not appear frequently as a person name in the training model. Second, because "met the family" appears in the context semantics, it is not enough to express the feature of "Tao" as a person name, so the training model There is not enough and strong signal to correctly identify "Tao". After this training, the model stores the information of "John" as the initial named entity information in the candidate pool. In this way, in the second layer, the model can be inferred by the inference engine. The specific principle of reasoning is that the model can know that the word before "met" is a person's name based on the information of "John" and that "Tao" is a person's name, so the inference engine can infer "John" and the first " Tao" maintains consistency in sentence logic and grammatical positioning, and then updates the candidate pool to store "Tao" as the initial named entity into the candidate pool. In the same way, the neural entity inference engine recognition model model can be recognized by the inference engine in the third layer that "Wiener" is the same as the aforementioned "Tao" in sentence logic and grammatical positioning. It is recognized as a named entity. Layer training, to recognize all word units in the text to be recognized, and finally recognize all named entities contained in the text, and complete the named entity recognition process of the entire neural entity inference engine.
优选地,在本申请实施例中,上述经过预处理的文本向量数据被编码成一个编码表示序列,每一层的解码器就可以依靠单词表达及其上下文生成信息独立给出预测标签。由于预测标签指出哪些词是实体,因此本申请可以从预测标签中找出实体表示。同时,本申请的模型始终记录整个神经实体推理机识别过程,包括已识别的实体信息,这样本申请所建立的模型就可以“看 到”过去的所有决策,然后每个层可通过推理机从中引用,并更新候选池,使来自预测结果以帮助下一个分层以保持全局一致性并获得更好的结果。Preferably, in the embodiment of the present application, the preprocessed text vector data is encoded into an encoding representation sequence, and the decoder of each layer can independently give prediction labels based on word expression and context generation information. Since the predicted label indicates which words are entities, this application can find the entity representation from the predicted label. At the same time, the model of this application always records the entire neural entity inference engine recognition process, including the identified entity information, so that the model established in this application can "see" all the past decisions, and then each layer can use the inference engine from it. Cite and update the candidate pool, so that the predicted results can help the next layer to maintain global consistency and obtain better results.
步骤三、将所述文本向量数据输入所述神经实体推理机识别模型进行训练得到命名实体集合。Step 3: Input the text vector data into the neural entity inference engine recognition model for training to obtain a named entity set.
较佳地,将所述文本向量数据输入所述神经实体推理机识别模型进行训练得到命名实体集合包括以下步骤:Preferably, inputting the text vector data into the neural entity inference engine recognition model for training to obtain a named entity set includes the following steps:
第一步骤、利用所述Bi-LSTM模型对所述文本向量数据进行编码,得到编码表示序列。The first step is to use the Bi-LSTM model to encode the text vector data to obtain an encoded representation sequence.
在本申请的实施例中,一个神经实体推理机识别模型层可以看作是一个基于神经实体推理机识别模型的规则编解码器框架,它可以接收推理机额外的信息。在这项工作中,本申请的模型使用了Bi-LSTM模型作为编码器,LSTM模型作为解码器。候选池是一个简单的列表,它由命名实体的编码表示序列组成,它可以包含在整个文本中或在整个结果中识别出的所有命名实体。各层的解码器和编码器可以共享参数,避免参数增长,使模型易于训练为端到端模型,因此各层之间唯一的区别就是候选池和命名实体的不同。In the embodiment of the present application, a neural entity inference engine recognition model layer can be regarded as a rule codec framework based on the neural entity inference engine recognition model, which can receive additional information from the inference engine. In this work, the model of this application uses the Bi-LSTM model as the encoder and the LSTM model as the decoder. The candidate pool is a simple list consisting of a sequence of coded representations of named entities, which can contain all named entities identified in the entire text or in the entire result. The decoders and encoders of each layer can share parameters to avoid parameter growth and make the model easy to train as an end-to-end model. Therefore, the only difference between each layer is the candidate pool and the different named entities.
LSTM模型的设计是为了解决梯度消失和学习长期依赖关系的问题。形式上,在时刻t时,对基本LSTM单元的记忆c_t和隐藏状态h_t更新如下式:The LSTM model is designed to solve the problem of gradient disappearance and long-term dependence of learning. Formally, at time t, the memory c_t and hidden state h_t of the basic LSTM unit are updated as follows:
h
t=o
t⊙tanh(c
t)
h t =o t ⊙tanh(c t )
式中,⊙表示元素积,σ是sigmoid函数,x
t表示在t时刻输入的向量,h
t,o
t,c
t,f
t分别表示第t步输入门、忘记门、输出门的更新。由于LSTM只接收当前输入字之前的信息,但是在顺序任务中,后面的上下文信息也很重要。为了捕获来自过去和未来的上下文信息,本申请使用Bi-LSTM模型对其进行编码,其编码规则如下所示,据此得到编码表示序列:
In the formula, ⊙ represents the element product, σ is the sigmoid function, x t represents the input vector at time t, h t , o t , c t , and f t represent the update of the input gate, forget gate, and output gate at step t, respectively. Because LSTM only receives the information before the current input word, but in sequential tasks, the context information behind is also very important. In order to capture contextual information from the past and the future, this application uses the Bi-LSTM model to encode it. The encoding rules are as follows, and the encoding representation sequence is obtained accordingly:
式中,
表示LSTM模型正向隐藏状态;
表示LSTM模型反向隐藏状态。
Where Indicates the forward hidden state of the LSTM model; Represents the reverse hidden state of the LSTM model.
第二步骤、输入所述编码表示序列和所述候选池中的所述初始命名实体由所述推理机进行处理,得到引用信息。The second step is to input the coded representation sequence and the initial named entity in the candidate pool to be processed by the inference engine to obtain reference information.
所述推理机是一组程序,用来控制、协调整个系统。是在一定的控制策略下,专家系统根据问题信息(用户与专家系统交流的信息)及知识库中的知识执行对问题的求解。即在目标引擎设定目标对象后,使用外部信息作为输入,使用演绎归纳等逻辑运算方法根据已建立的模式匹配,针对目标对象进行演算生成结论的引擎。The inference engine is a set of programs used to control and coordinate the entire system. Under a certain control strategy, the expert system solves the problem according to the problem information (the information communicated by the user and the expert system) and the knowledge in the knowledge base. That is, after the target engine sets the target object, it uses external information as input and uses logical operation methods such as deductive induction to perform calculations on the target object to generate a conclusion based on the established pattern matching.
较佳地,本实施例中所述推理机实际上是一个多事实推理模型,在这个模型中,当前的编码表示序列信息是查询,候选池中的初始命名实体信息是事实。本实施例使用一个内核K(query,fact)来计算当前的编码表示序列信息与每个词之间的关系,其中初始命名实体信息在候选池中,计算结果s={s
1,s
2,s
3,…,s
n}表示给每个初始命名实体信息的建议,然后根据推理机从这些建议中得到引用信息。
Preferably, the reasoning engine in this embodiment is actually a multi-fact reasoning model. In this model, the current code indicates that the sequence information is a query, and the initial named entity information in the candidate pool is a fact. This embodiment uses a kernel K (query, fact) to calculate the relationship between the current code representation sequence information and each word, where the initial named entity information is in the candidate pool, and the calculation result s={s 1 ,s 2 , s 3 ,...,s n } represents suggestions for each initial named entity information, and then the inference engine obtains citation information from these suggestions.
第三步骤、将所述编码表示序列和所述引用信息输入解码器,得到预测标签;根据所述预测标签更新所述候选池,得到所述命名实体集合。The third step is to input the coded representation sequence and the reference information into a decoder to obtain a predicted label; update the candidate pool according to the predicted label to obtain the named entity set.
较佳地,由于本申请的实施例使用Bi-LSTM模型,因此可以得到一个很好的预测标签y
i。同时本申请采用BMEOS(Begin、Middle、End、Other、Single)标记方案,这样就可以从预测标签y
i中知道每个命名实体的开始或结束在哪里从而形成边界信息,然后使用所述边界信息来组织和形成文档的缓存。由于该模型依赖于本地语言特性来进行决策,因此本申请考虑如何在此基础上更合理有效地存储命名实体信息。在本申请实施例中,把一个命名实体看作是一个独立的、不可分割的对象,它由几个单词组成,所以一个实体出现的模式可以这样描述:[向前上下文][实体][向后上下文]。因此,本申请以这种模式存储实体。
Preferably, since the Bi-LSTM model is used in the embodiment of the present application, a good prediction label y i can be obtained. At the same time, this application adopts the BMEOS (Begin, Middle, End, Other, Single) marking scheme, so that from the predicted label y i , you can know where each named entity starts or ends to form boundary information, and then use the boundary information To organize and form a cache of documents. Since this model relies on local language characteristics to make decisions, this application considers how to store named entity information more reasonably and efficiently on this basis. In the embodiment of this application, a named entity is regarded as an independent and indivisible object, which is composed of several words, so the mode of appearance of an entity can be described as follows: [forward context][entity][向After context]. Therefore, this application stores entities in this mode.
进一步地,由于每个实体的编码表示序列都包含信息来决定它的预测标签。编码层中的编码器是前向
和后向
的组合。因此,本申请将得到的预测标签存储在侯选池中,为推理机提供决定性的信息,以给出推理结果。基于侯选池,本申请实际上将一个实体存储为一个对象,这个对象有三个描述。所以对于每一个要预测的单词,本申请可以从三个方面利用当前单词和候选词库之间的相似性作为参考,做出更好的决策。候选池中的每个矩阵实际上都是一个向量表示列表,其中也包含部分实体信息的事实,据此本申请可以使用一个特殊的多事实推理模型从中获取建议。
Furthermore, since the coded representation sequence of each entity contains information to determine its predicted label. The encoder in the coding layer is forward And backward The combination. Therefore, this application stores the obtained predicted label in the candidate pool to provide decisive information for the inference engine to give the inference result. Based on the candidate pool, this application actually stores an entity as an object, and this object has three descriptions. Therefore, for each word to be predicted, this application can use the similarity between the current word and the candidate word database as a reference from three aspects to make better decisions. Each matrix in the candidate pool is actually a vector representation list, which also contains some entity information facts. Based on this, this application can use a special multi-fact reasoning model to obtain suggestions from it.
进一步地,所述解码器包括:Further, the decoder includes:
式中,X表示经过上述预处理的文本向量数据,y
i表示所述神经实体推理机识别模型中第i层的预测标签,x
t表示在t时刻文本向量x的值。
In the formula, X represents the preprocessed text vector data, y i represents the predicted label of the i-th layer in the neural entity inference engine recognition model, and x t represents the value of the text vector x at time t.
进一步地,在本实施例中,每个层的基于神经实体推理的命名实体识别模型在大多数情况下都可以共享参数,这使得本申请的模型真正实现了端到端。Further, in this embodiment, the neural entity inference-based named entity recognition model of each layer can share parameters in most cases, which makes the model of this application truly end-to-end.
因此,根据所述预测标签实时地更新所述候选池,得到所述命名实体集合。Therefore, the candidate pool is updated in real time according to the predicted label to obtain the named entity set.
步骤四、将所述文本向量数据和所述命名实体集合输入所述神经实体推理机识别模型中的推理机进行推理,得到命名实体。Step 4: Input the text vector data and the named entity set into the inference engine in the neural entity inference engine recognition model to perform inference to obtain a named entity.
在本实施例中,通过将文本向量数据输入所述所述神经实体推理机识别模型中进行训练获得了稳定的命名实体神经推理机。In this embodiment, a stable named entity neural inference engine is obtained by inputting text vector data into the neural entity inference engine recognition model for training.
同时,根据此神经实体推理机识别模型,通过输入待识别的原始语句中 的文本数据,经过多层神经实体推理机识别模型,得到相应的初始命名实体,并由初始命名实体形成命名实体集合。At the same time, according to the neural entity inference engine recognition model, by inputting the text data in the original sentence to be recognized, through the multi-layer neural entity inference engine recognition model, the corresponding initial named entity is obtained, and the initial named entity forms a named entity set.
本申请利用经过训练后的所述神经实体推理机识别模型推理机对文本向量数据和命名实体集合进行推理,得到命名实体。This application uses the trained neural entity inference engine recognition model inference engine to infer the text vector data and the named entity set to obtain the named entity.
可选地,在其他实施例中,命名实体识别程序还可以被分割为一个或者多个模块,一个或者多个模块被存储于存储器11中,并由一个或多个处理器(本实施例为处理器12)所执行以完成本申请,本申请所称的模块是指能够完成特定功能的一系列计算机程序指令段,用于描述命名实体识别程序在命名实体识别装置中的执行过程。Optionally, in other embodiments, the named entity recognition program may also be divided into one or more modules, and the one or more modules are stored in the memory 11 and run by one or more processors (in this embodiment, The processor 12) is executed to complete the application. The module referred to in the application refers to a series of computer program instruction segments capable of completing specific functions, and is used to describe the execution process of the named entity recognition program in the named entity recognition device.
例如,参照图3所示,为本申请命名实体识别装置一实施例中的命名实体识别程序的模块示意图,该实施例中,所述命名实体识别程序可以被分割为数据接收及处理模块10、词向量转化模块20、模型训练模块30、命名实体输出模块40示例性地:For example, referring to FIG. 3, it is a schematic diagram of modules of the named entity recognition program in an embodiment of the named entity recognition device of this application. In this embodiment, the named entity recognition program can be divided into data receiving and processing modules 10, The word vector conversion module 20, the model training module 30, and the named entity output module 40 are exemplary:
所述数据接收及处理模块10用于:接收由待识别的原始语句组成的第一文本数据,并对所述第一文本数据进行分词、去停用词、去重等操作。The data receiving and processing module 10 is configured to receive first text data composed of original sentences to be recognized, and perform operations such as word segmentation, stop word removal, and duplication on the first text data.
所述词向量转化模块20用于:利用TF-IDF算法对分词、去停用词、去重等操作之后的所述第一文本数据进行词向量形式转化,从而得到文本向量数据据。The word vector conversion module 20 is configured to use the TF-IDF algorithm to perform word vector conversion on the first text data after operations such as word segmentation, stop word removal, and deduplication, so as to obtain text vector data.
所述模型训练模块30用于:获取具有多层结构的神经实体推理机识别模型,其中,每一层都是一个编码-解码的Bi-LSTM模型,且每一层都独立完成一次命名实体的神经推理,而每层的命名实体神经推理结果会通过一个符号化的缓存存储起来,作为下一层的参考。The model training module 30 is used to obtain a neural entity inference engine recognition model with a multi-layer structure, where each layer is an encoding-decoding Bi-LSTM model, and each layer independently completes a named entity Neural reasoning, and the named entity neural reasoning results of each layer will be stored through a symbolic cache, which is used as a reference for the next layer.
所述命名实体输出模块40用于:将所述文本向量数据输入所述神经实体推理机识别模型进行训练得到命名实体集合,并将所述文本向量数据和所述命名实体集合输入所述神经实体推理机识别模型中的推理机进行推理,得到命名实体。The named entity output module 40 is configured to: input the text vector data into the neural entity inference engine recognition model for training to obtain a named entity set, and input the text vector data and the named entity set into the neural entity The inference engine recognizes the inference engine in the model for inference, and obtains a named entity.
上述数据接收及处理模块10、词向量转化模块20、模型训练模块30、命名实体输出模块40等程序模块被执行时所实现的功能或操作步骤与上述实施例大体相同,在此不再赘述。The above-mentioned data receiving and processing module 10, word vector conversion module 20, model training module 30, named entity output module 40 and other program modules implement functions or operation steps that are substantially the same as those in the above-mentioned embodiment, and will not be repeated here.
此外,本申请实施例还提出一种计算机可读存储介质,所述计算机可读存储介质上存储有命名实体识别程序,所述命名实体识别程序可被一个或多个处理器执行,以实现如下操作:In addition, an embodiment of the present application also proposes a computer-readable storage medium having a named entity recognition program stored on the computer-readable storage medium, and the named entity recognition program can be executed by one or more processors to achieve the following operating:
接收由待识别的原始语句组成的第一文本数据,并对所述第一文本数据进行预处理得到文本向量数据;Receiving first text data composed of original sentences to be recognized, and preprocessing the first text data to obtain text vector data;
获取具有多层结构的神经实体推理机识别模型;Obtain a neural entity inference engine recognition model with a multi-layer structure;
将所述文本向量数据输入所述神经实体推理机识别模型进行训练得到命名实体集合;Inputting the text vector data into the neural entity inference engine recognition model for training to obtain a named entity set;
将所述文本向量数据和所述命名实体集合输入所述神经实体推理机识别模型中的推理机进行推理,得到命名实体。The text vector data and the named entity set are input to the inference engine in the neural entity inference engine recognition model to perform inference to obtain the named entity.
需要说明的是,上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。并且本文中的术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、装置、物品或者方法中还存在另外的相同要素。It should be noted that the serial numbers of the foregoing embodiments of the present application are only for description, and do not represent the advantages and disadvantages of the embodiments. And the terms "include", "include" or any other variants thereof in this article are intended to cover non-exclusive inclusion, so that a process, device, article or method including a series of elements not only includes those elements, but also includes those elements that are not explicitly included. The other elements listed may also include elements inherent to the process, device, article, or method. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, device, article, or method that includes the element.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above implementation manners, those skilled in the art can clearly understand that the above-mentioned embodiment method can be implemented by means of software plus the necessary general hardware platform, of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM) as described above. , Magnetic disk, optical disk), including a number of instructions to make a terminal device (which can be a mobile phone, a computer, a server, or a network device, etc.) execute the method described in each embodiment of the present application.
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above are only the preferred embodiments of the application, and do not limit the scope of the patent for this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of the application, or directly or indirectly applied to other related technical fields , The same reason is included in the scope of patent protection of this application.
Claims (20)
- 一种命名实体识别方法,其特征在于,所述方法包括:A named entity recognition method, characterized in that the method includes:接收由待识别的原始语句组成的第一文本数据,并对所述第一文本数据进行预处理得到文本向量数据;Receiving first text data composed of original sentences to be recognized, and preprocessing the first text data to obtain text vector data;获取具有多层结构的神经实体推理机识别模型;Obtain a neural entity inference engine recognition model with a multi-layer structure;将所述文本向量数据输入所述神经实体推理机识别模型进行训练得到命名实体集合;Inputting the text vector data into the neural entity inference engine recognition model for training to obtain a named entity set;将所述文本向量数据和所述命名实体集合输入所述神经实体推理机识别模型中的推理机进行推理,得到命名实体。The text vector data and the named entity set are input to the inference engine in the neural entity inference engine recognition model to perform inference to obtain the named entity.
- 如权利要求1所述的命名实体识别方法,其特征在于,所述神经实体推理机识别模型的每一层结构通过Bi-LSTM模型进行编码,并通过解码器进行解码,解码完成后的数据进入下一层结构再次进行编码和解码。The named entity recognition method according to claim 1, wherein each layer structure of the neural entity inference engine recognition model is encoded by a Bi-LSTM model and decoded by a decoder, and the decoded data enters The next layer structure is encoded and decoded again.
- 如权利要求2所述的命名实体识别方法,其特征在于,所述将所述文本向量数据输入所述神经实体推理机识别模型进行训练得到命名实体集合包括:The method for recognizing a named entity according to claim 2, wherein said inputting said text vector data into said neural entity inference engine recognition model for training to obtain a named entity set comprises:输入经过预处理的所述文本向量数据;Input the preprocessed text vector data;利用所述Bi-LSTM模型对所述文本向量数据进行编码,得到编码表示序列和初始命名实体,将所述初始命名实体加入候选池;Use the Bi-LSTM model to encode the text vector data to obtain an encoded representation sequence and an initial named entity, and add the initial named entity to a candidate pool;将所述编码表示序列和所述候选池中的所述初始命名实体输入所述推理机中进行处理,得到引用信息;Inputting the coded representation sequence and the initial named entity in the candidate pool into the inference engine for processing to obtain reference information;将所述编码表示序列和所述引用信息输入解码器,得到预测标签,根据所述预测标签更新所述候选池,得到所述命名实体集合。The coded representation sequence and the reference information are input to a decoder to obtain a predicted label, and the candidate pool is updated according to the predicted label to obtain the named entity set.
- 如权利要求3所述的命名实体识别方法,其特征在于,所述解码器包括:8. The named entity recognition method of claim 3, wherein the decoder comprises:其中,X表示经过上述预处理的所述文本向量数据,y表示经过所述神经实体推理机识别模型训练后得到的所述预测标签,y i表示所述神经实体推理机识别模型中第i层的预测标签,x t表示在t时刻文本向量x的值。 Wherein, X represents the text vector data preprocessed above, y represents the predicted label obtained after the neural entity inference engine recognition model is trained, and y i represents the i-th layer in the neural entity inference engine recognition model. The predicted label of x t represents the value of the text vector x at time t.
- 如权利要求1~4中任一项所述的命名实体识别方法,其特征在于,所述对所述第一文本数据进行预处理得到文本向量数据包括:The named entity recognition method according to any one of claims 1 to 4, wherein said preprocessing said first text data to obtain text vector data comprises:对所述第一文本数据进行分词操作得到第二文本数据,对所述第二文本数据进行去停用词操作得到第三文本数据,对所述第三文本数据进行去重操作得到第四文本数据;Perform a word segmentation operation on the first text data to obtain second text data, perform a stop word removal operation on the second text data to obtain third text data, and perform a deduplication operation on the third text data to obtain a fourth text data;对所述第四文本数据利用TF-IDF算法进行词向量形式转化,得到所述文本向量数据。The TF-IDF algorithm is used to convert the word vector form to the fourth text data to obtain the text vector data.
- 如权利要求1所述的命名实体识别方法,其特征在于,所述对所述第一文本数据进行预处理得到文本向量数据还包括:5. The named entity recognition method according to claim 1, wherein said preprocessing said first text data to obtain text vector data further comprises:利用欧式距离方法对所述第一文本数据进行去重操作,其公式如下:Using the Euclidean distance method to perform a deduplication operation on the first text data, the formula is as follows:其中,w 1j和w 2j分别为2个第一文本数据,d为欧式距离。 Among them, w 1j and w 2j are the two first text data respectively, and d is the Euclidean distance.
- 一种命名实体识别装置,其特征在于,所述装置包括存储器和处理器,所述存储器上存储有可在所述处理器上运行的命名实体识别程序,所述命名实体识别程序被所述处理器执行时实现如下步骤:A named entity recognition device, characterized in that the device includes a memory and a processor, the memory stores a named entity recognition program that can run on the processor, and the named entity recognition program is processed by the processor. The following steps are implemented when the device is executed:接收由待识别的原始语句组成的第一文本数据,并对所述第一文本数据进行预处理得到文本向量数据;Receiving first text data composed of original sentences to be recognized, and preprocessing the first text data to obtain text vector data;获取具有多层结构的神经实体推理机识别模型;Obtain a neural entity inference engine recognition model with a multi-layer structure;将所述文本向量数据输入所述神经实体推理机识别模型进行训练得到命名实体集合;Inputting the text vector data into the neural entity inference engine recognition model for training to obtain a named entity set;将所述文本向量数据和所述命名实体集合输入所述神经实体推理机识别模型中的推理机进行推理,得到命名实体。The text vector data and the named entity set are input to the inference engine in the neural entity inference engine recognition model to perform inference to obtain the named entity.
- 如权利要求8所述的命名实体识别装置,其特征在于,所述神经实体推理机识别模型的每一层结构通过Bi-LSTM模型进行编码,并通过解码器进行解码,解码完成后的数据进入下一层结构再次进行编码和解码。The named entity recognition device according to claim 8, wherein each layer structure of the neural entity inference engine recognition model is encoded by a Bi-LSTM model and decoded by a decoder, and the decoded data enters The next layer structure is encoded and decoded again.
- 如权利要求9所述的命名实体识别装置,其特征在于,所述将所述文本向量数据输入所述神经实体推理机识别模型进行训练得到命名实体集合包括:9. The named entity recognition device according to claim 9, wherein said inputting said text vector data into said neural entity inference engine recognition model for training to obtain a named entity set comprises:输入经过预处理的的文本向量数据;Input preprocessed text vector data;利用所述Bi-LSTM模型对所述文本向量数据进行编码,得到编码表示序列和初始命名实体,将所述初始命名实体加入候选池;Use the Bi-LSTM model to encode the text vector data to obtain an encoded representation sequence and an initial named entity, and add the initial named entity to a candidate pool;将所述编码表示序列和所述候选池中的所述初始命名实体输入所述推理机中进行处理,得到引用信息;Inputting the coded representation sequence and the initial named entity in the candidate pool into the inference engine for processing to obtain reference information;将所述编码表示序列和所述引用信息输入解码器,得到预测标签,根据所述预测标签更新所述候选池,得到所述命名实体集合。The coded representation sequence and the reference information are input to a decoder to obtain a predicted label, and the candidate pool is updated according to the predicted label to obtain the named entity set.
- 如权利要求10所述的命名实体识别装置,其特征在于,所述解码器包括:The named entity recognition device of claim 10, wherein the decoder comprises:其中,X表示经过上述预处理的所述文本向量数据,y表示经过所述神经实体推理机识别模型训练后得到的所述预测标签,y i表示所述神经实体推理机 识别模型中第i层的预测标签,x t表示在t时刻文本向量x的值。 Wherein, X represents the text vector data preprocessed above, y represents the predicted label obtained after the neural entity inference engine recognition model is trained, and y i represents the i-th layer in the neural entity inference engine recognition model. The predicted label of x t represents the value of the text vector x at time t.
- 如权利要求8~11中任一项所述的命名实体识别装置,其特征在于,所述对所述第一文本数据进行预处理得到文本向量数据包括:The named entity recognition device according to any one of claims 8 to 11, wherein said preprocessing said first text data to obtain text vector data comprises:对所述第一文本数据进行分词操作得到第二文本数据,对所述第二文本数据进行去停用词操作得到第三文本数据,对所述第三文本数据进行去重操作得到第四文本数据;Perform a word segmentation operation on the first text data to obtain second text data, perform a stop word removal operation on the second text data to obtain third text data, and perform a deduplication operation on the third text data to obtain a fourth text data;对所述第四文本数据利用TF-IDF算法进行词向量形式转化,得到所述文本向量数据。The TF-IDF algorithm is used to convert the word vector form to the fourth text data to obtain the text vector data.
- 如权利要求8所述的命名实体识别装置,其特征在于,所述对所述第一文本数据进行预处理得到文本向量数据还包括:8. The named entity recognition device according to claim 8, wherein said preprocessing said first text data to obtain text vector data further comprises:利用欧式距离方法对所述第一文本数据进行去重操作,其公式如下:Using the Euclidean distance method to perform a deduplication operation on the first text data, the formula is as follows:其中,w 1j和w 2j分别为2个第一文本数据,d为欧式距离。 Among them, w 1j and w 2j are the two first text data respectively, and d is the Euclidean distance.
- 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有命名实体识别程序,所述命名实体识别程序可被一个或者多个处理器执行,以实现以下步骤:A computer-readable storage medium, characterized in that a named entity recognition program is stored on the computer-readable storage medium, and the named entity recognition program can be executed by one or more processors to implement the following steps:接收由待识别的原始语句组成的第一文本数据,并对所述第一文本数据进行预处理得到文本向量数据;Receiving first text data composed of original sentences to be recognized, and preprocessing the first text data to obtain text vector data;获取具有多层结构的神经实体推理机识别模型;Obtain a neural entity inference engine recognition model with a multi-layer structure;将所述文本向量数据输入所述神经实体推理机识别模型进行训练得到命名实体集合;Inputting the text vector data into the neural entity inference engine recognition model for training to obtain a named entity set;将所述文本向量数据和所述命名实体集合输入所述神经实体推理机识别模型中的推理机进行推理,得到命名实体。The text vector data and the named entity set are input to the inference engine in the neural entity inference engine recognition model to perform inference to obtain the named entity.
- 如权利要求15所述的计算机可读存储介质,其特征在于,所述神经实体推理机识别模型的每一层结构通过Bi-LSTM模型进行编码,并通过解码器进行解码,解码完成后的数据进入下一层结构再次进行编码和解码。The computer-readable storage medium according to claim 15, wherein each layer structure of the neural entity inference engine recognition model is encoded by the Bi-LSTM model, and decoded by a decoder, and the decoded data Enter the next layer of structure to encode and decode again.
- 如权利要求16所述的计算机可读存储介质,其特征在于,所述将所述文本向量数据输入所述神经实体推理机识别模型进行训练得到命名实体集合包括:16. The computer-readable storage medium according to claim 16, wherein said inputting said text vector data into said neural entity inference engine recognition model for training to obtain a named entity set comprises:输入经过预处理的所述文本向量数据;Input the preprocessed text vector data;利用所述Bi-LSTM模型对所述文本向量数据进行编码,得到编码表示序列和初始命名实体,将所述初始命名实体加入候选池;Use the Bi-LSTM model to encode the text vector data to obtain an encoded representation sequence and an initial named entity, and add the initial named entity to a candidate pool;将所述编码表示序列和所述候选池中的所述初始命名实体输入所述推理机中进行处理,得到引用信息;Inputting the coded representation sequence and the initial named entity in the candidate pool into the inference engine for processing to obtain reference information;将所述编码表示序列和所述引用信息输入解码器,得到预测标签,根据所述预测标签更新所述候选池,得到所述命名实体集合。The coded representation sequence and the reference information are input to a decoder to obtain a predicted label, and the candidate pool is updated according to the predicted label to obtain the named entity set.
- 如权利要求17所述的计算机可读存储介质,其特征在于,所述解码器包括:18. The computer-readable storage medium of claim 17, wherein the decoder comprises:其中,X表示经过上述预处理的所述文本向量数据,y表示经过所述神经实体推理机识别模型训练后得到的所述预测标签,y i表示所述神经实体推理机识别模型中第i层的预测标签,x t表示在t时刻文本向量x的值。 Wherein, X represents the text vector data preprocessed above, y represents the predicted label obtained after the neural entity inference engine recognition model is trained, and y i represents the i-th layer in the neural entity inference engine recognition model. The predicted label of x t represents the value of the text vector x at time t.
- 如权利要求15~18中任一项所述的计算机可读存储介质,其特征在于,所述对所述第一文本数据进行预处理得到文本向量数据包括:18. The computer-readable storage medium according to any one of claims 15 to 18, wherein said preprocessing said first text data to obtain text vector data comprises:对所述第一文本数据进行分词操作得到第二文本数据,对所述第二文本数据进行去停用词操作得到第三文本数据,对所述第三文本数据进行去重操作得到第四文本数据;Perform a word segmentation operation on the first text data to obtain second text data, perform a stop word removal operation on the second text data to obtain third text data, and perform a deduplication operation on the third text data to obtain a fourth text data;对所述第四文本数据利用TF-IDF算法进行词向量形式转化,得到所述文本向量数据。The TF-IDF algorithm is used to convert the word vector form to the fourth text data to obtain the text vector data.
- 如权利要求15所述的计算机可读存储介质,其特征在于,所述对所述第一文本数据进行预处理得到文本向量数据还包括:15. The computer-readable storage medium according to claim 15, wherein said preprocessing said first text data to obtain text vector data further comprises:利用欧式距离方法对所述第一文本数据进行去重操作,其公式如下:Using the Euclidean distance method to perform a deduplication operation on the first text data, the formula is as follows:其中,w 1j和w 2j分别为2个第一文本数据,d为欧式距离。 Among them, w 1j and w 2j are the two first text data respectively, and d is the Euclidean distance.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910825074.1A CN110688854B (en) | 2019-09-02 | 2019-09-02 | Named entity recognition method, device and computer readable storage medium |
CN201910825074.1 | 2019-09-02 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021042516A1 true WO2021042516A1 (en) | 2021-03-11 |
Family
ID=69108711
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/116935 WO2021042516A1 (en) | 2019-09-02 | 2019-11-10 | Named-entity recognition method and device, and computer readable storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110688854B (en) |
WO (1) | WO2021042516A1 (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113254581A (en) * | 2021-05-25 | 2021-08-13 | 深圳市图灵机器人有限公司 | Financial text formula extraction method and device based on neural semantic analysis |
CN113343702A (en) * | 2021-08-03 | 2021-09-03 | 杭州费尔斯通科技有限公司 | Entity matching method and system based on unmarked corpus |
CN113505598A (en) * | 2021-08-06 | 2021-10-15 | 贵州江南航天信息网络通信有限公司 | Network text entity relation extraction algorithm based on hybrid neural network |
CN113609860A (en) * | 2021-08-05 | 2021-11-05 | 湖南特能博世科技有限公司 | Text segmentation method and device and computer equipment |
CN115688777A (en) * | 2022-09-28 | 2023-02-03 | 北京邮电大学 | Named entity recognition system for nested and discontinuous entities of Chinese financial text |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111353310B (en) * | 2020-02-28 | 2023-08-11 | 腾讯科技(深圳)有限公司 | Named entity identification method and device based on artificial intelligence and electronic equipment |
CN111709052B (en) * | 2020-06-01 | 2021-05-25 | 支付宝(杭州)信息技术有限公司 | Private data identification and processing method, device, equipment and readable medium |
CN112256828B (en) * | 2020-10-20 | 2023-08-08 | 平安科技(深圳)有限公司 | Medical entity relation extraction method, device, computer equipment and readable storage medium |
CN112434532B (en) * | 2020-11-05 | 2024-05-28 | 西安交通大学 | Power grid environment model supporting man-machine bidirectional understanding and modeling method |
CN113051921B (en) * | 2021-03-17 | 2024-02-20 | 北京智慧星光信息技术有限公司 | Internet text entity identification method, system, electronic equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103902570A (en) * | 2012-12-27 | 2014-07-02 | 腾讯科技(深圳)有限公司 | Text classification feature extraction method, classification method and device |
CN107832400A (en) * | 2017-11-01 | 2018-03-23 | 山东大学 | A kind of method that location-based LSTM and CNN conjunctive models carry out relation classification |
KR101846824B1 (en) * | 2017-12-11 | 2018-04-09 | 가천대학교 산학협력단 | Automated Named-entity Recognizing Systems, Methods, and Computer-Readable Mediums |
CN110110330A (en) * | 2019-04-30 | 2019-08-09 | 腾讯科技(深圳)有限公司 | Text based keyword extracting method and computer equipment |
CN110192204A (en) * | 2016-11-03 | 2019-08-30 | 易享信息技术有限公司 | The deep neural network model of data is handled by multiple language task levels |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106569998A (en) * | 2016-10-27 | 2017-04-19 | 浙江大学 | Text named entity recognition method based on Bi-LSTM, CNN and CRF |
CN108536679B (en) * | 2018-04-13 | 2022-05-20 | 腾讯科技(成都)有限公司 | Named entity recognition method, device, equipment and computer readable storage medium |
CN109359291A (en) * | 2018-08-28 | 2019-02-19 | 昆明理工大学 | A kind of name entity recognition method |
CN109635279B (en) * | 2018-11-22 | 2022-07-26 | 桂林电子科技大学 | Chinese named entity recognition method based on neural network |
CN109885824B (en) * | 2019-01-04 | 2024-02-20 | 北京捷通华声科技股份有限公司 | Hierarchical Chinese named entity recognition method, hierarchical Chinese named entity recognition device and readable storage medium |
CN109933792B (en) * | 2019-03-11 | 2020-03-24 | 海南中智信信息技术有限公司 | Viewpoint type problem reading and understanding method based on multilayer bidirectional LSTM and verification model |
CN110008469B (en) * | 2019-03-19 | 2022-06-07 | 桂林电子科技大学 | Multilevel named entity recognition method |
CA3061432A1 (en) * | 2019-04-25 | 2019-07-18 | Alibaba Group Holding Limited | Identifying entities in electronic medical records |
CN110110335B (en) * | 2019-05-09 | 2023-01-06 | 南京大学 | Named entity identification method based on stack model |
-
2019
- 2019-09-02 CN CN201910825074.1A patent/CN110688854B/en active Active
- 2019-11-10 WO PCT/CN2019/116935 patent/WO2021042516A1/en active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103902570A (en) * | 2012-12-27 | 2014-07-02 | 腾讯科技(深圳)有限公司 | Text classification feature extraction method, classification method and device |
CN110192204A (en) * | 2016-11-03 | 2019-08-30 | 易享信息技术有限公司 | The deep neural network model of data is handled by multiple language task levels |
CN107832400A (en) * | 2017-11-01 | 2018-03-23 | 山东大学 | A kind of method that location-based LSTM and CNN conjunctive models carry out relation classification |
KR101846824B1 (en) * | 2017-12-11 | 2018-04-09 | 가천대학교 산학협력단 | Automated Named-entity Recognizing Systems, Methods, and Computer-Readable Mediums |
CN110110330A (en) * | 2019-04-30 | 2019-08-09 | 腾讯科技(深圳)有限公司 | Text based keyword extracting method and computer equipment |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113254581A (en) * | 2021-05-25 | 2021-08-13 | 深圳市图灵机器人有限公司 | Financial text formula extraction method and device based on neural semantic analysis |
CN113343702A (en) * | 2021-08-03 | 2021-09-03 | 杭州费尔斯通科技有限公司 | Entity matching method and system based on unmarked corpus |
CN113609860A (en) * | 2021-08-05 | 2021-11-05 | 湖南特能博世科技有限公司 | Text segmentation method and device and computer equipment |
CN113609860B (en) * | 2021-08-05 | 2023-09-19 | 湖南特能博世科技有限公司 | Text segmentation method and device and computer equipment |
CN113505598A (en) * | 2021-08-06 | 2021-10-15 | 贵州江南航天信息网络通信有限公司 | Network text entity relation extraction algorithm based on hybrid neural network |
CN115688777A (en) * | 2022-09-28 | 2023-02-03 | 北京邮电大学 | Named entity recognition system for nested and discontinuous entities of Chinese financial text |
CN115688777B (en) * | 2022-09-28 | 2023-05-05 | 北京邮电大学 | Named entity recognition system for nested and discontinuous entities of Chinese financial text |
Also Published As
Publication number | Publication date |
---|---|
CN110688854B (en) | 2022-03-25 |
CN110688854A (en) | 2020-01-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021042516A1 (en) | Named-entity recognition method and device, and computer readable storage medium | |
WO2021212682A1 (en) | Knowledge extraction method, apparatus, electronic device, and storage medium | |
WO2021147726A1 (en) | Information extraction method and apparatus, electronic device and storage medium | |
CN112101041B (en) | Entity relationship extraction method, device, equipment and medium based on semantic similarity | |
US20220050967A1 (en) | Extracting definitions from documents utilizing definition-labeling-dependent machine learning background | |
CN117076653B (en) | Knowledge base question-answering method based on thinking chain and visual lifting context learning | |
CN110737758A (en) | Method and apparatus for generating a model | |
CN113392209B (en) | Text clustering method based on artificial intelligence, related equipment and storage medium | |
CN111159485B (en) | Tail entity linking method, device, server and storage medium | |
CN112100332A (en) | Word embedding expression learning method and device and text recall method and device | |
CN113177412A (en) | Named entity identification method and system based on bert, electronic equipment and storage medium | |
WO2021012485A1 (en) | Text topic extraction method and device, and computer readable storage medium | |
CN113378970B (en) | Sentence similarity detection method and device, electronic equipment and storage medium | |
CN112101031B (en) | Entity identification method, terminal equipment and storage medium | |
CN111881256B (en) | Text entity relation extraction method and device and computer readable storage medium equipment | |
CN114416995A (en) | Information recommendation method, device and equipment | |
CN116204674B (en) | Image description method based on visual concept word association structural modeling | |
CN114358201A (en) | Text-based emotion classification method and device, computer equipment and storage medium | |
CN113807512B (en) | Training method and device for machine reading understanding model and readable storage medium | |
CN114912450B (en) | Information generation method and device, training method, electronic device and storage medium | |
CN116450829A (en) | Medical text classification method, device, equipment and medium | |
CN116628186A (en) | Text abstract generation method and system | |
CN114417016A (en) | Knowledge graph-based text information matching method and device and related equipment | |
CN114492661A (en) | Text data classification method and device, computer equipment and storage medium | |
Zheng et al. | Distantly supervised named entity recognition with Spy-PU algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19944541 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19944541 Country of ref document: EP Kind code of ref document: A1 |