CN110705258A

CN110705258A - Text entity identification method and device

Info

Publication number: CN110705258A
Application number: CN201910882003.5A
Authority: CN
Inventors: 聂俊丰
Original assignee: Beijing Mininglamp Software System Co ltd
Current assignee: Beijing Mininglamp Software System Co ltd
Priority date: 2019-09-18
Filing date: 2019-09-18
Publication date: 2020-01-17

Abstract

The invention provides a text entity identification method and a text entity identification device, wherein the method comprises the following steps: matching entities in the text based on the entity dictionary; verifying whether the boundary of the matched entity meets the segmentation boundary of the participle; and for the matching entities meeting the segmentation boundary, calculating the probability of the matching entities serving as the entities by counting the contexts of the matching entities. In the invention, a large-scale entity dictionary is combined, entity matching is carried out by utilizing a multi-mode matching algorithm, whether the entity boundary accords with the segmentation boundary of the participle is verified, and the probability of the entity boundary serving as the entity is calculated through statistics, so that the accuracy of entity identification is greatly improved.

Description

Text entity identification method and device

Technical Field

The invention relates to the field of named entity recognition, in particular to a text entity recognition method and device.

Background

There are various types of entities, and in different fields, the types of entities concerned are different, and common entity types include names of people, places, names of organizations, dates, times, currencies, and the like. The types of entities that may be of greater interest in the financial field include company names, stock codes, and the like. Types of entities that may be of greater interest in the biological field include genes, protein names, cell names, and the like. Entity identification is a basic task in natural language processing, and tasks such as semantic analysis, reference resolution, information retrieval, entity relationship identification, knowledge graph and the like all depend on the result of entity identification. Therefore, the method has important significance for subsequent natural language processing by accurately identifying the entity of the specific type in the text.

In engineering development, a CRF (Conditional Random Field) model is often used in a task of entity recognition as a theoretical strong, efficient, interpretable strong machine learning algorithm. However, the probability of identification is not very high by using only the CRF model for identification of the entity. Particularly for some specific types of entities.

Disclosure of Invention

The embodiment of the invention provides a text entity identification method and a text entity identification device, which at least solve the problem that in the related art, the probability of identification is not very high when the entity identification is carried out only on the basis of a CRF (fuzzy C) model.

According to an embodiment of the present invention, there is provided a text entity recognition method including: matching entities in the text based on the entity dictionary; verifying whether the boundary of the matched entity meets the segmentation boundary of the participle; and for the matching entities meeting the segmentation boundary, calculating the probability of the matching entities serving as the entities by counting the contexts of the matching entities.

Preferably, matching entities in the text based on the entity dictionary comprises: and matching the entities in the text by adopting a multi-mode algorithm based on the entity dictionary.

Preferably, for the matching entity meeting the segmentation boundary, calculating the probability of the matching entity as an entity by counting the context of the matching entity, including: counting the context of the matching entity, and verifying the context of the current matching entity; and judging the credibility of the current matching entity according to the verification result, and marking the entity.

Preferably, the method further comprises: a user-defined entity dictionary is added to extend a particular entity object.

Preferably, the method further comprises: and for the unknown words in the text, carrying out entity recognition through a trained CRF model.

According to another embodiment of the present invention, there is provided a text entity recognition apparatus including: the matching module is used for matching the entities in the text based on the entity dictionary; the verification module is used for verifying whether the boundary of the matched entity meets the segmentation boundary of the word segmentation; and the statistic module is used for calculating the probability of the matched entity which accords with the segmentation boundary as an entity by counting the context of the matched entity.

Preferably, the statistical module comprises: the statistical unit is used for counting the context of the matching entity and verifying the context of the current matching entity; and the judging unit is used for judging the credibility of the currently matched entity according to the verification result and marking the entity.

Preferably, the apparatus further comprises: and the adding module is used for adding the user-defined entity dictionary to expand the special entity object.

Preferably, the apparatus further comprises: and the CRF module is used for identifying the entity for the unknown word in the text through a trained CRF model.

According to a further embodiment of the present invention, there is also provided a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.

According to yet another embodiment of the present invention, there is also provided an electronic device, including a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above method embodiments.

In the embodiment of the invention, the CRF model is not relied on, but a large-scale entity dictionary is combined, the entity matching is carried out by utilizing the multi-mode matching algorithm, and whether the entity boundary accords with the segmentation boundary is verified. For the matched entities which accord with the segmentation boundary of the word segmentation, the probability of the matched entities serving as the entities is calculated through statistics, and the accuracy of entity identification is greatly improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a schematic diagram of a computer terminal operating in accordance with a method of an embodiment of the present invention;

FIG. 2 is a flow chart of a text entity recognition method according to an embodiment of the present invention;

fig. 3 is a block diagram of a structure of a text entity recognition apparatus according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a text entity recognition apparatus according to an alternative embodiment of the present invention.

Detailed Description

The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The method provided by the first embodiment of the application can be executed in a computer terminal, a single chip microcomputer, a server or a similar operation device. Taking the operation on a computer terminal as an example, fig. 1 is a hardware structure block diagram of the computer terminal operated by the method of the embodiment of the present invention. As shown in fig. 1, the computer terminal 100 may include one or more (only one shown in fig. 1) processors 102 (the processor 102 may include but is not limited to a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 for storing data, and optionally may also include a transmission device 106 for communication functions and an input-output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the computer terminal. For example, computer terminal 100 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

The memory 104 can be used for storing computer programs, for example, software programs and modules of application software, such as computer programs corresponding to the methods in the embodiments of the present invention, and the processor 102 executes the computer programs stored in the memory 104 to execute various functional applications and data processing, i.e., to implement the methods described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 100 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 100. In one example, the transmission device 106 includes a Network adapter (NIC) that can communicate with the internet.

In this embodiment, a text entity identification method operable on the computer terminal is provided, and fig. 2 is a flowchart of the text entity identification method according to the embodiment of the present invention, as shown in fig. 2, the flowchart includes the following steps:

step S202, matching entities in the text based on the entity dictionary;

step S204, verifying whether the boundary of the matched entity meets the segmentation boundary of the participle;

step S206, calculating the probability of the matching entity which accords with the segmentation boundary as an entity by counting the context of the matching entity.

In step S202 of the present embodiment, entities in the text may be matched by using a multi-modal algorithm based on the entity dictionary.

In step S206 of this embodiment, statistics is performed on the context of the matching entity, and the context of the current matching entity is verified; and judging the credibility of the current matching entity according to the verification result, and marking the entity.

In this embodiment, a user-defined entity dictionary may also be added to extend a particular entity object.

In this embodiment, for the unknown words in the text, the recognition of the entity is performed through a trained CRF model.

In order to facilitate the understanding of the technical scheme provided by the invention, the following detailed description of specific application examples is provided.

A brief introduction is first made to the existing identification of CRF entities. And performing model training on a large number of marked corpora by using the CRF, and segmenting the text according to characters to obtain an observation sequence, a characteristic sequence and a marking sequence. Wherein, the observation sequence is the word sequence of each sentence in the text. The CRF model selects characters, character types and contexts in the text as features, specifically, the CRF model comprises the characters of numbers, punctuations, English and the like, the current character, the previous character, the next character, the joint features of the previous character and the next character and the like. These features are used for learning current observation sequences and context information when training the model.

When model training is carried out, model parameters are obtained by maximizing the log-likelihood function of training data through maximum likelihood estimation or regularized maximum likelihood estimation. And finally training to obtain MRF distribution under the condition of a given observation sequence.

Therefore, during prediction, the text is still segmented according to words to obtain an observation sequence and a characteristic sequence, and then an optimal marking sequence is calculated by utilizing a Viterbi (Viterbi) algorithm based on a trained model, so that entities in the text are marked.

In the embodiment, a large-scale entity dictionary is combined instead of relying on a CRF model alone, a DAT multi-mode matching algorithm is used for dictionary construction and matching, and whether the entity boundary accords with the segmentation boundary is verified through combination with the segmentation component. For a matching entity meeting the segmentation boundary, the probability of the matching entity as the entity is calculated by using a Natural Language Processing (NLP) algorithm based on the context.

The context-based NLP algorithm is used for processing by combining context information of a statistical marking corpus with context in a current text. The method comprises the following specific steps:

and counting the number of times that the context co-occurs with the place name and the total number of times of the context. And judging the credibility of the current entity by counting the context of the entity and verifying the context of the current entity so as to mark the entity.

In addition, a user-defined entity dictionary can be added in the embodiment to expand some special entity objects. For entities such as identity card numbers, mobile phone numbers and the like, corresponding entities in the unstructured text can be extracted by using a brief rule engine and a set mode of the corresponding entities. In general texts, high accuracy can be achieved.

In this embodiment, 18 entities such as a person name, a place name, an organization name, an identification number, a mobile phone number, and the like can be accurately identified by combining machine learning and a rule engine and using an NLP algorithm such as context-based entity identification. The details are shown in table 1 below:

TABLE 1

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

In this embodiment, a text entity recognition apparatus is further provided, and the apparatus is used to implement the foregoing embodiments and preferred embodiments, and the description of the apparatus is omitted for brevity. As used below, the term "module" or "unit" may implement a combination of software and/or hardware of predetermined functions. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 3 is a block diagram showing the construction of a text entity recognition apparatus according to an embodiment of the present invention, which includes a matching module 10, a verification module 20, and a statistic module 30, as shown in fig. 3.

The matching module 10 is used for matching entities in the text based on the entity dictionary.

The verification module 20 is configured to verify whether the boundary of the matching entity meets the segmentation boundary.

The statistic module 30 calculates the probability of the matching entity meeting the segmentation boundary as an entity by counting the context of the matching entity.

Fig. 4 is a block diagram showing the construction of a text entity recognition apparatus according to an alternative embodiment of the present invention, which includes an adding module 40 and a CRF module 50 in addition to the matching module 10, the verifying module 20 and the counting module 30 shown in fig. 3, as shown in fig. 4.

The adding module 40 is used to add a user-defined entity dictionary to expand a particular entity object.

The CRF module 50 is used for identifying an entity through a trained CRF model for unknown words in the text

In this embodiment, the statistic module 30 may further include a statistic unit 31 and a judging unit 32.

The counting unit 31 is configured to count the context of the matching entity and verify the context of the current matching entity.

The judging unit 32 is configured to judge the reliability of the currently matched entity according to the verification result, and perform entity marking.

It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.

Embodiments of the present invention also provide a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.

Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for text entity recognition, comprising:

matching entities in the text based on the entity dictionary;

verifying whether the boundary of the matched entity meets the segmentation boundary of the participle;

and for the matching entities meeting the segmentation boundary, calculating the probability of the matching entities serving as the entities by counting the contexts of the matching entities.

2. The method of claim 1, wherein matching entities in text based on an entity dictionary comprises:

and matching the entities in the text by adopting a multi-mode algorithm based on the entity dictionary.

3. The method of claim 1, wherein calculating the probability of the matching entity as an entity by counting the context of the matching entity for the matching entity meeting the segmentation boundary comprises:

counting the context of the matching entity, and verifying the context of the current matching entity;

and judging the credibility of the current matching entity according to the verification result, and marking the entity.

4. The method of claim 1, further comprising:

a user-defined entity dictionary is added to extend a particular entity object.

5. The method of claim 1, further comprising:

and for the unknown words in the text, carrying out entity recognition through a trained CRF model.

6. A text entity recognition apparatus, comprising:

the matching module is used for matching the entities in the text based on the entity dictionary;

the verification module is used for verifying whether the boundary of the matched entity meets the segmentation boundary of the word segmentation;

and the statistic module is used for calculating the probability of the matched entity which accords with the segmentation boundary as an entity by counting the context of the matched entity.

7. The apparatus of claim 6, wherein the statistics module comprises:

the statistical unit is used for counting the context of the matching entity and verifying the context of the current matching entity;

and the judging unit is used for judging the credibility of the currently matched entity according to the verification result and marking the entity.

8. The apparatus of claim 6, further comprising:

and the adding module is used for adding the user-defined entity dictionary to expand the special entity object.

9. The apparatus of claim 6, further comprising:

and the CRF module is used for identifying the entity for the unknown word in the text through a trained CRF model.

10. A storage medium, in which a computer program is stored, wherein the computer program is arranged to perform the method of any of claims 1 to 5 when executed.

11. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 5.