CN110705258A - Text entity identification method and device - Google Patents

Text entity identification method and device Download PDF

Info

Publication number
CN110705258A
CN110705258A CN201910882003.5A CN201910882003A CN110705258A CN 110705258 A CN110705258 A CN 110705258A CN 201910882003 A CN201910882003 A CN 201910882003A CN 110705258 A CN110705258 A CN 110705258A
Authority
CN
China
Prior art keywords
entity
matching
entities
text
boundary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201910882003.5A
Other languages
Chinese (zh)
Inventor
聂俊丰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Mininglamp Software System Co ltd
Original Assignee
Beijing Mininglamp Software System Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Mininglamp Software System Co ltd filed Critical Beijing Mininglamp Software System Co ltd
Priority to CN201910882003.5A priority Critical patent/CN110705258A/en
Publication of CN110705258A publication Critical patent/CN110705258A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The invention provides a text entity identification method and a text entity identification device, wherein the method comprises the following steps: matching entities in the text based on the entity dictionary; verifying whether the boundary of the matched entity meets the segmentation boundary of the participle; and for the matching entities meeting the segmentation boundary, calculating the probability of the matching entities serving as the entities by counting the contexts of the matching entities. In the invention, a large-scale entity dictionary is combined, entity matching is carried out by utilizing a multi-mode matching algorithm, whether the entity boundary accords with the segmentation boundary of the participle is verified, and the probability of the entity boundary serving as the entity is calculated through statistics, so that the accuracy of entity identification is greatly improved.

Description

Text entity identification method and device
Technical Field
The invention relates to the field of named entity recognition, in particular to a text entity recognition method and device.
Background
There are various types of entities, and in different fields, the types of entities concerned are different, and common entity types include names of people, places, names of organizations, dates, times, currencies, and the like. The types of entities that may be of greater interest in the financial field include company names, stock codes, and the like. Types of entities that may be of greater interest in the biological field include genes, protein names, cell names, and the like. Entity identification is a basic task in natural language processing, and tasks such as semantic analysis, reference resolution, information retrieval, entity relationship identification, knowledge graph and the like all depend on the result of entity identification. Therefore, the method has important significance for subsequent natural language processing by accurately identifying the entity of the specific type in the text.
In engineering development, a CRF (Conditional Random Field) model is often used in a task of entity recognition as a theoretical strong, efficient, interpretable strong machine learning algorithm. However, the probability of identification is not very high by using only the CRF model for identification of the entity. Particularly for some specific types of entities.
Disclosure of Invention
The embodiment of the invention provides a text entity identification method and a text entity identification device, which at least solve the problem that in the related art, the probability of identification is not very high when the entity identification is carried out only on the basis of a CRF (fuzzy C) model.
According to an embodiment of the present invention, there is provided a text entity recognition method including: matching entities in the text based on the entity dictionary; verifying whether the boundary of the matched entity meets the segmentation boundary of the participle; and for the matching entities meeting the segmentation boundary, calculating the probability of the matching entities serving as the entities by counting the contexts of the matching entities.
Preferably, matching entities in the text based on the entity dictionary comprises: and matching the entities in the text by adopting a multi-mode algorithm based on the entity dictionary.
Preferably, for the matching entity meeting the segmentation boundary, calculating the probability of the matching entity as an entity by counting the context of the matching entity, including: counting the context of the matching entity, and verifying the context of the current matching entity; and judging the credibility of the current matching entity according to the verification result, and marking the entity.
Preferably, the method further comprises: a user-defined entity dictionary is added to extend a particular entity object.
Preferably, the method further comprises: and for the unknown words in the text, carrying out entity recognition through a trained CRF model.
According to another embodiment of the present invention, there is provided a text entity recognition apparatus including: the matching module is used for matching the entities in the text based on the entity dictionary; the verification module is used for verifying whether the boundary of the matched entity meets the segmentation boundary of the word segmentation; and the statistic module is used for calculating the probability of the matched entity which accords with the segmentation boundary as an entity by counting the context of the matched entity.
Preferably, the statistical module comprises: the statistical unit is used for counting the context of the matching entity and verifying the context of the current matching entity; and the judging unit is used for judging the credibility of the currently matched entity according to the verification result and marking the entity.
Preferably, the apparatus further comprises: and the adding module is used for adding the user-defined entity dictionary to expand the special entity object.
Preferably, the apparatus further comprises: and the CRF module is used for identifying the entity for the unknown word in the text through a trained CRF model.
According to a further embodiment of the present invention, there is also provided a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.
According to yet another embodiment of the present invention, there is also provided an electronic device, including a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above method embodiments.
In the embodiment of the invention, the CRF model is not relied on, but a large-scale entity dictionary is combined, the entity matching is carried out by utilizing the multi-mode matching algorithm, and whether the entity boundary accords with the segmentation boundary is verified. For the matched entities which accord with the segmentation boundary of the word segmentation, the probability of the matched entities serving as the entities is calculated through statistics, and the accuracy of entity identification is greatly improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a schematic diagram of a computer terminal operating in accordance with a method of an embodiment of the present invention;
FIG. 2 is a flow chart of a text entity recognition method according to an embodiment of the present invention;
fig. 3 is a block diagram of a structure of a text entity recognition apparatus according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a text entity recognition apparatus according to an alternative embodiment of the present invention.
Detailed Description
The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
The method provided by the first embodiment of the application can be executed in a computer terminal, a single chip microcomputer, a server or a similar operation device. Taking the operation on a computer terminal as an example, fig. 1 is a hardware structure block diagram of the computer terminal operated by the method of the embodiment of the present invention. As shown in fig. 1, the computer terminal 100 may include one or more (only one shown in fig. 1) processors 102 (the processor 102 may include but is not limited to a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 for storing data, and optionally may also include a transmission device 106 for communication functions and an input-output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the computer terminal. For example, computer terminal 100 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
The memory 104 can be used for storing computer programs, for example, software programs and modules of application software, such as computer programs corresponding to the methods in the embodiments of the present invention, and the processor 102 executes the computer programs stored in the memory 104 to execute various functional applications and data processing, i.e., to implement the methods described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 100 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 100. In one example, the transmission device 106 includes a Network adapter (NIC) that can communicate with the internet.
In this embodiment, a text entity identification method operable on the computer terminal is provided, and fig. 2 is a flowchart of the text entity identification method according to the embodiment of the present invention, as shown in fig. 2, the flowchart includes the following steps:
step S202, matching entities in the text based on the entity dictionary;
step S204, verifying whether the boundary of the matched entity meets the segmentation boundary of the participle;
step S206, calculating the probability of the matching entity which accords with the segmentation boundary as an entity by counting the context of the matching entity.
In step S202 of the present embodiment, entities in the text may be matched by using a multi-modal algorithm based on the entity dictionary.
In step S206 of this embodiment, statistics is performed on the context of the matching entity, and the context of the current matching entity is verified; and judging the credibility of the current matching entity according to the verification result, and marking the entity.
In this embodiment, a user-defined entity dictionary may also be added to extend a particular entity object.
In this embodiment, for the unknown words in the text, the recognition of the entity is performed through a trained CRF model.
In order to facilitate the understanding of the technical scheme provided by the invention, the following detailed description of specific application examples is provided.
A brief introduction is first made to the existing identification of CRF entities. And performing model training on a large number of marked corpora by using the CRF, and segmenting the text according to characters to obtain an observation sequence, a characteristic sequence and a marking sequence. Wherein, the observation sequence is the word sequence of each sentence in the text. The CRF model selects characters, character types and contexts in the text as features, specifically, the CRF model comprises the characters of numbers, punctuations, English and the like, the current character, the previous character, the next character, the joint features of the previous character and the next character and the like. These features are used for learning current observation sequences and context information when training the model.
When model training is carried out, model parameters are obtained by maximizing the log-likelihood function of training data through maximum likelihood estimation or regularized maximum likelihood estimation. And finally training to obtain MRF distribution under the condition of a given observation sequence.
Therefore, during prediction, the text is still segmented according to words to obtain an observation sequence and a characteristic sequence, and then an optimal marking sequence is calculated by utilizing a Viterbi (Viterbi) algorithm based on a trained model, so that entities in the text are marked.
In the embodiment, a large-scale entity dictionary is combined instead of relying on a CRF model alone, a DAT multi-mode matching algorithm is used for dictionary construction and matching, and whether the entity boundary accords with the segmentation boundary is verified through combination with the segmentation component. For a matching entity meeting the segmentation boundary, the probability of the matching entity as the entity is calculated by using a Natural Language Processing (NLP) algorithm based on the context.
The context-based NLP algorithm is used for processing by combining context information of a statistical marking corpus with context in a current text. The method comprises the following specific steps:
and counting the number of times that the context co-occurs with the place name and the total number of times of the context. And judging the credibility of the current entity by counting the context of the entity and verifying the context of the current entity so as to mark the entity.
In addition, a user-defined entity dictionary can be added in the embodiment to expand some special entity objects. For entities such as identity card numbers, mobile phone numbers and the like, corresponding entities in the unstructured text can be extracted by using a brief rule engine and a set mode of the corresponding entities. In general texts, high accuracy can be achieved.
In this embodiment, 18 entities such as a person name, a place name, an organization name, an identification number, a mobile phone number, and the like can be accurately identified by combining machine learning and a rule engine and using an NLP algorithm such as context-based entity identification. The details are shown in table 1 below:
TABLE 1
Figure BDA0002206142610000061
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
In this embodiment, a text entity recognition apparatus is further provided, and the apparatus is used to implement the foregoing embodiments and preferred embodiments, and the description of the apparatus is omitted for brevity. As used below, the term "module" or "unit" may implement a combination of software and/or hardware of predetermined functions. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
Fig. 3 is a block diagram showing the construction of a text entity recognition apparatus according to an embodiment of the present invention, which includes a matching module 10, a verification module 20, and a statistic module 30, as shown in fig. 3.
The matching module 10 is used for matching entities in the text based on the entity dictionary.
The verification module 20 is configured to verify whether the boundary of the matching entity meets the segmentation boundary.
The statistic module 30 calculates the probability of the matching entity meeting the segmentation boundary as an entity by counting the context of the matching entity.
Fig. 4 is a block diagram showing the construction of a text entity recognition apparatus according to an alternative embodiment of the present invention, which includes an adding module 40 and a CRF module 50 in addition to the matching module 10, the verifying module 20 and the counting module 30 shown in fig. 3, as shown in fig. 4.
The adding module 40 is used to add a user-defined entity dictionary to expand a particular entity object.
The CRF module 50 is used for identifying an entity through a trained CRF model for unknown words in the text
In this embodiment, the statistic module 30 may further include a statistic unit 31 and a judging unit 32.
The counting unit 31 is configured to count the context of the matching entity and verify the context of the current matching entity.
The judging unit 32 is configured to judge the reliability of the currently matched entity according to the verification result, and perform entity marking.
It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.
Embodiments of the present invention also provide a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.
Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.
Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.
It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims (11)

1. A method for text entity recognition, comprising:
matching entities in the text based on the entity dictionary;
verifying whether the boundary of the matched entity meets the segmentation boundary of the participle;
and for the matching entities meeting the segmentation boundary, calculating the probability of the matching entities serving as the entities by counting the contexts of the matching entities.
2. The method of claim 1, wherein matching entities in text based on an entity dictionary comprises:
and matching the entities in the text by adopting a multi-mode algorithm based on the entity dictionary.
3. The method of claim 1, wherein calculating the probability of the matching entity as an entity by counting the context of the matching entity for the matching entity meeting the segmentation boundary comprises:
counting the context of the matching entity, and verifying the context of the current matching entity;
and judging the credibility of the current matching entity according to the verification result, and marking the entity.
4. The method of claim 1, further comprising:
a user-defined entity dictionary is added to extend a particular entity object.
5. The method of claim 1, further comprising:
and for the unknown words in the text, carrying out entity recognition through a trained CRF model.
6. A text entity recognition apparatus, comprising:
the matching module is used for matching the entities in the text based on the entity dictionary;
the verification module is used for verifying whether the boundary of the matched entity meets the segmentation boundary of the word segmentation;
and the statistic module is used for calculating the probability of the matched entity which accords with the segmentation boundary as an entity by counting the context of the matched entity.
7. The apparatus of claim 6, wherein the statistics module comprises:
the statistical unit is used for counting the context of the matching entity and verifying the context of the current matching entity;
and the judging unit is used for judging the credibility of the currently matched entity according to the verification result and marking the entity.
8. The apparatus of claim 6, further comprising:
and the adding module is used for adding the user-defined entity dictionary to expand the special entity object.
9. The apparatus of claim 6, further comprising:
and the CRF module is used for identifying the entity for the unknown word in the text through a trained CRF model.
10. A storage medium, in which a computer program is stored, wherein the computer program is arranged to perform the method of any of claims 1 to 5 when executed.
11. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 5.
CN201910882003.5A 2019-09-18 2019-09-18 Text entity identification method and device Withdrawn CN110705258A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910882003.5A CN110705258A (en) 2019-09-18 2019-09-18 Text entity identification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910882003.5A CN110705258A (en) 2019-09-18 2019-09-18 Text entity identification method and device

Publications (1)

Publication Number Publication Date
CN110705258A true CN110705258A (en) 2020-01-17

Family

ID=69196258

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910882003.5A Withdrawn CN110705258A (en) 2019-09-18 2019-09-18 Text entity identification method and device

Country Status (1)

Country Link
CN (1) CN110705258A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111914561A (en) * 2020-07-31 2020-11-10 中国建设银行股份有限公司 Entity recognition model training method, entity recognition device and terminal equipment
WO2022111083A1 (en) * 2020-11-30 2022-06-02 京东方科技集团股份有限公司 Entity recognition method, entity recognition apparatus, electronic device and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106239517A (en) * 2016-08-23 2016-12-21 北京小米移动软件有限公司 Robot and the method for the autonomous manipulation of realization, device
CN106503192A (en) * 2016-10-31 2017-03-15 北京百度网讯科技有限公司 Name entity recognition method and device based on artificial intelligence
CN108491373A (en) * 2018-02-01 2018-09-04 北京百度网讯科技有限公司 A kind of entity recognition method and system
US20180365211A1 (en) * 2015-12-11 2018-12-20 Beijing Gridsum Technology Co., Ltd. Method and Device for Recognizing Domain Named Entity
CN109871541A (en) * 2019-03-06 2019-06-11 电子科技大学 It is a kind of suitable for multilingual multi-field name entity recognition method
CN109977402A (en) * 2019-03-11 2019-07-05 北京明略软件系统有限公司 A kind of name entity recognition method and system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180365211A1 (en) * 2015-12-11 2018-12-20 Beijing Gridsum Technology Co., Ltd. Method and Device for Recognizing Domain Named Entity
CN106239517A (en) * 2016-08-23 2016-12-21 北京小米移动软件有限公司 Robot and the method for the autonomous manipulation of realization, device
CN106503192A (en) * 2016-10-31 2017-03-15 北京百度网讯科技有限公司 Name entity recognition method and device based on artificial intelligence
CN108491373A (en) * 2018-02-01 2018-09-04 北京百度网讯科技有限公司 A kind of entity recognition method and system
CN109871541A (en) * 2019-03-06 2019-06-11 电子科技大学 It is a kind of suitable for multilingual multi-field name entity recognition method
CN109977402A (en) * 2019-03-11 2019-07-05 北京明略软件系统有限公司 A kind of name entity recognition method and system

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111914561A (en) * 2020-07-31 2020-11-10 中国建设银行股份有限公司 Entity recognition model training method, entity recognition device and terminal equipment
CN111914561B (en) * 2020-07-31 2023-06-30 建信金融科技有限责任公司 Entity recognition model training method, entity recognition device and terminal equipment
WO2022111083A1 (en) * 2020-11-30 2022-06-02 京东方科技集团股份有限公司 Entity recognition method, entity recognition apparatus, electronic device and storage medium

Similar Documents

Publication Publication Date Title
US11392838B2 (en) Method, equipment, computing device and computer-readable storage medium for knowledge extraction based on TextCNN
CN111222305B (en) Information structuring method and device
CN111984792A (en) Website classification method and device, computer equipment and storage medium
CN107341143A (en) A kind of sentence continuity determination methods and device and electronic equipment
CN110348012B (en) Method, device, storage medium and electronic device for determining target character
CN110069769A (en) Using label generating method, device and storage equipment
CN115544240A (en) Text sensitive information identification method and device, electronic equipment and storage medium
CN112232070A (en) Natural language processing model construction method, system, electronic device and storage medium
CN110705258A (en) Text entity identification method and device
CN113254649A (en) Sensitive content recognition model training method, text recognition method and related device
CN110569504B (en) Relation word determining method and device
CN111754352A (en) Method, device, equipment and storage medium for judging correctness of viewpoint statement
CN113095073B (en) Corpus tag generation method and device, computer equipment and storage medium
CN110941713A (en) Self-optimization financial information plate classification method based on topic model
CN114970490A (en) Text labeling data quality inspection method and device, electronic equipment and storage medium
CN114398482A (en) Dictionary construction method and device, electronic equipment and storage medium
CN114595332A (en) Text classification prediction method and device and electronic equipment
CN114647727A (en) Model training method, device and equipment applied to entity information recognition
CN110502741B (en) Chinese text recognition method and device
CN112560425A (en) Template generation method and device, electronic equipment and storage medium
CN112836498A (en) Data processing method, data identification device and computing equipment
CN110232328A (en) A kind of reference report analytic method, device and computer readable storage medium
CN114519357B (en) Natural language processing method and system based on machine learning
CN113434672B (en) Text type intelligent recognition method, device, equipment and medium
CN113177410B (en) Text word segmentation method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20200117

WW01 Invention patent application withdrawn after publication