CN107943786B

CN107943786B - Chinese named entity recognition method and system

Info

Publication number: CN107943786B
Application number: CN201711137581.3A
Authority: CN
Inventors: 吴远辉
Original assignee: Guangzhou Wanlong Securities Consulting Co ltd
Current assignee: Guangzhou Wanlong Securities Consulting Co ltd
Priority date: 2017-11-16
Filing date: 2017-11-16
Publication date: 2021-12-07
Anticipated expiration: 2037-11-16
Also published as: CN107943786A

Abstract

The invention discloses a method and a system for identifying a Chinese named entity, wherein the method comprises the following steps: s1, carrying out entity recognition based on rule matching on the target text to obtain a first named entity set; s2, performing entity recognition on the target text by adopting a statistical algorithm to obtain a second named entity set; and S3, cleaning the first named entity set and the second named entity set to obtain an identification result. The method is used for carrying out entity recognition on the target text based on the rule matching and the statistical algorithm respectively, and then cleaning the recognition results of the target text and the target text to obtain the final Chinese entity recognition result, so that the Chinese entity recognition accuracy can be ensured, the recall ratio of the Chinese entity recognition can be greatly improved, and the method is used for carrying out automatic recognition on the Chinese entity, has high recognition speed and can be widely applied to the field of information processing on the text.

Description

Chinese named entity recognition method and system

Technical Field

The invention relates to the field of computer application and information processing, in particular to a method and a system for identifying a Chinese named entity.

Background

The named entity is a basic information element in the target text and is a basis for correctly understanding the target text. Chinese entity naming and recognition are important basic tools in application fields such as information extraction, syntactic analysis, machine learning and the like, and play an important role in the process of bringing the natural language processing technology into practical use. Chinese named entity recognition is to determine whether a string represents a named entity. In the information extraction research, Chinese named entity identification is a technology which has the most practical value at present. The common method is a recognition method based purely on hidden Markov and maximum entropy models.

At present, because the word using rule of the name naming of the Chinese company is not strong, the name naming method is more random to use, and often appears in a form of short name, for example, "Chinese Bank stock Limited company" often appears in a form of short name, for example, "Chinese Bank" or "Zhongxing", which brings difficulty to the identification and application of the name naming method. Generally, the following difficulties exist in identifying the named entities of the Chinese company, which are referred to as the Chinese name entities for short: 1. under different fields and scenes, the extension of the nomenclature abbreviation is different. 2. Some types of entity names change frequently and no strict rule can be followed. 3. The expression forms are various. 4. The number is huge, and the dictionary cannot be enumerated and is difficult to be completely recorded in the dictionary. In summary, in the process of the Chinese target text, the recognition effect of the Chinese named entity is greatly influenced by the Chinese word segmentation effect, and further the analysis and processing effects of the target text are influenced, so that the recall ratio is low and the recognition speed is slow.

Disclosure of Invention

In order to solve the above technical problems, the present invention provides a method and a system for identifying a named entity in chinese.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a Chinese named entity recognition method comprises the following steps:

s1, carrying out entity recognition based on rule matching on the target text to obtain a first named entity set;

s2, performing entity recognition on the target text by adopting a statistical algorithm to obtain a second named entity set;

and S3, cleaning the first named entity set and the second named entity set to obtain an identification result.

Further, step S1 specifically includes:

s11, separating the content of the target text according to sentences;

s12, extracting the content of the separated target text based on punctuation rules;

s13, extracting the content of the separated target text based on the syntactic template rule;

s14, extracting the contents of the separated target text based on the table characteristics;

and S15, generating a first named entity set by all the named entities obtained by extraction.

Further, step S2 specifically includes:

s21, performing word segmentation processing on the target text;

s22, performing part-of-speech tagging on the word processing result based on a preset part-of-speech database;

and S23, performing statistical analysis on the part-of-speech tagging results based on a hidden Markov model statistical learning method, and generating a second named entity set from the named entities obtained by analysis.

Further, step S3 specifically includes:

s31, respectively cleaning the data of the first named entity set and the second named entity set according to a preset noise vocabulary library, and eliminating noise vocabularies;

and S32, merging the cleaned first named entity set and the second named entity set to obtain a named entity recognition result.

The other technical scheme adopted by the invention for solving the technical problem is as follows:

a Chinese named entity recognition system comprises the following modules:

the first identification module is used for carrying out entity identification based on rule matching on the target text to obtain a first named entity set;

the second identification module is used for carrying out entity identification on the target text by adopting a statistical algorithm to obtain a second named entity set;

and the cleaning module is used for cleaning the first named entity set and the second named entity set to obtain an identification result.

Further, the first identification module specifically includes:

a separation unit for separating the contents of the target text by sentences;

the first extraction unit is used for extracting the content of the separated target text based on punctuation rules;

the second extraction unit is used for extracting the content of the separated target text based on the syntactic template rule;

the third extraction unit is used for extracting the contents of the separated target text based on the table characteristics;

and the generating unit is used for generating a first named entity set from all the named entities obtained by extraction.

Further, the second identification module specifically includes:

the word segmentation processing unit is used for carrying out word segmentation processing on the target text;

the part-of-speech tagging unit is used for tagging the part-of-speech of the word processing result based on a preset part-of-speech database;

and the statistical analysis unit is used for performing statistical analysis on the part of speech tagging results based on a hidden Markov model statistical learning method, and generating a second named entity set from the named entities obtained by analysis.

Further, the cleaning module specifically includes:

the data cleaning unit is used for respectively cleaning the data of the first named entity set and the second named entity set according to a preset noise vocabulary library and eliminating noise vocabularies;

and the computing unit is used for solving a union set of the cleaned first named entity set and the cleaned second named entity set and then taking the union set as a named entity identification result.

The method and the system have the beneficial effects that: the method is used for carrying out entity recognition on the target text based on the rule matching and the statistical algorithm respectively, and then cleaning the recognition results of the target text and the target text to obtain the final Chinese entity recognition result, so that the Chinese entity recognition accuracy can be ensured, the recall ratio of the Chinese entity recognition is greatly improved, and the Chinese entity automatic recognition is carried out through the method, so that the recognition speed is high.

Drawings

FIG. 1 is a flow chart of a Chinese named entity recognition method of the present invention;

FIG. 2 is a block diagram of the structure of the Chinese named entity recognition system of the present invention.

Detailed Description

Referring to fig. 1, the invention provides a method for identifying a named entity in Chinese, comprising the following steps:

The target text refers to a text which needs to be identified by the Chinese named entity.

The method is based on the rule matching and the statistical algorithm to identify the entity of the target text, and the identification results of the rule matching and the statistical algorithm are cleaned to obtain the final Chinese entity identification result, so that the Chinese entity identification accuracy can be ensured, the recall ratio of the Chinese entity identification can be greatly improved, and the method can be used for automatically identifying the Chinese entity and has higher identification speed.

Further as a preferred embodiment, the step S1 specifically includes:

s11, separating the content of the target text according to sentences;

s12, extracting the content of the separated target text based on punctuation rules; for example, in some documents, it is customary to add a double quotation mark to an entity name, or to add a book name number, and at this time, the name in the double quotation mark or the book name number is extracted. Therefore, corresponding punctuation rules can be created according to the use habits of people, the punctuation rules record punctuation related to the Chinese entity name and corresponding extraction rules, and the punctuation rules are used as alternatives of the Chinese entity name after content extraction.

S13, extracting the content of the separated target text based on the syntactic template rule; for example, since the subject preceding a verb such as "announce", "call", or "speak" is generally an entity name, a corresponding syntactic template rule is created according to a language habit, and the syntactic template rule describes a wording related to a chinese entity name and a corresponding extraction rule, so that a target text can be extracted according to the syntactic template rule.

Further as a preferred embodiment, the step S2 specifically includes:

s21, performing word segmentation processing on the target text;

and S23, performing statistical analysis on the part-of-speech tagging results based on a hidden Markov model statistical learning method, and generating a second named entity set from the named entities obtained by analysis. The step is based on a hidden Markov model statistical learning method, firstly, the probability of the occurrence of the previous keyword is counted according to the known and correct entity name, and then the entity name is calculated through the keyword with high probability. Therefore, on the premise of not influencing the accuracy of the Chinese entity name obtained by identification, the recall ratio of identification is greatly improved, the Chinese entity name in the text can be more comprehensively identified and obtained, and the Chinese entity name is obtained by automatic identification, so that the identification speed is high.

Further as a preferred embodiment, the step S3 specifically includes:

Referring to fig. 2, the invention provides a system for identifying a named entity in chinese, comprising the following modules:

a first identification module 100, configured to perform entity identification based on rule matching on a target text, to obtain a first named entity set;

the second identification module 200 is configured to perform entity identification on the target text by using a statistical algorithm to obtain a second named entity set;

the cleaning module 300 is configured to obtain the recognition result after cleaning the first named entity set and the second named entity set.

Further as a preferred embodiment, the first identification module 100 specifically includes:

a separation unit for separating the contents of the target text by sentences;

Further as a preferred embodiment, the second identification module 200 specifically includes:

Further as a preferred embodiment, the cleaning module 300 specifically includes:

The Chinese named entity recognition system can execute the Chinese named entity recognition method provided by the invention, can execute any combination implementation steps of the embodiment of the method, and has corresponding functions and beneficial effects of the method.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A Chinese named entity recognition method is characterized by comprising the following steps:

s3, cleaning the first named entity set and the second named entity set to obtain an identification result;

the step S1 specifically includes:

s11, separating the content of the target text according to sentences;

s15, generating a first named entity set from all the named entities obtained by extraction;

the step S3 specifically includes:

s32, merging the cleaned first named entity set and the cleaned second named entity set to serve as a named entity identification result;

the punctuation mark rule is created according to the use habits of people;

the syntactic template rules are created according to language habits.

2. The method as claimed in claim 1, wherein the step of identifying the named entity comprises

S2, specifically including:

s21, performing word segmentation processing on the target text;

3. A Chinese named entity recognition system is characterized by comprising the following modules:

the cleaning module is used for cleaning the first named entity set and the second named entity set to obtain an identification result;

the first identification module specifically includes:

a separation unit for separating the contents of the target text by sentences;

the generating unit is used for generating a first named entity set from all the named entities obtained by extraction;

the cleaning module specifically comprises:

4. The system for recognizing a chinese named entity according to claim 3, wherein the second recognition module specifically includes: