CN107943786B - Chinese named entity recognition method and system - Google Patents
Chinese named entity recognition method and system Download PDFInfo
- Publication number
- CN107943786B CN107943786B CN201711137581.3A CN201711137581A CN107943786B CN 107943786 B CN107943786 B CN 107943786B CN 201711137581 A CN201711137581 A CN 201711137581A CN 107943786 B CN107943786 B CN 107943786B
- Authority
- CN
- China
- Prior art keywords
- named entity
- target text
- entity set
- named
- chinese
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
- Character Discrimination (AREA)
Abstract
The invention discloses a method and a system for identifying a Chinese named entity, wherein the method comprises the following steps: s1, carrying out entity recognition based on rule matching on the target text to obtain a first named entity set; s2, performing entity recognition on the target text by adopting a statistical algorithm to obtain a second named entity set; and S3, cleaning the first named entity set and the second named entity set to obtain an identification result. The method is used for carrying out entity recognition on the target text based on the rule matching and the statistical algorithm respectively, and then cleaning the recognition results of the target text and the target text to obtain the final Chinese entity recognition result, so that the Chinese entity recognition accuracy can be ensured, the recall ratio of the Chinese entity recognition can be greatly improved, and the method is used for carrying out automatic recognition on the Chinese entity, has high recognition speed and can be widely applied to the field of information processing on the text.
Description
Technical Field
The invention relates to the field of computer application and information processing, in particular to a method and a system for identifying a Chinese named entity.
Background
The named entity is a basic information element in the target text and is a basis for correctly understanding the target text. Chinese entity naming and recognition are important basic tools in application fields such as information extraction, syntactic analysis, machine learning and the like, and play an important role in the process of bringing the natural language processing technology into practical use. Chinese named entity recognition is to determine whether a string represents a named entity. In the information extraction research, Chinese named entity identification is a technology which has the most practical value at present. The common method is a recognition method based purely on hidden Markov and maximum entropy models.
At present, because the word using rule of the name naming of the Chinese company is not strong, the name naming method is more random to use, and often appears in a form of short name, for example, "Chinese Bank stock Limited company" often appears in a form of short name, for example, "Chinese Bank" or "Zhongxing", which brings difficulty to the identification and application of the name naming method. Generally, the following difficulties exist in identifying the named entities of the Chinese company, which are referred to as the Chinese name entities for short: 1. under different fields and scenes, the extension of the nomenclature abbreviation is different. 2. Some types of entity names change frequently and no strict rule can be followed. 3. The expression forms are various. 4. The number is huge, and the dictionary cannot be enumerated and is difficult to be completely recorded in the dictionary. In summary, in the process of the Chinese target text, the recognition effect of the Chinese named entity is greatly influenced by the Chinese word segmentation effect, and further the analysis and processing effects of the target text are influenced, so that the recall ratio is low and the recognition speed is slow.
Disclosure of Invention
In order to solve the above technical problems, the present invention provides a method and a system for identifying a named entity in chinese.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a Chinese named entity recognition method comprises the following steps:
s1, carrying out entity recognition based on rule matching on the target text to obtain a first named entity set;
s2, performing entity recognition on the target text by adopting a statistical algorithm to obtain a second named entity set;
and S3, cleaning the first named entity set and the second named entity set to obtain an identification result.
Further, step S1 specifically includes:
s11, separating the content of the target text according to sentences;
s12, extracting the content of the separated target text based on punctuation rules;
s13, extracting the content of the separated target text based on the syntactic template rule;
s14, extracting the contents of the separated target text based on the table characteristics;
and S15, generating a first named entity set by all the named entities obtained by extraction.
Further, step S2 specifically includes:
s21, performing word segmentation processing on the target text;
s22, performing part-of-speech tagging on the word processing result based on a preset part-of-speech database;
and S23, performing statistical analysis on the part-of-speech tagging results based on a hidden Markov model statistical learning method, and generating a second named entity set from the named entities obtained by analysis.
Further, step S3 specifically includes:
s31, respectively cleaning the data of the first named entity set and the second named entity set according to a preset noise vocabulary library, and eliminating noise vocabularies;
and S32, merging the cleaned first named entity set and the second named entity set to obtain a named entity recognition result.
The other technical scheme adopted by the invention for solving the technical problem is as follows:
a Chinese named entity recognition system comprises the following modules:
the first identification module is used for carrying out entity identification based on rule matching on the target text to obtain a first named entity set;
the second identification module is used for carrying out entity identification on the target text by adopting a statistical algorithm to obtain a second named entity set;
and the cleaning module is used for cleaning the first named entity set and the second named entity set to obtain an identification result.
Further, the first identification module specifically includes:
a separation unit for separating the contents of the target text by sentences;
the first extraction unit is used for extracting the content of the separated target text based on punctuation rules;
the second extraction unit is used for extracting the content of the separated target text based on the syntactic template rule;
the third extraction unit is used for extracting the contents of the separated target text based on the table characteristics;
and the generating unit is used for generating a first named entity set from all the named entities obtained by extraction.
Further, the second identification module specifically includes:
the word segmentation processing unit is used for carrying out word segmentation processing on the target text;
the part-of-speech tagging unit is used for tagging the part-of-speech of the word processing result based on a preset part-of-speech database;
and the statistical analysis unit is used for performing statistical analysis on the part of speech tagging results based on a hidden Markov model statistical learning method, and generating a second named entity set from the named entities obtained by analysis.
Further, the cleaning module specifically includes:
the data cleaning unit is used for respectively cleaning the data of the first named entity set and the second named entity set according to a preset noise vocabulary library and eliminating noise vocabularies;
and the computing unit is used for solving a union set of the cleaned first named entity set and the cleaned second named entity set and then taking the union set as a named entity identification result.
The method and the system have the beneficial effects that: the method is used for carrying out entity recognition on the target text based on the rule matching and the statistical algorithm respectively, and then cleaning the recognition results of the target text and the target text to obtain the final Chinese entity recognition result, so that the Chinese entity recognition accuracy can be ensured, the recall ratio of the Chinese entity recognition is greatly improved, and the Chinese entity automatic recognition is carried out through the method, so that the recognition speed is high.
Drawings
FIG. 1 is a flow chart of a Chinese named entity recognition method of the present invention;
FIG. 2 is a block diagram of the structure of the Chinese named entity recognition system of the present invention.
Detailed Description
Referring to fig. 1, the invention provides a method for identifying a named entity in Chinese, comprising the following steps:
s1, carrying out entity recognition based on rule matching on the target text to obtain a first named entity set;
s2, performing entity recognition on the target text by adopting a statistical algorithm to obtain a second named entity set;
and S3, cleaning the first named entity set and the second named entity set to obtain an identification result.
The target text refers to a text which needs to be identified by the Chinese named entity.
The method is based on the rule matching and the statistical algorithm to identify the entity of the target text, and the identification results of the rule matching and the statistical algorithm are cleaned to obtain the final Chinese entity identification result, so that the Chinese entity identification accuracy can be ensured, the recall ratio of the Chinese entity identification can be greatly improved, and the method can be used for automatically identifying the Chinese entity and has higher identification speed.
Further as a preferred embodiment, the step S1 specifically includes:
s11, separating the content of the target text according to sentences;
s12, extracting the content of the separated target text based on punctuation rules; for example, in some documents, it is customary to add a double quotation mark to an entity name, or to add a book name number, and at this time, the name in the double quotation mark or the book name number is extracted. Therefore, corresponding punctuation rules can be created according to the use habits of people, the punctuation rules record punctuation related to the Chinese entity name and corresponding extraction rules, and the punctuation rules are used as alternatives of the Chinese entity name after content extraction.
S13, extracting the content of the separated target text based on the syntactic template rule; for example, since the subject preceding a verb such as "announce", "call", or "speak" is generally an entity name, a corresponding syntactic template rule is created according to a language habit, and the syntactic template rule describes a wording related to a chinese entity name and a corresponding extraction rule, so that a target text can be extracted according to the syntactic template rule.
S14, extracting the contents of the separated target text based on the table characteristics;
and S15, generating a first named entity set by all the named entities obtained by extraction.
Further as a preferred embodiment, the step S2 specifically includes:
s21, performing word segmentation processing on the target text;
s22, performing part-of-speech tagging on the word processing result based on a preset part-of-speech database;
and S23, performing statistical analysis on the part-of-speech tagging results based on a hidden Markov model statistical learning method, and generating a second named entity set from the named entities obtained by analysis. The step is based on a hidden Markov model statistical learning method, firstly, the probability of the occurrence of the previous keyword is counted according to the known and correct entity name, and then the entity name is calculated through the keyword with high probability. Therefore, on the premise of not influencing the accuracy of the Chinese entity name obtained by identification, the recall ratio of identification is greatly improved, the Chinese entity name in the text can be more comprehensively identified and obtained, and the Chinese entity name is obtained by automatic identification, so that the identification speed is high.
Further as a preferred embodiment, the step S3 specifically includes:
s31, respectively cleaning the data of the first named entity set and the second named entity set according to a preset noise vocabulary library, and eliminating noise vocabularies;
and S32, merging the cleaned first named entity set and the second named entity set to obtain a named entity recognition result.
Referring to fig. 2, the invention provides a system for identifying a named entity in chinese, comprising the following modules:
a first identification module 100, configured to perform entity identification based on rule matching on a target text, to obtain a first named entity set;
the second identification module 200 is configured to perform entity identification on the target text by using a statistical algorithm to obtain a second named entity set;
the cleaning module 300 is configured to obtain the recognition result after cleaning the first named entity set and the second named entity set.
Further as a preferred embodiment, the first identification module 100 specifically includes:
a separation unit for separating the contents of the target text by sentences;
the first extraction unit is used for extracting the content of the separated target text based on punctuation rules;
the second extraction unit is used for extracting the content of the separated target text based on the syntactic template rule;
the third extraction unit is used for extracting the contents of the separated target text based on the table characteristics;
and the generating unit is used for generating a first named entity set from all the named entities obtained by extraction.
Further as a preferred embodiment, the second identification module 200 specifically includes:
the word segmentation processing unit is used for carrying out word segmentation processing on the target text;
the part-of-speech tagging unit is used for tagging the part-of-speech of the word processing result based on a preset part-of-speech database;
and the statistical analysis unit is used for performing statistical analysis on the part of speech tagging results based on a hidden Markov model statistical learning method, and generating a second named entity set from the named entities obtained by analysis.
Further as a preferred embodiment, the cleaning module 300 specifically includes:
the data cleaning unit is used for respectively cleaning the data of the first named entity set and the second named entity set according to a preset noise vocabulary library and eliminating noise vocabularies;
and the computing unit is used for solving a union set of the cleaned first named entity set and the cleaned second named entity set and then taking the union set as a named entity identification result.
The Chinese named entity recognition system can execute the Chinese named entity recognition method provided by the invention, can execute any combination implementation steps of the embodiment of the method, and has corresponding functions and beneficial effects of the method.
While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (4)
1. A Chinese named entity recognition method is characterized by comprising the following steps:
s1, carrying out entity recognition based on rule matching on the target text to obtain a first named entity set;
s2, performing entity recognition on the target text by adopting a statistical algorithm to obtain a second named entity set;
s3, cleaning the first named entity set and the second named entity set to obtain an identification result;
the step S1 specifically includes:
s11, separating the content of the target text according to sentences;
s12, extracting the content of the separated target text based on punctuation rules;
s13, extracting the content of the separated target text based on the syntactic template rule;
s14, extracting the contents of the separated target text based on the table characteristics;
s15, generating a first named entity set from all the named entities obtained by extraction;
the step S3 specifically includes:
s31, respectively cleaning the data of the first named entity set and the second named entity set according to a preset noise vocabulary library, and eliminating noise vocabularies;
s32, merging the cleaned first named entity set and the cleaned second named entity set to serve as a named entity identification result;
the punctuation mark rule is created according to the use habits of people;
the syntactic template rules are created according to language habits.
2. The method as claimed in claim 1, wherein the step of identifying the named entity comprises
S2, specifically including:
s21, performing word segmentation processing on the target text;
s22, performing part-of-speech tagging on the word processing result based on a preset part-of-speech database;
and S23, performing statistical analysis on the part-of-speech tagging results based on a hidden Markov model statistical learning method, and generating a second named entity set from the named entities obtained by analysis.
3. A Chinese named entity recognition system is characterized by comprising the following modules:
the first identification module is used for carrying out entity identification based on rule matching on the target text to obtain a first named entity set;
the second identification module is used for carrying out entity identification on the target text by adopting a statistical algorithm to obtain a second named entity set;
the cleaning module is used for cleaning the first named entity set and the second named entity set to obtain an identification result;
the first identification module specifically includes:
a separation unit for separating the contents of the target text by sentences;
the first extraction unit is used for extracting the content of the separated target text based on punctuation rules;
the second extraction unit is used for extracting the content of the separated target text based on the syntactic template rule;
the third extraction unit is used for extracting the contents of the separated target text based on the table characteristics;
the generating unit is used for generating a first named entity set from all the named entities obtained by extraction;
the cleaning module specifically comprises:
the data cleaning unit is used for respectively cleaning the data of the first named entity set and the second named entity set according to a preset noise vocabulary library and eliminating noise vocabularies;
and the computing unit is used for solving a union set of the cleaned first named entity set and the cleaned second named entity set and then taking the union set as a named entity identification result.
4. The system for recognizing a chinese named entity according to claim 3, wherein the second recognition module specifically includes:
the word segmentation processing unit is used for carrying out word segmentation processing on the target text;
the part-of-speech tagging unit is used for tagging the part-of-speech of the word processing result based on a preset part-of-speech database;
and the statistical analysis unit is used for performing statistical analysis on the part of speech tagging results based on a hidden Markov model statistical learning method, and generating a second named entity set from the named entities obtained by analysis.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711137581.3A CN107943786B (en) | 2017-11-16 | 2017-11-16 | Chinese named entity recognition method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711137581.3A CN107943786B (en) | 2017-11-16 | 2017-11-16 | Chinese named entity recognition method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107943786A CN107943786A (en) | 2018-04-20 |
CN107943786B true CN107943786B (en) | 2021-12-07 |
Family
ID=61931531
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711137581.3A Active CN107943786B (en) | 2017-11-16 | 2017-11-16 | Chinese named entity recognition method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107943786B (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108647194B (en) * | 2018-04-28 | 2022-04-19 | 北京神州泰岳软件股份有限公司 | Information extraction method and device |
WO2020133291A1 (en) * | 2018-12-28 | 2020-07-02 | 深圳市优必选科技有限公司 | Text entity recognition method and apparatus, computer device, and storage medium |
CN111382570B (en) * | 2018-12-28 | 2024-05-03 | 深圳市优必选科技有限公司 | Text entity recognition method, device, computer equipment and storage medium |
CN110008307B (en) * | 2019-01-18 | 2021-12-28 | 中国科学院信息工程研究所 | Method and device for identifying deformed entity based on rules and statistical learning |
CN110750991B (en) * | 2019-09-18 | 2022-04-15 | 平安科技(深圳)有限公司 | Entity identification method, device, equipment and computer readable storage medium |
CN111488467B (en) * | 2020-04-30 | 2022-04-05 | 北京建筑大学 | Construction method and device of geographical knowledge graph, storage medium and computer equipment |
CN112926333A (en) * | 2021-04-09 | 2021-06-08 | 平安科技(深圳)有限公司 | Entity identification method and device, electronic equipment and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1910573A (en) * | 2003-12-31 | 2007-02-07 | 新加坡科技研究局 | System for identifying and classifying denomination entity |
EP1783744A1 (en) * | 2005-11-03 | 2007-05-09 | Robert Bosch Corporation | Unified treatment of data-sparseness and data-overfitting in maximum entropy modeling |
CN102314417A (en) * | 2011-09-22 | 2012-01-11 | 西安电子科技大学 | Method for identifying Web named entity based on statistical model |
CN103942347A (en) * | 2014-05-19 | 2014-07-23 | 焦点科技股份有限公司 | Word separating method based on multi-dimensional comprehensive lexicon |
CN105302794A (en) * | 2015-10-30 | 2016-02-03 | 苏州大学 | Chinese homodigital event recognition method and system |
CN106055545A (en) * | 2015-04-10 | 2016-10-26 | 穆西格马交易方案私人有限公司 | Text mining system and tool |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060047500A1 (en) * | 2004-08-31 | 2006-03-02 | Microsoft Corporation | Named entity recognition using compiler methods |
CN102103594A (en) * | 2009-12-22 | 2011-06-22 | 北京大学 | Character data recognition and processing method and device |
CN103268348B (en) * | 2013-05-28 | 2016-08-10 | 中国科学院计算技术研究所 | A kind of user's query intention recognition methods |
CN103995885B (en) * | 2014-05-29 | 2017-11-17 | 百度在线网络技术(北京)有限公司 | The recognition methods of physical name and device |
CN105808523A (en) * | 2016-03-08 | 2016-07-27 | 浪潮软件股份有限公司 | Method and apparatus for identifying document |
CN105843875B (en) * | 2016-03-18 | 2019-09-13 | 北京光年无限科技有限公司 | A kind of question and answer data processing method and device towards intelligent robot |
-
2017
- 2017-11-16 CN CN201711137581.3A patent/CN107943786B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1910573A (en) * | 2003-12-31 | 2007-02-07 | 新加坡科技研究局 | System for identifying and classifying denomination entity |
EP1783744A1 (en) * | 2005-11-03 | 2007-05-09 | Robert Bosch Corporation | Unified treatment of data-sparseness and data-overfitting in maximum entropy modeling |
CN102314417A (en) * | 2011-09-22 | 2012-01-11 | 西安电子科技大学 | Method for identifying Web named entity based on statistical model |
CN103942347A (en) * | 2014-05-19 | 2014-07-23 | 焦点科技股份有限公司 | Word separating method based on multi-dimensional comprehensive lexicon |
CN106055545A (en) * | 2015-04-10 | 2016-10-26 | 穆西格马交易方案私人有限公司 | Text mining system and tool |
CN105302794A (en) * | 2015-10-30 | 2016-02-03 | 苏州大学 | Chinese homodigital event recognition method and system |
Also Published As
Publication number | Publication date |
---|---|
CN107943786A (en) | 2018-04-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107943786B (en) | Chinese named entity recognition method and system | |
US10176804B2 (en) | Analyzing textual data | |
CN106328147B (en) | Speech recognition method and device | |
CN110727880B (en) | Sensitive corpus detection method based on word bank and word vector model | |
CN111191022B (en) | Commodity short header generation method and device | |
CN104408078A (en) | Construction method for key word-based Chinese-English bilingual parallel corpora | |
CN110502738A (en) | Chinese name entity recognition method, device, equipment and inquiry system | |
CN109637537B (en) | Method for automatically acquiring annotated data to optimize user-defined awakening model | |
US9589563B2 (en) | Speech recognition of partial proper names by natural language processing | |
WO2015149533A1 (en) | Method and device for word segmentation processing on basis of webpage content classification | |
WO2003010754A1 (en) | Speech input search system | |
Favre et al. | Robust named entity extraction from large spoken archives | |
CN101952824A (en) | Method and information retrieval system that the document in the database is carried out index and retrieval that computing machine is carried out | |
CN106570180A (en) | Artificial intelligence based voice searching method and device | |
CN114556328A (en) | Data processing method and device, electronic equipment and storage medium | |
WO2014117553A1 (en) | Method and system of adding punctuation and establishing language model | |
CN113626598B (en) | Video text generation method, device, equipment and storage medium | |
CN108052630B (en) | Method for extracting expansion words based on Chinese education videos | |
Ali et al. | Advances in dialectal arabic speech recognition: A study using twitter to improve egyptian asr | |
KR20180092733A (en) | Generating method of relation extraction training data | |
CN111881297A (en) | Method and device for correcting voice recognition text | |
CN108345694B (en) | Document retrieval method and system based on theme database | |
CN103744837B (en) | Many texts contrast method based on keyword abstraction | |
CN112069312A (en) | Text classification method based on entity recognition and electronic device | |
Bigot et al. | Person name recognition in ASR outputs using continuous context models |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |