CN105320645B - The recognition methods of Chinese enterprise name - Google Patents
The recognition methods of Chinese enterprise name Download PDFInfo
- Publication number
- CN105320645B CN105320645B CN201510614480.5A CN201510614480A CN105320645B CN 105320645 B CN105320645 B CN 105320645B CN 201510614480 A CN201510614480 A CN 201510614480A CN 105320645 B CN105320645 B CN 105320645B
- Authority
- CN
- China
- Prior art keywords
- name
- enterprise
- vocabulary
- word
- proper
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Character Discrimination (AREA)
- Machine Translation (AREA)
Abstract
A kind of recognition methods of Chinese enterprise name, the following steps are included: establishing enterprise name knowledge base and enterprise name probabilistic knowledge library, include that place name word set, enterprise's general term word set, industry decorations name word set and enterprise's proper name forbid word set in enterprise name knowledge base, includes the left adjacent Word probability knowledge of enterprise name and the left adjacent Word probability knowledge of enterprise name in enterprise name probabilistic knowledge library;Scan text segments text;It is respectively completed with the enterprise name identification of place name decorations name beginning and the enterprise name identification without place name decorations name beginning.It can accelerate the speed in document identification using the recognition methods of Chinese enterprise name of the present invention, and improve the accuracy rate of enterprise name identification.
Description
Technical field
The present invention relates to the technical field of internet, specifically a kind of Chinese by determining that right boundary is realized is looked forward to
The recognition methods of industry title.
Background technique
Unknown word identification is a key technology in natural language processing, in information extraction, information retrieval, is automatically asked
It answers, machine translation etc. has been widely used in fields.When being collected to the information on internet, Chinese enterprise need to be acquired
Industry title, Chinese enterprise name are one kind of unregistered word, and there is constituent complexity, enormous amount, title constantly to change more
Newly, the features such as impossible to exhaust, it is considered to be it is most indiscernible in specific term, to natural language processing, especially translation and
Machine understanding brings very big puzzlement.
Identification for Chinese enterprise name, domestic research mainly have: utilizing Hidden Markov Model and join probability
Valuation formula constitutes the ability of enterprise name to evaluate in real text;Chinese mechanism based on stacking conditional random field models
Name automatic identification algorithm;Chinese organization names automatic identifying method based on class language model etc..
In Chinese enterprise name, often there is multiple and different word or phrase, composition is relatively abundanter, uses in enterprise name
The uncertainty of the randomness and Name Length of word and word, the identification process for having resulted in Chinese enterprise name is more difficult,
Discrimination is not also high.
Summary of the invention
The Chinese enterprise name that the technical problem to be solved in the present invention is to provide a kind of to be realized by determining right boundary
Recognition methods.
The technical scheme adopted by the present invention to solve the technical problems existing in the known art is that
The recognition methods of Chinese enterprise name of the invention, comprising the following steps:
A, enterprise name knowledge base, including place name word set, enterprise's general term word set, industry decorations name word set and enterprise's proper name are established
Forbid word set, each word set is respectively corresponded including place name vocabulary, enterprise's vocabulary of the same name, industry decorations name vocabulary and enterprise's proper name banned word
It converges;
B, enterprise name probabilistic knowledge library is established, the probabilistic knowledge including individual Chinese character composition enterprise's proper name;
C, scan text carries out Chinese word segmentation to text;
D, when occurring place name vocabulary in textual scan, word behind is continued to scan on, if occurring after 2-5 Chinese character
When industry adorns name vocabulary and industry decorations name immediately behind there is enterprise's general term vocabulary, triggering enterprise name identification;
E, judge whether the Chinese character between above-mentioned place name vocabulary and industry decorations name vocabulary includes enterprise's proper name banned word, such as
Fruit includes then to terminate identification, not comprising the probability for calculating these Chinese characters and constituting enterprise's proper name is then summarized, forms proper name probability weight
Calculated result;
F, judge whether proper name probability weight result is greater than threshold value, the currently enterprise from place name to the end is then assert greater than threshold value
Entire Chinese segment between industry general term is Chinese enterprise's name, less than then terminating identification;
G, the output of recognition result tissue is " enterprise name of name beginning is adornd with place name ".
The recognition methods of Chinese enterprise name of the invention, comprising the following steps:
A, enterprise name knowledge base, including place name word set, enterprise's general term word set, industry decorations name word set and enterprise's proper name are established
Forbid word set, each word set is respectively corresponded including place name vocabulary, enterprise's vocabulary of the same name, industry decorations name vocabulary and enterprise's proper name banned word
It converges;
B, statistics information news data obtains the left adjacent Word probability knowledge of enterprise name;Enterprise name probabilistic knowledge library is established,
The left adjacent Word probability knowledge of probabilistic knowledge and enterprise name including individual Chinese character composition enterprise's proper name;
C, scan text carries out Chinese word segmentation to text;
D, current to industry decorations noun remittance abroad when scanning, it continues to scan on thereafter whether close to there is enterprise's general term vocabulary, such as
Fruit is close to appearance, and current vocabulary is not identified as then triggering enterprise name knowledge " with the enterprise name of place name decorations name beginning "
Not;
E, using industry decorations name vocabulary as starting point, vocabulary is turned left scanning one by one, judges left side vocabulary with the presence or absence of special in enterprise
Name banned word, and if so, terminating identification;
F, Chinese character in the left side vocabulary in step E is obtained, and summarizes the probability of their composition enterprise's proper names of weighted calculation, together
When obtain " enterprise name left adjacent Word probability " of the vocabulary more left side word and calculated current according to Hidden Markov probabilistic model
Left side vocabulary as proper name entire enterprise name identification probability;
G, continue to merge the vocabulary of this left side vocabulary and step E as enterprise toward one vocabulary of left scan
Industry proper name is treated, and step F is repeated, and terminates until proper name number of Chinese characters is greater than 5;
H, the multiple identification probabilities obtained in G step reject probability value less than threshold value as a result, the maximum one group of knot of selection
Fruit, as final recognition result;
I, final recognition result tissue output.
The advantages and positive effects of the present invention are:
The recognition methods of Chinese enterprise name of the invention, comprising the following steps: establish enterprise name knowledge base and enterprise
Title probabilistic knowledge library includes that place name word set, enterprise's general term word set, industry decorations name word set and enterprise are special in enterprise name knowledge base
Name forbids word set, includes the left adjacent Word probability knowledge of enterprise name and the left adjacent Word probability of enterprise name in enterprise name probabilistic knowledge library
Knowledge;Scan text segments text;It is respectively completed with the enterprise name identification of place name decorations name beginning and adorns name without place name
The enterprise name of beginning identifies.Recognition methods using Chinese enterprise name of the present invention can be accelerated in document identification
Speed, and improve the accuracy rate of enterprise name identification.
Specific embodiment
The present invention is described in detail with reference to embodiments:
The recognition methods of Chinese enterprise name of the invention, comprising the following steps:
A, enterprise name knowledge base, including place name word set, enterprise's general term word set, industry decorations name word set and enterprise's proper name are established
Forbid word set, each word set is respectively corresponded including place name vocabulary, enterprise's vocabulary of the same name, industry decorations name vocabulary and enterprise's proper name banned word
It converges;
B, enterprise name probabilistic knowledge library is established, the probabilistic knowledge including individual Chinese character composition enterprise's proper name;" individual Chinese character
Form the probabilistic knowledge of enterprise's proper name ", this is made of more than common 3600 a Chinese characters, in 10,000,000 or more business directory
In, statistics obtains the probability of each Chinese character composition enterprise's proper name;
C, scan text carries out Chinese word segmentation to text;
D, when occurring place name vocabulary in textual scan, word behind is continued to scan on, if (enterprise is special in 2-5 Chinese character
Name be usually 2-5 word) after occur industry decorations name vocabulary and industry decorations name immediately behind there is enterprise's general term vocabulary when, trigger
Enterprise name identification;
E, judge whether the Chinese character between above-mentioned place name vocabulary and industry decorations name vocabulary includes enterprise's proper name banned word, such as
Fruit includes then to terminate identification, not comprising the probability for calculating these Chinese characters and constituting enterprise's proper name is then summarized, forms proper name probability weight
Calculated result;
F, judge whether proper name probability weight result is greater than threshold value, the currently enterprise from place name to the end is then assert greater than threshold value
Entire Chinese segment between industry general term is Chinese enterprise's name, less than then terminating identification;
G, the output of recognition result tissue is " enterprise name of name beginning is adornd with place name ".
The recognition methods of Chinese enterprise name of the invention, comprising the following steps:
A, enterprise name knowledge base, including place name word set, enterprise's general term word set, industry decorations name word set and enterprise's proper name are established
Forbid word set, each word set is respectively corresponded including place name vocabulary, enterprise's vocabulary of the same name, industry decorations name vocabulary and enterprise's proper name banned word
It converges;
B, statistics information news data obtains the left adjacent Word probability knowledge of enterprise name;Enterprise name probabilistic knowledge library is established,
The left adjacent Word probability knowledge of probabilistic knowledge and enterprise name including individual Chinese character composition enterprise's proper name;
C, scan text carries out Chinese word segmentation to text;
D, current to industry decorations noun remittance abroad when scanning, multiple, such as " the great industry premises in day source can occur simultaneously in industry decorations name
Industry discipline Co., Ltd ", it is also possible to place name decorations name, such as " letter and wealth management of investment (Beijing) Co., Ltd " occur, continue to sweep
It retouches thereafter whether close to there is enterprise's general term vocabulary, if close to appearance, and current vocabulary is not identified as " adoring name with place name
The enterprise name of beginning " then triggers enterprise name identification;
E, using industry decorations name vocabulary as starting point, vocabulary is turned left scanning one by one, judges left side vocabulary with the presence or absence of special in enterprise
Name banned word, and if so, terminating identification;
F, Chinese character in the left side vocabulary in step E is obtained, and summarizes the probability of their composition enterprise's proper names of weighted calculation, together
When obtain " enterprise name left adjacent Word probability " of the vocabulary more left side word and calculated current according to Hidden Markov probabilistic model
Left side vocabulary as proper name entire enterprise name identification probability;
G, continue to merge the vocabulary of this left side vocabulary and step E as enterprise toward one vocabulary of left scan
Industry proper name is treated, and step F is repeated, and terminates until proper name number of Chinese characters is greater than 5;
H, the multiple identification probabilities obtained in G step reject probability value less than threshold value as a result, the maximum one group of knot of selection
Fruit, as final recognition result;
I, final recognition result tissue output.
The above described is only a preferred embodiment of the present invention, be not intended to limit the present invention in any form, though
The right present invention has been described by way of example and in terms of the preferred embodiments, however, being not intended to limit the invention, any technology people for being familiar with this profession
Member can make a little change or modification using the technology contents disclosed certainly without departing from the scope of the present invention, at
For the equivalent embodiment of equivalent variations, but anything that does not depart from the technical scheme of the invention content, according to the technical essence of the invention
Any simple modification, equivalent change and modification to the above embodiments, belong in the range of technical solution of the present invention.
Claims (1)
1. a kind of recognition methods of Chinese enterprise name, comprising the following steps:
A, enterprise name knowledge base is established, including place name word set, enterprise's general term word set, industry decorations name word set and enterprise's proper name are forbidden
Word set, each word set are respectively corresponded including place name vocabulary, enterprise's vocabulary of the same name, industry decorations name vocabulary and enterprise's proper name banned word;
B, the left adjacent Word probability knowledge of enterprise name is obtained by following steps statistics information news data:
(1) enterprise name knowledge base is established, including place name word set, enterprise's general term word set, industry decorations name word set and enterprise's proper name are prohibited
Only word set, each word set are respectively corresponded including place name vocabulary, enterprise's vocabulary of the same name, industry decorations name vocabulary and enterprise's proper name banned word
It converges;
(2) enterprise name probabilistic knowledge library is established, the probabilistic knowledge including individual Chinese character composition enterprise's proper name;
(3) scan text carries out Chinese word segmentation to text;
(4) when occurring place name vocabulary in textual scan, word behind is continued to scan on, if existing out after 2-5 Chinese character
When industry adorns name vocabulary and industry decorations name immediately behind there is enterprise's general term vocabulary, triggering enterprise name identification;
(5) judge whether the Chinese character between above-mentioned place name vocabulary and industry decorations name vocabulary includes enterprise's proper name banned word, if
Comprising then terminating identification, not comprising the probability for calculating these Chinese characters and constituting enterprise's proper name is then summarized, proper name probability weight meter is formed
Calculate result;
(6) judge whether proper name probability weight result is greater than threshold value, the currently enterprise from place name to the end is then assert greater than threshold value
Entire Chinese segment between general term is Chinese enterprise's name, less than then terminating identification;
(7) output of recognition result tissue is " enterprise name of name beginning is adornd with place name ";
Enterprise name probabilistic knowledge library is established, the left adjacent word of probabilistic knowledge and enterprise name including individual Chinese character composition enterprise's proper name
Probabilistic knowledge;
C, scan text carries out Chinese word segmentation to text;
D, current to industry decorations noun remittance abroad when scanning, it continues to scan on thereafter whether close to there is enterprise's general term vocabulary, if tightly
Neighbour occurs, and current vocabulary is not identified as then triggering enterprise name identification " with the enterprise name of place name decorations name beginning ";
E, using industry decorations name vocabulary as starting point, vocabulary is turned left scanning one by one, judges that left side vocabulary whether there is in enterprise's proper name taboo
Only vocabulary, and if so, terminating identification;
F, Chinese character in the left side vocabulary in step E is obtained, and summarizes the probability of their composition enterprise's proper names of weighted calculation, is obtained simultaneously
" the left adjacent Word probability of enterprise name " for obtaining the vocabulary more left side word, according to Hidden Markov probabilistic model, calculates a current left side
Identification probability of the side vocabulary as the entire enterprise name of proper name;
G, continue to merge the vocabulary of this left side vocabulary and step E special as enterprise toward one vocabulary of left scan
Name is treated, and step F is repeated, and terminates until proper name number of Chinese characters is greater than 5;
H, the multiple identification probabilities obtained in G step, reject probability value be less than threshold value as a result, selection it is maximum one group as a result,
As final recognition result;
I, final recognition result tissue output.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510614480.5A CN105320645B (en) | 2015-09-24 | 2015-09-24 | The recognition methods of Chinese enterprise name |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510614480.5A CN105320645B (en) | 2015-09-24 | 2015-09-24 | The recognition methods of Chinese enterprise name |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105320645A CN105320645A (en) | 2016-02-10 |
CN105320645B true CN105320645B (en) | 2019-07-12 |
Family
ID=55248050
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510614480.5A Active CN105320645B (en) | 2015-09-24 | 2015-09-24 | The recognition methods of Chinese enterprise name |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105320645B (en) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105955954A (en) * | 2016-05-03 | 2016-09-21 | 成都数联铭品科技有限公司 | New enterprise name discovery method based on bidirectional recurrent neural network |
CN105975555A (en) * | 2016-05-03 | 2016-09-28 | 成都数联铭品科技有限公司 | Enterprise abbreviation extraction method based on bidirectional recurrent neural network |
CN106570170A (en) * | 2016-11-09 | 2017-04-19 | 武汉泰迪智慧科技有限公司 | Text classification and naming entity recognition integrated method and system based on depth cyclic neural network |
CN107688564A (en) * | 2017-08-31 | 2018-02-13 | 平安科技(深圳)有限公司 | Subject of news Corporate Identity method, electronic equipment and computer-readable recording medium |
CN107748745B (en) * | 2017-11-08 | 2021-08-03 | 厦门美亚商鼎信息科技有限公司 | Enterprise name keyword extraction method |
CN108460016A (en) * | 2018-02-09 | 2018-08-28 | 中云开源数据技术(上海)有限公司 | A kind of entity name analysis recognition method |
CN108595435B (en) * | 2018-05-03 | 2020-09-01 | 鹏元征信有限公司 | Organization name recognition processing method, intelligent terminal and storage medium |
CN109101480B (en) * | 2018-06-14 | 2022-09-06 | 华东理工大学 | Enterprise name segmentation method and device and computer readable storage medium |
CN111401083B (en) * | 2019-01-02 | 2023-05-02 | 阿里巴巴集团控股有限公司 | Name identification method and device, storage medium and processor |
CN110413764B (en) * | 2019-06-18 | 2023-09-01 | 杭州熊猫智云企业服务有限公司 | Long text enterprise name recognition method based on pre-built word stock |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1573923A (en) * | 2003-05-27 | 2005-02-02 | 微软公司 | System and method for user modeling to enhance named entity recognition |
CN101093478A (en) * | 2007-07-25 | 2007-12-26 | 中国科学院计算技术研究所 | Method and system for identifying Chinese full name based on Chinese shortened form of entity |
CN104615589A (en) * | 2015-02-15 | 2015-05-13 | 百度在线网络技术(北京)有限公司 | Named-entity recognition model training method and named-entity recognition method and device |
CN104933152A (en) * | 2015-06-24 | 2015-09-23 | 北京京东尚科信息技术有限公司 | Named entity recognition method and device |
-
2015
- 2015-09-24 CN CN201510614480.5A patent/CN105320645B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1573923A (en) * | 2003-05-27 | 2005-02-02 | 微软公司 | System and method for user modeling to enhance named entity recognition |
CN101093478A (en) * | 2007-07-25 | 2007-12-26 | 中国科学院计算技术研究所 | Method and system for identifying Chinese full name based on Chinese shortened form of entity |
CN104615589A (en) * | 2015-02-15 | 2015-05-13 | 百度在线网络技术(北京)有限公司 | Named-entity recognition model training method and named-entity recognition method and device |
CN104933152A (en) * | 2015-06-24 | 2015-09-23 | 北京京东尚科信息技术有限公司 | Named entity recognition method and device |
Non-Patent Citations (2)
Title |
---|
中文金融新闻中公司名的识别;王宁,葛瑞芳,苑春法,黄锦辉, 李文捷;《中文信息学报》;20020325(第2002年第02期);第1-4页 |
基于隐马尔科夫模型的中文命名实体识别研究;赵琳瑛;《中国优秀硕士学位论文全文数据库 信息科技辑》;20090115(第2009年第01期);第I138-1305页 |
Also Published As
Publication number | Publication date |
---|---|
CN105320645A (en) | 2016-02-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105320645B (en) | The recognition methods of Chinese enterprise name | |
CN110675288B (en) | Intelligent auxiliary judgment method, device, computer equipment and storage medium | |
Rodriguez et al. | Automatic detection of hate speech on facebook using sentiment and emotion analysis | |
CN101599071B (en) | Automatic extraction method of conversation text topic | |
CN107423444B (en) | Hot word phrase extraction method and system | |
CN109684646A (en) | A kind of microblog topic sentiment analysis method based on topic influence | |
Layton et al. | Authorship attribution for twitter in 140 characters or less | |
Kluever et al. | Balancing usability and security in a video CAPTCHA | |
CN110457404B (en) | Social media account classification method based on complex heterogeneous network | |
US7555523B1 (en) | Spam discrimination by generalized Ngram analysis of small header fields | |
Baecher et al. | Breaking reCAPTCHA: a holistic approach via shape recognition | |
CN108363701B (en) | Named entity identification method and system | |
CN101477544A (en) | Rubbish text recognition method and system | |
Hong et al. | An extended keyword extraction method | |
CN108009297B (en) | Text emotion analysis method and system based on natural language processing | |
CN113505200A (en) | Sentence-level Chinese event detection method combining document key information | |
CN105488098B (en) | A kind of new words extraction method based on field otherness | |
CN104331523B (en) | A kind of question sentence search method based on conceptual object model | |
CN109783623A (en) | The data analysing method of user and customer service dialogue under a kind of real scene | |
CN108416286A (en) | A kind of robot interactive approach based on real-time video chat scenario | |
WO2015062377A1 (en) | Device and method for detecting similar text, and application | |
CN106126495B (en) | One kind being based on large-scale corpus prompter method and apparatus | |
CN114969294A (en) | Expansion method of sound-proximity sensitive words | |
Imam et al. | Detecting spam images with embedded arabic text in twitter | |
Prasad | Micro-blogging sentiment analysis using bayesian classification methods |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 300020 Tianjin Heping District, South Road, No. 11 International Building 23 purchase of Wheat Applicant after: Tianjin mass information technology Limited by Share Ltd Address before: 300000 Tianjin Heping District, South Road, No. 11 International Building 23 purchase of Wheat Applicant before: Tianjin Hylanda Information Technology Co.,Ltd. |
|
COR | Change of bibliographic data | ||
GR01 | Patent grant | ||
GR01 | Patent grant |