CN105320645B - The recognition methods of Chinese enterprise name - Google Patents

The recognition methods of Chinese enterprise name Download PDF

Info

Publication number
CN105320645B
CN105320645B CN201510614480.5A CN201510614480A CN105320645B CN 105320645 B CN105320645 B CN 105320645B CN 201510614480 A CN201510614480 A CN 201510614480A CN 105320645 B CN105320645 B CN 105320645B
Authority
CN
China
Prior art keywords
name
enterprise
vocabulary
word
proper
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510614480.5A
Other languages
Chinese (zh)
Other versions
CN105320645A (en
Inventor
宋传宝
史墨轩
郝静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin Mass Information Technology Ltd By Share Ltd
Original Assignee
Tianjin Mass Information Technology Ltd By Share Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin Mass Information Technology Ltd By Share Ltd filed Critical Tianjin Mass Information Technology Ltd By Share Ltd
Priority to CN201510614480.5A priority Critical patent/CN105320645B/en
Publication of CN105320645A publication Critical patent/CN105320645A/en
Application granted granted Critical
Publication of CN105320645B publication Critical patent/CN105320645B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Character Discrimination (AREA)
  • Machine Translation (AREA)

Abstract

A kind of recognition methods of Chinese enterprise name, the following steps are included: establishing enterprise name knowledge base and enterprise name probabilistic knowledge library, include that place name word set, enterprise's general term word set, industry decorations name word set and enterprise's proper name forbid word set in enterprise name knowledge base, includes the left adjacent Word probability knowledge of enterprise name and the left adjacent Word probability knowledge of enterprise name in enterprise name probabilistic knowledge library;Scan text segments text;It is respectively completed with the enterprise name identification of place name decorations name beginning and the enterprise name identification without place name decorations name beginning.It can accelerate the speed in document identification using the recognition methods of Chinese enterprise name of the present invention, and improve the accuracy rate of enterprise name identification.

Description

The recognition methods of Chinese enterprise name
Technical field
The present invention relates to the technical field of internet, specifically a kind of Chinese by determining that right boundary is realized is looked forward to The recognition methods of industry title.
Background technique
Unknown word identification is a key technology in natural language processing, in information extraction, information retrieval, is automatically asked It answers, machine translation etc. has been widely used in fields.When being collected to the information on internet, Chinese enterprise need to be acquired Industry title, Chinese enterprise name are one kind of unregistered word, and there is constituent complexity, enormous amount, title constantly to change more Newly, the features such as impossible to exhaust, it is considered to be it is most indiscernible in specific term, to natural language processing, especially translation and Machine understanding brings very big puzzlement.
Identification for Chinese enterprise name, domestic research mainly have: utilizing Hidden Markov Model and join probability Valuation formula constitutes the ability of enterprise name to evaluate in real text;Chinese mechanism based on stacking conditional random field models Name automatic identification algorithm;Chinese organization names automatic identifying method based on class language model etc..
In Chinese enterprise name, often there is multiple and different word or phrase, composition is relatively abundanter, uses in enterprise name The uncertainty of the randomness and Name Length of word and word, the identification process for having resulted in Chinese enterprise name is more difficult, Discrimination is not also high.
Summary of the invention
The Chinese enterprise name that the technical problem to be solved in the present invention is to provide a kind of to be realized by determining right boundary Recognition methods.
The technical scheme adopted by the present invention to solve the technical problems existing in the known art is that
The recognition methods of Chinese enterprise name of the invention, comprising the following steps:
A, enterprise name knowledge base, including place name word set, enterprise's general term word set, industry decorations name word set and enterprise's proper name are established Forbid word set, each word set is respectively corresponded including place name vocabulary, enterprise's vocabulary of the same name, industry decorations name vocabulary and enterprise's proper name banned word It converges;
B, enterprise name probabilistic knowledge library is established, the probabilistic knowledge including individual Chinese character composition enterprise's proper name;
C, scan text carries out Chinese word segmentation to text;
D, when occurring place name vocabulary in textual scan, word behind is continued to scan on, if occurring after 2-5 Chinese character When industry adorns name vocabulary and industry decorations name immediately behind there is enterprise's general term vocabulary, triggering enterprise name identification;
E, judge whether the Chinese character between above-mentioned place name vocabulary and industry decorations name vocabulary includes enterprise's proper name banned word, such as Fruit includes then to terminate identification, not comprising the probability for calculating these Chinese characters and constituting enterprise's proper name is then summarized, forms proper name probability weight Calculated result;
F, judge whether proper name probability weight result is greater than threshold value, the currently enterprise from place name to the end is then assert greater than threshold value Entire Chinese segment between industry general term is Chinese enterprise's name, less than then terminating identification;
G, the output of recognition result tissue is " enterprise name of name beginning is adornd with place name ".
The recognition methods of Chinese enterprise name of the invention, comprising the following steps:
A, enterprise name knowledge base, including place name word set, enterprise's general term word set, industry decorations name word set and enterprise's proper name are established Forbid word set, each word set is respectively corresponded including place name vocabulary, enterprise's vocabulary of the same name, industry decorations name vocabulary and enterprise's proper name banned word It converges;
B, statistics information news data obtains the left adjacent Word probability knowledge of enterprise name;Enterprise name probabilistic knowledge library is established, The left adjacent Word probability knowledge of probabilistic knowledge and enterprise name including individual Chinese character composition enterprise's proper name;
C, scan text carries out Chinese word segmentation to text;
D, current to industry decorations noun remittance abroad when scanning, it continues to scan on thereafter whether close to there is enterprise's general term vocabulary, such as Fruit is close to appearance, and current vocabulary is not identified as then triggering enterprise name knowledge " with the enterprise name of place name decorations name beginning " Not;
E, using industry decorations name vocabulary as starting point, vocabulary is turned left scanning one by one, judges left side vocabulary with the presence or absence of special in enterprise Name banned word, and if so, terminating identification;
F, Chinese character in the left side vocabulary in step E is obtained, and summarizes the probability of their composition enterprise's proper names of weighted calculation, together When obtain " enterprise name left adjacent Word probability " of the vocabulary more left side word and calculated current according to Hidden Markov probabilistic model Left side vocabulary as proper name entire enterprise name identification probability;
G, continue to merge the vocabulary of this left side vocabulary and step E as enterprise toward one vocabulary of left scan Industry proper name is treated, and step F is repeated, and terminates until proper name number of Chinese characters is greater than 5;
H, the multiple identification probabilities obtained in G step reject probability value less than threshold value as a result, the maximum one group of knot of selection Fruit, as final recognition result;
I, final recognition result tissue output.
The advantages and positive effects of the present invention are:
The recognition methods of Chinese enterprise name of the invention, comprising the following steps: establish enterprise name knowledge base and enterprise Title probabilistic knowledge library includes that place name word set, enterprise's general term word set, industry decorations name word set and enterprise are special in enterprise name knowledge base Name forbids word set, includes the left adjacent Word probability knowledge of enterprise name and the left adjacent Word probability of enterprise name in enterprise name probabilistic knowledge library Knowledge;Scan text segments text;It is respectively completed with the enterprise name identification of place name decorations name beginning and adorns name without place name The enterprise name of beginning identifies.Recognition methods using Chinese enterprise name of the present invention can be accelerated in document identification Speed, and improve the accuracy rate of enterprise name identification.
Specific embodiment
The present invention is described in detail with reference to embodiments:
The recognition methods of Chinese enterprise name of the invention, comprising the following steps:
A, enterprise name knowledge base, including place name word set, enterprise's general term word set, industry decorations name word set and enterprise's proper name are established Forbid word set, each word set is respectively corresponded including place name vocabulary, enterprise's vocabulary of the same name, industry decorations name vocabulary and enterprise's proper name banned word It converges;
B, enterprise name probabilistic knowledge library is established, the probabilistic knowledge including individual Chinese character composition enterprise's proper name;" individual Chinese character Form the probabilistic knowledge of enterprise's proper name ", this is made of more than common 3600 a Chinese characters, in 10,000,000 or more business directory In, statistics obtains the probability of each Chinese character composition enterprise's proper name;
C, scan text carries out Chinese word segmentation to text;
D, when occurring place name vocabulary in textual scan, word behind is continued to scan on, if (enterprise is special in 2-5 Chinese character Name be usually 2-5 word) after occur industry decorations name vocabulary and industry decorations name immediately behind there is enterprise's general term vocabulary when, trigger Enterprise name identification;
E, judge whether the Chinese character between above-mentioned place name vocabulary and industry decorations name vocabulary includes enterprise's proper name banned word, such as Fruit includes then to terminate identification, not comprising the probability for calculating these Chinese characters and constituting enterprise's proper name is then summarized, forms proper name probability weight Calculated result;
F, judge whether proper name probability weight result is greater than threshold value, the currently enterprise from place name to the end is then assert greater than threshold value Entire Chinese segment between industry general term is Chinese enterprise's name, less than then terminating identification;
G, the output of recognition result tissue is " enterprise name of name beginning is adornd with place name ".
The recognition methods of Chinese enterprise name of the invention, comprising the following steps:
A, enterprise name knowledge base, including place name word set, enterprise's general term word set, industry decorations name word set and enterprise's proper name are established Forbid word set, each word set is respectively corresponded including place name vocabulary, enterprise's vocabulary of the same name, industry decorations name vocabulary and enterprise's proper name banned word It converges;
B, statistics information news data obtains the left adjacent Word probability knowledge of enterprise name;Enterprise name probabilistic knowledge library is established, The left adjacent Word probability knowledge of probabilistic knowledge and enterprise name including individual Chinese character composition enterprise's proper name;
C, scan text carries out Chinese word segmentation to text;
D, current to industry decorations noun remittance abroad when scanning, multiple, such as " the great industry premises in day source can occur simultaneously in industry decorations name Industry discipline Co., Ltd ", it is also possible to place name decorations name, such as " letter and wealth management of investment (Beijing) Co., Ltd " occur, continue to sweep It retouches thereafter whether close to there is enterprise's general term vocabulary, if close to appearance, and current vocabulary is not identified as " adoring name with place name The enterprise name of beginning " then triggers enterprise name identification;
E, using industry decorations name vocabulary as starting point, vocabulary is turned left scanning one by one, judges left side vocabulary with the presence or absence of special in enterprise Name banned word, and if so, terminating identification;
F, Chinese character in the left side vocabulary in step E is obtained, and summarizes the probability of their composition enterprise's proper names of weighted calculation, together When obtain " enterprise name left adjacent Word probability " of the vocabulary more left side word and calculated current according to Hidden Markov probabilistic model Left side vocabulary as proper name entire enterprise name identification probability;
G, continue to merge the vocabulary of this left side vocabulary and step E as enterprise toward one vocabulary of left scan Industry proper name is treated, and step F is repeated, and terminates until proper name number of Chinese characters is greater than 5;
H, the multiple identification probabilities obtained in G step reject probability value less than threshold value as a result, the maximum one group of knot of selection Fruit, as final recognition result;
I, final recognition result tissue output.
The above described is only a preferred embodiment of the present invention, be not intended to limit the present invention in any form, though The right present invention has been described by way of example and in terms of the preferred embodiments, however, being not intended to limit the invention, any technology people for being familiar with this profession Member can make a little change or modification using the technology contents disclosed certainly without departing from the scope of the present invention, at For the equivalent embodiment of equivalent variations, but anything that does not depart from the technical scheme of the invention content, according to the technical essence of the invention Any simple modification, equivalent change and modification to the above embodiments, belong in the range of technical solution of the present invention.

Claims (1)

1. a kind of recognition methods of Chinese enterprise name, comprising the following steps:
A, enterprise name knowledge base is established, including place name word set, enterprise's general term word set, industry decorations name word set and enterprise's proper name are forbidden Word set, each word set are respectively corresponded including place name vocabulary, enterprise's vocabulary of the same name, industry decorations name vocabulary and enterprise's proper name banned word;
B, the left adjacent Word probability knowledge of enterprise name is obtained by following steps statistics information news data:
(1) enterprise name knowledge base is established, including place name word set, enterprise's general term word set, industry decorations name word set and enterprise's proper name are prohibited Only word set, each word set are respectively corresponded including place name vocabulary, enterprise's vocabulary of the same name, industry decorations name vocabulary and enterprise's proper name banned word It converges;
(2) enterprise name probabilistic knowledge library is established, the probabilistic knowledge including individual Chinese character composition enterprise's proper name;
(3) scan text carries out Chinese word segmentation to text;
(4) when occurring place name vocabulary in textual scan, word behind is continued to scan on, if existing out after 2-5 Chinese character When industry adorns name vocabulary and industry decorations name immediately behind there is enterprise's general term vocabulary, triggering enterprise name identification;
(5) judge whether the Chinese character between above-mentioned place name vocabulary and industry decorations name vocabulary includes enterprise's proper name banned word, if Comprising then terminating identification, not comprising the probability for calculating these Chinese characters and constituting enterprise's proper name is then summarized, proper name probability weight meter is formed Calculate result;
(6) judge whether proper name probability weight result is greater than threshold value, the currently enterprise from place name to the end is then assert greater than threshold value Entire Chinese segment between general term is Chinese enterprise's name, less than then terminating identification;
(7) output of recognition result tissue is " enterprise name of name beginning is adornd with place name ";
Enterprise name probabilistic knowledge library is established, the left adjacent word of probabilistic knowledge and enterprise name including individual Chinese character composition enterprise's proper name Probabilistic knowledge;
C, scan text carries out Chinese word segmentation to text;
D, current to industry decorations noun remittance abroad when scanning, it continues to scan on thereafter whether close to there is enterprise's general term vocabulary, if tightly Neighbour occurs, and current vocabulary is not identified as then triggering enterprise name identification " with the enterprise name of place name decorations name beginning ";
E, using industry decorations name vocabulary as starting point, vocabulary is turned left scanning one by one, judges that left side vocabulary whether there is in enterprise's proper name taboo Only vocabulary, and if so, terminating identification;
F, Chinese character in the left side vocabulary in step E is obtained, and summarizes the probability of their composition enterprise's proper names of weighted calculation, is obtained simultaneously " the left adjacent Word probability of enterprise name " for obtaining the vocabulary more left side word, according to Hidden Markov probabilistic model, calculates a current left side Identification probability of the side vocabulary as the entire enterprise name of proper name;
G, continue to merge the vocabulary of this left side vocabulary and step E special as enterprise toward one vocabulary of left scan Name is treated, and step F is repeated, and terminates until proper name number of Chinese characters is greater than 5;
H, the multiple identification probabilities obtained in G step, reject probability value be less than threshold value as a result, selection it is maximum one group as a result, As final recognition result;
I, final recognition result tissue output.
CN201510614480.5A 2015-09-24 2015-09-24 The recognition methods of Chinese enterprise name Active CN105320645B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510614480.5A CN105320645B (en) 2015-09-24 2015-09-24 The recognition methods of Chinese enterprise name

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510614480.5A CN105320645B (en) 2015-09-24 2015-09-24 The recognition methods of Chinese enterprise name

Publications (2)

Publication Number Publication Date
CN105320645A CN105320645A (en) 2016-02-10
CN105320645B true CN105320645B (en) 2019-07-12

Family

ID=55248050

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510614480.5A Active CN105320645B (en) 2015-09-24 2015-09-24 The recognition methods of Chinese enterprise name

Country Status (1)

Country Link
CN (1) CN105320645B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105955954A (en) * 2016-05-03 2016-09-21 成都数联铭品科技有限公司 New enterprise name discovery method based on bidirectional recurrent neural network
CN105975555A (en) * 2016-05-03 2016-09-28 成都数联铭品科技有限公司 Enterprise abbreviation extraction method based on bidirectional recurrent neural network
CN106570170A (en) * 2016-11-09 2017-04-19 武汉泰迪智慧科技有限公司 Text classification and naming entity recognition integrated method and system based on depth cyclic neural network
CN107688564A (en) * 2017-08-31 2018-02-13 平安科技(深圳)有限公司 Subject of news Corporate Identity method, electronic equipment and computer-readable recording medium
CN107748745B (en) * 2017-11-08 2021-08-03 厦门美亚商鼎信息科技有限公司 Enterprise name keyword extraction method
CN108460016A (en) * 2018-02-09 2018-08-28 中云开源数据技术(上海)有限公司 A kind of entity name analysis recognition method
CN108595435B (en) * 2018-05-03 2020-09-01 鹏元征信有限公司 Organization name recognition processing method, intelligent terminal and storage medium
CN109101480B (en) * 2018-06-14 2022-09-06 华东理工大学 Enterprise name segmentation method and device and computer readable storage medium
CN111401083B (en) * 2019-01-02 2023-05-02 阿里巴巴集团控股有限公司 Name identification method and device, storage medium and processor
CN110413764B (en) * 2019-06-18 2023-09-01 杭州熊猫智云企业服务有限公司 Long text enterprise name recognition method based on pre-built word stock

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1573923A (en) * 2003-05-27 2005-02-02 微软公司 System and method for user modeling to enhance named entity recognition
CN101093478A (en) * 2007-07-25 2007-12-26 中国科学院计算技术研究所 Method and system for identifying Chinese full name based on Chinese shortened form of entity
CN104615589A (en) * 2015-02-15 2015-05-13 百度在线网络技术(北京)有限公司 Named-entity recognition model training method and named-entity recognition method and device
CN104933152A (en) * 2015-06-24 2015-09-23 北京京东尚科信息技术有限公司 Named entity recognition method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1573923A (en) * 2003-05-27 2005-02-02 微软公司 System and method for user modeling to enhance named entity recognition
CN101093478A (en) * 2007-07-25 2007-12-26 中国科学院计算技术研究所 Method and system for identifying Chinese full name based on Chinese shortened form of entity
CN104615589A (en) * 2015-02-15 2015-05-13 百度在线网络技术(北京)有限公司 Named-entity recognition model training method and named-entity recognition method and device
CN104933152A (en) * 2015-06-24 2015-09-23 北京京东尚科信息技术有限公司 Named entity recognition method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
中文金融新闻中公司名的识别;王宁,葛瑞芳,苑春法,黄锦辉, 李文捷;《中文信息学报》;20020325(第2002年第02期);第1-4页
基于隐马尔科夫模型的中文命名实体识别研究;赵琳瑛;《中国优秀硕士学位论文全文数据库 信息科技辑》;20090115(第2009年第01期);第I138-1305页

Also Published As

Publication number Publication date
CN105320645A (en) 2016-02-10

Similar Documents

Publication Publication Date Title
CN105320645B (en) The recognition methods of Chinese enterprise name
CN110675288B (en) Intelligent auxiliary judgment method, device, computer equipment and storage medium
Rodriguez et al. Automatic detection of hate speech on facebook using sentiment and emotion analysis
CN101599071B (en) Automatic extraction method of conversation text topic
CN107423444B (en) Hot word phrase extraction method and system
CN109684646A (en) A kind of microblog topic sentiment analysis method based on topic influence
Layton et al. Authorship attribution for twitter in 140 characters or less
Kluever et al. Balancing usability and security in a video CAPTCHA
CN110457404B (en) Social media account classification method based on complex heterogeneous network
US7555523B1 (en) Spam discrimination by generalized Ngram analysis of small header fields
Baecher et al. Breaking reCAPTCHA: a holistic approach via shape recognition
CN108363701B (en) Named entity identification method and system
CN101477544A (en) Rubbish text recognition method and system
Hong et al. An extended keyword extraction method
CN108009297B (en) Text emotion analysis method and system based on natural language processing
CN113505200A (en) Sentence-level Chinese event detection method combining document key information
CN105488098B (en) A kind of new words extraction method based on field otherness
CN104331523B (en) A kind of question sentence search method based on conceptual object model
CN109783623A (en) The data analysing method of user and customer service dialogue under a kind of real scene
CN108416286A (en) A kind of robot interactive approach based on real-time video chat scenario
WO2015062377A1 (en) Device and method for detecting similar text, and application
CN106126495B (en) One kind being based on large-scale corpus prompter method and apparatus
CN114969294A (en) Expansion method of sound-proximity sensitive words
Imam et al. Detecting spam images with embedded arabic text in twitter
Prasad Micro-blogging sentiment analysis using bayesian classification methods

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 300020 Tianjin Heping District, South Road, No. 11 International Building 23 purchase of Wheat

Applicant after: Tianjin mass information technology Limited by Share Ltd

Address before: 300000 Tianjin Heping District, South Road, No. 11 International Building 23 purchase of Wheat

Applicant before: Tianjin Hylanda Information Technology Co.,Ltd.

COR Change of bibliographic data
GR01 Patent grant
GR01 Patent grant