CN109522417A - A kind of trading company's abstracting method of company name - Google Patents

A kind of trading company's abstracting method of company name Download PDF

Info

Publication number
CN109522417A
CN109522417A CN201811258104.7A CN201811258104A CN109522417A CN 109522417 A CN109522417 A CN 109522417A CN 201811258104 A CN201811258104 A CN 201811258104A CN 109522417 A CN109522417 A CN 109522417A
Authority
CN
China
Prior art keywords
dictionary
industry
administrative division
company
trading company
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811258104.7A
Other languages
Chinese (zh)
Inventor
王本强
谢超
周庆勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Software Co Ltd
Original Assignee
Inspur Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Software Co Ltd filed Critical Inspur Software Co Ltd
Priority to CN201811258104.7A priority Critical patent/CN109522417A/en
Publication of CN109522417A publication Critical patent/CN109522417A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses a kind of trading company's abstracting method of company name, specific method includes carrying out Text Pretreatment first, obtains the smallest meaning of a word unit by carrying out participle pretreatment to text;Secondly building meets the administrative division dictionary, organizational form dictionary and industry dictionary of call format, and administrative division dictionary, organizational form dictionary and industry dictionary are loaded into segmenter in the form of Custom Dictionaries, is accurately segmented;Obtain the location information of administrative division and industry in character string;By administrative division and industry location information, the location information for obtaining trading company is calculated;According to the location information of trading company, the character string of trading company is extracted.Compared to the prior art a kind of trading company's abstracting method of company name of the invention, reduces the cumbersome work manually marked, reduces human cost and time cost.

Description

A kind of trading company's abstracting method of company name
Technical field
The present invention relates to natural language processing technique field, trading company's abstracting method of specifically a kind of company name.
Background technique
The trading company of company name extracts, and has application, such as the input frame completion of search engine, company name in many fields Matching algorithm in entity link.Currently, company name is mainly made of following four part, the administrative division in company location Title, trading company, company or font size, industry, organizational form.Due to the particularity of company's name, point of natural language processing field Word device cannot generally branch away the trading company of company name.Current machine learning (including deep learning) in precision although have Certain advantage is such as: it is existing it is a kind of based on deep learning company name ingredient extraction system and method (application number: 201710024098.8), method includes that Collection Co., Ltd's name simultaneously manually marks its each ingredient: by the text envelope of the company name Breath and markup information are converted into input of the form of vector as shot and long term memory (LSTM) model, according to the mark LSTM model after the input training of company name vector is exported annotation results by vector training, the LSTM model;It will be described LSTM model output the annotation results be converted into each ingredient of company name and export.But there are the drawbacks of be, A large amount of artificial mark is needed, and the cost manually marked is relatively high.
Summary of the invention
Technical assignment of the invention be against the above deficiency place, a kind of trading company's abstracting method of company name is provided.
The technical solution adopted by the present invention to solve the technical problems is: a kind of trading company's abstracting method of company name, specifically Method is as follows:
Text Pretreatment is carried out first, obtains the smallest meaning of a word unit by carrying out participle pretreatment to text;
Secondly building meets the administrative division dictionary, organizational form dictionary and industry dictionary of call format, and by administrative area It draws dictionary, organizational form dictionary and industry dictionary to be loaded into segmenter in the form of Custom Dictionaries, accurately be segmented;
Obtain the location information of administrative division and industry in character string;
By administrative division and industry location information, the location information for obtaining trading company is calculated;
According to the location information of trading company, the character string of trading company is extracted.
Further, preferred method is as follows: the building of industry dictionary includes, and carries out structure after participle pretreatment to text The administrative division dictionary for meeting call format, organizational form dictionary are built, frequency statistics is carried out to word segmentation result, removes low frequency amount Information removes the information in administrative division dictionary and organizational form dictionary, crawls to obtain trade information, to form industry word Allusion quotation.
Further, preferred method is as follows: further including that manual verification's link and industry dictionary are complete when building industry dictionary Standby link;
Manual verification's link, for correcting the trade information crawled;
The complete link of industry dictionary, the trade information for that will crawl merge with existing corresponding industry dictionary.
Further, preferred method is as follows:
The Text Pretreatment includes removing redundancy, redundancy information by regular expression or dictionary pattern matching Include one of punctuation mark, space, blank line or several.
A kind of trading company's extraction system of company name, including Text Pretreatment module, segmenter dictionary load and segment mould Block, administrative division and industry position information acquisition module, trading company's position information acquisition module and trading company's abstraction module;
Text Pretreatment module, for obtaining the smallest meaning of a word unit by carrying out participle pretreatment to text;
The load of segmenter dictionary and word segmentation module, for constructing the administrative division dictionary for meeting call format, organizational form Dictionary and industry dictionary, and administrative division dictionary, organizational form dictionary and industry dictionary are loaded in the form of Custom Dictionaries Into segmenter, accurately segmented;
Administrative division and industry position information acquisition module, for obtaining the position of administrative division and industry in character string Confidence breath;
Trading company's position information acquisition module, for calculating the position for obtaining trading company by administrative division and industry location information Confidence breath;
Trading company's abstraction module extracts the character string of trading company according to the location information of trading company.
Further, preferred structure is as follows: the segmenter dictionary load simultaneously word segmentation module, including dictionary creation Unit, dictionary loading unit, participle unit;
Dictionary creation unit, for constructing the administrative division dictionary for meeting call format, organizational form dictionary and industry word Allusion quotation;
Dictionary loading unit, for by administrative division dictionary, organizational form dictionary and industry dictionary with Custom Dictionaries Form is loaded into segmenter;
Participle unit, for accurately segmenting.
Further, preferred structure is as follows: further including manual verification's unit and industry dictionary combining unit;
Administrative division dictionary, the organizational form dictionary that building after participle pretreatment meets call format are carried out to text, it is right Word segmentation result carries out frequency statistics, removes low frequency amount information, removes the information in administrative division dictionary and organizational form dictionary, It crawls to obtain trade information;
Manual verification's unit, for correcting the trade information crawled;
Industry dictionary combining unit, the trade information for that will crawl merge with existing corresponding industry dictionary.
Further, preferred structure is as follows: the server includes:
One or more processors;
Storage device, for storing one or more programs;
When one or more of programs are executed by one or more of processors, so that one or more of Processor realizes the method as described in any in claim 1-4.
Compared to the prior art a kind of trading company's abstracting method of company name of the invention, has the beneficial effect that:
Present invention can apply to inquire completion in vertical search field, the information of user query can be obtained rapidly;
Present invention can apply to the matching algorithms in company name entity link, by assigning different weight calculation companies The similarity of name;
Present invention can apply to need to carry out company name by a large amount of mark corpus of machine learning (deep learning) training The scene of automatic marking.
Detailed description of the invention
The following further describes the present invention with reference to the drawings.
Attached drawing 1 is a kind of flow diagram of trading company's abstracting method of company name.
Attached drawing 2 is a kind of functional block diagram of trading company's extraction system of company name.
Specific embodiment
The present invention will be further explained below with reference to the attached drawings and specific examples.
The present invention is a kind of trading company's abstracting method of company name, it is intended to carry out quotient by the method for statistics and dictionary matching Number extract.
The method of the present invention includes word segmentation processing first pre-processes text data, removes spcial character, carries out Word segmentation processing;Secondly building administrative division dictionary, by the arrangement to state administration zoning, building meets the lattice of method requirement Formula;Then industry dictionary are constructed, carry out frequency statistics according to word segmentation result, film name is extracted according to frequecy characteristic, is gone forward side by side The check and correction of pedestrian's work, merges with existing industry dictionary;Finally by accurate participle, according to the position of administrative division and industry Confidence breath, to extract business name information.
Embodiment 1:
Specific implementation method is as follows:
Carry out data prediction first: the data being extracted into from business directory library include unique designation number and the enterprise of enterprise Title two, it can be seen that, include many punctuation marks, space, blank line from data.It is as follows:
These information are that maloperation is formed during typing, so leading to firstly the need of these redundancies are removed Regular expression is crossed, can be effectively removed.The case where being null character string for enterprise name, do not consider.
Then construct administrative division dictionary: it is online from disclosed administrative division, have from country, province (municipality directly under the Central Government), city, County (area), small towns (neighbourhood committee), village (community) complete zoning coding and title.We meet format by sorting out to it It is required that dictionary (similarly go building organizational form dictionary), it is as follows:
Next goes building industry dictionary: carrying out word segmentation processing to every enterprise name by using segmenter, it is therefore an objective to To the smallest meaning of a word unit.After obtained word segmentation result analysis, it is found that the trading company of the overwhelming majority all divides inaccuracy and name It does not bear the same name usually, so trading company part is low frequency amount, and administrative division is opposite with organizational form is fixed and invariable, Relatively good differentiation.
So removing low frequency amount after by the frequency statistics to word segmentation result, removing administrative division dictionary the inside includes Part, then remove the dictionary of organizational form, finally remaining is exactly trade information.
But the inside has mistake, by way of manual verification, it is ensured that correct.In view of completeness, crawl Then some industry dictionary merge with industry dictionary.Finally, foring industry dictionary.
The above-mentioned administrative division dictionary being organized into and industry dictionary, participle is loaded into the form of Custom Dictionaries Inside device, it can be ensured that during participle, information unit will not accidentally be torn open.
Administrative division position and industry tick lables are searched again: can obtain administrative region from word segmentation result With location information of the industry in character string;
Finally, the location information by administrative region and industry in character string is subtracted each other by the two position, so that it may The location information of trading company is obtained, and then takes out trading company's character string.
Method of the invention can be applied in vertical search field inquire completion, can obtain the letter of user query rapidly Breath;It can be applied to the matching algorithm in company name entity link, by the similarity for assigning different weight calculation company names; It may be applicable to the field for carrying out automatic marking to company name by a large amount of mark corpus of machine learning (deep learning) training Scape.
The technical personnel in the technical field can readily realize the present invention with the above specific embodiments,.But it answers Work as understanding, the present invention is not limited to above-mentioned several specific embodiments.On the basis of the disclosed embodiments, the skill The technical staff in art field can arbitrarily combine different technical features, to realize different technical solutions.

Claims (8)

1. a kind of trading company's abstracting method of company name, which is characterized in that the specific method is as follows:
Text Pretreatment is carried out first, obtains the smallest meaning of a word unit by carrying out participle pretreatment to text;
Secondly building meets the administrative division dictionary, organizational form dictionary and industry dictionary of call format, and by administrative division word Allusion quotation, organizational form dictionary and industry dictionary are loaded into segmenter in the form of Custom Dictionaries, are accurately segmented;
Obtain the location information of administrative division and industry in character string;
By administrative division and industry location information, the location information for obtaining trading company is calculated;
According to the location information of trading company, the character string of trading company is extracted.
2. a kind of trading company's abstracting method of company name according to claim 1, which is characterized in that the building packet of industry dictionary It includes, administrative division dictionary, the organizational form dictionary that building after participle pretreatment meets call format is carried out to text, participle is tied Fruit carries out frequency statistics, removes low frequency amount information, removes the information in administrative division dictionary and organizational form dictionary, crawls to obtain Trade information, to form industry dictionary.
3. a kind of trading company's abstracting method of company name according to claim 2, which is characterized in that when building industry dictionary also Including manual verification's link and the complete link of industry dictionary;
Manual verification's link, for correcting the trade information crawled;
The complete link of industry dictionary, the trade information for that will crawl merge with existing corresponding industry dictionary.
4. a kind of trading company's abstracting method of company name according to claim 1, which is characterized in that
The Text Pretreatment includes that redundancy is removed by regular expression or dictionary pattern matching, and redundancy includes mark One of point symbol, space, blank line are several.
5. a kind of trading company's extraction system of company name, which is characterized in that simultaneously including Text Pretreatment module, the load of segmenter dictionary Word segmentation module, administrative division and industry position information acquisition module, trading company's position information acquisition module and trading company's abstraction module;
Text Pretreatment module, for obtaining the smallest meaning of a word unit by carrying out participle pretreatment to text;
The load of segmenter dictionary and word segmentation module, for constructing the administrative division dictionary for meeting call format, organizational form dictionary And industry dictionary, and administrative division dictionary, organizational form dictionary and industry dictionary are loaded into point in the form of Custom Dictionaries In word device, accurately segmented;
Administrative division and industry position information acquisition module, for obtaining the position letter of administrative division and industry in character string Breath;
Trading company's position information acquisition module, for by administrative division and industry location information, calculating the position letter for obtaining trading company Breath;
Trading company's abstraction module extracts the character string of trading company according to the location information of trading company.
6. a kind of trading company's extraction system of company name according to claim 5, which is characterized in that the segmenter dictionary Load simultaneously word segmentation module, including dictionary creation unit, dictionary loading unit, participle unit;
Dictionary creation unit, for constructing the administrative division dictionary for meeting call format, organizational form dictionary and industry dictionary;
Dictionary loading unit, for by administrative division dictionary, organizational form dictionary and industry dictionary in the form of Custom Dictionaries It is loaded into segmenter;
Participle unit, for accurately segmenting.
7. a kind of trading company's extraction system of company name according to claim 6, which is characterized in that further include manual verification's list Member and industry dictionary combining unit;
Administrative division dictionary, the organizational form dictionary that building after participle pretreatment meets call format are carried out to text, to participle As a result frequency statistics is carried out, removes low frequency amount information, removes the information in administrative division dictionary and organizational form dictionary, crawl To trade information;
Manual verification's unit, for correcting the trade information crawled;
Industry dictionary combining unit, the trade information for that will crawl merge with existing corresponding industry dictionary.
8. a kind of server that the trading company for company name extracts, which is characterized in that the server includes:
One or more processors;
Storage device, for storing one or more programs;
When one or more of programs are executed by one or more of processors, so that one or more of processors Realize the method as described in any in claim 1-4.
CN201811258104.7A 2018-10-26 2018-10-26 A kind of trading company's abstracting method of company name Pending CN109522417A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811258104.7A CN109522417A (en) 2018-10-26 2018-10-26 A kind of trading company's abstracting method of company name

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811258104.7A CN109522417A (en) 2018-10-26 2018-10-26 A kind of trading company's abstracting method of company name

Publications (1)

Publication Number Publication Date
CN109522417A true CN109522417A (en) 2019-03-26

Family

ID=65773161

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811258104.7A Pending CN109522417A (en) 2018-10-26 2018-10-26 A kind of trading company's abstracting method of company name

Country Status (1)

Country Link
CN (1) CN109522417A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110134779A (en) * 2019-05-13 2019-08-16 极智(上海)企业管理咨询有限公司 A kind of method of enterprise name processing
CN110381115A (en) * 2019-06-14 2019-10-25 平安科技(深圳)有限公司 Information-pushing method, device, computer readable storage medium and computer equipment
CN110704719A (en) * 2019-09-29 2020-01-17 北京金堤科技有限公司 Enterprise search text word segmentation method and device
CN111079428A (en) * 2019-12-27 2020-04-28 出门问问信息科技有限公司 Word segmentation and industry dictionary construction method and device and readable storage medium
CN112784015A (en) * 2021-01-25 2021-05-11 北京金堤科技有限公司 Information recognition method and apparatus, device, medium, and program
CN115270800A (en) * 2022-09-28 2022-11-01 广州市玄武无线科技股份有限公司 Method, device and equipment for extracting terminal store names and computer storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015149533A1 (en) * 2014-03-31 2015-10-08 北京奇虎科技有限公司 Method and device for word segmentation processing on basis of webpage content classification
CN106777336A (en) * 2017-01-13 2017-05-31 深圳爱拼信息科技有限公司 A kind of exabyte composition extraction system and method based on deep learning
CN108595435A (en) * 2018-05-03 2018-09-28 鹏元征信有限公司 A kind of organization names identifying processing method, intelligent terminal and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015149533A1 (en) * 2014-03-31 2015-10-08 北京奇虎科技有限公司 Method and device for word segmentation processing on basis of webpage content classification
CN106777336A (en) * 2017-01-13 2017-05-31 深圳爱拼信息科技有限公司 A kind of exabyte composition extraction system and method based on deep learning
CN108595435A (en) * 2018-05-03 2018-09-28 鹏元征信有限公司 A kind of organization names identifying processing method, intelligent terminal and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
王宁等: "中文金融新闻中公司名的识别", 《中文信息学报》 *
胡万亭等: "一种基于词频统计的组织机构名识别方法", 《计算机应用研究》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110134779A (en) * 2019-05-13 2019-08-16 极智(上海)企业管理咨询有限公司 A kind of method of enterprise name processing
CN110381115A (en) * 2019-06-14 2019-10-25 平安科技(深圳)有限公司 Information-pushing method, device, computer readable storage medium and computer equipment
CN110381115B (en) * 2019-06-14 2022-03-11 平安科技(深圳)有限公司 Information pushing method and device, computer readable storage medium and computer equipment
CN110704719A (en) * 2019-09-29 2020-01-17 北京金堤科技有限公司 Enterprise search text word segmentation method and device
CN110704719B (en) * 2019-09-29 2022-03-08 北京金堤科技有限公司 Enterprise search text word segmentation method and device
CN111079428A (en) * 2019-12-27 2020-04-28 出门问问信息科技有限公司 Word segmentation and industry dictionary construction method and device and readable storage medium
CN111079428B (en) * 2019-12-27 2023-09-19 北京羽扇智信息科技有限公司 Word segmentation and industry dictionary construction method and device and readable storage medium
CN112784015A (en) * 2021-01-25 2021-05-11 北京金堤科技有限公司 Information recognition method and apparatus, device, medium, and program
CN112784015B (en) * 2021-01-25 2024-03-12 北京金堤科技有限公司 Information identification method and device, apparatus, medium, and program
CN115270800A (en) * 2022-09-28 2022-11-01 广州市玄武无线科技股份有限公司 Method, device and equipment for extracting terminal store names and computer storage medium

Similar Documents

Publication Publication Date Title
CN109522417A (en) A kind of trading company's abstracting method of company name
CN111222305B (en) Information structuring method and device
CN111914558A (en) Course knowledge relation extraction method and system based on sentence bag attention remote supervision
CN103824053A (en) Face image gender marking method and face gender detection method
Wilkinson et al. Neural Ctrl-F: segmentation-free query-by-string word spotting in handwritten manuscript collections
CN106777275A (en) Entity attribute and property value extracting method based on many granularity semantic chunks
CN108959566B (en) A kind of medical text based on Stacking integrated study goes privacy methods and system
CN110110577B (en) Method and device for identifying dish name, storage medium and electronic device
CN107392143A (en) A kind of resume accurate Analysis method based on SVM text classifications
CN110390363A (en) A kind of Image Description Methods
CN109582704A (en) Recruitment information and the matched method of job seeker resume
CN106202030A (en) A kind of rapid serial mask method based on isomery labeled data and device
CN110781284A (en) Knowledge graph-based question and answer method, device and storage medium
CN108509521A (en) A kind of image search method automatically generating text index
CN106844482B (en) Search engine-based retrieval information matching method and device
CN110837568A (en) Entity alignment method and device, electronic equipment and storage medium
CN106980620A (en) A kind of method and device matched to Chinese character string
CN107861944A (en) A kind of text label extracting method and device based on Word2Vec
CN108763192B (en) Entity relation extraction method and device for text processing
CN109086306A (en) The extracting method of atomic event label based on mixed hidden Markov model
CN110209781A (en) A kind of text handling method, device and relevant device
CN110738050A (en) Text recombination method, device and medium based on word segmentation and named entity recognition
CN112148735B (en) Construction method for structured form data knowledge graph
CN111026815B (en) Entity pair specific relation extraction method based on user-assisted correction
CN111680122B (en) Space data active recommendation method and device, storage medium and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190326