CN109522417A - A kind of trading company's abstracting method of company name - Google Patents
A kind of trading company's abstracting method of company name Download PDFInfo
- Publication number
- CN109522417A CN109522417A CN201811258104.7A CN201811258104A CN109522417A CN 109522417 A CN109522417 A CN 109522417A CN 201811258104 A CN201811258104 A CN 201811258104A CN 109522417 A CN109522417 A CN 109522417A
- Authority
- CN
- China
- Prior art keywords
- dictionary
- industry
- administrative division
- company
- trading company
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 32
- 230000011218 segmentation Effects 0.000 claims description 14
- 238000012795 verification Methods 0.000 claims description 9
- 238000000605 extraction Methods 0.000 claims description 6
- 239000000284 extract Substances 0.000 claims description 5
- 235000013399 edible fruits Nutrition 0.000 claims 1
- 238000013135 deep learning Methods 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 239000004615 ingredient Substances 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 238000013316 zoning Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
Abstract
The invention discloses a kind of trading company's abstracting method of company name, specific method includes carrying out Text Pretreatment first, obtains the smallest meaning of a word unit by carrying out participle pretreatment to text;Secondly building meets the administrative division dictionary, organizational form dictionary and industry dictionary of call format, and administrative division dictionary, organizational form dictionary and industry dictionary are loaded into segmenter in the form of Custom Dictionaries, is accurately segmented;Obtain the location information of administrative division and industry in character string;By administrative division and industry location information, the location information for obtaining trading company is calculated;According to the location information of trading company, the character string of trading company is extracted.Compared to the prior art a kind of trading company's abstracting method of company name of the invention, reduces the cumbersome work manually marked, reduces human cost and time cost.
Description
Technical field
The present invention relates to natural language processing technique field, trading company's abstracting method of specifically a kind of company name.
Background technique
The trading company of company name extracts, and has application, such as the input frame completion of search engine, company name in many fields
Matching algorithm in entity link.Currently, company name is mainly made of following four part, the administrative division in company location
Title, trading company, company or font size, industry, organizational form.Due to the particularity of company's name, point of natural language processing field
Word device cannot generally branch away the trading company of company name.Current machine learning (including deep learning) in precision although have
Certain advantage is such as: it is existing it is a kind of based on deep learning company name ingredient extraction system and method (application number:
201710024098.8), method includes that Collection Co., Ltd's name simultaneously manually marks its each ingredient: by the text envelope of the company name
Breath and markup information are converted into input of the form of vector as shot and long term memory (LSTM) model, according to the mark
LSTM model after the input training of company name vector is exported annotation results by vector training, the LSTM model;It will be described
LSTM model output the annotation results be converted into each ingredient of company name and export.But there are the drawbacks of be,
A large amount of artificial mark is needed, and the cost manually marked is relatively high.
Summary of the invention
Technical assignment of the invention be against the above deficiency place, a kind of trading company's abstracting method of company name is provided.
The technical solution adopted by the present invention to solve the technical problems is: a kind of trading company's abstracting method of company name, specifically
Method is as follows:
Text Pretreatment is carried out first, obtains the smallest meaning of a word unit by carrying out participle pretreatment to text;
Secondly building meets the administrative division dictionary, organizational form dictionary and industry dictionary of call format, and by administrative area
It draws dictionary, organizational form dictionary and industry dictionary to be loaded into segmenter in the form of Custom Dictionaries, accurately be segmented;
Obtain the location information of administrative division and industry in character string;
By administrative division and industry location information, the location information for obtaining trading company is calculated;
According to the location information of trading company, the character string of trading company is extracted.
Further, preferred method is as follows: the building of industry dictionary includes, and carries out structure after participle pretreatment to text
The administrative division dictionary for meeting call format, organizational form dictionary are built, frequency statistics is carried out to word segmentation result, removes low frequency amount
Information removes the information in administrative division dictionary and organizational form dictionary, crawls to obtain trade information, to form industry word
Allusion quotation.
Further, preferred method is as follows: further including that manual verification's link and industry dictionary are complete when building industry dictionary
Standby link;
Manual verification's link, for correcting the trade information crawled;
The complete link of industry dictionary, the trade information for that will crawl merge with existing corresponding industry dictionary.
Further, preferred method is as follows:
The Text Pretreatment includes removing redundancy, redundancy information by regular expression or dictionary pattern matching
Include one of punctuation mark, space, blank line or several.
A kind of trading company's extraction system of company name, including Text Pretreatment module, segmenter dictionary load and segment mould
Block, administrative division and industry position information acquisition module, trading company's position information acquisition module and trading company's abstraction module;
Text Pretreatment module, for obtaining the smallest meaning of a word unit by carrying out participle pretreatment to text;
The load of segmenter dictionary and word segmentation module, for constructing the administrative division dictionary for meeting call format, organizational form
Dictionary and industry dictionary, and administrative division dictionary, organizational form dictionary and industry dictionary are loaded in the form of Custom Dictionaries
Into segmenter, accurately segmented;
Administrative division and industry position information acquisition module, for obtaining the position of administrative division and industry in character string
Confidence breath;
Trading company's position information acquisition module, for calculating the position for obtaining trading company by administrative division and industry location information
Confidence breath;
Trading company's abstraction module extracts the character string of trading company according to the location information of trading company.
Further, preferred structure is as follows: the segmenter dictionary load simultaneously word segmentation module, including dictionary creation
Unit, dictionary loading unit, participle unit;
Dictionary creation unit, for constructing the administrative division dictionary for meeting call format, organizational form dictionary and industry word
Allusion quotation;
Dictionary loading unit, for by administrative division dictionary, organizational form dictionary and industry dictionary with Custom Dictionaries
Form is loaded into segmenter;
Participle unit, for accurately segmenting.
Further, preferred structure is as follows: further including manual verification's unit and industry dictionary combining unit;
Administrative division dictionary, the organizational form dictionary that building after participle pretreatment meets call format are carried out to text, it is right
Word segmentation result carries out frequency statistics, removes low frequency amount information, removes the information in administrative division dictionary and organizational form dictionary,
It crawls to obtain trade information;
Manual verification's unit, for correcting the trade information crawled;
Industry dictionary combining unit, the trade information for that will crawl merge with existing corresponding industry dictionary.
Further, preferred structure is as follows: the server includes:
One or more processors;
Storage device, for storing one or more programs;
When one or more of programs are executed by one or more of processors, so that one or more of
Processor realizes the method as described in any in claim 1-4.
Compared to the prior art a kind of trading company's abstracting method of company name of the invention, has the beneficial effect that:
Present invention can apply to inquire completion in vertical search field, the information of user query can be obtained rapidly;
Present invention can apply to the matching algorithms in company name entity link, by assigning different weight calculation companies
The similarity of name;
Present invention can apply to need to carry out company name by a large amount of mark corpus of machine learning (deep learning) training
The scene of automatic marking.
Detailed description of the invention
The following further describes the present invention with reference to the drawings.
Attached drawing 1 is a kind of flow diagram of trading company's abstracting method of company name.
Attached drawing 2 is a kind of functional block diagram of trading company's extraction system of company name.
Specific embodiment
The present invention will be further explained below with reference to the attached drawings and specific examples.
The present invention is a kind of trading company's abstracting method of company name, it is intended to carry out quotient by the method for statistics and dictionary matching
Number extract.
The method of the present invention includes word segmentation processing first pre-processes text data, removes spcial character, carries out
Word segmentation processing;Secondly building administrative division dictionary, by the arrangement to state administration zoning, building meets the lattice of method requirement
Formula;Then industry dictionary are constructed, carry out frequency statistics according to word segmentation result, film name is extracted according to frequecy characteristic, is gone forward side by side
The check and correction of pedestrian's work, merges with existing industry dictionary;Finally by accurate participle, according to the position of administrative division and industry
Confidence breath, to extract business name information.
Embodiment 1:
Specific implementation method is as follows:
Carry out data prediction first: the data being extracted into from business directory library include unique designation number and the enterprise of enterprise
Title two, it can be seen that, include many punctuation marks, space, blank line from data.It is as follows:
These information are that maloperation is formed during typing, so leading to firstly the need of these redundancies are removed
Regular expression is crossed, can be effectively removed.The case where being null character string for enterprise name, do not consider.
Then construct administrative division dictionary: it is online from disclosed administrative division, have from country, province (municipality directly under the Central Government), city,
County (area), small towns (neighbourhood committee), village (community) complete zoning coding and title.We meet format by sorting out to it
It is required that dictionary (similarly go building organizational form dictionary), it is as follows:
Next goes building industry dictionary: carrying out word segmentation processing to every enterprise name by using segmenter, it is therefore an objective to
To the smallest meaning of a word unit.After obtained word segmentation result analysis, it is found that the trading company of the overwhelming majority all divides inaccuracy and name
It does not bear the same name usually, so trading company part is low frequency amount, and administrative division is opposite with organizational form is fixed and invariable,
Relatively good differentiation.
So removing low frequency amount after by the frequency statistics to word segmentation result, removing administrative division dictionary the inside includes
Part, then remove the dictionary of organizational form, finally remaining is exactly trade information.
But the inside has mistake, by way of manual verification, it is ensured that correct.In view of completeness, crawl
Then some industry dictionary merge with industry dictionary.Finally, foring industry dictionary.
The above-mentioned administrative division dictionary being organized into and industry dictionary, participle is loaded into the form of Custom Dictionaries
Inside device, it can be ensured that during participle, information unit will not accidentally be torn open.
Administrative division position and industry tick lables are searched again: can obtain administrative region from word segmentation result
With location information of the industry in character string;
Finally, the location information by administrative region and industry in character string is subtracted each other by the two position, so that it may
The location information of trading company is obtained, and then takes out trading company's character string.
Method of the invention can be applied in vertical search field inquire completion, can obtain the letter of user query rapidly
Breath;It can be applied to the matching algorithm in company name entity link, by the similarity for assigning different weight calculation company names;
It may be applicable to the field for carrying out automatic marking to company name by a large amount of mark corpus of machine learning (deep learning) training
Scape.
The technical personnel in the technical field can readily realize the present invention with the above specific embodiments,.But it answers
Work as understanding, the present invention is not limited to above-mentioned several specific embodiments.On the basis of the disclosed embodiments, the skill
The technical staff in art field can arbitrarily combine different technical features, to realize different technical solutions.
Claims (8)
1. a kind of trading company's abstracting method of company name, which is characterized in that the specific method is as follows:
Text Pretreatment is carried out first, obtains the smallest meaning of a word unit by carrying out participle pretreatment to text;
Secondly building meets the administrative division dictionary, organizational form dictionary and industry dictionary of call format, and by administrative division word
Allusion quotation, organizational form dictionary and industry dictionary are loaded into segmenter in the form of Custom Dictionaries, are accurately segmented;
Obtain the location information of administrative division and industry in character string;
By administrative division and industry location information, the location information for obtaining trading company is calculated;
According to the location information of trading company, the character string of trading company is extracted.
2. a kind of trading company's abstracting method of company name according to claim 1, which is characterized in that the building packet of industry dictionary
It includes, administrative division dictionary, the organizational form dictionary that building after participle pretreatment meets call format is carried out to text, participle is tied
Fruit carries out frequency statistics, removes low frequency amount information, removes the information in administrative division dictionary and organizational form dictionary, crawls to obtain
Trade information, to form industry dictionary.
3. a kind of trading company's abstracting method of company name according to claim 2, which is characterized in that when building industry dictionary also
Including manual verification's link and the complete link of industry dictionary;
Manual verification's link, for correcting the trade information crawled;
The complete link of industry dictionary, the trade information for that will crawl merge with existing corresponding industry dictionary.
4. a kind of trading company's abstracting method of company name according to claim 1, which is characterized in that
The Text Pretreatment includes that redundancy is removed by regular expression or dictionary pattern matching, and redundancy includes mark
One of point symbol, space, blank line are several.
5. a kind of trading company's extraction system of company name, which is characterized in that simultaneously including Text Pretreatment module, the load of segmenter dictionary
Word segmentation module, administrative division and industry position information acquisition module, trading company's position information acquisition module and trading company's abstraction module;
Text Pretreatment module, for obtaining the smallest meaning of a word unit by carrying out participle pretreatment to text;
The load of segmenter dictionary and word segmentation module, for constructing the administrative division dictionary for meeting call format, organizational form dictionary
And industry dictionary, and administrative division dictionary, organizational form dictionary and industry dictionary are loaded into point in the form of Custom Dictionaries
In word device, accurately segmented;
Administrative division and industry position information acquisition module, for obtaining the position letter of administrative division and industry in character string
Breath;
Trading company's position information acquisition module, for by administrative division and industry location information, calculating the position letter for obtaining trading company
Breath;
Trading company's abstraction module extracts the character string of trading company according to the location information of trading company.
6. a kind of trading company's extraction system of company name according to claim 5, which is characterized in that the segmenter dictionary
Load simultaneously word segmentation module, including dictionary creation unit, dictionary loading unit, participle unit;
Dictionary creation unit, for constructing the administrative division dictionary for meeting call format, organizational form dictionary and industry dictionary;
Dictionary loading unit, for by administrative division dictionary, organizational form dictionary and industry dictionary in the form of Custom Dictionaries
It is loaded into segmenter;
Participle unit, for accurately segmenting.
7. a kind of trading company's extraction system of company name according to claim 6, which is characterized in that further include manual verification's list
Member and industry dictionary combining unit;
Administrative division dictionary, the organizational form dictionary that building after participle pretreatment meets call format are carried out to text, to participle
As a result frequency statistics is carried out, removes low frequency amount information, removes the information in administrative division dictionary and organizational form dictionary, crawl
To trade information;
Manual verification's unit, for correcting the trade information crawled;
Industry dictionary combining unit, the trade information for that will crawl merge with existing corresponding industry dictionary.
8. a kind of server that the trading company for company name extracts, which is characterized in that the server includes:
One or more processors;
Storage device, for storing one or more programs;
When one or more of programs are executed by one or more of processors, so that one or more of processors
Realize the method as described in any in claim 1-4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811258104.7A CN109522417A (en) | 2018-10-26 | 2018-10-26 | A kind of trading company's abstracting method of company name |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811258104.7A CN109522417A (en) | 2018-10-26 | 2018-10-26 | A kind of trading company's abstracting method of company name |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109522417A true CN109522417A (en) | 2019-03-26 |
Family
ID=65773161
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811258104.7A Pending CN109522417A (en) | 2018-10-26 | 2018-10-26 | A kind of trading company's abstracting method of company name |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109522417A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110134779A (en) * | 2019-05-13 | 2019-08-16 | 极智(上海)企业管理咨询有限公司 | A kind of method of enterprise name processing |
CN110381115A (en) * | 2019-06-14 | 2019-10-25 | 平安科技(深圳)有限公司 | Information-pushing method, device, computer readable storage medium and computer equipment |
CN110704719A (en) * | 2019-09-29 | 2020-01-17 | 北京金堤科技有限公司 | Enterprise search text word segmentation method and device |
CN111079428A (en) * | 2019-12-27 | 2020-04-28 | 出门问问信息科技有限公司 | Word segmentation and industry dictionary construction method and device and readable storage medium |
CN112784015A (en) * | 2021-01-25 | 2021-05-11 | 北京金堤科技有限公司 | Information recognition method and apparatus, device, medium, and program |
CN115270800A (en) * | 2022-09-28 | 2022-11-01 | 广州市玄武无线科技股份有限公司 | Method, device and equipment for extracting terminal store names and computer storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015149533A1 (en) * | 2014-03-31 | 2015-10-08 | 北京奇虎科技有限公司 | Method and device for word segmentation processing on basis of webpage content classification |
CN106777336A (en) * | 2017-01-13 | 2017-05-31 | 深圳爱拼信息科技有限公司 | A kind of exabyte composition extraction system and method based on deep learning |
CN108595435A (en) * | 2018-05-03 | 2018-09-28 | 鹏元征信有限公司 | A kind of organization names identifying processing method, intelligent terminal and storage medium |
-
2018
- 2018-10-26 CN CN201811258104.7A patent/CN109522417A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015149533A1 (en) * | 2014-03-31 | 2015-10-08 | 北京奇虎科技有限公司 | Method and device for word segmentation processing on basis of webpage content classification |
CN106777336A (en) * | 2017-01-13 | 2017-05-31 | 深圳爱拼信息科技有限公司 | A kind of exabyte composition extraction system and method based on deep learning |
CN108595435A (en) * | 2018-05-03 | 2018-09-28 | 鹏元征信有限公司 | A kind of organization names identifying processing method, intelligent terminal and storage medium |
Non-Patent Citations (2)
Title |
---|
王宁等: "中文金融新闻中公司名的识别", 《中文信息学报》 * |
胡万亭等: "一种基于词频统计的组织机构名识别方法", 《计算机应用研究》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110134779A (en) * | 2019-05-13 | 2019-08-16 | 极智(上海)企业管理咨询有限公司 | A kind of method of enterprise name processing |
CN110381115A (en) * | 2019-06-14 | 2019-10-25 | 平安科技(深圳)有限公司 | Information-pushing method, device, computer readable storage medium and computer equipment |
CN110381115B (en) * | 2019-06-14 | 2022-03-11 | 平安科技(深圳)有限公司 | Information pushing method and device, computer readable storage medium and computer equipment |
CN110704719A (en) * | 2019-09-29 | 2020-01-17 | 北京金堤科技有限公司 | Enterprise search text word segmentation method and device |
CN110704719B (en) * | 2019-09-29 | 2022-03-08 | 北京金堤科技有限公司 | Enterprise search text word segmentation method and device |
CN111079428A (en) * | 2019-12-27 | 2020-04-28 | 出门问问信息科技有限公司 | Word segmentation and industry dictionary construction method and device and readable storage medium |
CN111079428B (en) * | 2019-12-27 | 2023-09-19 | 北京羽扇智信息科技有限公司 | Word segmentation and industry dictionary construction method and device and readable storage medium |
CN112784015A (en) * | 2021-01-25 | 2021-05-11 | 北京金堤科技有限公司 | Information recognition method and apparatus, device, medium, and program |
CN112784015B (en) * | 2021-01-25 | 2024-03-12 | 北京金堤科技有限公司 | Information identification method and device, apparatus, medium, and program |
CN115270800A (en) * | 2022-09-28 | 2022-11-01 | 广州市玄武无线科技股份有限公司 | Method, device and equipment for extracting terminal store names and computer storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109522417A (en) | A kind of trading company's abstracting method of company name | |
CN111222305B (en) | Information structuring method and device | |
CN111914558A (en) | Course knowledge relation extraction method and system based on sentence bag attention remote supervision | |
CN103824053A (en) | Face image gender marking method and face gender detection method | |
Wilkinson et al. | Neural Ctrl-F: segmentation-free query-by-string word spotting in handwritten manuscript collections | |
CN106777275A (en) | Entity attribute and property value extracting method based on many granularity semantic chunks | |
CN108959566B (en) | A kind of medical text based on Stacking integrated study goes privacy methods and system | |
CN110110577B (en) | Method and device for identifying dish name, storage medium and electronic device | |
CN107392143A (en) | A kind of resume accurate Analysis method based on SVM text classifications | |
CN110390363A (en) | A kind of Image Description Methods | |
CN109582704A (en) | Recruitment information and the matched method of job seeker resume | |
CN106202030A (en) | A kind of rapid serial mask method based on isomery labeled data and device | |
CN110781284A (en) | Knowledge graph-based question and answer method, device and storage medium | |
CN108509521A (en) | A kind of image search method automatically generating text index | |
CN106844482B (en) | Search engine-based retrieval information matching method and device | |
CN110837568A (en) | Entity alignment method and device, electronic equipment and storage medium | |
CN106980620A (en) | A kind of method and device matched to Chinese character string | |
CN107861944A (en) | A kind of text label extracting method and device based on Word2Vec | |
CN108763192B (en) | Entity relation extraction method and device for text processing | |
CN109086306A (en) | The extracting method of atomic event label based on mixed hidden Markov model | |
CN110209781A (en) | A kind of text handling method, device and relevant device | |
CN110738050A (en) | Text recombination method, device and medium based on word segmentation and named entity recognition | |
CN112148735B (en) | Construction method for structured form data knowledge graph | |
CN111026815B (en) | Entity pair specific relation extraction method based on user-assisted correction | |
CN111680122B (en) | Space data active recommendation method and device, storage medium and computer equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190326 |