CN109710765A - A kind of company's trade classification calculation method based on natural language processing - Google Patents
A kind of company's trade classification calculation method based on natural language processing Download PDFInfo
- Publication number
- CN109710765A CN109710765A CN201811624587.8A CN201811624587A CN109710765A CN 109710765 A CN109710765 A CN 109710765A CN 201811624587 A CN201811624587 A CN 201811624587A CN 109710765 A CN109710765 A CN 109710765A
- Authority
- CN
- China
- Prior art keywords
- company
- data
- text
- classification
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Company's trade classification calculation method based on natural language processing that the invention discloses a kind of, the text data for the company that presorts is obtained by crawler, feature, noise reduction process and training term vector are extracted to text data, and after using language model and transfer learning pre-training disaggregated model, hierarchical classification is carried out to text data, realizes the classification to targeted company.Process of the present invention is simple, high-efficient, uses manpower and material resources sparingly;The present invention substantially increases the precision of classification by available about 30 first-level class of hierarchical classification system and about 300 secondary classifications;Model of the invention can receive the text input of different length, form, not need to make model any adjustment, and application range is wider, practicability is higher.
Description
Technical field
The present invention relates to Internet technical field, particularly relate to based on a kind of company's trade classification by natural language processing
Calculation method.
Background technique
In data search, accurately trade orientation, can help user quickly to judge whether targeted company meets oneself
Demand.Existing trade classification mainly using including that manual method marks company's category of employment, formulates trade classification rule to sentence
Disconnected company's industry or conventional sorting methods (the methods of such as support vector machines/decision tree) Lai Shixian, have the following problems:
(1) manual method: there are knowledge barriers between each row, need a large amount of industry specialists participations that can just efficiently accomplish mark, consume
Take a large amount of man power and materials;
(2) rule and method: company's substantial amounts are difficult to take into account all taking-over markets formulation trade classification rules;And new company
It emerges one after another, it is difficult to timely update;It lays down a regulation simultaneously and a large amount of personnel is needed to participate in, realize that difficulty is high;
(3) conventional sorting methods: needing to carry out feature extraction processing, and document loses information after processing, is easy to cause classification accurate
Degree reduces.
In view of this, the present inventor does not attain regarding to the issue above improves caused many missings and inconvenience, and go deep into structure
Think, and actively research improvement has a fling at and develops and design the present invention.
Summary of the invention
Company's trade classification calculation method based on natural language processing that the purpose of the present invention is to provide a kind of has and divides
The feature that class precision is high, applicability is wide, and manually mark amount is few required for the present invention, can use manpower and material resources sparingly.
In order to achieve the above objectives, solution of the invention is:
1, a kind of company's trade classification calculation method based on natural language processing, comprising the following steps:
Step 1, data acquisition
By crawler web data, the text data of the text description comprising the product for the company of presorting or service is obtained;
Step 2, data analysis
2.1 feature extractions: using the text data summation of all companies that presort as corpus, the text for the company that each presorts
Data extract feature as an article, to the text data for the company of presorting, and the feature includes the product of company, data
Source, TFIDF statistics and BOW statistics;Data mark is carried out by active learning;It regard webpage url segmentation as feature,
It is handled by noisy channel layer, the noise in quantized data source;
2.2 data cleansings: pass through the side of removal pure digi-tal text, small letter, the common word of removal, removal low-frequency word, lemmatization
Formula clears up text data;
2.3 training term vectors: the text data after cleaning is done into term vector training with GLOVE and word2vec and obtains term vector;
Step 3, deep learning frame
In conjunction with the term vector in the feature and step 2.3 extracted in step 2.1, using ELMO, ULMFIT model and wide and
Deep model is trained deep learning model;
Step 4, hierarchical classification
By trained deep learning model, text data is carried out first-level class is calculated;For each first-level class
It is individually trained according to the difference of data characteristics using different models, obtains the classifier of secondary classification;According to output
The classifier for the secondary classification that first-level class selection enters, realizes the trade classification to company.
The web data derives from official website homepage, first level pages, social networks homepage or the enterprise for the company that presorts
Yellow Page.
After adopting the above method, process of the present invention is simple, high-efficient, by using language model and transfer learning pre-training
Disaggregated model to greatly improve accuracy rate, and uses manpower and material resources sparingly;The present invention is available by hierarchical classification system
About 30 first-level class and about 300 secondary classifications, substantially increase the precision of classification;Model of the invention can receive not
Same length, the text input of form do not need to make model any adjustment, and application range is wider, practicability is higher.
In addition, using active learning during data mark, to guarantee the real-time update of model, increase
Add real-time, reduces duplication of labour power.
Specific embodiment
In order to further explain the technical solution of the present invention, being explained in detail below by specific embodiment the present invention
It states.
The product or its service provided that one company produces embody the own feature of the said firm, can pass through calculating
The trade classification for having the similarity degree of feature by oneself to analyze company.
The present invention is a kind of company's trade classification calculation method based on natural language processing, comprising the following steps:
Step 1, data acquisition
By crawler web data, the text data of the text description comprising the product for the company of presorting or service is obtained.
Above-mentioned web data derives from official website homepage, first level pages, social networks homepage or the enterprise for the company that presorts
The companies of presorting such as Yellow Page can issue the platform of its relevant information.The semantic information for including in above-mentioned text data can be used to
Do text understanding and accurately trade classification.
Step 2, data analysis
2.1 feature extractions: using the text data summation of all companies that presort as corpus, the text for the company that each presorts
Data extract feature as an article, to the text data for the company of presorting, and the feature includes the product of company, data
Source, TFIDF statistics and BOW statistics etc..Data mark is carried out by active proposed standard system active learning.It will
Webpage url segmentation is used as feature, is handled by noisy channel layer, and the noise in quantized data source is subsequent to increase
The accuracy rate of model in step is segmented into www if webpage url is www.google.com | google | com.
2.2 data cleansings: pass through removal pure digi-tal text, small letter, the common word of removal, removal low-frequency word, lemmatization
(lemmatisation) etc. a series of mode of natural language processings clears up text data.
2.3 training term vectors: by the text data after cleaning with GLOVE and word2vec do term vector training obtain word to
Amount, the input as deep learning model.
Step 3, deep learning frame
In conjunction with the term vector in the feature and step 2.3 extracted in step 2.1, using ELMO, ULMFIT model and wide and
Deep model is trained deep learning model.
ELMO and ULMFIT model is applied in this step, accuracy rate can accomplish highest accurate on standard data set
Degree.
Step 4, hierarchical classification
By trained deep learning model, text data is carried out first-level class is calculated;For each first-level class
It is individually trained according to the difference of data characteristics using different models, obtains the classifier of secondary classification;According to output
The classifier for the secondary classification that first-level class selection enters, realizes the trade classification to company.
Process of the present invention is simple, high-efficient, by using language model and transfer learning pre-training disaggregated model, thus greatly
Width improves accuracy rate, and uses manpower and material resources sparingly;The present invention passes through available about 30 first-level class of hierarchical classification system
About 300 secondary classifications, substantially increase the precision of classification;Model of the invention can receive different length, form
Text input does not need to make model any adjustment, and application range is wider, practicability is higher.
In addition, using active learning during data mark, to guarantee the real-time update of model, increase
Add real-time, reduces duplication of labour power.
Above-described embodiment and non-limiting product form and style of the invention, the ordinary skill people of any technical field
The appropriate changes or modifications that member does it, all should be regarded as not departing from patent category of the invention.
Claims (2)
1. a kind of company's trade classification calculation method based on natural language processing, it is characterised in that the following steps are included:
Step 1, data acquisition
By crawler web data, the text data of the text description comprising the product for the company of presorting or service is obtained;
Step 2, data analysis
2.1 feature extractions: using the text data summation of all companies that presort as corpus, the text for the company that each presorts
Data extract feature as an article, to the text data for the company of presorting, and the feature includes the product of company, data
Source, TFIDF statistics and BOW statistics;Data mark is carried out by active learning;It regard webpage url segmentation as feature,
It is handled by noisy channel layer, the noise in quantized data source;
2.2 data cleansings: pass through the side of removal pure digi-tal text, small letter, the common word of removal, removal low-frequency word, lemmatization
Formula clears up text data;
2.3 training term vectors: the text data after cleaning is done into term vector training with GLOVE and word2vec and obtains term vector;
Step 3, deep learning frame
In conjunction with the term vector in the feature and step 2.3 extracted in step 2.1, using ELMO, ULMFIT model and wide and
Deep model is trained deep learning model;
Step 4, hierarchical classification
By trained deep learning model, text data is carried out first-level class is calculated;For each first-level class
It is individually trained according to the difference of data characteristics using different models, obtains the classifier of secondary classification;According to output
The classifier for the secondary classification that first-level class selection enters, realizes the trade classification to company.
2. a kind of company's trade classification calculation method based on natural language processing as described in claim 1, it is characterised in that:
The web data derives from official website homepage, first level pages, social networks homepage or the enterprise's Yellow Page for the company that presorts.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811624587.8A CN109710765A (en) | 2018-12-28 | 2018-12-28 | A kind of company's trade classification calculation method based on natural language processing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811624587.8A CN109710765A (en) | 2018-12-28 | 2018-12-28 | A kind of company's trade classification calculation method based on natural language processing |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109710765A true CN109710765A (en) | 2019-05-03 |
Family
ID=66257975
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811624587.8A Withdrawn CN109710765A (en) | 2018-12-28 | 2018-12-28 | A kind of company's trade classification calculation method based on natural language processing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109710765A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111860981A (en) * | 2020-07-03 | 2020-10-30 | 航天信息(山东)科技有限公司 | Enterprise national industry category prediction method and system based on LSTM deep learning |
CN113139066A (en) * | 2021-04-28 | 2021-07-20 | 安徽智侒信信息技术有限公司 | Company industry link point matching method based on natural language processing technology |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2006111953A1 (en) * | 2005-04-17 | 2006-10-26 | Shlomo Brach | Method and system for conducting internet websites search |
CN103324628A (en) * | 2012-03-21 | 2013-09-25 | 腾讯科技(深圳)有限公司 | Industry classification method and system for text publishing |
CN104199851A (en) * | 2014-08-11 | 2014-12-10 | 北京奇虎科技有限公司 | Method for extracting telephone numbers according to yellow page information and cloud server |
CN105975457A (en) * | 2016-05-03 | 2016-09-28 | 成都数联铭品科技有限公司 | Information classification prediction system based on full-automatic learning |
-
2018
- 2018-12-28 CN CN201811624587.8A patent/CN109710765A/en not_active Withdrawn
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2006111953A1 (en) * | 2005-04-17 | 2006-10-26 | Shlomo Brach | Method and system for conducting internet websites search |
CN103324628A (en) * | 2012-03-21 | 2013-09-25 | 腾讯科技(深圳)有限公司 | Industry classification method and system for text publishing |
CN104199851A (en) * | 2014-08-11 | 2014-12-10 | 北京奇虎科技有限公司 | Method for extracting telephone numbers according to yellow page information and cloud server |
CN105975457A (en) * | 2016-05-03 | 2016-09-28 | 成都数联铭品科技有限公司 | Information classification prediction system based on full-automatic learning |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111860981A (en) * | 2020-07-03 | 2020-10-30 | 航天信息(山东)科技有限公司 | Enterprise national industry category prediction method and system based on LSTM deep learning |
CN111860981B (en) * | 2020-07-03 | 2024-01-19 | 航天信息(山东)科技有限公司 | Enterprise national industry category prediction method and system based on LSTM deep learning |
CN113139066A (en) * | 2021-04-28 | 2021-07-20 | 安徽智侒信信息技术有限公司 | Company industry link point matching method based on natural language processing technology |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110175325B (en) | Comment analysis method based on word vector and syntactic characteristics and visual interaction interface | |
CN106055673B (en) | A kind of Chinese short text sensibility classification method based on text feature insertion | |
CN104573046B (en) | A kind of comment and analysis method and system based on term vector | |
CN107451126B (en) | Method and system for screening similar meaning words | |
Anastasia et al. | Twitter sentiment analysis of online transportation service providers | |
CN106709754A (en) | Power user grouping method based on text mining | |
CN107038480A (en) | A kind of text sentiment classification method based on convolutional neural networks | |
CN106708966A (en) | Similarity calculation-based junk comment detection method | |
CN106055675B (en) | A kind of Relation extraction method based on convolutional neural networks and apart from supervision | |
CN110362678A (en) | A kind of method and apparatus automatically extracting Chinese text keyword | |
CN105740229B (en) | The method and device of keyword extraction | |
US20170091318A1 (en) | Apparatus and method for extracting keywords from a single document | |
CN108388554B (en) | Text emotion recognition system based on collaborative filtering attention mechanism | |
CN109165294A (en) | Short text classification method based on Bayesian classification | |
CN106126502A (en) | A kind of emotional semantic classification system and method based on support vector machine | |
CN107818173B (en) | Vector space model-based Chinese false comment filtering method | |
CN107038249A (en) | Network public sentiment information sensibility classification method based on dictionary | |
CN113360647B (en) | 5G mobile service complaint source-tracing analysis method based on clustering | |
CN106547864A (en) | A kind of Personalized search based on query expansion | |
CN109558587A (en) | A kind of classification method for the unbalanced public opinion orientation identification of category distribution | |
CN107133212A (en) | It is a kind of that recognition methods is contained based on integrated study and the text of words and phrases integrated information | |
Malandrakis et al. | SAIL: A hybrid approach to sentiment analysis | |
CN106055633A (en) | Chinese microblog subjective and objective sentence classification method | |
CN109710765A (en) | A kind of company's trade classification calculation method based on natural language processing | |
CN110110087A (en) | A kind of Feature Engineering method for Law Text classification based on two classifiers |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20190503 |
|
WW01 | Invention patent application withdrawn after publication |