CN109710765A - A kind of company's trade classification calculation method based on natural language processing - Google Patents

A kind of company's trade classification calculation method based on natural language processing Download PDF

Info

Publication number
CN109710765A
CN109710765A CN201811624587.8A CN201811624587A CN109710765A CN 109710765 A CN109710765 A CN 109710765A CN 201811624587 A CN201811624587 A CN 201811624587A CN 109710765 A CN109710765 A CN 109710765A
Authority
CN
China
Prior art keywords
company
data
text
classification
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201811624587.8A
Other languages
Chinese (zh)
Inventor
王凯锋
吴承霖
金立达
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Benniao Agel Ecommerce Ltd
Original Assignee
Xiamen Benniao Agel Ecommerce Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Benniao Agel Ecommerce Ltd filed Critical Xiamen Benniao Agel Ecommerce Ltd
Priority to CN201811624587.8A priority Critical patent/CN109710765A/en
Publication of CN109710765A publication Critical patent/CN109710765A/en
Withdrawn legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Company's trade classification calculation method based on natural language processing that the invention discloses a kind of, the text data for the company that presorts is obtained by crawler, feature, noise reduction process and training term vector are extracted to text data, and after using language model and transfer learning pre-training disaggregated model, hierarchical classification is carried out to text data, realizes the classification to targeted company.Process of the present invention is simple, high-efficient, uses manpower and material resources sparingly;The present invention substantially increases the precision of classification by available about 30 first-level class of hierarchical classification system and about 300 secondary classifications;Model of the invention can receive the text input of different length, form, not need to make model any adjustment, and application range is wider, practicability is higher.

Description

A kind of company's trade classification calculation method based on natural language processing
Technical field
The present invention relates to Internet technical field, particularly relate to based on a kind of company's trade classification by natural language processing Calculation method.
Background technique
In data search, accurately trade orientation, can help user quickly to judge whether targeted company meets oneself Demand.Existing trade classification mainly using including that manual method marks company's category of employment, formulates trade classification rule to sentence Disconnected company's industry or conventional sorting methods (the methods of such as support vector machines/decision tree) Lai Shixian, have the following problems:
(1) manual method: there are knowledge barriers between each row, need a large amount of industry specialists participations that can just efficiently accomplish mark, consume Take a large amount of man power and materials;
(2) rule and method: company's substantial amounts are difficult to take into account all taking-over markets formulation trade classification rules;And new company It emerges one after another, it is difficult to timely update;It lays down a regulation simultaneously and a large amount of personnel is needed to participate in, realize that difficulty is high;
(3) conventional sorting methods: needing to carry out feature extraction processing, and document loses information after processing, is easy to cause classification accurate Degree reduces.
In view of this, the present inventor does not attain regarding to the issue above improves caused many missings and inconvenience, and go deep into structure Think, and actively research improvement has a fling at and develops and design the present invention.
Summary of the invention
Company's trade classification calculation method based on natural language processing that the purpose of the present invention is to provide a kind of has and divides The feature that class precision is high, applicability is wide, and manually mark amount is few required for the present invention, can use manpower and material resources sparingly.
In order to achieve the above objectives, solution of the invention is:
1, a kind of company's trade classification calculation method based on natural language processing, comprising the following steps:
Step 1, data acquisition
By crawler web data, the text data of the text description comprising the product for the company of presorting or service is obtained;
Step 2, data analysis
2.1 feature extractions: using the text data summation of all companies that presort as corpus, the text for the company that each presorts Data extract feature as an article, to the text data for the company of presorting, and the feature includes the product of company, data Source, TFIDF statistics and BOW statistics;Data mark is carried out by active learning;It regard webpage url segmentation as feature, It is handled by noisy channel layer, the noise in quantized data source;
2.2 data cleansings: pass through the side of removal pure digi-tal text, small letter, the common word of removal, removal low-frequency word, lemmatization Formula clears up text data;
2.3 training term vectors: the text data after cleaning is done into term vector training with GLOVE and word2vec and obtains term vector;
Step 3, deep learning frame
In conjunction with the term vector in the feature and step 2.3 extracted in step 2.1, using ELMO, ULMFIT model and wide and Deep model is trained deep learning model;
Step 4, hierarchical classification
By trained deep learning model, text data is carried out first-level class is calculated;For each first-level class It is individually trained according to the difference of data characteristics using different models, obtains the classifier of secondary classification;According to output The classifier for the secondary classification that first-level class selection enters, realizes the trade classification to company.
The web data derives from official website homepage, first level pages, social networks homepage or the enterprise for the company that presorts Yellow Page.
After adopting the above method, process of the present invention is simple, high-efficient, by using language model and transfer learning pre-training Disaggregated model to greatly improve accuracy rate, and uses manpower and material resources sparingly;The present invention is available by hierarchical classification system About 30 first-level class and about 300 secondary classifications, substantially increase the precision of classification;Model of the invention can receive not Same length, the text input of form do not need to make model any adjustment, and application range is wider, practicability is higher.
In addition, using active learning during data mark, to guarantee the real-time update of model, increase Add real-time, reduces duplication of labour power.
Specific embodiment
In order to further explain the technical solution of the present invention, being explained in detail below by specific embodiment the present invention It states.
The product or its service provided that one company produces embody the own feature of the said firm, can pass through calculating The trade classification for having the similarity degree of feature by oneself to analyze company.
The present invention is a kind of company's trade classification calculation method based on natural language processing, comprising the following steps:
Step 1, data acquisition
By crawler web data, the text data of the text description comprising the product for the company of presorting or service is obtained.
Above-mentioned web data derives from official website homepage, first level pages, social networks homepage or the enterprise for the company that presorts The companies of presorting such as Yellow Page can issue the platform of its relevant information.The semantic information for including in above-mentioned text data can be used to Do text understanding and accurately trade classification.
Step 2, data analysis
2.1 feature extractions: using the text data summation of all companies that presort as corpus, the text for the company that each presorts Data extract feature as an article, to the text data for the company of presorting, and the feature includes the product of company, data Source, TFIDF statistics and BOW statistics etc..Data mark is carried out by active proposed standard system active learning.It will Webpage url segmentation is used as feature, is handled by noisy channel layer, and the noise in quantized data source is subsequent to increase The accuracy rate of model in step is segmented into www if webpage url is www.google.com | google | com.
2.2 data cleansings: pass through removal pure digi-tal text, small letter, the common word of removal, removal low-frequency word, lemmatization (lemmatisation) etc. a series of mode of natural language processings clears up text data.
2.3 training term vectors: by the text data after cleaning with GLOVE and word2vec do term vector training obtain word to Amount, the input as deep learning model.
Step 3, deep learning frame
In conjunction with the term vector in the feature and step 2.3 extracted in step 2.1, using ELMO, ULMFIT model and wide and Deep model is trained deep learning model.
ELMO and ULMFIT model is applied in this step, accuracy rate can accomplish highest accurate on standard data set Degree.
Step 4, hierarchical classification
By trained deep learning model, text data is carried out first-level class is calculated;For each first-level class It is individually trained according to the difference of data characteristics using different models, obtains the classifier of secondary classification;According to output The classifier for the secondary classification that first-level class selection enters, realizes the trade classification to company.
Process of the present invention is simple, high-efficient, by using language model and transfer learning pre-training disaggregated model, thus greatly Width improves accuracy rate, and uses manpower and material resources sparingly;The present invention passes through available about 30 first-level class of hierarchical classification system About 300 secondary classifications, substantially increase the precision of classification;Model of the invention can receive different length, form Text input does not need to make model any adjustment, and application range is wider, practicability is higher.
In addition, using active learning during data mark, to guarantee the real-time update of model, increase Add real-time, reduces duplication of labour power.
Above-described embodiment and non-limiting product form and style of the invention, the ordinary skill people of any technical field The appropriate changes or modifications that member does it, all should be regarded as not departing from patent category of the invention.

Claims (2)

1. a kind of company's trade classification calculation method based on natural language processing, it is characterised in that the following steps are included:
Step 1, data acquisition
By crawler web data, the text data of the text description comprising the product for the company of presorting or service is obtained;
Step 2, data analysis
2.1 feature extractions: using the text data summation of all companies that presort as corpus, the text for the company that each presorts Data extract feature as an article, to the text data for the company of presorting, and the feature includes the product of company, data Source, TFIDF statistics and BOW statistics;Data mark is carried out by active learning;It regard webpage url segmentation as feature, It is handled by noisy channel layer, the noise in quantized data source;
2.2 data cleansings: pass through the side of removal pure digi-tal text, small letter, the common word of removal, removal low-frequency word, lemmatization Formula clears up text data;
2.3 training term vectors: the text data after cleaning is done into term vector training with GLOVE and word2vec and obtains term vector;
Step 3, deep learning frame
In conjunction with the term vector in the feature and step 2.3 extracted in step 2.1, using ELMO, ULMFIT model and wide and Deep model is trained deep learning model;
Step 4, hierarchical classification
By trained deep learning model, text data is carried out first-level class is calculated;For each first-level class It is individually trained according to the difference of data characteristics using different models, obtains the classifier of secondary classification;According to output The classifier for the secondary classification that first-level class selection enters, realizes the trade classification to company.
2. a kind of company's trade classification calculation method based on natural language processing as described in claim 1, it is characterised in that: The web data derives from official website homepage, first level pages, social networks homepage or the enterprise's Yellow Page for the company that presorts.
CN201811624587.8A 2018-12-28 2018-12-28 A kind of company's trade classification calculation method based on natural language processing Withdrawn CN109710765A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811624587.8A CN109710765A (en) 2018-12-28 2018-12-28 A kind of company's trade classification calculation method based on natural language processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811624587.8A CN109710765A (en) 2018-12-28 2018-12-28 A kind of company's trade classification calculation method based on natural language processing

Publications (1)

Publication Number Publication Date
CN109710765A true CN109710765A (en) 2019-05-03

Family

ID=66257975

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811624587.8A Withdrawn CN109710765A (en) 2018-12-28 2018-12-28 A kind of company's trade classification calculation method based on natural language processing

Country Status (1)

Country Link
CN (1) CN109710765A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111860981A (en) * 2020-07-03 2020-10-30 航天信息(山东)科技有限公司 Enterprise national industry category prediction method and system based on LSTM deep learning
CN113139066A (en) * 2021-04-28 2021-07-20 安徽智侒信信息技术有限公司 Company industry link point matching method based on natural language processing technology

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006111953A1 (en) * 2005-04-17 2006-10-26 Shlomo Brach Method and system for conducting internet websites search
CN103324628A (en) * 2012-03-21 2013-09-25 腾讯科技(深圳)有限公司 Industry classification method and system for text publishing
CN104199851A (en) * 2014-08-11 2014-12-10 北京奇虎科技有限公司 Method for extracting telephone numbers according to yellow page information and cloud server
CN105975457A (en) * 2016-05-03 2016-09-28 成都数联铭品科技有限公司 Information classification prediction system based on full-automatic learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006111953A1 (en) * 2005-04-17 2006-10-26 Shlomo Brach Method and system for conducting internet websites search
CN103324628A (en) * 2012-03-21 2013-09-25 腾讯科技(深圳)有限公司 Industry classification method and system for text publishing
CN104199851A (en) * 2014-08-11 2014-12-10 北京奇虎科技有限公司 Method for extracting telephone numbers according to yellow page information and cloud server
CN105975457A (en) * 2016-05-03 2016-09-28 成都数联铭品科技有限公司 Information classification prediction system based on full-automatic learning

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111860981A (en) * 2020-07-03 2020-10-30 航天信息(山东)科技有限公司 Enterprise national industry category prediction method and system based on LSTM deep learning
CN111860981B (en) * 2020-07-03 2024-01-19 航天信息(山东)科技有限公司 Enterprise national industry category prediction method and system based on LSTM deep learning
CN113139066A (en) * 2021-04-28 2021-07-20 安徽智侒信信息技术有限公司 Company industry link point matching method based on natural language processing technology

Similar Documents

Publication Publication Date Title
CN110175325B (en) Comment analysis method based on word vector and syntactic characteristics and visual interaction interface
CN106055673B (en) A kind of Chinese short text sensibility classification method based on text feature insertion
CN104573046B (en) A kind of comment and analysis method and system based on term vector
CN107451126B (en) Method and system for screening similar meaning words
Anastasia et al. Twitter sentiment analysis of online transportation service providers
CN106709754A (en) Power user grouping method based on text mining
CN107038480A (en) A kind of text sentiment classification method based on convolutional neural networks
CN106708966A (en) Similarity calculation-based junk comment detection method
CN106055675B (en) A kind of Relation extraction method based on convolutional neural networks and apart from supervision
CN110362678A (en) A kind of method and apparatus automatically extracting Chinese text keyword
CN105740229B (en) The method and device of keyword extraction
US20170091318A1 (en) Apparatus and method for extracting keywords from a single document
CN108388554B (en) Text emotion recognition system based on collaborative filtering attention mechanism
CN109165294A (en) Short text classification method based on Bayesian classification
CN106126502A (en) A kind of emotional semantic classification system and method based on support vector machine
CN107818173B (en) Vector space model-based Chinese false comment filtering method
CN107038249A (en) Network public sentiment information sensibility classification method based on dictionary
CN113360647B (en) 5G mobile service complaint source-tracing analysis method based on clustering
CN106547864A (en) A kind of Personalized search based on query expansion
CN109558587A (en) A kind of classification method for the unbalanced public opinion orientation identification of category distribution
CN107133212A (en) It is a kind of that recognition methods is contained based on integrated study and the text of words and phrases integrated information
Malandrakis et al. SAIL: A hybrid approach to sentiment analysis
CN106055633A (en) Chinese microblog subjective and objective sentence classification method
CN109710765A (en) A kind of company's trade classification calculation method based on natural language processing
CN110110087A (en) A kind of Feature Engineering method for Law Text classification based on two classifiers

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20190503

WW01 Invention patent application withdrawn after publication