CN109710765A

CN109710765A - A kind of company's trade classification calculation method based on natural language processing

Info

Publication number: CN109710765A
Application number: CN201811624587.8A
Authority: CN
Inventors: 王凯锋; 吴承霖; 金立达
Original assignee: Xiamen Benniao Agel Ecommerce Ltd
Current assignee: Xiamen Benniao Agel Ecommerce Ltd
Priority date: 2018-12-28
Filing date: 2018-12-28
Publication date: 2019-05-03

Abstract

Company's trade classification calculation method based on natural language processing that the invention discloses a kind of, the text data for the company that presorts is obtained by crawler, feature, noise reduction process and training term vector are extracted to text data, and after using language model and transfer learning pre-training disaggregated model, hierarchical classification is carried out to text data, realizes the classification to targeted company.Process of the present invention is simple, high-efficient, uses manpower and material resources sparingly；The present invention substantially increases the precision of classification by available about 30 first-level class of hierarchical classification system and about 300 secondary classifications；Model of the invention can receive the text input of different length, form, not need to make model any adjustment, and application range is wider, practicability is higher.

Description

A kind of company's trade classification calculation method based on natural language processing

Technical field

The present invention relates to Internet technical field, particularly relate to based on a kind of company's trade classification by natural language processing Calculation method.

Background technique

In data search, accurately trade orientation, can help user quickly to judge whether targeted company meets oneself Demand.Existing trade classification mainly using including that manual method marks company's category of employment, formulates trade classification rule to sentence Disconnected company's industry or conventional sorting methods (the methods of such as support vector machines/decision tree) Lai Shixian, have the following problems:

(1) manual method: there are knowledge barriers between each row, need a large amount of industry specialists participations that can just efficiently accomplish mark, consume Take a large amount of man power and materials；

(2) rule and method: company's substantial amounts are difficult to take into account all taking-over markets formulation trade classification rules；And new company It emerges one after another, it is difficult to timely update；It lays down a regulation simultaneously and a large amount of personnel is needed to participate in, realize that difficulty is high；

(3) conventional sorting methods: needing to carry out feature extraction processing, and document loses information after processing, is easy to cause classification accurate Degree reduces.

In view of this, the present inventor does not attain regarding to the issue above improves caused many missings and inconvenience, and go deep into structure Think, and actively research improvement has a fling at and develops and design the present invention.

Summary of the invention

Company's trade classification calculation method based on natural language processing that the purpose of the present invention is to provide a kind of has and divides The feature that class precision is high, applicability is wide, and manually mark amount is few required for the present invention, can use manpower and material resources sparingly.

In order to achieve the above objectives, solution of the invention is:

1, a kind of company's trade classification calculation method based on natural language processing, comprising the following steps:

Step 1, data acquisition

By crawler web data, the text data of the text description comprising the product for the company of presorting or service is obtained;

Step 2, data analysis

2.1 feature extractions: using the text data summation of all companies that presort as corpus, the text for the company that each presorts Data extract feature as an article, to the text data for the company of presorting, and the feature includes the product of company, data Source, TFIDF statistics and BOW statistics;Data mark is carried out by active learning；It regard webpage url segmentation as feature, It is handled by noisy channel layer, the noise in quantized data source;

2.2 data cleansings: pass through the side of removal pure digi-tal text, small letter, the common word of removal, removal low-frequency word, lemmatization Formula clears up text data；

2.3 training term vectors: the text data after cleaning is done into term vector training with GLOVE and word2vec and obtains term vector；

Step 3, deep learning frame

In conjunction with the term vector in the feature and step 2.3 extracted in step 2.1, using ELMO, ULMFIT model and wide and Deep model is trained deep learning model；

Step 4, hierarchical classification

By trained deep learning model, text data is carried out first-level class is calculated；For each first-level class It is individually trained according to the difference of data characteristics using different models, obtains the classifier of secondary classification；According to output The classifier for the secondary classification that first-level class selection enters, realizes the trade classification to company.

The web data derives from official website homepage, first level pages, social networks homepage or the enterprise for the company that presorts Yellow Page.

After adopting the above method, process of the present invention is simple, high-efficient, by using language model and transfer learning pre-training Disaggregated model to greatly improve accuracy rate, and uses manpower and material resources sparingly；The present invention is available by hierarchical classification system About 30 first-level class and about 300 secondary classifications, substantially increase the precision of classification；Model of the invention can receive not Same length, the text input of form do not need to make model any adjustment, and application range is wider, practicability is higher.

In addition, using active learning during data mark, to guarantee the real-time update of model, increase Add real-time, reduces duplication of labour power.

Specific embodiment

In order to further explain the technical solution of the present invention, being explained in detail below by specific embodiment the present invention It states.

The product or its service provided that one company produces embody the own feature of the said firm, can pass through calculating The trade classification for having the similarity degree of feature by oneself to analyze company.

The present invention is a kind of company's trade classification calculation method based on natural language processing, comprising the following steps:

Step 1, data acquisition

By crawler web data, the text data of the text description comprising the product for the company of presorting or service is obtained.

Above-mentioned web data derives from official website homepage, first level pages, social networks homepage or the enterprise for the company that presorts The companies of presorting such as Yellow Page can issue the platform of its relevant information.The semantic information for including in above-mentioned text data can be used to Do text understanding and accurately trade classification.

Step 2, data analysis

2.1 feature extractions: using the text data summation of all companies that presort as corpus, the text for the company that each presorts Data extract feature as an article, to the text data for the company of presorting, and the feature includes the product of company, data Source, TFIDF statistics and BOW statistics etc..Data mark is carried out by active proposed standard system active learning.It will Webpage url segmentation is used as feature, is handled by noisy channel layer, and the noise in quantized data source is subsequent to increase The accuracy rate of model in step is segmented into www if webpage url is www.google.com | google | com.

2.2 data cleansings: pass through removal pure digi-tal text, small letter, the common word of removal, removal low-frequency word, lemmatization (lemmatisation) etc. a series of mode of natural language processings clears up text data.

2.3 training term vectors: by the text data after cleaning with GLOVE and word2vec do term vector training obtain word to Amount, the input as deep learning model.

Step 3, deep learning frame

In conjunction with the term vector in the feature and step 2.3 extracted in step 2.1, using ELMO, ULMFIT model and wide and Deep model is trained deep learning model.

ELMO and ULMFIT model is applied in this step, accuracy rate can accomplish highest accurate on standard data set Degree.

Step 4, hierarchical classification

Process of the present invention is simple, high-efficient, by using language model and transfer learning pre-training disaggregated model, thus greatly Width improves accuracy rate, and uses manpower and material resources sparingly；The present invention passes through available about 30 first-level class of hierarchical classification system About 300 secondary classifications, substantially increase the precision of classification；Model of the invention can receive different length, form Text input does not need to make model any adjustment, and application range is wider, practicability is higher.

Above-described embodiment and non-limiting product form and style of the invention, the ordinary skill people of any technical field The appropriate changes or modifications that member does it, all should be regarded as not departing from patent category of the invention.

Claims

1. a kind of company's trade classification calculation method based on natural language processing, it is characterised in that the following steps are included:

Step 1, data acquisition

Step 2, data analysis

Step 3, deep learning frame

Step 4, hierarchical classification

2. a kind of company's trade classification calculation method based on natural language processing as described in claim 1, it is characterised in that: The web data derives from official website homepage, first level pages, social networks homepage or the enterprise's Yellow Page for the company that presorts.