CN114090736A

CN114090736A - Enterprise industry identification system and method based on text similarity

Info

Publication number: CN114090736A
Application number: CN202111372067.4A
Authority: CN
Inventors: 张晖; 冯海; 杨弋; 王铮; 张鹏; 魏兵兵; 姚晗
Original assignee: Sichuan Institute Of Standardization; Southwest University of Science and Technology
Current assignee: Sichuan Institute Of Standardization; Southwest University of Science and Technology
Priority date: 2021-11-18
Filing date: 2021-11-18
Publication date: 2022-02-25

Abstract

The invention discloses an enterprise industry identification system and an enterprise industry identification method based on text similarity, which comprise a data preprocessing module, a data sampling module, a synonym expansion module, a vector space conversion module, a data labeling module and an industry identification module, wherein the data preprocessing module is used for preprocessing a text and generating verbs and noun word bags, the data sampling module is used for sampling and reading partial data of a unified social credit code database, and the synonym expansion module is used for performing synonym expansion on the sampled data and national economy industry classification data; the invention carries out synonym expansion on the data, improves the accuracy of similarity comparison, adopts a random sampling technology, extracts a small amount of data from the social unified credit code database and carries out similarity comparison on the data and national economy industry classification standard data, and the sampled data volume is less than the non-sampled data, thereby effectively improving the overall efficiency of industry identification.

Description

Enterprise industry identification system and method based on text similarity

Technical Field

The invention relates to the technical field of data processing, in particular to an enterprise industry identification system and an enterprise industry identification method based on text similarity.

Background

The unified social credit code library contains basic information of legal persons such as companies, but the 'operating range' field of an enterprise is automatically input by the enterprise, so that the phenomenon of non-standardization exists, the industry to which the unified social credit code library belongs cannot be directly obtained, and the subsequent analysis and statistics of sub-industries are difficult to carry out, while the national economy industry standard is released by the country at present, and the industry range of the unified social credit code library can be determined by comparing the 'operating range' text data of the enterprise in the unified social credit code library with the enterprise operating range text data in the standard;

at present, text similarity calculation methods are numerous and mainly include a method based on word distance, a method based on word bag and a method based on ontology, but enterprise operation range data has particularity, and short texts have low similarity calculation accuracy due to less contained information; the enterprise data is excessive and is compared with national standard data one by one, and the speed is low; the operating range words of each enterprise in the unified social credit code base are not standard, so that the identification is difficult or impossible, and therefore the invention provides an enterprise industry identification system and an enterprise industry identification method based on text similarity to solve the problems in the prior art.

Disclosure of Invention

Aiming at the problems, the invention aims to provide an enterprise industry identification system and an enterprise industry identification method based on text similarity, the enterprise industry identification system and the enterprise industry identification method based on the text similarity carry out synonym expansion on data, the similarity comparison accuracy is improved, a random sampling technology is adopted, a small amount of data is extracted from a social unified credit code database and is compared with national economy industry classification standard data in similarity, the amount of the sampled data is smaller than that of non-sampled data, and the overall efficiency of industry identification is effectively improved.

In order to realize the purpose of the invention, the invention is realized by the following technical scheme: an enterprise industry identification system based on text similarity comprises a data preprocessing module, a data sampling module, a synonym expansion module, a vector space conversion module, a data labeling module and an industry identification module, wherein the data preprocessing module is used for preprocessing a text and generating verbs and noun word bags, the data sampling module is used for sampling and reading partial data of a unified social credit code database, the synonym expansion module is used for performing synonym expansion on sampled data and national economy industry classification data, the vector space conversion module is used for converting data after synonym expansion and non-sampled data into a vector space through word embedding, the data labeling module is used for calculating and labeling the similarity between an enterprise operation range field in the sampled data and national economy industry enterprise operation range description data, the industry identification module is used for training the marked data by using a machine learning algorithm and acquiring the industry category of the unmarked social uniform credit code data by using a classification model obtained by training.

The further improvement lies in that: the data preprocessing module removes punctuation marks, stop words and participles in the text data during text preprocessing, and only keeps verbs and nouns after verbs and noun word bags are generated by the data preprocessing module and are labeled according to the parts of speech of the data.

The further improvement lies in that: the data sampling module randomly extracts partial data in the unified social credit code database according to a sampling proportion set by a user, and a random sampling technology is adopted, so that the amount of sampled data is smaller than that of non-sampled data, and the overall efficiency of industry identification is effectively improved.

The further improvement lies in that: the synonym expansion module searches words in the sampled data and the national economic industry classification data according to the number set by the user for the most similar words in the corresponding number through the synonym forest database and adds the words into the database;

the vector space conversion module converts data to a vector space through word2vec word embedding algorithm.

The further improvement lies in that: the data marking module calculates the cosine similarity between the operation range field of each sampled data and each national economic industry economic data one by one;

if more than one national economy industry with similarity higher than a preset threshold value is found, marking the industry of the enterprise as belonging to the industry;

and if the national economy industry higher than the preset threshold value is not found, manually marking.

The further improvement lies in that: the industry identification module trains the unified social credit code enterprise operation range data and national economy industry classification data subjected to word embedding and labeled by using an XGboost classification algorithm; and identifying the industry category of the non-sampled sample subjected to the word embedding by using the XGboost model obtained by training.

A recognition method of an enterprise industry recognition system based on text similarity comprises the following steps:

step one, inputting a unified social credit code database and a national economy industry classification database into a data preprocessing module to perform punctuation, stop word and word segmentation removal processing, and then performing part-of-speech tagging and keeping verbs and nouns;

setting a sampling proportion, randomly sampling the unified social credit code by using a data sampling module according to the set sampling proportion, and extracting a small amount of sampling data to form a training set;

step three, bringing the near synonyms and synonyms of the verbs and the nouns in the sampled data set and the classification data of the national economic industry into a calculation range, adopting downloaded synonym forest data, selecting a plurality of synonyms most similar to the verbs and the nouns one by one from the verbs and the nouns obtained in the data preprocessing module according to the number of the synonyms set by a user, and storing the synonyms and the synonyms in a database;

step four, converting word data into a vector space by using a word2vec word embedding algorithm;

fifthly, automatically and manually marking the unified social credit code data by using a data sampling module;

and step six, training by using a machine learning algorithm by using the training set in the step two, automatically identifying the industry category of the non-sampled uniform social credit code data by using the trained model, and outputting an identification result.

The further improvement lies in that: in the fifth step, data converted into a vector space is sequentially taken from the sampled unified social credit code data, cosine distances are calculated one by one between the data and national economy industry classification data, and when the similarity is higher than a threshold set by a user, the data is marked as belonging to the industry; and when the similarity of the data to all the industry data is lower than a threshold value, manually marking.

The beneficial effects of the invention are as follows: the invention adopts a random sampling technology, a small amount of data is extracted from the social unified credit code database and is compared with the national economy industry classification standard data in similarity, the amount of the sampled data is less than that of the non-sampled data, and the overall efficiency of industry identification is effectively improved;

the method carries out synonym expansion on verbs and nouns in the word bag, adds words with the same or similar semantics of the original words into the database, and realizes that industries similar to the semantics of the words can still be found under the condition that the words used by the unified social credit code data are not standard;

the invention divides the problem of automatic industry identification into a mode of combining small-amount data semi-automatic labeling and large-amount data machine learning, and improves the efficiency of industry identification while ensuring the accuracy of industry identification.

Drawings

Fig. 1 is a system structure diagram according to an embodiment of the invention.

FIG. 2 is a flowchart of a second method according to an embodiment of the present invention.

FIG. 3 is a flow chart of a two-step data preprocessing according to an embodiment of the present invention.

FIG. 4 is a flowchart illustrating a fifth step of the method according to the present invention.

Detailed Description

For the purpose of enhancing understanding of the present invention, the present invention will be further described in detail with reference to the following examples, which are provided for illustration only and are not intended to limit the scope of the present invention.

Example one

According to fig. 1, the embodiment provides an enterprise industry identification system based on text similarity, which includes a data preprocessing module, a data sampling module, a synonym expansion module, a vector space conversion module, a data labeling module and an industry identification module, wherein the data preprocessing module is used for preprocessing a text and generating verbs and noun word bags, the data sampling module is used for sampling and reading partial data of a unified social credit code database, the synonym expansion module is used for performing synonym expansion on the sampled data and national economic industry classification data, the vector space conversion module is used for converting data after synonym expansion and non-sampled data into a vector space through word embedding, the data labeling module is used for calculating and labeling similarity between an enterprise operation range field in the sampled data and national economic industry enterprise operation range description data, the industry identification module is used for training the marked data by using a machine learning algorithm and acquiring the industry category of the unmarked social uniform credit code data by using a classification model obtained by training.

The data preprocessing module removes punctuation marks, stop words and participles in the text data during text preprocessing, and only verbs and nouns are reserved after verbs and noun word bags generated by the data preprocessing module are used for part-of-speech tagging of the data.

The data sampling module randomly extracts partial data in the unified social credit code database according to a sampling proportion set by a user, and a random sampling technology is adopted, so that the amount of sampled data is smaller than that of non-sampled data, and the overall efficiency of industry identification is effectively improved.

The synonym expansion module searches the words in the sampled data and the national economic industry classification data according to the number set by the user for the most similar words in the corresponding number through the synonym forest database and adds the words into the database, and the synonym expansion module effectively improves the similarity comparison accuracy;

The data marking module calculates the cosine similarity between the operation range field of each sampled data and each national economic industry economic data one by one;

The industry identification module trains the unified social credit code enterprise operation range data and national economy industry classification data subjected to word embedding and labeled by using an XGboost classification algorithm; and identifying the industry class of the non-sampled sample subjected to word embedding by using the trained XGboost model.

Example two

As shown in fig. 2, 3, and 4, the embodiment provides a recognition method of an enterprise industry recognition system based on text similarity, which is characterized by comprising the following steps:

step one, inputting a unified social credit code database and a national economic industry classification database into a data preprocessing module to perform punctuation, stop word and word segmentation removal processing, then performing part-of-speech tagging, finally obtaining a word bag only containing verbs and nouns and storing the word bag into the database, wherein punctuation removal is realized by adopting regular expression programming, and all punctuation is deleted;

the participle adopts a conditional random domain algorithm, and the language database adopts a Chinese participle language database of Microsoft Asia institute; the stop word is compared with the stop word list, and the stop word after word segmentation is deleted;

the part-of-speech tagging adopts a conditional random domain algorithm, the language database adopts a national daily part-of-speech tagging language database, all verbs and nouns are screened out and stored in corresponding records in sequence;

step three, bringing the sampled data set and the synonyms of the verbs and the nouns in the national economic industry classification data into a calculation range, adopting downloaded synonym forest data, selecting one by one the most similar synonyms of the verbs and the nouns obtained in the data preprocessing module according to the number of the synonyms set by a user, and storing the synonyms in a database;

converting word data into a vector space by using a word2vec word embedding algorithm, and laying a foundation for similarity calculation and industry identification in the next step;

step five, automatically and manually marking the unified social credit code data by using a data sampling module according to the similarity between the operation range description in the sampled unified social credit code data and the operation range description in the national standard data of the national economy industry classification;

sequentially taking data converted into vector space from the sampled unified social credit code data, calculating cosine distances between the data and national economy industry classification data one by one, marking the data as belonging to the industry if the similarity is higher than a threshold set by a user, and manually marking if the similarity to all industry data is lower than the threshold

And step six, training by using a machine learning algorithm by using the training set in the step two, automatically identifying the industry type of the non-sampled uniform social credit code data by using the trained model, and outputting an identification result.

The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. The utility model provides an enterprise industry identification system based on text similarity which characterized in that: the system comprises a data preprocessing module, a data sampling module, a synonym expansion module, a vector space conversion module, a data labeling module and an industry identification module, wherein the data preprocessing module is used for preprocessing a text and generating verbs and noun word bags, the data sampling module is used for sampling and reading partial data of a unified social credit code database, the synonym expansion module is used for performing synonym expansion on sampled data and national economy industry classification data, the vector space conversion module is used for converting data after synonym expansion and unsampled data into a vector space through word embedding, the data labeling module is used for calculating the similarity between an enterprise operation range field in the sampled data and national economy industry enterprise operation range description data and labeling, the industry identification module is used for training the labeled data by using a machine learning algorithm and obtaining the unlabeled social credit code data by training The classification model of (a) obtains its industry classification.

2. The enterprise industry identification system based on text similarity according to claim 1, wherein: the data preprocessing module removes punctuation marks, stop words and participles in the text data during text preprocessing, and only keeps verbs and nouns after verbs and noun word bags are generated by the data preprocessing module and are labeled according to the parts of speech of the data.

3. The enterprise industry identification system based on text similarity according to claim 1, wherein: the data sampling module randomly extracts partial data in the unified social credit code database according to a sampling proportion set by a user.

4. The enterprise industry identification system based on text similarity according to claim 1, wherein: the synonym expansion module searches words in the sampled data and the national economic industry classification data according to the number set by the user for the most similar words in the corresponding number through the synonym forest database and adds the words into the database;

5. The enterprise industry identification system based on text similarity according to claim 1, wherein: the data marking module calculates the cosine similarity between the operation range field of each sampled data and each national economic industry economic data one by one;

6. The enterprise industry identification system based on text similarity according to claim 1, wherein: the industry identification module trains the marked unified social credit code enterprise operation range data and national economy industry classification data after word embedding by using an XGboost classification algorithm; and identifying the industry class of the non-sampled sample subjected to word embedding by using the trained XGboost model.

7. The identification method of the enterprise industry identification system based on the text similarity as claimed in claim 1, characterized by comprising the following steps:

8. The identification method of the enterprise industry identification system based on the text similarity as claimed in claim 7, wherein: in the fifth step, data converted into a vector space is sequentially taken from the sampled unified social credit code data, cosine distances are calculated one by one between the data and national economy industry classification data, and when the similarity is higher than a threshold set by a user, the data is marked as belonging to the industry; and when the similarity of the data to all the industry data is lower than a threshold value, manually marking.