CN115827871A

CN115827871A - Internet enterprise classification method, device and system

Info

Publication number: CN115827871A
Application number: CN202211690035.3A
Authority: CN
Inventors: 李美燕; 吴震; 王秀文; 李娅强; 刘纯艳; 王峰; 刘鑫; 李政达; 陈鹏云; 杨菁林; 赵磊; 秦恺; 曾宣玮; 刘志丞
Original assignee: Great Wall Computer Software & Systems Inc; National Computer Network and Information Security Management Center
Current assignee: Great Wall Computer Software & Systems Inc; National Computer Network and Information Security Management Center
Priority date: 2022-12-27
Filing date: 2022-12-27
Publication date: 2023-03-21

Abstract

The invention provides a method and a device for classifying Internet enterprises, wherein the method comprises the following steps: s1: the method comprises the steps of obtaining multi-dimensional data of an internet enterprise, and preprocessing the multi-dimensional data to generate long text data; s2: inputting the long text data into a Bert network model based on a Transformer encoder for processing; s3: and sending the processed data into a classifier to classify the Internet enterprises. The scheme of the invention is based on automatic feature combination learning in the deep neural network of the Transformer architecture, can accurately classify the industry of the Internet enterprises, and can greatly improve the accuracy of industry classification of the Internet enterprises. The scheme of the invention can rapidly identify the multidimensional information of the mass enterprises without manual intervention. The scheme of the invention based on the big corpus pre-training model and the downstream task fine adjustment can be flexibly applied to the rapid classification of massive enterprises in different scenes.

Description

Internet enterprise classification method, device and system

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to a method, a device and a system for classifying Internet enterprises.

Background

The enterprise classification is a technology for classifying the industries of the enterprises by using enterprise related information, and generally adopts the current national economic industry classification standard in China. The classification standard divides the industry into 97 major classes, 473 middle classes and 1380 minor classes. The number of enterprises in China is large, internet enterprises are the driving force for transformation and upgrading of the economic structure of China, and how to effectively classify the enterprises into industries is very necessary, so that effective supervision bases and bases can be provided for corresponding supervision departments in China, and the development conditions of specific industries and the positions of the enterprises in the national economy can be effectively explained. The traditional manual identification mode has the defects of low efficiency and strong subjectivity, cannot be applied in a large scale, and urgently needs an automatic classification technology to perform rapid and efficient classification on enterprises. With the development of artificial intelligence technology, a large number of enterprise classification algorithms have appeared.

Existing enterprise classification methods can be roughly divided into two categories: a classification method based on rule matching and a classification method based on machine learning.

The rule matching based method generally collects the relevant information of the enterprise in advance, carries out similarity calculation by using the relevant information text of the enterprise and the industry classification labels, carries out sequencing according to the similarity calculation result, and matches the first-ranked industry label to the enterprise. The method is simple to construct, and the enterprises can be classified only by calculating the similarity scores of the enterprise information and the labels, but the method is limited by the richness of the collected enterprise information, and the accuracy of classification can be influenced by the calculation methods with different similarities. In addition, the industrial label is updated, and the matching result and the latest industrial standard specification can also generate inaccurate phenomena.

The method based on machine learning can collect text information of enterprises to be classified, firstly, the text is cleaned and segmented, then, feature vectors based on characters and words are extracted based on segmentation results, and finally, the extracted feature vectors are used for training a classifier so as to classify the enterprises. Compared with the method based on rule matching, the method is greatly improved, and the accuracy of enterprise classification is improved due to the fact that the word vector characteristics of enterprise information are utilized. But the classification accuracy is still influenced by the word segmentation effect, and the collection and maintenance of the dictionary also consume great energy. In addition, the selection and combination of the word vector features are a technique which consumes great energy, and the effect of the feature combination has a great influence on the final classification effect.

Therefore, there is a need in the art for a solution that can effectively categorize internet enterprises.

The above information disclosed in the background section is only for further understanding of the background of the invention and therefore it may contain information that does not form the prior art that is already known to a person of ordinary skill in the art.

Disclosure of Invention

The invention relates to a method, a device and a system for classifying Internet enterprises. The invention aims to provide an internet enterprise classification method and device for multi-dimensional enterprise information, aiming at the defects of low classification dimension, low accuracy and the like of the existing enterprises. The scheme of the invention can accurately classify the industries of the Internet enterprises, and compared with an enterprise classification algorithm based on matching or machine learning, the invention can greatly improve the accuracy of the industry classification of the Internet enterprises.

A first aspect of the present invention provides a method for Internet enterprise classification, wherein the method comprises: s1: the method comprises the steps of obtaining multi-dimensional data of an internet enterprise, and preprocessing the multi-dimensional data to generate long text data; s2: inputting the long text data into a Bert network model based on a Transformer encoder for processing; s3: and sending the processed data into a classifier to classify the Internet enterprises, wherein the classifier is a Softmax classifier.

According to an embodiment of the invention, wherein the multiple dimensions comprise: the data related to the enterprise name, the main product and the business, the enterprise profile and the business scope, and the preprocessing processes the data related to the enterprise name, the main product and the business, the enterprise profile and the business scope, and then carries out text cleaning.

According to an embodiment of the invention, in the step S2, adding an auxiliary classification special mark symbol CLS in front of the long text data, wherein the Bert network model learns the feature vector marked by the CLS; and in the step S3, the feature vector of the CLS mark at the corresponding position in the processed data is input to the classifier.

According to an embodiment of the present invention, wherein in said step S2, said adding a secondary class specific mark symbol CLS comprises: s21: segmenting the long text according to characters, acquiring a sequence number corresponding to the characters in a dictionary, setting the sequence number as the character text Token, and setting a position code Token and a text type Token of the characters in the text, S22: adding the character text Token, the position coding Token and the text type Token according to positions, inputting the added character text Token, the position coding Token and the text type Token into an Embedding layer of the Bert network model, and inputting the obtained vector into a multi-layer self-attention layer of the Bert network model for feature learning.

According to an embodiment of the present invention, the step S2 further includes training a Bert network model, where the training includes: s31: dividing the sorted enterprise data set into a training set and a testing set according to a preset proportion, training the Bert network model on the training set, and adjusting the hyper-parameters in the Bert network model; s32: calculating the accuracy and the recall rate of each enterprise category on the test set, and evaluating the Bert network model; s33: if the accuracy rate and the recall rate meet the preset service standard, deploying a Bert network model meeting the preset service standard; s34: and if the accuracy and the recall rate do not meet the preset service standard, screening out samples with errors judged by the model, revising the samples with errors after correcting the samples with errors, adding the samples into the training set, and returning to the step S32.

According to an embodiment of the invention, the hyper-parameters comprise batch size, learning rate, maximum length of input text.

According to an embodiment of the present invention, in the step S3, the classification output by the classifier is a second-level classification of national economic classifications of listed enterprises published in China.

According to an embodiment of the present invention, the step S1 further includes: obtaining multi-dimensional data of an internet enterprise from an internet enterprise information base, and marking the total data in the internet enterprise information base; and said step S3 further comprises: the Softmax classifier outputs classification data for each Internet enterprise and outputs a confidence level of the classification data.

According to an embodiment of the present invention, in the step S2, ensemble learning is performed on the Bert network model by using an ensemble learning strategy, wherein in the ensemble learning, a Bagging algorithm is used to obtain the classification labels of the enterprise data.

According to one embodiment of the invention, in the Bagging algorithm, T samples are generated from a data set containing m enterprise samples by adopting a self-help random sampling method, and T base learners are independently trained on the basis of each sample set, wherein T < m.

According to an embodiment of the invention, in the Bagging algorithm, different training sets obtained by different sampling are adopted to train a model to obtain a homogeneous weak classifier, and a plurality of different prediction results based on a Bert network model output when the same sample is tested are voted to obtain a final classification prediction result.

A second aspect of the present invention provides an apparatus for internet enterprise classification, characterized in that the apparatus comprises a memory and a processor; the memory for storing a computer program; the method is characterized in that: the processor is configured to, when executing the computer program, implement a method of classifying an internet enterprise according to the above.

A third aspect of the present invention provides an internet enterprise categorization system, comprising: a data acquisition and pre-processing module configured to: enterprise data are collected and preprocessed to form a long text of the enterprise data; a training and testing module of a classification model configured to: dividing the enterprise data into a training set and a test set, training an enterprise classified Bert network model according to the training set, and evaluating the classification effect of the enterprise Bert network model according to the test set; an iteration and boosting module of the classification model configured to: marking all enterprise data, outputting the confidence information of the enterprises while outputting the enterprise categories by using a Bert network model classifier, re-marking the enterprise data with the confidence lower than a preset threshold value, and adding a training set to train the Bert network model again.

According to an embodiment of the invention, the iteration and promotion module of the classification model learns the Bert network model by using an integrated learning strategy, and obtains the classification labels of the enterprise data by adopting a Bagging algorithm.

The scheme of the invention can acquire information of four dimensions of Internet enterprise introduction, operation range, main product and service information and enterprise name. Preprocessing the multidimensional information, splicing the enterprise name, the operation range, the product information and the enterprise brief introduction, and then removing special characters such as webpage labels and the like. Inputting the preprocessed text into a multi-layer deep neural network based on a transform architecture for automatic feature combination learning, inputting the internal features of the text after the network learning into a SoftMax classifier, and carrying out industry classification on Internet enterprises. Compared with an enterprise classification algorithm based on matching or machine learning, the method and the device can greatly improve the accuracy of classification of internet enterprise industries. In addition, the invention can quickly identify the multi-dimensional information of the mass enterprises without manual intervention. The method based on the big corpus pre-training model and the downstream task fine-tuning can be flexibly applied to the rapid classification of massive enterprises in different scenes.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram of a Bert network model according to an exemplary embodiment of the present invention.

Fig. 2 is a schematic diagram of an Embedding layer according to an exemplary embodiment of the present invention.

FIG. 3 is a diagram of a Multi-Head Self-Attention mechanism (Multi-Head Self-Attention) in the Bert network model according to an exemplary embodiment of the invention.

Fig. 4 is a flowchart of a method of internet enterprise categorization in accordance with an exemplary embodiment of the present invention.

FIG. 5 is a graph of a Bert network classification model input output structure according to an exemplary embodiment of the present invention.

Fig. 6 is a block diagram of a framework of an internet enterprise categorization system in accordance with an exemplary embodiment of the present invention.

Fig. 7 is a schematic diagram illustrating a self-sampling method in the Bagging algorithm according to an exemplary embodiment of the present invention.

Fig. 8 illustrates a schematic diagram of a Bagging algorithm according to an exemplary embodiment of the present invention.

FIG. 9 illustrates an iterative flow diagram of model training in an Internet enterprise classification system in accordance with an exemplary embodiment of the present invention.

Fig. 10 illustrates a block diagram of an apparatus for internet enterprise categorization according to an exemplary embodiment of the present invention.

DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION

As used herein, the terms "first," "second," and the like may be used to describe elements of exemplary embodiments of the invention. These terms are only used to distinguish one element from another element, and the inherent features or order of the corresponding elements, etc. are not limited by the terms. Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their context in the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

Those skilled in the art will understand that the devices and methods of the present invention described herein and illustrated in the accompanying drawings are non-limiting exemplary embodiments and that the scope of the present invention is defined solely by the claims. Features illustrated or described in connection with one exemplary embodiment may be combined with features of other embodiments. Such modifications and variations are intended to be included within the scope of the present invention.

Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the drawings, a detailed description of related known functions or configurations is omitted to avoid unnecessarily obscuring the technical points of the present invention. In addition, the same reference numerals refer to the same circuits, modules or units throughout the description, and repeated descriptions of the same circuits, modules or units are omitted for brevity.

Further, it should be understood that one or more of the following methods or aspects thereof may be performed by at least one control unit or controller. The terms "control unit," "controller," "control module," or "master module" may refer to a hardware device that includes a memory and a processor. The memory or computer-readable storage medium is configured to store program instructions, while the processor is specifically configured to execute the program instructions to perform one or more processes that will be further described below. Moreover, it is to be appreciated that the following methods may be performed by including a processor in conjunction with one or more other components, as will be appreciated by one of ordinary skill in the art.

The invention provides an Internet enterprise classification method and device based on multi-dimensional enterprise information, which are characterized in that through an Internet enterprise text information cleaning pretreatment process, text information is spliced, unnecessary special characters, webpage labels and the like in the text are cleaned; the invention is based on a multi-dimensional internet enterprise information multi-layer deep neural network classification model, and combines multi-dimensional text information of an enterprise into a whole long text which is used as training data of the model to train the enterprise classification model; the method is based on the classification model accuracy improvement algorithm of active learning, marks are carried out on the test set, confidence coefficient is output, data with lower confidence coefficient are re-marked and trained, and classification accuracy of the model is improved.

The internet enterprise refers to an enterprise that registers a domain name on the internet, establishes a website, and performs various business activities using the internet. With the rise of the internet, some enterprises based on the internet concept appeared. Based on the related internet services, the enterprises establish various types of enterprises to engage in electronic commerce, artificial intelligence and internet finance, and various types of internet platform economy are formed. The classification of the Internet enterprises is carried out, the enterprise industry categories are scientifically and effectively defined, the responsibility of enterprise main bodies is further clarified, the disordered expansion of capital and industry monopoly are prevented, the Internet platform policy is accurately enforced, and the economic, healthy and ordered development of the Internet is promoted.

The information related to the internet enterprise is an enterprise name, an enterprise profile, enterprise product information, an operation range, and the like. The Internet enterprise name refers to all contents of an enterprise name column on a business license issued to an enterprise after approval registration by an industrial and commercial administrative management organization; the enterprise profile refers to a case for introducing the basic conditions and the operational strategy of the enterprise to the public in society through characters; the product information mainly comprises a product structure, information, data, knowledge and the like related to the product; the enterprise business scope is the business content which can be processed and approved by the business registration department.

With the increase of internet enterprises, the boundaries between the internet enterprises are blurred, and accurate industry classification of the internet enterprises becomes more and more difficult. Since the business names of many internet enterprises have a great difference from the actual business activities, it is impossible to accurately classify the internet enterprises by the business names alone. Most of the existing enterprise classification techniques are based on rule matching methods, similarity is calculated after keyword matching is carried out on enterprise names and industry categories in an internet industry classification dictionary, and enterprise types of enterprises to be classified are recommended. The subsequent improvement is that a machine learning algorithm is adopted, firstly, the enterprise name and the multi-dimensional information of the enterprise are segmented, after a keyword is extracted, a word vector technology is used for carrying out feature extraction on the segmentation result, and then a classifier is trained to realize classification of enterprise categories. The machine learning-based algorithm is a Pipeline-based multi-stage classification algorithm, the effect of each stage influences the algorithm of the next stage, and the prediction error generated by the model of the previous stage is propagated to the model prediction of the next stage. In addition, in the feature extraction stage, the type of the feature to be extracted needs to be designed manually, and the selection and combination of the features have higher requirements on the feature engineering technology.

The neural network-based algorithm is an end-to-end training and predicting process, and no combination of multiple stages exists. Aiming at the input data, the neural network algorithm can automatically perform feature extraction and feature selection on the data, automatically select the optimal feature combination and automatically learn the corresponding relation between the features and the real label. The deep learning algorithm avoids complicated feature engineering, and has higher accuracy compared with an algorithm based on machine learning.

According to one or more embodiments of the invention, the invention is based on a large-scale corpus pre-training technique of a transform encoder-decoder model architecture. Compared with the traditional pre-training technology, the transform model provided by the invention adopts massive Internet Wikipedia linguistic data to train the model, and two pre-training tasks of Next Sennce Prediction and Masked Token Prediction are added.

According to one or more embodiments of the invention, the transform model of the invention drives the progress of downstream tasks, and the Bert network model based on the transform Encoder structure of the invention achieves the best effect in 14 natural language processing tasks, which far exceeds the effect of all previous machine learning or deep neural network models represented by recurrent neural networks (RNN/LSTM). The models of GPT1, GPT2, GPT3 and the like improved based on the structure of a Transformer Decoder by Open AI company are verified to achieve remarkable effects.

According to one or more embodiments of the invention, the transform network model of the invention is mainly structured with a multi-head attention mechanism and a feedforward neural network and a multi-head attention with a mask. The input text is processed according to characters, and some potential semantic information in the text can be learned in a mass corpus by adding position coding and type coding. The Transformer structure performs attention calculation between every two input tokens, so that the long text has a good feature extraction effect, and the mutual information between the central words and the long-distance words is concerned. The deep neural network with the structure as the Transformer is widely applied to the field of natural language processing and has good effects in the fields of computer vision, voice recognition and the like. Therefore, the deep neural network based on the Transformer structure of the invention gradually replaces the traditional machine learning and the former generation deep neural network architecture represented by RNN/CNN.

According to one or more embodiments of the invention, the industry classification of the internet enterprises is a difficult process, and the accuracy of judging only from the enterprise name or the enterprise operation range is low. When the business range of an enterprise is different from the actual business, and the name of the enterprise is different from the actual business, the traditional classification algorithm of enterprise classification based on matching keywords cannot accurately classify the enterprise, and manual classification by manual intervention has high subjectivity. Enterprise classification using multi-dimensional information of the enterprise is a commonly adopted improvement. However, the length of the enterprise information text is multiplied by combining the multi-dimensional enterprise information, the traditional classification algorithm based on matching cannot effectively utilize all information, and meanwhile, the calculation complexity is greatly increased. The deep learning network structure based on the Transformer can effectively extract the characteristics of long-text enterprise information, deeply and semantically fuse the enterprise name, the enterprise brief introduction, the main product service and the operation range of the enterprise, effectively utilize the multi-dimensional enterprise information of the enterprise and accurately classify the industry of the Internet enterprise.

According to one or more embodiments of the present invention, the internet enterprise classification scheme of the present invention employs a Bert network model based on a transform coder, wherein the Bert model is called as bidirectional encoder responses from Transformer, and is a pre-trained language characterization model represented by a transform coder, and instead of using a conventional unidirectional language model or a method of shallow-layer splicing two unidirectional language models as in the past, a new Masked Language Model (MLM) is used to enable the generation of a deep bidirectional language characterization. The method comprises the following steps of training a Bert model by utilizing a large-scale unmarked corpus to obtain expression of texts containing rich semantic information, namely: semantic representation of the text, and then fine-tuning the semantic representation of the text in a specific NLP (natural language processing) task, and finally applying to the NLP task.

As shown in fig. 1, one input text (which may be a long text) is subjected to vector representation, then a Bert network model based on multiple transformers is input, then a processing result is output, and an output result is input into a Softmax classifier.

According to one or more embodiments of the invention, the method is based on the Bert model improved by the Transformer to pre-train on massive linguistic data, and the inherent semantic expression capability of the language is learned. In the training of a downstream classification task of the Bert model, an auxiliary classification special mark symbol ([ CLS ] ") is added in front of an input text, the model automatically learns the characteristics of the special mark, and the characteristic vector of the classification mark ([ CLS ]") in a corresponding position is extracted from an output result during testing and then is input to a SoftMax classifier to classify enterprises.

As shown in fig. 2, according to one or more embodiments of the present invention, a specific processing method for the Bert network model input is: after the input text is segmented according to characters, a word table is looked up to find a serial number corresponding to the character in a dictionary as a label of the character, and the label is called as a text Token of the character. In addition, the position code Token and the text type Token of the character in the text are added, the three tokens are added together according to the positions, and the three tokens are input into an Embedding layer to obtain vectors with 768 × 512 dimensions and then input into a multi-layer self-attention layer for feature learning. Position Token represents the position information for each character that may assist the model in learning the information of the character in the context of the text. The type Token represents whether the corresponding character is in the previous sentence or the next sentence in the text.

In accordance with one or more embodiments of the present invention, the input portion of Bert's of the present invention is a linear sequence, the two sentences are divided by separators, and two identification symbols are added at the forefront and the last. According to another embodiment of the invention, for example, there are three embeddings per word: position information embedding, because the word order is an important feature in natural language processing, the position information needs to be encoded here; the word imbedding; the third is the sentence embedding, because the training data mentioned above are composed of two sentences, each sentence has an embedding item of the whole sentence corresponding to each word. And superposing the three embedding corresponding to the words to form the input of Bert.

As shown in fig. 3, in order to enhance the diversity of the Attention, the present invention further uses different Self-Attention modules to obtain enhanced semantic vectors of each word in the text in different semantic spaces, and linearly combines a plurality of enhanced semantic vectors of each word, thereby obtaining a final enhanced semantic vector with the same length as the original word vector.

According to one or more embodiments of the present invention, a Multi-head Self-Attention mechanism (RNN) structure plays a most significant role in the Bert network model of the present invention, and because the long-range gradient vanishes in the conventional RNN mechanism, it is difficult for the RNN to convert the input sequence into a vector with a fixed length and store more effective information for a longer sentence, the effect of the model with the RNN as a basic structure is significantly reduced as the length of the sentence increases. To address the bottleneck of information loss caused by long sequence to fixed length vector conversion, attention was introduced. Where the Attention mechanism is similar to the idea when humans translate articles, i.e., focusing Attention on the context corresponding to our translation portion. Similarly, in the Attention model, when translating a current word, corresponding words in the source sentence are found and corresponding translations are made in conjunction with previously translated parts. The multi-head self-attention mechanism is adopted by the invention to solve the problem of overlong text length caused by the combination of related information of enterprises, can better learn the potential relation among multi-dimensional enterprise information, and really excavates some key information which is useful for enterprise industry classification.

As shown in fig. 4, wherein in step S1: the method comprises the steps of obtaining multi-dimensional data of an internet enterprise, and preprocessing the multi-dimensional data to generate long text data;

in step S2: inputting the long text data into a Bert network model for processing;

in step S3: and sending the processed data into a Softmax classifier to classify the internet enterprises.

Wherein the enterprise multidimension includes: the data related to the enterprise name, the main product and the business, the enterprise profile and the business scope, and the preprocessing processes the data related to the enterprise name, the main product and the business, the enterprise profile and the business scope, and then carries out text cleaning.

In the data preprocessing of the step S1, multidimensional data of internet enterprises are acquired from an internet enterprise information base, and marking is performed on the total data in the internet enterprise information base, and in the data classification of the step S3, the Softmax classifier outputs classification data of each internet enterprise according to the marking information, and outputs a confidence of the classification data. The method adopts an ensemble learning strategy to carry out ensemble learning on the Bert network model, and adopts a Bagging algorithm to obtain the classification labels of the data.

As shown in fig. 5, wherein:

[CLS]auxiliary classification special mark, E _[CLS] Is [ CLS ]]The embedded representation vector obtained after the special mark passes through the Embedding layer.

E ₁ -E _N The embedded expression vector is obtained after an Embedding layer is formed after Token obtained by looking up a word list according to character word segmentation of the enterprise name and position Token and category Token are added.

E _[SEP] A special separation between the name of the business and the business profile or product information.

E′ ₁ -E′ _M The embedded expression vector is obtained after Token obtained by looking up a word list after character word segmentation is added to position Token and category Token and then is embedded in an Embedding layer.

Output symbol interpretation:

c, representing [ CLS ] auxiliary classification special marks and vector representation obtained after the multi-layer self-attention network.

T ₁ -T _N The expression of the final vector obtained after the representative enterprise name Embedding vector passes through the multi-layer self-attention network.

T′ ₁ -T′ _M The Embedding vector representing the enterprise profile is represented by a final vector obtained after the Embedding vector passes through a plurality of layers of attention networks.

According to one or more embodiments of the invention, aiming at the input of the Bert network model, the scheme of the invention splices the name of an enterprise, the profile of the enterprise, the product information and the business scope into a long text, and the long text is divided by a special separator [ SEP ] in the middle. And splicing an auxiliary classification special mark symbol [ CLS ] at the top of the input text to perform auxiliary classification. The CLS special characters and other characters in the text are interactively learned in the classification model feature learning, and the corresponding relation between the internal features of the text and the text labels is learned. Here we use the Bert-base-chip Chinese pre-training model. The model has 12 hidden layers in total, the input and output dimensions of the hidden layers are 768 dimensions, the number of heads in the multi-head attention mechanism is 12, and the maximum input text length is 512. For the overlong internet enterprise information text, the length truncation operation is carried out, and only the first 512 characters are reserved. In fig. 5, the final business category may be derived from the pair vector C.

As shown in FIG. 6, the overall process of the Internet Enterprise Classification System includes data acquisition and preprocessing; the training and testing of the classification model and the iterative promotion of the classification model are respectively realized by different units or modules.

According to one or more embodiments of the invention, the data acquisition and preprocessing module is used for acquiring or acquiring multidimensional data of an enterprise, such as the name, the brief introduction, the product, the service and the operation range of the enterprise, splicing and preprocessing the information, and filtering useless texts to form long texts of the enterprise information.

According to one or more embodiments of the invention, the training and testing module of the classification model comprises: and dividing the sorted enterprise information data set into a training set and a testing set, and training and testing according to the Bert network model shown in fig. 3 and 5 to obtain the Bert network model with optimal parameters. Wherein the training and testing of the Bert network model comprises:

(1) And dividing the sorted data set into a training set and a testing set according to the proportion of 80% to 20%, and training the enterprise classification model on the training set.

According to one or more embodiments of the invention, the training process of the deep learning model requires that the data set is proportionally and randomly divided into a training set and a testing set. In order to achieve the optimal effect on the test set, the training of the model and the adjustment of the hyper-parameters need to be performed on the training set. The adjustment of the hyper-parameters involves the setting of the learning rate and the setting of the size of the BatchSize, which parameters influence the final achieved effect of the model. For example, after many experiments, the invention determines that the size of BatchSize is 6, the learning rate is 1e-4, and the maximum length of the input text is 512, so that the model can achieve the optimal effect on the test set. The setting of the number of structural layers and the size of a hidden layer of the Transformer is fixed, special adjustment is not needed, and the only thing needing to be changed is the number of output categories of the last-layer classifier of the Bert network. The invention adopts the second-level classification of national economic classification in listed enterprises published in China, and the total number of the classifications is 12.

(2) And testing the classification effect of the classification model on the test set, and evaluating the effect of the model.

To verify the robustness of the model, the accuracy and recall of the trained model needs to be tested on the test set. The deep learning model is sensitive to data, and the trained model does not necessarily have the same test effect on other data sets when the trained model has a good effect on a certain data set. In order to ensure that the model has better generalization capability as much as possible, the hyper-parameters of the model need to be adjusted by testing the effect on different test sets, so as to achieve the optimal effect. For example, through testing, the trained model achieves good effects on three different test sets.

According to one or more embodiments of the invention, the accuracy of a class is calculated according to the following formula:

acc＝TP/(TP+FP)

wherein TP represents the number of the classes correctly predicted in the test set, and FP represents the number of the classes incorrectly predicted in the test set.

Wherein, the recall rate of a certain category is calculated according to the following formula:

recall＝TP/(TP+FN)

TP represents the number of classes predicted correctly in the test set, and FN represents the number of classes predicted incorrectly in the test set. Table 1 shows the test effect of the classification model on four different test sets.

TABLE 1

Test set	Data volume	Accuracy (acc)	Recall rate (recall)
				Test set 1	1200 strips/12 class	92.4％	90.1％
Test set 2	500 strips/class 12	93.8％	89.0％
				Test set 3	700 strips/class 12	90.2％	92.3％

According to one or more embodiments of the invention, the internet enterprise classification system further comprises an iteration and promotion module of the classification model, and the specific functions of the iteration and promotion module are as follows: marking the total data in the networking enterprise information base, outputting the category of each internet enterprise, and outputting the confidence coefficient of each piece of data by the classifier; re-labeling data with the confidence level lower than the minimum threshold value, and adding a training set to train the classification model again; the model is trained by repeating the iteration process, and the required classification accuracy is achieved.

Specifically, the iteration and lifting module of the classification model comprises:

(1) Only partial data in the database is used in the training model, so in order to verify the robustness of the model, marking is carried out on the total data in the internet enterprise information base, the category of each internet enterprise is output, and the classifier outputs the confidence coefficient of each piece of data.

The trained model only obtains good effect on a data set within a certain range, but the effect on mass data is uncontrollable. In order to ensure that the model can achieve a good effect on massive data, marking is carried out on a full database, and the output result of the SoftMax classifier is used as the confidence coefficient of each piece of data.

(2) And setting a preset minimum threshold, re-labeling the data with the confidence level lower than the minimum threshold, and adding a training set to train the classification model again. For data with higher output confidence, the model has sufficient confidence to consider the data to belong to such a category, but for data with higher output confidence, the model has obvious insufficient confidence in predicting the data. Such data is re-labeled (e.g., manually labeled) and then added to the training set for re-training of the model.

(3) Based on the idea of active learning, the process training model is iterated repeatedly to achieve the required classification accuracy.

(4) And adopting an ensemble learning strategy to carry out ensemble learning on the model, and adopting a Bagging method to obtain a classification label of the data.

In accordance with one or more embodiments of the present invention, ensemble learning is a common paradigm in machine learning algorithms. In ensemble learning, the scheme of the invention trains a plurality of Internet enterprise classification models with weak classification capability to solve the same problem, and combines the output results of the Internet enterprise classification models to achieve a better effect. The invention trains a plurality of basic version models improved based on Bert, and finally, the basic version models are subjected to ensemble learning. The most important assumptions of the present invention in ensemble learning are: when different weak classifiers are combined, a more accurate or robust model can be obtained than with a single weak classifier.

According to one or more embodiments of the invention, the basic models to be aggregated are selected in the process of building ensemble learning, and in most cases, the respective basic models need to be trained separately to obtain homogeneous weak classifiers trained in different ways. The weak classifiers then need to be combined using an appropriate strategy. The meta-algorithm of the common combined classifier is: bagging: consider homogeneous weak classifiers that are learned in parallel independently of each other, combining them according to some deterministic averaging process. Boosting: also considered are homogeneous weak classifiers that are sequentially learned in a highly adaptive way and combined according to some deterministic line strategy. And (3) Stacking: what is considered is a heterogeneous weak classifier, which learns them in parallel and combines them by training a meta-model, outputting a final prediction result based on the prediction results of different weak models.

According to one or more embodiments of the invention, the invention adopts Bagging to obtain an integrated model which is more robust than a single model.

The basic process of the self-sampling method in Bagging is shown in fig. 7.

As shown in fig. 7, the self-sampling method is defined as: given a data set containing m samples, randomly taking one sample and putting the sample into a sampling set, and then putting the sample back into an initial data set, so that the sample is still possibly selected in the next sampling.

As shown in fig. 7, T samples are generated from a data set containing m samples by Bootstrap sampling (with samples put back) in Bagging (boosting), wherein each sample set contains about 63.2% of the samples of the original data set, and then T basis learners are trained independently based on each sample set.

If the two classes receive the same votes during the class prediction, the simplest method is to randomly select one, and the confidence of the learner voting can be further considered to determine the final winner. When prediction is output, a simple voting method is adopted for the classification problem, and a simple averaging method is adopted for the regression problem.

According to one or more embodiments of the invention, these samples have very good statistical properties under certain assumptions: in and approximation, they can be considered as being extracted directly from the true underlying (and often unknown) data distribution, and independent of each other. Therefore, they are considered to be representative and independent samples of the true data distribution. The basic requirement of independent and same distribution of data required by a machine learning algorithm is met. And training the model by adopting different data sets obtained by different sampling to obtain a homogeneous weak classifier, and voting after outputting prediction results of different models when testing the same sample to obtain a final prediction result.

As shown in fig. 8, T samples are generated from a data set containing m enterprise samples by a self-help random sampling method, T base learners (weak classifiers) are independently trained based on each sample set, and the T base learners are respectively trained to combine into a strong classifier.

According to one or more embodiments of the present invention, according to the Bagging algorithm, the final strong output of the classifier is:

y _com is T weak classifiers y _m (i.e. a plurality of basic classification models based on Bert) is subjected to ensemble learning to obtain a strong classifier, wherein M represents M training samples x, and M represents the total number of the training samples.

According to one or more embodiments of the invention, self-service random sampling is adopted in the training set to obtain a plurality of subsets of the training set, the sub-training sets are used for respectively and independently training T Internet enterprise classification models with weak classification capability, the classification accuracy of the weak classification models cannot reach an available state, only after the T Internet enterprise classification models are integrated through an ensemble learning mechanism, an Internet enterprise classification model with strong classification capability can be obtained, the classification effect of the obtained classifier is stronger than that of any one of the T weak classification models, and production availability is achieved.

According to one or more embodiments of the invention, in particular, the flow of model iteration comprises:

s31: dividing the sorted enterprise data set into a training set and a testing set according to a preset proportion, training the Bert network model on the training set, and adjusting the hyper-parameters in the Bert network model;

s32: calculating the accuracy and recall rate of each enterprise category on the test set, and evaluating a Bert network model;

s33: if the accuracy rate and the recall rate meet the preset service standard, deploying a Bert network model meeting the preset service standard;

s34: if the accuracy and the recall rate do not meet the preset service standard, screening out samples with errors judged by the model, adding the samples into the training set after re-labeling the corrected samples, and returning to the step S32.

As shown in fig. 9, the whole iteration flow is divided into five parts, which are:

(1) Training an internet enterprise classification model on a training set;

(2) The accuracy and recall of each category are calculated on the test set to evaluate the model.

(3) If the standard of the business is met (the accuracy rate is higher than 90%, and the recall rate is higher than 90%), a classification model is deployed in production.

(4) If the set standard is not met, screening (manually screening) samples with wrong judgment of the model, re-labeling, adding the samples to a training set after labels are corrected, and repeating the step (2).

(5) Until the model meets the set business criteria, or until all available data is exhausted.

As shown in fig. 10, the present invention further provides an apparatus for internet enterprise classification, wherein the apparatus comprises a memory and a processor; the memory for storing a computer program; the method is characterized in that: the processor, when executing the computer program, the method for classifying the internet enterprises according to the above is realized.

In the embodiments provided in the present application, it should be understood that the disclosed system, apparatus, device or method may be implemented in other ways. The above-described system embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and there may be other divisions in actual implementation, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed coupling or direct coupling or communication connection between each other may be through some communication interfaces, indirect coupling or communication connection of systems or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments provided in the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application.

In accordance with one or more embodiments of the present invention, the methods of the present invention may implement processes such as the flows in the above systems of the present invention using encoded instructions (e.g., computer and/or machine readable instructions) stored on a non-transitory computer and/or machine readable medium (e.g., hard disk drive, flash memory, read-only memory, optical disk, digital versatile disk, cache, random access memory, and/or any other storage device or storage disk) in which information is stored for any period of time (e.g., for extended periods of time, for permanent, for transient instances, for temporary caching, and/or for information caching). As used herein, the term "non-transitory computer-readable medium" is expressly defined to include any type of computer-readable storage device and/or storage disk and to exclude propagating signals and to exclude transmission media.

The drawings referred to above and the detailed description of the invention, which are exemplary of the invention, serve to explain the invention without limiting the meaning or scope of the invention as described in the claims. Thus, modifications may be readily made by those skilled in the art from the foregoing description. Further, those skilled in the art may delete some of the constituent elements described herein without deteriorating the performance, or may add other constituent elements to improve the performance. Further, the order of the steps of the methods described herein may be varied by one skilled in the art depending on the environment of the process or apparatus. Therefore, the scope of the present invention should be determined not by the embodiments described above but by the claims and their equivalents.

While the invention has been described in connection with what is presently considered to be practical embodiments, it is to be understood that the invention is not to be limited to the disclosed embodiments, but on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A method for internet enterprise categorization, the method comprising:

s1: obtaining multi-dimensional data of an internet enterprise, and preprocessing the multi-dimensional data to generate long text data;

s2: inputting the long text data into a Bert network model for processing;

s3: and sending the processed data into a classifier to classify the industry of the Internet enterprises.

2. The method of claim 1, wherein the classifier is a Softmax classifier.

3. The method of claim 1,

wherein the multiple dimensions include: data relating to business name, major products and businesses, business profiles, and scope of business, an

The preprocessing processes data related to enterprise names, main products and businesses, enterprise profiles and business ranges, and then performs text cleaning.

4. The method of claim 1,

in the step S2, adding an auxiliary classification special mark symbol CLS in front of the long text data, wherein the Bert network model learns the characteristic vector marked by the CLS; and is

In step S3, the feature vector of the CLS tag at the corresponding position in the processed data is input to the classifier.

5. The method according to claim 4, wherein in said step S2, said adding auxiliary class specific mark symbols CLS comprises:

s21: dividing the long text into words according to characters, acquiring the corresponding serial number of the characters in a classification dictionary, setting the serial number as the character text Token, setting the position code Token and the text type Token of the characters in the text,

s22: adding the character text Token, the position code Token and the text type Token according to positions, inputting the sum into an Embedding layer of the Bert network model, and inputting the obtained vector into a multi-layer self-attention layer of the Bert network model for feature learning.

6. The method according to claim 1, wherein the method according to claim 1, further comprises an iterative training of a Bert network model in step S2, the iterative training comprising:

s34: if the accuracy rate and the recall rate do not meet the preset service standard, screening out samples with errors judged by the model, re-labeling the samples after correcting the samples with errors, adding the samples into the training set, and returning to the step S32.

7. The method of claim 6, wherein the hyper-parameters comprise batch size, learning rate, maximum length of input text.

8. The method according to claim 1, wherein in the step S3, the classification output by the classifier is a second-level classification of national economic classifications of listed enterprises published in China.

9. The method of claim 1 or 6,

the step S1 further includes: obtaining multi-dimensional data of an internet enterprise from an internet enterprise information base, and marking the total data in the internet enterprise information base; and is provided with

The step S3 further includes: the Softmax classifier outputs classification data for each Internet enterprise and outputs a confidence level of the classification data.

10. The method of claim 9, wherein the classification data with the confidence level less than the predetermined minimum threshold is re-labeled and added to the training set to re-train the classification model.

11. The method according to claim 6, wherein in the step S2, ensemble learning is performed on the Bert network model by using an ensemble learning strategy, wherein in the ensemble learning, a Bagging algorithm is used to obtain the classification labels of the enterprise data.

12. The method of claim 10, wherein in the Bagging algorithm, a self-help random sampling method is used to generate T samples from a data set containing m enterprise samples, and T base learners are trained independently on each sample set, where T < m.

13. The method of claim 12, wherein, in the Bagging algorithm,

and training the model by adopting different training sets obtained by different sampling to obtain a homogeneous weak classifier, and voting a plurality of different prediction results output when the same sample is tested and based on the Bert network model to obtain a final classification prediction result.

14. An apparatus for internet enterprise classification, the apparatus comprising a memory and a processor; the memory for storing a computer program; the method is characterized in that: the processor, when executing the computer program, for implementing the method of classifying an internet enterprise according to any one of claims 1 to 13.

15. An internet enterprise classification system, the system comprising:

a data acquisition and pre-processing module configured to: acquiring enterprise data and preprocessing the enterprise data to form a long text of the enterprise data;

a training and testing module of a classification model configured to: dividing the enterprise data into a training set and a test set, training an enterprise classified Bert network model according to the training set, and evaluating the classification effect of the enterprise Bert network model according to the test set;

an iteration and boosting module of the classification model configured to: marking all enterprise data, outputting the confidence information of the enterprises while outputting the enterprise categories by using a Bert network model classifier, re-marking the enterprise data with the confidence lower than a preset threshold value, and adding a training set to train the Bert network model again.

16. The system of claim 15, wherein the iteration and promotion module of the classification model uses an integrated learning strategy to deeply learn the Bert network model and employs a Bagging algorithm to obtain classification tags for the enterprise data.