CN115718889A

CN115718889A - Industry classification method and device for company profile

Info

Publication number: CN115718889A
Application number: CN202211613376.0A
Authority: CN
Inventors: 蔡凡华; 胡万利
Original assignee: Ping An Bank Co Ltd
Current assignee: Ping An Bank Co Ltd
Priority date: 2022-12-15
Filing date: 2022-12-15
Publication date: 2023-02-28

Abstract

The application provides an industry classification method and device aiming at company profiles. Relates to the field of computer vision. Comprises determining a training sample; optimizing an XLNET pre-training language model based on the training sample to obtain an optimized XLNET language model; processing the company brief introduction text sample based on the optimized XLNET language model to obtain a high-order text vector sample; training an initial classification model based on the high-order text vector sample and the sample label to obtain a trained classification model; and classifying industries according to company profiles based on the trained classification model. The network layer can further learn word vectors and strengthen the weight of professional word vectors in special industries, so that the model can learn more high-order text feature vectors from context semantic information, obtain richer word vector semantic information and finally obtain more accurate classification results for text classification.

Description

Industry classification method and device for company profile

Technical Field

The application relates to the field of computers, in particular to an industry classification method and device for company profiles.

Background

With the rapid development of information technology, data is growing explosively, and social networks, mobile networks, various intelligent tools, service tools and the like are sources of data. Commodity transaction data generated by nearly 4 hundred million members of a certain e-commerce platform every day is about 20TB; log data generated by about 10 million users of face books per day exceeds 300TB. The wide data source determines the diversity of data forms, and data in any form can play a role, such as a recommendation system of a daily e-commerce platform and the like, analyzes log data of a user, and further recommends things which the user is interested in. In the financial field, generally, an internet platform (a company homepage information column) can obtain the information introduction of a company and the main business engaged, the abstract text is a section of text description, and the text is subjected to industry classification analysis, so that certain information reference can be provided for a bank customer manager when a business opportunity is excavated and financial product marketing is carried out, and marketing work can be developed more pertinently.

Because the length of the text information of the company brief introduction is generally within 300 words, the text data has the characteristic of sparse text features, and when the traditional text classification model is adopted for classification, the problems of how to capture more semantic information from the limited context and the like can be faced.

Disclosure of Invention

The embodiments of the present application provide an industry classification method and apparatus for company profiles, so as to alleviate the technical problem in the prior art that the classification for company profiles is inaccurate.

In a first aspect, the present invention provides a method for industry classification for company profiles, comprising:

determining a training sample, wherein the training sample comprises a company brief introduction text sample and a sample label;

optimizing an XLNET pre-training language model based on the training sample to obtain an optimized XLNET language model;

processing the company brief introduction text sample based on the optimized XLNet language model to obtain a high-order text vector sample;

training an initial classification model based on the high-order text vector sample and the sample label to obtain a trained classification model;

and classifying industries according to company profiles based on the trained classification model.

In an alternative embodiment, the determining the training samples comprises:

determining a data source, marking to obtain company brief introduction source texts and sample labels;

and segmenting the company brief introduction source text based on a segmentation tool, and preprocessing the company brief introduction source text to obtain a company brief introduction text sample, wherein the preprocessing comprises cleaning.

In an alternative embodiment, the XLNet pre-training language model is implemented based on a pyrrch framework, including a config.

In an alternative embodiment, the classification model includes a BilSTM + Attention network layer and a SoftMax layer.

In an optional embodiment, training an initial classification model based on the high-order text vector sample and the sample label to obtain a trained classification model, includes:

inputting the high-order text vector sample into a BilSTM + Attention network layer to obtain an output result;

and inputting the output result into the SoftMax layer to obtain the probability identification of each category, and training the initial classification model to obtain a trained classification model.

In an alternative embodiment, the method further comprises:

acquiring a test set;

and verifying the trained classification model based on the test set, and after the verification is passed, performing industry classification on the company profile based on the trained classification model.

and training an initial classification model according to the binary cross entropy loss, the high-order text vector sample and the sample label to obtain a trained classification model.

In a second aspect, the present invention provides an industry classification apparatus for company profiles, comprising:

the system comprises a determining module, a judging module and a judging module, wherein the determining module is used for determining a training sample, and the training sample comprises a company brief introduction text sample and a sample label;

the optimizing module is used for optimizing an XLNET pre-training language model based on the training sample to obtain an optimized XLNET language model;

the processing module is used for processing the company brief introduction text sample based on the optimized XLNET language model to obtain a high-order text vector sample;

the training module is used for training an initial classification model based on the high-order text vector sample and the sample label to obtain a trained classification model;

and the classification module is used for classifying the industries aiming at the company profiles based on the trained classification model.

In an optional embodiment, the determining module is specifically configured to:

determining a data source, marking to obtain a company brief introduction source text and a sample label;

In an alternative embodiment, the training module is specifically configured to:

In an optional embodiment, the system further comprises a test module, configured to:

acquiring a test set;

In a third aspect, the present invention provides an electronic device, including a processor, a communication interface, a memory and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of the preceding embodiments when executing the program stored in the memory.

In a fourth aspect, the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method steps of any of the preceding embodiments.

The invention provides an industry classification method and device for company profiles. Determining a training sample, wherein the training sample comprises a company brief introduction text sample and a sample label; optimizing an XLNET pre-training language model based on the training sample to obtain an optimized XLNET language model; processing the company brief introduction text sample based on the optimized XLNET language model to obtain a high-order text vector sample; training an initial classification model based on the high-order text vector sample and the sample label to obtain a trained classification model; and classifying industries according to company profiles based on the trained classification model. The network layer can further learn word vectors and strengthen the weight of professional word vectors in special industries, so that the model can learn more high-order text feature vectors from context semantic information, obtain richer word vector semantic information and finally obtain more accurate classification results for text classification.

Drawings

To more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

FIG. 1 is a schematic flow chart of an industry classification method for company profiles according to an embodiment of the present application;

FIG. 2 is a model structure example provided in an embodiment of the present application;

FIG. 3 is a schematic structural diagram of an industry classification device for company profiles according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

Some embodiments of the invention are described in detail below with reference to the accompanying drawings. The embodiments and features of the embodiments described below can be combined with each other without conflict.

Fig. 1 is a schematic flow chart of an industry classification method for company profiles according to an embodiment of the present application. As shown in fig. 1, the method may include the steps of:

and S110, determining a training sample, wherein the training sample comprises a company brief introduction text sample and a sample label.

Determining a data source, marking to obtain a company brief introduction source text and a sample label; and segmenting the company brief introduction source text based on a segmentation tool, and preprocessing the segmented company brief introduction source text to obtain a company brief introduction text sample, wherein the preprocessing comprises cleaning.

As an example, after a data source is taken out of a database, an industry mark is marked on a text description by adopting a manual marking mode, and the text description is used as a training set of a subsequent model. In the method, only text information of four industries (labor industry, beauty industry, pet industry and financial industry) is extracted in an experiment, and 5000 pieces of text information are extracted for each category and used as model training.

The sample label can be determined according to actual needs, and may be, for example, an industry label, an attribute label, such as a labor industry, a beauty industry, and the like, or a company-sized label, such as a large-sized enterprise, a medium-sized enterprise, or a small-sized enterprise.

The cleaning can be carried out by utilizing a jieba word segmentation tool, and the non-conventional characters such as English characters, numbers, messy codes, special symbols and the like can be removed through cleaning, namely the characters which have no significance for classification.

And S120, optimizing the XLNET pre-training language model based on the training samples to obtain the optimized XLNET language model.

The XLNET pre-training language model is realized based on a pytorch framework and can comprise a config.

A json file is a main configuration file of an HAP project, and all relevant configuration information is configured under the file; the configuration file adopts a JSON file format, and each configuration item consists of an attribute part and a value part; the attributes do not appear in sequence and only appear once; the value is made up of the basic data type of JSON.

Txt file is a vocabulary.

The pyrroch _ model _ bin file is a binary file of a trained XLNet pre-training language model.

The XLNet pre-training language model may include a chinese pre-training model file (bert-base-Chinese).

XLNet pre-training language model parameters can be finely adjusted for training, and feature information representation of high-order text vectors is obtained.

The fine tuning can be realized by continuously training the XLNet pre-training language model through training samples and a preset loss function.

The XLNET pre-training language model is an improved version of the bert model. The optimization is mainly carried out on the following three aspects: an AR (autoregressive) model is adopted to replace an AE (autorecoding) model, so that negative effects brought by mask are avoided; a double-flow attention mechanism is adopted; and introducing a transformer-xl.

The main task of the AR model is to evaluate the probability distribution of the corpus; the AE model is used in a contextual manner.

In xlnet, the AR model is finally adopted, but it is a focus of this document how to solve the problem in this context.

The permutation language model is introduced, the model does not model the sequence value of the traditional AR model in sequence any more, but maximizes the expected log-likelihood of the permutation and combination sequence of all possible sequences, thus not only preserving the context information of the sequences after processing, but also avoiding the adoption of mask marking bits, and skillfully improving the defects of the bert and the traditional AR model.

Although the alignment language model can satisfy the current objective, it is problematic for the common transform structure. Therefore, if the position information of the target is not considered, the result after permutation and combination is the same, or the result output by the transform is the same. Therefore, the model cannot be correctly expressed, and in order to solve the problem, a new distribution calculation method is proposed in the paper to realize target position sensing, that is, double-flow self-attention.

The model also integrates relative position coding and a segment rotation mechanism.

And S130, processing the company brief introduction text sample based on the optimized XLNet language model to obtain a high-order text vector sample.

The method can adopt a tokenizer + convert to id form to process, and obtain a high-order text vector sample. The specific process is as follows:

when a natural language processing task is processed by using a neural network, data needs to be preprocessed first, and the data is converted into a format acceptable by the neural network from a character string, and the method generally includes the following steps:

(1) Word segmentation: performing word segmentation (characters and words) on the text data by using a word segmenter;

(2) Constructing a dictionary: constructing dictionary mapping according to the result of the word segmentation of the data set (this step is not absolute, if a pre-training word vector is adopted, the dictionary mapping needs to be processed according to a word vector file);

(3) Data conversion: mapping the data after word segmentation processing according to the constructed dictionary, and converting the text sequence into a digital sequence;

(4) Data filling and truncation: in the mode of inputting to the model with batch, need fill the data of overlength, overlength data are cuted, guarantee that data length accords with the range that the model can accept, and the data dimension size in batch is unanimous simultaneously.

In past work, we may use different word segmenters and implement the work of constructing dictionary and conversion by themselves. However, in the transformations toolkit, the above-mentioned complexity is not needed, and the Tokenizer module is only needed to realize all the above-mentioned operations rapidly, and its function is to convert the text into data that can be processed by the neural network. The Tokenizer kit does not require additional installation and will be installed with the transformations.

The Tokenizer was loaded. Since Tokenzier is generated along with the pre-trained model, the pre-trained model to be used is specified when Tokenizer is loaded, and we take processing the chinese data as an example, where we choose to load the bert-base-kernel pre-trained model, and download part of the file when loading for the first time.

The token method is used for word segmentation, and the bert-base-Chinese word segmentation is classified according to characters.

After word segmentation, a dictionary should be constructed, but as mentioned above, tokenzier is generated along with the pre-trained model, so the dictionary is already constructed in advance and does not need to be constructed again. The specific content of the dictionary can be viewed through the vocab;

converting the word segmentation result into a digital sequence through a dictionary, and directly calling a convert _ tokens _ to _ ids method. Data is successfully converted from a string to a sequence of numbers by calling both tokenize and convert _ tokens _ to _ ids methods of Tokenizer. However, tokenizer provides a more convenient encode method, and the above effects can be directly achieved.

By means of the encode method, filling and truncation of data can be conveniently achieved. Data is to be able to enter the pre-training model provided by transformations, and two additional inputs, namely entry _ mask and token _ type _ id, are to be constructed for marking the real input and fragment type respectively.

And S140, training the initial classification model based on the high-order text vector sample and the sample label to obtain the trained classification model.

The classification model may include a BilSTM + Attention network layer and a SoftMax layer. The high-order text vector sample can be input into a BilSTM + Attention network layer to obtain an output result; and inputting the output result into a SoftMax layer to obtain the probability identification of each category, and training the initial classification model to obtain a trained classification model.

Specifically, the initial classification model can be trained according to the binary cross entropy loss, the high-order text vector sample and the sample label, so as to obtain a trained classification model.

The trained model file can be stored, and the model file can be conveniently used in a test set subsequently. Based on the method, a test set can be obtained; and verifying the trained classification model based on the test set, and after the verification is passed, carrying out industry classification on the company brief introduction based on the trained classification model.

Bi-LSTM + Attenttion is to add an Attenttion layer on a Bi-LSTM model, and in Bi-LSTM, an output vector of the last time sequence is used as a feature vector, and then softmax classification is carried out. Attenttion is to calculate the weight of each time sequence, then to make the weighted sum of all time sequence vectors as the feature vector, and then to make softmax classification. In the experiments, the addition of Attention indeed improves the results. The model structure is shown in fig. 2.

Among them, for the Bi-LSTM + Attention model. The input of the model can be defined, a loss function is defined, a word embedding matrix is initialized by using a pre-trained word vector, and words in the input data are converted into word vectors by using the word embedding matrix.

A model structure of two-layer bidirectional LSTM is defined: including forward LSTM structures and reverse LSTM structures. Dynamic rnn can be used, the length of the sequence can be dynamically input, and if no input exists, the full length of the sequence is taken. Splicing results of fw and bw in the forward LSTM structure and the reverse LSTM structure, transmitting the results into the next layer of Bi-LSTM, dividing the result output by the last layer of Bi-LSTM into forward and backward outputs, and adding the forward and backward outputs to the Attention output.

And calculating binary cross entropy loss, and obtaining vector representation of the sentence by utilizing an Attention mechanism.

And S150, classifying the industries aiming at the company profiles based on the trained classification model.

It should be noted that, an XLNet pre-training model is adopted when the text is pre-processed, and other pre-training models of transform types, such as bert, xlm, roberta, distilbert and the like, can be selected in actual work, and preference can be performed according to actual working scenes and specific effects.

The model firstly uses an XLNet pre-training model to generate characteristic information representation of a text, then inputs the characteristic information representation into a bidirectional long-short time memory network model BilsTM + Attention based on the Attention mechanism, the network layer can further learn word vectors and strengthen the weight of special industry professional word vectors, so that the model can learn more high-order text characteristic vectors from context semantic information to obtain richer word vector semantic information, finally realizes text classification through a SoftMax function layer, and outputs various category probabilities.

After experiments, the sample size of the test set is as follows: 20 ten thousand, industry category 4: the method is respectively a labor industry, a beauty industry, a pet industry and a financial industry.

Experiment 1: and performing text classification and identification by using an XLNet pre-training model, wherein the accuracy rate of the model reaches about 73.5%.

Experiment 2: if a bidirectional long-and-short time memory network model BilSTM + Attention based on an Attention mechanism is added later, the accuracy of model identification can be improved to about 80%, and the accuracy is improved to be close to 7%, and the experiment proves that the scheme provided by the text is effective.

Fig. 3 is a schematic structural diagram of an industry classification device for company profiles according to an embodiment of the present application. As shown in fig. 3, the apparatus includes:

a determining module 301, configured to determine a training sample, where the training sample includes a company profile text sample and a sample label;

an optimizing module 302, configured to optimize an XLNet pre-training language model based on a training sample, to obtain an optimized XLNet language model;

the processing module 303 is configured to process the company brief introduction text sample based on the optimized XLNet language model to obtain a high-order text vector sample;

the training module 304 is configured to train the initial classification model based on the high-order text vector sample and the sample label to obtain a trained classification model;

a classification module 305 for classifying industries with respect to company profiles based on the trained classification model.

In some embodiments, the determining module 301 is specifically configured to:

and segmenting the company brief introduction source text based on a segmentation tool, and preprocessing the segmented company brief introduction source text to obtain a company brief introduction text sample, wherein the preprocessing comprises cleaning.

In some embodiments, the XLNet pre-training language model is implemented based on a pytorch framework, including a config.

In some embodiments, the classification model includes a BilSTM + Attention network layer and a SoftMax layer.

In some embodiments, training module 304 is specifically configured to:

inputting a high-order text vector sample into a BilSTM + Attention network layer to obtain an output result;

and inputting the output result into a SoftMax layer to obtain the probability identification of each category, and training the initial classification model to obtain a trained classification model.

In some embodiments, further comprising a testing module to:

acquiring a test set;

and verifying the trained classification model based on the test set, and after the verification is passed, carrying out industry classification on the company brief introduction based on the trained classification model.

In some embodiments, training module 304 is specifically configured to:

and training the initial classification model according to the binary cross entropy loss, the high-order text vector sample and the sample label to obtain the trained classification model.

An electronic device is further provided in the embodiment of the present application, as shown in fig. 4, and includes a processor 410, a communication interface 420, a memory 430, and a communication bus 440, where the processor 410, the communication interface 420, and the memory 430 complete communication with each other through the communication bus 440.

A memory 430 for storing computer programs;

the processor 410, when executing the program stored in the memory 430, implements the method steps of any of the embodiments described above.

The communication bus mentioned above may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Alternatively, the memory may be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

Since the implementation manner and the beneficial effects of the electronic device in the foregoing embodiment for solving the problems can be implemented by referring to the steps in the embodiment shown in fig. 1, detailed working processes and beneficial effects of the electronic device provided in the embodiment of the present application are not described herein again.

In yet another embodiment provided by the present application, there is also provided a computer-readable storage medium having stored therein instructions that, when executed on a computer, cause the computer to perform any of the above-described method for industry classification for a company profile.

In yet another embodiment provided by the present application, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the industry classification method for company profiles of any of the above embodiments.

As will be appreciated by one of skill in the art, the embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all changes and modifications that fall within the scope of the embodiments of the present application.

It is apparent that those skilled in the art can make various changes and modifications to the embodiments of the present application without departing from the spirit and scope of the embodiments of the present application. Thus, if such modifications and variations of the embodiments of the present application fall within the scope of the claims of the embodiments of the present application and their equivalents, the embodiments of the present application are also intended to include such modifications and variations.

Claims

1. An industry classification method for company profiles, comprising:

2. The method of claim 1, wherein the determining training samples comprises:

3. The method of claim 1, wherein the XLNet pre-training language model is implemented based on a pytorch framework, including a config.

4. The method of claim 1, wherein the classification model comprises a BilSTM + Attention network layer and a SoftMax layer.

5. The method of claim 4, wherein training an initial classification model based on the higher order text vector samples and the sample labels to obtain a trained classification model comprises:

6. The method of claim 1, further comprising:

acquiring a test set;

7. The method of claim 5, wherein training an initial classification model based on the higher order text vector samples and the sample labels to obtain a trained classification model comprises:

8. An industry classification apparatus for company profiles, comprising:

and the classification module is used for classifying the industries aiming at the company brief introduction based on the trained classification model.

9. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are used for completing mutual communication through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of claims 1-7 when executing a program stored on a memory.

10. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of the claims 1-7.