CN115718889A - Industry classification method and device for company profile - Google Patents

Industry classification method and device for company profile Download PDF

Info

Publication number
CN115718889A
CN115718889A CN202211613376.0A CN202211613376A CN115718889A CN 115718889 A CN115718889 A CN 115718889A CN 202211613376 A CN202211613376 A CN 202211613376A CN 115718889 A CN115718889 A CN 115718889A
Authority
CN
China
Prior art keywords
sample
training
classification model
company
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211613376.0A
Other languages
Chinese (zh)
Inventor
蔡凡华
胡万利
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Bank Co Ltd
Original Assignee
Ping An Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Bank Co Ltd filed Critical Ping An Bank Co Ltd
Priority to CN202211613376.0A priority Critical patent/CN115718889A/en
Publication of CN115718889A publication Critical patent/CN115718889A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides an industry classification method and device aiming at company profiles. Relates to the field of computer vision. Comprises determining a training sample; optimizing an XLNET pre-training language model based on the training sample to obtain an optimized XLNET language model; processing the company brief introduction text sample based on the optimized XLNET language model to obtain a high-order text vector sample; training an initial classification model based on the high-order text vector sample and the sample label to obtain a trained classification model; and classifying industries according to company profiles based on the trained classification model. The network layer can further learn word vectors and strengthen the weight of professional word vectors in special industries, so that the model can learn more high-order text feature vectors from context semantic information, obtain richer word vector semantic information and finally obtain more accurate classification results for text classification.

Description

Industry classification method and device for company profile
Technical Field
The application relates to the field of computers, in particular to an industry classification method and device for company profiles.
Background
With the rapid development of information technology, data is growing explosively, and social networks, mobile networks, various intelligent tools, service tools and the like are sources of data. Commodity transaction data generated by nearly 4 hundred million members of a certain e-commerce platform every day is about 20TB; log data generated by about 10 million users of face books per day exceeds 300TB. The wide data source determines the diversity of data forms, and data in any form can play a role, such as a recommendation system of a daily e-commerce platform and the like, analyzes log data of a user, and further recommends things which the user is interested in. In the financial field, generally, an internet platform (a company homepage information column) can obtain the information introduction of a company and the main business engaged, the abstract text is a section of text description, and the text is subjected to industry classification analysis, so that certain information reference can be provided for a bank customer manager when a business opportunity is excavated and financial product marketing is carried out, and marketing work can be developed more pertinently.
Because the length of the text information of the company brief introduction is generally within 300 words, the text data has the characteristic of sparse text features, and when the traditional text classification model is adopted for classification, the problems of how to capture more semantic information from the limited context and the like can be faced.
Disclosure of Invention
The embodiments of the present application provide an industry classification method and apparatus for company profiles, so as to alleviate the technical problem in the prior art that the classification for company profiles is inaccurate.
In a first aspect, the present invention provides a method for industry classification for company profiles, comprising:
determining a training sample, wherein the training sample comprises a company brief introduction text sample and a sample label;
optimizing an XLNET pre-training language model based on the training sample to obtain an optimized XLNET language model;
processing the company brief introduction text sample based on the optimized XLNet language model to obtain a high-order text vector sample;
training an initial classification model based on the high-order text vector sample and the sample label to obtain a trained classification model;
and classifying industries according to company profiles based on the trained classification model.
In an alternative embodiment, the determining the training samples comprises:
determining a data source, marking to obtain company brief introduction source texts and sample labels;
and segmenting the company brief introduction source text based on a segmentation tool, and preprocessing the company brief introduction source text to obtain a company brief introduction text sample, wherein the preprocessing comprises cleaning.
In an alternative embodiment, the XLNet pre-training language model is implemented based on a pyrrch framework, including a config.
In an alternative embodiment, the classification model includes a BilSTM + Attention network layer and a SoftMax layer.
In an optional embodiment, training an initial classification model based on the high-order text vector sample and the sample label to obtain a trained classification model, includes:
inputting the high-order text vector sample into a BilSTM + Attention network layer to obtain an output result;
and inputting the output result into the SoftMax layer to obtain the probability identification of each category, and training the initial classification model to obtain a trained classification model.
In an alternative embodiment, the method further comprises:
acquiring a test set;
and verifying the trained classification model based on the test set, and after the verification is passed, performing industry classification on the company profile based on the trained classification model.
In an optional embodiment, training an initial classification model based on the high-order text vector sample and the sample label to obtain a trained classification model, includes:
and training an initial classification model according to the binary cross entropy loss, the high-order text vector sample and the sample label to obtain a trained classification model.
In a second aspect, the present invention provides an industry classification apparatus for company profiles, comprising:
the system comprises a determining module, a judging module and a judging module, wherein the determining module is used for determining a training sample, and the training sample comprises a company brief introduction text sample and a sample label;
the optimizing module is used for optimizing an XLNET pre-training language model based on the training sample to obtain an optimized XLNET language model;
the processing module is used for processing the company brief introduction text sample based on the optimized XLNET language model to obtain a high-order text vector sample;
the training module is used for training an initial classification model based on the high-order text vector sample and the sample label to obtain a trained classification model;
and the classification module is used for classifying the industries aiming at the company profiles based on the trained classification model.
In an optional embodiment, the determining module is specifically configured to:
determining a data source, marking to obtain a company brief introduction source text and a sample label;
and segmenting the company brief introduction source text based on a segmentation tool, and preprocessing the company brief introduction source text to obtain a company brief introduction text sample, wherein the preprocessing comprises cleaning.
In an alternative embodiment, the XLNet pre-training language model is implemented based on a pyrrch framework, including a config.
In an alternative embodiment, the classification model includes a BilSTM + Attention network layer and a SoftMax layer.
In an alternative embodiment, the training module is specifically configured to:
inputting the high-order text vector sample into a BilSTM + Attention network layer to obtain an output result;
and inputting the output result into the SoftMax layer to obtain the probability identification of each category, and training the initial classification model to obtain a trained classification model.
In an optional embodiment, the system further comprises a test module, configured to:
acquiring a test set;
and verifying the trained classification model based on the test set, and after the verification is passed, performing industry classification on the company profile based on the trained classification model.
In an alternative embodiment, the training module is specifically configured to:
and training an initial classification model according to the binary cross entropy loss, the high-order text vector sample and the sample label to obtain a trained classification model.
In a third aspect, the present invention provides an electronic device, including a processor, a communication interface, a memory and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any of the preceding embodiments when executing the program stored in the memory.
In a fourth aspect, the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method steps of any of the preceding embodiments.
The invention provides an industry classification method and device for company profiles. Determining a training sample, wherein the training sample comprises a company brief introduction text sample and a sample label; optimizing an XLNET pre-training language model based on the training sample to obtain an optimized XLNET language model; processing the company brief introduction text sample based on the optimized XLNET language model to obtain a high-order text vector sample; training an initial classification model based on the high-order text vector sample and the sample label to obtain a trained classification model; and classifying industries according to company profiles based on the trained classification model. The network layer can further learn word vectors and strengthen the weight of professional word vectors in special industries, so that the model can learn more high-order text feature vectors from context semantic information, obtain richer word vector semantic information and finally obtain more accurate classification results for text classification.
Drawings
To more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.
FIG. 1 is a schematic flow chart of an industry classification method for company profiles according to an embodiment of the present application;
FIG. 2 is a model structure example provided in an embodiment of the present application;
FIG. 3 is a schematic structural diagram of an industry classification device for company profiles according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
Some embodiments of the invention are described in detail below with reference to the accompanying drawings. The embodiments and features of the embodiments described below can be combined with each other without conflict.
Fig. 1 is a schematic flow chart of an industry classification method for company profiles according to an embodiment of the present application. As shown in fig. 1, the method may include the steps of:
and S110, determining a training sample, wherein the training sample comprises a company brief introduction text sample and a sample label.
Determining a data source, marking to obtain a company brief introduction source text and a sample label; and segmenting the company brief introduction source text based on a segmentation tool, and preprocessing the segmented company brief introduction source text to obtain a company brief introduction text sample, wherein the preprocessing comprises cleaning.
As an example, after a data source is taken out of a database, an industry mark is marked on a text description by adopting a manual marking mode, and the text description is used as a training set of a subsequent model. In the method, only text information of four industries (labor industry, beauty industry, pet industry and financial industry) is extracted in an experiment, and 5000 pieces of text information are extracted for each category and used as model training.
The sample label can be determined according to actual needs, and may be, for example, an industry label, an attribute label, such as a labor industry, a beauty industry, and the like, or a company-sized label, such as a large-sized enterprise, a medium-sized enterprise, or a small-sized enterprise.
The cleaning can be carried out by utilizing a jieba word segmentation tool, and the non-conventional characters such as English characters, numbers, messy codes, special symbols and the like can be removed through cleaning, namely the characters which have no significance for classification.
And S120, optimizing the XLNET pre-training language model based on the training samples to obtain the optimized XLNET language model.
The XLNET pre-training language model is realized based on a pytorch framework and can comprise a config.
A json file is a main configuration file of an HAP project, and all relevant configuration information is configured under the file; the configuration file adopts a JSON file format, and each configuration item consists of an attribute part and a value part; the attributes do not appear in sequence and only appear once; the value is made up of the basic data type of JSON.
Txt file is a vocabulary.
The pyrroch _ model _ bin file is a binary file of a trained XLNet pre-training language model.
The XLNet pre-training language model may include a chinese pre-training model file (bert-base-Chinese).
XLNet pre-training language model parameters can be finely adjusted for training, and feature information representation of high-order text vectors is obtained.
The fine tuning can be realized by continuously training the XLNet pre-training language model through training samples and a preset loss function.
The XLNET pre-training language model is an improved version of the bert model. The optimization is mainly carried out on the following three aspects: an AR (autoregressive) model is adopted to replace an AE (autorecoding) model, so that negative effects brought by mask are avoided; a double-flow attention mechanism is adopted; and introducing a transformer-xl.
The main task of the AR model is to evaluate the probability distribution of the corpus; the AE model is used in a contextual manner.
In xlnet, the AR model is finally adopted, but it is a focus of this document how to solve the problem in this context.
The permutation language model is introduced, the model does not model the sequence value of the traditional AR model in sequence any more, but maximizes the expected log-likelihood of the permutation and combination sequence of all possible sequences, thus not only preserving the context information of the sequences after processing, but also avoiding the adoption of mask marking bits, and skillfully improving the defects of the bert and the traditional AR model.
Although the alignment language model can satisfy the current objective, it is problematic for the common transform structure. Therefore, if the position information of the target is not considered, the result after permutation and combination is the same, or the result output by the transform is the same. Therefore, the model cannot be correctly expressed, and in order to solve the problem, a new distribution calculation method is proposed in the paper to realize target position sensing, that is, double-flow self-attention.
The model also integrates relative position coding and a segment rotation mechanism.
And S130, processing the company brief introduction text sample based on the optimized XLNet language model to obtain a high-order text vector sample.
The method can adopt a tokenizer + convert to id form to process, and obtain a high-order text vector sample. The specific process is as follows:
when a natural language processing task is processed by using a neural network, data needs to be preprocessed first, and the data is converted into a format acceptable by the neural network from a character string, and the method generally includes the following steps:
(1) Word segmentation: performing word segmentation (characters and words) on the text data by using a word segmenter;
(2) Constructing a dictionary: constructing dictionary mapping according to the result of the word segmentation of the data set (this step is not absolute, if a pre-training word vector is adopted, the dictionary mapping needs to be processed according to a word vector file);
(3) Data conversion: mapping the data after word segmentation processing according to the constructed dictionary, and converting the text sequence into a digital sequence;
(4) Data filling and truncation: in the mode of inputting to the model with batch, need fill the data of overlength, overlength data are cuted, guarantee that data length accords with the range that the model can accept, and the data dimension size in batch is unanimous simultaneously.
In past work, we may use different word segmenters and implement the work of constructing dictionary and conversion by themselves. However, in the transformations toolkit, the above-mentioned complexity is not needed, and the Tokenizer module is only needed to realize all the above-mentioned operations rapidly, and its function is to convert the text into data that can be processed by the neural network. The Tokenizer kit does not require additional installation and will be installed with the transformations.
The Tokenizer was loaded. Since Tokenzier is generated along with the pre-trained model, the pre-trained model to be used is specified when Tokenizer is loaded, and we take processing the chinese data as an example, where we choose to load the bert-base-kernel pre-trained model, and download part of the file when loading for the first time.
The token method is used for word segmentation, and the bert-base-Chinese word segmentation is classified according to characters.
After word segmentation, a dictionary should be constructed, but as mentioned above, tokenzier is generated along with the pre-trained model, so the dictionary is already constructed in advance and does not need to be constructed again. The specific content of the dictionary can be viewed through the vocab;
converting the word segmentation result into a digital sequence through a dictionary, and directly calling a convert _ tokens _ to _ ids method. Data is successfully converted from a string to a sequence of numbers by calling both tokenize and convert _ tokens _ to _ ids methods of Tokenizer. However, tokenizer provides a more convenient encode method, and the above effects can be directly achieved.
By means of the encode method, filling and truncation of data can be conveniently achieved. Data is to be able to enter the pre-training model provided by transformations, and two additional inputs, namely entry _ mask and token _ type _ id, are to be constructed for marking the real input and fragment type respectively.
And S140, training the initial classification model based on the high-order text vector sample and the sample label to obtain the trained classification model.
The classification model may include a BilSTM + Attention network layer and a SoftMax layer. The high-order text vector sample can be input into a BilSTM + Attention network layer to obtain an output result; and inputting the output result into a SoftMax layer to obtain the probability identification of each category, and training the initial classification model to obtain a trained classification model.
Specifically, the initial classification model can be trained according to the binary cross entropy loss, the high-order text vector sample and the sample label, so as to obtain a trained classification model.
The trained model file can be stored, and the model file can be conveniently used in a test set subsequently. Based on the method, a test set can be obtained; and verifying the trained classification model based on the test set, and after the verification is passed, carrying out industry classification on the company brief introduction based on the trained classification model.
Bi-LSTM + Attenttion is to add an Attenttion layer on a Bi-LSTM model, and in Bi-LSTM, an output vector of the last time sequence is used as a feature vector, and then softmax classification is carried out. Attenttion is to calculate the weight of each time sequence, then to make the weighted sum of all time sequence vectors as the feature vector, and then to make softmax classification. In the experiments, the addition of Attention indeed improves the results. The model structure is shown in fig. 2.
Among them, for the Bi-LSTM + Attention model. The input of the model can be defined, a loss function is defined, a word embedding matrix is initialized by using a pre-trained word vector, and words in the input data are converted into word vectors by using the word embedding matrix.
A model structure of two-layer bidirectional LSTM is defined: including forward LSTM structures and reverse LSTM structures. Dynamic rnn can be used, the length of the sequence can be dynamically input, and if no input exists, the full length of the sequence is taken. Splicing results of fw and bw in the forward LSTM structure and the reverse LSTM structure, transmitting the results into the next layer of Bi-LSTM, dividing the result output by the last layer of Bi-LSTM into forward and backward outputs, and adding the forward and backward outputs to the Attention output.
And calculating binary cross entropy loss, and obtaining vector representation of the sentence by utilizing an Attention mechanism.
And S150, classifying the industries aiming at the company profiles based on the trained classification model.
It should be noted that, an XLNet pre-training model is adopted when the text is pre-processed, and other pre-training models of transform types, such as bert, xlm, roberta, distilbert and the like, can be selected in actual work, and preference can be performed according to actual working scenes and specific effects.
The model firstly uses an XLNet pre-training model to generate characteristic information representation of a text, then inputs the characteristic information representation into a bidirectional long-short time memory network model BilsTM + Attention based on the Attention mechanism, the network layer can further learn word vectors and strengthen the weight of special industry professional word vectors, so that the model can learn more high-order text characteristic vectors from context semantic information to obtain richer word vector semantic information, finally realizes text classification through a SoftMax function layer, and outputs various category probabilities.
After experiments, the sample size of the test set is as follows: 20 ten thousand, industry category 4: the method is respectively a labor industry, a beauty industry, a pet industry and a financial industry.
Experiment 1: and performing text classification and identification by using an XLNet pre-training model, wherein the accuracy rate of the model reaches about 73.5%.
Experiment 2: if a bidirectional long-and-short time memory network model BilSTM + Attention based on an Attention mechanism is added later, the accuracy of model identification can be improved to about 80%, and the accuracy is improved to be close to 7%, and the experiment proves that the scheme provided by the text is effective.
Fig. 3 is a schematic structural diagram of an industry classification device for company profiles according to an embodiment of the present application. As shown in fig. 3, the apparatus includes:
a determining module 301, configured to determine a training sample, where the training sample includes a company profile text sample and a sample label;
an optimizing module 302, configured to optimize an XLNet pre-training language model based on a training sample, to obtain an optimized XLNet language model;
the processing module 303 is configured to process the company brief introduction text sample based on the optimized XLNet language model to obtain a high-order text vector sample;
the training module 304 is configured to train the initial classification model based on the high-order text vector sample and the sample label to obtain a trained classification model;
a classification module 305 for classifying industries with respect to company profiles based on the trained classification model.
In some embodiments, the determining module 301 is specifically configured to:
determining a data source, marking to obtain company brief introduction source texts and sample labels;
and segmenting the company brief introduction source text based on a segmentation tool, and preprocessing the segmented company brief introduction source text to obtain a company brief introduction text sample, wherein the preprocessing comprises cleaning.
In some embodiments, the XLNet pre-training language model is implemented based on a pytorch framework, including a config.
In some embodiments, the classification model includes a BilSTM + Attention network layer and a SoftMax layer.
In some embodiments, training module 304 is specifically configured to:
inputting a high-order text vector sample into a BilSTM + Attention network layer to obtain an output result;
and inputting the output result into a SoftMax layer to obtain the probability identification of each category, and training the initial classification model to obtain a trained classification model.
In some embodiments, further comprising a testing module to:
acquiring a test set;
and verifying the trained classification model based on the test set, and after the verification is passed, carrying out industry classification on the company brief introduction based on the trained classification model.
In some embodiments, training module 304 is specifically configured to:
and training the initial classification model according to the binary cross entropy loss, the high-order text vector sample and the sample label to obtain the trained classification model.
An electronic device is further provided in the embodiment of the present application, as shown in fig. 4, and includes a processor 410, a communication interface 420, a memory 430, and a communication bus 440, where the processor 410, the communication interface 420, and the memory 430 complete communication with each other through the communication bus 440.
A memory 430 for storing computer programs;
the processor 410, when executing the program stored in the memory 430, implements the method steps of any of the embodiments described above.
The communication bus mentioned above may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface is used for communication between the electronic equipment and other equipment.
The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Alternatively, the memory may be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.
Since the implementation manner and the beneficial effects of the electronic device in the foregoing embodiment for solving the problems can be implemented by referring to the steps in the embodiment shown in fig. 1, detailed working processes and beneficial effects of the electronic device provided in the embodiment of the present application are not described herein again.
In yet another embodiment provided by the present application, there is also provided a computer-readable storage medium having stored therein instructions that, when executed on a computer, cause the computer to perform any of the above-described method for industry classification for a company profile.
In yet another embodiment provided by the present application, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the industry classification method for company profiles of any of the above embodiments.
As will be appreciated by one of skill in the art, the embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all changes and modifications that fall within the scope of the embodiments of the present application.
It is apparent that those skilled in the art can make various changes and modifications to the embodiments of the present application without departing from the spirit and scope of the embodiments of the present application. Thus, if such modifications and variations of the embodiments of the present application fall within the scope of the claims of the embodiments of the present application and their equivalents, the embodiments of the present application are also intended to include such modifications and variations.

Claims (10)

1. An industry classification method for company profiles, comprising:
determining a training sample, wherein the training sample comprises a company brief introduction text sample and a sample label;
optimizing an XLNET pre-training language model based on the training sample to obtain an optimized XLNET language model;
processing the company brief introduction text sample based on the optimized XLNET language model to obtain a high-order text vector sample;
training an initial classification model based on the high-order text vector sample and the sample label to obtain a trained classification model;
and classifying industries according to company profiles based on the trained classification model.
2. The method of claim 1, wherein the determining training samples comprises:
determining a data source, marking to obtain a company brief introduction source text and a sample label;
and segmenting the company brief introduction source text based on a segmentation tool, and preprocessing the company brief introduction source text to obtain a company brief introduction text sample, wherein the preprocessing comprises cleaning.
3. The method of claim 1, wherein the XLNet pre-training language model is implemented based on a pytorch framework, including a config.
4. The method of claim 1, wherein the classification model comprises a BilSTM + Attention network layer and a SoftMax layer.
5. The method of claim 4, wherein training an initial classification model based on the higher order text vector samples and the sample labels to obtain a trained classification model comprises:
inputting the high-order text vector sample into a BilSTM + Attention network layer to obtain an output result;
and inputting the output result into the SoftMax layer to obtain the probability identification of each category, and training the initial classification model to obtain a trained classification model.
6. The method of claim 1, further comprising:
acquiring a test set;
and verifying the trained classification model based on the test set, and after the verification is passed, performing industry classification on the company profile based on the trained classification model.
7. The method of claim 5, wherein training an initial classification model based on the higher order text vector samples and the sample labels to obtain a trained classification model comprises:
and training an initial classification model according to the binary cross entropy loss, the high-order text vector sample and the sample label to obtain a trained classification model.
8. An industry classification apparatus for company profiles, comprising:
the system comprises a determining module, a judging module and a judging module, wherein the determining module is used for determining a training sample, and the training sample comprises a company brief introduction text sample and a sample label;
the optimizing module is used for optimizing an XLNET pre-training language model based on the training sample to obtain an optimized XLNET language model;
the processing module is used for processing the company brief introduction text sample based on the optimized XLNET language model to obtain a high-order text vector sample;
the training module is used for training an initial classification model based on the high-order text vector sample and the sample label to obtain a trained classification model;
and the classification module is used for classifying the industries aiming at the company brief introduction based on the trained classification model.
9. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are used for completing mutual communication through the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any of claims 1-7 when executing a program stored on a memory.
10. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of the claims 1-7.
CN202211613376.0A 2022-12-15 2022-12-15 Industry classification method and device for company profile Pending CN115718889A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211613376.0A CN115718889A (en) 2022-12-15 2022-12-15 Industry classification method and device for company profile

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211613376.0A CN115718889A (en) 2022-12-15 2022-12-15 Industry classification method and device for company profile

Publications (1)

Publication Number Publication Date
CN115718889A true CN115718889A (en) 2023-02-28

Family

ID=85257741

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211613376.0A Pending CN115718889A (en) 2022-12-15 2022-12-15 Industry classification method and device for company profile

Country Status (1)

Country Link
CN (1) CN115718889A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116680619A (en) * 2023-07-28 2023-09-01 江西中医药大学 Method and device for predicting decoction time classification, electronic equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116680619A (en) * 2023-07-28 2023-09-01 江西中医药大学 Method and device for predicting decoction time classification, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111859960B (en) Semantic matching method, device, computer equipment and medium based on knowledge distillation
CN110222188B (en) Company notice processing method for multi-task learning and server
CN111177326B (en) Key information extraction method and device based on fine labeling text and storage medium
CN110580308B (en) Information auditing method and device, electronic equipment and storage medium
CN106649612B (en) Method and device for automatically matching question and answer templates
CN112070138B (en) Construction method of multi-label mixed classification model, news classification method and system
CN111858843B (en) Text classification method and device
CN114818891B (en) Small sample multi-label text classification model training method and text classification method
CN112100401B (en) Knowledge graph construction method, device, equipment and storage medium for science and technology services
CN111625634A (en) Word slot recognition method and device, computer-readable storage medium and electronic device
CN113672731B (en) Emotion analysis method, device, equipment and storage medium based on field information
CN115952291A (en) Financial public opinion classification method and system based on multi-head self-attention and LSTM
CN111738018A (en) Intention understanding method, device, equipment and storage medium
CN111782793A (en) Intelligent customer service processing method, system and equipment
CN112784580A (en) Financial data analysis method and device based on event extraction
CN115357699A (en) Text extraction method, device, equipment and storage medium
CN116150367A (en) Emotion analysis method and system based on aspects
CN115718889A (en) Industry classification method and device for company profile
CN113761875B (en) Event extraction method and device, electronic equipment and storage medium
CN111368066A (en) Method, device and computer readable storage medium for acquiring dialogue abstract
CN113377910A (en) Emotion evaluation method and device, electronic equipment and storage medium
CN112597299A (en) Text entity classification method and device, terminal equipment and storage medium
CN115859121A (en) Text processing model training method and device
CN115563278A (en) Question classification processing method and device for sentence text
CN114676699A (en) Entity emotion analysis method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination