CN111428028A

CN111428028A - Information classification method based on deep learning and related equipment

Info

Publication number: CN111428028A
Application number: CN202010142300.9A
Authority: CN
Inventors: 金美芝
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2020-03-04
Filing date: 2020-03-04
Publication date: 2020-07-17

Abstract

The application relates to the technical field of data analysis, in particular to an information classification method based on deep learning and related equipment, which comprises the following steps: acquiring the data quantity of information to be identified, determining the clustering mode of the information to be identified according to the data quantity, and preprocessing the information to be identified by applying the clustering mode to obtain pre-classified data; performing word vector conversion on the pre-classified data to obtain word vectors of the pre-classified data; inputting word vectors of the pre-classified data into a deep learning model for text feature extraction to obtain a plurality of text features; classifying each text feature to obtain a classification result of the text feature; and (4) scoring the classification result by applying a voting mechanism, and determining the classification label of the information to be identified according to the scoring result. The method and the device effectively solve the problem that the content of the original information cannot be accurately reflected when the text feature extraction is carried out by applying the deep learning model.

Description

Information classification method based on deep learning and related equipment

Technical Field

The application relates to the technical field of data analysis, in particular to an information classification method based on deep learning and related equipment.

Background

Usually, people can clearly understand a plurality of intentions expressed in the text, but the robot is difficult to know all intentions expressed in the text, so that the answers given by the robot are not complete, a client cannot obtain a complete and satisfactory answer from the robot, and even an incorrect answer can be returned because the multi-intentions expressed by the client are not understood, so that the extremely poor experience is brought to the client, the satisfaction degree of the client is reduced, and therefore, the important task of customer service of the robot is to be solved by the robot.

At present, the text classification mode is mainly adopted when multi-purpose recognition is carried out. However, the problem of data imbalance during text classification results in failure to accurately reflect the content of original information when text feature extraction is performed by applying a deep learning model.

Disclosure of Invention

Based on the above, an information classification method based on deep learning and related equipment are provided for solving the problem that the content of original information cannot be accurately reflected when text feature extraction is performed by applying a deep learning model due to the problem of data imbalance during text classification at present.

An information classification method based on deep learning comprises the following steps:

acquiring the data quantity of information to be identified, determining the clustering mode of the information to be identified according to the data quantity, and preprocessing the information to be identified by applying the clustering mode to obtain pre-classified data;

performing word vector conversion on the pre-classified data to obtain word vectors of the pre-classified data;

inputting the word vectors of the pre-classified data into a preset deep learning model for text feature extraction to obtain a plurality of text features;

classifying the text features to obtain a classification result of the text features;

and scoring the classification result by using a preset voting mechanism to obtain a scoring result, and determining the classification label of the information to be identified according to the scoring result.

In one possible embodiment, the obtaining the data quantity of the information to be identified, determining a clustering mode of the information to be identified according to the data quantity, and preprocessing the information to be identified by applying the clustering mode to obtain pre-classified data includes:

comparing the data quantity with a preset data quantity threshold, if the data quantity is greater than the data quantity threshold, determining that the information to be identified is large sample data, otherwise, determining that the information to be identified is small sample data;

if the information to be identified is large sample data, clustering the large sample data by applying a clustering algorithm after removing noise points and isolated points in the large sample data to obtain pre-classified data;

if the information to be identified is small sample data, clustering similar samples in the small sample data by using a clustering algorithm to generate a plurality of clusters, and processing the data in each cluster by respectively adopting a genetic crossover algorithm to obtain the pre-classification data.

In one possible embodiment, the performing word vector conversion on the pre-classified data to obtain a word vector of the pre-classified data includes:

acquiring a preset word vector embedding model, and dividing the pre-classified data into a plurality of sentences according to the attribute of the word vector embedding model;

inputting the sentence into the word vector embedding model for mapping to obtain an initial text word vector;

and calculating the characteristic value of the initial text word vector, deleting the initial text word vector with the characteristic value of zero, summarizing the rest initial text word vectors, and obtaining the word vector of the pre-classified data.

In one possible embodiment, the entering of the word vector of the pre-classified data into a preset deep learning model for text feature extraction to obtain a plurality of text features includes:

inputting a preset standard word vector into an input layer in a preset cyclic neural network model, performing probability prediction on the word vector processed by the input layer through a hidden layer in the cyclic neural network model to obtain a probability prediction result, and converting the probability prediction result by using an output layer in the cyclic neural network model to obtain a prediction keyword;

and comparing the predicted key words with the keywords corresponding to the standard word vectors, if the predicted key words are consistent with the keywords corresponding to the standard word vectors, adding the word vectors of the pre-classified data into the cyclic neural network model for feature extraction, and otherwise, changing the parameters in the hidden layer for re-prediction until the predicted key words are consistent with the keywords corresponding to the standard word vectors.

In one possible embodiment, the classifying the text features to obtain a classification result of the text features includes:

obtaining classifiers of different categories, and establishing a classifier sub-tree according to the hierarchical relation among the classifiers;

inputting the text features into a root node of the classifier subtree, performing primary classification to obtain a primary classification result, and inputting the primary classification result into a next-level leaf node of the root node;

taking the next-level leaf node as a new root node to continue classifying until the next-level leaf node is the minimum leaf node;

and summarizing the classification result of the minimum leaf node to obtain the classification result of the text features.

In one possible embodiment, the scoring the classification result by using a preset voting mechanism to obtain a scoring result, and determining the classification label of the information to be identified according to the scoring result includes:

obtaining the classification accuracy of the end classifier corresponding to each minimum leaf node, and taking the classification accuracy as the weight of the end classifier;

voting and scoring the classification labels output by the end classifier by applying the voting mechanism by taking the weights as auxiliary parameters;

and extracting the classification label with the voting score larger than a score threshold value as the classification label of the information to be identified.

In one possible embodiment, the continuing the classification with the next-level leaf node as a new root node until the next-level leaf node is a minimum leaf node includes:

acquiring the similarity among the output results of all leaf nodes at any level, and extracting target leaf nodes corresponding to a plurality of output results with the similarity larger than a similarity threshold value as root nodes of next-level classification;

and acquiring a node label corresponding to the target leaf node, and inputting the node label into a next-level classifier for classification until a root node of the next-level classification is the minimum leaf node.

An information classification device based on deep learning comprises the following modules:

the pre-classification module is used for acquiring the data quantity of the information to be identified, determining the clustering mode of the information to be identified according to the data quantity, and preprocessing the information to be identified by applying the clustering mode to obtain pre-classification data;

the word vector module is used for carrying out word vector conversion on the pre-classified data to obtain word vectors of the pre-classified data;

the feature extraction module is used for inputting the word vectors of the pre-classified data into a preset deep learning model to extract text features so as to obtain a plurality of text features;

the result generation module is used for classifying the text features to obtain the classification result of the text features;

and the label generation module is set to score the classification result by applying a preset voting mechanism to obtain a scoring result, and determine the classification label of the information to be identified according to the scoring result.

A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, cause the processor to perform the steps of the above deep learning based information classification method.

A storage medium having stored thereon computer-readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the above-described deep learning-based information classification method.

Compared with the existing mechanism, the method and the device have the advantages that the data quantity of the information to be identified is obtained, the clustering mode of the information to be identified is determined according to the data quantity, and the information to be identified is preprocessed by applying the clustering mode to obtain pre-classified data; performing word vector conversion on the pre-classified data to obtain word vectors of the pre-classified data; inputting the word vectors of the pre-classified data into a preset deep learning model for text feature extraction to obtain a plurality of text features; classifying the text features to obtain a classification result of the text features; and scoring the classification result by using a preset voting mechanism to obtain a scoring result, and determining the classification label of the information to be identified according to the scoring result. The problem that the content of original information cannot be accurately reflected when text feature extraction is carried out by applying a deep learning model due to the problem of data imbalance can be effectively solved.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application.

FIG. 1 is a flowchart illustrating an overall method for deep learning based information classification according to an embodiment of the present application;

FIG. 2 is a diagram illustrating a pre-classification process in an information classification method based on deep learning according to an embodiment of the present application;

FIG. 3 is a diagram illustrating a result generation process in an information classification method based on deep learning according to an embodiment of the present application;

fig. 4 is a block diagram of an information classification apparatus based on deep learning according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

Fig. 1 is an overall flowchart of an information classification method based on deep learning according to an embodiment of the present application, where the information classification method based on deep learning includes the following steps:

s1, acquiring the data quantity of the information to be identified, determining the clustering mode of the information to be identified according to the data quantity, and preprocessing the information to be identified by applying the clustering mode to obtain pre-classified data;

specifically, when information identification is performed, whether information to be identified is a simple graph or a multiple graph is to be determined, where a simple graph means that a text sentence only includes one graph, such as: "I want to listen to Zhou Jieren's song", this language intent can be attributed to a musical intent, while multiple intentions means that multiple intentions can be contained in the text utterance, such as: the intention of buying apple can be concluded as fruit intention, and the intention of buying fruit can also be concluded as electronic intention, i think of buying apple mobile phone, then the intention of the sentence in the conversation is judged according to the past search record or context information of the user, and the best answer is returned preferentially. And then, when the data amount of the information to be identified is counted, a natural language identification algorithm is needed to classify the single sentence intentions and the multiple sentence intention diagrams in the information to be identified, the single sentence corresponding to each single sentence intention is used as 1 data, and the multiple sentences corresponding to each multiple sentence intention diagram are used as 1 data together. When clustering is carried out, the used clustering algorithm is mainly a K-mean algorithm and a coacervation hierarchical clustering algorithm.

S2, performing word vector conversion on the pre-classified data to obtain word vectors of the pre-classified data;

specifically, word vector conversion commonly used in Wordvec2, which uses BERT model as word vector embedding model in this step, can convert word vectors of text by using BERT model, 1, installing BERT model, (1) installing BERT on server side, (2) installing BERT on client side, 2, starting service, executing the following codes of BERT-providing-start-model _ dir/tmp/engli sh _ L-12 _ H-768_ A-12-num _ worker ═ 4, where/tmp/englishh _ L-12 _ H-768_ A-12/path of downloaded model 3, text vectorization using python script, executing code of from BERT _ providing

S3, inputting the word vectors of the pre-classified data into a preset deep learning model for text feature extraction to obtain a plurality of text features;

specifically, the deep learning model selected during text extraction is usually a convolutional neural network model or a cyclic neural network model, and when text features of word vectors are extracted, the deep learning model is trained firstly, and when the text feature accuracy of the trained deep learning model is greater than a preset threshold, the deep learning model can be used for extracting text features of pre-classified word vectors. In the step, a word frequency-inverse file frequency algorithm (TF-IDF) can be adopted for text feature extraction.

S4, classifying the text features to obtain a classification result of the text features;

specifically, the text feature classification uses a computer to automatically classify and mark a text set (or other entities or objects) according to a certain classification system or standard. Commonly used text feature classification methods are a naive bayes classification method, a decision tree method, an SVM support vector machine and the like. When the text features are classified, a classifier needs to be trained, and only the result classified by the verified classifier can be used as a reliable result for application.

The decision tree method is to classify text features by applying a decision tree model, firstly input the text features into a root node of the decision tree model for first classification, then perform second classification by using first leaf nodes, and so on until the minimum leaf node of the decision tree model. The decision tree model can classify the text features from coarse to fine step by step, so that a more accurate classification result is obtained. If the text characteristics are: and (3) the root node of the orange in the decision tree model is a 'creature', the first leaf node is a 'plant', and the minimum leaf node is a 'fruit'.

And S5, scoring the classification result by using a preset voting mechanism to obtain a scoring result, and determining the classification label of the information to be identified according to the scoring result.

The voting mechanism (voting) is a combination strategy for the classification problem in ensemble learning. The basic idea is to select the class that outputs the most among all machine learning algorithms. The output of the machine learning classification algorithm is of two types: one is to directly output class labels, and the other is to output class probabilities, and the former is used for voting and is called Hard voting (Majority/Hard voting), and the latter is used for classifying and is called Soft voting (Soft voting). In the step, a soft voting mechanism is adopted, and the weight is added in the voting as an auxiliary parameter, so that the classification label can be better obtained.

The following steps may be taken for the soft voting mechanism: firstly, obtaining class probabilities output by the used machine learning algorithm, such as 50% of class A, 30% of class B and 20% of class C, and then calculating weighted average values of the classes after obtaining weighted values of the classes, such as 0.3 of class A, 0.5 of class B and 0.2 of class C, wherein the class with a large value is selected.

In the embodiment, information needing to be subjected to intention identification is subjected to pre-classification processing, and different processing modes are adopted according to different types, so that the problem that the content of original information cannot be accurately reflected when text feature extraction is carried out by applying a deep learning model due to the unbalanced data can be effectively solved.

Fig. 2 is a schematic diagram of a pre-classification process in an information classification method based on deep learning in an embodiment of the present application, as shown in the drawing, in step S1, acquiring a data quantity of information to be identified, determining a clustering mode of the information to be identified according to the data quantity, and applying the clustering mode to pre-process the information to be identified to obtain pre-classification data, where the pre-classification process includes:

s11, comparing the data quantity with a preset data quantity threshold, if the data quantity is larger than the data quantity threshold, determining that the information to be identified is large sample data, otherwise, determining that the information to be identified is small sample data;

in general, indexes such as median, mode, average value, and the like are used as the data volume threshold, and if the sample data volume is larger than the threshold, the sample is determined as a large sample, otherwise, the sample is determined as a small sample.

S12, if the information to be identified is large sample data, clustering the large sample data by applying a clustering algorithm after removing noise points and isolated points in the large sample data to obtain pre-classified data;

specifically, the large sample data is data conforming to normal distribution, and when the large sample data is clustered, special symbols such as a special symbol in the large sample data are required; ",". And removing non-character noise points such as 'and the like', and if points which do not conform to normal distribution exist in the large sample data, taking the points as isolated points, removing the points in advance and then clustering the points.

And S13, if the information to be identified is small sample data, clustering similar samples in the small sample data by using a clustering algorithm to generate a plurality of clusters, and processing the data in each cluster by respectively using a genetic crossover algorithm to obtain the pre-classified data.

Specifically, the small sample data contains less feature information than the large sample data, the machine learns less feature information from the small sample, and the calculated joint probability is relatively smaller. Therefore, the clustering algorithm such as K-mean can not be directly adopted for carrying out classification statistics on the target object, and the classification accuracy is seriously influenced. When the genetic crossover algorithm is applied for processing, the adopted crossover operator is a uniform crossover operator. Two crossover points were randomly generated in two ligands a and B, and then gene-exchanged with three randomly generated integers of 0, 1, and 2, thereby forming two new individuals to complete crossover.

In the embodiment, the sample classification is performed on the data to be analyzed, and different clustering algorithms are adopted, so that the inaccuracy of information classification caused by data imbalance is avoided.

In one embodiment, the performing word vector conversion on the pre-classified data to obtain a word vector of the pre-classified data includes:

in particular, different word vector embedding models have different limitations on the length of the characters to be word vector converted.

In the embodiment, the word vector is embedded by using BERT, and the model adopts a transform sequence model, so that the method has a bidirectional function, can obtain semantic representation at a sentence level higher than a word, and has the advantages of strong universality, good effect and the like.

In one embodiment, the entering of the word vector of the pre-classification data into a preset deep learning model for text feature extraction to obtain a plurality of text features includes:

the recurrent neural network model comprises an input layer, a hidden layer and an output layer, wherein the input layer is used for receiving data, the hidden layer is used for processing the data, and the output layer is used for outputting the result. Wherein, a series of processing will be carried out to the data in the hidden layer, mainly including: gradient truncation, regularization, gating, etc. The data is effectively processed through the hidden layer.

In the embodiment, text feature extraction is performed on the word vectors through the deep learning model, so that the accuracy of the text features is ensured.

Fig. 3 is a schematic diagram of a result generation process in an information classification method based on deep learning according to an embodiment of the present application, where as shown in the drawing, the S4 classifies the text features to obtain a classification result of the text features, where the classification result includes:

s41, obtaining classifiers of different categories, and establishing a classifier sub-tree according to the hierarchical relationship among the classifiers;

specifically, a probability calculation method may be adopted in hierarchical classification, as follows:

P(n_k|d_i)＝∏p(n_j|d_i)xa_k

wherein P (n)_k|d_i) Representing a document d_iSorting Final arriving node n_kProbability of (1), P (n)_j|d_i) Representing a document d_iAt arriving node n_kAncestor node n through which the node has previously passed_jProbability of (a)_kRepresenting a high penalty factor.

S42, inputting the text features to a root node of the classifier subtree, carrying out primary classification to obtain a primary classification result, and inputting the primary classification result to a next-level leaf node of the root node;

s43, continuing to classify by taking the next-level leaf node as a new root node until the next-level leaf node is the minimum leaf node;

specifically, the similarity between the output results of all leaf nodes at any level is obtained, and the target leaf nodes corresponding to a plurality of output results with the similarity greater than a similarity threshold are extracted as the root nodes of the next-level classification;

Wherein, the similarity threshold is set according to the type of the classifier, such as: and judging the probabilities of the classifiers belonging to a task type, a chatting type and an FAQ, if the calculated probabilities are 0.8, 0.3 and 0.9 respectively, and the set threshold value is 0.5, judging that the text data belongs to the task type and the chatting, if the chatting belongs to a leaf node, not continuing the judgment, and then calculating the probabilities of the text belonging to a customer service task type and a maintenance task type through a next-stage classifier, and if the calculated probabilities are 0.88 and 0.21 respectively, finally judging that the results are the customer service task type and the chatting.

And S44, summarizing the classification result of the minimum leaf node to obtain the classification result of the text feature.

In the embodiment, by adopting the improved hierarchical classification method, the problem that all subsequent classifications are wrong due to the fact that classification errors occur in the classification of the upper hierarchy is solved.

In an embodiment, the scoring the classification result by using a preset voting mechanism to obtain a scoring result, and determining the classification label of the information to be identified according to the scoring result includes:

the calculation formula of the classification accuracy rate is as follows:

in the formula, TP is a true case, TN is a true negative case, FP is a false positive case, and FN is a false negative case.

According to the embodiment, the classification result is effectively scored by using a voting mechanism, so that the accuracy of the classification label is greatly improved.

The technical features mentioned in any of the above corresponding embodiments or implementations are also applicable to the embodiment corresponding to fig. 4 in the present application, and the details of the subsequent similarities are not repeated.

In the above description, a method for classifying information based on deep learning in the present application is described, and an information classification apparatus for performing the above deep learning is described below.

A structure diagram of an information classification apparatus based on deep learning, which is applicable to information classification based on deep learning, is shown in fig. 4. The deep learning-based information classification apparatus in the embodiment of the present application can implement the steps corresponding to the deep learning-based information classification method performed in the embodiment corresponding to fig. 1 described above. The functions realized by the deep learning-based information classification device can be realized by hardware, and can also be realized by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the above functions, which may be software and/or hardware.

In one embodiment, an information classification apparatus based on deep learning is provided, as shown in fig. 4, including the following modules:

the pre-classification module 10 is configured to acquire the data quantity of the information to be identified, determine the clustering mode of the information to be identified according to the data quantity, and pre-process the information to be identified by applying the clustering mode to obtain pre-classification data;

a word vector module 20 configured to perform word vector conversion on the pre-classified data to obtain a word vector of the pre-classified data;

the feature extraction module 30 is configured to add the word vectors of the pre-classified data into a preset deep learning model to perform text feature extraction, so as to obtain a plurality of text features;

a result generation module 40 configured to classify the text features to obtain a classification result of the text features;

and the label generating module 50 is configured to score the classification result by applying a preset voting mechanism to obtain a scoring result, and determine the classification label of the information to be identified according to the scoring result.

In one embodiment, a computer device is provided, the computer device includes a memory and a processor, the memory stores computer readable instructions, and when executed by the processor, the processor executes the steps of the deep learning based information classification method in the above embodiments.

In one embodiment, a storage medium storing computer-readable instructions is provided, which when executed by one or more processors, cause the one or more processors to perform the steps of the deep learning based information classification method in the above embodiments. The storage medium may be a nonvolatile storage medium or a volatile storage medium, and the present application is not limited in particular.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic or optical disk, or the like.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-described embodiments are merely illustrative of some embodiments of the present application, which are described in more detail and detail, but are not to be construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. An information classification method based on deep learning is characterized by comprising the following steps:

2. The information classification method based on deep learning of claim 1, wherein the obtaining of the data quantity of the information to be identified, the determining of the clustering mode of the information to be identified according to the data quantity, and the preprocessing of the information to be identified by applying the clustering mode to obtain pre-classification data comprises:

3. The method for information classification based on deep learning of claim 1, wherein the performing word vector transformation on the pre-classified data to obtain a word vector of the pre-classified data comprises:

4. The information classification method based on deep learning of claim 1, wherein the step of inputting the word vector of the pre-classification data into a preset deep learning model for text feature extraction to obtain a plurality of text features comprises:

5. The information classification method based on deep learning according to any one of claims 1 to 4, wherein the classifying the text features to obtain a classification result of the text features includes:

6. The information classification method based on deep learning of claim 5, wherein the applying a preset voting mechanism to score the classification result to obtain a scoring result, and determining the classification label of the information to be identified according to the scoring result comprises:

7. The method for classifying information based on deep learning according to claim 5, wherein the classifying continues with the next-level leaf node as a new root node until the next-level leaf node is a minimum leaf node, including:

8. An information classification device based on deep learning is characterized by comprising the following modules:

9. A computer device comprising a memory and a processor, the memory having stored therein computer-readable instructions, which, when executed by the processor, cause the processor to carry out the steps of the deep learning based information classification method according to any one of claims 1 to 7.

10. A storage medium storing computer readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the deep learning based information classification method according to any one of claims 1 to 7.