WO2021051518A1

WO2021051518A1 - Text data classification method and apparatus based on neural network model, and storage medium

Info

Publication number: WO2021051518A1
Application number: PCT/CN2019/116931
Authority: WO
Inventors: 金戈; 徐亮
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-09-17
Filing date: 2019-11-10
Publication date: 2021-03-25
Also published as: CN110750640A; CN110750640B

Abstract

The present application relates to the technical field of artificial intelligence. Disclosed is a text classification method based on a neural network model. The method comprises: collecting text data, and performing a pre-processing operation on the text data to obtain pre-processed text data; converting the pre-processed text data into a text vector; using a BP neural network classification model based on decision tree optimization to perform feature selection on the text vector, so as to obtain an initial text feature; according to the obtained initial text feature, using a stochastic gradient descent algorithm and a fine-turing method to train the BP neural network classification model until the best text feature is obtained; and according to the best text feature, using a classifier to classify the text data, and outputting a classification result of the text data. Further provided are a text classification apparatus based on a neural network model, and a computer-readable storage medium. The present application can realize the precise classification of text data.

Description

Method, device and storage medium for text data classification based on neural network model

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on September 17, 2019. The application number is 201910885586.7 and the invention title is "Methods, devices and storage media for text data classification based on neural network models." Incorporated in this application by reference.

Technical field

This application relates to the field of artificial intelligence technology, and in particular to a method, device and computer-readable storage medium for text data classification based on a neural network model.

Background technique

With the rapid development of network technology, the requirements for effective organization and management of electronic text information and the ability to quickly, accurately and comprehensively find relevant information are getting higher and higher. As a key technology for processing and organizing a large amount of text data, text classification solves the problem of information clutter to a large extent, and facilitates users to accurately obtain the required information. It is used in the fields of information filtering, information retrieval, search engines, and text databases. technical foundation.

The prior art mainly constructs a text classification model based on word frequency features in the text, and then classifies the text to be classified based on the constructed text classification model. However, because the word frequency in the text cannot effectively reflect the text category, the prior art usually has the problem of inaccurate text classification.

Summary of the invention

This application provides a method, device and computer-readable storage medium for text classification based on a neural network model, the main purpose of which is to provide an accurate text data classification scheme.

In order to achieve the above purpose, a text classification method based on a neural network model provided by this application includes: collecting text data, performing preprocessing operations on the text data to obtain preprocessed text data; Convert the text data into a text vector; use the BP neural network classification model based on decision tree optimization to perform feature selection on the text vector to obtain the initial text feature; according to the initial text feature obtained above, use the stochastic gradient descent algorithm and fine-turing The method trains the BP neural network classification model until the best text feature is obtained; according to the best text feature, a classifier is used to classify the text data, and the classification result of the text data is output.

In addition, in order to achieve the above-mentioned object, the present application also provides a text classification device based on a neural network model. The device includes a memory and a processor. The memory stores a neural network model-based neural network model that can be run on the processor. A text classification program, when the neural network model-based text classification program is executed by the processor, the following steps are implemented: collecting text data, performing preprocessing operations on the text data to obtain preprocessed text data; The preprocessed text data is converted into text vectors; the BP neural network classification model based on decision tree optimization is used to perform feature selection on the text vectors to obtain the initial text features; according to the initial text features obtained above, the stochastic gradient descent algorithm and The fine-turing method trains the BP neural network classification model until the best text feature is obtained; according to the best text feature, a classifier is used to classify the text data, and the classification result of the text data is output.

In addition, in order to achieve the above-mentioned object, the present application also provides a computer-readable storage medium on which a text classification program based on a neural network model is stored, and the text classification program based on a neural network model can be One or more processors execute to implement the steps of the text classification method based on the neural network model as described above.

The neural network model-based text classification method, device, and computer-readable storage medium proposed in this application use the BP neural network classification model optimized based on decision trees to perform feature selection on text data to obtain initial text features, and use the stochastic gradient descent algorithm to compare with The fine-turing method trains the BP neural network classification model to obtain the best text features, and uses a classifier to classify the text data according to the best text features. This application obtains the most representative text features in the text data by training the BP neural network classification model. Text classification based on the text features can improve the shortcomings of traditional text classification methods such as low classification accuracy. Therefore, this application can achieve rapid , Accurate text classification.

Description of the drawings

FIG. 1 is a schematic flowchart of a text classification method based on a neural network model provided by an embodiment of the application;

2 is a schematic diagram of the internal structure of a text classification device based on a neural network model provided by an embodiment of the application;

FIG. 3 is a schematic diagram of modules of a text classification program based on a neural network model in a text classification device based on a neural network model provided by an embodiment of the application.

The realization, functional characteristics, and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.

detailed description

In order to make the purpose, technical solutions, and advantages of this application clearer, the following further describes this application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, and are not used to limit the present application. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

The terms "first", "second", "third", "fourth", etc. (if any) in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects, without having to use To describe a specific order or sequence. It should be understood that the data used in this way can be interchanged under appropriate circumstances, so that the embodiments described herein can be implemented in a sequence other than the content illustrated or described herein. In addition, the descriptions of “first”, “second”, etc. are only for descriptive purposes, and cannot be understood as indicating or implying their relative importance or implicitly indicating the number of indicated technical features. Therefore, the features defined with "first" and "second" may explicitly or implicitly include at least one of the features.

Further, the terms "including" and "having" and any variations of them are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or device that includes a series of steps or units is not necessarily limited to a clearly listed Instead, it may include other steps or units that are not clearly listed or are inherent to these processes, methods, products, or equipment.

In addition, the technical solutions between the various embodiments can be combined with each other, but it must be based on what can be achieved by a person of ordinary skill in the art. When the combination of technical solutions is contradictory or cannot be achieved, it should be considered that such a combination of technical solutions does not exist. , Is not within the scope of protection required by this application.

This application provides a text classification method based on a neural network model. Referring to FIG. 1, it is a schematic flowchart of a text classification method based on a neural network model provided by an embodiment of this application. The method can be executed by a device, and the device can be implemented by software and/or hardware.

In this embodiment, the text classification method based on the neural network model includes:

S1. Collect text data, perform a preprocessing operation on the text data to obtain preprocessed text data, and convert the preprocessed text data into a text vector.

The preferred embodiment of the present application can collect the text data from the Internet, such as a news website, a shopping website, a paper database, or various forums.

The text data is unstructured or semi-structured data and cannot be directly recognized by the classification algorithm. Therefore, the purpose of the preprocessing operation on the text data in the preferred embodiment of the present application is to convert the text data into a vector space _{_{model: d i = (w 1,}} w 2, ..., w n), where, w _j is the weight of the j th feature item weight.

The embodiment of the present application performs preprocessing operations including word segmentation, stop word removal, feature weight calculation, and deduplication on the text data.

The word segmentation method described in the embodiment of the present application includes matching the text data with entries in a pre-built dictionary according to a predetermined strategy to obtain the words in the text data.

In the embodiment of this application, the selected method for removing stop words is stop word list filtering, that is, matching words in the text data with the stop word list that has been constructed. If the matching is successful, then the word is Stop word, the word needs to be deleted.

After word segmentation and stop words are removed, the text data is represented by a series of characteristic words (keywords), but this kind of textual data cannot be directly processed by the classification algorithm, but should be converted into a numerical form. It is necessary to calculate the weight of these feature words to represent the importance of the feature words in the text.

The embodiment of the application uses the TF-IDF algorithm to perform feature word calculation. The TF-IDF algorithm uses statistical information, word vector information, and dependency syntax information between words, builds a dependency graph to calculate the correlation strength between words, and uses TextRank algorithm to iteratively calculate the importance score of words.

In detail, when calculating the weights of feature words in this application, first calculate the degree of dependence and association of _{any two words W i} and W _j:

Where len(W _i , W _j ) represents the _{length of the dependency path between words W i} and W _j , and b is a hyperparameter.

This application believes that the semantic similarity between two words cannot accurately measure the importance of the words. Only when at least one of the two words appears frequently in the text, can it prove that the two words are important. According to the concept of gravitation, word frequency is regarded as quality, the Euclidean distance between the word vectors of two words is regarded as distance, and the gravitation between two words is calculated according to the formula of universal gravitation. However, in the current text environment, it is too one-sided to use word frequency to measure the importance of a word in the text. Therefore, this application introduces IDF value and replaces word frequency with TF-IDF value to take into account more global information. So a new formula for the value of word gravity is obtained. The gravity of the text words W _i and W _{j is:}

Among them, tfidf(W) is the TF-IDF value of word W, and d is the Euclidean distance between the word vectors _{of words W i} and W _j.

Therefore, the degree of association between the _{words W i} and W _{j is:}

weight(W _i ,W _j )=Dep(W _i ,W _j )*f _grav (W _i ,W _j )

Finally, the present application establish an undirected graph G = (V, E) using TextRank algorithm, where V is the set of vertices, E is the set of edges, the following equation is calculated based on the scores of the words W _i,:

among them

Is the set related to the vertices W _i , and η is the damping coefficient, from which the feature weight WS(W _i ) is obtained, and therefore each word is expressed in the form of a numerical vector.

Furthermore, due to the intricate sources of the collected text data, there may be a lot of duplicate text data. A large amount of repeated data will affect the classification accuracy. Therefore, in the embodiment of the present application, the Euclidean distance method is first used to de-duplicate the text before the text is classified. The formula is as follows:

In the formula, w _1j and w _2j are 2 text data respectively. After calculating the Euclidean distance of each two text data separately, the smaller the Euclidean distance, the more similar the text data, and then one of the two text data whose Euclidean distance is less than the preset threshold is deleted.

Further, a preferred embodiment of the present application further includes a text hierarchical encoder using a zoom neural network to perform encoding processing on the preprocessed text data to obtain an encoded text vector.

In the embodiment of the present application, the text hierarchical encoder has three layers, namely a text embedding layer and two bi-LSTM layers, wherein the text embedding layer initializes the words by word2vec to obtain a word vector, and The first bi-LSTM layer receives word vectors as input and generates sentence vectors, and the second bi-LSTM layer receives sentence vectors as input and generates paragraph vectors.

In detail, after the first bi-LSTM layer takes each word as input, it outputs a hidden state vector every time it is not long, and then uses the maximum pooling operation to obtain a fixed-length sentence vector, and combines all The sentence vector of as the sentence component of hierarchical memory, the formula used is:

Where

Indicates the word entered,

Represents a fixed-length sentence vector obtained through the maximum pooling operation, and its length is related to j, and R _s represents a sentence vector of hierarchical memory.

Next, this application uses a similar approach, using the second bi-LSTM layer and the maximum pooling operation to convert sentence components into paragraph vectors.

Through hierarchical coding, this application assigns a vector representation (hierarchical distributed memory) to each language unit at each level, and retains the segmentation information of its segmentation, based on which text vectors including word vectors, sentence vectors, and paragraph vectors are obtained.

S2. Use the BP neural network classification model optimized based on the decision tree to perform feature selection on the text vector to obtain the text feature.

Since in many cases, the number of features in text data will far exceed the number of training data, in order to simplify the training of the model, this application uses a method based on BP neural network for feature selection, and uses the sensitivity of feature X to changes in state Y δ is used as a measure to evaluate text characteristics, namely:

The BP neural network is a multi-layer feedforward neural network. The main characteristics of the network are signal forward transmission and error back propagation. In the forward transmission, the input signal is processed layer by layer from the input layer to the hidden layer. The output layer. The neuron state of each layer only affects the neuron state of the next layer. If the output layer cannot get the expected output, it will switch to back propagation and adjust the network weights and thresholds according to the prediction error, so that the network predicted output is constantly approaching the expected output.

The BP neural network described in this application includes the following structure:

Input layer: It is the only data input entry of the entire neural network. The number of neuron nodes in the input layer is the same as the dimension of the numerical vector of the text. The value of each neuron corresponds to the value of each item of the numerical vector;

Hidden layer: It is mainly used to perform non-linear processing on the input data of the input layer. Non-linear fitting of the input data based on the excitation function can effectively ensure the predictive ability of the model;

Output layer: After the hidden layer, it is the only output of the entire model. The number of neuron nodes in the output layer is the same as the number of text categories.

Since the structure of the BP neural network has a great impact on the classification results, if the design is not good, there will be disadvantages such as slow convergence speed, low training speed, and low classification accuracy. Therefore, this application uses a decision tree to optimize the BP neural network. In the embodiment of the present application, the length of the longest rule chain of the decision tree is taken as the number of hidden layer nodes of the BP neural network to optimize the structure of the neural network, that is, the depth of the decision tree is taken as the number of hidden layer nodes of the BP neural network.

The preferred embodiment of this application constructs a 3-layer BP neural network, where n units in the input layer correspond to n feature parameters, and m units in the output layer correspond to m pattern classifications. The number of units in the middle hidden layer is q, and

Represents the connection weight between the input layer unit i and the hidden layer unit q, using

Represents the connection weight between the hidden layer unit q and the output layer unit j, θ _q is the threshold of each unit of the hidden layer, and the output O _q of the qth unit of the hidden layer is:

_{The output y i} of the j-th unit of the output layer is:

In the above formula, δ _j is the threshold value of each unit of the output layer, j = 1, 2, ..., m.

Sensitivity Sensitivity δ _ij determined difference and the text feature X _k X _i of text feature δ _kj composite function of the partial derivative of the chain rule:

among them,

At this time, if

Certainly has δ _ij> δ _kj, i.e. strong text feature X _i of the j-th class pattern classification ability than the text feature X _k, selected accordingly textual features.

S3. According to the text features obtained above, use the stochastic gradient descent algorithm and the fine-turing method to train the BP neural network classification model until the best text features are obtained. According to the best text features, use a classifier to compare the text The data is classified, and the classification result of the target text is output.

The fine-turing method is to extract the shallow features of the available neural network, modify the parameters in the deep neural network, and build a new neural network model to reduce the number of iterations, so as to obtain the best BP neural network more quickly Classification model.

In a preferred embodiment of the present application, the process of training the BP neural network classification model is as follows:

Ⅰ. Construct a loss function.

In a neural network, the loss function is used to evaluate the predicted value of the network model output

The difference from the true value Y. Used here

To represent the loss function, it is a non-negative real number function. The smaller the loss value, the better the performance of the network model. The input pattern vector is A _k =(a ₁ ,a ₂ ,…a ₈ )(k=1, 2,…,20), and the output vector is hoped to be Y _k (k=1, 2,…,20), according to the depth The basic formula of neuron in learning, the input and output of each layer are

C _i =f(z _i ).

This application selects the classification loss function:

Where m is the number of samples of the text data, h _θ (x ⁽ⁱ⁾ ) is the predicted value of the text data, and y ⁽ⁱ⁾ is the true value of the text data;

At the same time, in order to alleviate the problem of gradient dissipation, this application selects the ReLU function relu(x)=max(0,x) as the activation function, which satisfies the sparsity in bionics, and activates the function only when the input value is higher than a certain number. Neuron node, when the input value is lower than 0, limit, when the input rises above a certain threshold, the independent variable in the function has a linear relationship with the dependent variable. Where x represents the cumulative value of the reverse gradient and the cumulative value of the descending gradient.

Ⅱ. Use the stochastic gradient descent algorithm to solve the loss function, and use the fine-turing method to reduce the number of model iterations.

Gradient descent algorithm is the most commonly used optimization algorithm for neural network model training. To find the loss function

The minimum value of, the variable y needs to be updated in the direction opposite to the gradient vector -dL/dy, so that the gradient can be reduced the fastest, until the loss converges to the minimum. In the embodiment of the present application, combined with the momentum method, each batch-sizes data is input, the learning rate is reduced as the gradient drops, and each epoch is input, the decay rate is increased according to the decrease in the learning rate. The parameter update formula is as follows: L=L-αdL /dy,α represents the learning rate, and dL/dy is the decay rate, so that the final BP neural network parameters can be obtained. At the same time, when using the fine-turing method, this application first adjusts the parameters in the network layer, deletes the FC layer and adjusts the learning rate, because the last layer is relearning, so it needs to have a faster learning rate compared to other layers. The learning rate of weight and bias in this application is accelerated by 10 times, and the learning strategy is not changed. Finally, the solver parameters are modified to reduce the size of the text data, and the step size is changed from 100,000 to 20,000, and the maximum number of iterations is also reduced accordingly, so that the optimized BP neural network classification model can be obtained with a smaller number of iterations. And use the optimized BP neural network classification to obtain the best text features.

Further, a preferred embodiment of the present application uses a random forest algorithm as a classifier to classify the collected text data according to the best text characteristics.

The random forest algorithm uses bagging algorithm with replacement sampling, extracts multiple sample subsets from the original sample, and uses these samples to train multiple decision tree models, and uses random features in the training process. The subspace method extracts some features from the feature set to split the decision tree, and finally integrates multiple decision trees called an ensemble classifier, and this ensemble classifier is called a random forest. The algorithm process can be divided into three parts, the generation of the sub-sample set, the construction of the decision tree, and the voting results. The specific process is as follows:

1) Generate sub-sample set: Random forest is an ensemble classifier. For each base classifier, a certain sample subset needs to be generated as the input variable of the base classifier. In order to take into account the evaluation model, there are many ways to divide the sample set. In the embodiment of the application, the text data is divided by cross-certification. The cross-certification divides the original text into k according to the number of pages. For each sub-text data, in each training, one of the sub-text data is used as the test set, the remaining sub-text data is used as the training set, and k rotation steps are performed.

2) Building a decision tree: In a random forest, each base classifier is an independent decision tree. The most important thing in the construction of the decision tree is the split rule. The split rule tries to find an optimal feature to divide the sample to improve the accuracy of the final classification. The decision tree of the random forest is basically the same as the ordinary decision tree construction method. The difference is that the feature selected when the decision tree of the random forest is split does not search the entire feature set, but randomly selects k features for division. In the embodiment of the present application, the sub-text features obtained above are used as the sub-nodes of the decision tree, and the lower nodes are the respective extracted features.

3) Voting produces results. The classification result of the random forest is obtained by voting for each base classifier, that is, the decision tree. Random forest treats the base classifier equally. Each decision tree obtains a classification result, and collects the text classification results of all decision trees for cumulative summation. The result with the highest number of votes is the final text classification result, that is, the text is effectively classified.

This application also provides a text classification device based on the neural network model. Referring to FIG. 2, it is a schematic diagram of the internal structure of a text classification device based on a neural network model provided by an embodiment of this application.

In this embodiment, the text classification device 1 based on the neural network model may be a PC (Personal Computer, personal computer), or a terminal device such as a smart phone, a tablet computer, or a portable computer. The text classification device 1 based on the neural network model at least includes a memory 11, a processor 12, a communication bus 13, and a network interface 14.

The memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, and the like. In some embodiments, the memory 11 may be an internal storage unit of the text classification device 1 based on the neural network model, for example, the hard disk of the text classification device 1 based on the neural network model. In other embodiments, the memory 11 may also be an external storage device of the text classification device 1 based on a neural network model, such as a plug-in hard disk equipped on the text classification device 1 based on a neural network model, and a smart media card (Smart Media Card). ,SMC), Secure Digital (SD) card, Flash Card, etc. Further, the memory 11 may also include both an internal storage unit of the text classification apparatus 1 based on a neural network model and an external storage device. The memory 11 can be used not only to store application software and various data installed in the text classification device 1 based on the neural network model, such as the code of the text classification program 01 based on the neural network model, etc., but also to temporarily store the output or The data to be output.

In some embodiments, the processor 12 may be a central processing unit (CPU), controller, microcontroller, microprocessor, or other data processing chip, for running program codes or processing stored in the memory 11 Data, for example, execute the text classification program 01 based on the neural network model.

The communication bus 13 is used to realize the connection and communication between these components.

The network interface 14 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface), and is usually used to establish a communication connection between the apparatus 1 and other electronic devices.

Optionally, the device 1 may also include a user interface. The user interface may include a display (Display) and an input unit such as a keyboard (Keyboard). The optional user interface may also include a standard wired interface and a wireless interface. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, etc. Among them, the display can also be appropriately called a display screen or a display unit, which is used to display the information processed in the text classification device 1 based on the neural network model and to display a visualized user interface.

Fig. 2 only shows a neural network model-based text classification device 1 with components 11-14 and a neural network model-based text classification program 01. Those skilled in the art can understand that the structure shown in Fig. 1 does not constitute The definition of the text classification device 1 based on the neural network model may include fewer or more components than shown, or a combination of certain components, or a different component arrangement.

In the embodiment of the device 1 shown in FIG. 2, the memory 11 stores a text classification program 01 based on a neural network model; the processor 12 implements the following steps when executing the text classification program 01 based on a neural network model stored in the memory 11:

Step 1: Collect text data, perform pre-processing operations on the text data to obtain pre-processed text data, and convert the pre-processed text data into text vectors.

Therefore, the degree of association between the _{words W i} and W _{j is:}

weight(W _i ,W _j )=Dep(W _i ,W _j )*f _grav (W _i ,W _j )

Finally, the present application establish an undirected graph G = (V, E) using TextRank algorithm, where V is the set of vertices, E is the set of edges, according to the words W _i calculated by the following formula Score:

among them

Where

Indicates the word entered,

Step 2: Use the BP neural network classification model optimized based on the decision tree to perform feature selection on the text vector, so as to obtain the text feature.

In many cases, the number of features in the text data will far exceed the number of training data. In order to simplify the training of the model, this application uses a BP neural network-based method for feature selection, and uses the sensitivity of feature X to changes in state Y δ is used as a measure to evaluate text characteristics, namely:

_{The output y i} of the j-th unit of the output layer is:

among them,

At this time, if

Step 3. According to the text features obtained above, use the stochastic gradient descent algorithm and the fine-turing method to train the BP neural network classification model until the best text features are obtained. According to the best text features, use the classifier Classify the text data and output the classification result of the target text.

Ⅰ. Construct a loss function.

The difference from the true value Y. Used here

C _i =f(z _i ).

This application selects the classification loss function:

The minimum value of, the variable y needs to be updated in the direction opposite to the gradient vector -dL/dy, so that the gradient can be reduced the fastest, until the loss converges to the minimum. In the embodiment of this application, combined with the momentum method, each batch-sizes data is input, the learning rate is reduced with the gradient drop, and each epoch is input, the attenuation rate is increased according to the decrease in the learning rate. The parameter update formula is as follows: L=L-αdL /dy,α represents the learning rate, and dL/dy is the decay rate, so that the final BP neural network parameters can be obtained. At the same time, when using the fine-turing method, this application first adjusts the parameters in the network layer, deletes the FC layer and adjusts the learning rate, because the last layer is relearning, so it needs to have a faster learning rate compared to other layers. The learning rate of weight and bias in this application is accelerated by 10 times, and the learning strategy is not changed. Finally, the solver parameters are modified to reduce the size of the text data, and the step size is changed from 100,000 to 20,000, and the maximum number of iterations is also reduced accordingly, so that the optimized BP neural network classification model can be obtained with a smaller number of iterations. And use the optimized BP neural network classification to obtain the best text features. .

Further, a preferred embodiment of the present application uses a random forest algorithm as a classifier, and performs text classification on the collected text data according to the best text feature.

Optionally, in other embodiments, the text classification program based on the neural network model can also be divided into one or more modules, and the one or more modules are stored in the memory 11 and run by one or more processors ( This embodiment is executed by the processor 12) to complete the application. The module referred to in this application refers to a series of computer program instruction segments that can complete specific functions, and is used to describe the text classification program based on the neural network model. The execution process in the text classification device of the model.

For example, referring to FIG. 3, a schematic diagram of program modules of a text classification program based on a neural network model in an embodiment of a text classification device based on a neural network model of this application. In this embodiment, the text classification program based on the neural network model It can be divided into a sample collection module 10, a feature extraction module 20, and a text classification module 30. Illustratively:

The sample collection module 10 is used to collect text data, perform preprocessing operations on the text data, obtain preprocessed text data, and convert the preprocessed text data into text vectors.

Wherein, the preprocessing operation on the text data includes:

Matching the text data with the entries in the pre-built dictionary according to a predetermined strategy to obtain the words in the text data;

Use the constructed stop word table to match the words in the text data, and if the matching is successful, determine that the word is a stop word, and delete the word;

Construct a dependency graph to calculate the correlation strength between words, use the TextRank algorithm to iteratively calculate the importance score of the words, and express each word in the form of a numerical vector;

Calculate the Euclidean distance between every two of the text data, and delete one of the two text data when the Euclidean distance is less than a preset threshold.

Wherein, said converting the text data into a text vector includes:

The text hierarchical encoder of the zoom neural network is used to encode the preprocessed text data to obtain the encoded text vector, wherein the text hierarchical encoder includes a text embedding layer and two bi- LSTM layer. The text embedding layer initializes the words by word2vec to obtain word vectors. The first bi-LSTM layer receives word vectors as input and generates sentence vectors, and the second bi-LSTM layer receives sentence vectors as input and Generate paragraph vectors.

The feature extraction module 20 is configured to: use a BP neural network classification model optimized based on a decision tree to perform feature selection on the text vector to obtain an initial text feature.

Wherein, the use of a BP neural network classification model optimized based on a decision tree to perform feature selection on the text vector to obtain text features includes:

Construct a 3-layer BP neural network, where n units in the input layer of each layer of BP neural network correspond to n feature parameters, and m units in the output layer correspond to m pattern classifications. Take the number of hidden layer units in the middle as q, use

_{The output y i} of the j-th unit of the output layer is:

In the above formula, δ _j is the threshold of each unit of the output layer, j = 1, 2, ..., m;

The composite function of the partial derivative of the chain rule, the difference between the sensitivity obtained and the sensitivity of the text feature δ _ij X _k X _i of text feature δ _kj of:

among them

At this time, if

The δ _ij> δ _kj, i.e. strong text feature X _i of the j-th class classification ability than the text mode feature of X _k, and accordingly select the text features.

The text classification module 30 is configured to train the BP neural network classification model by using the stochastic gradient descent algorithm and the fine-turing method according to the initial text features obtained above, until the best text features are obtained, and according to the best text features For text features, the text data is classified using a classifier, and the classification result of the text data is output.

Wherein, the classifier is a random forest classifier; and

The using a classifier to classify text data includes:

Use cross-certification to divide the text data, where the cross-certification is to divide the original text data into k sub-text data according to the number of pages. In each training, one of the sub-text data is used as Test set, the rest of the sub-text data is used as the training set, and perform k rotations;

Use the sub-text features obtained above as the child nodes of the decision tree to construct multiple decision trees;

The text classification results of all decision trees are collected for cumulative summation, and the result with the highest number of votes is the final text classification result.

The functions or operation steps implemented by the program modules such as the sample collection module 10, the feature extraction module 20, and the text classification module 30 when executed are substantially the same as those in the foregoing embodiment, and will not be repeated here.

In addition, an embodiment of the present application also proposes a computer-readable storage medium that stores a text classification program based on a neural network model, and the text classification program based on a neural network model can be used by one or more Each processor executes to achieve the following operations:

Collect text data, perform preprocessing operations on the text data, and obtain preprocessed text data;

Converting the preprocessed text data into a text vector;

Perform feature selection on the text vector using a BP neural network classification model optimized based on a decision tree to obtain initial text features;

Training the BP neural network classification model by using a stochastic gradient descent algorithm and a fine-turing method according to the initial text features obtained above, until the best text features are obtained;

According to the best text feature, the text data is classified by a classifier, and the classification result of the text data is output.

The specific implementation of the computer-readable storage medium of the present application is basically the same as the foregoing embodiments of the text classification device and method based on the neural network model, and will not be repeated here.

It should be noted that the serial numbers of the foregoing embodiments of the present application are only for description, and do not represent the advantages and disadvantages of the embodiments. And the terms "include", "include" or any other variants thereof in this article are intended to cover non-exclusive inclusion, so that a process, device, article or method including a series of elements not only includes those elements, but also includes those elements that are not explicitly included. The other elements listed may also include elements inherent to the process, device, article, or method. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, device, article, or method that includes the element.

Through the description of the above implementation manners, those skilled in the art can clearly understand that the above-mentioned embodiment method can be implemented by means of software plus the necessary general hardware platform, of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM) as described above. , Magnetic disks, optical disks), including several instructions to make a terminal device (which can be a mobile phone, a computer, a server, or a network device, etc.) execute the method described in each embodiment of the present application.

The above are only the preferred embodiments of the application, and do not limit the scope of the patent for this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of the application, or directly or indirectly applied to other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims

A text classification method based on a neural network model, characterized in that the method includes:

Collect text data, perform preprocessing operations on the text data, and obtain preprocessed text data;

Converting the preprocessed text data into a text vector;

Perform feature selection on the text vector using a BP neural network classification model optimized based on a decision tree to obtain initial text features;

Training the BP neural network classification model by using a stochastic gradient descent algorithm and a fine-turing method according to the initial text features obtained above, until the best text features are obtained;

According to the best text feature, the text data is classified by a classifier, and the classification result of the text data is output.
The text classification method based on a neural network model according to claim 1, wherein the preprocessing operation on the text data comprises:

Matching the text data with the entries in the pre-built dictionary according to a predetermined strategy to obtain the words in the text data;

Use the constructed stop word table to match the words in the text data, and if the matching is successful, determine that the word is a stop word, and delete the word;

Construct a dependency graph to calculate the correlation strength between words, use the TextRank algorithm to iteratively calculate the importance score of the words, and express each word in the form of a numerical vector;

Calculate the Euclidean distance between every two of the text data, and delete one of the two text data when the Euclidean distance is less than a preset threshold.
The text classification method based on a neural network model according to claim 2, wherein said converting the preprocessed text data into a text vector comprises:

Using the text hierarchical encoder of the zoom neural network, the preprocessed text data is encoded to obtain the encoded text vector, wherein the text hierarchical encoder includes a text embedding layer and two bi- LSTM layer. The text embedding layer initializes the words by word2vec to obtain word vectors. The first bi-LSTM layer receives word vectors as input and generates sentence vectors, and the second bi-LSTM layer receives sentence vectors as input and A paragraph vector is generated, and the text vector including a word vector, a sentence vector and a paragraph vector is obtained.
The method for text classification based on a neural network model according to claim 1, characterized in that said using a BP neural network classification model optimized based on a decision tree to perform feature selection on said text vector to obtain text features comprises:

Construct a 3-layer BP neural network, where n units in the input layer of each layer of BP neural network correspond to n feature parameters, and m units in the output layer correspond to m pattern classifications. Take the number of hidden layer units in the middle as q, use
Represents the connection weight between the input layer unit i and the hidden layer unit q, using
Represents the connection weight between the hidden layer unit q and the output layer unit j, θ q is the threshold of each unit of the hidden layer, and the output O q of the qth unit of the hidden layer is:

The output y i of the j-th unit of the output layer is:

In the above formula, δ j is the threshold of each unit of the output layer, j = 1, 2, ..., m;

The composite function of the partial derivative of the chain rule, the difference between the sensitivity obtained and the sensitivity of the text feature δ ij X k X i of text feature δ kj of:

among them

At this time, if
The δ ij> δ kj, i.e. strong text feature X i of the j-th class classification ability than the text mode feature of X k, and accordingly select the text features.
The method for text classification based on a neural network model according to claim 1, wherein the classifier is a random forest classifier; and

The using a classifier to classify text data includes:

Use cross-certification to divide the text data, where the cross-certification is to divide the original text data into k sub-text data according to the number of pages. In each training, one of the sub-text data is used as Test set, the rest of the sub-text data is used as the training set, and perform k rotations;

Use the sub-text data obtained above as the child nodes of the decision tree to construct multiple decision trees;

The text classification results of all decision trees are collected for cumulative summation, and the result with the highest number of votes is the final text classification result.
The method for text classification based on a neural network model according to claim 2, wherein the classifier is a random forest classifier; and

The using a classifier to classify text data includes:

Use cross-certification to divide the text data, where the cross-certification is to divide the original text data into k sub-text data according to the number of pages. In each training, one of the sub-text data is used as Test set, the rest of the sub-text data is used as the training set, and perform k rotations;

Use the sub-text data obtained above as the child nodes of the decision tree to construct multiple decision trees;

The text classification results of all decision trees are collected for cumulative summation, and the result with the highest number of votes is the final text classification result.
The method for text classification based on a neural network model according to any one of claims 3 to 4, wherein the classifier is a random forest classifier; and

The using a classifier to classify text data includes:

The text data is divided using a cross-certification method, where the cross-certification is to divide the original text data into k sub-text data according to the number of pages, and use one of the sub-text data for each training. Test set, the rest of the sub-text data is used as training set, and perform k rotations

Use the sub-text data obtained above as the child nodes of the decision tree to construct multiple decision trees;

The text classification results of all decision trees are collected for cumulative summation, and the result with the highest number of votes is the final text classification result.
A text classification device based on a neural network model, characterized in that the device includes a memory and a processor, and a text classification program based on the neural network model that can be run on the processor is stored in the memory, and the When the text classification program based on the neural network model is executed by the processor, the following steps are implemented:

Collect text data, perform preprocessing operations on the text data, and obtain preprocessed text data;

Converting the preprocessed text data into a text vector;

Perform feature selection on the text vector using a BP neural network classification model optimized based on a decision tree to obtain initial text features;

Training the BP neural network classification model by using a stochastic gradient descent algorithm and a fine-turing method according to the initial text features obtained above, until the best text features are obtained;

According to the best text feature, the text data is classified by a classifier, and the classification result of the text data is output.
The text classification device based on a neural network model according to claim 8, wherein the preprocessing operation on the text data comprises:

Matching the text data with the entries in the pre-built dictionary according to a predetermined strategy to obtain the words in the text data;

Use the constructed stop word table to match the words in the text data, and if the matching is successful, determine that the word is a stop word, and delete the word;

Construct a dependency graph to calculate the correlation strength between words, use the TextRank algorithm to iteratively calculate the importance score of the words, and express each word in the form of a numerical vector;

Calculate the Euclidean distance between every two of the text data, and delete one of the two text data when the Euclidean distance is less than a preset threshold.
9. The text classification device based on a neural network model according to claim 9, wherein said converting said preprocessed text data into a text vector comprises:

Using the text hierarchical encoder of the zoom neural network, the preprocessed text data is encoded to obtain the encoded text vector, wherein the text hierarchical encoder includes a text embedding layer and two bi- LSTM layer. The text embedding layer initializes the words by word2vec to obtain word vectors. The first bi-LSTM layer receives word vectors as input and generates sentence vectors, and the second bi-LSTM layer receives sentence vectors as input and A paragraph vector is generated, and the text vector including a word vector, a sentence vector and a paragraph vector is obtained.
8. The text classification device based on a neural network model according to claim 8, wherein said using a BP neural network classification model optimized based on a decision tree to perform feature selection on said text vector to obtain text features comprises:

Construct a 3-layer BP neural network, where n units in the input layer of each layer of BP neural network correspond to n feature parameters, and m units in the output layer correspond to m pattern classifications. Take the number of hidden layer units in the middle as q, use
Represents the connection weight between the input layer unit i and the hidden layer unit q, using
Represents the connection weight between the hidden layer unit q and the output layer unit j, θ q is the threshold of each unit of the hidden layer, and the output O q of the qth unit of the hidden layer is:

The output y i of the j-th unit of the output layer is:

In the above formula, δ j is the threshold of each unit of the output layer, j = 1, 2, ..., m;

The composite function of the partial derivative of the chain rule, the difference between the sensitivity obtained and the sensitivity of the text feature δ ij X k X i of text feature δ kj of:

among them

At this time, if
The δ ij> δ kj, i.e. strong text feature X i of the j-th class classification ability than the text mode feature of X k, and accordingly select the text features.
The text classification device based on a neural network model according to claim 8, wherein the classifier is a random forest classifier; and

The using a classifier to classify text data includes:

Use cross-certification to divide the text data, where the cross-certification is to divide the original text data into k sub-text data according to the number of pages. In each training, one of the sub-text data is used as Test set, the rest of the sub-text data is used as the training set, and perform k rotations;

Use the sub-text data obtained above as the child nodes of the decision tree to construct multiple decision trees;

The text classification results of all decision trees are collected for cumulative summation, and the result with the highest number of votes is the final text classification result.
The text classification device based on a neural network model according to claim 9, wherein the classifier is a random forest classifier; and

The using a classifier to classify text data includes:

Use cross-certification to divide the text data, where the cross-certification is to divide the original text data into k sub-text data according to the number of pages. In each training, one of the sub-text data is used as Test set, the rest of the sub-text data is used as the training set, and perform k rotations;

Use the sub-text data obtained above as the child nodes of the decision tree to construct multiple decision trees;

The text classification results of all decision trees are collected for cumulative summation, and the result with the highest number of votes is the final text classification result.
The text classification device based on a neural network model according to any one of claims 10 to 11, wherein the classifier is a random forest classifier; and

The using a classifier to classify text data includes:

Use cross-certification to divide the text data, where the cross-certification is to divide the original text data into k sub-text data according to the number of pages. In each training, one of the sub-text data is used as Test set, the rest of the sub-text data is used as the training set, and perform k rotations;

Use the sub-text data obtained above as the child nodes of the decision tree to construct multiple decision trees;

The text classification results of all decision trees are collected for cumulative summation, and the result with the highest number of votes is the final text classification result.
A computer-readable storage medium, wherein the computer-readable storage medium stores a text classification program based on a neural network model, and the text classification program based on a neural network model can be executed by one or more processors , In order to achieve the following steps:

Collect text data, perform preprocessing operations on the text data, and obtain preprocessed text data;

Converting the preprocessed text data into a text vector;

Perform feature selection on the text vector using a BP neural network classification model optimized based on a decision tree to obtain initial text features;

Training the BP neural network classification model by using a stochastic gradient descent algorithm and a fine-turing method according to the initial text features obtained above, until the best text features are obtained;

According to the best text feature, the text data is classified by a classifier, and the classification result of the text data is output.
15. The computer-readable storage medium according to claim 15, wherein the preprocessing operation on the text data comprises:

Matching the text data with the entries in the pre-built dictionary according to a predetermined strategy to obtain the words in the text data;

Use the constructed stop word table to match the words in the text data, and if the matching is successful, determine that the word is a stop word, and delete the word;

Construct a dependency graph to calculate the correlation strength between words, use the TextRank algorithm to iteratively calculate the importance score of the words, and express each word in the form of a numerical vector;

Calculate the Euclidean distance between every two of the text data, and delete one of the two text data when the Euclidean distance is less than a preset threshold.
15. The computer-readable storage medium of claim 16, wherein the converting the preprocessed text data into a text vector comprises:

Using the text hierarchical encoder of the zoom neural network, the preprocessed text data is encoded to obtain the encoded text vector, wherein the text hierarchical encoder includes a text embedding layer and two bi- LSTM layer. The text embedding layer initializes the words by word2vec to obtain word vectors. The first bi-LSTM layer receives word vectors as input and generates sentence vectors, and the second bi-LSTM layer receives sentence vectors as input and A paragraph vector is generated, and the text vector including a word vector, a sentence vector and a paragraph vector is obtained.
15. The computer-readable storage medium according to claim 15, wherein said using a BP neural network classification model optimized based on a decision tree to perform feature selection on said text vector to obtain text features comprises:

Construct a 3-layer BP neural network, where n units in the input layer of each layer of BP neural network correspond to n feature parameters, and m units in the output layer correspond to m pattern classifications. Take the number of hidden layer units in the middle as q, use
Represents the connection weight between the input layer unit i and the hidden layer unit q, using
Represents the connection weight between the hidden layer unit q and the output layer unit j, θ q is the threshold of each unit of the hidden layer, and the output O q of the qth unit of the hidden layer is:

The output y i of the j-th unit of the output layer is:

In the above formula, δ j is the threshold of each unit of the output layer, j = 1, 2, ..., m;

The composite function of the partial derivative of the chain rule, the difference between the sensitivity obtained and the sensitivity of the text feature δ ij X k X i of text feature δ kj of:

among them

At this time, if
The δ ij> δ kj, i.e. strong text feature X i of the j-th class classification ability than the text mode feature of X k, and accordingly select the text features.
The computer-readable storage medium of claim 15, wherein the classifier is a random forest classifier; and

The using a classifier to classify text data includes:

Use cross-certification to divide the text data, where the cross-certification is to divide the original text data into k sub-text data according to the number of pages. In each training, one of the sub-text data is used as Test set, the rest of the sub-text data is used as the training set, and perform k rotations;

Use the sub-text data obtained above as the child nodes of the decision tree to construct multiple decision trees;

The text classification results of all decision trees are collected for cumulative summation, and the result with the highest number of votes is the final text classification result.
The computer-readable storage medium according to any one of claims 16 to 18, wherein the classifier is a random forest classifier; and

The using a classifier to classify text data includes:

Use cross-certification to divide the text data, where the cross-certification is to divide the original text data into k sub-text data according to the number of pages. In each training, one of the sub-text data is used as Test set, the rest of the sub-text data is used as the training set, and perform k rotations;

Use the sub-text data obtained above as the child nodes of the decision tree to construct multiple decision trees;

The text classification results of all decision trees are collected for cumulative summation, and the result with the highest number of votes is the final text classification result.