WO2021051518A1 - Text data classification method and apparatus based on neural network model, and storage medium - Google Patents

Text data classification method and apparatus based on neural network model, and storage medium Download PDF

Info

Publication number
WO2021051518A1
WO2021051518A1 PCT/CN2019/116931 CN2019116931W WO2021051518A1 WO 2021051518 A1 WO2021051518 A1 WO 2021051518A1 CN 2019116931 W CN2019116931 W CN 2019116931W WO 2021051518 A1 WO2021051518 A1 WO 2021051518A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
text data
neural network
classification
vector
Prior art date
Application number
PCT/CN2019/116931
Other languages
French (fr)
Chinese (zh)
Inventor
金戈
徐亮
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021051518A1 publication Critical patent/WO2021051518A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a method, device and computer-readable storage medium for text data classification based on a neural network model.
  • the prior art mainly constructs a text classification model based on word frequency features in the text, and then classifies the text to be classified based on the constructed text classification model.
  • the prior art usually has the problem of inaccurate text classification.
  • This application provides a method, device and computer-readable storage medium for text classification based on a neural network model, the main purpose of which is to provide an accurate text data classification scheme.
  • a text classification method based on a neural network model includes: collecting text data, performing preprocessing operations on the text data to obtain preprocessed text data; Convert the text data into a text vector; use the BP neural network classification model based on decision tree optimization to perform feature selection on the text vector to obtain the initial text feature; according to the initial text feature obtained above, use the stochastic gradient descent algorithm and fine-turing The method trains the BP neural network classification model until the best text feature is obtained; according to the best text feature, a classifier is used to classify the text data, and the classification result of the text data is output.
  • the present application also provides a text classification device based on a neural network model.
  • the device includes a memory and a processor.
  • the memory stores a neural network model-based neural network model that can be run on the processor.
  • a text classification program when the neural network model-based text classification program is executed by the processor, the following steps are implemented: collecting text data, performing preprocessing operations on the text data to obtain preprocessed text data;
  • the preprocessed text data is converted into text vectors;
  • the BP neural network classification model based on decision tree optimization is used to perform feature selection on the text vectors to obtain the initial text features; according to the initial text features obtained above, the stochastic gradient descent algorithm and
  • the fine-turing method trains the BP neural network classification model until the best text feature is obtained; according to the best text feature, a classifier is used to classify the text data, and the classification result of the text data is output.
  • the present application also provides a computer-readable storage medium on which a text classification program based on a neural network model is stored, and the text classification program based on a neural network model can be One or more processors execute to implement the steps of the text classification method based on the neural network model as described above.
  • the neural network model-based text classification method, device, and computer-readable storage medium proposed in this application use the BP neural network classification model optimized based on decision trees to perform feature selection on text data to obtain initial text features, and use the stochastic gradient descent algorithm to compare with
  • the fine-turing method trains the BP neural network classification model to obtain the best text features, and uses a classifier to classify the text data according to the best text features.
  • This application obtains the most representative text features in the text data by training the BP neural network classification model. Text classification based on the text features can improve the shortcomings of traditional text classification methods such as low classification accuracy. Therefore, this application can achieve rapid , Accurate text classification.
  • FIG. 1 is a schematic flowchart of a text classification method based on a neural network model provided by an embodiment of the application;
  • FIG. 2 is a schematic diagram of the internal structure of a text classification device based on a neural network model provided by an embodiment of the application;
  • FIG. 3 is a schematic diagram of modules of a text classification program based on a neural network model in a text classification device based on a neural network model provided by an embodiment of the application.
  • This application provides a text classification method based on a neural network model.
  • FIG. 1 it is a schematic flowchart of a text classification method based on a neural network model provided by an embodiment of this application.
  • the method can be executed by a device, and the device can be implemented by software and/or hardware.
  • the text classification method based on the neural network model includes:
  • S1 Collect text data, perform a preprocessing operation on the text data to obtain preprocessed text data, and convert the preprocessed text data into a text vector.
  • the preferred embodiment of the present application can collect the text data from the Internet, such as a news website, a shopping website, a paper database, or various forums.
  • the embodiment of the present application performs preprocessing operations including word segmentation, stop word removal, feature weight calculation, and deduplication on the text data.
  • the word segmentation method described in the embodiment of the present application includes matching the text data with entries in a pre-built dictionary according to a predetermined strategy to obtain the words in the text data.
  • the selected method for removing stop words is stop word list filtering, that is, matching words in the text data with the stop word list that has been constructed. If the matching is successful, then the word is Stop word, the word needs to be deleted.
  • the text data is represented by a series of characteristic words (keywords), but this kind of textual data cannot be directly processed by the classification algorithm, but should be converted into a numerical form. It is necessary to calculate the weight of these feature words to represent the importance of the feature words in the text.
  • the embodiment of the application uses the TF-IDF algorithm to perform feature word calculation.
  • the TF-IDF algorithm uses statistical information, word vector information, and dependency syntax information between words, builds a dependency graph to calculate the correlation strength between words, and uses TextRank algorithm to iteratively calculate the importance score of words.
  • len(W i , W j ) represents the length of the dependency path between words W i and W j
  • b is a hyperparameter
  • tfidf(W) is the TF-IDF value of word W
  • d is the Euclidean distance between the word vectors of words W i and W j.
  • the Euclidean distance method is first used to de-duplicate the text before the text is classified.
  • the formula is as follows:
  • w 1j and w 2j are 2 text data respectively.
  • a preferred embodiment of the present application further includes a text hierarchical encoder using a zoom neural network to perform encoding processing on the preprocessed text data to obtain an encoded text vector.
  • the text hierarchical encoder has three layers, namely a text embedding layer and two bi-LSTM layers, wherein the text embedding layer initializes the words by word2vec to obtain a word vector, and The first bi-LSTM layer receives word vectors as input and generates sentence vectors, and the second bi-LSTM layer receives sentence vectors as input and generates paragraph vectors.
  • the first bi-LSTM layer takes each word as input, it outputs a hidden state vector every time it is not long, and then uses the maximum pooling operation to obtain a fixed-length sentence vector, and combines all
  • the sentence vector of as the sentence component of hierarchical memory the formula used is:
  • this application uses a similar approach, using the second bi-LSTM layer and the maximum pooling operation to convert sentence components into paragraph vectors.
  • this application assigns a vector representation (hierarchical distributed memory) to each language unit at each level, and retains the segmentation information of its segmentation, based on which text vectors including word vectors, sentence vectors, and paragraph vectors are obtained.
  • this application uses a method based on BP neural network for feature selection, and uses the sensitivity of feature X to changes in state Y ⁇ is used as a measure to evaluate text characteristics, namely:
  • the BP neural network is a multi-layer feedforward neural network.
  • the main characteristics of the network are signal forward transmission and error back propagation.
  • the input signal is processed layer by layer from the input layer to the hidden layer.
  • the output layer The neuron state of each layer only affects the neuron state of the next layer. If the output layer cannot get the expected output, it will switch to back propagation and adjust the network weights and thresholds according to the prediction error, so that the network predicted output is constantly approaching the expected output.
  • the BP neural network described in this application includes the following structure:
  • Input layer It is the only data input entry of the entire neural network.
  • the number of neuron nodes in the input layer is the same as the dimension of the numerical vector of the text.
  • the value of each neuron corresponds to the value of each item of the numerical vector;
  • Hidden layer It is mainly used to perform non-linear processing on the input data of the input layer. Non-linear fitting of the input data based on the excitation function can effectively ensure the predictive ability of the model;
  • Output layer After the hidden layer, it is the only output of the entire model. The number of neuron nodes in the output layer is the same as the number of text categories.
  • this application uses a decision tree to optimize the BP neural network.
  • the length of the longest rule chain of the decision tree is taken as the number of hidden layer nodes of the BP neural network to optimize the structure of the neural network, that is, the depth of the decision tree is taken as the number of hidden layer nodes of the BP neural network.
  • the preferred embodiment of this application constructs a 3-layer BP neural network, where n units in the input layer correspond to n feature parameters, and m units in the output layer correspond to m pattern classifications.
  • the number of units in the middle hidden layer is q, and Represents the connection weight between the input layer unit i and the hidden layer unit q, using Represents the connection weight between the hidden layer unit q and the output layer unit j, ⁇ q is the threshold of each unit of the hidden layer, and the output O q of the qth unit of the hidden layer is:
  • the output y i of the j-th unit of the output layer is:
  • Sensitivity Sensitivity ⁇ ij determined difference and the text feature X k X i of text feature ⁇ kj composite function of the partial derivative of the chain rule:
  • the fine-turing method is to extract the shallow features of the available neural network, modify the parameters in the deep neural network, and build a new neural network model to reduce the number of iterations, so as to obtain the best BP neural network more quickly Classification model.
  • the process of training the BP neural network classification model is as follows:
  • the loss function is used to evaluate the predicted value of the network model output The difference from the true value Y. Used here To represent the loss function, it is a non-negative real number function. The smaller the loss value, the better the performance of the network model.
  • m is the number of samples of the text data
  • h ⁇ (x (i) ) is the predicted value of the text data
  • y (i) is the true value of the text data
  • Neuron node when the input value is lower than 0, limit, when the input rises above a certain threshold, the independent variable in the function has a linear relationship with the dependent variable.
  • x represents the cumulative value of the reverse gradient and the cumulative value of the descending gradient.
  • Gradient descent algorithm is the most commonly used optimization algorithm for neural network model training. To find the loss function The minimum value of, the variable y needs to be updated in the direction opposite to the gradient vector -dL/dy, so that the gradient can be reduced the fastest, until the loss converges to the minimum.
  • each batch-sizes data is input, the learning rate is reduced as the gradient drops, and each epoch is input, the decay rate is increased according to the decrease in the learning rate.
  • this application first adjusts the parameters in the network layer, deletes the FC layer and adjusts the learning rate, because the last layer is relearning, so it needs to have a faster learning rate compared to other layers.
  • the learning rate of weight and bias in this application is accelerated by 10 times, and the learning strategy is not changed.
  • the solver parameters are modified to reduce the size of the text data, and the step size is changed from 100,000 to 20,000, and the maximum number of iterations is also reduced accordingly, so that the optimized BP neural network classification model can be obtained with a smaller number of iterations. And use the optimized BP neural network classification to obtain the best text features.
  • a preferred embodiment of the present application uses a random forest algorithm as a classifier to classify the collected text data according to the best text characteristics.
  • the random forest algorithm uses bagging algorithm with replacement sampling, extracts multiple sample subsets from the original sample, and uses these samples to train multiple decision tree models, and uses random features in the training process.
  • the subspace method extracts some features from the feature set to split the decision tree, and finally integrates multiple decision trees called an ensemble classifier, and this ensemble classifier is called a random forest.
  • the algorithm process can be divided into three parts, the generation of the sub-sample set, the construction of the decision tree, and the voting results. The specific process is as follows:
  • Random forest is an ensemble classifier. For each base classifier, a certain sample subset needs to be generated as the input variable of the base classifier.
  • the text data is divided by cross-certification.
  • the cross-certification divides the original text into k according to the number of pages. For each sub-text data, in each training, one of the sub-text data is used as the test set, the remaining sub-text data is used as the training set, and k rotation steps are performed.
  • each base classifier is an independent decision tree.
  • the most important thing in the construction of the decision tree is the split rule.
  • the split rule tries to find an optimal feature to divide the sample to improve the accuracy of the final classification.
  • the decision tree of the random forest is basically the same as the ordinary decision tree construction method. The difference is that the feature selected when the decision tree of the random forest is split does not search the entire feature set, but randomly selects k features for division.
  • the sub-text features obtained above are used as the sub-nodes of the decision tree, and the lower nodes are the respective extracted features.
  • Voting produces results.
  • the classification result of the random forest is obtained by voting for each base classifier, that is, the decision tree. Random forest treats the base classifier equally. Each decision tree obtains a classification result, and collects the text classification results of all decision trees for cumulative summation. The result with the highest number of votes is the final text classification result, that is, the text is effectively classified.
  • This application also provides a text classification device based on the neural network model.
  • FIG. 2 it is a schematic diagram of the internal structure of a text classification device based on a neural network model provided by an embodiment of this application.
  • the text classification device 1 based on the neural network model may be a PC (Personal Computer, personal computer), or a terminal device such as a smart phone, a tablet computer, or a portable computer.
  • the text classification device 1 based on the neural network model at least includes a memory 11, a processor 12, a communication bus 13, and a network interface 14.
  • the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, and the like.
  • the memory 11 may be an internal storage unit of the text classification device 1 based on the neural network model, for example, the hard disk of the text classification device 1 based on the neural network model.
  • the memory 11 may also be an external storage device of the text classification device 1 based on a neural network model, such as a plug-in hard disk equipped on the text classification device 1 based on a neural network model, and a smart media card (Smart Media Card). ,SMC), Secure Digital (SD) card, Flash Card, etc.
  • the memory 11 may also include both an internal storage unit of the text classification apparatus 1 based on a neural network model and an external storage device.
  • the memory 11 can be used not only to store application software and various data installed in the text classification device 1 based on the neural network model, such as the code of the text classification program 01 based on the neural network model, etc., but also to temporarily store the output or The data to be output.
  • the processor 12 may be a central processing unit (CPU), controller, microcontroller, microprocessor, or other data processing chip, for running program codes or processing stored in the memory 11 Data, for example, execute the text classification program 01 based on the neural network model.
  • CPU central processing unit
  • controller microcontroller
  • microprocessor microprocessor
  • other data processing chip for running program codes or processing stored in the memory 11 Data, for example, execute the text classification program 01 based on the neural network model.
  • the communication bus 13 is used to realize the connection and communication between these components.
  • the network interface 14 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface), and is usually used to establish a communication connection between the apparatus 1 and other electronic devices.
  • the device 1 may also include a user interface.
  • the user interface may include a display (Display) and an input unit such as a keyboard (Keyboard).
  • the optional user interface may also include a standard wired interface and a wireless interface.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, etc.
  • the display can also be appropriately called a display screen or a display unit, which is used to display the information processed in the text classification device 1 based on the neural network model and to display a visualized user interface.
  • Fig. 2 only shows a neural network model-based text classification device 1 with components 11-14 and a neural network model-based text classification program 01.
  • Fig. 1 does not constitute
  • the definition of the text classification device 1 based on the neural network model may include fewer or more components than shown, or a combination of certain components, or a different component arrangement.
  • the memory 11 stores a text classification program 01 based on a neural network model; the processor 12 implements the following steps when executing the text classification program 01 based on a neural network model stored in the memory 11:
  • Step 1 Collect text data, perform pre-processing operations on the text data to obtain pre-processed text data, and convert the pre-processed text data into text vectors.
  • the preferred embodiment of the present application can collect the text data from the Internet, such as a news website, a shopping website, a paper database, or various forums.
  • the embodiment of the present application performs preprocessing operations including word segmentation, stop word removal, feature weight calculation, and deduplication on the text data.
  • the word segmentation method described in the embodiment of the present application includes matching the text data with entries in a pre-built dictionary according to a predetermined strategy to obtain the words in the text data.
  • the selected method for removing stop words is stop word list filtering, that is, matching words in the text data with the stop word list that has been constructed. If the matching is successful, then the word is Stop word, the word needs to be deleted.
  • the text data is represented by a series of characteristic words (keywords), but this kind of textual data cannot be directly processed by the classification algorithm, but should be converted into a numerical form. It is necessary to calculate the weight of these feature words to represent the importance of the feature words in the text.
  • the embodiment of the application uses the TF-IDF algorithm to perform feature word calculation.
  • the TF-IDF algorithm uses statistical information, word vector information, and dependency syntax information between words, builds a dependency graph to calculate the correlation strength between words, and uses TextRank algorithm to iteratively calculate the importance score of words.
  • len(W i , W j ) represents the length of the dependency path between words W i and W j
  • b is a hyperparameter
  • tfidf(W) is the TF-IDF value of word W
  • d is the Euclidean distance between the word vectors of words W i and W j.
  • the Euclidean distance method is first used to de-duplicate the text before the text is classified.
  • the formula is as follows:
  • w 1j and w 2j are 2 text data respectively.
  • a preferred embodiment of the present application further includes a text hierarchical encoder using a zoom neural network to perform encoding processing on the preprocessed text data to obtain an encoded text vector.
  • the text hierarchical encoder has three layers, namely a text embedding layer and two bi-LSTM layers, wherein the text embedding layer initializes the words by word2vec to obtain a word vector, and The first bi-LSTM layer receives word vectors as input and generates sentence vectors, and the second bi-LSTM layer receives sentence vectors as input and generates paragraph vectors.
  • the first bi-LSTM layer takes each word as input, it outputs a hidden state vector every time it is not long, and then uses the maximum pooling operation to obtain a fixed-length sentence vector, and combines all
  • the sentence vector of as the sentence component of hierarchical memory the formula used is:
  • this application uses a similar approach, using the second bi-LSTM layer and the maximum pooling operation to convert sentence components into paragraph vectors.
  • this application assigns a vector representation (hierarchical distributed memory) to each language unit at each level, and retains the segmentation information of its segmentation, based on which text vectors including word vectors, sentence vectors, and paragraph vectors are obtained.
  • Step 2 Use the BP neural network classification model optimized based on the decision tree to perform feature selection on the text vector, so as to obtain the text feature.
  • this application uses a BP neural network-based method for feature selection, and uses the sensitivity of feature X to changes in state Y ⁇ is used as a measure to evaluate text characteristics, namely:
  • the BP neural network is a multi-layer feedforward neural network.
  • the main characteristics of the network are signal forward transmission and error back propagation.
  • the input signal is processed layer by layer from the input layer to the hidden layer.
  • the output layer The neuron state of each layer only affects the neuron state of the next layer. If the output layer cannot get the expected output, it will switch to back propagation and adjust the network weights and thresholds according to the prediction error, so that the network predicted output is constantly approaching the expected output.
  • the BP neural network described in this application includes the following structure:
  • Input layer It is the only data input entry of the entire neural network.
  • the number of neuron nodes in the input layer is the same as the dimension of the numerical vector of the text.
  • the value of each neuron corresponds to the value of each item of the numerical vector;
  • Hidden layer It is mainly used to perform non-linear processing on the input data of the input layer. Non-linear fitting of the input data based on the excitation function can effectively ensure the predictive ability of the model;
  • Output layer After the hidden layer, it is the only output of the entire model. The number of neuron nodes in the output layer is the same as the number of text categories.
  • this application uses a decision tree to optimize the BP neural network.
  • the length of the longest rule chain of the decision tree is taken as the number of hidden layer nodes of the BP neural network to optimize the structure of the neural network, that is, the depth of the decision tree is taken as the number of hidden layer nodes of the BP neural network.
  • the preferred embodiment of this application constructs a 3-layer BP neural network, where n units in the input layer correspond to n feature parameters, and m units in the output layer correspond to m pattern classifications.
  • the number of units in the middle hidden layer is q, and Represents the connection weight between the input layer unit i and the hidden layer unit q, using Represents the connection weight between the hidden layer unit q and the output layer unit j, ⁇ q is the threshold of each unit of the hidden layer, and the output O q of the qth unit of the hidden layer is:
  • the output y i of the j-th unit of the output layer is:
  • Sensitivity Sensitivity ⁇ ij determined difference and the text feature X k X i of text feature ⁇ kj composite function of the partial derivative of the chain rule:
  • Step 3 use the stochastic gradient descent algorithm and the fine-turing method to train the BP neural network classification model until the best text features are obtained. According to the best text features, use the classifier Classify the text data and output the classification result of the target text.
  • the fine-turing method is to extract the shallow features of the available neural network, modify the parameters in the deep neural network, and build a new neural network model to reduce the number of iterations, so as to obtain the best BP neural network more quickly Classification model.
  • the process of training the BP neural network classification model is as follows:
  • the loss function is used to evaluate the predicted value of the network model output The difference from the true value Y. Used here To represent the loss function, it is a non-negative real number function. The smaller the loss value, the better the performance of the network model.
  • m is the number of samples of the text data
  • h ⁇ (x (i) ) is the predicted value of the text data
  • y (i) is the true value of the text data
  • Neuron node when the input value is lower than 0, limit, when the input rises above a certain threshold, the independent variable in the function has a linear relationship with the dependent variable.
  • x represents the cumulative value of the reverse gradient and the cumulative value of the descending gradient.
  • Gradient descent algorithm is the most commonly used optimization algorithm for neural network model training. To find the loss function The minimum value of, the variable y needs to be updated in the direction opposite to the gradient vector -dL/dy, so that the gradient can be reduced the fastest, until the loss converges to the minimum.
  • each batch-sizes data is input, the learning rate is reduced with the gradient drop, and each epoch is input, the attenuation rate is increased according to the decrease in the learning rate.
  • this application first adjusts the parameters in the network layer, deletes the FC layer and adjusts the learning rate, because the last layer is relearning, so it needs to have a faster learning rate compared to other layers.
  • the learning rate of weight and bias in this application is accelerated by 10 times, and the learning strategy is not changed.
  • the solver parameters are modified to reduce the size of the text data, and the step size is changed from 100,000 to 20,000, and the maximum number of iterations is also reduced accordingly, so that the optimized BP neural network classification model can be obtained with a smaller number of iterations. And use the optimized BP neural network classification to obtain the best text features. .
  • a preferred embodiment of the present application uses a random forest algorithm as a classifier, and performs text classification on the collected text data according to the best text feature.
  • the random forest algorithm uses bagging algorithm with replacement sampling, extracts multiple sample subsets from the original sample, and uses these samples to train multiple decision tree models, and uses random features in the training process.
  • the subspace method extracts some features from the feature set to split the decision tree, and finally integrates multiple decision trees called an ensemble classifier, and this ensemble classifier is called a random forest.
  • the algorithm process can be divided into three parts, the generation of the sub-sample set, the construction of the decision tree, and the voting results. The specific process is as follows:
  • Random forest is an ensemble classifier. For each base classifier, a certain sample subset needs to be generated as the input variable of the base classifier.
  • the text data is divided by cross-certification.
  • the cross-certification divides the original text into k according to the number of pages. For each sub-text data, in each training, one of the sub-text data is used as the test set, the remaining sub-text data is used as the training set, and k rotation steps are performed.
  • each base classifier is an independent decision tree.
  • the most important thing in the construction of the decision tree is the split rule.
  • the split rule tries to find an optimal feature to divide the sample to improve the accuracy of the final classification.
  • the decision tree of the random forest is basically the same as the ordinary decision tree construction method. The difference is that the feature selected when the decision tree of the random forest is split does not search the entire feature set, but randomly selects k features for division.
  • the sub-text features obtained above are used as the sub-nodes of the decision tree, and the lower nodes are the respective extracted features.
  • Voting produces results.
  • the classification result of the random forest is obtained by voting for each base classifier, that is, the decision tree. Random forest treats the base classifier equally. Each decision tree obtains a classification result, and collects the text classification results of all decision trees for cumulative summation. The result with the highest number of votes is the final text classification result, that is, the text is effectively classified.
  • the text classification program based on the neural network model can also be divided into one or more modules, and the one or more modules are stored in the memory 11 and run by one or more processors ( This embodiment is executed by the processor 12) to complete the application.
  • the module referred to in this application refers to a series of computer program instruction segments that can complete specific functions, and is used to describe the text classification program based on the neural network model. The execution process in the text classification device of the model.
  • FIG. 3 a schematic diagram of program modules of a text classification program based on a neural network model in an embodiment of a text classification device based on a neural network model of this application.
  • the text classification program based on the neural network model It can be divided into a sample collection module 10, a feature extraction module 20, and a text classification module 30.
  • a sample collection module 10 a feature extraction module
  • a text classification module 30 a text classification module
  • the sample collection module 10 is used to collect text data, perform preprocessing operations on the text data, obtain preprocessed text data, and convert the preprocessed text data into text vectors.
  • the preprocessing operation on the text data includes:
  • said converting the text data into a text vector includes:
  • the text hierarchical encoder of the zoom neural network is used to encode the preprocessed text data to obtain the encoded text vector, wherein the text hierarchical encoder includes a text embedding layer and two bi- LSTM layer.
  • the text embedding layer initializes the words by word2vec to obtain word vectors.
  • the first bi-LSTM layer receives word vectors as input and generates sentence vectors, and the second bi-LSTM layer receives sentence vectors as input and Generate paragraph vectors.
  • the feature extraction module 20 is configured to: use a BP neural network classification model optimized based on a decision tree to perform feature selection on the text vector to obtain an initial text feature.
  • BP neural network classification model optimized based on a decision tree to perform feature selection on the text vector to obtain text features includes:
  • n units in the input layer of each layer of BP neural network correspond to n feature parameters
  • m units in the output layer correspond to m pattern classifications.
  • the output y i of the j-th unit of the output layer is:
  • the text classification module 30 is configured to train the BP neural network classification model by using the stochastic gradient descent algorithm and the fine-turing method according to the initial text features obtained above, until the best text features are obtained, and according to the best text features For text features, the text data is classified using a classifier, and the classification result of the text data is output.
  • the classifier is a random forest classifier
  • the using a classifier to classify text data includes:
  • cross-certification is to divide the original text data into k sub-text data according to the number of pages.
  • one of the sub-text data is used as Test set, the rest of the sub-text data is used as the training set, and perform k rotations;
  • the text classification results of all decision trees are collected for cumulative summation, and the result with the highest number of votes is the final text classification result.
  • an embodiment of the present application also proposes a computer-readable storage medium that stores a text classification program based on a neural network model, and the text classification program based on a neural network model can be used by one or more Each processor executes to achieve the following operations:
  • the text data is classified by a classifier, and the classification result of the text data is output.

Abstract

The present application relates to the technical field of artificial intelligence. Disclosed is a text classification method based on a neural network model. The method comprises: collecting text data, and performing a pre-processing operation on the text data to obtain pre-processed text data; converting the pre-processed text data into a text vector; using a BP neural network classification model based on decision tree optimization to perform feature selection on the text vector, so as to obtain an initial text feature; according to the obtained initial text feature, using a stochastic gradient descent algorithm and a fine-turing method to train the BP neural network classification model until the best text feature is obtained; and according to the best text feature, using a classifier to classify the text data, and outputting a classification result of the text data. Further provided are a text classification apparatus based on a neural network model, and a computer-readable storage medium. The present application can realize the precise classification of text data.

Description

基于神经网络模型的文本数据分类方法、装置及存储介质Method, device and storage medium for text data classification based on neural network model
本申请要求于2019年9月17日提交中国专利局,申请号为201910885586.7、发明名称为“基于神经网络模型的文本数据分类方法、装置及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on September 17, 2019. The application number is 201910885586.7 and the invention title is "Methods, devices and storage media for text data classification based on neural network models." Incorporated in this application by reference.
技术领域Technical field
本申请涉及人工智能技术领域,尤其涉及一种基于神经网络模型的文本数据分类方法、装置及计算机可读存储介质。This application relates to the field of artificial intelligence technology, and in particular to a method, device and computer-readable storage medium for text data classification based on a neural network model.
背景技术Background technique
随着网络技术的快速发展,对于电子文本信息进行有效地组织和管理,并能快速、准确且全面地从中找到相关信息的要求越来越高。文本分类作为处理和组织大量文本数据的关键技术,在较大程度上解决了信息杂乱的问题,方便用户准确地获取所需的信息,是信息过滤、信息检索、搜索引擎及文本数据库等领域的技术基础。With the rapid development of network technology, the requirements for effective organization and management of electronic text information and the ability to quickly, accurately and comprehensively find relevant information are getting higher and higher. As a key technology for processing and organizing a large amount of text data, text classification solves the problem of information clutter to a large extent, and facilitates users to accurately obtain the required information. It is used in the fields of information filtering, information retrieval, search engines, and text databases. technical foundation.
现有技术主要是基于文本中的词频特征构建文本分类模型,进而基于构建的文本分类模型对待分类文本进行文本分类。但是,因文本中的词频并不能有效体现文本的类别,所以现有技术通常存在文本分类不准确的问题。The prior art mainly constructs a text classification model based on word frequency features in the text, and then classifies the text to be classified based on the constructed text classification model. However, because the word frequency in the text cannot effectively reflect the text category, the prior art usually has the problem of inaccurate text classification.
发明内容Summary of the invention
本申请提供一种基于神经网络模型的文本分类方法、装置及计算机可读存储介质,其主要目的在于提供一种精确的文本数据的分类方案。This application provides a method, device and computer-readable storage medium for text classification based on a neural network model, the main purpose of which is to provide an accurate text data classification scheme.
为实现上述目的,本申请提供的一种基于神经网络模型的文本分类方法,包括:收集文本数据,对所述文本数据进行预处理操作,得到预处理后的文本数据;将所述预处理后的文本数据转换为文本向量;利用基于决策树优化的BP神经网络分类模型对所述文本向量进行特征选择,得到初始文本特征;根据上述得到的初始文本特征,利用随机梯度下降算法与fine-turing方法训练所述BP神经网络分类模型,直到得到最佳的文本特征;根据所述最佳的文本特征,利用分类器对所述文本数据进行分类,输出所述文本数据的分类结果。In order to achieve the above purpose, a text classification method based on a neural network model provided by this application includes: collecting text data, performing preprocessing operations on the text data to obtain preprocessed text data; Convert the text data into a text vector; use the BP neural network classification model based on decision tree optimization to perform feature selection on the text vector to obtain the initial text feature; according to the initial text feature obtained above, use the stochastic gradient descent algorithm and fine-turing The method trains the BP neural network classification model until the best text feature is obtained; according to the best text feature, a classifier is used to classify the text data, and the classification result of the text data is output.
此外,为实现上述目的,本申请还提供一种基于神经网络模型的文本分类装置,该装置包括存储器和处理器,所述存储器中存储有可在所述处理器上运行的基于神经网络模型的文本分类程序,所述基于神经网络模型的文本分类程序被所述处理器执行时实现如下步骤:收集文本数据,对所述文本数据进行预处理操作,得到预处理后的文本数据;将所述预处理后的文本数据转换为文本向量;利用基于决策树优化的BP神经网络分类模型对所述文本向量进行特征选择,得到初始文本特征;根据上述得到的初始文本特征,利用随机梯度下降算法与fine-turing方法训练所述BP神经网络分类模型,直到得 到最佳的文本特征;根据所述最佳的文本特征,利用分类器对所述文本数据进行分类,输出所述文本数据的分类结果。In addition, in order to achieve the above-mentioned object, the present application also provides a text classification device based on a neural network model. The device includes a memory and a processor. The memory stores a neural network model-based neural network model that can be run on the processor. A text classification program, when the neural network model-based text classification program is executed by the processor, the following steps are implemented: collecting text data, performing preprocessing operations on the text data to obtain preprocessed text data; The preprocessed text data is converted into text vectors; the BP neural network classification model based on decision tree optimization is used to perform feature selection on the text vectors to obtain the initial text features; according to the initial text features obtained above, the stochastic gradient descent algorithm and The fine-turing method trains the BP neural network classification model until the best text feature is obtained; according to the best text feature, a classifier is used to classify the text data, and the classification result of the text data is output.
此外,为实现上述目的,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有基于神经网络模型的文本分类程序,所述基于神经网络模型的文本分类程序可被一个或者多个处理器执行,以实现如上所述的基于神经网络模型的文本分类方法的步骤。In addition, in order to achieve the above-mentioned object, the present application also provides a computer-readable storage medium on which a text classification program based on a neural network model is stored, and the text classification program based on a neural network model can be One or more processors execute to implement the steps of the text classification method based on the neural network model as described above.
本申请提出的基于神经网络模型的文本分类方法、装置及计算机可读存储介质利用基于决策树优化的BP神经网络分类模型对文本数据进行特征选择,得到初始文本特征,并利用随机梯度下降算法与fine-turing方法训练所述BP神经网络分类模型,以得到最佳的文本特征,并根据所述最佳的文本特征利用分类器对所述文本数据进行分类。本申请通过训练BP神经网络分类模型,得到所述文本数据中最有代表性的文本特征,根据该文本特征进行文本分类能够改善传统文本分类方法分类精度低等缺点,因此,本申请能够实现快速、准确的文本分类。The neural network model-based text classification method, device, and computer-readable storage medium proposed in this application use the BP neural network classification model optimized based on decision trees to perform feature selection on text data to obtain initial text features, and use the stochastic gradient descent algorithm to compare with The fine-turing method trains the BP neural network classification model to obtain the best text features, and uses a classifier to classify the text data according to the best text features. This application obtains the most representative text features in the text data by training the BP neural network classification model. Text classification based on the text features can improve the shortcomings of traditional text classification methods such as low classification accuracy. Therefore, this application can achieve rapid , Accurate text classification.
附图说明Description of the drawings
图1为本申请一实施例提供的基于神经网络模型的文本分类方法的流程示意图;FIG. 1 is a schematic flowchart of a text classification method based on a neural network model provided by an embodiment of the application;
图2为本申请一实施例提供的基于神经网络模型的文本分类装置的内部结构示意图;2 is a schematic diagram of the internal structure of a text classification device based on a neural network model provided by an embodiment of the application;
图3为本申请一实施例提供的基于神经网络模型的文本分类装置中基于神经网络模型的文本分类程序的模块示意图。FIG. 3 is a schematic diagram of modules of a text classification program based on a neural network model in a text classification device based on a neural network model provided by an embodiment of the application.
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization, functional characteristics, and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.
具体实施方式detailed description
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In order to make the purpose, technical solutions, and advantages of this application clearer, the following further describes this application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, and are not used to limit the present application. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,所述“第一”、“第二”等的描述仅用于描述目的,而不能理解为指示或暗示其相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。The terms "first", "second", "third", "fourth", etc. (if any) in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects, without having to use To describe a specific order or sequence. It should be understood that the data used in this way can be interchanged under appropriate circumstances, so that the embodiments described herein can be implemented in a sequence other than the content illustrated or described herein. In addition, the descriptions of “first”, “second”, etc. are only for descriptive purposes, and cannot be understood as indicating or implying their relative importance or implicitly indicating the number of indicated technical features. Therefore, the features defined with "first" and "second" may explicitly or implicitly include at least one of the features.
进一步地,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排 他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。Further, the terms "including" and "having" and any variations of them are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or device that includes a series of steps or units is not necessarily limited to a clearly listed Instead, it may include other steps or units that are not clearly listed or are inherent to these processes, methods, products, or equipment.
另外,各个实施例之间的技术方案可以相互结合,但是必须是以本领域普通技术人员能够实现为基础,当技术方案的结合出现相互矛盾或无法实现时应当认为这种技术方案的结合不存在,也不在本申请要求的保护范围之内。In addition, the technical solutions between the various embodiments can be combined with each other, but it must be based on what can be achieved by a person of ordinary skill in the art. When the combination of technical solutions is contradictory or cannot be achieved, it should be considered that such a combination of technical solutions does not exist. , Is not within the scope of protection required by this application.
本申请提供一种基于神经网络模型的文本分类方法。参照图1所示,为本申请一实施例提供的基于神经网络模型的文本分类方法的流程示意图。该方法可以由一个装置执行,该装置可以由软件和/或硬件实现。This application provides a text classification method based on a neural network model. Referring to FIG. 1, it is a schematic flowchart of a text classification method based on a neural network model provided by an embodiment of this application. The method can be executed by a device, and the device can be implemented by software and/or hardware.
在本实施例中,基于神经网络模型的文本分类方法包括:In this embodiment, the text classification method based on the neural network model includes:
S1、收集文本数据,并对所述文本数据进行预处理操作,得到预处理后的文本数据,并将所述预处理后的文本数据转换为文本向量。S1. Collect text data, perform a preprocessing operation on the text data to obtain preprocessed text data, and convert the preprocessed text data into a text vector.
本申请较佳实施例可以从网络中,例如新闻网站、购物网站、论文数据库或者各种论坛中收集所述文本数据。The preferred embodiment of the present application can collect the text data from the Internet, such as a news website, a shopping website, a paper database, or various forums.
所述文本数据是非结构化或半结构化的数据,不能被分类算法直接识别,因此,本申请较佳实施例对所述文本数据进行预处理操作的目的是将所述文本数据转化为向量空间模型:d i=(w 1,w 2,…,w n),其中,w j为第j个特征项的权重。 The text data is unstructured or semi-structured data and cannot be directly recognized by the classification algorithm. Therefore, the purpose of the preprocessing operation on the text data in the preferred embodiment of the present application is to convert the text data into a vector space model: d i = (w 1, w 2, ..., w n), where, w j is the weight of the j th feature item weight.
本申请实施例对所述文本数据执行包括分词、去停用词、特征权重计算、去重等的预处理操作。The embodiment of the present application performs preprocessing operations including word segmentation, stop word removal, feature weight calculation, and deduplication on the text data.
本申请实施例所述分词方法包括对所述文本数据与预先构建的词典中的词条根据预先规定的策略进行匹配,得到所述文本数据中的词语。The word segmentation method described in the embodiment of the present application includes matching the text data with entries in a pre-built dictionary according to a predetermined strategy to obtain the words in the text data.
在本申请实施例中,所选取的去停用词的方法为停用词表过滤,即通过已经构建好的停用词表和文本数据中的词语进行匹配,如果匹配成功,那么这个词语就是停用词,需要将该词删除。In the embodiment of this application, the selected method for removing stop words is stop word list filtering, that is, matching words in the text data with the stop word list that has been constructed. If the matching is successful, then the word is Stop word, the word needs to be deleted.
在经过分词、去停用词后,所述文本数据由一系列的特征词(关键词)表示,但是这种文本形式的数据并不能直接被分类算法所处理,而应该转化为数值形式,因此需要对这些特征词进行权重计算,用来表征该特征词在文本中的重要性。After word segmentation and stop words are removed, the text data is represented by a series of characteristic words (keywords), but this kind of textual data cannot be directly processed by the classification algorithm, but should be converted into a numerical form. It is necessary to calculate the weight of these feature words to represent the importance of the feature words in the text.
本申请实施例使用TF-IDF算法进行特征词计算。所述TF-IDF算法是利用统计信息、词向量信息以及词语间的依存句法信息,通过构建依存关系图来计算词语之间的关联强度,利用TextRank算法迭代算出词语的重要度得分。The embodiment of the application uses the TF-IDF algorithm to perform feature word calculation. The TF-IDF algorithm uses statistical information, word vector information, and dependency syntax information between words, builds a dependency graph to calculate the correlation strength between words, and uses TextRank algorithm to iteratively calculate the importance score of words.
详细地,本申请在进行特征词的权重计算时,首先计算任意两词W i和W j的依存关联度: In detail, when calculating the weights of feature words in this application, first calculate the degree of dependence and association of any two words W i and W j:
Figure PCTCN2019116931-appb-000001
Figure PCTCN2019116931-appb-000001
其中len(W i,W j)表示词语W i和W j之间的依存路径长度,b是超参数。 Where len(W i , W j ) represents the length of the dependency path between words W i and W j , and b is a hyperparameter.
本申请认为2个词之间的语义相似度无法准确衡量词语的重要程度,只有当2个词中至少有一个在文本中出现的频率很高,才能证明2个词很重要。 根据万有引力的概念,将词频看作质量,将2个词的词向量间的欧氏距离视为距离,根据万有引力公式来计算2个词之间的引力。然而在当前文本环境中,仅利用词频来衡量文本中某个词的重要程度太过片面,因此本申请引入了IDF值,将词频替换为TF-IDF值,从而考虑到更全局性的信息。于是得到了新的词引力值公式。文本词语W i和的W j的引力为: This application believes that the semantic similarity between two words cannot accurately measure the importance of the words. Only when at least one of the two words appears frequently in the text, can it prove that the two words are important. According to the concept of gravitation, word frequency is regarded as quality, the Euclidean distance between the word vectors of two words is regarded as distance, and the gravitation between two words is calculated according to the formula of universal gravitation. However, in the current text environment, it is too one-sided to use word frequency to measure the importance of a word in the text. Therefore, this application introduces IDF value and replaces word frequency with TF-IDF value to take into account more global information. So a new formula for the value of word gravity is obtained. The gravity of the text words W i and W j is:
Figure PCTCN2019116931-appb-000002
Figure PCTCN2019116931-appb-000002
其中,tfidf(W)是词W的TF-IDF值,d是词W i和W j的词向量之间的欧式距离。 Among them, tfidf(W) is the TF-IDF value of word W, and d is the Euclidean distance between the word vectors of words W i and W j.
因此,词语W i和的W j之间的关联度为: Therefore, the degree of association between the words W i and W j is:
weight(W i,W j)=Dep(W i,W j)*f grav(W i,W j) weight(W i ,W j )=Dep(W i ,W j )*f grav (W i ,W j )
最后,本申请利用TextRank算法建立一个无向图G=(V,E),其中V是顶点的集合,E是边的集合,根据下列式子算出词语W i的得分,: Finally, the present application establish an undirected graph G = (V, E) using TextRank algorithm, where V is the set of vertices, E is the set of edges, the following equation is calculated based on the scores of the words W i,:
Figure PCTCN2019116931-appb-000003
Figure PCTCN2019116931-appb-000003
其中
Figure PCTCN2019116931-appb-000004
是与顶点W i有关的集合,η为阻尼系数,由此得到特征权重WS(W i),并因此将每个词语表示成数值向量形式。
among them
Figure PCTCN2019116931-appb-000004
Is the set related to the vertices W i , and η is the damping coefficient, from which the feature weight WS(W i ) is obtained, and therefore each word is expressed in the form of a numerical vector.
进一步地,由于所收集的文本数据来源错综复杂,其中可能会存在很多重复的文本数据。大量的重复数据会影响分类精度,因此,在本申请实施例中,在对文本进行分类前首先利用欧式距离方法对文本进行去重操作,其公式如下:Furthermore, due to the intricate sources of the collected text data, there may be a lot of duplicate text data. A large amount of repeated data will affect the classification accuracy. Therefore, in the embodiment of the present application, the Euclidean distance method is first used to de-duplicate the text before the text is classified. The formula is as follows:
Figure PCTCN2019116931-appb-000005
Figure PCTCN2019116931-appb-000005
式中w 1j和w 2j分别为2个文本数据。在分别计算每两个文本数据的欧式距离后,欧式距离越小,说明文本数据越相似,则删除欧氏距离小于预设阈值的两个文本数据中的其中一个。 In the formula, w 1j and w 2j are 2 text data respectively. After calculating the Euclidean distance of each two text data separately, the smaller the Euclidean distance, the more similar the text data, and then one of the two text data whose Euclidean distance is less than the preset threshold is deleted.
进一步地,本申请优选实施例还包括利用变焦神经网络的文本层次化编码器,对所述预处理后的文本数据进行编码处理,得到经过编码处理的文本向量。Further, a preferred embodiment of the present application further includes a text hierarchical encoder using a zoom neural network to perform encoding processing on the preprocessed text data to obtain an encoded text vector.
在本申请实施例中,所述文本层次化编码器有三层,分别为文字嵌入层和两个bi-LSTM层,其中,所述文字嵌入层将所述词语由word2vec初始化,得到词语向量,并利用第一个bi-LSTM层接收词语向量作为输入并生成句子向量,第二个bi-LSTM层接收句子向量作为输入并生成段落向量。In the embodiment of the present application, the text hierarchical encoder has three layers, namely a text embedding layer and two bi-LSTM layers, wherein the text embedding layer initializes the words by word2vec to obtain a word vector, and The first bi-LSTM layer receives word vectors as input and generates sentence vectors, and the second bi-LSTM layer receives sentence vectors as input and generates paragraph vectors.
详细地,所述第一个bi-LSTM层将每一个词语作为输入后,在每个时间不长输出一个隐藏状态向量,接下来使用最大池化操作获得一个固定长度的句子向量,并将所有的句子向量作为层次记忆的句子分量,所使用公式为:In detail, after the first bi-LSTM layer takes each word as input, it outputs a hidden state vector every time it is not long, and then uses the maximum pooling operation to obtain a fixed-length sentence vector, and combines all The sentence vector of as the sentence component of hierarchical memory, the formula used is:
Figure PCTCN2019116931-appb-000006
Figure PCTCN2019116931-appb-000006
Figure PCTCN2019116931-appb-000007
Figure PCTCN2019116931-appb-000007
式中,
Figure PCTCN2019116931-appb-000008
表示输入的词语,
Figure PCTCN2019116931-appb-000009
表示通过最大池化操作获得的一个固定长度的句子向量,其长度和j有关,R s表示层次记忆的句子向量。
Where
Figure PCTCN2019116931-appb-000008
Indicates the word entered,
Figure PCTCN2019116931-appb-000009
Represents a fixed-length sentence vector obtained through the maximum pooling operation, and its length is related to j, and R s represents a sentence vector of hierarchical memory.
接下来本申请使用类似的方式,使用第二个bi-LSTM层和最大池化操作将句子分量转换为段落向量。Next, this application uses a similar approach, using the second bi-LSTM layer and the maximum pooling operation to convert sentence components into paragraph vectors.
通过层次化编码,本申请赋予每一级每一个语言单元一个向量表示(hierarchical distributed memory),并保留其句段划分的分界信息,据此得到包括词语向量、句子向量以及段落向量的文本向量。Through hierarchical coding, this application assigns a vector representation (hierarchical distributed memory) to each language unit at each level, and retains the segmentation information of its segmentation, based on which text vectors including word vectors, sentence vectors, and paragraph vectors are obtained.
S2、利用基于决策树优化的BP神经网络分类模型对所述文本向量进行特征选择,得到文本特征。S2. Use the BP neural network classification model optimized based on the decision tree to perform feature selection on the text vector to obtain the text feature.
由于在很多情况下,文本数据中的特征数量会远远超过训练数据的数量,为简化模型的训练,本申请使用基于BP神经网络的方法进行特征选择,并以特征X对状态Y变化的灵敏度δ作为评价文本特征的度量,即:
Figure PCTCN2019116931-appb-000010
Figure PCTCN2019116931-appb-000011
Since in many cases, the number of features in text data will far exceed the number of training data, in order to simplify the training of the model, this application uses a method based on BP neural network for feature selection, and uses the sensitivity of feature X to changes in state Y δ is used as a measure to evaluate text characteristics, namely:
Figure PCTCN2019116931-appb-000010
Figure PCTCN2019116931-appb-000011
所述BP神经网络是一种多层前馈神经网络,该网络的主要特点是信号前向传递,误差反向传播,在前向传递中,输入信号从输入层经过隐藏层逐层处理,直至输出层。每一层的神经元状态只影响下一层神经元状态。如果输出层得不到期望输出,则转入反向传播,根据预测误差调整网络权值和阈值,从而使网络预测输出不断逼近期望输出。The BP neural network is a multi-layer feedforward neural network. The main characteristics of the network are signal forward transmission and error back propagation. In the forward transmission, the input signal is processed layer by layer from the input layer to the hidden layer. The output layer. The neuron state of each layer only affects the neuron state of the next layer. If the output layer cannot get the expected output, it will switch to back propagation and adjust the network weights and thresholds according to the prediction error, so that the network predicted output is constantly approaching the expected output.
本申请所述BP神经网络包括如下结构:The BP neural network described in this application includes the following structure:
输入层:是整个神经网络唯一数据输入入口,输入层的神经元节点数目和文本的数值向量维数相同,每一个神经元的值对应数值向量的每个项的值;Input layer: It is the only data input entry of the entire neural network. The number of neuron nodes in the input layer is the same as the dimension of the numerical vector of the text. The value of each neuron corresponds to the value of each item of the numerical vector;
隐藏层:是对主要用来对输入层输入的数据进行非线性化处理,以激励函数为基础对输入的数据进行非线性拟合可以有效保证模型的预测能力;Hidden layer: It is mainly used to perform non-linear processing on the input data of the input layer. Non-linear fitting of the input data based on the excitation function can effectively ensure the predictive ability of the model;
输出层:在隐藏层之后,是整个模型的唯一输出。输出层的神经元节点数目和文本的类别数目相同。Output layer: After the hidden layer, it is the only output of the entire model. The number of neuron nodes in the output layer is the same as the number of text categories.
由于BP神经网络的结构对分类结果具有很大影响,如果设计不好将出现收敛速度慢、训练速度低、分类精度低等缺点,因此本申请使用决策树来对BP神经网络进行优化。在本申请实施例中,取决策树最长的规则链长度作为BP神经网络的隐藏层节点数目来对神经网络的结构进行优化,即将决策树的深度作为BP神经网络的隐藏层节点数目。Since the structure of the BP neural network has a great impact on the classification results, if the design is not good, there will be disadvantages such as slow convergence speed, low training speed, and low classification accuracy. Therefore, this application uses a decision tree to optimize the BP neural network. In the embodiment of the present application, the length of the longest rule chain of the decision tree is taken as the number of hidden layer nodes of the BP neural network to optimize the structure of the neural network, that is, the depth of the decision tree is taken as the number of hidden layer nodes of the BP neural network.
本申请较佳实施例构建3层BP神经网络,其中,输入层n个单元对应n个特征参数,输出层m个单元对应m种模式分类,取中间隐藏层单元数为q,用
Figure PCTCN2019116931-appb-000012
表示输入层单元i与隐层单元q之间的连接权,用
Figure PCTCN2019116931-appb-000013
表示隐层单元q与输出层单元j之间的连接权,θ q为隐藏层各单元的阈值,则隐藏层第q单元的输出O q为:
The preferred embodiment of this application constructs a 3-layer BP neural network, where n units in the input layer correspond to n feature parameters, and m units in the output layer correspond to m pattern classifications. The number of units in the middle hidden layer is q, and
Figure PCTCN2019116931-appb-000012
Represents the connection weight between the input layer unit i and the hidden layer unit q, using
Figure PCTCN2019116931-appb-000013
Represents the connection weight between the hidden layer unit q and the output layer unit j, θ q is the threshold of each unit of the hidden layer, and the output O q of the qth unit of the hidden layer is:
Figure PCTCN2019116931-appb-000014
Figure PCTCN2019116931-appb-000014
输出层第j单元的输出y i为: The output y i of the j-th unit of the output layer is:
Figure PCTCN2019116931-appb-000015
Figure PCTCN2019116931-appb-000015
在上式中,δ j为输出层各单元的阈值,j=1,2,…,m。 In the above formula, δ j is the threshold value of each unit of the output layer, j = 1, 2, ..., m.
根据复合函数求偏导数的链式法则求得文本特征X i的灵敏度δ ij和文本特征X k的灵敏度δ kj之差: Sensitivity Sensitivity δ ij determined difference and the text feature X k X i of text feature δ kj composite function of the partial derivative of the chain rule:
Figure PCTCN2019116931-appb-000016
Figure PCTCN2019116931-appb-000016
其中,
Figure PCTCN2019116931-appb-000017
among them,
Figure PCTCN2019116931-appb-000017
此时,若
Figure PCTCN2019116931-appb-000018
则必有δ ijkj,即文本特征X i对第j类模式的分类能力比文本特征X k的强,据此进行文本特征的选择。
At this time, if
Figure PCTCN2019116931-appb-000018
Certainly has δ ij> δ kj, i.e. strong text feature X i of the j-th class pattern classification ability than the text feature X k, selected accordingly textual features.
S3、根据上述得到的文本特征,利用随机梯度下降算法与fine-turing方法训练所述BP神经网络分类模型,直到得到最佳的文本特征,根据所述最佳的文本特征,利用分类器对文本数据进行分类,输出目标文本的分类结果。S3. According to the text features obtained above, use the stochastic gradient descent algorithm and the fine-turing method to train the BP neural network classification model until the best text features are obtained. According to the best text features, use a classifier to compare the text The data is classified, and the classification result of the target text is output.
所述fine-turing方法即通过可用神经网络,提取其浅层特征,并修改深度神经网络中的参数,构建新的神经网络模型,以减少迭代次数,从而更快速的获得最佳的BP神经网络分类模型。The fine-turing method is to extract the shallow features of the available neural network, modify the parameters in the deep neural network, and build a new neural network model to reduce the number of iterations, so as to obtain the best BP neural network more quickly Classification model.
本申请较佳实施例中,训练所述BP神经网络分类模型的流程如下:In a preferred embodiment of the present application, the process of training the BP neural network classification model is as follows:
Ⅰ、构建损失函数。Ⅰ. Construct a loss function.
在神经网络中,损失函数用来评价网络模型输出的预测值
Figure PCTCN2019116931-appb-000019
与真实值Y之间的差异。这里用
Figure PCTCN2019116931-appb-000020
来表示损失函数,它使一个非负实数函数,损失值越小,网络模型的性能越好。输入模式向量为A k=(a 1,a 2,…a 8)(k=1,2,…,20),希望输出向量为Y k(k=1,2,…,20),根据深度学习中神经元基本公式,各层输入、输出为
Figure PCTCN2019116931-appb-000021
C i=f(z i)。
In a neural network, the loss function is used to evaluate the predicted value of the network model output
Figure PCTCN2019116931-appb-000019
The difference from the true value Y. Used here
Figure PCTCN2019116931-appb-000020
To represent the loss function, it is a non-negative real number function. The smaller the loss value, the better the performance of the network model. The input pattern vector is A k =(a 1 ,a 2 ,…a 8 )(k=1, 2,…,20), and the output vector is hoped to be Y k (k=1, 2,…,20), according to the depth The basic formula of neuron in learning, the input and output of each layer are
Figure PCTCN2019116931-appb-000021
C i =f(z i ).
本申请选取分类损失函数:This application selects the classification loss function:
Figure PCTCN2019116931-appb-000022
Figure PCTCN2019116931-appb-000022
其中,m为所述文本数据的样本个数,h θ(x (i))为所述文本数据的预测值,y (i)为所述文本数据的真实值; Where m is the number of samples of the text data, h θ (x (i) ) is the predicted value of the text data, and y (i) is the true value of the text data;
同时,为缓解梯度消散问题,本申请选择ReLU函数relu(x)=max(0,x)作为激活函数,该函数满足仿生学中的稀疏性,只有当输入值高于一定数目时才激活该神经元节点,当输入值低于0时进行限制,当输入上升到某一阙值以上时,函数中自变量与因变量呈线性关系。其中x表示所述反向梯度累计值和下降梯度累计值。At the same time, in order to alleviate the problem of gradient dissipation, this application selects the ReLU function relu(x)=max(0,x) as the activation function, which satisfies the sparsity in bionics, and activates the function only when the input value is higher than a certain number. Neuron node, when the input value is lower than 0, limit, when the input rises above a certain threshold, the independent variable in the function has a linear relationship with the dependent variable. Where x represents the cumulative value of the reverse gradient and the cumulative value of the descending gradient.
Ⅱ、用随机梯度下降算法解所述损失函数,并使用fine-turing方法减少模型迭代次数。Ⅱ. Use the stochastic gradient descent algorithm to solve the loss function, and use the fine-turing method to reduce the number of model iterations.
梯度下降算法是神经网络模型训练最常用的优化算法。为找到损失函数
Figure PCTCN2019116931-appb-000023
的最小值,需要沿着与梯度向量相反的方向-dL/dy更新变量y,这样可以使得梯度减少最快,直至损失收敛至最小值。在本申请实施例中,结合动量方法,每输入batch-sizes数据,随着梯度下降降低学习率,每输入一个epoch,根据学习率降低情况提升衰减率,参数更新公式如下:L=L-αdL/dy,α表示学习率,dL/dy为衰减率,从而可以获取最终的BP神经网络参数。同时,在使用fine-turing方法时,本申请首先调整网络层中的参数,删去FC层并调整学习速率,因为最后一层是重新学习,因此相比较其他层需要有更快的学习速率,本申请weight和bias的学习速率加快10倍,且并不改变学习策略。最后修改solver参数,通过减少文本数据大小,将步长从原来的100000变成了20000,最大的迭代次数也相应减少,从而在更小迭代次数的情况下得到最优化的BP神经网络分类模型,并利用所述最优化的BP神经网络分类获取最佳的文本特征。
Gradient descent algorithm is the most commonly used optimization algorithm for neural network model training. To find the loss function
Figure PCTCN2019116931-appb-000023
The minimum value of, the variable y needs to be updated in the direction opposite to the gradient vector -dL/dy, so that the gradient can be reduced the fastest, until the loss converges to the minimum. In the embodiment of the present application, combined with the momentum method, each batch-sizes data is input, the learning rate is reduced as the gradient drops, and each epoch is input, the decay rate is increased according to the decrease in the learning rate. The parameter update formula is as follows: L=L-αdL /dy,α represents the learning rate, and dL/dy is the decay rate, so that the final BP neural network parameters can be obtained. At the same time, when using the fine-turing method, this application first adjusts the parameters in the network layer, deletes the FC layer and adjusts the learning rate, because the last layer is relearning, so it needs to have a faster learning rate compared to other layers. The learning rate of weight and bias in this application is accelerated by 10 times, and the learning strategy is not changed. Finally, the solver parameters are modified to reduce the size of the text data, and the step size is changed from 100,000 to 20,000, and the maximum number of iterations is also reduced accordingly, so that the optimized BP neural network classification model can be obtained with a smaller number of iterations. And use the optimized BP neural network classification to obtain the best text features.
进一步地,本申请较佳实施例利用随机森林算法作为分类器,根据所述最佳的文本特征,对所述收集的文本数据进行分类。Further, a preferred embodiment of the present application uses a random forest algorithm as a classifier to classify the collected text data according to the best text characteristics.
所述随机森林算法是利用袋装算法的有放回抽样,从原始样本中抽取多个样本子集,并使用这几个样本对多个决策树模型训练,在训练过程中使用借鉴了随机特征子空间方法,在特征集中抽取部分特征进行决策树的分裂,最后集成多个决策树称为一个集成分类器,这个集成分类器称为随机森林。其算法流程可分为三部分,子样本集的生成,决策树的构建,投票产生结果,其具体流程如下所示:The random forest algorithm uses bagging algorithm with replacement sampling, extracts multiple sample subsets from the original sample, and uses these samples to train multiple decision tree models, and uses random features in the training process. The subspace method extracts some features from the feature set to split the decision tree, and finally integrates multiple decision trees called an ensemble classifier, and this ensemble classifier is called a random forest. The algorithm process can be divided into three parts, the generation of the sub-sample set, the construction of the decision tree, and the voting results. The specific process is as follows:
1)生成子样本集:随机森林是一种集成分类器,对于每个基分类器需要产生一定的样本子集,作为基分类器的输入变量。为了兼顾评估模型,样本集的划分有多种方式,在本申请实施例中,使用的是交叉认证的方式对文本数据进行划分,所述交叉认证是把原始文本按照页数的不同,分成k个子文本数据,在每次训练时,使用其中一个子文本数据进行作为测试集,其余子文本数据作为训练集,并进行k次轮换步骤。1) Generate sub-sample set: Random forest is an ensemble classifier. For each base classifier, a certain sample subset needs to be generated as the input variable of the base classifier. In order to take into account the evaluation model, there are many ways to divide the sample set. In the embodiment of the application, the text data is divided by cross-certification. The cross-certification divides the original text into k according to the number of pages. For each sub-text data, in each training, one of the sub-text data is used as the test set, the remaining sub-text data is used as the training set, and k rotation steps are performed.
2)构建决策树:在随机森林中,每个基分类器都是一棵独立的决策树。在决策树的构建过程中最重要的是分裂规则,分裂规则试图寻找一个最优的特征对样本进行划分,来提高最终分类的准确性。随机森林的决策树与普通的决策树构建方式基本一致,不同的是随机森林的决策树在进行分裂时选择的特征并不是对整个特征全集进行搜索,而是随机选取k个特征进行划分。在本申请实施例中,将上述得到的子文本特征作为决策树的子节点,其下节点为各自提取到特征。2) Building a decision tree: In a random forest, each base classifier is an independent decision tree. The most important thing in the construction of the decision tree is the split rule. The split rule tries to find an optimal feature to divide the sample to improve the accuracy of the final classification. The decision tree of the random forest is basically the same as the ordinary decision tree construction method. The difference is that the feature selected when the decision tree of the random forest is split does not search the entire feature set, but randomly selects k features for division. In the embodiment of the present application, the sub-text features obtained above are used as the sub-nodes of the decision tree, and the lower nodes are the respective extracted features.
3)投票产生结果。随机森林的分类结果是各个基分类器,即决策树,进行投票得出。随机森林对基分类器一视同仁,每个决策树得出一个分类结果,汇集所有决策树的文本分类结果进行累加求和,票数最高的结果为最终的文本分类结果,即对文本进行有效分类。3) Voting produces results. The classification result of the random forest is obtained by voting for each base classifier, that is, the decision tree. Random forest treats the base classifier equally. Each decision tree obtains a classification result, and collects the text classification results of all decision trees for cumulative summation. The result with the highest number of votes is the final text classification result, that is, the text is effectively classified.
本申请还提供一种基于神经网络模型的文本分类装置。参照图2所示,为本申请一实施例提供的基于神经网络模型的文本分类装置的内部结构示意 图。This application also provides a text classification device based on the neural network model. Referring to FIG. 2, it is a schematic diagram of the internal structure of a text classification device based on a neural network model provided by an embodiment of this application.
在本实施例中,基于神经网络模型的文本分类装置1可以是PC(Personal Computer,个人电脑),也可以是智能手机、平板电脑、便携计算机等终端设备。该基于神经网络模型的文本分类装置1至少包括存储器11、处理器12,通信总线13,以及网络接口14。In this embodiment, the text classification device 1 based on the neural network model may be a PC (Personal Computer, personal computer), or a terminal device such as a smart phone, a tablet computer, or a portable computer. The text classification device 1 based on the neural network model at least includes a memory 11, a processor 12, a communication bus 13, and a network interface 14.
其中,存储器11至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、磁性存储器、磁盘、光盘等。存储器11在一些实施例中可以是基于神经网络模型的文本分类装置1的内部存储单元,例如该基于神经网络模型的文本分类装置1的硬盘。存储器11在另一些实施例中也可以是基于神经网络模型的文本分类装置1的外部存储设备,例如基于神经网络模型的文本分类装置1上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,存储器11还可以既包括基于神经网络模型的文本分类装置1的内部存储单元也包括外部存储设备。存储器11不仅可以用于存储安装于基于神经网络模型的文本分类装置1的应用软件及各类数据,例如基于神经网络模型的文本分类程序01的代码等,还可以用于暂时地存储已经输出或者将要输出的数据。The memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, and the like. In some embodiments, the memory 11 may be an internal storage unit of the text classification device 1 based on the neural network model, for example, the hard disk of the text classification device 1 based on the neural network model. In other embodiments, the memory 11 may also be an external storage device of the text classification device 1 based on a neural network model, such as a plug-in hard disk equipped on the text classification device 1 based on a neural network model, and a smart media card (Smart Media Card). ,SMC), Secure Digital (SD) card, Flash Card, etc. Further, the memory 11 may also include both an internal storage unit of the text classification apparatus 1 based on a neural network model and an external storage device. The memory 11 can be used not only to store application software and various data installed in the text classification device 1 based on the neural network model, such as the code of the text classification program 01 based on the neural network model, etc., but also to temporarily store the output or The data to be output.
处理器12在一些实施例中可以是一中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器或其他数据处理芯片,用于运行存储器11中存储的程序代码或处理数据,例如执行基于神经网络模型的文本分类程序01等。In some embodiments, the processor 12 may be a central processing unit (CPU), controller, microcontroller, microprocessor, or other data processing chip, for running program codes or processing stored in the memory 11 Data, for example, execute the text classification program 01 based on the neural network model.
通信总线13用于实现这些组件之间的连接通信。The communication bus 13 is used to realize the connection and communication between these components.
网络接口14可选的可以包括标准的有线接口、无线接口(如WI-FI接口),通常用于在该装置1与其他电子设备之间建立通信连接。The network interface 14 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface), and is usually used to establish a communication connection between the apparatus 1 and other electronic devices.
可选地,该装置1还可以包括用户接口,用户接口可以包括显示器(Display)、输入单元比如键盘(Keyboard),可选的用户接口还可以包括标准的有线接口、无线接口。可选地,在一些实施例中,显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。其中,显示器也可以适当的称为显示屏或显示单元,用于显示在基于神经网络模型的文本分类装置1中处理的信息以及用于显示可视化的用户界面。Optionally, the device 1 may also include a user interface. The user interface may include a display (Display) and an input unit such as a keyboard (Keyboard). The optional user interface may also include a standard wired interface and a wireless interface. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, etc. Among them, the display can also be appropriately called a display screen or a display unit, which is used to display the information processed in the text classification device 1 based on the neural network model and to display a visualized user interface.
图2仅示出了具有组件11-14以及基于神经网络模型的文本分类程序01的基于神经网络模型的文本分类装置1,本领域技术人员可以理解的是,图1示出的结构并不构成对基于神经网络模型的文本分类装置1的限定,可以包括比图示更少或者更多的部件,或者组合某些部件,或者不同的部件布置。Fig. 2 only shows a neural network model-based text classification device 1 with components 11-14 and a neural network model-based text classification program 01. Those skilled in the art can understand that the structure shown in Fig. 1 does not constitute The definition of the text classification device 1 based on the neural network model may include fewer or more components than shown, or a combination of certain components, or a different component arrangement.
在图2所示的装置1实施例中,存储器11中存储有基于神经网络模型的文本分类程序01;处理器12执行存储器11中存储的基于神经网络模型的文本分类程序01时实现如下步骤:In the embodiment of the device 1 shown in FIG. 2, the memory 11 stores a text classification program 01 based on a neural network model; the processor 12 implements the following steps when executing the text classification program 01 based on a neural network model stored in the memory 11:
步骤一、收集文本数据,并对所述文本数据进行预处理操作,得到预处 理后的文本数据,并将所述预处理后的文本数据转换为文本向量。Step 1: Collect text data, perform pre-processing operations on the text data to obtain pre-processed text data, and convert the pre-processed text data into text vectors.
本申请较佳实施例可以从网络中,例如新闻网站、购物网站、论文数据库或者各种论坛中收集所述文本数据。The preferred embodiment of the present application can collect the text data from the Internet, such as a news website, a shopping website, a paper database, or various forums.
所述文本数据是非结构化或半结构化的数据,不能被分类算法直接识别,因此,本申请较佳实施例对所述文本数据进行预处理操作的目的是将所述文本数据转化为向量空间模型:d i=(w 1,w 2,…,w n),其中,w j为第j个特征项的权重。 The text data is unstructured or semi-structured data and cannot be directly recognized by the classification algorithm. Therefore, the purpose of the preprocessing operation on the text data in the preferred embodiment of the present application is to convert the text data into a vector space model: d i = (w 1, w 2, ..., w n), where, w j is the weight of the j th feature item weight.
本申请实施例对所述文本数据执行包括分词、去停用词、特征权重计算、去重等的预处理操作。The embodiment of the present application performs preprocessing operations including word segmentation, stop word removal, feature weight calculation, and deduplication on the text data.
本申请实施例所述分词方法包括对所述文本数据与预先构建的词典中的词条根据预先规定的策略进行匹配,得到所述文本数据中的词语。The word segmentation method described in the embodiment of the present application includes matching the text data with entries in a pre-built dictionary according to a predetermined strategy to obtain the words in the text data.
在本申请实施例中,所选取的去停用词的方法为停用词表过滤,即通过已经构建好的停用词表和文本数据中的词语进行匹配,如果匹配成功,那么这个词语就是停用词,需要将该词删除。In the embodiment of this application, the selected method for removing stop words is stop word list filtering, that is, matching words in the text data with the stop word list that has been constructed. If the matching is successful, then the word is Stop word, the word needs to be deleted.
在经过分词、去停用词后,所述文本数据由一系列的特征词(关键词)表示,但是这种文本形式的数据并不能直接被分类算法所处理,而应该转化为数值形式,因此需要对这些特征词进行权重计算,用来表征该特征词在文本中的重要性。After word segmentation and stop words are removed, the text data is represented by a series of characteristic words (keywords), but this kind of textual data cannot be directly processed by the classification algorithm, but should be converted into a numerical form. It is necessary to calculate the weight of these feature words to represent the importance of the feature words in the text.
本申请实施例使用TF-IDF算法进行特征词计算。所述TF-IDF算法是利用统计信息、词向量信息以及词语间的依存句法信息,通过构建依存关系图来计算词语之间的关联强度,利用TextRank算法迭代算出词语的重要度得分。The embodiment of the application uses the TF-IDF algorithm to perform feature word calculation. The TF-IDF algorithm uses statistical information, word vector information, and dependency syntax information between words, builds a dependency graph to calculate the correlation strength between words, and uses TextRank algorithm to iteratively calculate the importance score of words.
详细地,本申请在进行特征词的权重计算时,首先计算任意两词W i和W j的依存关联度: In detail, when calculating the weights of feature words in this application, first calculate the degree of dependence and association of any two words W i and W j:
Figure PCTCN2019116931-appb-000024
Figure PCTCN2019116931-appb-000024
其中len(W i,W j)表示词语W i和W j之间的依存路径长度,b是超参数。 Where len(W i , W j ) represents the length of the dependency path between words W i and W j , and b is a hyperparameter.
本申请认为2个词之间的语义相似度无法准确衡量词语的重要程度,只有当2个词中至少有一个在文本中出现的频率很高,才能证明2个词很重要。根据万有引力的概念,将词频看作质量,将2个词的词向量间的欧氏距离视为距离,根据万有引力公式来计算2个词之间的引力。然而在当前文本环境中,仅利用词频来衡量文本中某个词的重要程度太过片面,因此本申请引入了IDF值,将词频替换为TF-IDF值,从而考虑到更全局性的信息。于是得到了新的词引力值公式。文本词语W i和的W j的引力为: This application believes that the semantic similarity between two words cannot accurately measure the importance of the words. Only when at least one of the two words appears frequently in the text, can it prove that the two words are important. According to the concept of gravitation, word frequency is regarded as quality, the Euclidean distance between the word vectors of two words is regarded as distance, and the gravitation between two words is calculated according to the formula of universal gravitation. However, in the current text environment, it is too one-sided to use word frequency to measure the importance of a word in the text. Therefore, this application introduces IDF value and replaces word frequency with TF-IDF value to take into account more global information. So a new formula for the value of word gravity is obtained. The gravity of the text words W i and W j is:
Figure PCTCN2019116931-appb-000025
Figure PCTCN2019116931-appb-000025
其中,tfidf(W)是词W的TF-IDF值,d是词W i和W j的词向量之间的欧式距离。 Among them, tfidf(W) is the TF-IDF value of word W, and d is the Euclidean distance between the word vectors of words W i and W j.
因此,词语W i和的W j之间的关联度为: Therefore, the degree of association between the words W i and W j is:
weight(W i,W j)=Dep(W i,W j)*f grav(W i,W j) weight(W i ,W j )=Dep(W i ,W j )*f grav (W i ,W j )
最后,本申请利用TextRank算法建立一个无向图G=(V,E),其中V是顶点的集合,E是边的集合,根据下列式子算出词语W i的得分: Finally, the present application establish an undirected graph G = (V, E) using TextRank algorithm, where V is the set of vertices, E is the set of edges, according to the words W i calculated by the following formula Score:
Figure PCTCN2019116931-appb-000026
Figure PCTCN2019116931-appb-000026
其中
Figure PCTCN2019116931-appb-000027
是与顶点W i有关的集合,η为阻尼系数,由此得到特征权重WS(W i),并因此将每个词语表示成数值向量形式。
among them
Figure PCTCN2019116931-appb-000027
Is the set related to the vertices W i , and η is the damping coefficient, from which the feature weight WS(W i ) is obtained, and therefore each word is expressed in the form of a numerical vector.
进一步地,由于所收集的文本数据来源错综复杂,其中可能会存在很多重复的文本数据。大量的重复数据会影响分类精度,因此,在本申请实施例中,在对文本进行分类前首先利用欧式距离方法对文本进行去重操作,其公式如下:Furthermore, due to the intricate sources of the collected text data, there may be a lot of duplicate text data. A large amount of repeated data will affect the classification accuracy. Therefore, in the embodiment of the present application, the Euclidean distance method is first used to de-duplicate the text before the text is classified. The formula is as follows:
Figure PCTCN2019116931-appb-000028
Figure PCTCN2019116931-appb-000028
式中w 1j和w 2j分别为2个文本数据。在分别计算每两个文本数据的欧式距离后,欧式距离越小,说明文本数据越相似,则删除欧氏距离小于预设阈值的两个文本数据中的其中一个。 In the formula, w 1j and w 2j are 2 text data respectively. After calculating the Euclidean distance of each two text data separately, the smaller the Euclidean distance, the more similar the text data, and then one of the two text data whose Euclidean distance is less than the preset threshold is deleted.
进一步地,本申请优选实施例还包括利用变焦神经网络的文本层次化编码器,对所述预处理后的文本数据进行编码处理,得到经过编码处理的文本向量。Further, a preferred embodiment of the present application further includes a text hierarchical encoder using a zoom neural network to perform encoding processing on the preprocessed text data to obtain an encoded text vector.
在本申请实施例中,所述文本层次化编码器有三层,分别为文字嵌入层和两个bi-LSTM层,其中,所述文字嵌入层将所述词语由word2vec初始化,得到词语向量,并利用第一个bi-LSTM层接收词语向量作为输入并生成句子向量,第二个bi-LSTM层接收句子向量作为输入并生成段落向量。In the embodiment of the present application, the text hierarchical encoder has three layers, namely a text embedding layer and two bi-LSTM layers, wherein the text embedding layer initializes the words by word2vec to obtain a word vector, and The first bi-LSTM layer receives word vectors as input and generates sentence vectors, and the second bi-LSTM layer receives sentence vectors as input and generates paragraph vectors.
详细地,所述第一个bi-LSTM层将每一个词语作为输入后,在每个时间不长输出一个隐藏状态向量,接下来使用最大池化操作获得一个固定长度的句子向量,并将所有的句子向量作为层次记忆的句子分量,所使用公式为:In detail, after the first bi-LSTM layer takes each word as input, it outputs a hidden state vector every time it is not long, and then uses the maximum pooling operation to obtain a fixed-length sentence vector, and combines all The sentence vector of as the sentence component of hierarchical memory, the formula used is:
Figure PCTCN2019116931-appb-000029
Figure PCTCN2019116931-appb-000029
Figure PCTCN2019116931-appb-000030
Figure PCTCN2019116931-appb-000030
式中,
Figure PCTCN2019116931-appb-000031
表示输入的词语,
Figure PCTCN2019116931-appb-000032
表示通过最大池化操作获得的一个固定长度的句子向量,其长度和j有关,R s表示层次记忆的句子向量。
Where
Figure PCTCN2019116931-appb-000031
Indicates the word entered,
Figure PCTCN2019116931-appb-000032
Represents a fixed-length sentence vector obtained through the maximum pooling operation, and its length is related to j, and R s represents a sentence vector of hierarchical memory.
接下来本申请使用类似的方式,使用第二个bi-LSTM层和最大池化操作将句子分量转换为段落向量。Next, this application uses a similar approach, using the second bi-LSTM layer and the maximum pooling operation to convert sentence components into paragraph vectors.
通过层次化编码,本申请赋予每一级每一个语言单元一个向量表示(hierarchical distributed memory),并保留其句段划分的分界信息,据此得到包括词语向量、句子向量以及段落向量的文本向量。Through hierarchical coding, this application assigns a vector representation (hierarchical distributed memory) to each language unit at each level, and retains the segmentation information of its segmentation, based on which text vectors including word vectors, sentence vectors, and paragraph vectors are obtained.
步骤二、利用基于决策树优化的BP神经网络分类模型对所述文本向量进行特征选择,从而得到文本特征。Step 2: Use the BP neural network classification model optimized based on the decision tree to perform feature selection on the text vector, so as to obtain the text feature.
由于在很多情况下,文本数据中的特征数量会远远超过训练数据的数量, 为简化模型的训练,本申请使用基于BP神经网络的方法进行特征选择,并以特征X对状态Y变化的灵敏度δ作为评价文本特征的度量,即:
Figure PCTCN2019116931-appb-000033
Figure PCTCN2019116931-appb-000034
In many cases, the number of features in the text data will far exceed the number of training data. In order to simplify the training of the model, this application uses a BP neural network-based method for feature selection, and uses the sensitivity of feature X to changes in state Y δ is used as a measure to evaluate text characteristics, namely:
Figure PCTCN2019116931-appb-000033
Figure PCTCN2019116931-appb-000034
所述BP神经网络是一种多层前馈神经网络,该网络的主要特点是信号前向传递,误差反向传播,在前向传递中,输入信号从输入层经过隐藏层逐层处理,直至输出层。每一层的神经元状态只影响下一层神经元状态。如果输出层得不到期望输出,则转入反向传播,根据预测误差调整网络权值和阈值,从而使网络预测输出不断逼近期望输出。The BP neural network is a multi-layer feedforward neural network. The main characteristics of the network are signal forward transmission and error back propagation. In the forward transmission, the input signal is processed layer by layer from the input layer to the hidden layer. The output layer. The neuron state of each layer only affects the neuron state of the next layer. If the output layer cannot get the expected output, it will switch to back propagation and adjust the network weights and thresholds according to the prediction error, so that the network predicted output is constantly approaching the expected output.
本申请所述BP神经网络包括如下结构:The BP neural network described in this application includes the following structure:
输入层:是整个神经网络唯一数据输入入口,输入层的神经元节点数目和文本的数值向量维数相同,每一个神经元的值对应数值向量的每个项的值;Input layer: It is the only data input entry of the entire neural network. The number of neuron nodes in the input layer is the same as the dimension of the numerical vector of the text. The value of each neuron corresponds to the value of each item of the numerical vector;
隐藏层:是对主要用来对输入层输入的数据进行非线性化处理,以激励函数为基础对输入的数据进行非线性拟合可以有效保证模型的预测能力;Hidden layer: It is mainly used to perform non-linear processing on the input data of the input layer. Non-linear fitting of the input data based on the excitation function can effectively ensure the predictive ability of the model;
输出层:在隐藏层之后,是整个模型的唯一输出。输出层的神经元节点数目和文本的类别数目相同。Output layer: After the hidden layer, it is the only output of the entire model. The number of neuron nodes in the output layer is the same as the number of text categories.
由于BP神经网络的结构对分类结果具有很大影响,如果设计不好将出现收敛速度慢、训练速度低、分类精度低等缺点,因此本申请使用决策树来对BP神经网络进行优化。在本申请实施例中,取决策树最长的规则链长度作为BP神经网络的隐藏层节点数目来对神经网络的结构进行优化,即将决策树的深度作为BP神经网络的隐藏层节点数目。Since the structure of the BP neural network has a great impact on the classification results, if the design is not good, there will be disadvantages such as slow convergence speed, low training speed, and low classification accuracy. Therefore, this application uses a decision tree to optimize the BP neural network. In the embodiment of the present application, the length of the longest rule chain of the decision tree is taken as the number of hidden layer nodes of the BP neural network to optimize the structure of the neural network, that is, the depth of the decision tree is taken as the number of hidden layer nodes of the BP neural network.
本申请较佳实施例构建3层BP神经网络,其中,输入层n个单元对应n个特征参数,输出层m个单元对应m种模式分类,取中间隐藏层单元数为q,用
Figure PCTCN2019116931-appb-000035
表示输入层单元i与隐层单元q之间的连接权,用
Figure PCTCN2019116931-appb-000036
表示隐层单元q与输出层单元j之间的连接权,θ q为隐藏层各单元的阈值,则隐藏层第q单元的输出O q为:
The preferred embodiment of this application constructs a 3-layer BP neural network, where n units in the input layer correspond to n feature parameters, and m units in the output layer correspond to m pattern classifications. The number of units in the middle hidden layer is q, and
Figure PCTCN2019116931-appb-000035
Represents the connection weight between the input layer unit i and the hidden layer unit q, using
Figure PCTCN2019116931-appb-000036
Represents the connection weight between the hidden layer unit q and the output layer unit j, θ q is the threshold of each unit of the hidden layer, and the output O q of the qth unit of the hidden layer is:
Figure PCTCN2019116931-appb-000037
Figure PCTCN2019116931-appb-000037
输出层第j单元的输出y i为: The output y i of the j-th unit of the output layer is:
Figure PCTCN2019116931-appb-000038
Figure PCTCN2019116931-appb-000038
在上式中,δ j为输出层各单元的阈值,j=1,2,…,m。 In the above formula, δ j is the threshold value of each unit of the output layer, j = 1, 2, ..., m.
根据复合函数求偏导数的链式法则求得文本特征X i的灵敏度δ ij和文本特征X k的灵敏度δ kj之差: Sensitivity Sensitivity δ ij determined difference and the text feature X k X i of text feature δ kj composite function of the partial derivative of the chain rule:
Figure PCTCN2019116931-appb-000039
Figure PCTCN2019116931-appb-000039
其中,
Figure PCTCN2019116931-appb-000040
among them,
Figure PCTCN2019116931-appb-000040
此时,若
Figure PCTCN2019116931-appb-000041
则必有δ ijkj,即文本特征X i 对第j类模式的分类能力比文本特征X k的强,据此进行文本特征的选择。
At this time, if
Figure PCTCN2019116931-appb-000041
Certainly has δ ij> δ kj, i.e. strong text feature X i of the j-th class pattern classification ability than the text feature X k, selected accordingly textual features.
步骤三、根据上述得到的文本特征,利用随机梯度下降算法与fine-turing方法来训练所述BP神经网络分类模型,直到得到最佳的文本特征,根据所述最佳的文本特征,利用分类器对文本数据进行分类,输出目标文本的分类结果。Step 3. According to the text features obtained above, use the stochastic gradient descent algorithm and the fine-turing method to train the BP neural network classification model until the best text features are obtained. According to the best text features, use the classifier Classify the text data and output the classification result of the target text.
所述fine-turing方法即通过可用神经网络,提取其浅层特征,并修改深度神经网络中的参数,构建新的神经网络模型,以减少迭代次数,从而更快速的获得最佳的BP神经网络分类模型。The fine-turing method is to extract the shallow features of the available neural network, modify the parameters in the deep neural network, and build a new neural network model to reduce the number of iterations, so as to obtain the best BP neural network more quickly Classification model.
本申请较佳实施例中,训练所述BP神经网络分类模型的流程如下:In a preferred embodiment of the present application, the process of training the BP neural network classification model is as follows:
Ⅰ、构建损失函数。Ⅰ. Construct a loss function.
在神经网络中,损失函数用来评价网络模型输出的预测值
Figure PCTCN2019116931-appb-000042
与真实值Y之间的差异。这里用
Figure PCTCN2019116931-appb-000043
来表示损失函数,它使一个非负实数函数,损失值越小,网络模型的性能越好。输入模式向量为A k=(a 1,a 2,…a 8)(k=1,2,…,20),希望输出向量为Y k(k=1,2,…,20),根据深度学习中神经元基本公式,各层输入、输出为
Figure PCTCN2019116931-appb-000044
C i=f(z i)。
In a neural network, the loss function is used to evaluate the predicted value of the network model output
Figure PCTCN2019116931-appb-000042
The difference from the true value Y. Used here
Figure PCTCN2019116931-appb-000043
To represent the loss function, it is a non-negative real number function. The smaller the loss value, the better the performance of the network model. The input pattern vector is A k =(a 1 ,a 2 ,…a 8 )(k=1, 2,…,20), and the output vector is hoped to be Y k (k=1, 2,…,20), according to the depth The basic formula of neuron in learning, the input and output of each layer are
Figure PCTCN2019116931-appb-000044
C i =f(z i ).
本申请选取分类损失函数:This application selects the classification loss function:
Figure PCTCN2019116931-appb-000045
Figure PCTCN2019116931-appb-000045
其中,m为所述文本数据的样本个数,h θ(x (i))为所述文本数据的预测值,y (i)为所述文本数据的真实值; Where m is the number of samples of the text data, h θ (x (i) ) is the predicted value of the text data, and y (i) is the true value of the text data;
同时,为缓解梯度消散问题,本申请选择ReLU函数relu(x)=max(0,x)作为激活函数,该函数满足仿生学中的稀疏性,只有当输入值高于一定数目时才激活该神经元节点,当输入值低于0时进行限制,当输入上升到某一阙值以上时,函数中自变量与因变量呈线性关系。其中x表示所述反向梯度累计值和下降梯度累计值。At the same time, in order to alleviate the problem of gradient dissipation, this application selects the ReLU function relu(x)=max(0,x) as the activation function, which satisfies the sparsity in bionics, and activates the function only when the input value is higher than a certain number. Neuron node, when the input value is lower than 0, limit, when the input rises above a certain threshold, the independent variable in the function has a linear relationship with the dependent variable. Where x represents the cumulative value of the reverse gradient and the cumulative value of the descending gradient.
Ⅱ、用随机梯度下降算法解所述损失函数,并使用fine-turing方法减少模型迭代次数。Ⅱ. Use the stochastic gradient descent algorithm to solve the loss function, and use the fine-turing method to reduce the number of model iterations.
梯度下降算法是神经网络模型训练最常用的优化算法。为找到损失函数
Figure PCTCN2019116931-appb-000046
的最小值,需要沿着与梯度向量相反的方向-dL/dy更新变量y,这样可以使得梯度减少最快,直至损失收敛至最小值。在本申请实施例中,结合动量方法,每输入batch-sizes数据,随着梯度下降降低学习率,每输入一个epoch,根据学习率降低情况提升衰减率,参数更新公式如下:L=L-αdL/dy,α表示学习率,dL/dy为衰减率,从而可以获取最终的BP神经网络参数。同时,在使用fine-turing方法时,本申请首先调整网络层中的参数,删去FC层并调整学习速率,因为最后一层是重新学习,因此相比较其他层需要有更快的学习速率,本申请weight和bias的学习速率加快10倍,且并不改变学习策略。最后修改solver参数,通过减少文本数据大小,将步长从原来的100000变成了20000,最大的迭代次数也相应减少,从而在更小迭代次数的 情况下得到最优化的BP神经网络分类模型,并利用所述最优化的BP神经网络分类获取最佳的文本特征。。
Gradient descent algorithm is the most commonly used optimization algorithm for neural network model training. To find the loss function
Figure PCTCN2019116931-appb-000046
The minimum value of, the variable y needs to be updated in the direction opposite to the gradient vector -dL/dy, so that the gradient can be reduced the fastest, until the loss converges to the minimum. In the embodiment of this application, combined with the momentum method, each batch-sizes data is input, the learning rate is reduced with the gradient drop, and each epoch is input, the attenuation rate is increased according to the decrease in the learning rate. The parameter update formula is as follows: L=L-αdL /dy,α represents the learning rate, and dL/dy is the decay rate, so that the final BP neural network parameters can be obtained. At the same time, when using the fine-turing method, this application first adjusts the parameters in the network layer, deletes the FC layer and adjusts the learning rate, because the last layer is relearning, so it needs to have a faster learning rate compared to other layers. The learning rate of weight and bias in this application is accelerated by 10 times, and the learning strategy is not changed. Finally, the solver parameters are modified to reduce the size of the text data, and the step size is changed from 100,000 to 20,000, and the maximum number of iterations is also reduced accordingly, so that the optimized BP neural network classification model can be obtained with a smaller number of iterations. And use the optimized BP neural network classification to obtain the best text features. .
进一步地,本申请较佳实施例利用随机森林算法作为分类器,根据所述最佳的文本特征,对所述收集的文本数据进行文本分类。Further, a preferred embodiment of the present application uses a random forest algorithm as a classifier, and performs text classification on the collected text data according to the best text feature.
所述随机森林算法是利用袋装算法的有放回抽样,从原始样本中抽取多个样本子集,并使用这几个样本对多个决策树模型训练,在训练过程中使用借鉴了随机特征子空间方法,在特征集中抽取部分特征进行决策树的分裂,最后集成多个决策树称为一个集成分类器,这个集成分类器称为随机森林。其算法流程可分为三部分,子样本集的生成,决策树的构建,投票产生结果,其具体流程如下所示:The random forest algorithm uses bagging algorithm with replacement sampling, extracts multiple sample subsets from the original sample, and uses these samples to train multiple decision tree models, and uses random features in the training process. The subspace method extracts some features from the feature set to split the decision tree, and finally integrates multiple decision trees called an ensemble classifier, and this ensemble classifier is called a random forest. The algorithm process can be divided into three parts, the generation of the sub-sample set, the construction of the decision tree, and the voting results. The specific process is as follows:
1)生成子样本集:随机森林是一种集成分类器,对于每个基分类器需要产生一定的样本子集,作为基分类器的输入变量。为了兼顾评估模型,样本集的划分有多种方式,在本申请实施例中,使用的是交叉认证的方式对文本数据进行划分,所述交叉认证是把原始文本按照页数的不同,分成k个子文本数据,在每次训练时,使用其中一个子文本数据进行作为测试集,其余子文本数据作为训练集,并进行k次轮换步骤。1) Generate sub-sample set: Random forest is an ensemble classifier. For each base classifier, a certain sample subset needs to be generated as the input variable of the base classifier. In order to take into account the evaluation model, there are many ways to divide the sample set. In the embodiment of the application, the text data is divided by cross-certification. The cross-certification divides the original text into k according to the number of pages. For each sub-text data, in each training, one of the sub-text data is used as the test set, the remaining sub-text data is used as the training set, and k rotation steps are performed.
2)构建决策树:在随机森林中,每个基分类器都是一棵独立的决策树。在决策树的构建过程中最重要的是分裂规则,分裂规则试图寻找一个最优的特征对样本进行划分,来提高最终分类的准确性。随机森林的决策树与普通的决策树构建方式基本一致,不同的是随机森林的决策树在进行分裂时选择的特征并不是对整个特征全集进行搜索,而是随机选取k个特征进行划分。在本申请实施例中,将上述得到的子文本特征作为决策树的子节点,其下节点为各自提取到特征。2) Building a decision tree: In a random forest, each base classifier is an independent decision tree. The most important thing in the construction of the decision tree is the split rule. The split rule tries to find an optimal feature to divide the sample to improve the accuracy of the final classification. The decision tree of the random forest is basically the same as the ordinary decision tree construction method. The difference is that the feature selected when the decision tree of the random forest is split does not search the entire feature set, but randomly selects k features for division. In the embodiment of the present application, the sub-text features obtained above are used as the sub-nodes of the decision tree, and the lower nodes are the respective extracted features.
3)投票产生结果。随机森林的分类结果是各个基分类器,即决策树,进行投票得出。随机森林对基分类器一视同仁,每个决策树得出一个分类结果,汇集所有决策树的文本分类结果进行累加求和,票数最高的结果为最终的文本分类结果,即对文本进行有效分类。3) Voting produces results. The classification result of the random forest is obtained by voting for each base classifier, that is, the decision tree. Random forest treats the base classifier equally. Each decision tree obtains a classification result, and collects the text classification results of all decision trees for cumulative summation. The result with the highest number of votes is the final text classification result, that is, the text is effectively classified.
可选地,在其他实施例中,基于神经网络模型的文本分类程序还可以被分割为一个或者多个模块,一个或者多个模块被存储于存储器11中,并由一个或多个处理器(本实施例为处理器12)所执行以完成本申请,本申请所称的模块是指能够完成特定功能的一系列计算机程序指令段,用于描述基于神经网络模型的文本分类程序在基于神经网络模型的文本分类装置中的执行过程。Optionally, in other embodiments, the text classification program based on the neural network model can also be divided into one or more modules, and the one or more modules are stored in the memory 11 and run by one or more processors ( This embodiment is executed by the processor 12) to complete the application. The module referred to in this application refers to a series of computer program instruction segments that can complete specific functions, and is used to describe the text classification program based on the neural network model. The execution process in the text classification device of the model.
例如,参照图3所示,为本申请基于神经网络模型的文本分类装置一实施例中的基于神经网络模型的文本分类程序的程序模块示意图,该实施例中,基于神经网络模型的文本分类程序可以被分割为样本收集模块10、特征提取模块20及文本分类模块30。示例性地:For example, referring to FIG. 3, a schematic diagram of program modules of a text classification program based on a neural network model in an embodiment of a text classification device based on a neural network model of this application. In this embodiment, the text classification program based on the neural network model It can be divided into a sample collection module 10, a feature extraction module 20, and a text classification module 30. Illustratively:
所述样本收集模块10用于:收集文本数据,对所述文本数据进行预处理操作,得到预处理后的文本数据,并将所述预处理后的文本数据转换为文本 向量。The sample collection module 10 is used to collect text data, perform preprocessing operations on the text data, obtain preprocessed text data, and convert the preprocessed text data into text vectors.
其中,所述对所述文本数据进行预处理操作包括:Wherein, the preprocessing operation on the text data includes:
对所述文本数据与预先构建的词典中的词条根据预先规定的策略进行匹配,得到所述文本数据中的词语;Matching the text data with the entries in the pre-built dictionary according to a predetermined strategy to obtain the words in the text data;
利用已经构建好的停用词表和所述文本数据中的词语进行匹配,如果匹配成功,则判断该词语是停用词,将该词语删除;Use the constructed stop word table to match the words in the text data, and if the matching is successful, determine that the word is a stop word, and delete the word;
构建依存关系图计算词语之间的关联强度,利用TextRank算法迭代算出词语的重要度得分,将每个词语表示成数值向量形式;Construct a dependency graph to calculate the correlation strength between words, use the TextRank algorithm to iteratively calculate the importance score of the words, and express each word in the form of a numerical vector;
计算所述文本数据中每两个之间的欧式距离,当所述欧氏距离小于预设阈值时,删除的两个文本数据中的其中一个。Calculate the Euclidean distance between every two of the text data, and delete one of the two text data when the Euclidean distance is less than a preset threshold.
其中,所述将所述文本数据转换为文本向量包括:Wherein, said converting the text data into a text vector includes:
利用变焦神经网络的文本层次化编码器,对所述预处理后的文本数据进行编码处理,得到经过编码处理的文本向量,其中,所述文本层次化编码器包括文字嵌入层和两个bi-LSTM层,所述文字嵌入层将所述词语由word2vec初始化,得到词语向量,第一个bi-LSTM层接收词语向量作为输入并生成句子向量,第二个bi-LSTM层接收句子向量作为输入并生成段落向量。The text hierarchical encoder of the zoom neural network is used to encode the preprocessed text data to obtain the encoded text vector, wherein the text hierarchical encoder includes a text embedding layer and two bi- LSTM layer. The text embedding layer initializes the words by word2vec to obtain word vectors. The first bi-LSTM layer receives word vectors as input and generates sentence vectors, and the second bi-LSTM layer receives sentence vectors as input and Generate paragraph vectors.
所述特征提取模块20用于:利用基于决策树优化的BP神经网络分类模型对所述文本向量进行特征选择,得到初始文本特征。The feature extraction module 20 is configured to: use a BP neural network classification model optimized based on a decision tree to perform feature selection on the text vector to obtain an initial text feature.
其中,所述利用基于决策树优化的BP神经网络分类模型对所述文本向量进行特征选择,从而得到文本特征,包括:Wherein, the use of a BP neural network classification model optimized based on a decision tree to perform feature selection on the text vector to obtain text features includes:
构建3层BP神经网络,其中,每层BP神经网络的输入层n个单元对应n个特征参数,输出层m个单元对应m种模式分类,取中间隐藏层单元数为q,用
Figure PCTCN2019116931-appb-000047
表示输入层单元i与隐层单元q之间的连接权,用
Figure PCTCN2019116931-appb-000048
表示隐层单元q与输出层单元j之间的连接权,θ q为隐藏层各单元的阈值,则隐藏层第q单元的输出O q为:
Construct a 3-layer BP neural network, where n units in the input layer of each layer of BP neural network correspond to n feature parameters, and m units in the output layer correspond to m pattern classifications. Take the number of hidden layer units in the middle as q, use
Figure PCTCN2019116931-appb-000047
Represents the connection weight between the input layer unit i and the hidden layer unit q, using
Figure PCTCN2019116931-appb-000048
Represents the connection weight between the hidden layer unit q and the output layer unit j, θ q is the threshold of each unit of the hidden layer, and the output O q of the qth unit of the hidden layer is:
Figure PCTCN2019116931-appb-000049
Figure PCTCN2019116931-appb-000049
输出层第j单元的输出y i为: The output y i of the j-th unit of the output layer is:
Figure PCTCN2019116931-appb-000050
Figure PCTCN2019116931-appb-000050
在上式中,δ j为输出层各单元的阈值,j=1,2,…,m; In the above formula, δ j is the threshold of each unit of the output layer, j = 1, 2, ..., m;
根据复合函数求偏导数的链式法则,求得文本特征X i的灵敏度δ ij和文本特征X k的灵敏度δ kj之差: The composite function of the partial derivative of the chain rule, the difference between the sensitivity obtained and the sensitivity of the text feature δ ij X k X i of text feature δ kj of:
Figure PCTCN2019116931-appb-000051
Figure PCTCN2019116931-appb-000051
其中
Figure PCTCN2019116931-appb-000052
among them
Figure PCTCN2019116931-appb-000052
此时,若
Figure PCTCN2019116931-appb-000053
则δ ijkj,即文本特征X i对 第j类模式的分类能力比文本特征X k的强,并据此选择文本特征。
At this time, if
Figure PCTCN2019116931-appb-000053
The δ ij> δ kj, i.e. strong text feature X i of the j-th class classification ability than the text mode feature of X k, and accordingly select the text features.
所述文本分类模块30用于:根据上述得到的初始文本特征,利用随机梯度下降算法与fine-turing方法训练所述BP神经网络分类模型,直到得到最佳的文本特征,根据所述最佳的文本特征,利用分类器对所述文本数据进行分类,输出所述文本数据的分类结果。The text classification module 30 is configured to train the BP neural network classification model by using the stochastic gradient descent algorithm and the fine-turing method according to the initial text features obtained above, until the best text features are obtained, and according to the best text features For text features, the text data is classified using a classifier, and the classification result of the text data is output.
其中,所述分类器为随机森林分类器;及Wherein, the classifier is a random forest classifier; and
所述利用分类器对文本数据进行分类包括:The using a classifier to classify text data includes:
使用交叉认证的方式对所述文本数据进行划分,其中,所述交叉认证是把原始文本数据按照页数的不同,分成k个子文本数据,在每次训练时,使用其中一个子文本数据进行作为测试集,其余子文本数据作为训练集,并进行k次轮换;Use cross-certification to divide the text data, where the cross-certification is to divide the original text data into k sub-text data according to the number of pages. In each training, one of the sub-text data is used as Test set, the rest of the sub-text data is used as the training set, and perform k rotations;
将上述得到的子文本特征作为决策树的子节点,构建多个决策树;Use the sub-text features obtained above as the child nodes of the decision tree to construct multiple decision trees;
汇集所有决策树的文本分类结果进行累加求和,得到票数最高的结果为最终的文本分类结果。The text classification results of all decision trees are collected for cumulative summation, and the result with the highest number of votes is the final text classification result.
上述样本收集模块10、特征提取模块20、文本分类模块30等程序模块被执行时所实现的功能或操作步骤与上述实施例大体相同,在此不再赘述。The functions or operation steps implemented by the program modules such as the sample collection module 10, the feature extraction module 20, and the text classification module 30 when executed are substantially the same as those in the foregoing embodiment, and will not be repeated here.
此外,本申请实施例还提出一种计算机可读存储介质,所述计算机可读存储介质上存储有基于神经网络模型的文本分类程序,所述基于神经网络模型的文本分类程序可被一个或多个处理器执行,以实现如下操作:In addition, an embodiment of the present application also proposes a computer-readable storage medium that stores a text classification program based on a neural network model, and the text classification program based on a neural network model can be used by one or more Each processor executes to achieve the following operations:
收集文本数据,对所述文本数据进行预处理操作,得到预处理后的文本数据;Collect text data, perform preprocessing operations on the text data, and obtain preprocessed text data;
将所述预处理后的文本数据转换为文本向量;Converting the preprocessed text data into a text vector;
利用基于决策树优化的BP神经网络分类模型对所述文本向量进行特征选择,得到初始文本特征;Perform feature selection on the text vector using a BP neural network classification model optimized based on a decision tree to obtain initial text features;
根据上述得到的初始文本特征,利用随机梯度下降算法与fine-turing方法训练所述BP神经网络分类模型,直到得到最佳的文本特征;Training the BP neural network classification model by using a stochastic gradient descent algorithm and a fine-turing method according to the initial text features obtained above, until the best text features are obtained;
根据所述最佳的文本特征,利用分类器对所述文本数据进行分类,输出所述文本数据的分类结果。According to the best text feature, the text data is classified by a classifier, and the classification result of the text data is output.
本申请计算机可读存储介质具体实施方式与上述基于神经网络模型的文本分类装置和方法各实施例基本相同,在此不作累述。The specific implementation of the computer-readable storage medium of the present application is basically the same as the foregoing embodiments of the text classification device and method based on the neural network model, and will not be repeated here.
需要说明的是,上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。并且本文中的术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、装置、物品或者方法中还存在另外的相同要素。It should be noted that the serial numbers of the foregoing embodiments of the present application are only for description, and do not represent the advantages and disadvantages of the embodiments. And the terms "include", "include" or any other variants thereof in this article are intended to cover non-exclusive inclusion, so that a process, device, article or method including a series of elements not only includes those elements, but also includes those elements that are not explicitly included. The other elements listed may also include elements inherent to the process, device, article, or method. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, device, article, or method that includes the element.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通 过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above implementation manners, those skilled in the art can clearly understand that the above-mentioned embodiment method can be implemented by means of software plus the necessary general hardware platform, of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM) as described above. , Magnetic disks, optical disks), including several instructions to make a terminal device (which can be a mobile phone, a computer, a server, or a network device, etc.) execute the method described in each embodiment of the present application.
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above are only the preferred embodiments of the application, and do not limit the scope of the patent for this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of the application, or directly or indirectly applied to other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims (20)

  1. 一种基于神经网络模型的文本分类方法,其特征在于,所述方法包括:A text classification method based on a neural network model, characterized in that the method includes:
    收集文本数据,对所述文本数据进行预处理操作,得到预处理后的文本数据;Collect text data, perform preprocessing operations on the text data, and obtain preprocessed text data;
    将所述预处理后的文本数据转换为文本向量;Converting the preprocessed text data into a text vector;
    利用基于决策树优化的BP神经网络分类模型对所述文本向量进行特征选择,得到初始文本特征;Perform feature selection on the text vector using a BP neural network classification model optimized based on a decision tree to obtain initial text features;
    根据上述得到的初始文本特征,利用随机梯度下降算法与fine-turing方法训练所述BP神经网络分类模型,直到得到最佳的文本特征;Training the BP neural network classification model by using a stochastic gradient descent algorithm and a fine-turing method according to the initial text features obtained above, until the best text features are obtained;
    根据所述最佳的文本特征,利用分类器对所述文本数据进行分类,输出所述文本数据的分类结果。According to the best text feature, the text data is classified by a classifier, and the classification result of the text data is output.
  2. 如权利要求1所述的基于神经网络模型的文本分类方法,其特征在于,所述对所述文本数据进行预处理操作包括:The text classification method based on a neural network model according to claim 1, wherein the preprocessing operation on the text data comprises:
    对所述文本数据与预先构建的词典中的词条根据预先规定的策略进行匹配,得到所述文本数据中的词语;Matching the text data with the entries in the pre-built dictionary according to a predetermined strategy to obtain the words in the text data;
    利用已经构建好的停用词表和所述文本数据中的词语进行匹配,如果匹配成功,则判断该词语是停用词,将该词语删除;Use the constructed stop word table to match the words in the text data, and if the matching is successful, determine that the word is a stop word, and delete the word;
    构建依存关系图计算词语之间的关联强度,利用TextRank算法迭代算出词语的重要度得分,将每个词语表示成数值向量形式;Construct a dependency graph to calculate the correlation strength between words, use the TextRank algorithm to iteratively calculate the importance score of the words, and express each word in the form of a numerical vector;
    计算所述文本数据中每两个之间的欧式距离,当所述欧氏距离小于预设阈值时,删除的两个文本数据中的其中一个。Calculate the Euclidean distance between every two of the text data, and delete one of the two text data when the Euclidean distance is less than a preset threshold.
  3. 如权利要求2所述的基于神经网络模型的文本分类方法,其特征在于,所述将所述预处理后的文本数据转换为文本向量包括:The text classification method based on a neural network model according to claim 2, wherein said converting the preprocessed text data into a text vector comprises:
    利用变焦神经网络的文本层次化编码器,对所述预处理后的文本数据进行编码处理,得到经过编码处理的文本向量,其中,所述文本层次化编码器包括文字嵌入层和两个bi-LSTM层,所述文字嵌入层将所述词语由word2vec初始化,得到词语向量,第一个bi-LSTM层接收词语向量作为输入并生成句子向量,第二个bi-LSTM层接收句子向量作为输入并生成段落向量,得到包括词语向量、句子向量以及段落向量的所述文本向量。Using the text hierarchical encoder of the zoom neural network, the preprocessed text data is encoded to obtain the encoded text vector, wherein the text hierarchical encoder includes a text embedding layer and two bi- LSTM layer. The text embedding layer initializes the words by word2vec to obtain word vectors. The first bi-LSTM layer receives word vectors as input and generates sentence vectors, and the second bi-LSTM layer receives sentence vectors as input and A paragraph vector is generated, and the text vector including a word vector, a sentence vector and a paragraph vector is obtained.
  4. 如权利要求1中所述的基于神经网络模型的文本分类方法,其特征在于,所述利用基于决策树优化的BP神经网络分类模型对所述文本向量进行特征选择,从而得到文本特征,包括:The method for text classification based on a neural network model according to claim 1, characterized in that said using a BP neural network classification model optimized based on a decision tree to perform feature selection on said text vector to obtain text features comprises:
    构建3层BP神经网络,其中,每层BP神经网络的输入层n个单元对应n个特征参数,输出层m个单元对应m种模式分类,取中间隐藏层单元数为q,用
    Figure PCTCN2019116931-appb-100001
    表示输入层单元i与隐层单元q之间的连接权,用
    Figure PCTCN2019116931-appb-100002
    表示隐层单元q与输出层单元j之间的连接权,θ q为隐藏层各单元的阈值,则隐藏层第q单元的输出O q为:
    Construct a 3-layer BP neural network, where n units in the input layer of each layer of BP neural network correspond to n feature parameters, and m units in the output layer correspond to m pattern classifications. Take the number of hidden layer units in the middle as q, use
    Figure PCTCN2019116931-appb-100001
    Represents the connection weight between the input layer unit i and the hidden layer unit q, using
    Figure PCTCN2019116931-appb-100002
    Represents the connection weight between the hidden layer unit q and the output layer unit j, θ q is the threshold of each unit of the hidden layer, and the output O q of the qth unit of the hidden layer is:
    Figure PCTCN2019116931-appb-100003
    Figure PCTCN2019116931-appb-100003
    输出层第j单元的输出y i为: The output y i of the j-th unit of the output layer is:
    Figure PCTCN2019116931-appb-100004
    Figure PCTCN2019116931-appb-100004
    在上式中,δ j为输出层各单元的阈值,j=1,2,…,m; In the above formula, δ j is the threshold of each unit of the output layer, j = 1, 2, ..., m;
    根据复合函数求偏导数的链式法则,求得文本特征X i的灵敏度δ ij和文本特征X k的灵敏度δ kj之差: The composite function of the partial derivative of the chain rule, the difference between the sensitivity obtained and the sensitivity of the text feature δ ij X k X i of text feature δ kj of:
    Figure PCTCN2019116931-appb-100005
    Figure PCTCN2019116931-appb-100005
    其中
    Figure PCTCN2019116931-appb-100006
    among them
    Figure PCTCN2019116931-appb-100006
    此时,若
    Figure PCTCN2019116931-appb-100007
    则δ ijkj,即文本特征X i对第j类模式的分类能力比文本特征X k的强,并据此选择文本特征。
    At this time, if
    Figure PCTCN2019116931-appb-100007
    The δ ij> δ kj, i.e. strong text feature X i of the j-th class classification ability than the text mode feature of X k, and accordingly select the text features.
  5. 如权利要求1所述的基于神经网络模型的文本分类方法,所述分类器为随机森林分类器;及The method for text classification based on a neural network model according to claim 1, wherein the classifier is a random forest classifier; and
    所述利用分类器对文本数据进行分类包括:The using a classifier to classify text data includes:
    使用交叉认证的方式对所述文本数据进行划分,其中,所述交叉认证是把原始文本数据按照页数的不同,分成k个子文本数据,在每次训练时,使用其中一个子文本数据进行作为测试集,其余子文本数据作为训练集,并进行k次轮换;Use cross-certification to divide the text data, where the cross-certification is to divide the original text data into k sub-text data according to the number of pages. In each training, one of the sub-text data is used as Test set, the rest of the sub-text data is used as the training set, and perform k rotations;
    将上述得到的子文本数据作为决策树的子节点,构建多个决策树;Use the sub-text data obtained above as the child nodes of the decision tree to construct multiple decision trees;
    汇集所有决策树的文本分类结果进行累加求和,得到票数最高的结果为最终的文本分类结果。The text classification results of all decision trees are collected for cumulative summation, and the result with the highest number of votes is the final text classification result.
  6. 如权利要求2所述的基于神经网络模型的文本分类方法,所述分类器为随机森林分类器;及The method for text classification based on a neural network model according to claim 2, wherein the classifier is a random forest classifier; and
    所述利用分类器对文本数据进行分类包括:The using a classifier to classify text data includes:
    使用交叉认证的方式对所述文本数据进行划分,其中,所述交叉认证是把原始文本数据按照页数的不同,分成k个子文本数据,在每次训练时,使用其中一个子文本数据进行作为测试集,其余子文本数据作为训练集,并进行k次轮换;Use cross-certification to divide the text data, where the cross-certification is to divide the original text data into k sub-text data according to the number of pages. In each training, one of the sub-text data is used as Test set, the rest of the sub-text data is used as the training set, and perform k rotations;
    将上述得到的子文本数据作为决策树的子节点,构建多个决策树;Use the sub-text data obtained above as the child nodes of the decision tree to construct multiple decision trees;
    汇集所有决策树的文本分类结果进行累加求和,得到票数最高的结果为最终的文本分类结果。The text classification results of all decision trees are collected for cumulative summation, and the result with the highest number of votes is the final text classification result.
  7. 如权利要求3至4中任意一项所述的基于神经网络模型的文本分类方法,所述分类器为随机森林分类器;及The method for text classification based on a neural network model according to any one of claims 3 to 4, wherein the classifier is a random forest classifier; and
    所述利用分类器对文本数据进行分类包括:The using a classifier to classify text data includes:
    使用交叉认证的方式对所述文本数据进行划分,其中,所述交叉认证是 把原始文本数据按照页数的不同,分成k个子文本数据,在每次训练时,使用其中一个子文本数据进行作为测试集,其余子文本数据作为训练集,并进行k次轮换;The text data is divided using a cross-certification method, where the cross-certification is to divide the original text data into k sub-text data according to the number of pages, and use one of the sub-text data for each training. Test set, the rest of the sub-text data is used as training set, and perform k rotations
    将上述得到的子文本数据作为决策树的子节点,构建多个决策树;Use the sub-text data obtained above as the child nodes of the decision tree to construct multiple decision trees;
    汇集所有决策树的文本分类结果进行累加求和,得到票数最高的结果为最终的文本分类结果。The text classification results of all decision trees are collected for cumulative summation, and the result with the highest number of votes is the final text classification result.
  8. 一种基于神经网络模型的文本分类装置,其特征在于,所述装置包括存储器和处理器,所述存储器上存储有可在所述处理器上运行的基于神经网络模型的文本分类程序,所述基于神经网络模型的文本分类程序被所述处理器执行时实现如下步骤:A text classification device based on a neural network model, characterized in that the device includes a memory and a processor, and a text classification program based on the neural network model that can be run on the processor is stored in the memory, and the When the text classification program based on the neural network model is executed by the processor, the following steps are implemented:
    收集文本数据,对所述文本数据进行预处理操作,得到预处理后的文本数据;Collect text data, perform preprocessing operations on the text data, and obtain preprocessed text data;
    将所述预处理后的文本数据转换为文本向量;Converting the preprocessed text data into a text vector;
    利用基于决策树优化的BP神经网络分类模型对所述文本向量进行特征选择,得到初始文本特征;Perform feature selection on the text vector using a BP neural network classification model optimized based on a decision tree to obtain initial text features;
    根据上述得到的初始文本特征,利用随机梯度下降算法与fine-turing方法训练所述BP神经网络分类模型,直到得到最佳的文本特征;Training the BP neural network classification model by using a stochastic gradient descent algorithm and a fine-turing method according to the initial text features obtained above, until the best text features are obtained;
    根据所述最佳的文本特征,利用分类器对所述文本数据进行分类,输出所述文本数据的分类结果。According to the best text feature, the text data is classified by a classifier, and the classification result of the text data is output.
  9. 如权利要求8所述的基于神经网络模型的文本分类装置,其特征在于,所述对所述文本数据进行预处理操作包括:The text classification device based on a neural network model according to claim 8, wherein the preprocessing operation on the text data comprises:
    对所述文本数据与预先构建的词典中的词条根据预先规定的策略进行匹配,得到所述文本数据中的词语;Matching the text data with the entries in the pre-built dictionary according to a predetermined strategy to obtain the words in the text data;
    利用已经构建好的停用词表和所述文本数据中的词语进行匹配,如果匹配成功,则判断该词语是停用词,将该词语删除;Use the constructed stop word table to match the words in the text data, and if the matching is successful, determine that the word is a stop word, and delete the word;
    构建依存关系图计算词语之间的关联强度,利用TextRank算法迭代算出词语的重要度得分,将每个词语表示成数值向量形式;Construct a dependency graph to calculate the correlation strength between words, use the TextRank algorithm to iteratively calculate the importance score of the words, and express each word in the form of a numerical vector;
    计算所述文本数据中每两个之间的欧式距离,当所述欧氏距离小于预设阈值时,删除的两个文本数据中的其中一个。Calculate the Euclidean distance between every two of the text data, and delete one of the two text data when the Euclidean distance is less than a preset threshold.
  10. 如权利要求9所述的基于神经网络模型的文本分类装置,其特征在于,所述将所述预处理后的文本数据转换为文本向量包括:9. The text classification device based on a neural network model according to claim 9, wherein said converting said preprocessed text data into a text vector comprises:
    利用变焦神经网络的文本层次化编码器,对所述预处理后的文本数据进行编码处理,得到经过编码处理的文本向量,其中,所述文本层次化编码器包括文字嵌入层和两个bi-LSTM层,所述文字嵌入层将所述词语由word2vec初始化,得到词语向量,第一个bi-LSTM层接收词语向量作为输入并生成句子向量,第二个bi-LSTM层接收句子向量作为输入并生成段落向量,得到包括词语向量、句子向量以及段落向量的所述文本向量。Using the text hierarchical encoder of the zoom neural network, the preprocessed text data is encoded to obtain the encoded text vector, wherein the text hierarchical encoder includes a text embedding layer and two bi- LSTM layer. The text embedding layer initializes the words by word2vec to obtain word vectors. The first bi-LSTM layer receives word vectors as input and generates sentence vectors, and the second bi-LSTM layer receives sentence vectors as input and A paragraph vector is generated, and the text vector including a word vector, a sentence vector and a paragraph vector is obtained.
  11. 如权利要求8中所述的基于神经网络模型的文本分类装置,其特征在于,所述利用基于决策树优化的BP神经网络分类模型对所述文本向量进行 特征选择,从而得到文本特征,包括:8. The text classification device based on a neural network model according to claim 8, wherein said using a BP neural network classification model optimized based on a decision tree to perform feature selection on said text vector to obtain text features comprises:
    构建3层BP神经网络,其中,每层BP神经网络的输入层n个单元对应n个特征参数,输出层m个单元对应m种模式分类,取中间隐藏层单元数为q,用
    Figure PCTCN2019116931-appb-100008
    表示输入层单元i与隐层单元q之间的连接权,用
    Figure PCTCN2019116931-appb-100009
    表示隐层单元q与输出层单元j之间的连接权,θ q为隐藏层各单元的阈值,则隐藏层第q单元的输出O q为:
    Construct a 3-layer BP neural network, where n units in the input layer of each layer of BP neural network correspond to n feature parameters, and m units in the output layer correspond to m pattern classifications. Take the number of hidden layer units in the middle as q, use
    Figure PCTCN2019116931-appb-100008
    Represents the connection weight between the input layer unit i and the hidden layer unit q, using
    Figure PCTCN2019116931-appb-100009
    Represents the connection weight between the hidden layer unit q and the output layer unit j, θ q is the threshold of each unit of the hidden layer, and the output O q of the qth unit of the hidden layer is:
    Figure PCTCN2019116931-appb-100010
    Figure PCTCN2019116931-appb-100010
    输出层第j单元的输出y i为: The output y i of the j-th unit of the output layer is:
    Figure PCTCN2019116931-appb-100011
    Figure PCTCN2019116931-appb-100011
    在上式中,δ j为输出层各单元的阈值,j=1,2,…,m; In the above formula, δ j is the threshold of each unit of the output layer, j = 1, 2, ..., m;
    根据复合函数求偏导数的链式法则,求得文本特征X i的灵敏度δ ij和文本特征X k的灵敏度δ kj之差: The composite function of the partial derivative of the chain rule, the difference between the sensitivity obtained and the sensitivity of the text feature δ ij X k X i of text feature δ kj of:
    Figure PCTCN2019116931-appb-100012
    Figure PCTCN2019116931-appb-100012
    其中
    Figure PCTCN2019116931-appb-100013
    among them
    Figure PCTCN2019116931-appb-100013
    此时,若
    Figure PCTCN2019116931-appb-100014
    则δ ijkj,即文本特征X i对第j类模式的分类能力比文本特征X k的强,并据此选择文本特征。
    At this time, if
    Figure PCTCN2019116931-appb-100014
    The δ ij> δ kj, i.e. strong text feature X i of the j-th class classification ability than the text mode feature of X k, and accordingly select the text features.
  12. 如权利要求8所述的基于神经网络模型的文本分类装置,所述分类器为随机森林分类器;及The text classification device based on a neural network model according to claim 8, wherein the classifier is a random forest classifier; and
    所述利用分类器对文本数据进行分类包括:The using a classifier to classify text data includes:
    使用交叉认证的方式对所述文本数据进行划分,其中,所述交叉认证是把原始文本数据按照页数的不同,分成k个子文本数据,在每次训练时,使用其中一个子文本数据进行作为测试集,其余子文本数据作为训练集,并进行k次轮换;Use cross-certification to divide the text data, where the cross-certification is to divide the original text data into k sub-text data according to the number of pages. In each training, one of the sub-text data is used as Test set, the rest of the sub-text data is used as the training set, and perform k rotations;
    将上述得到的子文本数据作为决策树的子节点,构建多个决策树;Use the sub-text data obtained above as the child nodes of the decision tree to construct multiple decision trees;
    汇集所有决策树的文本分类结果进行累加求和,得到票数最高的结果为最终的文本分类结果。The text classification results of all decision trees are collected for cumulative summation, and the result with the highest number of votes is the final text classification result.
  13. 如权利要求9所述的基于神经网络模型的文本分类装置,所述分类器为随机森林分类器;及The text classification device based on a neural network model according to claim 9, wherein the classifier is a random forest classifier; and
    所述利用分类器对文本数据进行分类包括:The using a classifier to classify text data includes:
    使用交叉认证的方式对所述文本数据进行划分,其中,所述交叉认证是把原始文本数据按照页数的不同,分成k个子文本数据,在每次训练时,使用其中一个子文本数据进行作为测试集,其余子文本数据作为训练集,并进行k次轮换;Use cross-certification to divide the text data, where the cross-certification is to divide the original text data into k sub-text data according to the number of pages. In each training, one of the sub-text data is used as Test set, the rest of the sub-text data is used as the training set, and perform k rotations;
    将上述得到的子文本数据作为决策树的子节点,构建多个决策树;Use the sub-text data obtained above as the child nodes of the decision tree to construct multiple decision trees;
    汇集所有决策树的文本分类结果进行累加求和,得到票数最高的结果为最终的文本分类结果。The text classification results of all decision trees are collected for cumulative summation, and the result with the highest number of votes is the final text classification result.
  14. 如权利要求10至11中任意一项所述的基于神经网络模型的文本分类装置,所述分类器为随机森林分类器;及The text classification device based on a neural network model according to any one of claims 10 to 11, wherein the classifier is a random forest classifier; and
    所述利用分类器对文本数据进行分类包括:The using a classifier to classify text data includes:
    使用交叉认证的方式对所述文本数据进行划分,其中,所述交叉认证是把原始文本数据按照页数的不同,分成k个子文本数据,在每次训练时,使用其中一个子文本数据进行作为测试集,其余子文本数据作为训练集,并进行k次轮换;Use cross-certification to divide the text data, where the cross-certification is to divide the original text data into k sub-text data according to the number of pages. In each training, one of the sub-text data is used as Test set, the rest of the sub-text data is used as the training set, and perform k rotations;
    将上述得到的子文本数据作为决策树的子节点,构建多个决策树;Use the sub-text data obtained above as the child nodes of the decision tree to construct multiple decision trees;
    汇集所有决策树的文本分类结果进行累加求和,得到票数最高的结果为最终的文本分类结果。The text classification results of all decision trees are collected for cumulative summation, and the result with the highest number of votes is the final text classification result.
  15. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有基于神经网络模型的文本分类程序,所述基于神经网络模型的文本分类程序可被一个或者多个处理器执行,以实现实现如下步骤:A computer-readable storage medium, wherein the computer-readable storage medium stores a text classification program based on a neural network model, and the text classification program based on a neural network model can be executed by one or more processors , In order to achieve the following steps:
    收集文本数据,对所述文本数据进行预处理操作,得到预处理后的文本数据;Collect text data, perform preprocessing operations on the text data, and obtain preprocessed text data;
    将所述预处理后的文本数据转换为文本向量;Converting the preprocessed text data into a text vector;
    利用基于决策树优化的BP神经网络分类模型对所述文本向量进行特征选择,得到初始文本特征;Perform feature selection on the text vector using a BP neural network classification model optimized based on a decision tree to obtain initial text features;
    根据上述得到的初始文本特征,利用随机梯度下降算法与fine-turing方法训练所述BP神经网络分类模型,直到得到最佳的文本特征;Training the BP neural network classification model by using a stochastic gradient descent algorithm and a fine-turing method according to the initial text features obtained above, until the best text features are obtained;
    根据所述最佳的文本特征,利用分类器对所述文本数据进行分类,输出所述文本数据的分类结果。According to the best text feature, the text data is classified by a classifier, and the classification result of the text data is output.
  16. 如权利要求15所述的计算机可读存储介质,其特征在于,所述对所述文本数据进行预处理操作包括:15. The computer-readable storage medium according to claim 15, wherein the preprocessing operation on the text data comprises:
    对所述文本数据与预先构建的词典中的词条根据预先规定的策略进行匹配,得到所述文本数据中的词语;Matching the text data with the entries in the pre-built dictionary according to a predetermined strategy to obtain the words in the text data;
    利用已经构建好的停用词表和所述文本数据中的词语进行匹配,如果匹配成功,则判断该词语是停用词,将该词语删除;Use the constructed stop word table to match the words in the text data, and if the matching is successful, determine that the word is a stop word, and delete the word;
    构建依存关系图计算词语之间的关联强度,利用TextRank算法迭代算出词语的重要度得分,将每个词语表示成数值向量形式;Construct a dependency graph to calculate the correlation strength between words, use the TextRank algorithm to iteratively calculate the importance score of the words, and express each word in the form of a numerical vector;
    计算所述文本数据中每两个之间的欧式距离,当所述欧氏距离小于预设阈值时,删除的两个文本数据中的其中一个。Calculate the Euclidean distance between every two of the text data, and delete one of the two text data when the Euclidean distance is less than a preset threshold.
  17. 如权利要求16所述的计算机可读存储介质,其特征在于,所述将所述预处理后的文本数据转换为文本向量包括:15. The computer-readable storage medium of claim 16, wherein the converting the preprocessed text data into a text vector comprises:
    利用变焦神经网络的文本层次化编码器,对所述预处理后的文本数据进行编码处理,得到经过编码处理的文本向量,其中,所述文本层次化编码器包括文字嵌入层和两个bi-LSTM层,所述文字嵌入层将所述词语由word2vec 初始化,得到词语向量,第一个bi-LSTM层接收词语向量作为输入并生成句子向量,第二个bi-LSTM层接收句子向量作为输入并生成段落向量,得到包括词语向量、句子向量以及段落向量的所述文本向量。Using the text hierarchical encoder of the zoom neural network, the preprocessed text data is encoded to obtain the encoded text vector, wherein the text hierarchical encoder includes a text embedding layer and two bi- LSTM layer. The text embedding layer initializes the words by word2vec to obtain word vectors. The first bi-LSTM layer receives word vectors as input and generates sentence vectors, and the second bi-LSTM layer receives sentence vectors as input and A paragraph vector is generated, and the text vector including a word vector, a sentence vector and a paragraph vector is obtained.
  18. 如权利要求15中所述的计算机可读存储介质,其特征在于,所述利用基于决策树优化的BP神经网络分类模型对所述文本向量进行特征选择,从而得到文本特征,包括:15. The computer-readable storage medium according to claim 15, wherein said using a BP neural network classification model optimized based on a decision tree to perform feature selection on said text vector to obtain text features comprises:
    构建3层BP神经网络,其中,每层BP神经网络的输入层n个单元对应n个特征参数,输出层m个单元对应m种模式分类,取中间隐藏层单元数为q,用
    Figure PCTCN2019116931-appb-100015
    表示输入层单元i与隐层单元q之间的连接权,用
    Figure PCTCN2019116931-appb-100016
    表示隐层单元q与输出层单元j之间的连接权,θ q为隐藏层各单元的阈值,则隐藏层第q单元的输出O q为:
    Construct a 3-layer BP neural network, where n units in the input layer of each layer of BP neural network correspond to n feature parameters, and m units in the output layer correspond to m pattern classifications. Take the number of hidden layer units in the middle as q, use
    Figure PCTCN2019116931-appb-100015
    Represents the connection weight between the input layer unit i and the hidden layer unit q, using
    Figure PCTCN2019116931-appb-100016
    Represents the connection weight between the hidden layer unit q and the output layer unit j, θ q is the threshold of each unit of the hidden layer, and the output O q of the qth unit of the hidden layer is:
    Figure PCTCN2019116931-appb-100017
    Figure PCTCN2019116931-appb-100017
    输出层第j单元的输出y i为: The output y i of the j-th unit of the output layer is:
    Figure PCTCN2019116931-appb-100018
    Figure PCTCN2019116931-appb-100018
    在上式中,δ j为输出层各单元的阈值,j=1,2,…,m; In the above formula, δ j is the threshold of each unit of the output layer, j = 1, 2, ..., m;
    根据复合函数求偏导数的链式法则,求得文本特征X i的灵敏度δ ij和文本特征X k的灵敏度δ kj之差: The composite function of the partial derivative of the chain rule, the difference between the sensitivity obtained and the sensitivity of the text feature δ ij X k X i of text feature δ kj of:
    Figure PCTCN2019116931-appb-100019
    Figure PCTCN2019116931-appb-100019
    其中
    Figure PCTCN2019116931-appb-100020
    among them
    Figure PCTCN2019116931-appb-100020
    此时,若
    Figure PCTCN2019116931-appb-100021
    则δ ijkj,即文本特征X i对第j类模式的分类能力比文本特征X k的强,并据此选择文本特征。
    At this time, if
    Figure PCTCN2019116931-appb-100021
    The δ ij> δ kj, i.e. strong text feature X i of the j-th class classification ability than the text mode feature of X k, and accordingly select the text features.
  19. 如权利要求15所述的计算机可读存储介质,所述分类器为随机森林分类器;及The computer-readable storage medium of claim 15, wherein the classifier is a random forest classifier; and
    所述利用分类器对文本数据进行分类包括:The using a classifier to classify text data includes:
    使用交叉认证的方式对所述文本数据进行划分,其中,所述交叉认证是把原始文本数据按照页数的不同,分成k个子文本数据,在每次训练时,使用其中一个子文本数据进行作为测试集,其余子文本数据作为训练集,并进行k次轮换;Use cross-certification to divide the text data, where the cross-certification is to divide the original text data into k sub-text data according to the number of pages. In each training, one of the sub-text data is used as Test set, the rest of the sub-text data is used as the training set, and perform k rotations;
    将上述得到的子文本数据作为决策树的子节点,构建多个决策树;Use the sub-text data obtained above as the child nodes of the decision tree to construct multiple decision trees;
    汇集所有决策树的文本分类结果进行累加求和,得到票数最高的结果为最终的文本分类结果。The text classification results of all decision trees are collected for cumulative summation, and the result with the highest number of votes is the final text classification result.
  20. 如权利要求16至18中任意一项所述的计算机可读存储介质,所述分类器为随机森林分类器;及The computer-readable storage medium according to any one of claims 16 to 18, wherein the classifier is a random forest classifier; and
    所述利用分类器对文本数据进行分类包括:The using a classifier to classify text data includes:
    使用交叉认证的方式对所述文本数据进行划分,其中,所述交叉认证是把原始文本数据按照页数的不同,分成k个子文本数据,在每次训练时,使用其中一个子文本数据进行作为测试集,其余子文本数据作为训练集,并进行k次轮换;Use cross-certification to divide the text data, where the cross-certification is to divide the original text data into k sub-text data according to the number of pages. In each training, one of the sub-text data is used as Test set, the rest of the sub-text data is used as the training set, and perform k rotations;
    将上述得到的子文本数据作为决策树的子节点,构建多个决策树;Use the sub-text data obtained above as the child nodes of the decision tree to construct multiple decision trees;
    汇集所有决策树的文本分类结果进行累加求和,得到票数最高的结果为最终的文本分类结果。The text classification results of all decision trees are collected for cumulative summation, and the result with the highest number of votes is the final text classification result.
PCT/CN2019/116931 2019-09-17 2019-11-10 Text data classification method and apparatus based on neural network model, and storage medium WO2021051518A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910885586.7A CN110750640B (en) 2019-09-17 2019-09-17 Text data classification method and device based on neural network model and storage medium
CN201910885586.7 2019-09-17

Publications (1)

Publication Number Publication Date
WO2021051518A1 true WO2021051518A1 (en) 2021-03-25

Family

ID=69276659

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/116931 WO2021051518A1 (en) 2019-09-17 2019-11-10 Text data classification method and apparatus based on neural network model, and storage medium

Country Status (2)

Country Link
CN (1) CN110750640B (en)
WO (1) WO2021051518A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113282711A (en) * 2021-06-03 2021-08-20 中国软件评测中心(工业和信息化部软件与集成电路促进中心) Internet of vehicles text matching method and device, electronic equipment and storage medium

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112085157B (en) * 2020-07-20 2024-02-27 西安电子科技大学 Disease prediction method and device based on neural network and tree model
CN111882416A (en) * 2020-07-24 2020-11-03 未鲲(上海)科技服务有限公司 Training method and related device of risk prediction model
CN112819072B (en) * 2021-02-01 2023-07-18 西南民族大学 Supervision type classification method and system
CN113033902B (en) * 2021-03-31 2024-03-19 中汽院智能网联科技有限公司 Automatic driving lane change track planning method based on improved deep learning
CN113269368B (en) * 2021-06-07 2023-06-30 上海航空工业(集团)有限公司 Civil aircraft safety trend prediction method based on data driving
CN113673229B (en) * 2021-08-23 2024-04-05 广东电网有限责任公司 Electric power marketing data interaction method, system and storage medium
CN114281992A (en) * 2021-12-22 2022-04-05 北京朗知网络传媒科技股份有限公司 Automobile article intelligent classification method and system based on media field
CN114896468B (en) * 2022-04-24 2024-02-02 北京月新时代科技股份有限公司 File type matching method and data intelligent input method based on neural network
CN115147225B (en) * 2022-07-28 2024-04-05 连连银通电子支付有限公司 Data transfer information identification method, device, equipment and storage medium
CN115328062B (en) 2022-08-31 2023-03-28 济南永信新材料科技有限公司 Intelligent control system for spunlace production line
CN116646078B (en) * 2023-07-19 2023-11-24 中国人民解放军总医院 Cardiovascular critical clinical decision support system and device based on artificial intelligence

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107656990A (en) * 2017-09-14 2018-02-02 中山大学 A kind of file classification method based on two aspect characteristic informations of word and word
CN109376242A (en) * 2018-10-18 2019-02-22 西安工程大学 Text classification algorithm based on Recognition with Recurrent Neural Network variant and convolutional neural networks
CN109947940A (en) * 2019-02-15 2019-06-28 平安科技(深圳)有限公司 File classification method, device, terminal and storage medium

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156766B (en) * 2015-03-25 2020-02-18 阿里巴巴集团控股有限公司 Method and device for generating text line classifier
CN108268461A (en) * 2016-12-30 2018-07-10 广东精点数据科技股份有限公司 A kind of document sorting apparatus based on hybrid classifer
CN106919646B (en) * 2017-01-18 2020-06-09 南京云思创智信息科技有限公司 Chinese text abstract generating system and method
WO2019019199A1 (en) * 2017-07-28 2019-01-31 Shenzhen United Imaging Healthcare Co., Ltd. System and method for image conversion
CN107665248A (en) * 2017-09-22 2018-02-06 齐鲁工业大学 File classification method and device based on deep learning mixed model
US11100399B2 (en) * 2017-11-21 2021-08-24 International Business Machines Corporation Feature extraction using multi-task learning
CN109086654B (en) * 2018-06-04 2023-04-28 平安科技(深圳)有限公司 Handwriting model training method, text recognition method, device, equipment and medium
CN108829822B (en) * 2018-06-12 2023-10-27 腾讯科技(深圳)有限公司 Media content recommendation method and device, storage medium and electronic device
CN110138849A (en) * 2019-05-05 2019-08-16 哈尔滨英赛克信息技术有限公司 Agreement encryption algorithm type recognition methods based on random forest
CN110196893A (en) * 2019-05-05 2019-09-03 平安科技(深圳)有限公司 Non- subjective item method to go over files, device and storage medium based on text similarity

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107656990A (en) * 2017-09-14 2018-02-02 中山大学 A kind of file classification method based on two aspect characteristic informations of word and word
CN109376242A (en) * 2018-10-18 2019-02-22 西安工程大学 Text classification algorithm based on Recognition with Recurrent Neural Network variant and convolutional neural networks
CN109947940A (en) * 2019-02-15 2019-06-28 平安科技(深圳)有限公司 File classification method, device, terminal and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113282711A (en) * 2021-06-03 2021-08-20 中国软件评测中心(工业和信息化部软件与集成电路促进中心) Internet of vehicles text matching method and device, electronic equipment and storage medium
CN113282711B (en) * 2021-06-03 2023-09-22 中国软件评测中心(工业和信息化部软件与集成电路促进中心) Internet of vehicles text matching method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN110750640A (en) 2020-02-04
CN110750640B (en) 2022-11-04

Similar Documents

Publication Publication Date Title
WO2021051518A1 (en) Text data classification method and apparatus based on neural network model, and storage medium
CN110222160B (en) Intelligent semantic document recommendation method and device and computer readable storage medium
Ristoski et al. Rdf2vec: Rdf graph embeddings for data mining
WO2021068339A1 (en) Text classification method and device, and computer readable storage medium
CN108875051A (en) Knowledge mapping method for auto constructing and system towards magnanimity non-structured text
CN113011533A (en) Text classification method and device, computer equipment and storage medium
CN112256939B (en) Text entity relation extraction method for chemical field
KR20180011254A (en) Web page training methods and devices, and search intent identification methods and devices
Le et al. Text classification: Naïve bayes classifier with sentiment Lexicon
CN113392209B (en) Text clustering method based on artificial intelligence, related equipment and storage medium
Vijayaraghavan et al. Fake news detection with different models
WO2017193685A1 (en) Method and device for data processing in social network
CN112632224B (en) Case recommendation method and device based on case knowledge graph and electronic equipment
US11886515B2 (en) Hierarchical clustering on graphs for taxonomy extraction and applications thereof
CN115952292B (en) Multi-label classification method, apparatus and computer readable medium
KR102091633B1 (en) Searching Method for Related Law
Helmy et al. Applying deep learning for Arabic keyphrase extraction
CN113961666A (en) Keyword recognition method, apparatus, device, medium, and computer program product
CN112559747A (en) Event classification processing method and device, electronic equipment and storage medium
Wang et al. File fragment type identification with convolutional neural networks
US20190095525A1 (en) Extraction of expression for natural language processing
CN113515589A (en) Data recommendation method, device, equipment and medium
CN115344563B (en) Data deduplication method and device, storage medium and electronic equipment
Tian et al. Chinese short text multi-classification based on word and part-of-speech tagging embedding
Padia et al. UMBC at SemEval-2018 Task 8: Understanding text about malware

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19946196

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19946196

Country of ref document: EP

Kind code of ref document: A1