WO2020244066A1

WO2020244066A1 - Text classification method, apparatus, device, and storage medium

Info

Publication number: WO2020244066A1
Application number: PCT/CN2019/102464
Authority: WO
Inventors: 李坤
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-06-04
Filing date: 2019-08-26
Publication date: 2020-12-10
Also published as: CN110309304A

Abstract

The present application relates to the field of text classification, and provides a text classification method, an apparatus, a device and a storage medium. The method comprises: acquiring a training text, and inputting the training text into a coding layer of a neural network model and performing word vectorization of the training text in the coding layer, obtaining a feature vector corresponding to the training text; inputting the feature vector into an RNN model, performing modeling of sentences, and capturing a long distance dependency feature of each sentence in the training text; inputting a feature vector of the captured long distance dependency information into a convolutional neural network (CNN) model in the neural network model; in the CNN model, extracting a local feature from the feature vector, obtaining a target feature vector; the local feature indicating a local relevance of the feature vector; inputting the target feature vector into a classifier, performing classification processing of the training text by means of the classifier, obtaining a classified text.

Description

Text classification method, device, equipment and storage medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on June 4, 2019, the application number is 201910479226.7, and the invention title is "a text classification method, device, equipment and storage medium", the entire content of which is by reference Incorporate in the application.

Technical field

This application relates to the field of text classification, and in particular to a text classification method, device, equipment, and storage medium.

Background technique

Text classification is a key task in natural language processing, which can help users discover useful information from massive amounts of data. Text classification is mainly used in spam recognition, sentiment analysis, question and answer systems, translation, etc. The purpose of sentence model is to learn text features to represent sentences, and it is a key model for text classification.

In the intrusion detection system, WebShell detection also belongs to a kind of text classification. Current text classification is mostly based on statistics and machine learning. The statistical method uses split sentences, based on a corpus, and counts the occurrence probability of words composed of adjacent words. If adjacent words appear more frequently, the probability of occurrence is high. Words are segmented according to the probability value, so a complete corpus Very important. The machine learning method uses the TF-IDF algorithm to calculate text features, and then uses classifiers such as logistic regression, SVM, and random forest to classify the text. However, the inventor realizes that these methods are time-consuming and labor-intensive, and have poor generalization ability and high false alarm rate.

Summary of the invention

This application provides a text classification method, device, equipment and storage medium, which can solve the problem of poor accuracy of text classification in the prior art.

In a first aspect, the present application provides a text classification method, which includes: obtaining training text; inputting the training text into the coding layer of a neural network model, and performing word vectorization on the training text in the coding layer to obtain The feature vector corresponding to the training text; the feature vector is input to the RNN model, the sentence is modeled, and the long-distance dependent feature of each sentence in the training text is captured; wherein, the long-distance dependent feature refers to the text And the context vector is dependent on the time domain for a long time; input the feature vector that captures the long-distance dependence information into the convolutional neural network CNN model in the neural network model; in the CNN model from the The local features are extracted from the feature vector to obtain the target feature vector; wherein, the local feature refers to the local correlation in the feature vector; the target feature vector is input to the classifier, and the training is performed by the classifier The text is classified, and the classified text is obtained.

In a possible design, the capturing the long-distance dependence characteristics of each sentence in the training text includes: sequentially calculating the long-distance dependence characteristics of each word in the sentence through the LSTM model, wherein the The long-distance dependency feature represents the dependency relationship between the specific word and other long-distance words in the sentence; the method further includes: sequentially calculating the semantic structure features of each word, and the semantic structure feature representation of the specific word includes the specific word The semantic structure of the partial sentence of the word before it; the long-distance dependence feature of each word and the semantic structure feature are combined to obtain the word feature of each word in the sentence; the probability of each word in the sentence is calculated based on each word feature .

In a possible design, when the training text is continuous data of any one of speech language, lyrics, or thesis, the LSTM model is used to sequentially calculate the long-distance dependent features of each word in the sentence, including : Calculate the long-distance dependence information of each word in the sentence sequentially and cyclically through the LSTM model to capture the long-distance dependence feature from the continuous data.

In a possible design, before the training text is classified by the classifier, the method further includes: inputting a plurality of sentences into the neural network model, and performing word vectorization on each sentence to obtain Multiple word vectors; input each word vector into the LSTM model or GRU model to extract long-distance dependent features; input the long-distance dependent features into the CNN model to extract local features with invariant positions, and finally obtain multiple feature vectors, each The feature vectors respectively have long-distance dependent features and position-invariant local features; input the multiple feature vectors into the pooling layer to perform dimensionality reduction processing on these feature vectors; input the feature vectors obtained by the dimensionality reduction processing述Classifier.

In a possible design, before the input of the feature vector obtained by the dimensionality reduction process into the classifier, the method further includes: presetting a threshold for the classifier; if the output of the classifier is greater than all The threshold value means WebShell; when the output of the classifier is less than the preset threshold value, it means it is not WebShell; the classifying the training text by the classifier to obtain the classified text includes: setting the The size of the decision tree N in the classifier, Bootstrap sampling is performed to obtain N data sets; the parameter θn of each decision tree in the N decision trees is learned; each decision tree is trained in parallel training, and the training is completed in a single decision tree After that, the voting records of the training results for the training decision tree are counted to determine the final output of the CNN-RF model; among them, a representation of the final output of the CNN-RF model is:

Among them, Ti(x) is the classification result of the sample x by the tree i, c* is the final category corresponding to the sample, and N is the number of decision trees in the classifier.

In a possible design, the training text is Webshell, which is a command execution environment in the form of web pages such as asp, php, jsp, or cgi; the acquisition of the training text includes one of the following implementation methods: Use search engines to find common vulnerabilities disclosed on the Internet. If the target site has not been repaired, obtain WebShell; conduct code audits on open source CMS through code audit strategies, and mine code vulnerabilities from the CMS to obtain WebShell; adopt Upload vulnerability to obtain WebShell; use SQL injection attack to obtain WebShell; or use database backup to obtain WebShell.

In the second aspect, the present application provides a text classification device with a function corresponding to the text classification method provided in the first aspect. The function can be realized by hardware, or by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the above functions, and the modules may be software and/or hardware.

Another aspect of the present application provides a computer device, which includes at least one connected processor, a memory, and a transceiver, wherein the memory is used to store program code, and the processor is used to call the program code in the memory To perform the method described in the first aspect above.

Another aspect of the present application provides a computer storage medium. The computer-readable storage medium may be a non-volatile computer-readable storage medium, and the computer-readable storage medium stores instructions when it runs on a computer. , Causing the computer to execute the method described in the first aspect.

In this application, the RNN model is used to process long-term information to capture long-distance dependent features, which can accurately determine the context vector with strong correlation and avoid the loss of a large amount of information in the signal transmission process, and then use the CNN model to analyze local features Extract local features from the perceptual characteristics of the CNN model, and finally input the output of the CNN model into the classification for classification processing. Since the feature vector in the input classifier has both long-distance dependent features and local features, it can effectively improve the classification of sentences of different lengths Effect, and improve the accuracy of the neural network model to recognize text.

Description of the drawings

FIG. 1 is a schematic flowchart of a text classification method in an embodiment of this application;

Figure 2a is a schematic flowchart of a text classification method in an embodiment of this application;

FIG. 2b is a schematic table showing the comparison of the accuracy rates of fudan, Weibo and MR in the embodiment of the application;

FIG. 2c is a schematic diagram of another flow chart of a text classification method according to an embodiment of this application;

FIG. 3 is a schematic diagram of a structure of text classification in an embodiment of this application;

Fig. 4 is a schematic diagram of a structure of a computer device in an embodiment of the application.

Detailed ways

This application provides a text classification method, device, equipment, and storage medium, which can be used to classify texts such as news, papers, posts, and emails. This application does not limit the application scenarios of text classification.

To solve the above technical problems, this application mainly provides the following technical solutions:

Using the Convolutional Neural Networks (CNN) model in deep learning is good at extracting local features that do not change position, and the Recurrent Neural Network (RNN) model is good at modeling entire sentences , Combine the CNN model and the RNN model to achieve the purpose of not only capturing long-distance dependent information, but also extracting key phrase information well. Through the practical verification of the intrusion detection system project, it can achieve higher accuracy than using the CNN model or the RNN model alone. . The neural network model of the present application includes a CNN model and an RNN model, and a schematic diagram of the structure of the neural network model is shown in FIG. 1.

In Figure 1, the coding layer of the neural network model includes an RNN model and a CNN model. The input of the neural network model is the input of the RNN model, and the output of the RNN model is the input of the CNN model. The output of the CNN model is the output of the neural network model.

2a, a text classification method in an embodiment of the present application is introduced below, and the method includes:

201. Obtain training text.

Wherein, the training text includes multiple sentences, and each sentence includes multiple words. The training text in this application is Webshell. Webshell is a command execution environment in the form of web files such as asp, php, jsp, or cgi, which can also be called a web backdoor. After hackers invade a website, they usually mix the asp or php backdoor files with the normal webpage files in the web directory of the website server, and then use the browser to access the asp or php backdoor to obtain a command execution environment to control the website The purpose of the server.

The training text is Webshell, which is a command execution environment in the form of web pages such as asp, php, jsp, or cgi.

In some implementations, a content management system (Content Management System, CMS) may be used to obtain the Webshell, and one of the following implementation methods may be used to obtain the training text:

(1) A content management system (CMS) can be used to obtain Webshell. For example, a search engine can be used to find common vulnerabilities publicly disclosed on the Internet. If the target site has not been repaired, WebShell is obtained.

(2) Conduct code audit on the open source CMS through the code audit strategy, and mine code vulnerabilities from the CMS to obtain WebShell.

(3) Use upload vulnerability to obtain WebShell.

(4) Use SQL injection attacks to obtain WebShel.

(5) Use database backup to obtain WebShell.

This application does not limit the method and source of obtaining training text.

202. Input the training text into the coding layer of the neural network model, and perform word vectorization on the training text in the coding layer to obtain a feature vector corresponding to the training text.

Wherein, the feature vector is the text representation of the directional quantity space model. Through the representation of the word vector, the text data is changed from a high-latitude and high-sparse neural network difficult to process to continuous dense data similar to images and voices.

203. Input the feature vector into the RNN model, model the sentence, and capture the long-distance dependent feature of each sentence in the training text.

Wherein, the long-distance dependence feature refers to the context vector of the text, and the context vector is long-term dependent in the time domain.

In some implementation manners, the capturing the long-distance dependent characteristics of each sentence in the training text includes:

The long-distance dependence feature of each word in the sentence is sequentially calculated by the LSTM model, where the long-distance dependence feature of a specific word represents the dependence relationship between the specific word and other long-distance words in the sentence.

In some embodiments, the RNN model may adopt a Long Short-Term Memory (LSTM) model, through which a wide range of context information can be used in text processing to determine the probability of the next word. Specifically, the LSTM model can use a wide range of context information in text processing to determine the probability of the next word, including the following steps:

Calculate the semantic structure features of each word in order, and the semantic structure feature of a specific word represents the semantic structure of the partial sentence containing the specific word and the word before it;

Combine the long-distance dependence feature of each word and the semantic structure feature to obtain the word feature of each word in the sentence;

Calculate the probability of each word in the sentence based on each word feature.

In some embodiments, considering that the training text may be continuous data, such as speech language, lyrics, or essays, a loop operation can be used to capture long-distance dependent information from such continuous data to ensure that the signal can continue to propagate. . Specifically, when the training text is continuous data of any one of speech language, lyrics, or thesis, the sequential calculation of the long-distance dependent features of each word in the sentence through the LSTM model includes:

The long-distance dependence information of each word in the sentence is sequentially and cyclically calculated by the LSTM model to capture the long-distance dependence feature from the continuous data.

204. Input the feature vector capturing the long-distance dependence information into the convolutional neural network CNN model in the neural network model.

205. Extract local features from the feature vector in the CNN model to obtain a target feature vector.

Wherein, the local feature refers to the local correlation in the feature vector, and can also be referred to as key information similar to n-gram in the feature vector.

In some embodiments, in order to further improve the generalization ability of the classifier and the accuracy of text classification, the CNN model may adopt the CNN-RF model. The following table shows the comparison of accuracy rates of 3 types of text (fudan, weibo, and MR) using NB model, CART model, RF model, CNN model and CNN-RF model (as shown in Figure 2b) .

206. Input the target feature vector to the classifier, and perform classification processing on the training text by the classifier to obtain classified text.

In the embodiment of the present application, the neural network model includes a classifier, and the input of the classifier is the output of the CNN model. In the neural network model, the classifier trains the feature vector until the classifier converges.

In some implementations, a threshold may be preset for the classifier. If the output of the classifier is greater than the threshold, it means that it is WebShell; when the output of SoftMax is less than the threshold, it means that it is not WebShell.

Compared with the existing mechanism, in the embodiment of the present application, the characteristics of long-term information processing by the RNN model are used to capture long-distance dependent features, which can accurately determine the context vector with strong correlation and avoid a large amount of signal loss in the transmission process. Then use the CNN model to extract the local features from the perceptual characteristics of the local features, and finally input the output of the CNN model into the classification for classification processing. Since the feature vector in the input classifier has both long-distance dependent features and local features, The classification effect of sentences of different lengths can be effectively improved, and the accuracy of text recognition by the neural network model can be improved. In addition, combining the feature extraction ability of the CNN model and the generalization ability of the random forest, the generalization ability can be analyzed from the following three aspects:

First, from a statistical point of view, because the hypothesis space of learning tasks is often large, there may be multiple hypotheses that achieve the same level of performance on the training set. At this time, if a single decision tree is used, it may cause generalization due to misselection. Poor ability.

In the second aspect, from the perspective of feature extraction, dual word vectors describe the meaning of words from two perspectives, enriching short text information, and expanding feature information compared to single word vectors.

In the third aspect, from the perspective of representation, the true hypothesis of some learning tasks may not be in the hypothesis space where the current decision tree algorithm is located. If a single classification method is used, the established hypothesis space will not be searched. And random forest using Bootstrap sampling can reduce the dependence of the machine learning model on data and reduce the variance of the model, so that the RNN model has better generalization capabilities.

Optionally, in some embodiments of the present application, before the training text is classified by the classifier, the method further includes:

Input multiple sentences to the neural network model to which they belong, and perform word vectorization on each sentence to obtain multiple word vectors;

Input each word vector into the LSTM model or GRU model to extract long-distance dependent features;

Input the long-distance dependent features into a CNN model, extract local features with invariant positions, and finally obtain a plurality of feature vectors, each of the feature vectors has a long-distance dependent feature and a local feature with invariable position;

Input the multiple feature vectors to the pooling layer to perform dimensionality reduction processing on these feature vectors;

The feature vector obtained by the dimensionality reduction process is input to the classifier.

Optionally, in some embodiments of the present application, before the input of the feature vector obtained by the dimensionality reduction process into the classifier, the method further includes:

Preset a threshold for the classifier;

If the output of the classifier is greater than the threshold, it means that it is WebShell; when the output of the classifier is less than the preset threshold, it means that it is not WebShell;

The classifying the training text by the classifier to obtain the classified text includes:

Setting the size of the decision tree N in the classifier, and performing Bootstrap sampling to obtain N data sets;

Learn the parameter θn of each decision tree in N decision trees;

Each decision tree is trained in parallel training. After the training of a single decision tree is completed, the voting records of the training results of the training decision tree are counted to determine the final output of the CNN-RF model; among them, the final output of the CNN-RF model One way of representation is:

Among them, Ti(x) is the classification result of the sample x by the tree i, that is, the voting method, c* is the final category corresponding to the sample, and N is the number of decision trees in the classifier.

In the embodiment of this application, the classifier may adopt a random forest model or a Softmax model. When the random forest model is adopted, the fully connected layer feature Cfinal may be sent to the random forest model for training.

Since Cfinal, the fully connected layer feature of the random forest, usually has a small dimension, and the general data set has m×s<103, the cost of establishing a random forest model is very small.

For ease of understanding, the following takes a specific application scenario as an example. As shown in Figure 2c, input multiple sentences to the neural network model to which they belong, and perform word vectorization on each sentence to obtain multiple word vectors (such as h1, h2,... and h9), and input each word vector into the LSTM model or the GRU model , Extract long-distance dependent features (such as y1, y2,...y9), input the long-distance dependent features into the CNN model, extract local features with invariable positions, and finally obtain multiple feature vectors, which have long-distance dependent features. Local features with unchanged features and locations. Then, the multiple feature vectors are input to the pooling layer to perform dimensionality reduction processing on these feature vectors. Input the feature vector obtained by dimensionality reduction into a classifier (such as Softmax), which sets a threshold in advance. When the output of SoftMax is greater than the threshold, it means that it is WebShell; when the output of SoftMax is less than the threshold, it means that it is not WebShell .

The technical features mentioned in the embodiments or implementations corresponding to any one of the above FIGS. 1 to 2c are also applicable to the embodiments corresponding to FIGS. 3 and 4 in this application, and the similarities will not be repeated here. .

The text classification method in the present application is described above, and the device that executes the text classification method is described below.

A schematic structural diagram of a text classification device 30 shown in FIG. 3 can be applied to classify texts such as news, papers, posts, and mails. The text classification device 30 in the embodiment of the present application can implement the steps corresponding to the text classification method executed in the embodiment corresponding to FIG. 1. The functions implemented by the text classification device 30 can be implemented by hardware, or can be implemented by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the above functions, and the modules may be software and/or hardware. The text classification device 30 may include an input and output module 301, a processing module 302, and a collection module 303. The functional realization of the input and output module 301, the processing module 302, and the collection module 303 can refer to the implementation corresponding to FIG. 1 The operations performed in the example will not be repeated here. The processing module 302 can be used to control the input and output operations of the income output module 301 and the collection operation of the collection module 303.

In some embodiments, the input and output module 301 may be used to obtain training text;

The processing module 302 may be configured to input the training text obtained by the input and output module 301 into the coding layer of the neural network model, and perform word vectorization on the training text in the coding layer to obtain the corresponding training text The feature vector of; input the feature vector into the RNN model to model the sentence;

The acquisition module 303 can be used to capture the long-distance dependent features of each sentence in the training text; wherein the long-distance dependent features refer to the context vector of the text, and the context vector is dependent on the time domain for a long time;

The input and output module 301 is further configured to input the feature vector of the long-distance dependence information captured by the acquisition module into the convolutional neural network CNN model in the neural network model;

The processing module 302 is also used to extract local features from the feature vector in the CNN model to obtain a target feature vector; wherein, the local feature refers to the local correlation in the feature vector; through the input and output The module inputs the target feature vector to the classifier, and classifies the training text through the classifier to obtain the classified text.

In the embodiments of this application, the RNN model is used to process long-term information to capture long-distance dependent features, which can accurately determine the context vector with strong correlation and avoid the loss of a large amount of information in the signal transmission process, and then use the CNN model to The perceptual features of local features extract local features, and finally input the output of the CNN model into the classification for classification processing. Since the feature vector in the input classifier has both long-distance dependent features and local features, it can effectively improve sentences of different lengths The classification effect and the improvement of the accuracy of the neural network model for text recognition.

In some implementation manners, the collection module 302 is specifically configured to:

The long-distance dependence feature of each word in the sentence is sequentially calculated by the LSTM model, where the long-distance dependence feature of a specific word represents the dependence relationship between the specific word and other long-distance words in the sentence; The semantic structure feature of each word, the semantic structure feature of a specific word characterizes the semantic structure of the partial sentence containing the specific word and the word before it; the long-distance dependence feature of each word and the semantic structure feature are combined to obtain the sentence The word feature of each word; calculate the probability of each word in the sentence based on each word feature.

In some implementation manners, when the training text is continuous data of any one of speech language, lyrics, or thesis, the processing module 302 is specifically configured to:

In some implementation manners, before the training text is classified by the classifier, the processing module 302 is further configured to: input multiple sentences into the neural network model to which each sentence belongs through the input and output module 301. Perform word vectorization to obtain multiple word vectors; input each word vector into the LSTM model or GRU model through the input and output module 301 to extract long-distance dependent features; and input the long-distance dependent features through the input and output module 301 The CNN model extracts location-invariant local features, and finally obtains multiple feature vectors, each of which has long-distance dependent features and location-invariant local features; the input and output module 301 combines the multiple feature vectors. The feature vectors are input to the pooling layer to perform dimensionality reduction processing on these feature vectors; the input and output module 301 inputs the feature vectors obtained by the dimensionality reduction processing to the classifier.

In some implementation manners, before inputting the feature vector obtained by the dimensionality reduction process into the classifier, the processing module 302 is further configured to: preset a threshold for the classifier; if the output of the classifier is greater than the Threshold, it means WebShell; when the output of the classifier is less than the preset threshold, it means it is not WebShell; set the size of the decision tree N in the classifier, and conduct Bootstrap sampling to obtain N data sets; learn each of N decision trees The parameters θn of decision trees; each decision tree is trained in parallel training. After the training of a single decision tree is completed, the voting records of the training results of the training decision tree are counted to determine the final output of the CNN-RF model; among them, CNN -One way to represent the final output of the RF model is:

In some embodiments, the training text is Webshell, which is a command execution environment in the form of web files such as asp, php, jsp, or cgi; the input and output module 301 performs one of the following operations to obtain WebShell: Engine to find common vulnerabilities disclosed on the Internet, if the target site has not been repaired, obtain WebShell; conduct code audit on open source CMS through code audit strategy, and mine code vulnerabilities from the CMS to obtain WebShell; use upload vulnerabilities Obtain WebShell; use SQL injection attack to obtain WebShell; or use database backup to obtain WebShell.

The physical device corresponding to the input-output module 301 shown in FIG. 3 is the input-output unit shown in FIG. 4, which can realize part or all of the functions of the acquisition module 1, or realize the same or similar functions as the input-output module 301 Features. The physical device corresponding to the collection module 303 shown in FIG. 3 is the collection device shown in FIG. 4.

The physical device corresponding to the processing module 302 shown in FIG. 3 is the processor shown in FIG. 4, and the processor can implement part or all of the functions of the processing module 302 or implement the same or similar functions as the processing module 302.

The text classification device 30 in the embodiment of the present application is described above from the perspective of modular functional entities. The following describes a computer device from the perspective of hardware, as shown in FIG. 4, which includes: a processor, a memory, an input and output unit ( It may also be a transceiver (not identified in FIG. 4) and a computer program stored in the memory and running on the processor. For example, the computer program may be a program corresponding to the text classification method in the embodiment corresponding to FIG. 1. For example, when a computer device implements the functions of the text classification device 30 shown in FIG. 3, the processor executes the computer program to implement the text classification method executed by the text classification device 30 in the embodiment corresponding to FIG. 3 Or, when the processor executes the computer program, the function of each module in the text classification device 30 of the embodiment corresponding to FIG. 3 is realized. For another example, the computer program may be a program corresponding to the text classification method in the embodiment corresponding to FIG. 1.

The so-called processor can be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), ready-made Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor, etc. The processor is the control center of the computer equipment, and various interfaces and lines are used to connect various parts of the entire computer equipment.

The memory may be used to store the computer program and/or module, and the processor implements the computer by running or executing the computer program and/or module stored in the memory, and calling data stored in the memory. Various functions of the device. The memory may mainly include a storage program area and a storage data area, where the storage program area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.); the storage data area may store Data created based on the use of mobile phones (such as audio data, video data, etc.), etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as hard disks, memory, plug-in hard disks, smart media cards (SMC), and secure digital (SD) cards , Flash Card, at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The transceiver may also be replaced by a receiver and a transmitter, and may be the same or different physical entities. When they are the same physical entity, they can be collectively referred to as transceivers. The transceiver can be an input and output unit.

The memory may be integrated in the processor, or may be provided separately from the processor.

The present application also provides a non-volatile computer-readable storage medium, including instructions, which when run on a computer, cause the computer to execute the following steps of the text classification method:

Acquiring training text, where the training text includes multiple sentences, and each sentence includes multiple words;

Input the training text into the coding layer of the neural network model, and perform word vectorization on the training text in the coding layer to obtain a feature vector corresponding to the training text;

Input the feature vector into the RNN model to model the sentence, and capture the long-distance dependent features of each sentence in the training text; wherein, the long-distance dependent feature refers to the context vector of the text, and the context vector is in the time domain Long-term dependence

Inputting the feature vector capturing the long-distance dependence information into the convolutional neural network CNN model in the neural network model;

Extracting local features from the feature vector in the CNN model to obtain a target feature vector; where the local feature refers to the local correlation in the feature vector;

The target feature vector is input to the classifier, and the training text is classified by the classifier to obtain the classified text.

Through the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product. The computer software product is stored in a storage medium (such as ROM/RAM), including Several instructions are used to make a terminal (which may be a mobile phone, a computer, a server, or a network device, etc.) execute the method described in each embodiment of the present application.

Claims

A text classification method, the method includes:

Acquiring training text, where the training text includes multiple sentences, and each sentence includes multiple words;

Input the training text into the coding layer of the neural network model, and perform word vectorization on the training text in the coding layer to obtain a feature vector corresponding to the training text;

Input the feature vector into the RNN model to model the sentence, and capture the long-distance dependent features of each sentence in the training text; wherein, the long-distance dependent feature refers to the context vector of the text, and the context vector is in the time domain Long-term dependence

Inputting the feature vector capturing the long-distance dependence information into the convolutional neural network CNN model in the neural network model;

Extracting local features from the feature vector in the CNN model to obtain a target feature vector; where the local feature refers to the local correlation in the feature vector;

The target feature vector is input to the classifier, and the training text is classified by the classifier to obtain the classified text.
The text classification method according to claim 1, wherein the capturing the long-distance dependency characteristics of each sentence in the training text comprises:

Calculate the long-distance dependence feature of each word in the sentence sequentially by using the LSTM model, where the long-distance dependence feature of a specific word represents the dependence relationship between the specific word and other long-distance words in the sentence;

The method also includes:

Calculate the semantic structure features of each word in order, and the semantic structure feature of a specific word represents the semantic structure of the partial sentence containing the specific word and the word before it;

Combine the long-distance dependence feature of each word and the semantic structure feature to obtain the word feature of each word in the sentence;

Calculate the probability of each word in the sentence based on each word feature.
The text classification method according to claim 2, when the training text is continuous data of any one of speech language, lyrics or thesis, the LSTM model is used to sequentially calculate the long-distance dependence of each word in the sentence Features include:

The long-distance dependence information of each word in the sentence is sequentially and cyclically calculated by the LSTM model to capture the long-distance dependence feature from the continuous data.
The text classification method according to claim 3, before the classification processing of the training text by the classifier, the method further comprises:

Input multiple sentences to the neural network model to which they belong, and perform word vectorization on each sentence to obtain multiple word vectors;

Input each word vector into the LSTM model or GRU model to extract long-distance dependent features;

Input the long-distance dependent features into a CNN model, extract local features with invariant positions, and finally obtain a plurality of feature vectors, each of the feature vectors has a long-distance dependent feature and a local feature with invariable position;

Input the multiple feature vectors to the pooling layer to perform dimensionality reduction processing on these feature vectors;

The feature vector obtained by the dimensionality reduction process is input to the classifier.
The text classification method according to claim 4, before the input of the feature vector obtained by the dimensionality reduction processing into the classifier, the method further comprises:

Preset a threshold for the classifier;

If the output of the classifier is greater than the threshold, it means that it is WebShell; when the output of the classifier is less than the preset threshold, it means that it is not WebShell;

The classifying the training text by the classifier to obtain the classified text includes:

Setting the size of the decision tree N in the classifier, and performing Bootstrap sampling to obtain N data sets;

Learn the parameter θn of each decision tree in N decision trees;

Each decision tree is trained in parallel training. After the training of a single decision tree is completed, the voting records of the training results of the training decision tree are counted to determine the final output of the CNN-RF model; among them, the final output of the CNN-RF model One way of representation is:

Among them, Ti(x) is the classification result of the sample x by the tree i, c* is the final category corresponding to the sample, and N is the number of decision trees in the classifier.
The text classification method according to any one of claims 1 to 5, wherein the training text is Webshell, and Webshell is a command execution environment in the form of web files such as asp, php, jsp, or cgi; the acquisition training Text, including one of the following implementation methods:

Use search engines to find common vulnerabilities disclosed on the Internet. If the target site has not been repaired, obtain WebShell;

Conduct code audit on the open source CMS through code audit strategy, and mine code vulnerabilities from the CMS to obtain WebShell;

Use upload vulnerability to obtain WebShell;

Use SQL injection attacks to obtain WebShell;

Or, use a database backup to get WebShell.
A text classification device, the text classification device includes:

An input and output module for obtaining training text, the training text includes multiple sentences, each sentence includes multiple words;

The processing module is configured to input the training text into the coding layer of the neural network model, and perform word vectorization on the training text in the coding layer to obtain a feature vector corresponding to the training text; and input the feature vector RNN model to model sentences;

The acquisition module is used to capture the long-distance dependence feature of each sentence in the training text; wherein the long-distance dependence feature refers to the context vector of the text, and the context vector is dependent on the time domain for a long time;

The input and output module is further configured to input the feature vector of the long-distance dependence information captured by the acquisition module into the convolutional neural network CNN model in the neural network model;

The processing module is also used to extract local features from the feature vector in the CNN model to obtain a target feature vector; wherein the local feature refers to the local correlation in the feature vector; through the input and output module The target feature vector is input to the classifier, and the training text is classified by the classifier to obtain the classified text.
The text classification device according to claim 7, wherein the collection module is specifically configured to:

Calculate the long-distance dependence feature of each word in the sentence sequentially by using the LSTM model, where the long-distance dependence feature of a specific word represents the dependence relationship between the specific word and other long-distance words in the sentence;

Calculate the semantic structure features of each word in order, and the semantic structure feature of a specific word represents the semantic structure of the partial sentence containing the specific word and the word before it;

Combine the long-distance dependence feature of each word and the semantic structure feature to obtain the word feature of each word in the sentence;

Calculate the probability of each word in the sentence based on each word feature.
The text classification device according to claim 8, when the training text is continuous data of any one of speech language, lyrics, or thesis, the processing module is specifically configured to:

The long-distance dependence information of each word in the sentence is sequentially and cyclically calculated by the LSTM model to capture the long-distance dependence feature from the continuous data.
The text classification device according to claim 9, wherein the processing module is further configured to: before performing classification processing on the training text by the classifier:

Input multiple sentences to the neural network model to which they belong through the input and output module, and perform word vectorization on each sentence to obtain multiple word vectors;

Input each word vector into the LSTM model or GRU model through the input and output module, and extract long-distance dependent features;

The long-distance dependent features are input to the CNN model through the input and output module, and local features with invariable positions are extracted, and finally multiple feature vectors are obtained. Each of the feature vectors has a long-distance dependent feature and a local with invariant position. feature;

Input the multiple feature vectors to the pooling layer through the input and output module, so as to perform dimensionality reduction processing on these feature vectors;

The feature vector obtained by the dimensionality reduction process is input to the classifier through the input and output module.
The text classification device according to claim 10, before the processing module inputs the feature vector obtained by the dimensionality reduction processing into the classifier, it is further used for:

Preset a threshold for the classifier;

If the output of the classifier is greater than the threshold, it means that it is WebShell; when the output of the classifier is less than the preset threshold, it means that it is not WebShell;

The classifying the training text by the classifier to obtain the classified text includes:

Setting the size of the decision tree N in the classifier, and performing Bootstrap sampling to obtain N data sets;

Learn the parameter θn of each decision tree in N decision trees;

Each decision tree is trained in parallel training. After the training of a single decision tree is completed, the voting records of the training results of the training decision tree are counted to determine the final output of the CNN-RF model; among them, the final output of the CNN-RF model One way of representation is:

Among them, Ti(x) is the classification result of the sample x by the tree i, c* is the final category corresponding to the sample, and N is the number of decision trees in the classifier.
The text classification device according to any one of claims 7-11, wherein the training text is Webshell, and Webshell is a command execution environment in the form of web files such as asp, php, jsp or cgi; the input and output The module performs one of the following operations to obtain WebShell:

Use search engines to find common vulnerabilities disclosed on the Internet. If the target site has not been repaired, obtain WebShell;

Conduct code audit on the open source CMS through the code audit strategy, and mine code vulnerabilities from the CMS to obtain WebShell;

Use upload vulnerability to obtain WebShell;

Use SQL injection attacks to obtain WebShell;

Or, use a database backup to get WebShell.
A computer device, characterized in that the device includes:

At least one processor, memory and input/output unit;

Wherein, the memory is used to store program code, and the processor is used to call the program code stored in the memory to perform the following steps:

Acquiring training text, where the training text includes multiple sentences, and each sentence includes multiple words;

Input the training text into the coding layer of the neural network model, and perform word vectorization on the training text in the coding layer to obtain a feature vector corresponding to the training text;

Input the feature vector into the RNN model to model the sentence, and capture the long-distance dependent features of each sentence in the training text; wherein, the long-distance dependent feature refers to the context vector of the text, and the context vector is in the time domain Long-term dependence

Inputting the feature vector capturing the long-distance dependence information into the convolutional neural network CNN model in the neural network model;

Extracting local features from the feature vector in the CNN model to obtain a target feature vector; where the local feature refers to the local correlation in the feature vector;

The target feature vector is input to the classifier, and the training text is classified by the classifier to obtain the classified text.
The computer device according to claim 13, when the processor executes the computer program to realize the capturing of long-distance dependent features of each sentence in the training text, the method comprises the following steps:

Calculate the long-distance dependence feature of each word in the sentence sequentially by using the LSTM model, where the long-distance dependence feature of a specific word represents the dependence relationship between the specific word and other long-distance words in the sentence;

The method also includes:

Calculate the semantic structure features of each word in order, and the semantic structure feature of a specific word represents the semantic structure of the partial sentence containing the specific word and the word before it;

Combine the long-distance dependence feature of each word and the semantic structure feature to obtain the word feature of each word in the sentence;

Calculate the probability of each word in the sentence based on each word feature.
The computer device according to claim 14, wherein the processor executes the computer program to realize that when the training text is continuous data of any one of speech language, lyrics, or thesis, the sequential calculation by the LSTM model When the long distance of each word in the sentence depends on the feature, the following steps are included:

The long-distance dependence information of each word in the sentence is sequentially and cyclically calculated by the LSTM model to capture the long-distance dependence feature from the continuous data.
The computer device according to claim 15, before the processor executes the computer program to implement the classification processing of the training text by the classifier, further comprising the following steps:

Input multiple sentences to the neural network model to which they belong, and perform word vectorization on each sentence to obtain multiple word vectors;

Input each word vector into the LSTM model or GRU model to extract long-distance dependent features;

Input the long-distance dependent features into a CNN model, extract local features with invariant positions, and finally obtain a plurality of feature vectors, each of the feature vectors has a long-distance dependent feature and a local feature with invariable position;

Input the multiple feature vectors to the pooling layer to perform dimensionality reduction processing on these feature vectors;

The feature vector obtained by the dimensionality reduction process is input to the classifier.
The computer device according to claim 16, before the processor executes the computer program to realize the input of the feature vector obtained by the dimensionality reduction process into the classifier, further comprising the following steps:

Preset a threshold for the classifier;

If the output of the classifier is greater than the threshold, it means that it is WebShell; when the output of the classifier is less than the preset threshold, it means that it is not WebShell;

The classifying the training text by the classifier to obtain the classified text includes:

Setting the size of the decision tree N in the classifier, and performing Bootstrap sampling to obtain N data sets;

Learn the parameter θn of each decision tree in N decision trees;

Each decision tree is trained in parallel training. After the training of a single decision tree is completed, the voting records of the training results of the training decision tree are counted to determine the final output of the CNN-RF model; among them, the final output of the CNN-RF model One way of representation is:

Among them, Ti(x) is the classification result of the sample x by the tree i, c* is the final category corresponding to the sample, and N is the number of decision trees in the classifier.
The computer device according to any one of claims 13-17, the training text is Webshell, and Webshell is a command execution environment in the form of web files such as asp, php, jsp, or cgi; the processor executes One of the following operations to obtain WebShell:

Use search engines to find common vulnerabilities disclosed on the Internet. If the target site has not been repaired, obtain WebShell;

Conduct code audit on the open source CMS through code audit strategy, and mine code vulnerabilities from the CMS to obtain WebShell;

Use upload vulnerability to obtain WebShell;

Use SQL injection attacks to obtain WebShell;

Or, use a database backup to get WebShell.
A computer storage medium in which instructions are stored in the computer-readable storage medium, when running on a computer, the computer executes the following steps:

Acquiring training text, where the training text includes multiple sentences, and each sentence includes multiple words;

Input the training text into the coding layer of the neural network model, and perform word vectorization on the training text in the coding layer to obtain a feature vector corresponding to the training text;

Input the feature vector into the RNN model to model the sentence, and capture the long-distance dependent features of each sentence in the training text; wherein, the long-distance dependent feature refers to the context vector of the text, and the context vector is in the time domain Long-term dependence

Inputting the feature vector capturing the long-distance dependence information into the convolutional neural network CNN model in the neural network model;

Extracting local features from the feature vector in the CNN model to obtain a target feature vector; where the local feature refers to the local correlation in the feature vector;

The target feature vector is input to the classifier, and the training text is classified by the classifier to obtain the classified text.
The computer-readable storage medium according to claim 19, when the computer-readable storage medium is executed by a processor, the following steps are further implemented:

Calculate the long-distance dependence feature of each word in the sentence sequentially by using the LSTM model, where the long-distance dependence feature of a specific word represents the dependence relationship between the specific word and other long-distance words in the sentence;

The method also includes:

Calculate the semantic structure features of each word in order, and the semantic structure feature of a specific word represents the semantic structure of the partial sentence containing the specific word and the word before it;

Combine the long-distance dependence feature of each word and the semantic structure feature to obtain the word feature of each word in the sentence; calculate the probability of each word in the sentence based on each word feature.