WO2020244066A1 - Text classification method, apparatus, device, and storage medium - Google Patents

Text classification method, apparatus, device, and storage medium Download PDF

Info

Publication number
WO2020244066A1
WO2020244066A1 PCT/CN2019/102464 CN2019102464W WO2020244066A1 WO 2020244066 A1 WO2020244066 A1 WO 2020244066A1 CN 2019102464 W CN2019102464 W CN 2019102464W WO 2020244066 A1 WO2020244066 A1 WO 2020244066A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
word
long
text
model
Prior art date
Application number
PCT/CN2019/102464
Other languages
French (fr)
Chinese (zh)
Inventor
李坤
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020244066A1 publication Critical patent/WO2020244066A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • This application relates to the field of text classification, and in particular to a text classification method, device, equipment, and storage medium.
  • Text classification is a key task in natural language processing, which can help users discover useful information from massive amounts of data. Text classification is mainly used in spam recognition, sentiment analysis, question and answer systems, translation, etc.
  • the purpose of sentence model is to learn text features to represent sentences, and it is a key model for text classification.
  • WebShell detection also belongs to a kind of text classification.
  • Current text classification is mostly based on statistics and machine learning.
  • the statistical method uses split sentences, based on a corpus, and counts the occurrence probability of words composed of adjacent words. If adjacent words appear more frequently, the probability of occurrence is high. Words are segmented according to the probability value, so a complete corpus Very important.
  • the machine learning method uses the TF-IDF algorithm to calculate text features, and then uses classifiers such as logistic regression, SVM, and random forest to classify the text.
  • classifiers such as logistic regression, SVM, and random forest to classify the text.
  • This application provides a text classification method, device, equipment and storage medium, which can solve the problem of poor accuracy of text classification in the prior art.
  • the present application provides a text classification method, which includes: obtaining training text; inputting the training text into the coding layer of a neural network model, and performing word vectorization on the training text in the coding layer to obtain The feature vector corresponding to the training text; the feature vector is input to the RNN model, the sentence is modeled, and the long-distance dependent feature of each sentence in the training text is captured; wherein, the long-distance dependent feature refers to the text And the context vector is dependent on the time domain for a long time; input the feature vector that captures the long-distance dependence information into the convolutional neural network CNN model in the neural network model; in the CNN model from the The local features are extracted from the feature vector to obtain the target feature vector; wherein, the local feature refers to the local correlation in the feature vector; the target feature vector is input to the classifier, and the training is performed by the classifier The text is classified, and the classified text is obtained.
  • the capturing the long-distance dependence characteristics of each sentence in the training text includes: sequentially calculating the long-distance dependence characteristics of each word in the sentence through the LSTM model, wherein the The long-distance dependency feature represents the dependency relationship between the specific word and other long-distance words in the sentence; the method further includes: sequentially calculating the semantic structure features of each word, and the semantic structure feature representation of the specific word includes the specific word The semantic structure of the partial sentence of the word before it; the long-distance dependence feature of each word and the semantic structure feature are combined to obtain the word feature of each word in the sentence; the probability of each word in the sentence is calculated based on each word feature .
  • the LSTM model is used to sequentially calculate the long-distance dependent features of each word in the sentence, including : Calculate the long-distance dependence information of each word in the sentence sequentially and cyclically through the LSTM model to capture the long-distance dependence feature from the continuous data.
  • the method before the training text is classified by the classifier, the method further includes: inputting a plurality of sentences into the neural network model, and performing word vectorization on each sentence to obtain Multiple word vectors; input each word vector into the LSTM model or GRU model to extract long-distance dependent features; input the long-distance dependent features into the CNN model to extract local features with invariant positions, and finally obtain multiple feature vectors, each The feature vectors respectively have long-distance dependent features and position-invariant local features; input the multiple feature vectors into the pooling layer to perform dimensionality reduction processing on these feature vectors; input the feature vectors obtained by the dimensionality reduction processing ⁇ Classifier.
  • the method before the input of the feature vector obtained by the dimensionality reduction process into the classifier, the method further includes: presetting a threshold for the classifier; if the output of the classifier is greater than all The threshold value means WebShell; when the output of the classifier is less than the preset threshold value, it means it is not WebShell; the classifying the training text by the classifier to obtain the classified text includes: setting the The size of the decision tree N in the classifier, Bootstrap sampling is performed to obtain N data sets; the parameter ⁇ n of each decision tree in the N decision trees is learned; each decision tree is trained in parallel training, and the training is completed in a single decision tree After that, the voting records of the training results for the training decision tree are counted to determine the final output of the CNN-RF model; among them, a representation of the final output of the CNN-RF model is: Among them, Ti(x) is the classification result of the sample x by the tree i, c* is the final category corresponding to the sample, and N is the
  • the training text is Webshell, which is a command execution environment in the form of web pages such as asp, php, jsp, or cgi;
  • the acquisition of the training text includes one of the following implementation methods: Use search engines to find common vulnerabilities disclosed on the Internet. If the target site has not been repaired, obtain WebShell; conduct code audits on open source CMS through code audit strategies, and mine code vulnerabilities from the CMS to obtain WebShell; adopt Upload vulnerability to obtain WebShell; use SQL injection attack to obtain WebShell; or use database backup to obtain WebShell.
  • the present application provides a text classification device with a function corresponding to the text classification method provided in the first aspect.
  • the function can be realized by hardware, or by hardware executing corresponding software.
  • the hardware or software includes one or more modules corresponding to the above functions, and the modules may be software and/or hardware.
  • Another aspect of the present application provides a computer device, which includes at least one connected processor, a memory, and a transceiver, wherein the memory is used to store program code, and the processor is used to call the program code in the memory To perform the method described in the first aspect above.
  • the computer-readable storage medium may be a non-volatile computer-readable storage medium, and the computer-readable storage medium stores instructions when it runs on a computer. , Causing the computer to execute the method described in the first aspect.
  • the RNN model is used to process long-term information to capture long-distance dependent features, which can accurately determine the context vector with strong correlation and avoid the loss of a large amount of information in the signal transmission process, and then use the CNN model to analyze local features Extract local features from the perceptual characteristics of the CNN model, and finally input the output of the CNN model into the classification for classification processing. Since the feature vector in the input classifier has both long-distance dependent features and local features, it can effectively improve the classification of sentences of different lengths Effect, and improve the accuracy of the neural network model to recognize text.
  • FIG. 1 is a schematic flowchart of a text classification method in an embodiment of this application
  • Figure 2a is a schematic flowchart of a text classification method in an embodiment of this application.
  • FIG. 2b is a schematic table showing the comparison of the accuracy rates of fudan, Weibo and MR in the embodiment of the application;
  • FIG. 2c is a schematic diagram of another flow chart of a text classification method according to an embodiment of this application.
  • FIG. 3 is a schematic diagram of a structure of text classification in an embodiment of this application.
  • Fig. 4 is a schematic diagram of a structure of a computer device in an embodiment of the application.
  • This application provides a text classification method, device, equipment, and storage medium, which can be used to classify texts such as news, papers, posts, and emails. This application does not limit the application scenarios of text classification.
  • this application mainly provides the following technical solutions:
  • the neural network model of the present application includes a CNN model and an RNN model, and a schematic diagram of the structure of the neural network model is shown in FIG. 1.
  • the coding layer of the neural network model includes an RNN model and a CNN model.
  • the input of the neural network model is the input of the RNN model
  • the output of the RNN model is the input of the CNN model.
  • the output of the CNN model is the output of the neural network model.
  • a text classification method in an embodiment of the present application is introduced below, and the method includes:
  • the training text includes multiple sentences, and each sentence includes multiple words.
  • the training text in this application is Webshell.
  • Webshell is a command execution environment in the form of web files such as asp, php, jsp, or cgi, which can also be called a web backdoor.
  • hackers invade a website they usually mix the asp or php backdoor files with the normal webpage files in the web directory of the website server, and then use the browser to access the asp or php backdoor to obtain a command execution environment to control the website The purpose of the server.
  • the training text is Webshell, which is a command execution environment in the form of web pages such as asp, php, jsp, or cgi.
  • a content management system (Content Management System, CMS) may be used to obtain the Webshell, and one of the following implementation methods may be used to obtain the training text:
  • CMS Content Management System
  • a content management system can be used to obtain Webshell.
  • a search engine can be used to find common vulnerabilities publicly disclosed on the Internet. If the target site has not been repaired, WebShell is obtained.
  • This application does not limit the method and source of obtaining training text.
  • the feature vector is the text representation of the directional quantity space model.
  • the text data is changed from a high-latitude and high-sparse neural network difficult to process to continuous dense data similar to images and voices.
  • the long-distance dependence feature refers to the context vector of the text, and the context vector is long-term dependent in the time domain.
  • the capturing the long-distance dependent characteristics of each sentence in the training text includes:
  • the long-distance dependence feature of each word in the sentence is sequentially calculated by the LSTM model, where the long-distance dependence feature of a specific word represents the dependence relationship between the specific word and other long-distance words in the sentence.
  • the RNN model may adopt a Long Short-Term Memory (LSTM) model, through which a wide range of context information can be used in text processing to determine the probability of the next word.
  • LSTM Long Short-Term Memory
  • the LSTM model can use a wide range of context information in text processing to determine the probability of the next word, including the following steps:
  • the training text may be continuous data, such as speech language, lyrics, or essays
  • a loop operation can be used to capture long-distance dependent information from such continuous data to ensure that the signal can continue to propagate.
  • the training text is continuous data of any one of speech language, lyrics, or thesis
  • the sequential calculation of the long-distance dependent features of each word in the sentence through the LSTM model includes:
  • the long-distance dependence information of each word in the sentence is sequentially and cyclically calculated by the LSTM model to capture the long-distance dependence feature from the continuous data.
  • the local feature refers to the local correlation in the feature vector, and can also be referred to as key information similar to n-gram in the feature vector.
  • the CNN model may adopt the CNN-RF model.
  • the following table shows the comparison of accuracy rates of 3 types of text (fudan, weibo, and MR) using NB model, CART model, RF model, CNN model and CNN-RF model (as shown in Figure 2b) .
  • the neural network model includes a classifier, and the input of the classifier is the output of the CNN model.
  • the classifier trains the feature vector until the classifier converges.
  • a threshold may be preset for the classifier. If the output of the classifier is greater than the threshold, it means that it is WebShell; when the output of SoftMax is less than the threshold, it means that it is not WebShell.
  • the characteristics of long-term information processing by the RNN model are used to capture long-distance dependent features, which can accurately determine the context vector with strong correlation and avoid a large amount of signal loss in the transmission process. Then use the CNN model to extract the local features from the perceptual characteristics of the local features, and finally input the output of the CNN model into the classification for classification processing. Since the feature vector in the input classifier has both long-distance dependent features and local features, The classification effect of sentences of different lengths can be effectively improved, and the accuracy of text recognition by the neural network model can be improved. In addition, combining the feature extraction ability of the CNN model and the generalization ability of the random forest, the generalization ability can be analyzed from the following three aspects:
  • dual word vectors describe the meaning of words from two perspectives, enriching short text information, and expanding feature information compared to single word vectors.
  • the true hypothesis of some learning tasks may not be in the hypothesis space where the current decision tree algorithm is located. If a single classification method is used, the established hypothesis space will not be searched. And random forest using Bootstrap sampling can reduce the dependence of the machine learning model on data and reduce the variance of the model, so that the RNN model has better generalization capabilities.
  • the method before the training text is classified by the classifier, the method further includes:
  • the feature vector obtained by the dimensionality reduction process is input to the classifier.
  • the method before the input of the feature vector obtained by the dimensionality reduction process into the classifier, the method further includes:
  • the output of the classifier is greater than the threshold, it means that it is WebShell; when the output of the classifier is less than the preset threshold, it means that it is not WebShell;
  • the classifying the training text by the classifier to obtain the classified text includes:
  • Each decision tree is trained in parallel training. After the training of a single decision tree is completed, the voting records of the training results of the training decision tree are counted to determine the final output of the CNN-RF model; among them, the final output of the CNN-RF model
  • One way of representation is:
  • Ti(x) is the classification result of the sample x by the tree i, that is, the voting method
  • c* is the final category corresponding to the sample
  • N is the number of decision trees in the classifier.
  • the classifier may adopt a random forest model or a Softmax model.
  • the random forest model When the random forest model is adopted, the fully connected layer feature Cfinal may be sent to the random forest model for training.
  • Input the feature vector obtained by dimensionality reduction into a classifier (such as Softmax), which sets a threshold in advance.
  • SoftMax a classifier
  • the output of SoftMax is greater than the threshold, it means that it is WebShell; when the output of SoftMax is less than the threshold, it means that it is not WebShell .
  • the text classification method in the present application is described above, and the device that executes the text classification method is described below.
  • a schematic structural diagram of a text classification device 30 shown in FIG. 3 can be applied to classify texts such as news, papers, posts, and mails.
  • the text classification device 30 in the embodiment of the present application can implement the steps corresponding to the text classification method executed in the embodiment corresponding to FIG. 1.
  • the functions implemented by the text classification device 30 can be implemented by hardware, or can be implemented by hardware executing corresponding software.
  • the hardware or software includes one or more modules corresponding to the above functions, and the modules may be software and/or hardware.
  • the text classification device 30 may include an input and output module 301, a processing module 302, and a collection module 303.
  • the functional realization of the input and output module 301, the processing module 302, and the collection module 303 can refer to the implementation corresponding to FIG. 1
  • the operations performed in the example will not be repeated here.
  • the processing module 302 can be used to control the input and output operations of the income output module 301 and the collection operation of the collection module 303.
  • the input and output module 301 may be used to obtain training text
  • the processing module 302 may be configured to input the training text obtained by the input and output module 301 into the coding layer of the neural network model, and perform word vectorization on the training text in the coding layer to obtain the corresponding training text
  • the feature vector of input the feature vector into the RNN model to model the sentence;
  • the acquisition module 303 can be used to capture the long-distance dependent features of each sentence in the training text; wherein the long-distance dependent features refer to the context vector of the text, and the context vector is dependent on the time domain for a long time;
  • the input and output module 301 is further configured to input the feature vector of the long-distance dependence information captured by the acquisition module into the convolutional neural network CNN model in the neural network model;
  • the processing module 302 is also used to extract local features from the feature vector in the CNN model to obtain a target feature vector; wherein, the local feature refers to the local correlation in the feature vector; through the input and output
  • the module inputs the target feature vector to the classifier, and classifies the training text through the classifier to obtain the classified text.
  • the RNN model is used to process long-term information to capture long-distance dependent features, which can accurately determine the context vector with strong correlation and avoid the loss of a large amount of information in the signal transmission process, and then use the CNN model to The perceptual features of local features extract local features, and finally input the output of the CNN model into the classification for classification processing. Since the feature vector in the input classifier has both long-distance dependent features and local features, it can effectively improve sentences of different lengths The classification effect and the improvement of the accuracy of the neural network model for text recognition.
  • the collection module 302 is specifically configured to:
  • the long-distance dependence feature of each word in the sentence is sequentially calculated by the LSTM model, where the long-distance dependence feature of a specific word represents the dependence relationship between the specific word and other long-distance words in the sentence;
  • the semantic structure feature of each word, the semantic structure feature of a specific word characterizes the semantic structure of the partial sentence containing the specific word and the word before it; the long-distance dependence feature of each word and the semantic structure feature are combined to obtain the sentence
  • the word feature of each word calculate the probability of each word in the sentence based on each word feature.
  • the processing module 302 is specifically configured to:
  • the long-distance dependence information of each word in the sentence is sequentially and cyclically calculated by the LSTM model to capture the long-distance dependence feature from the continuous data.
  • the processing module 302 is further configured to: input multiple sentences into the neural network model to which each sentence belongs through the input and output module 301. Perform word vectorization to obtain multiple word vectors; input each word vector into the LSTM model or GRU model through the input and output module 301 to extract long-distance dependent features; and input the long-distance dependent features through the input and output module 301
  • the CNN model extracts location-invariant local features, and finally obtains multiple feature vectors, each of which has long-distance dependent features and location-invariant local features; the input and output module 301 combines the multiple feature vectors.
  • the feature vectors are input to the pooling layer to perform dimensionality reduction processing on these feature vectors; the input and output module 301 inputs the feature vectors obtained by the dimensionality reduction processing to the classifier.
  • the processing module 302 before inputting the feature vector obtained by the dimensionality reduction process into the classifier, is further configured to: preset a threshold for the classifier; if the output of the classifier is greater than the Threshold, it means WebShell; when the output of the classifier is less than the preset threshold, it means it is not WebShell; set the size of the decision tree N in the classifier, and conduct Bootstrap sampling to obtain N data sets; learn each of N decision trees The parameters ⁇ n of decision trees; each decision tree is trained in parallel training. After the training of a single decision tree is completed, the voting records of the training results of the training decision tree are counted to determine the final output of the CNN-RF model; among them, CNN -One way to represent the final output of the RF model is:
  • Ti(x) is the classification result of the sample x by the tree i, that is, the voting method
  • c* is the final category corresponding to the sample
  • N is the number of decision trees in the classifier.
  • the training text is Webshell, which is a command execution environment in the form of web files such as asp, php, jsp, or cgi;
  • the input and output module 301 performs one of the following operations to obtain WebShell: Engine to find common vulnerabilities disclosed on the Internet, if the target site has not been repaired, obtain WebShell; conduct code audit on open source CMS through code audit strategy, and mine code vulnerabilities from the CMS to obtain WebShell; use upload vulnerabilities Obtain WebShell; use SQL injection attack to obtain WebShell; or use database backup to obtain WebShell.
  • the physical device corresponding to the input-output module 301 shown in FIG. 3 is the input-output unit shown in FIG. 4, which can realize part or all of the functions of the acquisition module 1, or realize the same or similar functions as the input-output module 301 Features.
  • the physical device corresponding to the collection module 303 shown in FIG. 3 is the collection device shown in FIG. 4.
  • the physical device corresponding to the processing module 302 shown in FIG. 3 is the processor shown in FIG. 4, and the processor can implement part or all of the functions of the processing module 302 or implement the same or similar functions as the processing module 302.
  • the text classification device 30 in the embodiment of the present application is described above from the perspective of modular functional entities.
  • the following describes a computer device from the perspective of hardware, as shown in FIG. 4, which includes: a processor, a memory, an input and output unit ( It may also be a transceiver (not identified in FIG. 4) and a computer program stored in the memory and running on the processor.
  • the computer program may be a program corresponding to the text classification method in the embodiment corresponding to FIG. 1.
  • the processor executes the computer program to implement the text classification method executed by the text classification device 30 in the embodiment corresponding to FIG. 3
  • the function of each module in the text classification device 30 of the embodiment corresponding to FIG. 3 is realized.
  • the computer program may be a program corresponding to the text classification method in the embodiment corresponding to FIG. 1.
  • the so-called processor can be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), ready-made Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor, etc.
  • the processor is the control center of the computer equipment, and various interfaces and lines are used to connect various parts of the entire computer equipment.
  • the memory may be used to store the computer program and/or module, and the processor implements the computer by running or executing the computer program and/or module stored in the memory, and calling data stored in the memory.
  • the memory may mainly include a storage program area and a storage data area, where the storage program area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.); the storage data area may store Data created based on the use of mobile phones (such as audio data, video data, etc.), etc.
  • the memory may include high-speed random access memory, and may also include non-volatile memory, such as hard disks, memory, plug-in hard disks, smart media cards (SMC), and secure digital (SD) cards , Flash Card, at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
  • non-volatile memory such as hard disks, memory, plug-in hard disks, smart media cards (SMC), and secure digital (SD) cards , Flash Card, at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
  • the transceiver may also be replaced by a receiver and a transmitter, and may be the same or different physical entities. When they are the same physical entity, they can be collectively referred to as transceivers.
  • the transceiver can be an input and output unit.
  • the memory may be integrated in the processor, or may be provided separately from the processor.
  • the present application also provides a non-volatile computer-readable storage medium, including instructions, which when run on a computer, cause the computer to execute the following steps of the text classification method:
  • training text includes multiple sentences, and each sentence includes multiple words
  • the target feature vector is input to the classifier, and the training text is classified by the classifier to obtain the classified text.
  • the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better. ⁇
  • the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product.
  • the computer software product is stored in a storage medium (such as ROM/RAM), including Several instructions are used to make a terminal (which may be a mobile phone, a computer, a server, or a network device, etc.) execute the method described in each embodiment of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The present application relates to the field of text classification, and provides a text classification method, an apparatus, a device and a storage medium. The method comprises: acquiring a training text, and inputting the training text into a coding layer of a neural network model and performing word vectorization of the training text in the coding layer, obtaining a feature vector corresponding to the training text; inputting the feature vector into an RNN model, performing modeling of sentences, and capturing a long distance dependency feature of each sentence in the training text; inputting a feature vector of the captured long distance dependency information into a convolutional neural network (CNN) model in the neural network model; in the CNN model, extracting a local feature from the feature vector, obtaining a target feature vector; the local feature indicating a local relevance of the feature vector; inputting the target feature vector into a classifier, performing classification processing of the training text by means of the classifier, obtaining a classified text.

Description

一种文本分类方法、装置、设备及存储介质Text classification method, device, equipment and storage medium
本申请要求于2019年6月4日提交中国专利局、申请号为201910479226.7、发明名称为“一种文本分类方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on June 4, 2019, the application number is 201910479226.7, and the invention title is "a text classification method, device, equipment and storage medium", the entire content of which is by reference Incorporate in the application.
技术领域Technical field
本申请涉及文本分类领域,尤其涉及一种文本分类方法、装置、设备及存储介质。This application relates to the field of text classification, and in particular to a text classification method, device, equipment, and storage medium.
背景技术Background technique
文本分类是自然语言处理中的关键任务,能够帮助用户从海量数据中发掘有用信息,文本分类主要应用于垃圾邮件识别、情感分析、问答系统、翻译等方面。句子模型目的是学习文本特征对句子进行表征,是文本分类的关键模型。Text classification is a key task in natural language processing, which can help users discover useful information from massive amounts of data. Text classification is mainly used in spam recognition, sentiment analysis, question and answer systems, translation, etc. The purpose of sentence model is to learn text features to represent sentences, and it is a key model for text classification.
在入侵检测系统中WebShell的检测也属于一种文本分类。目前的文本分类大多基于统计学和机器学习。统计学的方法采用拆分句子,基于语料库,统计相邻的字组成的词语出现的概率,相邻的词出现的次数多,就出现的概率大,按照概率值进行分词,所以一个完整的语料库很重要。机器学习方法采用获取TF-IDF算法计算文本特征,然后使用logistic regression、SVM、随机森林等分类器对文本进行分类。但是发明人意识到这些方式费时费力,而且泛化能力很差、误报率较高。In the intrusion detection system, WebShell detection also belongs to a kind of text classification. Current text classification is mostly based on statistics and machine learning. The statistical method uses split sentences, based on a corpus, and counts the occurrence probability of words composed of adjacent words. If adjacent words appear more frequently, the probability of occurrence is high. Words are segmented according to the probability value, so a complete corpus Very important. The machine learning method uses the TF-IDF algorithm to calculate text features, and then uses classifiers such as logistic regression, SVM, and random forest to classify the text. However, the inventor realizes that these methods are time-consuming and labor-intensive, and have poor generalization ability and high false alarm rate.
发明内容Summary of the invention
本申请提供了一种文本分类方法、装置、设备及存储介质,能够解决现有技术中文本分类的准确率较差的问题。This application provides a text classification method, device, equipment and storage medium, which can solve the problem of poor accuracy of text classification in the prior art.
第一方面,本申请提供一种文本分类方法,该方法包括:获取训练文本;将所述训练文本输入神经网络模型的编码层,在所述编码层对所述训练文本进行词向量化,得到与所述训练文本对应的特征向量;将所述特征向量输入RNN模型,对句子进行建模,捕捉所述训练文本中各句子的长距离依赖特征;其中,所述长距离依赖特征是指文本的上下文向量,且上下文向量在时域上长期依赖;将捕获了所述长距离依赖信息的特征向量输入所述神经网络模型中的卷积神经网络CNN模型;在所述CNN模型中从所述特征向量中提取局部特征,得到目标特征向量;其中,局部特征是指所述特征向量中的局部相关性;将所述目标特征向量输入到所述分类器,通过所述分类器对所述训练文本进行分类处理,得到分类后的文本。In a first aspect, the present application provides a text classification method, which includes: obtaining training text; inputting the training text into the coding layer of a neural network model, and performing word vectorization on the training text in the coding layer to obtain The feature vector corresponding to the training text; the feature vector is input to the RNN model, the sentence is modeled, and the long-distance dependent feature of each sentence in the training text is captured; wherein, the long-distance dependent feature refers to the text And the context vector is dependent on the time domain for a long time; input the feature vector that captures the long-distance dependence information into the convolutional neural network CNN model in the neural network model; in the CNN model from the The local features are extracted from the feature vector to obtain the target feature vector; wherein, the local feature refers to the local correlation in the feature vector; the target feature vector is input to the classifier, and the training is performed by the classifier The text is classified, and the classified text is obtained.
在一种可能的设计中,所述捕捉所述训练文本中各句子的长距离依赖特征,包括:通 过所述LSTM模型依序计算句子中的各个词的长距离依赖特征,其中,特定词的长距离依赖特征表征该特定词与句子中的其他长距离的词之间的依赖关系;所述方法还包括:依序计算各个词的语义结构特征,特定词的语义结构特征表征包含该特定词及其之前的词的局部句子的语义结构;将每个词的长距离依赖特征和语义结构特征组合,以得到句子中的各个词的词特征;基于各个词特征计算句子中的各个词的概率。In a possible design, the capturing the long-distance dependence characteristics of each sentence in the training text includes: sequentially calculating the long-distance dependence characteristics of each word in the sentence through the LSTM model, wherein the The long-distance dependency feature represents the dependency relationship between the specific word and other long-distance words in the sentence; the method further includes: sequentially calculating the semantic structure features of each word, and the semantic structure feature representation of the specific word includes the specific word The semantic structure of the partial sentence of the word before it; the long-distance dependence feature of each word and the semantic structure feature are combined to obtain the word feature of each word in the sentence; the probability of each word in the sentence is calculated based on each word feature .
在一种可能的设计中,所述训练文本为演讲语言、歌词或论文中的任一项连续数据时,所述通过所述LSTM模型依序计算句子中的各个词的长距离依赖特征,包括:通过所述LSTM模型依序循环计算句子中的各个词的长距离依赖信息,以从所述连续数据中捕获所述长距离依赖特征。In a possible design, when the training text is continuous data of any one of speech language, lyrics, or thesis, the LSTM model is used to sequentially calculate the long-distance dependent features of each word in the sentence, including : Calculate the long-distance dependence information of each word in the sentence sequentially and cyclically through the LSTM model to capture the long-distance dependence feature from the continuous data.
在一种可能的设计中,所述通过所述分类器对所述训练文本进行分类处理之前,所述方法还包括:向所属神经网络模型输入多个句子,对各句子进行词向量化,得到多个词向量;将各词向量输入LSTM模型或者GRU模型,提取长距离依赖特征;将所述长距离依赖特征输入CNN模型,抽取位置不变的局部特征,最终得到多个特征向量,每个所述特征向量分别具备长距离依赖特征和位置不变的局部特征;将所述多个特征向量输入池化层,以对这些特征向量进行降维处理;将降维处理得到的特征向量输入所述分类器。In a possible design, before the training text is classified by the classifier, the method further includes: inputting a plurality of sentences into the neural network model, and performing word vectorization on each sentence to obtain Multiple word vectors; input each word vector into the LSTM model or GRU model to extract long-distance dependent features; input the long-distance dependent features into the CNN model to extract local features with invariant positions, and finally obtain multiple feature vectors, each The feature vectors respectively have long-distance dependent features and position-invariant local features; input the multiple feature vectors into the pooling layer to perform dimensionality reduction processing on these feature vectors; input the feature vectors obtained by the dimensionality reduction processing述Classifier.
在一种可能的设计中,所述将降维处理得到的特征向量输入所述分类器之前,所述方法还包括:为所述分类器预先设置一个阈值;若所述分类器的输出大于所述阈值,则表示是WebShell;当分类器的输出小于预设阈值,则表示不是WebShell;所述通过所述分类器对所述训练文本进行分类处理,得到分类后的文本,包括:设置所述分类器中决策树N的大小,进行Bootstrap抽样得到N个数据集;学习N颗决策树中每颗决策树的参数θn;采用并行训练的方式训练每颗决策树,在单颗决策树训练完成后,统计针对训练决策树的训练结果的投票记录确定CNN-RF模型的最终输出;其中,CNN-RF模型的最终输出的一种表示方式为:
Figure PCTCN2019102464-appb-000001
其中,Ti(x)为树i对样本x的分类结果,c*为样本对应最终类别,N为所述分类器中决策树的数目。
In a possible design, before the input of the feature vector obtained by the dimensionality reduction process into the classifier, the method further includes: presetting a threshold for the classifier; if the output of the classifier is greater than all The threshold value means WebShell; when the output of the classifier is less than the preset threshold value, it means it is not WebShell; the classifying the training text by the classifier to obtain the classified text includes: setting the The size of the decision tree N in the classifier, Bootstrap sampling is performed to obtain N data sets; the parameter θn of each decision tree in the N decision trees is learned; each decision tree is trained in parallel training, and the training is completed in a single decision tree After that, the voting records of the training results for the training decision tree are counted to determine the final output of the CNN-RF model; among them, a representation of the final output of the CNN-RF model is:
Figure PCTCN2019102464-appb-000001
Among them, Ti(x) is the classification result of the sample x by the tree i, c* is the final category corresponding to the sample, and N is the number of decision trees in the classifier.
在一种可能的设计中,所述训练文本为Webshell,Webshell是以asp、php、jsp或者cgi等网页文件形式存在的一种命令执行环境;所述获取训练文本,包括以下实现方式之一:利用搜索引擎来查找互联网上公开的通用漏洞,如果目标站点并没有进行修复,则获取WebShell;通过代码审计策略对开源的CMS进行代码审计,从所述CMS中挖掘代码漏洞,以获取WebShell;采用上传漏洞获取WebShell;利用SQL注入攻击获取WebShell;或者,利用数据库备份获取WebShell。In a possible design, the training text is Webshell, which is a command execution environment in the form of web pages such as asp, php, jsp, or cgi; the acquisition of the training text includes one of the following implementation methods: Use search engines to find common vulnerabilities disclosed on the Internet. If the target site has not been repaired, obtain WebShell; conduct code audits on open source CMS through code audit strategies, and mine code vulnerabilities from the CMS to obtain WebShell; adopt Upload vulnerability to obtain WebShell; use SQL injection attack to obtain WebShell; or use database backup to obtain WebShell.
第二方面,本申请提供一种文本分类装置,具有实现对应于上述第一方面提供的文本 分类方法的功能。所述功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。硬件或软件包括一个或多个与上述功能相对应的模块,所述模块可以是软件和/或硬件。In the second aspect, the present application provides a text classification device with a function corresponding to the text classification method provided in the first aspect. The function can be realized by hardware, or by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the above functions, and the modules may be software and/or hardware.
本申请又一方面提供了一种计算机设备,其包括至少一个连接的处理器、存储器和收发器,其中,所述存储器用于存储程序代码,所述处理器用于调用所述存储器中的程序代码来执行上述第一方面所述的方法。Another aspect of the present application provides a computer device, which includes at least one connected processor, a memory, and a transceiver, wherein the memory is used to store program code, and the processor is used to call the program code in the memory To perform the method described in the first aspect above.
本申请又一方面提供了一种计算机存储介质,该计算机可读存储介质可以为非易失性计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述第一方面所述的方法。Another aspect of the present application provides a computer storage medium. The computer-readable storage medium may be a non-volatile computer-readable storage medium, and the computer-readable storage medium stores instructions when it runs on a computer. , Causing the computer to execute the method described in the first aspect.
本申请中,先利用RNN模型处理长期信息的特点捕获长距离依赖特征,这样能够准确的判断相关性较强的上下文向量,以及避免信号在传递过程中损失大量信息,然后利用CNN模型对局部特征的感知特点提取局部特征,最后再将CNN模型的输出输入到分类其中进行分类处理,由于输入分类器中的特征向量同时具备长距离依赖特征和局部特征,所以能够有效的提升不同长度句子的分类效果,以及提高所述神经网络模型识别文本的准确性。In this application, the RNN model is used to process long-term information to capture long-distance dependent features, which can accurately determine the context vector with strong correlation and avoid the loss of a large amount of information in the signal transmission process, and then use the CNN model to analyze local features Extract local features from the perceptual characteristics of the CNN model, and finally input the output of the CNN model into the classification for classification processing. Since the feature vector in the input classifier has both long-distance dependent features and local features, it can effectively improve the classification of sentences of different lengths Effect, and improve the accuracy of the neural network model to recognize text.
附图说明Description of the drawings
图1为本申请实施例中文本分类方法的一种流程示意图;FIG. 1 is a schematic flowchart of a text classification method in an embodiment of this application;
图2a为本申请实施例中文本分类方法的一种流程示意图;Figure 2a is a schematic flowchart of a text classification method in an embodiment of this application;
图2b为本申请实施例对fudan、微博和MR进行分类的准确率对比示意表;FIG. 2b is a schematic table showing the comparison of the accuracy rates of fudan, Weibo and MR in the embodiment of the application;
图2c为本申请实施例文本分类方法的另一种流程示意图;FIG. 2c is a schematic diagram of another flow chart of a text classification method according to an embodiment of this application;
图3为本申请实施例中文本分类的一种结构示意图;FIG. 3 is a schematic diagram of a structure of text classification in an embodiment of this application;
图4为本申请实施例中计算机设备的一种结构示意图。Fig. 4 is a schematic diagram of a structure of a computer device in an embodiment of the application.
具体实施方式Detailed ways
本申请提供一种文本分类方法、装置、设备及存储介质,可用于对新闻、论文、帖子、邮件等文本进行分类,本申请不对文本分类的应用场景作限定。This application provides a text classification method, device, equipment, and storage medium, which can be used to classify texts such as news, papers, posts, and emails. This application does not limit the application scenarios of text classification.
为解决上述技术问题,本申请主要提供以下技术方案:To solve the above technical problems, this application mainly provides the following technical solutions:
利用深度学习中卷积神经网络(Convolutional Neural Networks,CNN)模型擅长抽取位置不变的局部特征的特性,以及利用循环神经网络(Recurrent Neural Network,RNN)模型则擅长对整个句子进行建模的特性,结合CNN模型和RNN模型实现既能捕捉长距离依赖信息,又可以很好地抽取关键短语信息的目的,通过入侵检测系统项目实践验证,达到比单独使用CNN模型或者RNN模型更高的准确性。本申请的神经网络模型包括CNN模型和RNN模型,所述神经网络模型的一种结构示意图如图1所示。Using the Convolutional Neural Networks (CNN) model in deep learning is good at extracting local features that do not change position, and the Recurrent Neural Network (RNN) model is good at modeling entire sentences , Combine the CNN model and the RNN model to achieve the purpose of not only capturing long-distance dependent information, but also extracting key phrase information well. Through the practical verification of the intrusion detection system project, it can achieve higher accuracy than using the CNN model or the RNN model alone. . The neural network model of the present application includes a CNN model and an RNN model, and a schematic diagram of the structure of the neural network model is shown in FIG. 1.
图1中,所述神经网络模型的编码层包括RNN模型和CNN模型,所述神经网络模型的输入为所述RNN模型的输入,所述RNN模型的输出为所述CNN模型的输入,所述CNN模型的输出为所述神经网络模型的输出。In Figure 1, the coding layer of the neural network model includes an RNN model and a CNN model. The input of the neural network model is the input of the RNN model, and the output of the RNN model is the input of the CNN model. The output of the CNN model is the output of the neural network model.
请参照图2a,以下介绍本申请实施例中的一种文本分类方法,所述方法包括:2a, a text classification method in an embodiment of the present application is introduced below, and the method includes:
201、获取训练文本。201. Obtain training text.
其中,所述训练文本包括多个句子,每个句子包括多个词。本申请中的训练文本为Webshell,Webshell是以asp、php、jsp或者cgi等网页文件形式存在的一种命令执行环境,也可以将其称做为一种网页后门。黑客在入侵了一个网站后,通常会将asp或php后门文件与网站服务器WEB目录下正常的网页文件混在一起,然后使用浏览器来访问asp或者php后门,得到一个命令执行环境,以达到控制网站服务器的目的。Wherein, the training text includes multiple sentences, and each sentence includes multiple words. The training text in this application is Webshell. Webshell is a command execution environment in the form of web files such as asp, php, jsp, or cgi, which can also be called a web backdoor. After hackers invade a website, they usually mix the asp or php backdoor files with the normal webpage files in the web directory of the website server, and then use the browser to access the asp or php backdoor to obtain a command execution environment to control the website The purpose of the server.
所述训练文本为Webshell,Webshell是以asp、php、jsp或者cgi等网页文件形式存在的一种命令执行环境。The training text is Webshell, which is a command execution environment in the form of web pages such as asp, php, jsp, or cgi.
一些实施方式中,可以采用内容管理系统(Content Management System,CMS)获取Webshell,可采用以下实现方式之一获取所述训练文本:In some implementations, a content management system (Content Management System, CMS) may be used to obtain the Webshell, and one of the following implementation methods may be used to obtain the training text:
(1)可以采用内容管理系统(Content Management System,CMS)获取Webshell,例如利用公开漏洞途径即利用搜索引擎来查找互联网上公开的通用漏洞,如果目标站点并没有进行修复,则获取WebShell。(1) A content management system (CMS) can be used to obtain Webshell. For example, a search engine can be used to find common vulnerabilities publicly disclosed on the Internet. If the target site has not been repaired, WebShell is obtained.
(2)通过代码审计策略对开源的CMS进行代码审计,从所述CMS中挖掘代码漏洞,以获取WebShell。(2) Conduct code audit on the open source CMS through the code audit strategy, and mine code vulnerabilities from the CMS to obtain WebShell.
(3)采用上传漏洞获取WebShell。(3) Use upload vulnerability to obtain WebShell.
(4)利用SQL注入攻击获取WebShel。(4) Use SQL injection attacks to obtain WebShel.
(5)利用数据库备份获取WebShell。(5) Use database backup to obtain WebShell.
本申请不对获取训练文本的方式和来源作限定。This application does not limit the method and source of obtaining training text.
202、将所述训练文本输入神经网络模型的编码层,在所述编码层对所述训练文本进行词向量化,得到与所述训练文本对应的特征向量。202. Input the training text into the coding layer of the neural network model, and perform word vectorization on the training text in the coding layer to obtain a feature vector corresponding to the training text.
其中,所述特征向量是指向量空间模型的文本表示,通过词向量的表示方式,把文本数据从高纬度高稀疏的神经网络难处理的方式,变为类似图像、语音的的连续稠密数据。Wherein, the feature vector is the text representation of the directional quantity space model. Through the representation of the word vector, the text data is changed from a high-latitude and high-sparse neural network difficult to process to continuous dense data similar to images and voices.
203、将所述特征向量输入RNN模型,对句子进行建模,捕捉所述训练文本中各句子的长距离依赖特征。203. Input the feature vector into the RNN model, model the sentence, and capture the long-distance dependent feature of each sentence in the training text.
其中,所述长距离依赖特征是指文本的上下文向量,且上下文向量在时域上长期依赖。Wherein, the long-distance dependence feature refers to the context vector of the text, and the context vector is long-term dependent in the time domain.
一些实施方式中,所述捕捉所述训练文本中各句子的长距离依赖特征,包括:In some implementation manners, the capturing the long-distance dependent characteristics of each sentence in the training text includes:
通过所述LSTM模型依序计算句子中的各个词的长距离依赖特征,其中,特定词的长距离依赖特征表征该特定词与句子中的其他长距离的词之间的依赖关系。The long-distance dependence feature of each word in the sentence is sequentially calculated by the LSTM model, where the long-distance dependence feature of a specific word represents the dependence relationship between the specific word and other long-distance words in the sentence.
一些实施方式中,所述RNN模型可采用长短时记忆网络(Long Short-Term Memory,LSTM)模型,通过该LSTM模型在文本处理中能够利用很宽范围的上下文信息来判断下一个词的概率。具体来说,通过该LSTM模型在文本处理中能够利用很宽范围的上下文信息来判断下一个词的概率,包括以下步骤:In some embodiments, the RNN model may adopt a Long Short-Term Memory (LSTM) model, through which a wide range of context information can be used in text processing to determine the probability of the next word. Specifically, the LSTM model can use a wide range of context information in text processing to determine the probability of the next word, including the following steps:
依序计算各个词的语义结构特征,特定词的语义结构特征表征包含该特定词及其之前的词的局部句子的语义结构;Calculate the semantic structure features of each word in order, and the semantic structure feature of a specific word represents the semantic structure of the partial sentence containing the specific word and the word before it;
将每个词的长距离依赖特征和语义结构特征组合,以得到句子中的各个词的词特征;Combine the long-distance dependence feature of each word and the semantic structure feature to obtain the word feature of each word in the sentence;
基于各个词特征计算句子中的各个词的概率。Calculate the probability of each word in the sentence based on each word feature.
一些实施方式中,考虑到所述训练文本可能为连续的数据,例如为演讲语言、歌词或论文等,可以采用循环操作从这类连续数据中捕获长距离依赖信息,以保证信号能够不断地传播。具体来说,所述训练文本为演讲语言、歌词或论文中的任一项连续数据时,所述通过所述LSTM模型依序计算句子中的各个词的长距离依赖特征,包括:In some embodiments, considering that the training text may be continuous data, such as speech language, lyrics, or essays, a loop operation can be used to capture long-distance dependent information from such continuous data to ensure that the signal can continue to propagate. . Specifically, when the training text is continuous data of any one of speech language, lyrics, or thesis, the sequential calculation of the long-distance dependent features of each word in the sentence through the LSTM model includes:
通过所述LSTM模型依序循环计算句子中的各个词的长距离依赖信息,以从所述连续数据中捕获所述长距离依赖特征。The long-distance dependence information of each word in the sentence is sequentially and cyclically calculated by the LSTM model to capture the long-distance dependence feature from the continuous data.
204、将捕获了所述长距离依赖信息的特征向量输入所述神经网络模型中的卷积神经网络CNN模型。204. Input the feature vector capturing the long-distance dependence information into the convolutional neural network CNN model in the neural network model.
205、在所述CNN模型中从所述特征向量中提取局部特征,得到目标特征向量。205. Extract local features from the feature vector in the CNN model to obtain a target feature vector.
其中,局部特征是指所述特征向量中的局部相关性,也可以称为所述特征向量中类似n-gram的关键信息。Wherein, the local feature refers to the local correlation in the feature vector, and can also be referred to as key information similar to n-gram in the feature vector.
一些实施方式中,为进一步提高分类器的泛化能力和文本分类的准确率,CNN模型可采用CNN-RF模型。下表为采用NB模型、CART模型、RF模型、CNN模型和CNN-RF模型对3类文本(fudan、微博(weibo)和MR)进行分类的准确率对比示意表(如图2b所示)。In some embodiments, in order to further improve the generalization ability of the classifier and the accuracy of text classification, the CNN model may adopt the CNN-RF model. The following table shows the comparison of accuracy rates of 3 types of text (fudan, weibo, and MR) using NB model, CART model, RF model, CNN model and CNN-RF model (as shown in Figure 2b) .
206、将所述目标特征向量输入到所述分类器,通过所述分类器对所述训练文本进行分类处理,得到分类后的文本。206. Input the target feature vector to the classifier, and perform classification processing on the training text by the classifier to obtain classified text.
本申请实施例中,所述神经网络模型包括分类器,分类器的输入为所述CNN模型的输 出。在所述神经网络模型中该分类器对所述特征向量进行训练,直至所述分类器收敛。In the embodiment of the present application, the neural network model includes a classifier, and the input of the classifier is the output of the CNN model. In the neural network model, the classifier trains the feature vector until the classifier converges.
一些实施方式中,还可为所述分类器预先设置一个阈值,若所述分类器的输出大于所述阈值,则表示是WebShell;当SoftMax的输出小于thredshold,则表示不是WebShell。In some implementations, a threshold may be preset for the classifier. If the output of the classifier is greater than the threshold, it means that it is WebShell; when the output of SoftMax is less than the threshold, it means that it is not WebShell.
与现有机制相比,本申请实施例中,先利用RNN模型处理长期信息的特点捕获长距离依赖特征,这样能够准确的判断相关性较强的上下文向量,以及避免信号在传递过程中损失大量信息,然后利用CNN模型对局部特征的感知特点提取局部特征,最后再将CNN模型的输出输入到分类其中进行分类处理,由于输入分类器中的特征向量同时具备长距离依赖特征和局部特征,所以能够有效的提升不同长度句子的分类效果,以及提高所述神经网络模型识别文本的准确性。此外,结合CNN模型的特征提取能力与随机森林的泛化能力,泛化能力可以从以下三个方面分析:Compared with the existing mechanism, in the embodiment of the present application, the characteristics of long-term information processing by the RNN model are used to capture long-distance dependent features, which can accurately determine the context vector with strong correlation and avoid a large amount of signal loss in the transmission process. Then use the CNN model to extract the local features from the perceptual characteristics of the local features, and finally input the output of the CNN model into the classification for classification processing. Since the feature vector in the input classifier has both long-distance dependent features and local features, The classification effect of sentences of different lengths can be effectively improved, and the accuracy of text recognition by the neural network model can be improved. In addition, combining the feature extraction ability of the CNN model and the generalization ability of the random forest, the generalization ability can be analyzed from the following three aspects:
第一方面,从统计角度来看,由于学习任务的假设空间往往很大,可能有多个假设在训练集上达到同等水准的性能,此时若使用单一决策树可能因为误选而导致泛化能力不佳。First, from a statistical point of view, because the hypothesis space of learning tasks is often large, there may be multiple hypotheses that achieve the same level of performance on the training set. At this time, if a single decision tree is used, it may cause generalization due to misselection. Poor ability.
第二方面,从特征提取角度分析,双重词向量分别从两个角度刻画词语的含义,丰富了短文本信息,相对于单一词向量来说扩充了特征信息。In the second aspect, from the perspective of feature extraction, dual word vectors describe the meaning of words from two perspectives, enriching short text information, and expanding feature information compared to single word vectors.
第三方面,从表示方面来看,某些学习任务的真实假设可能不在当前决策树算法所处的假设空间之内,若使用单一分类方法,则会导致搜索不到既定的假设空间。并且随机森林采用Bootstrap抽样可以降低机器学习模型对数据的依赖能力,以及降低模型的方差,使得RNN模型拥有更好的泛化能力。In the third aspect, from the perspective of representation, the true hypothesis of some learning tasks may not be in the hypothesis space where the current decision tree algorithm is located. If a single classification method is used, the established hypothesis space will not be searched. And random forest using Bootstrap sampling can reduce the dependence of the machine learning model on data and reduce the variance of the model, so that the RNN model has better generalization capabilities.
可选的,在本申请的一些实施例中,所述通过所述分类器对所述训练文本进行分类处理之前,所述方法还包括:Optionally, in some embodiments of the present application, before the training text is classified by the classifier, the method further includes:
向所属神经网络模型输入多个句子,对各句子进行词向量化,得到多个词向量;Input multiple sentences to the neural network model to which they belong, and perform word vectorization on each sentence to obtain multiple word vectors;
将各词向量输入LSTM模型或者GRU模型,提取长距离依赖特征;Input each word vector into the LSTM model or GRU model to extract long-distance dependent features;
将所述长距离依赖特征输入CNN模型,抽取位置不变的局部特征,最终得到多个特征向量,每个所述特征向量分别具备长距离依赖特征和位置不变的局部特征;Input the long-distance dependent features into a CNN model, extract local features with invariant positions, and finally obtain a plurality of feature vectors, each of the feature vectors has a long-distance dependent feature and a local feature with invariable position;
将所述多个特征向量输入池化层,以对这些特征向量进行降维处理;Input the multiple feature vectors to the pooling layer to perform dimensionality reduction processing on these feature vectors;
将降维处理得到的特征向量输入所述分类器。The feature vector obtained by the dimensionality reduction process is input to the classifier.
可选的,在本申请的一些实施例中,所述将降维处理得到的特征向量输入所述分类器之前,所述方法还包括:Optionally, in some embodiments of the present application, before the input of the feature vector obtained by the dimensionality reduction process into the classifier, the method further includes:
为所述分类器预先设置一个阈值;Preset a threshold for the classifier;
若所述分类器的输出大于所述阈值,则表示是WebShell;当分类器的输出小于预设阈值,则表示不是WebShell;If the output of the classifier is greater than the threshold, it means that it is WebShell; when the output of the classifier is less than the preset threshold, it means that it is not WebShell;
所述通过所述分类器对所述训练文本进行分类处理,得到分类后的文本,包括:The classifying the training text by the classifier to obtain the classified text includes:
设置所述分类器中决策树N的大小,进行Bootstrap抽样得到N个数据集;Setting the size of the decision tree N in the classifier, and performing Bootstrap sampling to obtain N data sets;
学习N颗决策树中每颗决策树的参数θn;Learn the parameter θn of each decision tree in N decision trees;
采用并行训练的方式训练每颗决策树,在单颗决策树训练完成后,统计针对训练决策树的训练结果的投票记录确定CNN-RF模型的最终输出;其中,CNN-RF模型的最终输出的一种表示方式为:Each decision tree is trained in parallel training. After the training of a single decision tree is completed, the voting records of the training results of the training decision tree are counted to determine the final output of the CNN-RF model; among them, the final output of the CNN-RF model One way of representation is:
Figure PCTCN2019102464-appb-000002
Figure PCTCN2019102464-appb-000002
其中,Ti(x)是树i对样本x的分类结果,即投票法,c*即为样本对应最终类别,N为所述分类器中决策树的数目。Among them, Ti(x) is the classification result of the sample x by the tree i, that is, the voting method, c* is the final category corresponding to the sample, and N is the number of decision trees in the classifier.
本申请实施例中,分类器可采用随机森林模型或Softmax模型,采用随机森林模型时,可将全连接层特征Cfinal送入随机森林模型进行训练。In the embodiment of this application, the classifier may adopt a random forest model or a Softmax model. When the random forest model is adopted, the fully connected layer feature Cfinal may be sent to the random forest model for training.
由于随机森林的全连接层特征Cfinal通常维度不大,一般数据集种均有m×s<103,所以建立随机森林模型的开销非常小。Since Cfinal, the fully connected layer feature of the random forest, usually has a small dimension, and the general data set has m×s<103, the cost of establishing a random forest model is very small.
为便于理解,以下以一具体的应用场景为例。如图2c所示,向所属神经网络模型输入多个句子,对各句子进行词向量化,得到多个词向量(例如h1、h2、…和h9),将各词向量输入LSTM模型或者GRU模型,提取长距离依赖特征(例如y1、y2、…y9),将所述长距离依赖特征输入CNN模型,抽取位置不变的局部特征,最终得到多个特征向量,这些特征向量分别具备长距离依赖特征和位置不变的局部特征。然后将所述多个特征向量输入池化层,以对这些特征向量进行降维处理。将降维处理得到的特征向量输入分类器(例如Softmax),该分类器预先设置一个阈值(threshold),当SoftMax的输出大于thredshold,则表示是WebShell;当SoftMax的输出小于thredshold,则表示不是WebShell。For ease of understanding, the following takes a specific application scenario as an example. As shown in Figure 2c, input multiple sentences to the neural network model to which they belong, and perform word vectorization on each sentence to obtain multiple word vectors (such as h1, h2,... and h9), and input each word vector into the LSTM model or the GRU model , Extract long-distance dependent features (such as y1, y2,...y9), input the long-distance dependent features into the CNN model, extract local features with invariable positions, and finally obtain multiple feature vectors, which have long-distance dependent features. Local features with unchanged features and locations. Then, the multiple feature vectors are input to the pooling layer to perform dimensionality reduction processing on these feature vectors. Input the feature vector obtained by dimensionality reduction into a classifier (such as Softmax), which sets a threshold in advance. When the output of SoftMax is greater than the threshold, it means that it is WebShell; when the output of SoftMax is less than the threshold, it means that it is not WebShell .
上述图1至图2c中任一所对应的实施例或实施方式中所提及的技术特征也同样适用于本申请中的图3和图4所对应的实施例,后续类似之处不再赘述。The technical features mentioned in the embodiments or implementations corresponding to any one of the above FIGS. 1 to 2c are also applicable to the embodiments corresponding to FIGS. 3 and 4 in this application, and the similarities will not be repeated here. .
以上对本申请中一种文本分类方法进行说明,以下对执行上述文本分类方法的装置进行描述。The text classification method in the present application is described above, and the device that executes the text classification method is described below.
如图3所示的一种文本分类装置30的结构示意图,其可应用于对新闻、论文、帖子、 邮件等文本进行分类。本申请实施例中的文本分类装置30能够实现对应于上述图1所对应的实施例中所执行的文本分类方法的步骤。文本分类装置30实现的功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。硬件或软件包括一个或多个与上述功能相对应的模块,所述模块可以是软件和/或硬件。所述文本分类装置30可包括输入输出模块301、处理模块302和采集模块303,所述输入输出模块301、所述处理模块302和所述采集模块303的功能实现可参考图1所对应的实施例中所执行的操作,此处不作赘述。所述处理模块302可用于控制所述收入输出模块301的输入输出操作,以及控制所述采集模块303的采集操作。A schematic structural diagram of a text classification device 30 shown in FIG. 3 can be applied to classify texts such as news, papers, posts, and mails. The text classification device 30 in the embodiment of the present application can implement the steps corresponding to the text classification method executed in the embodiment corresponding to FIG. 1. The functions implemented by the text classification device 30 can be implemented by hardware, or can be implemented by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the above functions, and the modules may be software and/or hardware. The text classification device 30 may include an input and output module 301, a processing module 302, and a collection module 303. The functional realization of the input and output module 301, the processing module 302, and the collection module 303 can refer to the implementation corresponding to FIG. 1 The operations performed in the example will not be repeated here. The processing module 302 can be used to control the input and output operations of the income output module 301 and the collection operation of the collection module 303.
一些实施方式中,所述输入输出模块301可用于获取训练文本;In some embodiments, the input and output module 301 may be used to obtain training text;
所述处理模块302可用于将所述输入输出模块301获取的所述训练文本输入神经网络模型的编码层,在所述编码层对所述训练文本进行词向量化,得到与所述训练文本对应的特征向量;将所述特征向量输入RNN模型,对句子进行建模;The processing module 302 may be configured to input the training text obtained by the input and output module 301 into the coding layer of the neural network model, and perform word vectorization on the training text in the coding layer to obtain the corresponding training text The feature vector of; input the feature vector into the RNN model to model the sentence;
所述采集模块303可用于捕捉所述训练文本中各句子的长距离依赖特征;其中,所述长距离依赖特征是指文本的上下文向量,且上下文向量在时域上长期依赖;The acquisition module 303 can be used to capture the long-distance dependent features of each sentence in the training text; wherein the long-distance dependent features refer to the context vector of the text, and the context vector is dependent on the time domain for a long time;
所述输入输出模块301还用于将所述采集模块捕获了所述长距离依赖信息的特征向量输入所述神经网络模型中的卷积神经网络CNN模型;The input and output module 301 is further configured to input the feature vector of the long-distance dependence information captured by the acquisition module into the convolutional neural network CNN model in the neural network model;
所述处理模块302还用于在所述CNN模型中从所述特征向量中提取局部特征,得到目标特征向量;其中,局部特征是指所述特征向量中的局部相关性;通过所述输入输出模块将所述目标特征向量输入到所述分类器,通过所述分类器对所述训练文本进行分类处理,得到分类后的文本。The processing module 302 is also used to extract local features from the feature vector in the CNN model to obtain a target feature vector; wherein, the local feature refers to the local correlation in the feature vector; through the input and output The module inputs the target feature vector to the classifier, and classifies the training text through the classifier to obtain the classified text.
本申请实施例中,先利用RNN模型处理长期信息的特点捕获长距离依赖特征,这样能够准确的判断相关性较强的上下文向量,以及避免信号在传递过程中损失大量信息,然后利用CNN模型对局部特征的感知特点提取局部特征,最后再将CNN模型的输出输入到分类其中进行分类处理,由于输入分类器中的特征向量同时具备长距离依赖特征和局部特征,所以能够有效的提升不同长度句子的分类效果,以及提高所述神经网络模型识别文本的准确性。In the embodiments of this application, the RNN model is used to process long-term information to capture long-distance dependent features, which can accurately determine the context vector with strong correlation and avoid the loss of a large amount of information in the signal transmission process, and then use the CNN model to The perceptual features of local features extract local features, and finally input the output of the CNN model into the classification for classification processing. Since the feature vector in the input classifier has both long-distance dependent features and local features, it can effectively improve sentences of different lengths The classification effect and the improvement of the accuracy of the neural network model for text recognition.
一些实施方式中,所述采集模块302具体用于:In some implementation manners, the collection module 302 is specifically configured to:
通过所述LSTM模型依序计算句子中的各个词的长距离依赖特征,其中,特定词的长距离依赖特征表征该特定词与句子中的其他长距离的词之间的依赖关系;依序计算各个词 的语义结构特征,特定词的语义结构特征表征包含该特定词及其之前的词的局部句子的语义结构;将每个词的长距离依赖特征和语义结构特征组合,以得到句子中的各个词的词特征;基于各个词特征计算句子中的各个词的概率。The long-distance dependence feature of each word in the sentence is sequentially calculated by the LSTM model, where the long-distance dependence feature of a specific word represents the dependence relationship between the specific word and other long-distance words in the sentence; The semantic structure feature of each word, the semantic structure feature of a specific word characterizes the semantic structure of the partial sentence containing the specific word and the word before it; the long-distance dependence feature of each word and the semantic structure feature are combined to obtain the sentence The word feature of each word; calculate the probability of each word in the sentence based on each word feature.
一些实施方式中,所述训练文本为演讲语言、歌词或论文中的任一项连续数据时,所述处理模块302具体用于:In some implementation manners, when the training text is continuous data of any one of speech language, lyrics, or thesis, the processing module 302 is specifically configured to:
通过所述LSTM模型依序循环计算句子中的各个词的长距离依赖信息,以从所述连续数据中捕获所述长距离依赖特征。The long-distance dependence information of each word in the sentence is sequentially and cyclically calculated by the LSTM model to capture the long-distance dependence feature from the continuous data.
一些实施方式中,所述处理模块302在通过所述分类器对所述训练文本进行分类处理之前,还用于:通过所述输入输出模块301向所属神经网络模型输入多个句子,对各句子进行词向量化,得到多个词向量;通过所述输入输出模块301将各词向量输入LSTM模型或者GRU模型,提取长距离依赖特征;通过所述输入输出模块301将所述长距离依赖特征输入CNN模型,抽取位置不变的局部特征,最终得到多个特征向量,每个所述特征向量分别具备长距离依赖特征和位置不变的局部特征;通过所述输入输出模块301将所述多个特征向量输入池化层,以对这些特征向量进行降维处理;通过所述输入输出模块301将降维处理得到的特征向量输入所述分类器。In some implementation manners, before the training text is classified by the classifier, the processing module 302 is further configured to: input multiple sentences into the neural network model to which each sentence belongs through the input and output module 301. Perform word vectorization to obtain multiple word vectors; input each word vector into the LSTM model or GRU model through the input and output module 301 to extract long-distance dependent features; and input the long-distance dependent features through the input and output module 301 The CNN model extracts location-invariant local features, and finally obtains multiple feature vectors, each of which has long-distance dependent features and location-invariant local features; the input and output module 301 combines the multiple feature vectors. The feature vectors are input to the pooling layer to perform dimensionality reduction processing on these feature vectors; the input and output module 301 inputs the feature vectors obtained by the dimensionality reduction processing to the classifier.
一些实施方式中,所述处理模块302在将降维处理得到的特征向量输入所述分类器之前,还用于:为所述分类器预先设置一个阈值;若所述分类器的输出大于所述阈值,则表示是WebShell;当分类器的输出小于预设阈值,则表示不是WebShell;设置所述分类器中决策树N的大小,进行Bootstrap抽样得到N个数据集;学习N颗决策树中每颗决策树的参数θn;采用并行训练的方式训练每颗决策树,在单颗决策树训练完成后,统计针对训练决策树的训练结果的投票记录确定CNN-RF模型的最终输出;其中,CNN-RF模型的最终输出的一种表示方式为:
Figure PCTCN2019102464-appb-000003
In some implementation manners, before inputting the feature vector obtained by the dimensionality reduction process into the classifier, the processing module 302 is further configured to: preset a threshold for the classifier; if the output of the classifier is greater than the Threshold, it means WebShell; when the output of the classifier is less than the preset threshold, it means it is not WebShell; set the size of the decision tree N in the classifier, and conduct Bootstrap sampling to obtain N data sets; learn each of N decision trees The parameters θn of decision trees; each decision tree is trained in parallel training. After the training of a single decision tree is completed, the voting records of the training results of the training decision tree are counted to determine the final output of the CNN-RF model; among them, CNN -One way to represent the final output of the RF model is:
Figure PCTCN2019102464-appb-000003
其中,Ti(x)是树i对样本x的分类结果,即投票法,c*即为样本对应最终类别,N为所述分类器中决策树的数目。Among them, Ti(x) is the classification result of the sample x by the tree i, that is, the voting method, c* is the final category corresponding to the sample, and N is the number of decision trees in the classifier.
一些实施方式中,所述训练文本为Webshell,Webshell是以asp、php、jsp或者cgi等网页文件形式存在的一种命令执行环境;所述输入输出模块301执行以下操作之一获取WebShell:利用搜索引擎来查找互联网上公开的通用漏洞,如果目标站点并没有进行修复,则获取WebShell;通过代码审计策略对开源的CMS进行代码审计,从所述CMS中挖掘代码漏洞,以获取WebShell;采用上传漏洞获取WebShell;利用SQL注入攻击获取WebShell;或者,利用数据库备份获取WebShell。In some embodiments, the training text is Webshell, which is a command execution environment in the form of web files such as asp, php, jsp, or cgi; the input and output module 301 performs one of the following operations to obtain WebShell: Engine to find common vulnerabilities disclosed on the Internet, if the target site has not been repaired, obtain WebShell; conduct code audit on open source CMS through code audit strategy, and mine code vulnerabilities from the CMS to obtain WebShell; use upload vulnerabilities Obtain WebShell; use SQL injection attack to obtain WebShell; or use database backup to obtain WebShell.
图3中所示的输入输出模块301对应的实体设备为图4所示的输入输出单元,该输入输出单元能够实现获取模块1部分或全部的功能,或者实现与输入输出模块301相同或相似的功能。图3中所示的采集模块303对应的实体设备为图4所示的采集设备。The physical device corresponding to the input-output module 301 shown in FIG. 3 is the input-output unit shown in FIG. 4, which can realize part or all of the functions of the acquisition module 1, or realize the same or similar functions as the input-output module 301 Features. The physical device corresponding to the collection module 303 shown in FIG. 3 is the collection device shown in FIG. 4.
图3中所示的处理模块302对应的实体设备为图4所示的处理器,该处理器能够实现处理模块302部分或全部的功能,或者实现与处理模块302相同或相似的功能。The physical device corresponding to the processing module 302 shown in FIG. 3 is the processor shown in FIG. 4, and the processor can implement part or all of the functions of the processing module 302 or implement the same or similar functions as the processing module 302.
上面从模块化功能实体的角度分别介绍了本申请实施例中的文本分类装置30,以下从硬件角度介绍一种计算机设备,如图4所示,其包括:处理器、存储器、输入输出单元(也可以是收发器,图4中未标识出)以及存储在所述存储器中并可在所述处理器上运行的计算机程序。例如,该计算机程序可以为图1所对应的实施例中文本分类方法对应的程序。例如,当计算机设备实现如图3所示的文本分类装置30的功能时,所述处理器执行所述计算机程序时实现上述图3所对应的实施例中由文本分类装置30执行的文本分类方法中的各步骤;或者,所述处理器执行所述计算机程序时实现上述图3所对应的实施例的文本分类装置30中各模块的功能。又例如,该计算机程序可以为图1所对应的实施例中文本分类方法对应的程序。The text classification device 30 in the embodiment of the present application is described above from the perspective of modular functional entities. The following describes a computer device from the perspective of hardware, as shown in FIG. 4, which includes: a processor, a memory, an input and output unit ( It may also be a transceiver (not identified in FIG. 4) and a computer program stored in the memory and running on the processor. For example, the computer program may be a program corresponding to the text classification method in the embodiment corresponding to FIG. 1. For example, when a computer device implements the functions of the text classification device 30 shown in FIG. 3, the processor executes the computer program to implement the text classification method executed by the text classification device 30 in the embodiment corresponding to FIG. 3 Or, when the processor executes the computer program, the function of each module in the text classification device 30 of the embodiment corresponding to FIG. 3 is realized. For another example, the computer program may be a program corresponding to the text classification method in the embodiment corresponding to FIG. 1.
所称处理器可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等,所述处理器是所述计算机设备的控制中心,利用各种接口和线路连接整个计算机设备的各个部分。The so-called processor can be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), ready-made Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor, etc. The processor is the control center of the computer equipment, and various interfaces and lines are used to connect various parts of the entire computer equipment.
所述存储器可用于存储所述计算机程序和/或模块,所述处理器通过运行或执行存储在所述存储器内的计算机程序和/或模块,以及调用存储在存储器内的数据,实现所述计算机设备的各种功能。所述存储器可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据手机的使用所创建的数据(比如音频数据、视频数据等)等。此外,存储器可以包括高速随机存取存储器,还可以包括非易失性存储器,例如硬盘、内存、插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)、至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器 件。The memory may be used to store the computer program and/or module, and the processor implements the computer by running or executing the computer program and/or module stored in the memory, and calling data stored in the memory. Various functions of the device. The memory may mainly include a storage program area and a storage data area, where the storage program area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.); the storage data area may store Data created based on the use of mobile phones (such as audio data, video data, etc.), etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as hard disks, memory, plug-in hard disks, smart media cards (SMC), and secure digital (SD) cards , Flash Card, at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
所述收发器也可以用接收器和发送器代替,可以为相同或者不同的物理实体。为相同的物理实体时,可以统称为收发器。该收发器可以为输入输出单元。The transceiver may also be replaced by a receiver and a transmitter, and may be the same or different physical entities. When they are the same physical entity, they can be collectively referred to as transceivers. The transceiver can be an input and output unit.
所述存储器可以集成在所述处理器中,也可以与所述处理器分开设置。The memory may be integrated in the processor, or may be provided separately from the processor.
本申请还提供一种非易失性计算机可读存储介质,包括指令,当所述指令在计算机上运行时,使得计算机执行如下文本分类方法的步骤:The present application also provides a non-volatile computer-readable storage medium, including instructions, which when run on a computer, cause the computer to execute the following steps of the text classification method:
获取训练文本,所述训练文本包括多个句子,每个句子包括多个词;Acquiring training text, where the training text includes multiple sentences, and each sentence includes multiple words;
将所述训练文本输入神经网络模型的编码层,在所述编码层对所述训练文本进行词向量化,得到与所述训练文本对应的特征向量;Input the training text into the coding layer of the neural network model, and perform word vectorization on the training text in the coding layer to obtain a feature vector corresponding to the training text;
将所述特征向量输入RNN模型,对句子进行建模,捕捉所述训练文本中各句子的长距离依赖特征;其中,所述长距离依赖特征是指文本的上下文向量,且上下文向量在时域上长期依赖;Input the feature vector into the RNN model to model the sentence, and capture the long-distance dependent features of each sentence in the training text; wherein, the long-distance dependent feature refers to the context vector of the text, and the context vector is in the time domain Long-term dependence
将捕获了所述长距离依赖信息的特征向量输入所述神经网络模型中的卷积神经网络CNN模型;Inputting the feature vector capturing the long-distance dependence information into the convolutional neural network CNN model in the neural network model;
在所述CNN模型中从所述特征向量中提取局部特征,得到目标特征向量;其中,局部特征是指所述特征向量中的局部相关性;Extracting local features from the feature vector in the CNN model to obtain a target feature vector; where the local feature refers to the local correlation in the feature vector;
将所述目标特征向量输入到所述分类器,通过所述分类器对所述训练文本进行分类处理,得到分类后的文本。The target feature vector is input to the classifier, and the training text is classified by the classifier to obtain the classified text.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM)中,包括若干指令用以使得一台终端(可以是手机,计算机,服务器或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product. The computer software product is stored in a storage medium (such as ROM/RAM), including Several instructions are used to make a terminal (which may be a mobile phone, a computer, a server, or a network device, etc.) execute the method described in each embodiment of the present application.

Claims (20)

  1. 一种文本分类方法,所述方法包括:A text classification method, the method includes:
    获取训练文本,所述训练文本包括多个句子,每个句子包括多个词;Acquiring training text, where the training text includes multiple sentences, and each sentence includes multiple words;
    将所述训练文本输入神经网络模型的编码层,在所述编码层对所述训练文本进行词向量化,得到与所述训练文本对应的特征向量;Input the training text into the coding layer of the neural network model, and perform word vectorization on the training text in the coding layer to obtain a feature vector corresponding to the training text;
    将所述特征向量输入RNN模型,对句子进行建模,捕捉所述训练文本中各句子的长距离依赖特征;其中,所述长距离依赖特征是指文本的上下文向量,且上下文向量在时域上长期依赖;Input the feature vector into the RNN model to model the sentence, and capture the long-distance dependent features of each sentence in the training text; wherein, the long-distance dependent feature refers to the context vector of the text, and the context vector is in the time domain Long-term dependence
    将捕获了所述长距离依赖信息的特征向量输入所述神经网络模型中的卷积神经网络CNN模型;Inputting the feature vector capturing the long-distance dependence information into the convolutional neural network CNN model in the neural network model;
    在所述CNN模型中从所述特征向量中提取局部特征,得到目标特征向量;其中,局部特征是指所述特征向量中的局部相关性;Extracting local features from the feature vector in the CNN model to obtain a target feature vector; where the local feature refers to the local correlation in the feature vector;
    将所述目标特征向量输入到所述分类器,通过所述分类器对所述训练文本进行分类处理,得到分类后的文本。The target feature vector is input to the classifier, and the training text is classified by the classifier to obtain the classified text.
  2. 根据权利要求1所述的文本分类方法,所述捕捉所述训练文本中各句子的长距离依赖特征,包括:The text classification method according to claim 1, wherein the capturing the long-distance dependency characteristics of each sentence in the training text comprises:
    通过所述LSTM模型依序计算句子中的各个词的长距离依赖特征,其中,特定词的长距离依赖特征表征该特定词与句子中的其他长距离的词之间的依赖关系;Calculate the long-distance dependence feature of each word in the sentence sequentially by using the LSTM model, where the long-distance dependence feature of a specific word represents the dependence relationship between the specific word and other long-distance words in the sentence;
    所述方法还包括:The method also includes:
    依序计算各个词的语义结构特征,特定词的语义结构特征表征包含该特定词及其之前的词的局部句子的语义结构;Calculate the semantic structure features of each word in order, and the semantic structure feature of a specific word represents the semantic structure of the partial sentence containing the specific word and the word before it;
    将每个词的长距离依赖特征和语义结构特征组合,以得到句子中的各个词的词特征;Combine the long-distance dependence feature of each word and the semantic structure feature to obtain the word feature of each word in the sentence;
    基于各个词特征计算句子中的各个词的概率。Calculate the probability of each word in the sentence based on each word feature.
  3. 根据权利要求2所述的文本分类方法,所述训练文本为演讲语言、歌词或论文中的任一项连续数据时,所述通过所述LSTM模型依序计算句子中的各个词的长距离依赖特征,包括:The text classification method according to claim 2, when the training text is continuous data of any one of speech language, lyrics or thesis, the LSTM model is used to sequentially calculate the long-distance dependence of each word in the sentence Features include:
    通过所述LSTM模型依序循环计算句子中的各个词的长距离依赖信息,以从所述连续数据中捕获所述长距离依赖特征。The long-distance dependence information of each word in the sentence is sequentially and cyclically calculated by the LSTM model to capture the long-distance dependence feature from the continuous data.
  4. 根据权利要求3所述的文本分类方法,所述通过所述分类器对所述训练文本进行分 类处理之前,所述方法还包括:The text classification method according to claim 3, before the classification processing of the training text by the classifier, the method further comprises:
    向所属神经网络模型输入多个句子,对各句子进行词向量化,得到多个词向量;Input multiple sentences to the neural network model to which they belong, and perform word vectorization on each sentence to obtain multiple word vectors;
    将各词向量输入LSTM模型或者GRU模型,提取长距离依赖特征;Input each word vector into the LSTM model or GRU model to extract long-distance dependent features;
    将所述长距离依赖特征输入CNN模型,抽取位置不变的局部特征,最终得到多个特征向量,每个所述特征向量分别具备长距离依赖特征和位置不变的局部特征;Input the long-distance dependent features into a CNN model, extract local features with invariant positions, and finally obtain a plurality of feature vectors, each of the feature vectors has a long-distance dependent feature and a local feature with invariable position;
    将所述多个特征向量输入池化层,以对这些特征向量进行降维处理;Input the multiple feature vectors to the pooling layer to perform dimensionality reduction processing on these feature vectors;
    将降维处理得到的特征向量输入所述分类器。The feature vector obtained by the dimensionality reduction process is input to the classifier.
  5. 根据权利要求4所述的文本分类方法,所述将降维处理得到的特征向量输入所述分类器之前,所述方法还包括:The text classification method according to claim 4, before the input of the feature vector obtained by the dimensionality reduction processing into the classifier, the method further comprises:
    为所述分类器预先设置一个阈值;Preset a threshold for the classifier;
    若所述分类器的输出大于所述阈值,则表示是WebShell;当分类器的输出小于预设阈值,则表示不是WebShell;If the output of the classifier is greater than the threshold, it means that it is WebShell; when the output of the classifier is less than the preset threshold, it means that it is not WebShell;
    所述通过所述分类器对所述训练文本进行分类处理,得到分类后的文本,包括:The classifying the training text by the classifier to obtain the classified text includes:
    设置所述分类器中决策树N的大小,进行Bootstrap抽样得到N个数据集;Setting the size of the decision tree N in the classifier, and performing Bootstrap sampling to obtain N data sets;
    学习N颗决策树中每颗决策树的参数θn;Learn the parameter θn of each decision tree in N decision trees;
    采用并行训练的方式训练每颗决策树,在单颗决策树训练完成后,统计针对训练决策树的训练结果的投票记录确定CNN-RF模型的最终输出;其中,CNN-RF模型的最终输出的一种表示方式为:Each decision tree is trained in parallel training. After the training of a single decision tree is completed, the voting records of the training results of the training decision tree are counted to determine the final output of the CNN-RF model; among them, the final output of the CNN-RF model One way of representation is:
    Figure PCTCN2019102464-appb-100001
    Figure PCTCN2019102464-appb-100001
    其中,Ti(x)为树i对样本x的分类结果,c*为样本对应最终类别,N为所述分类器中决策树的数目。Among them, Ti(x) is the classification result of the sample x by the tree i, c* is the final category corresponding to the sample, and N is the number of decision trees in the classifier.
  6. 根据权利要求1-5中任一项所述的文本分类方法,所述训练文本为Webshell,Webshell是以asp、php、jsp或者cgi等网页文件形式存在的一种命令执行环境;所述获取训练文本,包括以下实现方式之一:The text classification method according to any one of claims 1 to 5, wherein the training text is Webshell, and Webshell is a command execution environment in the form of web files such as asp, php, jsp, or cgi; the acquisition training Text, including one of the following implementation methods:
    利用搜索引擎来查找互联网上公开的通用漏洞,如果目标站点并没有进行修复,则获取WebShell;Use search engines to find common vulnerabilities disclosed on the Internet. If the target site has not been repaired, obtain WebShell;
    通过代码审计策略对开源的CMS进行代码审计,从所述CMS中挖掘代码漏洞,以获取WebShell;Conduct code audit on the open source CMS through code audit strategy, and mine code vulnerabilities from the CMS to obtain WebShell;
    采用上传漏洞获取WebShell;Use upload vulnerability to obtain WebShell;
    利用SQL注入攻击获取WebShell;Use SQL injection attacks to obtain WebShell;
    或者,利用数据库备份获取WebShell。Or, use a database backup to get WebShell.
  7. 一种文本分类装置,所述文本分类装置包括:A text classification device, the text classification device includes:
    输入输出模块,用于获取训练文本,所述训练文本包括多个句子,每个句子包括多个词;An input and output module for obtaining training text, the training text includes multiple sentences, each sentence includes multiple words;
    处理模块,用于将所述训练文本输入神经网络模型的编码层,在所述编码层对所述训练文本进行词向量化,得到与所述训练文本对应的特征向量;将所述特征向量输入RNN模型,对句子进行建模;The processing module is configured to input the training text into the coding layer of the neural network model, and perform word vectorization on the training text in the coding layer to obtain a feature vector corresponding to the training text; and input the feature vector RNN model to model sentences;
    采集模块,用于捕捉所述训练文本中各句子的长距离依赖特征;其中,所述长距离依赖特征是指文本的上下文向量,且上下文向量在时域上长期依赖;The acquisition module is used to capture the long-distance dependence feature of each sentence in the training text; wherein the long-distance dependence feature refers to the context vector of the text, and the context vector is dependent on the time domain for a long time;
    所述输入输出模块还用于将所述采集模块捕获了所述长距离依赖信息的特征向量输入所述神经网络模型中的卷积神经网络CNN模型;The input and output module is further configured to input the feature vector of the long-distance dependence information captured by the acquisition module into the convolutional neural network CNN model in the neural network model;
    所述处理模块还用于在所述CNN模型中从所述特征向量中提取局部特征,得到目标特征向量;其中,局部特征是指所述特征向量中的局部相关性;通过所述输入输出模块将所述目标特征向量输入到所述分类器,通过所述分类器对所述训练文本进行分类处理,得到分类后的文本。The processing module is also used to extract local features from the feature vector in the CNN model to obtain a target feature vector; wherein the local feature refers to the local correlation in the feature vector; through the input and output module The target feature vector is input to the classifier, and the training text is classified by the classifier to obtain the classified text.
  8. 根据权利要求7所述的文本分类装置,其特征在于,所述采集模块具体用于:The text classification device according to claim 7, wherein the collection module is specifically configured to:
    通过所述LSTM模型依序计算句子中的各个词的长距离依赖特征,其中,特定词的长距离依赖特征表征该特定词与句子中的其他长距离的词之间的依赖关系;Calculate the long-distance dependence feature of each word in the sentence sequentially by using the LSTM model, where the long-distance dependence feature of a specific word represents the dependence relationship between the specific word and other long-distance words in the sentence;
    依序计算各个词的语义结构特征,特定词的语义结构特征表征包含该特定词及其之前的词的局部句子的语义结构;Calculate the semantic structure features of each word in order, and the semantic structure feature of a specific word represents the semantic structure of the partial sentence containing the specific word and the word before it;
    将每个词的长距离依赖特征和语义结构特征组合,以得到句子中的各个词的词特征;Combine the long-distance dependence feature of each word and the semantic structure feature to obtain the word feature of each word in the sentence;
    基于各个词特征计算句子中的各个词的概率。Calculate the probability of each word in the sentence based on each word feature.
  9. 根据权利要求8所述的文本分类装置,所述训练文本为演讲语言、歌词或论文中的任一项连续数据时,所述处理模块具体用于:The text classification device according to claim 8, when the training text is continuous data of any one of speech language, lyrics, or thesis, the processing module is specifically configured to:
    通过所述LSTM模型依序循环计算句子中的各个词的长距离依赖信息,以从所述连续数据中捕获所述长距离依赖特征。The long-distance dependence information of each word in the sentence is sequentially and cyclically calculated by the LSTM model to capture the long-distance dependence feature from the continuous data.
  10. 根据权利要求9所述的文本分类装置,所述处理模块在通过所述分类器对所述训练文本进行分类处理之前,还用于:The text classification device according to claim 9, wherein the processing module is further configured to: before performing classification processing on the training text by the classifier:
    通过所述输入输出模块向所属神经网络模型输入多个句子,对各句子进行词向量化,得到多个词向量;Input multiple sentences to the neural network model to which they belong through the input and output module, and perform word vectorization on each sentence to obtain multiple word vectors;
    通过所述输入输出模块将各词向量输入LSTM模型或者GRU模型,提取长距离依赖特征;Input each word vector into the LSTM model or GRU model through the input and output module, and extract long-distance dependent features;
    通过所述输入输出模块将所述长距离依赖特征输入CNN模型,抽取位置不变的局部特征,最终得到多个特征向量,每个所述特征向量分别具备长距离依赖特征和位置不变的局部特征;The long-distance dependent features are input to the CNN model through the input and output module, and local features with invariable positions are extracted, and finally multiple feature vectors are obtained. Each of the feature vectors has a long-distance dependent feature and a local with invariant position. feature;
    通过所述输入输出模块将所述多个特征向量输入池化层,以对这些特征向量进行降维处理;Input the multiple feature vectors to the pooling layer through the input and output module, so as to perform dimensionality reduction processing on these feature vectors;
    通过所述输入输出模块将降维处理得到的特征向量输入所述分类器。The feature vector obtained by the dimensionality reduction process is input to the classifier through the input and output module.
  11. 根据权利要求10所述的文本分类装置,所述处理模块在将降维处理得到的特征向量输入所述分类器之前,还用于:The text classification device according to claim 10, before the processing module inputs the feature vector obtained by the dimensionality reduction processing into the classifier, it is further used for:
    为所述分类器预先设置一个阈值;Preset a threshold for the classifier;
    若所述分类器的输出大于所述阈值,则表示是WebShell;当分类器的输出小于预设阈值,则表示不是WebShell;If the output of the classifier is greater than the threshold, it means that it is WebShell; when the output of the classifier is less than the preset threshold, it means that it is not WebShell;
    所述通过所述分类器对所述训练文本进行分类处理,得到分类后的文本,包括:The classifying the training text by the classifier to obtain the classified text includes:
    设置所述分类器中决策树N的大小,进行Bootstrap抽样得到N个数据集;Setting the size of the decision tree N in the classifier, and performing Bootstrap sampling to obtain N data sets;
    学习N颗决策树中每颗决策树的参数θn;Learn the parameter θn of each decision tree in N decision trees;
    采用并行训练的方式训练每颗决策树,在单颗决策树训练完成后,统计针对训练决策树的训练结果的投票记录确定CNN-RF模型的最终输出;其中,CNN-RF模型的最终输出的一种表示方式为:Each decision tree is trained in parallel training. After the training of a single decision tree is completed, the voting records of the training results of the training decision tree are counted to determine the final output of the CNN-RF model; among them, the final output of the CNN-RF model One way of representation is:
    Figure PCTCN2019102464-appb-100002
    Figure PCTCN2019102464-appb-100002
    其中,Ti(x)为树i对样本x的分类结果,c*为样本对应最终类别,N为所述分类器中决策树的数目。Among them, Ti(x) is the classification result of the sample x by the tree i, c* is the final category corresponding to the sample, and N is the number of decision trees in the classifier.
  12. 根据权利要求7-11中任一项所述的文本分类装置,所述训练文本为Webshell,Webshell是以asp、php、jsp或者cgi等网页文件形式存在的一种命令执行环境;所述输入输出模块执行以下操作之一获取WebShell:The text classification device according to any one of claims 7-11, wherein the training text is Webshell, and Webshell is a command execution environment in the form of web files such as asp, php, jsp or cgi; the input and output The module performs one of the following operations to obtain WebShell:
    利用搜索引擎来查找互联网上公开的通用漏洞,如果目标站点并没有进行修复,则获取WebShell;Use search engines to find common vulnerabilities disclosed on the Internet. If the target site has not been repaired, obtain WebShell;
    通过代码审计策略对开源的CMS进行代码审计,从所述CMS中挖掘代码漏洞,以获取 WebShell;Conduct code audit on the open source CMS through the code audit strategy, and mine code vulnerabilities from the CMS to obtain WebShell;
    采用上传漏洞获取WebShell;Use upload vulnerability to obtain WebShell;
    利用SQL注入攻击获取WebShell;Use SQL injection attacks to obtain WebShell;
    或者,利用数据库备份获取WebShell。Or, use a database backup to get WebShell.
  13. 一种计算机设备,其特征在于,所述设备包括:A computer device, characterized in that the device includes:
    至少一个处理器、存储器和输入输出单元;At least one processor, memory and input/output unit;
    其中,所述存储器用于存储程序代码,所述处理器用于调用所述存储器中存储的程序代码来执行如下步骤:Wherein, the memory is used to store program code, and the processor is used to call the program code stored in the memory to perform the following steps:
    获取训练文本,所述训练文本包括多个句子,每个句子包括多个词;Acquiring training text, where the training text includes multiple sentences, and each sentence includes multiple words;
    将所述训练文本输入神经网络模型的编码层,在所述编码层对所述训练文本进行词向量化,得到与所述训练文本对应的特征向量;Input the training text into the coding layer of the neural network model, and perform word vectorization on the training text in the coding layer to obtain a feature vector corresponding to the training text;
    将所述特征向量输入RNN模型,对句子进行建模,捕捉所述训练文本中各句子的长距离依赖特征;其中,所述长距离依赖特征是指文本的上下文向量,且上下文向量在时域上长期依赖;Input the feature vector into the RNN model to model the sentence, and capture the long-distance dependent features of each sentence in the training text; wherein, the long-distance dependent feature refers to the context vector of the text, and the context vector is in the time domain Long-term dependence
    将捕获了所述长距离依赖信息的特征向量输入所述神经网络模型中的卷积神经网络CNN模型;Inputting the feature vector capturing the long-distance dependence information into the convolutional neural network CNN model in the neural network model;
    在所述CNN模型中从所述特征向量中提取局部特征,得到目标特征向量;其中,局部特征是指所述特征向量中的局部相关性;Extracting local features from the feature vector in the CNN model to obtain a target feature vector; where the local feature refers to the local correlation in the feature vector;
    将所述目标特征向量输入到所述分类器,通过所述分类器对所述训练文本进行分类处理,得到分类后的文本。The target feature vector is input to the classifier, and the training text is classified by the classifier to obtain the classified text.
  14. 根据权利要求13所述的计算机设备,所述处理器执行所述计算机程序实现所述捕捉所述训练文本中各句子的长距离依赖特征时,包括以下步骤:The computer device according to claim 13, when the processor executes the computer program to realize the capturing of long-distance dependent features of each sentence in the training text, the method comprises the following steps:
    通过所述LSTM模型依序计算句子中的各个词的长距离依赖特征,其中,特定词的长距离依赖特征表征该特定词与句子中的其他长距离的词之间的依赖关系;Calculate the long-distance dependence feature of each word in the sentence sequentially by using the LSTM model, where the long-distance dependence feature of a specific word represents the dependence relationship between the specific word and other long-distance words in the sentence;
    所述方法还包括:The method also includes:
    依序计算各个词的语义结构特征,特定词的语义结构特征表征包含该特定词及其之前的词的局部句子的语义结构;Calculate the semantic structure features of each word in order, and the semantic structure feature of a specific word represents the semantic structure of the partial sentence containing the specific word and the word before it;
    将每个词的长距离依赖特征和语义结构特征组合,以得到句子中的各个词的词特征;Combine the long-distance dependence feature of each word and the semantic structure feature to obtain the word feature of each word in the sentence;
    基于各个词特征计算句子中的各个词的概率。Calculate the probability of each word in the sentence based on each word feature.
  15. 根据权利要求14所述的计算机设备,所述处理器执行所述计算机程序实现所述训练文本为演讲语言、歌词或论文中的任一项连续数据时,所述通过所述LSTM模型依序计算句子中的各个词的长距离依赖特征时,包括以下步骤:The computer device according to claim 14, wherein the processor executes the computer program to realize that when the training text is continuous data of any one of speech language, lyrics, or thesis, the sequential calculation by the LSTM model When the long distance of each word in the sentence depends on the feature, the following steps are included:
    通过所述LSTM模型依序循环计算句子中的各个词的长距离依赖信息,以从所述连续数据中捕获所述长距离依赖特征。The long-distance dependence information of each word in the sentence is sequentially and cyclically calculated by the LSTM model to capture the long-distance dependence feature from the continuous data.
  16. 根据权利要求15所述的计算机设备,所述处理器执行所述计算机程序实现所述通过所述分类器对所述训练文本进行分类处理之前,还包括以下步骤:The computer device according to claim 15, before the processor executes the computer program to implement the classification processing of the training text by the classifier, further comprising the following steps:
    向所属神经网络模型输入多个句子,对各句子进行词向量化,得到多个词向量;Input multiple sentences to the neural network model to which they belong, and perform word vectorization on each sentence to obtain multiple word vectors;
    将各词向量输入LSTM模型或者GRU模型,提取长距离依赖特征;Input each word vector into the LSTM model or GRU model to extract long-distance dependent features;
    将所述长距离依赖特征输入CNN模型,抽取位置不变的局部特征,最终得到多个特征向量,每个所述特征向量分别具备长距离依赖特征和位置不变的局部特征;Input the long-distance dependent features into a CNN model, extract local features with invariant positions, and finally obtain a plurality of feature vectors, each of the feature vectors has a long-distance dependent feature and a local feature with invariable position;
    将所述多个特征向量输入池化层,以对这些特征向量进行降维处理;Input the multiple feature vectors to the pooling layer to perform dimensionality reduction processing on these feature vectors;
    将降维处理得到的特征向量输入所述分类器。The feature vector obtained by the dimensionality reduction process is input to the classifier.
  17. 根据权利要求16所述的计算机设备,所述处理器执行所述计算机程序实现所述将降维处理得到的特征向量输入所述分类器之前,还包括以下步骤:The computer device according to claim 16, before the processor executes the computer program to realize the input of the feature vector obtained by the dimensionality reduction process into the classifier, further comprising the following steps:
    为所述分类器预先设置一个阈值;Preset a threshold for the classifier;
    若所述分类器的输出大于所述阈值,则表示是WebShell;当分类器的输出小于预设阈值,则表示不是WebShell;If the output of the classifier is greater than the threshold, it means that it is WebShell; when the output of the classifier is less than the preset threshold, it means that it is not WebShell;
    所述通过所述分类器对所述训练文本进行分类处理,得到分类后的文本,包括:The classifying the training text by the classifier to obtain the classified text includes:
    设置所述分类器中决策树N的大小,进行Bootstrap抽样得到N个数据集;Setting the size of the decision tree N in the classifier, and performing Bootstrap sampling to obtain N data sets;
    学习N颗决策树中每颗决策树的参数θn;Learn the parameter θn of each decision tree in N decision trees;
    采用并行训练的方式训练每颗决策树,在单颗决策树训练完成后,统计针对训练决策树的训练结果的投票记录确定CNN-RF模型的最终输出;其中,CNN-RF模型的最终输出的一种表示方式为:Each decision tree is trained in parallel training. After the training of a single decision tree is completed, the voting records of the training results of the training decision tree are counted to determine the final output of the CNN-RF model; among them, the final output of the CNN-RF model One way of representation is:
    Figure PCTCN2019102464-appb-100003
    Figure PCTCN2019102464-appb-100003
    其中,Ti(x)为树i对样本x的分类结果,c*为样本对应最终类别,N为所述分类器中决策树的数目。Among them, Ti(x) is the classification result of the sample x by the tree i, c* is the final category corresponding to the sample, and N is the number of decision trees in the classifier.
  18. 根据权利要求13-17中任一项所述的计算机设备,所述训练文本为Webshell,Webshell是以asp、php、jsp或者cgi等网页文件形式存在的一种命令执行环境;所述处 理器执行以下操作之一获取WebShell:The computer device according to any one of claims 13-17, the training text is Webshell, and Webshell is a command execution environment in the form of web files such as asp, php, jsp, or cgi; the processor executes One of the following operations to obtain WebShell:
    利用搜索引擎来查找互联网上公开的通用漏洞,如果目标站点并没有进行修复,则获取WebShell;Use search engines to find common vulnerabilities disclosed on the Internet. If the target site has not been repaired, obtain WebShell;
    通过代码审计策略对开源的CMS进行代码审计,从所述CMS中挖掘代码漏洞,以获取WebShell;Conduct code audit on the open source CMS through code audit strategy, and mine code vulnerabilities from the CMS to obtain WebShell;
    采用上传漏洞获取WebShell;Use upload vulnerability to obtain WebShell;
    利用SQL注入攻击获取WebShell;Use SQL injection attacks to obtain WebShell;
    或者,利用数据库备份获取WebShell。Or, use a database backup to get WebShell.
  19. 一种计算机存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行如下步骤:A computer storage medium in which instructions are stored in the computer-readable storage medium, when running on a computer, the computer executes the following steps:
    获取训练文本,所述训练文本包括多个句子,每个句子包括多个词;Acquiring training text, where the training text includes multiple sentences, and each sentence includes multiple words;
    将所述训练文本输入神经网络模型的编码层,在所述编码层对所述训练文本进行词向量化,得到与所述训练文本对应的特征向量;Input the training text into the coding layer of the neural network model, and perform word vectorization on the training text in the coding layer to obtain a feature vector corresponding to the training text;
    将所述特征向量输入RNN模型,对句子进行建模,捕捉所述训练文本中各句子的长距离依赖特征;其中,所述长距离依赖特征是指文本的上下文向量,且上下文向量在时域上长期依赖;Input the feature vector into the RNN model to model the sentence, and capture the long-distance dependent features of each sentence in the training text; wherein, the long-distance dependent feature refers to the context vector of the text, and the context vector is in the time domain Long-term dependence
    将捕获了所述长距离依赖信息的特征向量输入所述神经网络模型中的卷积神经网络CNN模型;Inputting the feature vector capturing the long-distance dependence information into the convolutional neural network CNN model in the neural network model;
    在所述CNN模型中从所述特征向量中提取局部特征,得到目标特征向量;其中,局部特征是指所述特征向量中的局部相关性;Extracting local features from the feature vector in the CNN model to obtain a target feature vector; where the local feature refers to the local correlation in the feature vector;
    将所述目标特征向量输入到所述分类器,通过所述分类器对所述训练文本进行分类处理,得到分类后的文本。The target feature vector is input to the classifier, and the training text is classified by the classifier to obtain the classified text.
  20. 根据权利要求19所述的计算机可读存储介质,所述计算机可读存储介质被处理器执行时还实现如下步骤:The computer-readable storage medium according to claim 19, when the computer-readable storage medium is executed by a processor, the following steps are further implemented:
    通过所述LSTM模型依序计算句子中的各个词的长距离依赖特征,其中,特定词的长距离依赖特征表征该特定词与句子中的其他长距离的词之间的依赖关系;Calculate the long-distance dependence feature of each word in the sentence sequentially by using the LSTM model, where the long-distance dependence feature of a specific word represents the dependence relationship between the specific word and other long-distance words in the sentence;
    所述方法还包括:The method also includes:
    依序计算各个词的语义结构特征,特定词的语义结构特征表征包含该特定词及其之前的词的局部句子的语义结构;Calculate the semantic structure features of each word in order, and the semantic structure feature of a specific word represents the semantic structure of the partial sentence containing the specific word and the word before it;
    将每个词的长距离依赖特征和语义结构特征组合,以得到句子中的各个词的词特征;基于各个词特征计算句子中的各个词的概率。Combine the long-distance dependence feature of each word and the semantic structure feature to obtain the word feature of each word in the sentence; calculate the probability of each word in the sentence based on each word feature.
PCT/CN2019/102464 2019-06-04 2019-08-26 Text classification method, apparatus, device, and storage medium WO2020244066A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910479226.7A CN110309304A (en) 2019-06-04 2019-06-04 A kind of file classification method, device, equipment and storage medium
CN201910479226.7 2019-06-04

Publications (1)

Publication Number Publication Date
WO2020244066A1 true WO2020244066A1 (en) 2020-12-10

Family

ID=68075283

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/102464 WO2020244066A1 (en) 2019-06-04 2019-08-26 Text classification method, apparatus, device, and storage medium

Country Status (2)

Country Link
CN (1) CN110309304A (en)
WO (1) WO2020244066A1 (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112699944A (en) * 2020-12-31 2021-04-23 中国银联股份有限公司 Order-returning processing model training method, processing method, device, equipment and medium
CN112784601A (en) * 2021-02-03 2021-05-11 中山大学孙逸仙纪念医院 Key information extraction method and device, electronic equipment and storage medium
CN112950313A (en) * 2021-02-25 2021-06-11 北京嘀嘀无限科技发展有限公司 Order processing method and device, electronic equipment and readable storage medium
CN113190154A (en) * 2021-04-29 2021-07-30 北京百度网讯科技有限公司 Model training method, entry classification method, device, apparatus, storage medium, and program
CN113221537A (en) * 2021-04-12 2021-08-06 湘潭大学 Aspect-level emotion analysis method based on truncated cyclic neural network and proximity weighted convolution
CN113239192A (en) * 2021-04-29 2021-08-10 湘潭大学 Text structuring technology based on sliding window and random discrete sampling
CN113468872A (en) * 2021-06-09 2021-10-01 大连理工大学 Biomedical relation extraction method and system based on sentence level graph convolution
CN113486347A (en) * 2021-06-30 2021-10-08 福州大学 Deep learning hardware Trojan horse detection method based on semantic understanding
CN114021651A (en) * 2021-11-04 2022-02-08 桂林电子科技大学 Block chain violation information perception method based on deep learning
CN114169443A (en) * 2021-12-08 2022-03-11 西安交通大学 Word-level text countermeasure sample detection method
CN114499944A (en) * 2021-12-22 2022-05-13 天翼云科技有限公司 Method, device and equipment for detecting WebShell
CN114510576A (en) * 2021-12-21 2022-05-17 一拓通信集团股份有限公司 Entity relationship extraction method based on BERT and BiGRU fusion attention mechanism
CN115249017A (en) * 2021-06-23 2022-10-28 马上消费金融股份有限公司 Text labeling method, intention recognition model training method and related equipment
CN116227495A (en) * 2023-05-05 2023-06-06 公安部信息通信中心 Entity classification data processing system
CN116453385A (en) * 2023-03-16 2023-07-18 中山市加乐美科技发展有限公司 Space-time disk learning machine
CN116958752A (en) * 2023-09-20 2023-10-27 国网湖北省电力有限公司经济技术研究院 Power grid infrastructure archiving method, device and equipment based on IPKCNN-SVM
CN117093996A (en) * 2023-10-18 2023-11-21 湖南惟储信息技术有限公司 Safety protection method and system for embedded operating system
CN117201733A (en) * 2023-08-22 2023-12-08 杭州中汇通航航空科技有限公司 Real-time unmanned aerial vehicle monitoring and sharing system
CN117668562A (en) * 2024-01-31 2024-03-08 腾讯科技(深圳)有限公司 Training and using method, device, equipment and medium of text classification model
CN117623735B (en) * 2023-12-01 2024-05-14 广东雅诚德实业有限公司 Production method of high-strength anti-pollution domestic ceramic

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113032534A (en) * 2019-12-24 2021-06-25 中国移动通信集团四川有限公司 Dialog text classification method and electronic equipment
CN111177392A (en) * 2019-12-31 2020-05-19 腾讯云计算(北京)有限责任公司 Data processing method and device
CN111538840B (en) * 2020-06-23 2023-04-28 基建通(三亚)国际科技有限公司 Text classification method and device
CN111930938A (en) * 2020-07-06 2020-11-13 武汉卓尔数字传媒科技有限公司 Text classification method and device, electronic equipment and storage medium
CN111865960A (en) * 2020-07-15 2020-10-30 北京市燃气集团有限责任公司 Network intrusion scene analysis processing method, system, terminal and storage medium
CN114050908B (en) * 2020-07-24 2023-07-21 中国移动通信集团浙江有限公司 Method, device, computing equipment and computer storage medium for automatically auditing firewall policy
CN112118225B (en) * 2020-08-13 2021-09-03 紫光云(南京)数字技术有限公司 Webshell detection method and device based on RNN
CN112148943A (en) * 2020-09-27 2020-12-29 北京天融信网络安全技术有限公司 Webpage classification method and device, electronic equipment and readable storage medium
CN112491891B (en) * 2020-11-27 2022-05-17 杭州电子科技大学 Network attack detection method based on hybrid deep learning in Internet of things environment
CN112686315A (en) * 2020-12-31 2021-04-20 山西三友和智慧信息技术股份有限公司 Deep learning-based unnatural earthquake classification method
CN112699964A (en) * 2021-01-13 2021-04-23 成都链安科技有限公司 Model construction method, system, device, medium and transaction identity identification method
CN113010740B (en) * 2021-03-09 2023-05-30 腾讯科技(深圳)有限公司 Word weight generation method, device, equipment and medium
CN115359867B (en) * 2022-09-06 2024-02-02 中国电信股份有限公司 Electronic medical record classification method, device, electronic equipment and storage medium
CN116226702B (en) * 2022-09-09 2024-04-26 武汉中数医疗科技有限公司 Thyroid sampling data identification method based on bioelectrical impedance

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030078899A1 (en) * 2001-08-13 2003-04-24 Xerox Corporation Fuzzy text categorizer
CN102141978A (en) * 2010-02-02 2011-08-03 阿里巴巴集团控股有限公司 Method and system for classifying texts
CN104572892A (en) * 2014-12-24 2015-04-29 中国科学院自动化研究所 Text classification method based on cyclic convolution network
CN107103754A (en) * 2017-05-10 2017-08-29 华南师范大学 A kind of road traffic condition Forecasting Methodology and system
CN108829818A (en) * 2018-06-12 2018-11-16 中国科学院计算技术研究所 A kind of file classification method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11250311B2 (en) * 2017-03-15 2022-02-15 Salesforce.Com, Inc. Deep neural network-based decision network
CN107066553B (en) * 2017-03-24 2021-01-01 北京工业大学 Short text classification method based on convolutional neural network and random forest
CN107832400B (en) * 2017-11-01 2019-04-16 山东大学 A kind of method that location-based LSTM and CNN conjunctive model carries out relationship classification

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030078899A1 (en) * 2001-08-13 2003-04-24 Xerox Corporation Fuzzy text categorizer
CN102141978A (en) * 2010-02-02 2011-08-03 阿里巴巴集团控股有限公司 Method and system for classifying texts
CN104572892A (en) * 2014-12-24 2015-04-29 中国科学院自动化研究所 Text classification method based on cyclic convolution network
CN107103754A (en) * 2017-05-10 2017-08-29 华南师范大学 A kind of road traffic condition Forecasting Methodology and system
CN108829818A (en) * 2018-06-12 2018-11-16 中国科学院计算技术研究所 A kind of file classification method

Cited By (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112699944A (en) * 2020-12-31 2021-04-23 中国银联股份有限公司 Order-returning processing model training method, processing method, device, equipment and medium
CN112699944B (en) * 2020-12-31 2024-04-23 中国银联股份有限公司 Training method, processing method, device, equipment and medium for returning list processing model
CN112784601A (en) * 2021-02-03 2021-05-11 中山大学孙逸仙纪念医院 Key information extraction method and device, electronic equipment and storage medium
CN112784601B (en) * 2021-02-03 2023-06-27 中山大学孙逸仙纪念医院 Key information extraction method, device, electronic equipment and storage medium
CN112950313A (en) * 2021-02-25 2021-06-11 北京嘀嘀无限科技发展有限公司 Order processing method and device, electronic equipment and readable storage medium
CN113221537A (en) * 2021-04-12 2021-08-06 湘潭大学 Aspect-level emotion analysis method based on truncated cyclic neural network and proximity weighted convolution
CN113239192A (en) * 2021-04-29 2021-08-10 湘潭大学 Text structuring technology based on sliding window and random discrete sampling
CN113190154B (en) * 2021-04-29 2023-10-13 北京百度网讯科技有限公司 Model training and entry classification methods, apparatuses, devices, storage medium and program
CN113239192B (en) * 2021-04-29 2024-04-16 湘潭大学 Text structuring technology based on sliding window and random discrete sampling
CN113190154A (en) * 2021-04-29 2021-07-30 北京百度网讯科技有限公司 Model training method, entry classification method, device, apparatus, storage medium, and program
CN113468872A (en) * 2021-06-09 2021-10-01 大连理工大学 Biomedical relation extraction method and system based on sentence level graph convolution
CN113468872B (en) * 2021-06-09 2024-04-16 大连理工大学 Biomedical relation extraction method and system based on sentence level graph convolution
CN115249017A (en) * 2021-06-23 2022-10-28 马上消费金融股份有限公司 Text labeling method, intention recognition model training method and related equipment
CN115249017B (en) * 2021-06-23 2023-12-19 马上消费金融股份有限公司 Text labeling method, training method of intention recognition model and related equipment
CN113486347A (en) * 2021-06-30 2021-10-08 福州大学 Deep learning hardware Trojan horse detection method based on semantic understanding
CN113486347B (en) * 2021-06-30 2023-07-14 福州大学 Deep learning hardware Trojan horse detection method based on semantic understanding
CN114021651A (en) * 2021-11-04 2022-02-08 桂林电子科技大学 Block chain violation information perception method based on deep learning
CN114021651B (en) * 2021-11-04 2024-03-29 桂林电子科技大学 Block chain illegal information sensing method based on deep learning
CN114169443B (en) * 2021-12-08 2024-02-06 西安交通大学 Word-level text countermeasure sample detection method
CN114169443A (en) * 2021-12-08 2022-03-11 西安交通大学 Word-level text countermeasure sample detection method
CN114510576A (en) * 2021-12-21 2022-05-17 一拓通信集团股份有限公司 Entity relationship extraction method based on BERT and BiGRU fusion attention mechanism
CN114499944B (en) * 2021-12-22 2023-08-08 天翼云科技有限公司 Method, device and equipment for detecting WebShell
CN114499944A (en) * 2021-12-22 2022-05-13 天翼云科技有限公司 Method, device and equipment for detecting WebShell
CN116453385B (en) * 2023-03-16 2023-11-24 中山市加乐美科技发展有限公司 Space-time disk learning machine
CN116453385A (en) * 2023-03-16 2023-07-18 中山市加乐美科技发展有限公司 Space-time disk learning machine
CN116227495A (en) * 2023-05-05 2023-06-06 公安部信息通信中心 Entity classification data processing system
CN117201733A (en) * 2023-08-22 2023-12-08 杭州中汇通航航空科技有限公司 Real-time unmanned aerial vehicle monitoring and sharing system
CN117201733B (en) * 2023-08-22 2024-03-12 杭州中汇通航航空科技有限公司 Real-time unmanned aerial vehicle monitoring and sharing system
CN116958752A (en) * 2023-09-20 2023-10-27 国网湖北省电力有限公司经济技术研究院 Power grid infrastructure archiving method, device and equipment based on IPKCNN-SVM
CN116958752B (en) * 2023-09-20 2023-12-15 国网湖北省电力有限公司经济技术研究院 Power grid infrastructure archiving method, device and equipment based on IPKCNN-SVM
CN117093996A (en) * 2023-10-18 2023-11-21 湖南惟储信息技术有限公司 Safety protection method and system for embedded operating system
CN117093996B (en) * 2023-10-18 2024-02-06 湖南惟储信息技术有限公司 Safety protection method and system for embedded operating system
CN117623735B (en) * 2023-12-01 2024-05-14 广东雅诚德实业有限公司 Production method of high-strength anti-pollution domestic ceramic
CN117668562A (en) * 2024-01-31 2024-03-08 腾讯科技(深圳)有限公司 Training and using method, device, equipment and medium of text classification model
CN117668562B (en) * 2024-01-31 2024-04-19 腾讯科技(深圳)有限公司 Training and using method, device, equipment and medium of text classification model

Also Published As

Publication number Publication date
CN110309304A (en) 2019-10-08

Similar Documents

Publication Publication Date Title
WO2020244066A1 (en) Text classification method, apparatus, device, and storage medium
WO2019136993A1 (en) Text similarity calculation method and device, computer apparatus, and storage medium
WO2021072885A1 (en) Method and apparatus for recognizing text, device and storage medium
CN107612893B (en) Short message auditing system and method and short message auditing model building method
WO2017084586A1 (en) Method , system, and device for inferring malicious code rule based on deep learning method
CN110210617B (en) Confrontation sample generation method and generation device based on feature enhancement
WO2020253350A1 (en) Network content publication auditing method and apparatus, computer device and storage medium
CN109670163B (en) Information identification method, information recommendation method, template construction method and computing device
CN110162593A (en) A kind of processing of search result, similarity model training method and device
US20200125836A1 (en) Training Method for Descreening System, Descreening Method, Device, Apparatus and Medium
CN111460153A (en) Hot topic extraction method and device, terminal device and storage medium
CN111814770A (en) Content keyword extraction method of news video, terminal device and medium
EP4258610A1 (en) Malicious traffic identification method and related apparatus
US20170193098A1 (en) System and method for topic modeling using unstructured manufacturing data
CN111859968A (en) Text structuring method, text structuring device and terminal equipment
CN111177375B (en) Electronic document classification method and device
CN110929145A (en) Public opinion analysis method, public opinion analysis device, computer device and storage medium
CN109446299B (en) Method and system for searching e-mail content based on event recognition
CN112507167A (en) Method and device for identifying video collection, electronic equipment and storage medium
CN110765286A (en) Cross-media retrieval method and device, computer equipment and storage medium
CN114416998A (en) Text label identification method and device, electronic equipment and storage medium
TWI749349B (en) Text restoration method, device, electronic equipment and computer readable storage medium
WO2023273303A1 (en) Tree model-based method and apparatus for acquiring degree of influence of event, and computer device
CN111444362A (en) Malicious picture intercepting method, device, equipment and storage medium
CN115314268B (en) Malicious encryption traffic detection method and system based on traffic fingerprint and behavior

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19932046

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19932046

Country of ref document: EP

Kind code of ref document: A1