WO2020244066A1 - Procédé de classification de texte, appareil, dispositif et support de stockage - Google Patents

Procédé de classification de texte, appareil, dispositif et support de stockage Download PDF

Info

Publication number
WO2020244066A1
WO2020244066A1 PCT/CN2019/102464 CN2019102464W WO2020244066A1 WO 2020244066 A1 WO2020244066 A1 WO 2020244066A1 CN 2019102464 W CN2019102464 W CN 2019102464W WO 2020244066 A1 WO2020244066 A1 WO 2020244066A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
word
long
text
model
Prior art date
Application number
PCT/CN2019/102464
Other languages
English (en)
Chinese (zh)
Inventor
李坤
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020244066A1 publication Critical patent/WO2020244066A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • This application relates to the field of text classification, and in particular to a text classification method, device, equipment, and storage medium.
  • Text classification is a key task in natural language processing, which can help users discover useful information from massive amounts of data. Text classification is mainly used in spam recognition, sentiment analysis, question and answer systems, translation, etc.
  • the purpose of sentence model is to learn text features to represent sentences, and it is a key model for text classification.
  • WebShell detection also belongs to a kind of text classification.
  • Current text classification is mostly based on statistics and machine learning.
  • the statistical method uses split sentences, based on a corpus, and counts the occurrence probability of words composed of adjacent words. If adjacent words appear more frequently, the probability of occurrence is high. Words are segmented according to the probability value, so a complete corpus Very important.
  • the machine learning method uses the TF-IDF algorithm to calculate text features, and then uses classifiers such as logistic regression, SVM, and random forest to classify the text.
  • classifiers such as logistic regression, SVM, and random forest to classify the text.
  • This application provides a text classification method, device, equipment and storage medium, which can solve the problem of poor accuracy of text classification in the prior art.
  • the present application provides a text classification method, which includes: obtaining training text; inputting the training text into the coding layer of a neural network model, and performing word vectorization on the training text in the coding layer to obtain The feature vector corresponding to the training text; the feature vector is input to the RNN model, the sentence is modeled, and the long-distance dependent feature of each sentence in the training text is captured; wherein, the long-distance dependent feature refers to the text And the context vector is dependent on the time domain for a long time; input the feature vector that captures the long-distance dependence information into the convolutional neural network CNN model in the neural network model; in the CNN model from the The local features are extracted from the feature vector to obtain the target feature vector; wherein, the local feature refers to the local correlation in the feature vector; the target feature vector is input to the classifier, and the training is performed by the classifier The text is classified, and the classified text is obtained.
  • the capturing the long-distance dependence characteristics of each sentence in the training text includes: sequentially calculating the long-distance dependence characteristics of each word in the sentence through the LSTM model, wherein the The long-distance dependency feature represents the dependency relationship between the specific word and other long-distance words in the sentence; the method further includes: sequentially calculating the semantic structure features of each word, and the semantic structure feature representation of the specific word includes the specific word The semantic structure of the partial sentence of the word before it; the long-distance dependence feature of each word and the semantic structure feature are combined to obtain the word feature of each word in the sentence; the probability of each word in the sentence is calculated based on each word feature .
  • the LSTM model is used to sequentially calculate the long-distance dependent features of each word in the sentence, including : Calculate the long-distance dependence information of each word in the sentence sequentially and cyclically through the LSTM model to capture the long-distance dependence feature from the continuous data.
  • the method before the training text is classified by the classifier, the method further includes: inputting a plurality of sentences into the neural network model, and performing word vectorization on each sentence to obtain Multiple word vectors; input each word vector into the LSTM model or GRU model to extract long-distance dependent features; input the long-distance dependent features into the CNN model to extract local features with invariant positions, and finally obtain multiple feature vectors, each The feature vectors respectively have long-distance dependent features and position-invariant local features; input the multiple feature vectors into the pooling layer to perform dimensionality reduction processing on these feature vectors; input the feature vectors obtained by the dimensionality reduction processing ⁇ Classifier.
  • the method before the input of the feature vector obtained by the dimensionality reduction process into the classifier, the method further includes: presetting a threshold for the classifier; if the output of the classifier is greater than all The threshold value means WebShell; when the output of the classifier is less than the preset threshold value, it means it is not WebShell; the classifying the training text by the classifier to obtain the classified text includes: setting the The size of the decision tree N in the classifier, Bootstrap sampling is performed to obtain N data sets; the parameter ⁇ n of each decision tree in the N decision trees is learned; each decision tree is trained in parallel training, and the training is completed in a single decision tree After that, the voting records of the training results for the training decision tree are counted to determine the final output of the CNN-RF model; among them, a representation of the final output of the CNN-RF model is: Among them, Ti(x) is the classification result of the sample x by the tree i, c* is the final category corresponding to the sample, and N is the
  • the training text is Webshell, which is a command execution environment in the form of web pages such as asp, php, jsp, or cgi;
  • the acquisition of the training text includes one of the following implementation methods: Use search engines to find common vulnerabilities disclosed on the Internet. If the target site has not been repaired, obtain WebShell; conduct code audits on open source CMS through code audit strategies, and mine code vulnerabilities from the CMS to obtain WebShell; adopt Upload vulnerability to obtain WebShell; use SQL injection attack to obtain WebShell; or use database backup to obtain WebShell.
  • the present application provides a text classification device with a function corresponding to the text classification method provided in the first aspect.
  • the function can be realized by hardware, or by hardware executing corresponding software.
  • the hardware or software includes one or more modules corresponding to the above functions, and the modules may be software and/or hardware.
  • Another aspect of the present application provides a computer device, which includes at least one connected processor, a memory, and a transceiver, wherein the memory is used to store program code, and the processor is used to call the program code in the memory To perform the method described in the first aspect above.
  • the computer-readable storage medium may be a non-volatile computer-readable storage medium, and the computer-readable storage medium stores instructions when it runs on a computer. , Causing the computer to execute the method described in the first aspect.
  • the RNN model is used to process long-term information to capture long-distance dependent features, which can accurately determine the context vector with strong correlation and avoid the loss of a large amount of information in the signal transmission process, and then use the CNN model to analyze local features Extract local features from the perceptual characteristics of the CNN model, and finally input the output of the CNN model into the classification for classification processing. Since the feature vector in the input classifier has both long-distance dependent features and local features, it can effectively improve the classification of sentences of different lengths Effect, and improve the accuracy of the neural network model to recognize text.
  • FIG. 1 is a schematic flowchart of a text classification method in an embodiment of this application
  • Figure 2a is a schematic flowchart of a text classification method in an embodiment of this application.
  • FIG. 2b is a schematic table showing the comparison of the accuracy rates of fudan, Weibo and MR in the embodiment of the application;
  • FIG. 2c is a schematic diagram of another flow chart of a text classification method according to an embodiment of this application.
  • FIG. 3 is a schematic diagram of a structure of text classification in an embodiment of this application.
  • Fig. 4 is a schematic diagram of a structure of a computer device in an embodiment of the application.
  • This application provides a text classification method, device, equipment, and storage medium, which can be used to classify texts such as news, papers, posts, and emails. This application does not limit the application scenarios of text classification.
  • this application mainly provides the following technical solutions:
  • the neural network model of the present application includes a CNN model and an RNN model, and a schematic diagram of the structure of the neural network model is shown in FIG. 1.
  • the coding layer of the neural network model includes an RNN model and a CNN model.
  • the input of the neural network model is the input of the RNN model
  • the output of the RNN model is the input of the CNN model.
  • the output of the CNN model is the output of the neural network model.
  • a text classification method in an embodiment of the present application is introduced below, and the method includes:
  • the training text includes multiple sentences, and each sentence includes multiple words.
  • the training text in this application is Webshell.
  • Webshell is a command execution environment in the form of web files such as asp, php, jsp, or cgi, which can also be called a web backdoor.
  • hackers invade a website they usually mix the asp or php backdoor files with the normal webpage files in the web directory of the website server, and then use the browser to access the asp or php backdoor to obtain a command execution environment to control the website The purpose of the server.
  • the training text is Webshell, which is a command execution environment in the form of web pages such as asp, php, jsp, or cgi.
  • a content management system (Content Management System, CMS) may be used to obtain the Webshell, and one of the following implementation methods may be used to obtain the training text:
  • CMS Content Management System
  • a content management system can be used to obtain Webshell.
  • a search engine can be used to find common vulnerabilities publicly disclosed on the Internet. If the target site has not been repaired, WebShell is obtained.
  • This application does not limit the method and source of obtaining training text.
  • the feature vector is the text representation of the directional quantity space model.
  • the text data is changed from a high-latitude and high-sparse neural network difficult to process to continuous dense data similar to images and voices.
  • the long-distance dependence feature refers to the context vector of the text, and the context vector is long-term dependent in the time domain.
  • the capturing the long-distance dependent characteristics of each sentence in the training text includes:
  • the long-distance dependence feature of each word in the sentence is sequentially calculated by the LSTM model, where the long-distance dependence feature of a specific word represents the dependence relationship between the specific word and other long-distance words in the sentence.
  • the RNN model may adopt a Long Short-Term Memory (LSTM) model, through which a wide range of context information can be used in text processing to determine the probability of the next word.
  • LSTM Long Short-Term Memory
  • the LSTM model can use a wide range of context information in text processing to determine the probability of the next word, including the following steps:
  • the training text may be continuous data, such as speech language, lyrics, or essays
  • a loop operation can be used to capture long-distance dependent information from such continuous data to ensure that the signal can continue to propagate.
  • the training text is continuous data of any one of speech language, lyrics, or thesis
  • the sequential calculation of the long-distance dependent features of each word in the sentence through the LSTM model includes:
  • the long-distance dependence information of each word in the sentence is sequentially and cyclically calculated by the LSTM model to capture the long-distance dependence feature from the continuous data.
  • the local feature refers to the local correlation in the feature vector, and can also be referred to as key information similar to n-gram in the feature vector.
  • the CNN model may adopt the CNN-RF model.
  • the following table shows the comparison of accuracy rates of 3 types of text (fudan, weibo, and MR) using NB model, CART model, RF model, CNN model and CNN-RF model (as shown in Figure 2b) .
  • the neural network model includes a classifier, and the input of the classifier is the output of the CNN model.
  • the classifier trains the feature vector until the classifier converges.
  • a threshold may be preset for the classifier. If the output of the classifier is greater than the threshold, it means that it is WebShell; when the output of SoftMax is less than the threshold, it means that it is not WebShell.
  • the characteristics of long-term information processing by the RNN model are used to capture long-distance dependent features, which can accurately determine the context vector with strong correlation and avoid a large amount of signal loss in the transmission process. Then use the CNN model to extract the local features from the perceptual characteristics of the local features, and finally input the output of the CNN model into the classification for classification processing. Since the feature vector in the input classifier has both long-distance dependent features and local features, The classification effect of sentences of different lengths can be effectively improved, and the accuracy of text recognition by the neural network model can be improved. In addition, combining the feature extraction ability of the CNN model and the generalization ability of the random forest, the generalization ability can be analyzed from the following three aspects:
  • dual word vectors describe the meaning of words from two perspectives, enriching short text information, and expanding feature information compared to single word vectors.
  • the true hypothesis of some learning tasks may not be in the hypothesis space where the current decision tree algorithm is located. If a single classification method is used, the established hypothesis space will not be searched. And random forest using Bootstrap sampling can reduce the dependence of the machine learning model on data and reduce the variance of the model, so that the RNN model has better generalization capabilities.
  • the method before the training text is classified by the classifier, the method further includes:
  • the feature vector obtained by the dimensionality reduction process is input to the classifier.
  • the method before the input of the feature vector obtained by the dimensionality reduction process into the classifier, the method further includes:
  • the output of the classifier is greater than the threshold, it means that it is WebShell; when the output of the classifier is less than the preset threshold, it means that it is not WebShell;
  • the classifying the training text by the classifier to obtain the classified text includes:
  • Each decision tree is trained in parallel training. After the training of a single decision tree is completed, the voting records of the training results of the training decision tree are counted to determine the final output of the CNN-RF model; among them, the final output of the CNN-RF model
  • One way of representation is:
  • Ti(x) is the classification result of the sample x by the tree i, that is, the voting method
  • c* is the final category corresponding to the sample
  • N is the number of decision trees in the classifier.
  • the classifier may adopt a random forest model or a Softmax model.
  • the random forest model When the random forest model is adopted, the fully connected layer feature Cfinal may be sent to the random forest model for training.
  • Input the feature vector obtained by dimensionality reduction into a classifier (such as Softmax), which sets a threshold in advance.
  • SoftMax a classifier
  • the output of SoftMax is greater than the threshold, it means that it is WebShell; when the output of SoftMax is less than the threshold, it means that it is not WebShell .
  • the text classification method in the present application is described above, and the device that executes the text classification method is described below.
  • a schematic structural diagram of a text classification device 30 shown in FIG. 3 can be applied to classify texts such as news, papers, posts, and mails.
  • the text classification device 30 in the embodiment of the present application can implement the steps corresponding to the text classification method executed in the embodiment corresponding to FIG. 1.
  • the functions implemented by the text classification device 30 can be implemented by hardware, or can be implemented by hardware executing corresponding software.
  • the hardware or software includes one or more modules corresponding to the above functions, and the modules may be software and/or hardware.
  • the text classification device 30 may include an input and output module 301, a processing module 302, and a collection module 303.
  • the functional realization of the input and output module 301, the processing module 302, and the collection module 303 can refer to the implementation corresponding to FIG. 1
  • the operations performed in the example will not be repeated here.
  • the processing module 302 can be used to control the input and output operations of the income output module 301 and the collection operation of the collection module 303.
  • the input and output module 301 may be used to obtain training text
  • the processing module 302 may be configured to input the training text obtained by the input and output module 301 into the coding layer of the neural network model, and perform word vectorization on the training text in the coding layer to obtain the corresponding training text
  • the feature vector of input the feature vector into the RNN model to model the sentence;
  • the acquisition module 303 can be used to capture the long-distance dependent features of each sentence in the training text; wherein the long-distance dependent features refer to the context vector of the text, and the context vector is dependent on the time domain for a long time;
  • the input and output module 301 is further configured to input the feature vector of the long-distance dependence information captured by the acquisition module into the convolutional neural network CNN model in the neural network model;
  • the processing module 302 is also used to extract local features from the feature vector in the CNN model to obtain a target feature vector; wherein, the local feature refers to the local correlation in the feature vector; through the input and output
  • the module inputs the target feature vector to the classifier, and classifies the training text through the classifier to obtain the classified text.
  • the RNN model is used to process long-term information to capture long-distance dependent features, which can accurately determine the context vector with strong correlation and avoid the loss of a large amount of information in the signal transmission process, and then use the CNN model to The perceptual features of local features extract local features, and finally input the output of the CNN model into the classification for classification processing. Since the feature vector in the input classifier has both long-distance dependent features and local features, it can effectively improve sentences of different lengths The classification effect and the improvement of the accuracy of the neural network model for text recognition.
  • the collection module 302 is specifically configured to:
  • the long-distance dependence feature of each word in the sentence is sequentially calculated by the LSTM model, where the long-distance dependence feature of a specific word represents the dependence relationship between the specific word and other long-distance words in the sentence;
  • the semantic structure feature of each word, the semantic structure feature of a specific word characterizes the semantic structure of the partial sentence containing the specific word and the word before it; the long-distance dependence feature of each word and the semantic structure feature are combined to obtain the sentence
  • the word feature of each word calculate the probability of each word in the sentence based on each word feature.
  • the processing module 302 is specifically configured to:
  • the long-distance dependence information of each word in the sentence is sequentially and cyclically calculated by the LSTM model to capture the long-distance dependence feature from the continuous data.
  • the processing module 302 is further configured to: input multiple sentences into the neural network model to which each sentence belongs through the input and output module 301. Perform word vectorization to obtain multiple word vectors; input each word vector into the LSTM model or GRU model through the input and output module 301 to extract long-distance dependent features; and input the long-distance dependent features through the input and output module 301
  • the CNN model extracts location-invariant local features, and finally obtains multiple feature vectors, each of which has long-distance dependent features and location-invariant local features; the input and output module 301 combines the multiple feature vectors.
  • the feature vectors are input to the pooling layer to perform dimensionality reduction processing on these feature vectors; the input and output module 301 inputs the feature vectors obtained by the dimensionality reduction processing to the classifier.
  • the processing module 302 before inputting the feature vector obtained by the dimensionality reduction process into the classifier, is further configured to: preset a threshold for the classifier; if the output of the classifier is greater than the Threshold, it means WebShell; when the output of the classifier is less than the preset threshold, it means it is not WebShell; set the size of the decision tree N in the classifier, and conduct Bootstrap sampling to obtain N data sets; learn each of N decision trees The parameters ⁇ n of decision trees; each decision tree is trained in parallel training. After the training of a single decision tree is completed, the voting records of the training results of the training decision tree are counted to determine the final output of the CNN-RF model; among them, CNN -One way to represent the final output of the RF model is:
  • Ti(x) is the classification result of the sample x by the tree i, that is, the voting method
  • c* is the final category corresponding to the sample
  • N is the number of decision trees in the classifier.
  • the training text is Webshell, which is a command execution environment in the form of web files such as asp, php, jsp, or cgi;
  • the input and output module 301 performs one of the following operations to obtain WebShell: Engine to find common vulnerabilities disclosed on the Internet, if the target site has not been repaired, obtain WebShell; conduct code audit on open source CMS through code audit strategy, and mine code vulnerabilities from the CMS to obtain WebShell; use upload vulnerabilities Obtain WebShell; use SQL injection attack to obtain WebShell; or use database backup to obtain WebShell.
  • the physical device corresponding to the input-output module 301 shown in FIG. 3 is the input-output unit shown in FIG. 4, which can realize part or all of the functions of the acquisition module 1, or realize the same or similar functions as the input-output module 301 Features.
  • the physical device corresponding to the collection module 303 shown in FIG. 3 is the collection device shown in FIG. 4.
  • the physical device corresponding to the processing module 302 shown in FIG. 3 is the processor shown in FIG. 4, and the processor can implement part or all of the functions of the processing module 302 or implement the same or similar functions as the processing module 302.
  • the text classification device 30 in the embodiment of the present application is described above from the perspective of modular functional entities.
  • the following describes a computer device from the perspective of hardware, as shown in FIG. 4, which includes: a processor, a memory, an input and output unit ( It may also be a transceiver (not identified in FIG. 4) and a computer program stored in the memory and running on the processor.
  • the computer program may be a program corresponding to the text classification method in the embodiment corresponding to FIG. 1.
  • the processor executes the computer program to implement the text classification method executed by the text classification device 30 in the embodiment corresponding to FIG. 3
  • the function of each module in the text classification device 30 of the embodiment corresponding to FIG. 3 is realized.
  • the computer program may be a program corresponding to the text classification method in the embodiment corresponding to FIG. 1.
  • the so-called processor can be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), ready-made Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor, etc.
  • the processor is the control center of the computer equipment, and various interfaces and lines are used to connect various parts of the entire computer equipment.
  • the memory may be used to store the computer program and/or module, and the processor implements the computer by running or executing the computer program and/or module stored in the memory, and calling data stored in the memory.
  • the memory may mainly include a storage program area and a storage data area, where the storage program area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.); the storage data area may store Data created based on the use of mobile phones (such as audio data, video data, etc.), etc.
  • the memory may include high-speed random access memory, and may also include non-volatile memory, such as hard disks, memory, plug-in hard disks, smart media cards (SMC), and secure digital (SD) cards , Flash Card, at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
  • non-volatile memory such as hard disks, memory, plug-in hard disks, smart media cards (SMC), and secure digital (SD) cards , Flash Card, at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
  • the transceiver may also be replaced by a receiver and a transmitter, and may be the same or different physical entities. When they are the same physical entity, they can be collectively referred to as transceivers.
  • the transceiver can be an input and output unit.
  • the memory may be integrated in the processor, or may be provided separately from the processor.
  • the present application also provides a non-volatile computer-readable storage medium, including instructions, which when run on a computer, cause the computer to execute the following steps of the text classification method:
  • training text includes multiple sentences, and each sentence includes multiple words
  • the target feature vector is input to the classifier, and the training text is classified by the classifier to obtain the classified text.
  • the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better. ⁇
  • the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product.
  • the computer software product is stored in a storage medium (such as ROM/RAM), including Several instructions are used to make a terminal (which may be a mobile phone, a computer, a server, or a network device, etc.) execute the method described in each embodiment of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

La présente invention se rapporte au domaine de la classification de texte et concerne un procédé et un procédé de classification de texte, un appareil, un dispositif, ainsi qu'un support de stockage. Le procédé comprend les étapes consistant à : acquérir un texte d'apprentissage, entrer le texte d'apprentissage dans une couche de codage d'un modèle de réseau neuronal, effectuer une vectorisation des mots du texte d'apprentissage dans la couche de codage et obtenir un vecteur de caractéristiques correspondant au texte d'apprentissage ; entrer le vecteur de caractéristiques dans un modèle RNN, effectuer une modélisation de phrases et capturer une caractéristique de "dépendance longue distance" de chaque phrase du texte d'apprentissage ; entrer un vecteur de caractéristique des informations de "dépendance longue distance" capturées dans un modèle de réseau neuronal convolutionnel (CNN)) dans le modèle de réseau neuronal ; dans le modèle de CNN, extraire une caractéristique locale à partir du vecteur de caractéristiques, obtenir un vecteur de caractéristique cible ; la caractéristique locale indiquant une pertinence locale du vecteur de caractéristique ; entrer le vecteur de caractéristique cible dans un classificateur, effectuer un traitement de classification du texte d'apprentissage au moyen du classificateur, obtenir un texte classé.
PCT/CN2019/102464 2019-06-04 2019-08-26 Procédé de classification de texte, appareil, dispositif et support de stockage WO2020244066A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910479226.7A CN110309304A (zh) 2019-06-04 2019-06-04 一种文本分类方法、装置、设备及存储介质
CN201910479226.7 2019-06-04

Publications (1)

Publication Number Publication Date
WO2020244066A1 true WO2020244066A1 (fr) 2020-12-10

Family

ID=68075283

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/102464 WO2020244066A1 (fr) 2019-06-04 2019-08-26 Procédé de classification de texte, appareil, dispositif et support de stockage

Country Status (2)

Country Link
CN (1) CN110309304A (fr)
WO (1) WO2020244066A1 (fr)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112699944A (zh) * 2020-12-31 2021-04-23 中国银联股份有限公司 退单处理模型训练方法、处理方法、装置、设备及介质
CN112784601A (zh) * 2021-02-03 2021-05-11 中山大学孙逸仙纪念医院 关键信息提取方法、装置、电子设备和存储介质
CN112950313A (zh) * 2021-02-25 2021-06-11 北京嘀嘀无限科技发展有限公司 订单处理方法、装置、电子设备和可读存储介质
CN113190154A (zh) * 2021-04-29 2021-07-30 北京百度网讯科技有限公司 模型训练、词条分类方法、装置、设备、存储介质及程序
CN113221537A (zh) * 2021-04-12 2021-08-06 湘潭大学 一种基于截断循环神经网络和临近加权卷积的方面级情感分析方法
CN113239192A (zh) * 2021-04-29 2021-08-10 湘潭大学 一种基于滑动窗口和随机离散采样的文本结构化技术
CN113468872A (zh) * 2021-06-09 2021-10-01 大连理工大学 基于句子级别图卷积的生物医学关系抽取方法及系统
CN113486347A (zh) * 2021-06-30 2021-10-08 福州大学 一种基于语义理解的深度学习硬件木马检测方法
CN113822019A (zh) * 2021-09-22 2021-12-21 科大讯飞股份有限公司 文本规整方法、相关设备及可读存储介质
CN114021651A (zh) * 2021-11-04 2022-02-08 桂林电子科技大学 一种基于深度学习的区块链违法信息感知方法
CN114169443A (zh) * 2021-12-08 2022-03-11 西安交通大学 词级文本对抗样本检测方法
CN114499944A (zh) * 2021-12-22 2022-05-13 天翼云科技有限公司 一种检测WebShell的方法、装置和设备
CN114510576A (zh) * 2021-12-21 2022-05-17 一拓通信集团股份有限公司 一种基于BERT和BiGRU融合注意力机制的实体关系抽取方法
CN115249017A (zh) * 2021-06-23 2022-10-28 马上消费金融股份有限公司 文本标注方法、意图识别模型的训练方法及相关设备
CN116227495A (zh) * 2023-05-05 2023-06-06 公安部信息通信中心 一种实体分类的数据处理系统
CN116453385A (zh) * 2023-03-16 2023-07-18 中山市加乐美科技发展有限公司 一种跨时空盘学机
CN116958752A (zh) * 2023-09-20 2023-10-27 国网湖北省电力有限公司经济技术研究院 一种基于ipkcnn-svm的电网基建建筑归档方法、装置及设备
CN117093996A (zh) * 2023-10-18 2023-11-21 湖南惟储信息技术有限公司 嵌入式操作系统的安全防护方法及系统
CN117201733A (zh) * 2023-08-22 2023-12-08 杭州中汇通航航空科技有限公司 一种实时无人机监控分享系统
CN117623735A (zh) * 2023-12-01 2024-03-01 广东雅诚德实业有限公司 高强度抗污日用陶瓷的生产方法
CN117668562A (zh) * 2024-01-31 2024-03-08 腾讯科技(深圳)有限公司 文本分类模型的训练和使用方法、装置、设备和介质

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113032534A (zh) * 2019-12-24 2021-06-25 中国移动通信集团四川有限公司 对话文本的分类方法和电子设备
CN111177392A (zh) * 2019-12-31 2020-05-19 腾讯云计算(北京)有限责任公司 一种数据处理方法及装置
CN111538840B (zh) * 2020-06-23 2023-04-28 基建通(三亚)国际科技有限公司 一种文本分类方法及装置
CN111930938A (zh) * 2020-07-06 2020-11-13 武汉卓尔数字传媒科技有限公司 文本分类方法、装置、电子设备及存储介质
CN111865960A (zh) * 2020-07-15 2020-10-30 北京市燃气集团有限责任公司 一种网络入侵场景分析处理方法、系统、终端及存储介质
CN114050908B (zh) * 2020-07-24 2023-07-21 中国移动通信集团浙江有限公司 防火墙策略自动审核的方法、装置、计算设备及计算机存储介质
CN112118225B (zh) * 2020-08-13 2021-09-03 紫光云(南京)数字技术有限公司 一种基于RNN的Webshell检测方法及装置
CN112148943A (zh) * 2020-09-27 2020-12-29 北京天融信网络安全技术有限公司 网页分类方法、装置、电子设备及可读存储介质
CN112491891B (zh) * 2020-11-27 2022-05-17 杭州电子科技大学 物联网环境下基于混合深度学习的网络攻击检测方法
CN112686315A (zh) * 2020-12-31 2021-04-20 山西三友和智慧信息技术股份有限公司 一种基于深度学习的非自然地震分类方法
CN112699964A (zh) * 2021-01-13 2021-04-23 成都链安科技有限公司 模型构建方法、系统、装置、介质、交易身份识别方法
CN113010740B (zh) * 2021-03-09 2023-05-30 腾讯科技(深圳)有限公司 词权重的生成方法、装置、设备及介质
CN115359867B (zh) * 2022-09-06 2024-02-02 中国电信股份有限公司 电子病历分类方法、装置、电子设备及存储介质
CN116226702B (zh) * 2022-09-09 2024-04-26 武汉中数医疗科技有限公司 一种基于生物电阻抗的甲状腺采样数据识别方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030078899A1 (en) * 2001-08-13 2003-04-24 Xerox Corporation Fuzzy text categorizer
CN102141978A (zh) * 2010-02-02 2011-08-03 阿里巴巴集团控股有限公司 一种文本分类的方法及系统
CN104572892A (zh) * 2014-12-24 2015-04-29 中国科学院自动化研究所 一种基于循环卷积网络的文本分类方法
CN107103754A (zh) * 2017-05-10 2017-08-29 华南师范大学 一种道路交通状况预测方法及系统
CN108829818A (zh) * 2018-06-12 2018-11-16 中国科学院计算技术研究所 一种文本分类方法

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11354565B2 (en) * 2017-03-15 2022-06-07 Salesforce.Com, Inc. Probability-based guider
CN107066553B (zh) * 2017-03-24 2021-01-01 北京工业大学 一种基于卷积神经网络与随机森林的短文本分类方法
CN108694163B (zh) * 2017-04-06 2021-11-26 富士通株式会社 计算句子中的词的概率的方法、装置和神经网络
CN107562784A (zh) * 2017-07-25 2018-01-09 同济大学 基于ResLCNN模型的短文本分类方法
CN107832400B (zh) * 2017-11-01 2019-04-16 山东大学 一种基于位置的lstm和cnn联合模型进行关系分类的方法
CN108334499B (zh) * 2018-02-08 2022-03-18 海南云江科技有限公司 一种文本标签标注设备、方法和计算设备
CN109743732B (zh) * 2018-12-20 2022-05-10 重庆邮电大学 基于改进的cnn-lstm的垃圾短信判别方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030078899A1 (en) * 2001-08-13 2003-04-24 Xerox Corporation Fuzzy text categorizer
CN102141978A (zh) * 2010-02-02 2011-08-03 阿里巴巴集团控股有限公司 一种文本分类的方法及系统
CN104572892A (zh) * 2014-12-24 2015-04-29 中国科学院自动化研究所 一种基于循环卷积网络的文本分类方法
CN107103754A (zh) * 2017-05-10 2017-08-29 华南师范大学 一种道路交通状况预测方法及系统
CN108829818A (zh) * 2018-06-12 2018-11-16 中国科学院计算技术研究所 一种文本分类方法

Cited By (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112699944B (zh) * 2020-12-31 2024-04-23 中国银联股份有限公司 退单处理模型训练方法、处理方法、装置、设备及介质
CN112699944A (zh) * 2020-12-31 2021-04-23 中国银联股份有限公司 退单处理模型训练方法、处理方法、装置、设备及介质
CN112784601A (zh) * 2021-02-03 2021-05-11 中山大学孙逸仙纪念医院 关键信息提取方法、装置、电子设备和存储介质
CN112784601B (zh) * 2021-02-03 2023-06-27 中山大学孙逸仙纪念医院 关键信息提取方法、装置、电子设备和存储介质
CN112950313A (zh) * 2021-02-25 2021-06-11 北京嘀嘀无限科技发展有限公司 订单处理方法、装置、电子设备和可读存储介质
CN113221537A (zh) * 2021-04-12 2021-08-06 湘潭大学 一种基于截断循环神经网络和临近加权卷积的方面级情感分析方法
CN113190154A (zh) * 2021-04-29 2021-07-30 北京百度网讯科技有限公司 模型训练、词条分类方法、装置、设备、存储介质及程序
CN113239192A (zh) * 2021-04-29 2021-08-10 湘潭大学 一种基于滑动窗口和随机离散采样的文本结构化技术
CN113190154B (zh) * 2021-04-29 2023-10-13 北京百度网讯科技有限公司 模型训练、词条分类方法、装置、设备、存储介质及程序
CN113239192B (zh) * 2021-04-29 2024-04-16 湘潭大学 一种基于滑动窗口和随机离散采样的文本结构化技术
CN113468872B (zh) * 2021-06-09 2024-04-16 大连理工大学 基于句子级别图卷积的生物医学关系抽取方法及系统
CN113468872A (zh) * 2021-06-09 2021-10-01 大连理工大学 基于句子级别图卷积的生物医学关系抽取方法及系统
CN115249017A (zh) * 2021-06-23 2022-10-28 马上消费金融股份有限公司 文本标注方法、意图识别模型的训练方法及相关设备
CN115249017B (zh) * 2021-06-23 2023-12-19 马上消费金融股份有限公司 文本标注方法、意图识别模型的训练方法及相关设备
CN113486347A (zh) * 2021-06-30 2021-10-08 福州大学 一种基于语义理解的深度学习硬件木马检测方法
CN113486347B (zh) * 2021-06-30 2023-07-14 福州大学 一种基于语义理解的深度学习硬件木马检测方法
CN113822019A (zh) * 2021-09-22 2021-12-21 科大讯飞股份有限公司 文本规整方法、相关设备及可读存储介质
CN114021651A (zh) * 2021-11-04 2022-02-08 桂林电子科技大学 一种基于深度学习的区块链违法信息感知方法
CN114021651B (zh) * 2021-11-04 2024-03-29 桂林电子科技大学 一种基于深度学习的区块链违法信息感知方法
CN114169443B (zh) * 2021-12-08 2024-02-06 西安交通大学 词级文本对抗样本检测方法
CN114169443A (zh) * 2021-12-08 2022-03-11 西安交通大学 词级文本对抗样本检测方法
CN114510576A (zh) * 2021-12-21 2022-05-17 一拓通信集团股份有限公司 一种基于BERT和BiGRU融合注意力机制的实体关系抽取方法
CN114499944A (zh) * 2021-12-22 2022-05-13 天翼云科技有限公司 一种检测WebShell的方法、装置和设备
CN114499944B (zh) * 2021-12-22 2023-08-08 天翼云科技有限公司 一种检测WebShell的方法、装置和设备
CN116453385B (zh) * 2023-03-16 2023-11-24 中山市加乐美科技发展有限公司 一种跨时空盘学机
CN116453385A (zh) * 2023-03-16 2023-07-18 中山市加乐美科技发展有限公司 一种跨时空盘学机
CN116227495A (zh) * 2023-05-05 2023-06-06 公安部信息通信中心 一种实体分类的数据处理系统
CN117201733B (zh) * 2023-08-22 2024-03-12 杭州中汇通航航空科技有限公司 一种实时无人机监控分享系统
CN117201733A (zh) * 2023-08-22 2023-12-08 杭州中汇通航航空科技有限公司 一种实时无人机监控分享系统
CN116958752B (zh) * 2023-09-20 2023-12-15 国网湖北省电力有限公司经济技术研究院 一种基于ipkcnn-svm的电网基建建筑归档方法、装置及设备
CN116958752A (zh) * 2023-09-20 2023-10-27 国网湖北省电力有限公司经济技术研究院 一种基于ipkcnn-svm的电网基建建筑归档方法、装置及设备
CN117093996B (zh) * 2023-10-18 2024-02-06 湖南惟储信息技术有限公司 嵌入式操作系统的安全防护方法及系统
CN117093996A (zh) * 2023-10-18 2023-11-21 湖南惟储信息技术有限公司 嵌入式操作系统的安全防护方法及系统
CN117623735A (zh) * 2023-12-01 2024-03-01 广东雅诚德实业有限公司 高强度抗污日用陶瓷的生产方法
CN117623735B (zh) * 2023-12-01 2024-05-14 广东雅诚德实业有限公司 高强度抗污日用陶瓷的生产方法
CN117668562A (zh) * 2024-01-31 2024-03-08 腾讯科技(深圳)有限公司 文本分类模型的训练和使用方法、装置、设备和介质
CN117668562B (zh) * 2024-01-31 2024-04-19 腾讯科技(深圳)有限公司 文本分类模型的训练和使用方法、装置、设备和介质

Also Published As

Publication number Publication date
CN110309304A (zh) 2019-10-08

Similar Documents

Publication Publication Date Title
WO2020244066A1 (fr) Procédé de classification de texte, appareil, dispositif et support de stockage
WO2019136993A1 (fr) Procédé et dispositif de calcul de similarité de texte, appareil informatique, et support de stockage
WO2021072885A1 (fr) Procédé et appareil de reconnaissance de texte, dispositif et support de stockage
CN107612893B (zh) 短信的审核系统和方法以及构建短信审核模型方法
WO2017084586A1 (fr) Procédé, système et dispositif pour inférer une règle de code malveillant sur la base d'un procédé d'apprentissage approfondi
CN110210617B (zh) 一种基于特征增强的对抗样本生成方法及生成装置
WO2020253350A1 (fr) Procédé et appareil de vérification de publication de contenu de réseau, dispositif informatique et support de stockage
CN109670163B (zh) 信息识别方法、信息推荐方法、模板构建方法及计算设备
CN111371806A (zh) 一种Web攻击检测方法及装置
US20200125836A1 (en) Training Method for Descreening System, Descreening Method, Device, Apparatus and Medium
CN111460153A (zh) 热点话题提取方法、装置、终端设备及存储介质
CN111814770A (zh) 一种新闻视频的内容关键词提取方法、终端设备及介质
EP4258610A1 (fr) Procédé d'identification de trafic malveillant et appareil associé
CN110929145A (zh) 舆情分析方法、装置、计算机装置及存储介质
US20170193098A1 (en) System and method for topic modeling using unstructured manufacturing data
CN109446299B (zh) 基于事件识别的搜索电子邮件内容的方法及系统
CN110765286A (zh) 跨媒体检索方法、装置、计算机设备和存储介质
CN112507167A (zh) 一种识别视频合集的方法、装置、电子设备及存储介质
CN114416998A (zh) 文本标签的识别方法、装置、电子设备及存储介质
TWI749349B (zh) 文本還原方法、裝置及電子設備與電腦可讀儲存媒體
WO2023273303A1 (fr) Procédé et appareil basés sur un modèle d'arbre pour acquérir un degré d'influence d'un événement, et dispositif informatique
CN117079645A (zh) 语音模型优化方法、装置、设备及介质
CN111444362A (zh) 恶意图片拦截方法、装置、设备和存储介质
CN116561298A (zh) 基于人工智能的标题生成方法、装置、设备及存储介质
CN115314268B (zh) 基于流量指纹和行为的恶意加密流量检测方法和系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19932046

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19932046

Country of ref document: EP

Kind code of ref document: A1