CN117407530A - Text classification method, device, electronic equipment and storage medium - Google Patents

Text classification method, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN117407530A
CN117407530A CN202311443529.6A CN202311443529A CN117407530A CN 117407530 A CN117407530 A CN 117407530A CN 202311443529 A CN202311443529 A CN 202311443529A CN 117407530 A CN117407530 A CN 117407530A
Authority
CN
China
Prior art keywords
text
training
classified
classification
feature vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311443529.6A
Other languages
Chinese (zh)
Inventor
刘微
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Original Assignee
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Topsec Technology Co Ltd, Beijing Topsec Network Security Technology Co Ltd, Beijing Topsec Software Co Ltd filed Critical Beijing Topsec Technology Co Ltd
Priority to CN202311443529.6A priority Critical patent/CN117407530A/en
Publication of CN117407530A publication Critical patent/CN117407530A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/40Network security protocols

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Hardware Design (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a text classification method, a text classification device, electronic equipment and a storage medium. The method comprises the following steps: processing the text to be classified to obtain text feature vectors corresponding to the text to be classified; inputting the text feature vector into a pre-trained text classification model to obtain a classification result corresponding to the text to be classified output by the text classification model; the classification result is used for representing attack intention classification corresponding to the text to be classified; the text classification model is obtained by a corrected training text training classifier which is obtained by correcting the training text by using the deviation correction weight and is obtained by using the deviation correction weight, and the sources of the training text and the test text are different. According to the text classification model training method and device, the deviation correction weight is used for correcting the deviation of the training text, so that the text classification model is trained by using the corrected deviation training text, the generalization capability of the text classification model is improved, and the accuracy of text classification is improved.

Description

Text classification method, device, electronic equipment and storage medium
Technical Field
The present application relates to the field of network security technologies, and in particular, to a text classification method, a device, an electronic device, and a storage medium.
Background
In recent years, a large number of organizations have encountered complex network attacks, including advanced persistent threats (advanced persistent threats, APT). Most of the techniques used by APT attacks readily circumvent the common off-the-shelf defense mechanisms. These organizations regularly share threat information in the form of reports with the aim of improving security by improving personal awareness of such attacks. The number of threat reports generated by various organizations is on an upward trend, and automatic parsing of threat reports is a necessary condition to promote efficient use.
In the prior art, feature extraction is generally adopted for threat reports, and extracted features are analyzed based on a model obtained through training to obtain the classification of threat reports. Because the sources of training data and test data are often different in the training process of the model, deviation exists in the analysis process of the model, and the classification accuracy is low.
Disclosure of Invention
An embodiment of the application aims to provide a text classification method, a text classification device, electronic equipment and a storage medium, which are used for improving the accuracy of text classification.
In a first aspect, an embodiment of the present application provides a text classification method, including:
Processing the text to be classified to obtain text feature vectors corresponding to the text to be classified;
inputting the text feature vector into a pre-trained text classification model to obtain a classification result corresponding to the text to be classified output by the text classification model; the classification result is used for representing attack intention classification corresponding to the text to be classified;
the text classification model is obtained by using a corrected training text training classifier which is obtained by correcting the training text by using offset correction weights and using the offset correction weights, and sources of the training text and the test text are different.
According to the text classification model training method and device, the deviation correction weight is used for correcting the deviation of the training text, so that the text classification model is trained by using the corrected deviation training text, the generalization capability of the text classification model is improved, and the accuracy of text classification is improved.
In any embodiment, the method further comprises:
acquiring a training text and a test text; training texts and test texts have different sources;
respectively processing the training text and the test text by using a language processing model to obtain a training feature vector and a test feature vector;
Mapping the distribution corresponding to the training feature vector and the distribution corresponding to the test feature vector to a regeneration kernel Hilbert space respectively to obtain training mapping distribution and test mapping distribution corresponding to the training feature vector;
calculating according to the training mapping distribution and the testing mapping distribution to obtain deviation correction weight;
and correcting the deviation of the training text by using the deviation correction weight, and training the classifier by using the training text after correcting the deviation to obtain a text classification model.
According to the method and the device for correcting the bias correction, the bias correction weights are obtained through calculation of the training texts and the test texts from different sources, so that bias correction can be conducted on the training text feature vectors of the training texts through the bias correction weights, and the generalization capability of the text classification model is improved.
In either embodiment, obtaining the bias correction weights from the training map distribution and the test map distribution calculation includes:
e [ beta (x) phi (x) according to the formula min 1 )]-E[φ(x 2 )]Calculating to obtain deviation correction weight;
wherein β (x) is the offset correction weight to be calculated; phi (x) 1 ) Mapping distribution for training; phi (x) 2 ) Distributing for the test mapping; e is used to represent the desire; the terms |· | are used to represent the two norms.
According to the method and the device for calculating the deviation correction weight, the deviation correction weight is calculated through calculating the maximum mean value difference of the distances between the weighted training mapping distribution and the test mapping distribution, so that the difference between the weighted training mapping distribution and the test mapping distribution is minimum, and the deviation correction weight calculated through the method can well carry out deviation correction on the text to be classified.
In any embodiment, processing text to be classified includes:
coding the text to be classified by using the BERT model to obtain sentence representation vectors corresponding to the text to be classified;
the sentence representation vector is processed through a first layer of the transform encoder substructure, and the processing result is input into a next layer of the transform encoder substructure;
and taking and summing the results output by the substructures of the last layer of the transducer encoder as chapter representations corresponding to the texts to be classified.
According to the embodiment of the application, through the BERT model and the multi-layer converter encoder substructure, long-term semantic dependency relations among sentences are captured, sentence representations of text-level features are obtained, and a basis is provided for the accuracy of classification of subsequent attack intents.
In any embodiment, performing bias correction on the training feature vector using the bias correction weights includes:
and calculating the product of the deviation correction weight and the training feature vector to obtain a corrected training feature vector.
According to the embodiment of the application, the bias correction is carried out on the training feature vector by using the bias correction weight, so that the generalization capability of the trained text classification model is improved.
In any embodiment, inputting the text feature vector into a pre-trained text classification model to obtain a classification result corresponding to the text to be classified output by the text classification model, including:
And classifying the text feature vectors through a single-layer forward network of the sigmoid activation function to obtain a classification result.
In the embodiment of the application, since the input text to be classified may belong to one or two or even more types, each type is independent but not mutually exclusive, and the sigmoid activation function supports processing of the non-exclusive type problem, the application of the sigmoid activation function can realize more accurate classification of the text to be classified.
In any embodiment, the classification result comprises at least one of: scouting, resource development, initial access, enforcement, persistence, rights promotion, defense avoidance, credential access, discovery, lateral movement, collection, command and control, data theft, and compromise.
The method provided by the embodiment of the application realizes ATT & CK tactical classification of the text to be classified.
In a second aspect, an embodiment of the present application provides a text classification apparatus, including:
the text processing module is used for processing the text to be classified to obtain a text feature vector corresponding to the text to be classified;
the classification module is used for inputting the text feature vector into a pre-trained text classification model to obtain a classification result corresponding to the text to be classified, which is output by the text classification model; the classification result is used for representing attack intention classification corresponding to the text to be classified;
The text classification model is obtained by using a training text and a test text to obtain deviation correction weights, and using the deviation correction weights to correct the training text to obtain a corrected training text training classifier, wherein the sources of the training text and the test text are different.
In a third aspect, an embodiment of the present application provides an electronic device, including: a processor, a memory, and a bus, wherein,
the processor and the memory complete communication with each other through the bus;
the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of the first aspect.
In a fourth aspect, embodiments of the present application provide a non-transitory computer readable storage medium comprising:
the non-transitory computer-readable storage medium stores computer instructions that cause the computer to perform the method of the first aspect.
Additional features and advantages of the application will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the embodiments of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of a text classification method according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a text classification device according to an embodiment of the present application;
fig. 3 is a schematic diagram of an entity structure of an electronic device according to an embodiment of the present application.
Detailed Description
Embodiments of the technical solutions of the present application will be described in detail below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical solutions of the present application, and thus are only examples, and are not intended to limit the scope of protection of the present application.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "comprising" and "having" and any variations thereof in the description and claims of the present application and in the description of the figures above are intended to cover non-exclusive inclusions.
In the description of the embodiments of the present application, the technical terms "first," "second," etc. are used merely to distinguish between different objects and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated, a particular order or a primary or secondary relationship. In the description of the embodiments of the present application, the meaning of "plurality" is two or more unless explicitly defined otherwise.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
In the description of the embodiments of the present application, the term "and/or" is merely an association relationship describing an association object, which means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.
In the description of the embodiments of the present application, the term "plurality" refers to two or more (including two), and similarly, "plural sets" refers to two or more (including two), and "plural sheets" refers to two or more (including two).
In the description of the embodiments of the present application, unless explicitly specified and limited otherwise, the terms "mounted," "connected," "secured" and the like are to be construed broadly and may be, for example, fixedly connected, detachably connected, or integrally formed; or may be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communicated with the inside of two elements or the interaction relationship of the two elements. The specific meaning of the above terms in the embodiments of the present application will be understood by those of ordinary skill in the art according to the specific circumstances.
Currently, there are many methods for identifying the type of threat information, for example: extracting the document characteristics and the information security element characteristics of threat information; the threat types of threat information are classified through information security element extraction, information security element relation construction, feature engineering, neural network model based and the like. In the prior art, the source of training text used in training a model is possibly different from the source of text to be classified generated in an inference stage, which causes deviation of the model in the analysis process and further causes low classification accuracy.
In order to solve the above problems, an embodiment of the present application provides a text classification method, which corrects a training feature vector corresponding to a training text by a deviation correction weight obtained in advance, and classifies a classifier by using the corrected training feature vector, so as to improve the generalization capability of a text classification model obtained by training.
It can be understood that the text classification method provided by the embodiment of the application can be applied to electronic equipment, and the electronic equipment comprises a terminal and a server; the terminal can be a smart phone, a tablet computer, a personal digital assistant (Personal Digital Assitant, PDA) and the like; the server may be an application server or a Web server.
Fig. 1 is a schematic flow chart of a text classification method according to an embodiment of the present application, as shown in fig. 1, where the method includes:
step 101: and processing the text to be classified to obtain text feature vectors corresponding to the text to be classified.
The text to be classified can be a threat report obtained from an open source threat information platform, or can be obtained from other ways, and the format of the text to be classified can be HTML or Adobe PDF. After the electronic equipment acquires the text to be classified, processing such as sentence segmentation and feature extraction is carried out on the text to be classified, and text feature vectors corresponding to the text to be classified are obtained.
Step 102: inputting the text feature vector into a pre-trained text classification model to obtain a classification result corresponding to the text to be classified, which is output by the text classification model.
In a specific implementation process, the text classification model is obtained by using a training text and a test text to obtain deviation correction weights, and the corrected training text obtained by correcting the training text by using the deviation correction weights is obtained by using a training classifier of corrected training text, and the sources of the training text and the test text are different, because the training data and the test data come from different threat information platforms, and because the different threat information platforms describe different threats, in order to sufficiently satisfy the problem that the different threat information platforms describe different threats, when the text classification model is trained, a mode of weighting the feature vectors of the training text is adopted, namely, the feature vectors of the training text are multiplied by the deviation correction weights obtained by pre-calculation, so that the generalization capability of the text classification model is improved.
It may be appreciated that the classification result corresponding to the text to be classified is used to characterize the attack intention classification corresponding to the text to be classified, and the classification result may include at least one of the following: scouting, resource development, initial access, enforcement, persistence, rights promotion, defense avoidance, credential access, discovery, lateral movement, collection, command and control, data theft, and compromise.
Reconnaissance: an attacker gathers useful information before invading an enterprise to plan for future actions. Scouting involves the attacker actively or passively collecting some information for targeting the attack. Such information may include detailed information of the victim organization, infrastructure, or employee. The attacker may also utilize this information to assist in attacks at other stages of the attack lifecycle, such as using the collected information to plan and perform initial accesses, to determine range of action and target priority after intrusion, or to facilitate further scout work.
And (3) resource development: meaning that an attacker will set up some resources for future combat. Resource development involves an attacker creating, purchasing, or stealing resources that can be used to lock the target of the attack. Such resources include infrastructure, accounts, or functions. An attacker may also use these resources for other stages of the attack lifecycle, such as using purchased domain names to implement commands and controls, phishing with mail accounts to implement "initial access", or stealing code signature certificates to implement defensive bypasses.
Initial access: in general, "initial access" refers to an attacker establishing a foothold in an enterprise environment. For enterprises, from this point forward, the attacker can use different techniques to achieve initial access based on various information collected prior to intrusion. For example, an attacker uses a harpoon-type fishing accessory to attack. The accessory may utilize some type of vulnerability to achieve this level of access, such as PowerShell or other scripting techniques. If the execution is successful, the attacker can employ other strategies and techniques to achieve the final objective.
Performing: among all tactics taken by an attacker, the most widely applied tactics are "performed". Attackers will choose to "execute" this tactic when considering the use of off-the-shelf malware, luxury software, or APT attacks. To validate the malware, it must be run so the defender has the opportunity to block or detect it. However, malicious executable files of all malware cannot be easily found with antivirus software. In addition, the command line interface or PowerShell is very useful to an attacker. Many file-free malware utilize one or a combination of both of these techniques.
Persistence: after the attacker realizes persistent access, even if the operation and maintenance personnel take measures such as restarting, changing certificates and the like, the computer can still be infected with viruses again or maintain the existing connection of the computer. For example, registry run keys, startup folders, are the most common technique that are executed each time a computer is started. Thus, an attacker may implement persistence when launching a commonly used application such as a Web browser or Microsoft Office. Among all ATT & CK tactics, persistence is one of the most interesting tactics.
And (3) rights promotion: ATT & CK proposes that "should focus on preventing attack tools from running in the early stages of the active chain and re-focus on identifying subsequent malicious behavior". This means that a deep defense needs to be exploited to prevent infection with viruses, such as the peripheral defense system of the terminal or the application whitelist. For rights promotion beyond the ATT & CK range, the prevention is to use a reinforcement baseline on the terminal. Another approach to deal with rights promotion is audit log records. When attackers employ some of the techniques in rights promotion, they often leave spider silk trails, exposing their purpose. Particularly for the log of the host side, all operation and maintenance commands of the server need to be recorded so as to facilitate evidence collection and real-time audit.
Defense circumvention: refers to techniques used by an attacker to avoid being discovered by defenses throughout the attack. Techniques to defend against bypass use include offloading/disabling security software or obfuscating/encrypting data and scripts. An attacker can also exploit and misuse trusted processes to hide and disguise malware. One interesting point of this tactic is that some malware (e.g., luxury software) is not of any concern for defensive bypassing. Their only goal is to execute once on the device and then be discovered as soon as possible. Some technologies may trick anti-virus (AV) products into failing to detect them at all or bypass the application of whitelisting techniques.
Credential access: any attacker intrusion into an enterprise would like to maintain a degree of privacy. An attacker wishes to steal as many credentials as possible. Of course, they can be broken by violence, but this attack is too much static. There are many examples of stealing hash passwords, and hash transfer or offline cracking of hash passwords. Among all the information to be stolen, the attacker prefers to steal the plaintext password. The plaintext password may be stored in a plaintext file, a database, or even a registry. One very common behavior is that an attacker hacks a system to steal the local hash code and crack the local administrator code. The simplest way to deal with credential access is to use a complex password. It is recommended to use case, number and special character combinations in order to make it difficult for an attacker to crack the password. Finally, it is necessary to monitor the usage of the active account, since in many cases data leakage occurs through the active credentials.
The discovery is as follows: including techniques used by an attacker to obtain information about the system and internal networks. These techniques may help an attacker observe the environment and determine direction before deciding how to take action. The attacker can use these techniques to explore what they can control and what is near the point of entry and to help them achieve the purpose of the attack based on the information that has been obtained. An attacker can also use a local operating system tool to realize the purpose of information collection after invasion.
And (3) transversely moving: an attacker typically attempts to move laterally within the network after exploiting a single system vulnerability. Even the lux software for a single system also tries to move laterally in the network to find other attack targets. An attacker will typically first find a foothold and then start moving through the various systems, looking for higher access rights, in order to achieve the final goal. In mitigating and detecting lateral movement, appropriate network segments can mitigate the risk of lateral movement to a large extent.
And (3) collecting: refers to the technology used by an attacker to collect information and collect therefrom sources of information related to the purpose of implementing the attacker. Typically, the next step after collecting the data is to steal the data. Common attack sources include various drive types, browsers, audio, video, and email. Common collection methods include capturing screenshots and keyboard inputs. Enterprises may use various techniques in this tactic to learn more about how malware handles data in organizations. An attacker may attempt to steal the user's information including what is on the screen, what the user is inputting, what the user discusses, and the user's appearance characteristics.
Command and control: consists of techniques used by an attacker to communicate with an intruded system within the victim network. An attacker typically avoids itself being discovered by mimicking the normal expected traffic. Depending on the network architecture and defensive power of the victim, the attacker can establish commands and controls of different stealth levels in a variety of ways. Most malware now uses command and control tactics to some extent. An attacker can receive data through the command and control server and tell the malware what instructions to execute next. For each command and control, the attacker accesses the network from a remote location. Thus, understanding what happens on the network is critical to effectively cope with these technologies.
Data theft: including techniques used by an attacker to steal data from a user network. After the attacker gains access rights, the attacker searches the relevant data everywhere and then starts to steal the data, but not all malware can reach this stage. In the case of an attacker stealing data through a network, especially large amounts of data (such as a customer database), establishing a network intrusion detection or prevention system helps to identify when the data is transmitted.
Hazard: an attacker attempts to manipulate, interrupt, or destroy the enterprise's systems and data. Techniques for "compromise" include destroying or tampering with the data. In some cases, the business process looks good, but it is possible that the data has been tampered with by an attacker. These techniques may be used by attackers to accomplish their ultimate goal or to provide a shield for their hacking secrets.
According to the text classification model training method and device, the deviation correction weight is used for correcting the deviation of the training text, so that the text classification model is trained by using the corrected deviation training text, the generalization capability of the text classification model is improved, and the accuracy of text classification is improved.
On the basis of the above embodiment, a specific calculation method of the offset correction weight is as follows:
step 1: acquiring training texts and test texts with different sources; it will be appreciated that the number of training texts and test texts is multiple.
Step 2: and respectively processing the training text and the test text by using a language processing model to obtain training feature vectors and test feature vectors. The training text and the test text may be pre-processed prior to processing the training text and the test text using the language processing model, for example: the training text and the test text can be divided into sentences, and the sentence dividing method comprises the following steps:
(1) Rule-based method: boundaries of sentences are identified by writing a series of rules. These rules may include punctuation marks, spaces, line breaks, and the like. For example, regular expressions may be used to match the start and end positions of sentences.
(2) Statistical-based methods: statistical features of sentences are learned by analyzing a large amount of labeled sentence data, and then these features are used to predict the boundaries of new sentences. Common statistical methods are Hidden Markov Models (HMM), maximum Entropy Models (MEM), etc.
(3) A machine learning based method: the clause task is treated as a sequence labeling problem, and machine learning algorithms (e.g., CRF, LSTM, etc.) are used to train the model. Firstly, a large amount of marked sentence data is required to be prepared as a training set, and then a trained model is used for dividing a new sentence.
(4) Deep learning-based method: text is encoded using a deep learning technique (e.g., transformer, BERT, etc.) and then the clause results are generated using a decoder. This approach typically requires a significant amount of computing resources and training data.
(5) Dictionary-based methods: sentences are divided according to words and phrases in the dictionary. This approach is applicable to certain fields, such as law, medicine, etc., as the text of these fields is often highly structural.
(6) The mixing method comprises the following steps: the method is combined to improve the accuracy and the robustness of the clause. For example, the preliminary clause may be performed using a rule-based approach followed by optimization of the preliminary clause results using a statistical or machine learning-based approach.
In a specific implementation process, a proper sentence method can be selected according to actual conditions. After obtaining multiple clauses corresponding to each training text and test text, sentences in the training text and test text can be analyzed through Stanza, and sentences with more than three words are reserved. It is to be understood that sentences of more than four or more words may also be retained, as embodiments of the present application are not particularly limited. For a longer sentence, to avoid the restriction of the language processing model to the input characters, the longer sentence may be divided into shorter sentences.
After obtaining the preprocessed sentences, [ CLS ] is inserted at the beginning of each sentence]Marking, inserting [ SEP ] at the end of sentence]A tag for distinguishing a plurality of sentences in the text. After the insertion of the tag is completed, sentences are concatenated and input into the BERT model. The BERT model outputs sentence representation vectors corresponding to the text to be classified, namely t= { t 1 ,t 2 ,...,t m And t is }, where i Is the ith [ CLS ]]The output of the corresponding sentence on the BERT model is marked as a vector representation of the ith sentence.
To further incorporate global context information of the text and capture long-term semantic dependencies between sentences, the substructures of the multi-layered transform encoder are applied to the output layer of the BERT model. The input of the transform encoder substructure layer of layer l is t, therefore:
h l =Trans(t 0 )
wherein,representing the substructure of a transducer encoder, l is the number of layers of the substructure of a stacked transducer encoder, h l The output of the last layer of the stack, the transducer encoder substructure, is a sentence representation with text-level features.
The output of the last layer of transform encoder substructure is summed up by x=sum (h l ) As a text feature vector for the input text.
Step 3: mapping the distribution corresponding to the training feature vector and the distribution corresponding to the test feature vector to a regeneration kernel Hilbert space respectively to obtain training mapping distribution and test mapping distribution corresponding to the training feature vector.
Step 4: and calculating according to the training mapping distribution and the testing mapping distribution to obtain the deviation correction weight.
Sample bias is a known problem in natural language processing where there is often a bias in the training data distribution from the test data distribution. Regarding text classification, deviations in training text may occur if reports with appropriate labels are available from only a few security organizations, whose format styles do not represent the population well.
p tr (x, y) and p te (x, y) are joint probability distributions of training text and test text, p, respectively tr (x) And p te (x) The training text and the test text are distributed respectively, and y is a class data label. Covariate offset assumption: p is p tr (x,y)=p te (x, y), but p tr (x)≠p te (x) A. The invention relates to a method for producing a fibre-reinforced plastic composite With this assumption, the bias in the training text can be corrected by evaluating the instance weights β (x), such that for each training text:
in calculating the bias correction weights β (x), a kernel-mean-matching algorithm (Kernel Mean Matching, KMM) estimation may be employed. The main idea of the algorithm is in feature mappingIs a regenerated nuclear Hilbert space (Reproducing Kernel Hilbert Space, RKHS)/(>In minimizing the weighted training data distribution beta (x) p tr (x) Distribution p with corresponding test data te (x) Average euclidean distance between them. The average distance is measured by calculating the maximum mean difference (Maximum Mean Discrepancy, MMD):
min||E[β(x)φ(x1)]-E[φ(x2)]||
wherein β (x) is the offset correction weight to be calculated; phi (x 1) is the training map distribution; phi (x 2) is the test map distribution; e is used to indicate the desireThe method comprises the steps of carrying out a first treatment on the surface of the The terms |· | are used to represent the two norms. Minimizing MMD to obtain optimal bias correction weights
It should be noted that, the processing methods of the step 2 and the step 3 are suitable for processing the text to be classified, which is not described in detail in the embodiment of the present application.
Step 5: and correcting the deviation of the training text by using the deviation correction weight, and training the classifier by using the training text after correcting the deviation to obtain a text classification model.
Multiplying the deviation correction weight with the training text to obtain a training text after deviation correction, inputting the training text after deviation correction into a classifier, and training the classifier to obtain a text classification model.
According to the method and the device for correcting the bias correction, the bias correction weights are obtained through calculation of the training texts and the test texts from different sources, so that bias correction can be conducted on the training text feature vectors of the training texts through the bias correction weights, and the generalization capability of the text classification model is improved.
On the basis of the above embodiment, classifying the corrected text feature vector to obtain a classification result corresponding to the text to be classified includes:
classifying the corrected text feature vectors through a single-layer forward network of the sigmoid activation function to obtain a classification result.
In a specific implementation process, the classifier is used for obtaining the corresponding category of the text according to a preset classification strategy according to the operation result of the text after the quantization representation. In multi-tag classification, it is necessary to obtain whether each tag can be an output tag, so the probability that each tag can be an output tag is quantized to a value between 0 and 1.
The embodiment of the application uses a single-layer forward network of a sigmoid activation function to construct a nonlinear classifier to predict attack intention classification of texts to be classified. Since it is a multi-tag classifier, there may be multiple intentions for the text to be classified, so that the highest tactical tag has a probability within 10% of the other relatively next highest tactical tags, and the text has multiple tactical tags. For example: the highest tactical tag has a probability difference of less than 10% from the third highest tactical tag, indicating that the text to be classified has three tactical tags (i.e., attack intent classification).
In the embodiment of the application, since the input text to be classified may belong to one or two or even more types, each type is independent but not mutually exclusive, and the sigmoid activation function supports processing of the non-exclusive type problem, the application of the sigmoid activation function can realize more accurate classification of the text to be classified.
Fig. 2 is a schematic structural diagram of a text classification apparatus according to an embodiment of the present application, where the apparatus may be a module, a program segment, or a code on an electronic device. It should be understood that the apparatus corresponds to the embodiment of the method of fig. 1 described above, and is capable of performing the steps involved in the embodiment of the method of fig. 1, and specific functions of the apparatus may be referred to in the foregoing description, and detailed descriptions thereof are omitted herein as appropriate to avoid redundancy. The device comprises: a text processing module 201 and a classification module 202, wherein:
The text processing module 201 is configured to process the text to be classified to obtain a text feature vector corresponding to the text to be classified;
the classification module 202 is configured to input the text feature vector into a pre-trained text classification model, and obtain a classification result corresponding to the text to be classified output by the text classification model; the classification result is used for representing attack intention classification corresponding to the text to be classified;
the text classification model is obtained by using a training text and a test text to obtain deviation correction weights, and using the deviation correction weights to correct the training text to obtain a corrected training text training classifier, wherein the sources of the training text and the test text are different.
On the basis of the above embodiment, the apparatus further includes a weight calculation module configured to:
acquiring training texts and test texts with different sources;
respectively processing the training text and the test text by using a language processing model to obtain a training feature vector and a test feature vector;
mapping the distribution corresponding to the training feature vector and the distribution corresponding to the test feature vector to a regeneration kernel Hilbert space respectively to obtain training mapping distribution and test mapping distribution corresponding to the training feature vector;
Calculating according to the training mapping distribution and the test mapping distribution to obtain the deviation correction weight;
and carrying out deviation correction on the training text by using the deviation correction weight, and training a classifier by using the training text after the deviation correction to obtain the text classification model.
On the basis of the above embodiment, the weight calculation module is specifically configured to:
e [ beta (x) phi (x) according to the formula min 1 )]-E[φ(x 2 )]The deviation correction weight is obtained through I calculation;
wherein β (x) is the offset correction weight to be calculated; phi (x 1) is the training map distribution; phi (x 2) is the test map distribution; e is used to represent the desire; the terms |· | are used to represent the two norms.
On the basis of the above embodiment, the text processing module 201 is specifically configured to:
encoding the text to be classified by using a BERT model to obtain sentence representation vectors corresponding to the text to be classified;
the sentence representation vector is processed through a first layer of the transform encoder substructure, and the processing result is input into a next layer of the transform encoder substructure;
and summing the results output by the substructures of the last layer of the transducer encoder to be used as chapter representations corresponding to the text to be classified.
Based on the above embodiment, the classification module 202 is specifically configured to:
and calculating the product of the deviation correction weight and the training feature vector to obtain the corrected training feature vector.
Based on the above embodiment, the classification module 203 is specifically configured to:
and classifying the corrected text feature vector through a single-layer forward network of a sigmoid activation function to obtain the classification result.
On the basis of the above embodiment, the classification result includes at least one of the following: scouting, resource development, initial access, enforcement, persistence, rights promotion, defense avoidance, credential access, discovery, lateral movement, collection, command and control, data theft, and compromise.
Fig. 3 is a schematic diagram of an entity structure of an electronic device according to an embodiment of the present application, as shown in fig. 3, where the electronic device includes: a processor (processor) 301, a memory (memory) 302, and a bus 303; wherein,
the processor 301 and the memory 302 perform communication with each other through the bus 303;
the processor 301 is configured to invoke the program instructions in the memory 302 to perform the methods provided in the above method embodiments, for example, including: processing a text to be classified to obtain a text feature vector corresponding to the text to be classified; inputting the text feature vector into a pre-trained text classification model to obtain a classification result corresponding to the text to be classified, which is output by the text classification model; the classification result is used for representing attack intention classification corresponding to the text to be classified; the text classification model is obtained by using a training text and a test text to obtain deviation correction weights, and using the deviation correction weights to correct the training text to obtain a corrected training text training classifier, wherein the sources of the training text and the test text are different.
The processor 301 may be an integrated circuit chip with signal processing capabilities. The processor 301 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also Digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. Which may implement or perform the various methods, steps, and logical blocks disclosed in embodiments of the present application. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Memory 302 may include, but is not limited to, random access Memory (Random Access Memory, RAM), read Only Memory (ROM), programmable Read Only Memory (Programmable Read-Only Memory, PROM), erasable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), electrically erasable Read Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), and the like.
The present embodiment discloses a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, are capable of performing the methods provided by the above-described method embodiments, for example comprising: processing a text to be classified to obtain a text feature vector corresponding to the text to be classified; inputting the text feature vector into a pre-trained text classification model to obtain a classification result corresponding to the text to be classified, which is output by the text classification model; the classification result is used for representing attack intention classification corresponding to the text to be classified; the text classification model is obtained by using a training text and a test text to obtain deviation correction weights, and using the deviation correction weights to correct the training text to obtain a corrected training text training classifier, wherein the sources of the training text and the test text are different.
The present embodiment provides a non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the methods provided by the above-described method embodiments, for example, including: processing a text to be classified to obtain a text feature vector corresponding to the text to be classified; inputting the text feature vector into a pre-trained text classification model to obtain a classification result corresponding to the text to be classified, which is output by the text classification model; the classification result is used for representing attack intention classification corresponding to the text to be classified; the text classification model is obtained by using a training text and a test text to obtain deviation correction weights, and using the deviation correction weights to correct the training text to obtain a corrected training text training classifier, wherein the sources of the training text and the test text are different.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.
Further, the units described as separate units may or may not be physically separate, and units displayed as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
Furthermore, functional modules in various embodiments of the present application may be integrated together to form a single portion, or each module may exist alone, or two or more modules may be integrated to form a single portion.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application, and various modifications and variations may be suggested to one skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application.

Claims (10)

1. A method of text classification, comprising:
processing a text to be classified to obtain a text feature vector corresponding to the text to be classified;
inputting the text feature vector into a pre-trained text classification model to obtain a classification result corresponding to the text to be classified, which is output by the text classification model; the classification result is used for representing attack intention classification corresponding to the text to be classified;
the text classification model is obtained by using a training text and a test text to obtain deviation correction weights, and using the deviation correction weights to correct the training text to obtain a corrected training text training classifier, wherein the sources of the training text and the test text are different.
2. The method according to claim 1, wherein the method further comprises:
acquiring the training text and the test text;
respectively processing the training text and the test text by using a language processing model to obtain a training feature vector and a test feature vector;
mapping the distribution corresponding to the training feature vector and the distribution corresponding to the test feature vector to a regeneration kernel Hilbert space respectively to obtain training mapping distribution and test mapping distribution corresponding to the training feature vector;
Calculating according to the training mapping distribution and the test mapping distribution to obtain the deviation correction weight;
and carrying out deviation correction on the training feature vector by using the deviation correction weight, and training a classifier by using the training feature vector subjected to deviation correction to obtain the text classification model.
3. The method of claim 2, wherein said calculating the bias correction weights from the training map distribution and the test map distribution comprises:
e [ beta (x) phi (x) according to the formula min 1 )]-E[φ(x 2 )]The deviation correction weight is obtained through I calculation;
wherein β (x) is the offset correction weight to be calculated; phi (x) 1 ) Distributing for the training map; phi (x) 2 ) Distributing for the test map; e is used to represent the desire; the terms |· | are used to represent the two norms.
4. The method of claim 1, wherein the processing the text to be classified comprises:
encoding the text to be classified by using a BERT model to obtain sentence representation vectors corresponding to the text to be classified;
the sentence representation vector is processed through a first layer of the transform encoder substructure, and the processing result is input into a next layer of the transform encoder substructure;
And summing the results output by the substructures of the last layer of the transducer encoder to be used as chapter representations corresponding to the text to be classified.
5. The method of claim 2, wherein said performing bias correction on said training feature vector using said bias correction weights comprises:
and calculating the product of the deviation correction weight and the training feature vector to obtain the corrected training feature vector.
6. The method according to claim 1, wherein the inputting the text feature vector into a pre-trained text classification model to obtain a classification result corresponding to the text to be classified output by the text classification model includes:
and classifying the text feature vectors through a single-layer forward network of a sigmoid activation function to obtain the classification result.
7. The method of any one of claims 1-6, wherein the classification result comprises at least one of: scouting, resource development, initial access, enforcement, persistence, rights promotion, defense avoidance, credential access, discovery, lateral movement, collection, command and control, data theft, and compromise.
8. A text classification device, comprising:
the text processing module is used for processing the text to be classified to obtain text feature vectors corresponding to the text to be classified;
the classification module is used for inputting the text feature vector into a pre-trained text classification model to obtain a classification result corresponding to the text to be classified, which is output by the text classification model; the classification result is used for representing attack intention classification corresponding to the text to be classified;
the text classification model is obtained by using a training text and a test text to obtain deviation correction weights, and using the deviation correction weights to correct the training text to obtain a corrected training text training classifier, wherein the sources of the training text and the test text are different.
9. An electronic device, comprising: a processor, a memory, and a bus, wherein,
the processor and the memory complete communication with each other through the bus;
the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1-7.
10. A non-transitory computer readable storage medium storing computer instructions which, when executed by a computer, cause the computer to perform the method of any of claims 1-7.
CN202311443529.6A 2023-11-01 2023-11-01 Text classification method, device, electronic equipment and storage medium Pending CN117407530A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311443529.6A CN117407530A (en) 2023-11-01 2023-11-01 Text classification method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311443529.6A CN117407530A (en) 2023-11-01 2023-11-01 Text classification method, device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117407530A true CN117407530A (en) 2024-01-16

Family

ID=89495968

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311443529.6A Pending CN117407530A (en) 2023-11-01 2023-11-01 Text classification method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117407530A (en)

Similar Documents

Publication Publication Date Title
Mao et al. Phishing-alarm: Robust and efficient phishing detection via page component similarity
Alazab Profiling and classifying the behavior of malicious codes
Liang et al. Cracking classifiers for evasion: a case study on the google's phishing pages filter
Andronio et al. Heldroid: Dissecting and detecting mobile ransomware
Nissim et al. Detection of malicious PDF files and directions for enhancements: A state-of-the art survey
JP5961183B2 (en) How to detect malicious software using contextual probabilities, generic signatures, and machine learning methods
Liu et al. ATMPA: attacking machine learning-based malware visualization detection methods via adversarial examples
Zhao et al. A review of computer vision methods in network security
Struppek et al. Learning to break deep perceptual hashing: The use case neuralhash
Kasim An ensemble classification-based approach to detect attack level of SQL injections
Liu et al. GraphXSS: an efficient XSS payload detection approach based on graph convolutional network
CN110855716B (en) Self-adaptive security threat analysis method and system for counterfeit domain names
Hara et al. Machine-learning approach using solidity bytecode for smart-contract honeypot detection in the ethereum
Mohammed et al. HAPSSA: Holistic Approach to PDF malware detection using Signal and Statistical Analysis
Rasheed et al. Adversarial attacks on featureless deep learning malicious URLs detection
Bountakas et al. Defense strategies for adversarial machine learning: A survey
Baballe et al. Management of Vulnerabilities in Cyber Security
Khan et al. A dynamic method of detecting malicious scripts using classifiers
Raymond et al. Investigation of Android malware using deep learning approach
Dubin Content disarm and reconstruction of PDF files
WO2024039984A1 (en) Anti-malware behavioral graph engines, systems and methods
Naït-Abdesselam et al. Malware forensics: Legacy solutions, recent advances, and future challenges
Afandi et al. COVID-19 phishing detection based on hyperlink using k-nearest neighbor (KNN) algorithm
Stephen et al. Prevention of cross site scripting with E-Guard algorithm
CN117407530A (en) Text classification method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination