CN112632966A

CN112632966A - Alarm information marking method, device, medium and equipment

Info

Publication number: CN112632966A
Application number: CN202011614604.7A
Authority: CN
Inventors: 张润滋; 刘文懋; 陈磊; 薛见新; 吴复迪
Original assignee: Nsfocus Technologies Inc; Nsfocus Technologies Group Co Ltd
Current assignee: Nsfocus Technologies Inc; Nsfocus Technologies Group Co Ltd
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2021-04-09
Anticipated expiration: 2040-12-30
Also published as: CN112632966B

Abstract

The invention relates to an alarm information marking method, an alarm information marking device, an alarm information marking medium and alarm information marking equipment. The method can determine a theme distribution vector corresponding to the type of the currently received alarm information by using a pre-trained LDA model, and can determine a theme distribution vector corresponding to each context text (formed by alarm sentences formed according to the alarm information and the type of the alarm information associated with the alarm information) corresponding to the alarm information, and further measure a semantic deviation value between the type of the currently received alarm information and each context text corresponding to the currently received alarm information by using Euclidean distance values between the theme distribution vectors. Therefore, when a certain Euclidean distance value is large, the semantic deviation value between the type of the currently received alarm information and the context text corresponding to the Euclidean distance value is considered to be large, a corresponding context abnormal label is generated for the currently received alarm information, and the alarm information is prompted to be possibly high-risk alarm information aiming at the certain context text.

Description

Alarm information marking method, device, medium and equipment

Technical Field

The present invention relates to the field of network security technologies, and in particular, to a method, an apparatus, a medium, and a device for marking alarm information.

Background

This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.

A major challenge faced by security operation centers today is to implement network security management under the limitation of limited manpower and cost resources. And a large amount of alarm information is processed, which far exceeds the manual processing capability of safety operators, so that serious alarm fatigue can be caused, not only can the network safety not be effectively maintained, but also the distrust of the safety operators to the alarm information can be caused, and further the network safety is reduced.

In order to reduce the occurrence of the "alarm fatigue" phenomenon, the existing scheme classifies and classifies alarm information through a rule-driven static classification, an experience-driven black and white list, or a primary data frequency statistical scheme, etc., so as to distinguish high-risk alarm information (which may be understood as alarm information with a high impact on system security by an attack corresponding to the high-risk alarm information) and low-risk alarm information (which may be understood as alarm information with a low impact on system security by a corresponding attack) in a large amount of alarm information, thereby realizing the discovery of the high-risk alarm information, and enabling a security operator to effectively process the high-risk alarm information in a targeted manner.

However, in the current scheme for finding high-risk warning information from warning information, the high-risk warning information cannot be timely and accurately identified, so that the best threat capturing opportunity is missed, and huge hidden dangers are buried for stable operation of data assets and IT systems of enterprises and organizations.

Disclosure of Invention

The embodiment of the invention provides an alarm information marking method, an alarm information marking device, an alarm information marking medium and alarm information marking equipment, which are used for solving the problem that the timeliness and the accuracy of high-risk alarm information found from alarm information are poor.

In a first aspect, the present invention provides an alarm information marking method, where the method includes:

if the type of the currently received first alarm information is determined to belong to one of the types of alarm information used for training a pre-trained potential Dirichlet distribution LDA model, determining second alarm information received within a set time length before the first alarm information is received;

determining at least one context text corresponding to the first alarm information according to the first alarm information and the second alarm information, and determining each context text as a theme distribution vector corresponding to a document by using the LDA model;

determining the type of the first alarm information as a word aiming at a theme distribution vector corresponding to the LDA model, and respectively determining Euclidean distance values between the theme distribution vector and the theme distribution vector corresponding to each context text as a document;

and if at least one Euclidean distance value is larger than a set value, generating a context abnormal label corresponding to each Euclidean distance value larger than the set value for the first alarm information.

Optionally, the method further includes: aiming at each Euclidean distance value larger than a set value, according to a pre-manual labeling result of the LDA model, obtaining semantic description corresponding to a specified theme, wherein the specified theme is a theme corresponding to a context text corresponding to the Euclidean distance value; outputting the Euclidean distance value, a context abnormal label corresponding to the Euclidean distance value and semantic description corresponding to a specified subject corresponding to the Euclidean distance value;

and determining the theme corresponding to one context text according to the context text as the theme distribution vector corresponding to the document.

Optionally, the at least one context text includes a source context text, a destination context text, and a source-destination context text;

the source context text is formed according to an alarm statement with a source Internet protocol address same as the first alarm information;

the target context text is formed according to the alarm statement with the same target Internet protocol address as the first alarm information;

and the source-destination context text is formed according to the alarm sentences of which the source internet protocol address and the destination internet protocol address are the same as the first alarm information.

Optionally, determining, by using the LDA model, each context text as a topic distribution vector corresponding to a document includes:

determining a vector corresponding to each context text, and determining the context text as a theme distribution vector corresponding to the document by using the LDA model according to the vector;

the vector length corresponding to one context text is the number of types of the alarm information for the LDA model training, and the vector value is the weight value of each type of the alarm information for the LDA model training in the context text, which is obtained according to the word frequency inverse text frequency index TF-IDF model.

Optionally, after determining at least one context text corresponding to the first warning information according to the first warning information and the second warning information, before determining, by using the LDA model, that each context text is used as a topic distribution vector corresponding to a document, the method further includes:

and if the length of at least one context text in the determined context texts is smaller than a threshold value, increasing the set time length, and returning to execute the second alarm information received in the set time length before the first alarm information is determined to be received.

Optionally, the method further includes:

and if the quantity of the alarm information which does not belong to one of the types of the alarm information used for training the pre-trained LDA model reaches a threshold value, prompting that the LDA model needs to be trained again.

Optionally, the method further includes:

according to the first alarm information and the second alarm information, aiming at each piece of second alarm information, determining at least one context text corresponding to the second alarm information, and determining each context text as a theme distribution vector corresponding to a document by using the LDA model;

determining the type of the second alarm information as a word aiming at the theme distribution vector corresponding to the LDA model, and respectively determining Euclidean distance values between the theme distribution vector and the theme distribution vector corresponding to each context text as a document;

if at least one Euclidean distance value is larger than a set value, generating context abnormal labels corresponding to the Euclidean distance values larger than the set value for the second alarm information;

and each context text corresponding to each piece of second alarm information is formed according to the first alarm information and the alarm sentence corresponding to each piece of second alarm information.

In a second aspect, the present invention further provides an alarm information tagging apparatus, including:

the analysis module is used for determining second alarm information received within a set time length before the first alarm information is received if the type of the currently received first alarm information is determined to belong to one of the types of alarm information used for training of a pre-trained potential Dirichlet distribution LDA model; determining at least one context text corresponding to the first alarm information according to the first alarm information and the second alarm information, and determining each context text as a theme distribution vector corresponding to a document by using the LDA model; determining the type of the first alarm information as a word aiming at a theme distribution vector corresponding to the LDA model, and respectively determining Euclidean distance values between the theme distribution vector and the theme distribution vector corresponding to each context text as a document;

and the marking module is used for generating context abnormal labels corresponding to the Euclidean distance values larger than the set value for the first alarm information if at least one Euclidean distance value is larger than the set value.

In a third aspect, the present invention also provides a non-volatile computer storage medium storing an executable program for execution by a processor to implement the method as described above.

In a fourth aspect, the present invention further provides a block chain data processing device, including a processor, a communication interface, a memory and a communication bus, where the processor, the communication interface, and the memory complete mutual communication through the communication bus;

the memory is used for storing a computer program;

the processor, when executing the program stored in the memory, is configured to implement the method steps as described above.

According to the scheme provided by the embodiment of the invention, a pre-trained LDA model can be utilized to determine the theme distribution vector corresponding to the type of the currently received alarm information, and the theme distribution vector corresponding to each context text (formed by alarm sentences formed according to the alarm information and the type of the alarm information associated with the alarm information) corresponding to the alarm information can be determined, so that the semantic deviation value between the type of the currently received alarm information and each context text corresponding to the currently received alarm information can be measured through the Euclidean distance value between the theme distribution vectors. Therefore, when a certain Euclidean distance value is large, the semantic deviation value between the type of the currently received alarm information and the context text corresponding to the Euclidean distance value is considered to be large, a corresponding context abnormal label is generated for the currently received alarm information, and the alarm information is prompted to be possibly high-risk alarm information aiming at the certain context text. The high-risk warning information is timely found by generating the label of the currently received warning information in real time. And the semantic deviation value between the type of the currently received alarm information and each corresponding context text is measured through the Euclidean distance value between the theme distribution vectors, so that whether a mode of generating a corresponding context abnormal label for the currently received alarm information is needed or not is judged, and the accuracy of high-risk alarm information discovery is effectively ensured.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flow chart of an alarm information marking method according to an embodiment of the present invention;

FIG. 2 is a diagram of an alarm statement provided in an embodiment of the present invention;

FIG. 3 is a diagram of context text provided by an embodiment of the present invention;

FIG. 4 is a schematic flowchart of obtaining a trained LDA model according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an alarm information tagging device according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an alarm information tagging device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that, the "plurality" or "a plurality" mentioned herein means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

The terms "first," "second," and the like in the description and in the claims, and in the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein.

Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The inventor of the present application finds that low-risk alarm information such as scanning type alarm information and information type alarm information often has a more stable context, that is, alarm information of the same type appears in a correlated manner. If the alarm information is the abnormal alarm information in the associated alarm information, it often means that the alarm information is high-risk alarm information mixed in large-scale alarm information and may be a real attack action of an attacker. The application considers the screening of the high-risk alarm information based on the research finding.

In order to improve timeliness and accuracy of finding high-risk alarm information from alarm information and quickly and accurately distinguish the high-risk alarm information from low-risk alarm information, an effective means is needed to depict context information when the alarm information is generated, for example, which alarm information is generated before one alarm information is generated and which alarm information is generated after the alarm information is generated, and a behavior environment when some alarm information is generated needs to be described on the whole by a modeling and quantification method.

A potential Dirichlet distribution (LDA) model is a document theme generation model and is a three-layer Bayesian probability model and comprises a word, a theme and a document three-layer structure. The document topic generation model is a process in which each word of a document is considered to be obtained by selecting a topic with a certain probability and selecting a word from the topic with a certain probability. Document-to-topic follows a polynomial distribution, and topic-to-word follows a polynomial distribution. The purpose of the LDA model is to identify topics, i.e. to transform the document word matrix into a document topic matrix (distribution) and a topic word matrix (distribution).

The method comprises the steps of taking the type of the alarm information (the type of the alarm information can be any type, such as a scanning type, an information type, a vulnerability utilization type and the like) as a word, taking the type sequence of the alarm information as a sentence, taking the type sequence set of the alarm information as a document, establishing a semantic model of the type sequence of the alarm information, and learning a potential context semantic relation from a corpus through an LDA model, so that the trained LDA model can be used for dynamically identifying, classifying and screening the high-risk alarm information. Of course, the network security expert may further classify or grade the alarm information based on the identification result of the high-risk alarm information.

The embodiment of the invention provides an alarm information marking method, the flow of the steps of the method can be shown as figure 1, and the method comprises the following steps:

step 101, determining whether the type of the currently received alarm information belongs to one of the types of alarm information for LDA model training trained in advance.

In order to mark the warning information by using the pre-trained LDA model, first, it may be determined whether the type of the currently received warning information (first warning information) belongs to one of the types of the warning information for training the pre-trained LDA model (that is, it is determined whether the type of the currently received first warning information belongs to one of the words corresponding to the pre-trained LDA model), if so, the step 102 is continuously performed, otherwise, it may be determined that the currently received warning information cannot be marked by using the pre-trained LDA model, and the process may be ended.

The currently received alarm information may also be understood as the currently generated alarm information, that is, in this embodiment, the real-time marking may be performed on the alarm information generated in real time, so as to ensure the real-time property of the high-risk alarm data discovery.

And 102, determining second alarm information received within a set time length before the first alarm information is received.

If the type of the first warning information is determined to belong to one of the types of warning information for the training of the pre-trained LDA model, in this step, the warning information (second warning information) received within a set time length before the first warning information is received may be determined, so that the part of the warning information is used as the associated warning information of the currently received warning information, and the context text corresponding to the currently received warning information is determined.

Step 103, determining at least one context text corresponding to the first warning information according to the first warning information and the second warning information.

In this embodiment, each context text corresponding to the first alarm information may be understood as being formed according to alarm statements corresponding to the first alarm information and the second alarm information, and one alarm statement is formed by arranging types of each alarm information according to a time sequence according to the alarm information in which a source Internet Protocol (IP) address and a destination IP address are the same.

The alarm statement is formed by arranging the types of the alarm information according to the time sequence aiming at the alarm information with the same source IP address and destination IP address according to the sequence of the receiving time of the alarm information from early to late.

In this embodiment, a context text corresponding to the first alarm information may be obtained by aggregating the alarm statements corresponding to the first alarm information and the second alarm information according to any aggregation method and with a certain granularity, as long as the context text can represent context information in a certain sense.

In a possible implementation manner, in this step, three context texts corresponding to the first warning information may be determined, which are a source context text, a destination context text, and a source-destination context text, respectively, and then whether the warning information is high-risk warning information may be determined from multiple dimensions according to the obtained multi-dimensional context text.

The source context text is formed according to an alarm statement having the same source internet protocol address as the first alarm information. The source context text can be understood as a context environment corresponding to the alarm information triggered by the attack initiated by an attacker, and can reflect the attack method of the attacker.

The destination context text is formed according to the alarm statement with the destination internet protocol address being the same as the first alarm information. The target context text can be understood as a context environment corresponding to the alarm information triggered by the attack of one server, and can reflect the service characteristics and the associated vulnerability of the attacked server.

The source-destination context text is formed according to an alarm statement that the source internet protocol address and the destination internet protocol address are the same as the first alarm information. The source-destination context text can be understood as a context environment of alarm information triggered by an attack initiated by an attacker to a server.

Taking the example of determining the source context text, the destination context text and the source-destination context text corresponding to the first alarm information according to the first alarm information and the second alarm information, the determination process of each context text may be as follows:

for the first alarm information and the second alarm information, the types of the alarm information can be arranged according to the time sequence to form an alarm statement and at least one alarm statement according to the alarm information with the same source IP address and the same destination IP address. A schematic diagram of at least one alarm statement may be formed as shown in fig. 2 (in fig. 2, 5 alarm statements are included).

One block in fig. 2 represents one piece of alarm information, blocks of different filler contents represent different types of alarm information (the block outlined by a dotted line represents the first alarm information), one dot represents one IP address, and the arrow direction represents from the source IP address to the destination IP address, and an alarm statement may be understood as being formed by the types of alarm information represented by the filler contents in the blocks arranged between two dots in time series generated by the blocks.

Furthermore, the source IP, the destination IP, the source IP and the destination IP can be used as keys respectively to aggregate the formed alarm sentences to form context texts with different granularities. A schematic diagram of each context text formed may be as shown in fig. 3.

And aggregating alarm sentences of which the source Internet protocol addresses are the same as the first alarm information to form a source context text corresponding to the first alarm information.

And aggregating the alarm sentences of which the destination Internet protocol addresses are the same as the first alarm information to form a destination context text corresponding to the first alarm information.

And aggregating the alarm sentences of which the source internet protocol addresses and the destination internet protocol addresses are the same as the first alarm information to form a source-destination context text corresponding to the first alarm information.

And 104, determining each context text as a theme distribution vector corresponding to the document by using the LDA model.

In this step, each context text corresponding to the first warning information may be used as an input by using a pre-trained LDA model, and each context text is determined as a topic distribution vector corresponding to the document.

In this step, each context text may be vectorized, and then each context text may be determined as a topic distribution vector corresponding to the document by using a pre-trained LDA model according to a vector corresponding to each context text.

That is, in this step, a vector corresponding to each context text may be determined, and the context text may be determined as a topic distribution vector corresponding to the document according to the vector by using a pre-trained LDA model.

In a possible implementation manner, the vector length corresponding to a context text is the number of types of warning information for the training of the pre-trained LDA model, and the vector value is a weight value of each type of warning information in the context text, which is obtained according to a word frequency inverse text frequency index (TF-IDF) model and is used for the training of the pre-trained LDA model.

And 105, determining theme distribution vectors corresponding to the types of the first alarm information, and respectively determining Euclidean distance values between every two theme distribution vectors.

After the LDA model is trained, the type of each piece of alarm information for training and the matrix distribution of each topic can be obtained, and in this step, the matrix distribution corresponding to the pre-trained LDA model can be queried, and the type of the first alarm information is determined as a topic distribution vector corresponding to the pre-trained LDA model for a word.

Considering that low-risk alarm information such as scanning type and information type often has a relatively stable context, if the semantic deviation degree of one alarm information type for one context text (which can also be understood as a topic corresponding to one context text) is relatively large, the alarm information is often high-risk alarm information mixed in a large amount of alarm information and may be a real attack action of an attacker. Therefore, the semantic deviation degree of the type of an alarm message relative to a context text can be considered to measure to determine whether the alarm message is a high-risk alarm message.

In this step, the type of the determined first warning information may be used as a word, and a euclidean distance value between the word and a topic distribution vector corresponding to each context text serving as a document may be determined for a topic distribution vector corresponding to a pre-trained LDA model.

Each Euclidean distance value (which may be denoted as L)_K) Can be understood as representing the type of the first alarm informationAnd aiming at the semantic deviation degree of one context text corresponding to the Euclidean distance value.

And 106, if at least one Euclidean distance value is larger than the set value, generating context abnormal labels corresponding to the Euclidean distance values larger than the set value for the first alarm information.

In this step, a corresponding context abnormal label may be generated for the first warning information for each euclidean distance value greater than the set value, which indicates that the semantic deviation degree between the type of the currently received warning information and the context text corresponding to the euclidean distance value is higher, and the warning information may be a high-risk warning information.

Further, the present embodiment may further include the following steps:

and step 107, outputting relevant prompt information aiming at each Euclidean distance value larger than the set value.

In this step, for each euclidean distance value greater than the set value, according to a pre-manual labeling result of the pre-trained LDA model, a semantic description corresponding to a specified topic may be obtained, where the specified topic is a topic corresponding to a context text corresponding to the euclidean distance value. And further outputting the Euclidean distance value, a context abnormal label corresponding to the Euclidean distance value and semantic description corresponding to a specified subject corresponding to the Euclidean distance value, and prompting a context text with higher type semantic deviation degree of the currently received alarm information, the actual alarm meaning of the subject of the context text, and the semantic deviation degree of the type of the currently received alarm information and the context text.

And determining the theme corresponding to one context text according to the context text as the theme distribution vector corresponding to the document. It can be understood that, in the topic distribution vector corresponding to a context text, the corresponding topic with the highest generation probability is the topic corresponding to the context text.

The output related prompt information can realize or assist safety operators to realize classification or classification filtering of the alarm information, and quickly discover the alarm information with the highest risk and most concern.

It should be noted that, in a possible implementation manner, in order to ensure the accuracy of determining whether the currently received alarm information is high-risk alarm information, after step 103 and before step 104, step 103' may be executed:

step 103', judging whether the length of at least one context text in the determined context texts is smaller than a threshold value.

If the length of at least one context text in the determined context texts is smaller than the threshold value, the set time length can be increased, the step 102 is returned to be executed, otherwise, the step 104 can be continuously executed. Therefore, the situation that the context text is too short can be avoided, and the accuracy of judging whether the currently received alarm information is the high-risk alarm information or not is ensured through the context text which is long enough.

It should be noted that, in this embodiment, if the number of alarm information that does not belong to one of the types of alarm information for LDA model training trained in advance reaches a threshold value, it may be prompted that the LDA model needs to be trained again.

That is, if a large number of alarm information with types not belonging to the alarm information types for LDA model training are found, the LDA model can be trained again, so that the trained LDA model can better adapt to the alarm information, and the LDA model has wider applicability.

In addition, it should be noted that, for the embodiment, when new alarm information arrives, the alarm information changes the context semantics of the alarm information associated with the new alarm information, so that the context abnormality of the alarm information may be a continuous evaluation process, and besides the real-time evaluation of whether the alarm information is high-risk alarm information, the dynamic evaluation of whether the alarm information is high-risk alarm information may also be implemented.

In this embodiment, after the first alarm information is currently received, it may be determined again whether a context exception tag needs to be generated for each piece of second alarm information.

It can be understood that, according to the first alarm information and the second alarm information, for each piece of second alarm information, at least one context text corresponding to the second alarm information is determined, and each context text is determined as a topic distribution vector corresponding to a document by using a pre-trained LDA model.

And determining the type of the second alarm information as a topic distribution vector corresponding to a word aiming at a pre-trained LDA model, and respectively determining Euclidean distance values between the topic distribution vector and the topic distribution vector corresponding to each context text as a document.

And if at least one Euclidean distance value is larger than the set value, generating context abnormal labels corresponding to the Euclidean distance values larger than the set value for the second alarm information.

Each context text corresponding to each piece of second alarm information can be understood as being formed according to the first alarm information and the alarm statement corresponding to each piece of second alarm information.

Next, a training process of the LDA model will be described.

Firstly, obtaining a training sample.

The training process for the LDA model first requires obtaining training samples.

In the process of obtaining the training samples, the batch of alarm information for training (i.e., the alarm information for training) may be grouped according to the set period T corresponding to the time when each piece of alarm information is received.

For each group of alarm information, an alarm statement may be formed by arranging the types of the alarm information according to a time sequence according to the alarm information having the same source IP address and destination IP address, and at least one alarm statement may be formed, where the type of the alarm information corresponding to the alarm information for training may be understood as a word corresponding to the LDA model, and may be understood as an alarm statement formed according to the word.

After at least one alarm statement is formed for each group of alarm information, at least one context text corresponding to the alarm information can be further determined for each alarm information in the group of alarm information for each group of alarm information.

Each context text corresponding to one piece of alarm information can be understood as being formed according to the alarm sentence corresponding to the alarm information of the group in which the alarm information is positioned. Each context text may be understood as a document to which the LDA model corresponds.

In a possible implementation manner, three context texts corresponding to each piece of alarm information can be determined, namely a source context text, a destination context text and a source-destination context text, so that a corpus formed by multi-dimensional context texts integrated by alarm sentences is obtained.

After obtaining the corpus of context texts, each context text in the corpus may be vectorized to obtain a corpus represented by a vector.

The vector length of each context text is the number of types of the alarm information for training, and the vector value is a weight value of each type of alarm information for training in the context text, which is obtained according to the TF-IDF model.

In a possible implementation manner, after the context text is obtained, the obtained context text may be further preprocessed to ensure the accuracy of the trained LDA model.

For example, for a type of alarm information with a frequency lower than a set frequency in a batch of alarm information for training, each context text corresponding to the type of alarm information in the corpus may be copied to obtain a set number of context texts. Of course, in another possible implementation manner, the type of alarm information whose occurrence frequency is lower than the set frequency may also be directly discarded (in this case, it may be understood that the context text corresponding to the type of alarm information is not obtained in the corpus).

For another example, for a continuously repeated alarm information type in an alarm sentence corresponding to a context text in the corpus, only the first alarm information type in the continuously repeated alarm information types may be retained, and the subsequently occurring alarm information types may be deleted.

For another example, for each context text with a length smaller than the set length in the corpus, the context text may be spliced before the alarm information corresponding to the context text and after the context text corresponding to the adjacent received alarm information, for example, the source context text corresponding to the alarm information 1, according to the time sequence, if the length of the source context text is smaller than the set length, the source context text may be spliced before the alarm information 1 is received and after the source context text corresponding to the adjacent received alarm information 2 is received.

And secondly, training the LDA model.

After the corpus represented by the vectors is obtained, the number of topics of the LDA model may be set (for example, set to K), the vector corresponding to each context text is used as one training sample in the training sample set, and the pre-established LDA model is subjected to unsupervised training to obtain the trained LDA model.

The trained LDA model can learn to obtain the generation probability between the theme and the word (the type of the alarm information) and the generation probability between the theme and the document (the context text) according to the context text.

The trained LDA model learns the obtained subject (which can be expressed as T)₁、T₂……T_K) And word (which may be denoted as W)₁、W₂……W_M) The generation probability therebetween may be as shown in table 1, it is assumed that the number of types of alarm information in the batch of alarm information used for training is M.

TABLE 1

	T₁	T₂	T₃	……	T_K
						W₁	0.1	0.1	0.6	……	0.1
W₂	0.4	0.1	0.2	……	0.13
						……	……	……	……	……	……
W_M	0.23	0.45	0.01	……	0.1

The trained LDA model learns the obtained subjects and documents (which can be expressed as D)₁、D₂……D_N) The generation probability therebetween can be shown in table 2, it is assumed that the number of context texts in the corpus is N.

TABLE 2

	T₁	T₂	T₃	……	T_K
						D₁	0.76	0.1	0.02	……	0.2
D₂	0.3	0.67	0.01	……	0.03
						……	……	……	……	……	……
D_N	0.01	0.1	0.2	……	0.34

And thirdly, manually marking the trained LDA model.

In this embodiment, the number of topics of the LDA model is set to be K, and it can be understood that the LDA model assumes that each document in the corpus is generated by K topics (topic) according to a certain probability. For example, a certain context text is composed of the potential four topics "host scan", "exploit", "DDoS attack", "information stealing". Each topic is corresponding to a word with a certain probability, namely a certain alarm information type.

The trained LDA model is only responsible for generating probability distribution among corresponding topics, documents and words, and semantic descriptions corresponding to the topics, such as connotations and names, need to be marked manually. That is, after the LDA model is trained, T can be determined by manual labeling₁、T₂……T_KCorresponding semantic description of the K topics to obtain the artificially labeled trained LDAAnd the model enables the theme of the trained LDA model to have actual alarm meaning.

A schematic flow diagram of the process of obtaining a trained LDA model including a manual labeling process may be shown in fig. 4.

The generated probability between the topic and the word shown in table 1 corresponding to the trained LDA model may be subsequently used to query the type of the currently received alarm information as a topic distribution vector of the word corresponding to the LDA model for the currently received alarm information.

In addition, each context text (corresponding vector) corresponding to the currently received alarm information is used as an input, that is, each context text can be determined as a topic distribution vector corresponding to the document by using the LDA model.

In addition, for each Euclidean distance value larger than a set value, after a topic corresponding to the context text corresponding to the Euclidean distance value is determined according to the context text as a topic distribution vector corresponding to a document, semantic description of the topic is determined by inquiring a pre-manual labeling result of the trained LDA model, so that a user can obtain actual alarm meaning of the topic corresponding to the context text where the high-risk alarm information is located.

According to the scheme provided by the embodiment of the invention, the modeling analysis can be carried out on the context of the alarm information based on the statistical language model, and the context semantic information when the alarm information occurs can be accurately described, so that the internal rules of the context of the alarm information can be automatically mined in a data-driven mode. And then, an expert mark and an exception handling mechanism can be fused, the alarm information deviating from the context semantics can be automatically identified, the exception degree of the alarm information is evaluated, and a context exception label of the alarm information is provided for realizing or assisting in realizing the classification and classification of the alarm information.

The high-risk or concerned alarm information can be further identified manually, the low efficiency of a static alarm information grading scheme is effectively solved, the alarm information processing efficiency in safe operation is improved, the period of threat event analysis response is reduced, and the protection capability is improved.

According to the scheme provided by the embodiment of the invention, the high-risk alarm data can be effectively screened from a large amount of alarm data, the misguidance of the misreported high-risk alarm information on safe operation is reduced, and the signal-to-noise ratio of the high-risk alarm data of the safe operation center is improved.

Corresponding to the provided method, the following device is further provided.

An embodiment of the present invention provides an alarm information tagging device, where a structure of the device may be as shown in fig. 5, and the device includes:

the analysis module 12 is configured to determine, if it is determined that the type of the currently received first warning information belongs to one of types of warning information for training of a pre-trained potential dirichlet distribution LDA model, second warning information received within a set time period before the first warning information is received; determining at least one context text corresponding to the first alarm information according to the first alarm information and the second alarm information, and determining each context text as a theme distribution vector corresponding to a document by using the LDA model; determining the type of the first alarm information as a word aiming at a theme distribution vector corresponding to the LDA model, and respectively determining Euclidean distance values between the theme distribution vector and the theme distribution vector corresponding to each context text as a document;

the marking module 13 is configured to generate a context exception tag corresponding to each euclidean distance value greater than the set value for the first alarm information if at least one euclidean distance value is greater than the set value.

The device may further include a determining module 11:

the judging module 11 is configured to determine whether a type of the currently received first warning information belongs to one of types of warning information for training a pre-trained latent dirichlet allocation LDA model;

at this time, if it is determined that the type of the currently received first warning information belongs to one of the types of warning information for training of the pre-trained potential dirichlet distribution LDA model, it may be understood that if the determining module 11 determines that the type of the currently received first warning information belongs to one of the types of warning information for training of the pre-trained potential dirichlet distribution LDA model.

Each context text corresponding to the first alarm information is formed according to the first alarm information and the alarm sentences corresponding to the second alarm information, and one alarm sentence is formed according to the alarm information with the same source internet protocol address and the same destination internet protocol address and by arranging the types of the alarm information according to the time sequence.

Optionally, the apparatus further comprises an output module 14:

the output module 14 is configured to, for each euclidean distance value greater than a set value, obtain, according to a result of manual pre-labeling of the LDA model, a semantic description corresponding to a specified topic, where the specified topic is a topic corresponding to a context text corresponding to the euclidean distance value; outputting the Euclidean distance value, a context abnormal label corresponding to the Euclidean distance value and semantic description corresponding to a specified subject corresponding to the Euclidean distance value;

Optionally, the analyzing module 12 determines, by using the LDA model, each context text as a topic distribution vector corresponding to a document, including:

Optionally, the analysis module 12 is further configured to, after determining at least one context text corresponding to the first alarm information according to the first alarm information and the second alarm information, before determining, by using the LDA model, that each context text is used as a topic distribution vector corresponding to a document, increase the set duration if the length of at least one context text in the determined context texts is smaller than a threshold, and return to execute the second alarm information received within the set duration before determining that the first alarm information is received.

Optionally, the analysis module 12 is further configured to prompt that the LDA model needs to be trained again if the number of alarm information that does not belong to one of the types of alarm information for training of the pre-trained LDA model reaches a threshold value.

Optionally, the analysis module 12 is further configured to determine, according to the first warning information and the second warning information, at least one context text corresponding to each piece of the second warning information, and determine, by using the LDA model, each context text as a topic distribution vector corresponding to a document;

the marking module 13 is further configured to generate a context abnormal label corresponding to each euclidean distance value greater than the set value for the second alarm information if at least one euclidean distance value is greater than the set value;

The functions of the functional units of the apparatuses provided in the above embodiments of the present invention may be implemented by the steps of the corresponding methods, and therefore, detailed working processes and beneficial effects of the functional units in the apparatuses provided in the embodiments of the present invention are not described herein again.

Based on the same inventive concept, embodiments of the present invention provide the following apparatus and medium.

The structure of the alarm information marking device provided by the embodiment of the present invention may be as shown in fig. 6, and the alarm information marking device includes a processor 21, a communication interface 22, a memory 23, and a communication bus 24, where the processor 21, the communication interface 22, and the memory 23 complete mutual communication through the communication bus 24;

the memory 23 is used for storing computer programs;

the processor 21 is configured to implement the steps of the above method embodiments of the present invention when executing the program stored in the memory.

Optionally, the processor 21 may specifically include a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), one or more Integrated circuits for controlling program execution, a hardware Circuit developed by using a Field Programmable Gate Array (FPGA), and a baseband processor.

Optionally, the processor 21 may include at least one processing core.

Alternatively, the Memory 23 may include a Read-Only Memory (ROM), a Random Access Memory (RAM), and a disk Memory. The memory 23 is used for storing data required by the at least one processor 21 during operation. The number of the memory 23 may be one or more.

An embodiment of the present invention further provides a non-volatile computer storage medium, where the computer storage medium stores an executable program, and when the executable program is executed by a processor, the method provided in the foregoing method embodiment of the present invention is implemented.

In particular implementations, computer storage media may include: various storage media capable of storing program codes, such as a Universal Serial Bus Flash Drive (USB), a mobile hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

In the embodiments of the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the described unit or division of units is only one division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical or other form.

The functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be an independent physical module.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the technical solutions of the embodiments of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device, such as a personal computer, a server, or a network device, or a processor (processor) to execute all or part of the steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a Universal Serial Bus Flash Drive (usb Flash Drive), a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method for marking alarm information is characterized in that the method comprises the following steps:

2. The method of claim 1, wherein the method further comprises: aiming at each Euclidean distance value larger than a set value, according to a pre-manual labeling result of the LDA model, obtaining semantic description corresponding to a specified theme, wherein the specified theme is a theme corresponding to a context text corresponding to the Euclidean distance value; and the number of the first and second electrodes,

outputting the Euclidean distance value, a context abnormal label corresponding to the Euclidean distance value and semantic description corresponding to a specified subject corresponding to the Euclidean distance value;

3. The method of claim 1, wherein the at least one context text comprises a source context text, a destination context text, and a source-destination context text;

4. The method of claim 1, wherein determining each context text as a topic distribution vector for a document using the LDA model comprises:

5. The method of claim 1, wherein after determining at least one context text corresponding to the first warning information according to the first warning information and the second warning information, before determining each context text as a topic distribution vector corresponding to a document using the LDA model, the method further comprises:

6. The method of claim 1, wherein the method further comprises:

7. The method of any of claims 1 to 6, further comprising:

8. An alert information tagging apparatus, the apparatus comprising:

9. A non-transitory computer storage medium storing an executable program for execution by a processor to perform the method of any one of claims 1 to 7.

10. An alarm information marking device, which is characterized in that the device comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

the memory is used for storing a computer program;

the processor, when executing the program stored in the memory, implementing the method steps of any of claims 1-7.