CN112131259A

CN112131259A - Similar malware recommendation method, device, medium and equipment

Info

Publication number: CN112131259A
Application number: CN202011037893.9A
Authority: CN
Inventors: 周娟; 章瑞康; 袁军; 李文瑾
Original assignee: Nsfocus Technologies Inc; Nsfocus Technologies Group Co Ltd
Current assignee: Nsfocus Technologies Inc; Nsfocus Technologies Group Co Ltd
Priority date: 2020-09-28
Filing date: 2020-09-28
Publication date: 2020-12-25
Anticipated expiration: 2040-09-28
Also published as: CN112131259B

Abstract

The invention relates to a similar malware recommendation method, a similar malware recommendation device, a similar malware recommendation medium and equipment. Analysis data corresponding to each malicious software can be stored based on the malicious software heterogeneous information network, and then a vector corresponding to each malicious software can be obtained and stored based on the analysis data corresponding to each malicious software stored in the malicious software heterogeneous information network. When similar malware is searched, only the malware to be searched needs to be determined, namely when the vector corresponding to the malware to be searched is determined and stored, similarity search is performed, a specified number of malware with the highest similarity between the corresponding vector and the vector corresponding to the malware to be searched is used as recommended malware similar to the malware to be searched, complicated query sentences do not need to be written, searching of the similar malware can be achieved, the searching process of the similar malware is simplified, and the accuracy of searching of the similar malware is improved in a vector similarity comparison mode.

Description

Similar malware recommendation method, device, medium and equipment

Technical Field

The invention relates to the technical field of network security, in particular to a method, a device, a medium and equipment for recommending similar malicious software.

Background

This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.

In recent years, with the advent of anti-detection technologies such as obfuscation, distortion, and the like, malware has exhibited a sharp trend of increasing in quantity and quality, which undoubtedly increases the analysis burden of security researchers. On one hand, a traditional fragmentation storage mode of analysis data is easy to cause some repeated analysis work, and on the other hand, when a network security threat event is faced, malicious software and the threat event are difficult to be quickly associated, so that a relatively comprehensive prevention or remedy decision cannot be made in time.

Therefore, how to perform relevance management and mining on malware based on historical analysis data of the malware and determine similar malware for auxiliary analysis has become a research focus in the field of malware analysis.

In the prior art, relevant information of malicious software is generally stored based on a relational database, and malicious software analysts can query similar malicious software in the relational database in a conditional query mode by writing complicated query statements. That is, the prior art requires a malware analyst to empirically find malware similar to the target malware, which greatly depends on the familiarity of the analyst with the malware. And the condition query mode only stays in the shallow feature of the malicious software, so that the complex association relationship among the malicious software is difficult to be comprehensively considered, the similar malicious software searching mode is complex, and the accurate searching of the similar malicious software is difficult to realize.

Disclosure of Invention

The embodiment of the invention provides a similar malware recommendation method, device, medium and equipment, which are used for solving the problems that similar malware is complex in searching mode and cannot be accurately searched.

In a first aspect, the present invention provides a similar malware recommendation method, including:

determining malicious software to be queried;

if a first vector is determined to be stored, wherein the first vector is a vector corresponding to the malware to be queried, determining the similarity between the stored first vector and each stored second vector, and one second vector is a vector corresponding to a piece of malware which is not the malware to be queried;

taking the malicious software with the highest similarity between the corresponding second vector and the first vector and with the specified quantity as recommended malicious software similar to the malicious software to be inquired;

and the vector corresponding to each piece of saved malicious software is obtained through analysis data corresponding to each piece of malicious software saved on the basis of the malicious software heterogeneous information network.

Optionally, the stored vector corresponding to each malware is a structure vector, a content vector, or a fusion vector obtained by splicing the structure vector and the content vector.

Optionally, the analysis data corresponding to each malware stored in the malware heterogeneous information network is entities and relationships included based on a predefined malware ontology structure, where the entities include malware instances, the obtained relationships between the entities corresponding to each malware, and attribute information of each entity.

Optionally, the structure vector corresponding to one piece of malware is obtained by:

aiming at the relation between entities corresponding to each malicious software stored on the basis of a malicious software heterogeneous information network, defining meta-paths from at least two visual angles according to a specified meta-path definition reference, wherein each visual angle corresponds to a group of meta-paths, and the head node and the tail node of each meta-path are malicious software instances;

determining wandering sequence data corresponding to each meta path at each view angle from the relationship between entities corresponding to each malicious software stored in the malicious software heterogeneous information network through path wandering, and performing feature learning on the wandering sequence data through an embedded model to obtain a sub-structure vector at each view angle corresponding to each malicious software;

and aiming at each malicious software, obtaining a structure vector corresponding to the malicious software according to each sub-structure vector corresponding to the malicious software.

Optionally, a content vector corresponding to the malware is obtained by:

aiming at a piece of malicious software, extracting attribute characteristics of a malicious software instance through feature engineering based on attribute information of each entity corresponding to the malicious software stored in a malicious software heterogeneous information network, and obtaining a content vector corresponding to the malicious software according to the attribute characteristics; or,

aiming at a piece of malicious software, extracting attribute features of a malicious software instance through feature engineering based on the relationship between entities corresponding to the malicious software and attribute information of each entity, which are stored in a malicious software heterogeneous information network, and extracting specified features of adjacent entities of the malicious software instance through the feature engineering, wherein the specified features of one adjacent entity are obtained according to the attribute features of the adjacent entities;

and obtaining a content vector corresponding to the malicious software according to the attribute characteristics and the designated characteristics.

Optionally, taking a specified number of malware with the highest similarity between the corresponding second vector and the first vector as recommended malware similar to the malware to be queried, including:

according to a threat intelligence knowledge graph, updating the similarity between the malicious software with the highest similarity and the malicious software to be inquired in a specified quantity, wherein the threat intelligence knowledge graph is constructed by using specified data;

and according to the updated similarity, taking the malicious software with the highest similarity between the corresponding second vector and the first vector and with the set quantity as recommended malicious software similar to the malicious software to be inquired.

In a second aspect, the present invention further provides a similar malware recommendation apparatus, including:

the determining module is used for determining the malicious software to be inquired;

a comparison module, configured to determine a similarity between a stored first vector and each stored second vector if it is determined that the first vector is stored, where the first vector is a vector corresponding to the malware to be queried, and one of the second vectors is a vector corresponding to a piece of malware which is not the malware to be queried;

the recommending module is used for taking the malicious software with the highest similarity between the corresponding second vector and the first vector and with the specified quantity as recommended malicious software similar to the malicious software to be inquired;

Optionally, a content vector corresponding to the malware is obtained by:

Optionally, the recommendation module is specifically configured to update, according to a threat intelligence knowledge graph, the similarity between each malicious software with the highest similarity and the specified number of malicious software to be queried, where the threat intelligence knowledge graph is constructed by using specified data;

In a third aspect, the present invention also provides a non-volatile computer storage medium storing an executable program for execution by a processor to implement the method as described above.

In a fourth aspect, the present invention further provides a similar malware recommendation device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

the memory is used for storing a computer program;

the processor, when executing the program stored in the memory, is configured to implement the method steps as described above.

According to the scheme provided by the embodiment of the invention, the analysis data corresponding to each malicious software can be stored based on the malicious software heterogeneous information network, and then the vector corresponding to each malicious software can be obtained and stored based on the analysis data corresponding to each malicious software stored by the malicious software heterogeneous information network. When similar malware is searched, only the malware to be searched needs to be determined, namely when the vector corresponding to the malware to be searched is determined and stored, similarity search is performed, the specified number of malware with the highest similarity between the corresponding vector and the vector corresponding to the malware to be searched is used as recommended malware similar to the malware to be searched, complicated query sentences do not need to be written, searching of the similar malware can be achieved, the searching process of the similar malware is simplified, and the accuracy of searching of the similar malware is improved.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flowchart of a similar malware recommendation method according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of a similar malware recommendation method according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a malware ontology according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a threat intelligence knowledge-graph provided by an embodiment of the invention;

fig. 5 is a schematic structural diagram of a similar malware recommendation apparatus according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a similar malware recommendation device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that, the "plurality" or "a plurality" mentioned herein means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

The terms "first," "second," and the like in the description and in the claims, and in the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein.

Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The malicious software analysis is a process of learning the functions and potential influences of malicious software, but the analysis data corresponding to the malicious software is stored by using a relational database, so that the analysis data is stored dispersedly, deep association mining is difficult to perform, a network security researcher is not required to quickly master the relevant information of the malicious software, and the network security researcher is difficult to assist in performing direct and deep malicious software association mining.

An information network (or graph) is an important data structure currently used for representing association between abstract entities, and is widely used in concrete fields such as academia, e-commerce, medical treatment, finance and the like, such as abstract expression of data of paper citation, commodity purchasing behavior, medicine influence, transaction behavior and the like. Through the information network, the associated data can be analyzed and modeled, and potential information contained in the data is deeply mined so as to realize tasks such as classification, prediction, recommendation and the like.

An information network is defined as a directed network graph G ═ (V, E), where V is the set of all nodes and E is the set of all edges. Heterogeneous information networks are a form of information networks, with one or more types of nodes and edges, that may contain richer semantic information.

Based on the advantages of the heterogeneous information network, in the embodiment of the invention, the analysis data corresponding to the malicious software is abstracted and expressed by using the heterogeneous information network (which can be recorded as the malicious software heterogeneous information network), and rich associated information is mined from the analysis data, so that similar malicious software recommendation is performed, and the accuracy and diversity of the similar malicious software recommendation are improved.

The similarity search is to find target information closest to the input information by calculating the similarity between data in an N-dimensional space. The technology is widely applied to various fields of databases, information retrieval, pattern recognition, data analysis and the like.

In the scheme provided by the embodiment of the invention, the analysis data corresponding to each malicious software can be stored based on the malicious software heterogeneous information network, and the vector corresponding to each malicious software can be obtained and stored based on the analysis data corresponding to each malicious software stored by the malicious software heterogeneous information network. And then, the malware similar to the malware to be inquired is determined to be recommended in a similarity search mode based on the stored vector corresponding to each malware.

The Attribute Enumeration and Characterization of Malware (MAEC) is a standardized language for sharing Malware structured information, and can be used for information interaction among different devices, and the problems of ambiguity and uncertainty existing in Malware description are solved.

The Structured Threat Information Expression (STIX) is a language for standardizing, obtaining, characterizing and communicating standardized cyber Threat Information, and provides support for more effective cyber Threat management processes and application automation by adopting a Structured way.

Antagonistic Tactics, Techniques and general knowledge (ATT & CK, additive behaviours, Techniques, and Common knowledge) is a set of models and knowledge bases proposed by MITRE that reflect the attack behavior of each attack lifecycle. The ATT & CK establishes a set of fine-grained and easily-shared knowledge model and framework for attacker behaviors in the last four stages with observability on the basis of a KillChain model (comprising detection tracking, weapon construction, load delivery, vulnerability utilization, installation and implantation, continuous control and target achievement) proposed by Lockschid-Martin company, and forms a set of network attacker behavior knowledge base jointly participated and maintained by governments, public service enterprises, private enterprises and academic institutions through continuous accumulation so as to guide users to adopt targeted detection, defense and response work.

In the scheme provided by the embodiment of the invention, a malware heterogeneous network structure can be predefined, for example, the malware heterogeneous network structure is defined by combining the MAEC, STIX and ATT & CK standards, and analysis data corresponding to malware is extracted by using a big data technology and stored in a malware heterogeneous information network. Therefore, standardization and interaction of analysis results of different analysis devices can be effectively realized, internal association conditions of malicious software analysis details are clearly displayed, and safety researchers can quickly master analysis information related to malicious software.

A meta path, also called meta path, is a path defined on the directed network graph G ═ V, E. It has a representation structure of

Define a slave V₁To V_n+1The coincidence relation between E and E₁◇E₂◇……E_nIs a relationship operator.

In the scheme provided by the embodiment of the invention, a structure vector corresponding to each malicious software can be determined according to analysis data of each malicious software stored in a malicious software heterogeneous information network based on meta-path definition (a structure vector corresponding to a malicious software can be understood as a vector representation of a structure corresponding to the malicious software), and then similarity search can be performed based on the structure vector.

However, the inventor of the present application further studies and finds that, based on meta-path definition, structural features of nodes are learned one by one, a large number of meta-paths need to be designed, a process of determining a structural vector is complex, a system load is large, and potential associations between meta-paths are ignored, so that it is difficult to express richer combined structural information of the nodes. Therefore, in the embodiment of the present invention, a concept is proposed that a meta path definition basis can be specified, a meta path is defined for one specified meta path definition basis, and a meta path is defined for multiple views. Therefore, the meta-path set can be designed at different view angles according to the definition reference of each meta-path, and the combined structure characteristics of the nodes at different view angles can be learned.

In the scheme provided by the embodiment of the invention, a content vector corresponding to each malware can be determined according to analysis data of each malware stored in the malware heterogeneous information network (a content vector corresponding to one malware can be understood as a vector representation of content corresponding to the malware), and then similarity search can be performed based on the content vector.

However, the inventor of the present invention further researches and discovers that in a malware heterogeneous information network, one node does not exist independently and is also influenced by a neighbor node (context). Therefore, in the scheme provided by the embodiment of the invention, on the basis of constructing the content vector based on the self characteristics of the malware instance node, the content characteristics of the malware can be represented by a method of fusing the self characteristics of the malware instance node and the characteristics of the neighbor entity nodes, so that the semantic expression of the malware content vector is enriched.

The knowledge graph is a semantic network and formally describes objects and relations in the real world.

A knowledge graph is generally represented by the triplet D ═ (E, R, E), where D represents the knowledge base; e ═ E₁,e₂,…,e_|E|Representing an entity set in D, wherein entities in the entity set mainly comprise | E | types; r ═ R₁,r₂,…,r_|R|Denotes a set of relationships in D, in which there are a total of | R | different relationships. The types of triplets are mainly two:<entity, attribute value>And<entities, relationships, entities>. The threat information knowledge graph mainly describes knowledge related to threat information, such as entities of threat organization, battle, indexes and the like, and relations of use, relief and the like.

The similarity search of the malicious software based on the vector mainly focuses on similarity calculation of characteristics of the malicious software, and omits more abstract knowledge information related to the malicious software. Considering that the threat intelligence knowledge graph can extract high-quality information from multi-source intelligence data to be fused into threat intelligence knowledge, providing intelligence knowledge related to malicious software, expanding semantic information of the malicious software to a certain extent, and reflecting characteristics of the malicious software more comprehensively.

Based on the above description, an embodiment of the present invention provides a method for recommending similar malware, where the flow of the steps of the method may be as shown in fig. 1, and the method includes:

step 101, determining malicious software to be queried.

In this step, malware that needs to perform a similarity search, i.e., a similar malware query, may be determined.

And step 102, determining whether a vector corresponding to the malware to be queried is stored.

In this step, it may be determined whether a vector (which may be denoted as a first vector) corresponding to the malware is stored for the malware to be queried. If the vector corresponding to the malware is determined to be saved, the step 103 may be continuously executed, otherwise, it may be prompted that similar malware recommendation cannot be performed.

In the solution provided in this embodiment, analysis data corresponding to each malware may be saved based on a malware heterogeneous information network, for example, a Hugegraph database. Because the analysis data corresponding to each malicious software is stored based on the malicious software heterogeneous information network, the vector corresponding to each malicious software can be obtained and stored through the analysis data corresponding to each malicious software stored based on the malicious software heterogeneous information network.

When similarity search is carried out, whether a vector corresponding to the malware to be inquired is stored or not can be determined for the malware to be inquired, if the vector corresponding to the malware to be inquired is determined to be stored, the subsequent steps are continued, and similarity search for the malware to be inquired is carried out according to the similarity between the stored vector corresponding to the malware to be inquired and the stored vectors corresponding to other malware. Otherwise, if it is determined that the vector corresponding to the malware to be queried is not stored, it may be considered that similarity search for the malware to be queried cannot be performed, and it may be prompted that recommendation of similar malware cannot be performed.

And 103, determining the similarity between the stored vector corresponding to the malware to be queried and the stored vector corresponding to each other malware.

In this step, the similarity between the vector corresponding to the saved malware to be queried and the vector (which may be denoted as the second vector) corresponding to each of the other saved malware (i.e., malware that is not the malware to be queried) may be determined.

That is, in this embodiment, the similarity search may be performed based on a vector corresponding to malware. And in this embodiment, the similarity between vectors may be determined, but is not limited to, by cosine similarity algorithm.

In this embodiment, the vector used for similarity search may be any vector corresponding to malware. For example, the structure vector may be any one of a structure vector, a content vector, and a fusion vector obtained by splicing the structure vector and the content vector.

That is to say, in this embodiment, the similarity may be determined based on any one of the structure vector and the content vector, and the fusion vector obtained by splicing the structure vector and the content vector, and the similar malware may be determined according to the similarity determined based on the vector.

It should be noted that, in a possible implementation manner, the analysis data corresponding to each malware stored in the malware heterogeneous information network may be based on entities and relationships included in a predefined malware ontology structure, where the entities include malware instances, obtained relationships between entities corresponding to each malware, and attribute information of each entity.

The malware ontology structure may be predefined by, but is not limited to, the MAEC, STIX, and ATT & CK standards.

Further, if the malware heterogeneous information network stores the relationship between the entities corresponding to each piece of malware and the attribute information of each entity, in a possible implementation manner, a structure vector corresponding to each piece of malware may be obtained by:

determining wandering sequence data corresponding to each meta path under each view angle from the relationship between entities corresponding to each malicious software stored in a malicious software heterogeneous information network through path wandering, and performing feature learning on the wandering sequence data through an embedded model to obtain a sub-structure vector under each view angle corresponding to each malicious software;

In this embodiment, the embedded model may be, but is not limited to, a Skip-gram model.

In this embodiment, the structure vector corresponding to a piece of malware may be obtained according to each sub-structure vector corresponding to the piece of malware by, but not limited to, direct concatenation or weighted summation.

In addition, if the malware heterogeneous information network stores the relationship between the entities corresponding to each piece of malware and the attribute information of each entity, in a possible implementation manner, a content vector corresponding to one piece of malware may be obtained by:

aiming at a piece of malicious software, extracting attribute characteristics of a malicious software instance through feature engineering (converting original data into characteristics) based on the relationship between entities corresponding to the malicious software and the attribute information of each entity, which are stored in a malicious software heterogeneous information network, and extracting the designated characteristics of adjacent entities of the malicious software instance through the feature engineering, wherein the designated characteristics of one adjacent entity are obtained according to the attribute characteristics of the adjacent entity;

In this embodiment, attribute features of malware (or attribute features of malware and specified features) may be screened and combined through a word frequency-inverse document frequency (TF-IDF) algorithm, and a content vector corresponding to malware is obtained through vectorization.

It should be noted that the structural vector and the content vector corresponding to the malware are obtained through the above method, the complex association relationship between the malware can be comprehensively considered, instead of manually constructing the feature dimension to represent the malware only according to experience and not staying in the shallow feature of the malware, the malware can be more accurately characterized, and subsequently, similar malware is searched based on the structural vector and the content vector, so that the accuracy of searching the similar malware can be effectively improved.

And 104, taking the malware with the specified quantity with the highest similarity between the corresponding vector and the vector corresponding to the malware to be queried as recommended malware similar to the malware to be queried.

In this step, a specified number (e.g., N) of malware with the highest similarity to the malware vector to be queried may be taken as recommended malware similar to the malware to be queried.

In a possible implementation manner, in this step, the similarity between the specified number of malware with the highest similarity and the malware to be queried may be further updated according to a threat intelligence knowledge graph, where the threat intelligence knowledge graph is constructed by using specified data, and according to the updated similarity, a set number (for example, K) of malware with the highest similarity between a corresponding vector and a vector corresponding to the malware to be queried may be used as recommended malware similar to the malware to be queried.

The above embodiment is explained below by a specific example. The embodiment of the present invention further provides a method for recommending similar malware, where the flow of the steps of the method may be as shown in fig. 2, and the method includes:

firstly, constructing a malicious software body structure.

In this embodiment, a hierarchical and applicable malware ontology structure may be designed according to MAEC, STIX, and ATT & CK in combination with actual business requirements to define a malware heterogeneous network structure, and describe concepts and entities related to malware analysis and relationships between the concepts and entities.

In this embodiment, the malware ontology structure may be as shown in fig. 3, where each node represents each type of entity in malware, and each edge represents a relationship between entities, and fig. 3 includes a type 13 entity and a type 12 relationship. Wherein the meaning and code number of the 13 kinds of entities can be shown in table 1.

TABLE 1

Entity type	Means of	Code number
			malware-instance	Examples of malware	MI
malware-family	Malware family	MF
			capability	Malware capability	C
behavior	Malware behavior	B
			technique	Attack technique	T
tactic	Tactics of attack	TA
			ip	IPv4、IPv6	I
domain-name	Domain name	D
			url	URL	U
file	Document	F
			api-call-sequence	API call function sequence	A
mutex	Mutual exclusion body	M
			process	Process	P

The 12-type relationships may include: belonging to (below-to), associated internet protocol address (contact-ip), associated uniform resource locator (contact-url), associated domain name (contact-domain), having (has), creating (create), metadata (metadata), related-to, resolved-to, containing (relationship-of), similar (similar), and child-of.

And secondly, generating a malicious software heterogeneous information network.

After the malware ontology structure is constructed, in this step, an actual business data set analyzed by the malware may be collected. The collected data may include static analysis data of the malware (API call sequence, threat type, file type, etc.) as well as dynamic analysis data of the malware (create process, create file, communication IP, communication domain name, etc.).

In a possible implementation manner, the collected data may be preprocessed, for example, preprocessing such as special character filtering may be performed, and the relationships among the entities may be obtained from the preprocessed data based on various entities and relationships included in a predefined malware ontology structure, and various attribute information of the entities may be obtained.

Furthermore, the acquired data can be uniformly expressed in a triple format of < entity, relationship, entity > and < entity, attribute value > based on a pre-designed one-meaning multi-word mapping vocabulary in the field of network security. And the uniformly expressed data can be imported into a Hugegraph open source graph database to generate a malicious software heterogeneous information network.

And thirdly, designing a multi-view meta path to generate a structural vector corresponding to the malicious software.

A meta-path represents some kind of complex relationship between a source entity and a target entity in an information network structure, e.g. a meta-path

It may be shown that two instances of malware communicate with the same IP.

In view of the potential association that may exist between multiple meta paths, the present embodiment proposes a concept of multi-view meta paths, that is, a meta path definition reference may be specified, and for the meta path definition reference, meta paths are defined from at least two views, and each view corresponds to a group of meta paths.

By combining meta-paths at different perspectives, potential associations between multiple meta-paths can be established. Each visual angle can be designed reasonably according to actual requirements. For example, when the number of hops that the source entity needs to pass through to the target entity is taken as a meta-path definition reference, a k-hop (k-hop) view meta-path (k ═ 1,2,3 … …) may be designed, where each view represents the number of hops that the source entity needs to pass through to the target entity, such as 1 hop, 2 hop, 3 hop, and beyond; for another example, when the functional meaning expressed by the path is taken as the meta path definition reference, a k-function view meta path (k ═ 1,2,3 … …) may be designed, where each view represents a function type, such as a meta path from a communication type view of a malware domain, a meta path from an attack operation type view, and the like.

On the basis of generating the malware heterogeneous information network, in this embodiment, when the number of hops that the source entity needs to pass through to the target entity is taken as a meta-path definition reference, meta-path definitions are performed for 2 hops, 3 hops, 4 hops, 5 hops and 4 views above the view are taken as examples for explanation. For each view, a plurality of meta-paths can be defined, and the head and tail nodes of each meta-path are malware-instance entities.

In the present embodiment, meta paths defined for respective views may be as shown in table 2. For 2-hop views (2-hop-methods), 10 meta paths may be defined, for 3-hop views (3-hop-methods), 4 meta paths may be defined, for 4-hop views (4-hop-methods), 4 meta paths may be defined, and for 5-hop and above views (5-hop-methods), 4 meta paths may be defined.

TABLE 2

According to the meta-paths defined for each view as shown in table 2, path migration may be performed in a malware heterogeneous information network in sequence based on a meta-path set of 4 views, corresponding migration sequence data is extracted, and may be stored in a migration sequence file of a corresponding view in the form of < entity, … …, entity >, such as a TXT file, where a head node and a tail node of each migration sequence data are malware-instance.

For example, meta-paths

May be represented as an entity Identification (ID):<802,31505,31805,31926,185>the entity corresponding to each ID may be as shown in table 3:

TABLE 3

Since this example defines meta-paths (2-hop-metanaths, 3-hop-metanaths, 4-hop-metanaths, 5-hop-metanaths) for 4 views, walking sequence data for 4 views can be obtained.

After obtaining the walk sequence data, the walk sequence data of 4 views can be respectively subjected to embedded learning by using a Skip-gram model. Because some of the malware-instances may not satisfy all of the meta-path relationships, each has 0-4 view-angle sub-structure vectors.

For the sub-structure vectors under each view angle corresponding to each malware-instance, any one of the following two ways can be used to fuse the sub-structure vectors to obtain the structure vector corresponding to each malware:

mode 1, directly splicing the substructure vectors according to a specified sequence, such as the sequence of 2-jump visual angle, 3-jump visual angle, 4-jump visual angle and 5-jump visual angle;

mode 2, the substructural vectors for 4 views are summed with weights.

And fourthly, generating a content vector corresponding to the malicious software based on the feature engineering.

The fourth step and the third step may be performed without being separated. In a malware heterogeneous information network, a malware-instance does not exist independently, and is affected by adjacent entities. Although in the embodiment, the multi-view meta-path-based structure vector considers the influence of the structure of the neighboring entities of the malware-instance on the malware-instance, the influence of the content features of the neighboring entities on the malware-instance is not considered.

Therefore, in this embodiment, when defining the content vector corresponding to the malware, not only the content features of the malware-instance itself, such as the threat type, the severity, the file type, and the like, may be considered, but also the content features of the adjacent entities of the malware-instance, such as whether there is a communication blacklist IP, the number of the communication blacklist IPs, and the like, may be considered based on the relationship between the entities corresponding to the malware stored in the malware heterogeneous information network. The semantic representation of the content vector is enriched by taking the content features of the neighboring entities as part of the content vector.

The features corresponding to the content vector of one malware may be understood to include two types, which are a malware-instance feature (which may be understood as a content feature of the malware-instance itself) and an adjacent entity feature (which may be understood as a content feature of an adjacent entity of the malware-instance), and the feature dimension corresponding to each type, the data type of each feature, and the value example may be, but are not limited to, as shown in table 4:

TABLE 4

In table 4, SSDEEP is a fuzzy hash value corresponding to a malware-instance. In this embodiment, the ssdeep string may be split into two sub-strings based on the character ". The two sub-strings are then used as features of a malware-instance type, respectively.

If the ssdeep string is:

24576: rjh + w9Yz1Wsp9bOWFH2hoiMUx2pP2pBFsRC5: nY9 MWFHWJUoK 7, the two corresponding substrings are:

rjh + w9Yz1Wsp9bOWFH2hoiMUx2pP2pBFsRC5 and nY9 MWFHWJUoK 7.

By splitting the ssdeep character string, the search accuracy can be further improved when similarity search is performed based on the content vector.

After determining each feature characterizing the content vector, in this embodiment, a TF-IDF algorithm may be used to screen and combine each feature, and obtain the content vector corresponding to the malware through vectorization.

And fifthly, determining a vector corresponding to the malicious software.

After the structure vector and the content vector are represented for each malware, three vectors, namely the structure vector, the content vector and a fusion vector obtained by splicing the structure vector and the content vector, can be used as vectors corresponding to the malware and stored.

And sixthly, calculating the similarity based on the vector corresponding to the malicious software.

After the three vectors corresponding to each malware are stored, similarity search can be performed on the designated malware by using a cosine similarity algorithm based on the stored three vectors, and N (TOP-N) malware with the highest similarity are obtained respectively for recommendation.

And based on the stored structural vector, performing similarity search on the specified malicious software by using a cosine similarity algorithm, so as to obtain the malicious software which is similar to the specified malicious software in structure.

And based on the stored content vector, performing similarity search on the specified malicious software by using a cosine similarity algorithm to obtain the malicious software similar to the specified malicious software in content.

And based on the stored fusion vector, performing similarity search on the specified malicious software by using a cosine similarity algorithm, so as to obtain the malicious software which is similar to the specified malicious software in structure and content.

And seventhly, recommending and optimizing based on the threat intelligence knowledge graph.

Considering that newly added malware is not comprehensively analyzed, the system is difficult to accurately model and recommend the newly added malware. In order to enhance the deeper and longer-range association between the designated malware and the candidate malware (i.e., other malware), in this embodiment, a threat intelligence knowledge graph may be further used to perform recommendation optimization to make up for the problem of information sparseness or missing, so as to improve the accuracy and diversity of recommendation results to a certain extent.

In this embodiment, a threat intelligence knowledge graph may be constructed from specified data based on STIX. In the constructed threat intelligence knowledge graph, a top-level entity relationship graph can be shown in fig. 4, wherein the entities comprise vulnerabilities, identities, threat organizations, malware, attack patterns and response schemes, and the relationships comprise inclusion, targeting, use, utilization and mitigation.

In a malware heterogeneous information network, it can be considered that a malware entity is associated at the level of performing actions, e.g., two malware entities communicate with the same IP or domain name.

The threat intelligence knowledge graph provides deeper and longer-range knowledge association, such as [ malware A ] → [ event report 1] → [ threat organization 1] → [ event report 2] → [ malware B ] is recommended, and the path deeply excavates threat events and threat organizations related to the malware from the perspective of threat intelligence, makes up for sparseness and deficiency of information, and improves accuracy and diversity to a certain extent.

After the threat intelligence knowledge graph is constructed, in the step, the TOP-N similar malicious software obtained by similarity search according to the vector can be reordered by utilizing the threat intelligence knowledge graph. For example, the malware used by the same threat organization and appearing in the same threat event is given higher weight, the similarity of the TOP-N similar malware is recalculated, and finally the K (TOP-K) malware with the highest similarity is taken as the final recommendation result.

The embodiment provides a method for recommending TOP-K similarity search of malicious software. For a malicious software heterogeneous information network, meta paths are designed from different angles based on multiple angles, and meta path migration sequence data are extracted. The embedded model can then be used to learn the structural features of the entity, feature engineering can be used to represent the content features of the entity, and the structural features and the content features of the entity can be fused and further applied to similar malware search tasks. And finally, optimizing the search result by using a threat intelligence knowledge graph to realize recommendation of similar malicious software.

In the scheme provided by the embodiment of the invention, in consideration of fragmentation of malicious software analysis data, a certain degree of repeatability of analysis work exists, a malicious software body structure can be constructed by combining MAEC, STIX and ATT & CK, a storage structure of a malicious software analysis result is defined in a uniform and normalized format, and the data is stored in a Hugegraph database in a graph form to obtain a malicious software heterogeneous information network.

A meta-path represents a compound association mode of a source entity and a target entity, for a malicious software heterogeneous information network with a complex structure, the number of the meta-paths is large, the semantics expressed by a single path are limited, the potential association between the meta-paths is ignored, and the richer structural semantic features of the entities cannot be learned. Based on this, a concept of multi-view meta-path is proposed, which can be designed from multiple views, each view being a combination of a series of meta-paths. And then, respectively walking the meta-path of each view angle from the malicious software heterogeneous information network through path walking, extracting corresponding walking sequences and respectively writing the walking sequences into the storage files of the corresponding view angles.

Based on the concept of multi-view meta-path, after extracting corresponding walking sequences and respectively writing the walking sequences into storage files of corresponding views, respectively performing entity structure vector representation learning on meta-path walking sequence data under each view by using a Skip-gram model to obtain structure vector representation of a malicious software entity under each view, and then fusing the structure vector representation under all views by using a direct splicing or weighted summation method to obtain final structure vector representation of the malicious software entity.

In the malicious software heterogeneous information network, the entity does not exist independently, but is also influenced by a neighbor entity (context), for the semantic expression of rich entities, the content characteristics of the entity are provided to comprise the characteristics of the entity and the characteristics of the neighbor entity, and then the TF-IDF algorithm is used for carrying out characteristic screening to obtain the content vector representation of the malicious software entity. And the three forms of the entity structure vector, the entity content vector and the spliced entity structure vector and content vector can be used as final entity vector representation.

The TOP-N query of the similar malware can be used for calculating cosine similarity among entities in a vector space by using a similarity calculation method aiming at three similar modes of similar structure, similar content and similar structure and content, so as to obtain TOP-N candidate similar malware entities which are similar to the designated malware entities in the three modes respectively, wherein the candidate malware entities have similarity in a malware analysis level.

In addition, in the embodiment, recommendation optimization can be performed based on the threat intelligence knowledge graph. That is, to improve the accuracy and diversity of similar search results, a threat intelligence knowledge graph may be used to mine deeper and longer-range associations between malware entities, such as assigning a higher weight to candidate malware entities that have appeared in the same threat event with a given malware entity and have been used by the same threat organization, then the similarity of TOP-N candidate malware entities may be recalculated, and finally TOP-K malware entities are taken as the final recommendation result.

Corresponding to the provided method, the following device is further provided.

An embodiment of the present invention provides a similar malware recommendation apparatus, where the structure of the apparatus may be as shown in fig. 5, and the apparatus includes:

the determining module 11 is used for determining the malware to be queried;

the comparison module 12 is configured to determine, if a first vector is determined to be stored, where the first vector is a vector corresponding to the malware to be queried, a similarity between the stored first vector and each stored second vector, where one of the second vectors is a vector corresponding to a piece of malware that is not the malware to be queried;

the recommending module 13 is configured to take a specified number of malware with the highest similarity between the corresponding second vector and the first vector as recommended malware similar to the malware to be queried;

Optionally, a content vector corresponding to the malware is obtained by:

Optionally, the recommending module 13 is specifically configured to update the similarity between each malicious software with the highest similarity and the malicious software to be queried in a specified number according to a threat intelligence knowledge graph, where the threat intelligence knowledge graph is constructed by using specified data;

The functions of the functional units of the apparatuses provided in the above embodiments of the present invention may be implemented by the steps of the corresponding methods, and therefore, detailed working processes and beneficial effects of the functional units in the apparatuses provided in the embodiments of the present invention are not described herein again.

Based on the same inventive concept, embodiments of the present invention provide the following apparatus and medium.

The structure of the device can be as shown in fig. 6, and the device includes a processor 21, a communication interface 22, a memory 23, and a communication bus 24, where the processor 21, the communication interface 22, and the memory 23 complete mutual communication through the communication bus 24;

the memory 23 is used for storing computer programs;

the processor 21 is configured to implement the steps of the above method embodiments of the present invention when executing the program stored in the memory.

Optionally, the processor 21 may specifically include a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), one or more Integrated circuits for controlling program execution, a hardware Circuit developed by using a Field Programmable Gate Array (FPGA), and a baseband processor.

Optionally, the processor 21 may include at least one processing core.

Alternatively, the Memory 23 may include a Read-Only Memory (ROM), a Random Access Memory (RAM), and a disk Memory. The memory 23 is used for storing data required by the at least one processor 21 during operation. The number of the memory 23 may be one or more.

An embodiment of the present invention further provides a non-volatile computer storage medium, where the computer storage medium stores an executable program, and when the executable program is executed by a processor, the method provided in the foregoing method embodiment of the present invention is implemented.

In particular implementations, computer storage media may include: various storage media capable of storing program codes, such as a Universal Serial Bus Flash Drive (USB), a mobile hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

In the embodiments of the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the described unit or division of units is only one division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical or other form.

The functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be an independent physical module.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the technical solutions of the embodiments of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device, such as a personal computer, a server, or a network device, or a processor (processor) to execute all or part of the steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a Universal Serial Bus Flash Drive (usb Flash Drive), a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A similar malware recommendation method, the method comprising:

determining malicious software to be queried;

2. The method of claim 1, wherein the vector corresponding to each malware is stored as a structure vector, a content vector, or a fusion vector based on concatenation of a structure vector and a content vector.

3. The method as claimed in claim 2, wherein the analysis data corresponding to each malware stored in the malware heterogeneous information network is entities and relationships included based on a predefined malware ontology structure, the entities include malware instances, the obtained relationships between the entities corresponding to each malware, and attribute information of each entity.

4. The method of claim 3, wherein a structural vector for malware is obtained by:

5. The method of claim 3, wherein a malware corresponding content vector is obtained by:

6. The method as claimed in any one of claims 1 to 5, wherein the step of using a specified number of malware with highest similarity between the corresponding second vector and the first vector as recommended malware similar to the malware to be queried comprises:

7. A similar malware recommendation apparatus, the apparatus comprising:

8. The apparatus of claim 7, wherein the recommendation module is specifically configured to update the similarity between each malware to be queried and the specified number of malware with the highest similarity according to a threat intelligence knowledge graph, the threat intelligence knowledge graph being constructed using specified data;

9. A non-transitory computer storage medium storing an executable program for execution by a processor to perform the method of any one of claims 1 to 6.

10. A similar malware recommendation device, comprising a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other via the communication bus;

the memory is used for storing a computer program;

the processor, when executing the program stored in the memory, implementing the method steps of any of claims 1-6.