CN112131259B

CN112131259B - Similar malicious software recommendation method, device, medium and equipment

Info

Publication number: CN112131259B
Application number: CN202011037893.9A
Authority: CN
Inventors: 周娟; 章瑞康; 袁军; 李文瑾
Original assignee: Nsfocus Technologies Inc; Nsfocus Technologies Group Co Ltd
Current assignee: Nsfocus Technologies Inc; Nsfocus Technologies Group Co Ltd
Priority date: 2020-09-28
Filing date: 2020-09-28
Publication date: 2024-03-15
Anticipated expiration: 2040-09-28
Also published as: CN112131259A

Abstract

The invention relates to a similar malicious software recommendation method, a device, a medium and equipment. The analysis data corresponding to each malicious software can be stored based on the malicious software heterogeneous information network, and then the vector corresponding to each malicious software can be obtained and stored through the analysis data corresponding to each malicious software stored based on the malicious software heterogeneous information network. When searching for similar malicious software, only the malicious software to be searched is needed to be determined, and when the vector corresponding to the malicious software to be searched is determined and stored, similarity searching can be carried out, and the specified quantity of malicious software with the highest similarity between the corresponding vector and the vector corresponding to the malicious software to be searched is used as recommended malicious software similar to the malicious software to be searched, so that searching for the similar malicious software can be realized without writing complicated query sentences, the searching process of the similar malicious software is simplified, and the searching accuracy of the similar malicious software is improved in a vector similarity comparison mode.

Description

Similar malicious software recommendation method, device, medium and equipment

Technical Field

The invention relates to the technical field of network security, in particular to a method, a device, a medium and equipment for recommending similar malicious software.

Background

This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.

In recent years, with the advent of countermeasure detection techniques such as confusion, distortion, and the like, malware has exhibited a rapid trend in quantity and quality, which undoubtedly increases the analytical burden on security researchers. On one hand, the traditional analysis data fragmentation storage mode is easy to cause repeated analysis work, and on the other hand, aiming at network security threat events, malicious software is difficult to be quickly associated with the threat events, so that comprehensive prevention or remedy decisions cannot be made in time.

Therefore, how to perform relevance management and mining on the malicious software based on the malicious software historical analysis data and determine similar malicious software for auxiliary analysis has become a research focus in the field of malicious software analysis.

In the prior art, related information of malicious software is generally stored on the basis of a relational database, and a malicious software analysis personnel queries similar malicious software in the relational database in a conditional query mode by writing complicated query sentences. That is, the prior art requires a malware analyst to empirically find malware that is similar to the target malware, which is highly dependent on the analyst's familiarity with malware. Moreover, the condition query mode is only stopped at the shallow features of the malicious software, complex association relations among the malicious software are difficult to comprehensively consider, the similar malicious software searching mode is complex, and accurate searching of the similar malicious software is difficult to realize.

Disclosure of Invention

The embodiment of the invention provides a similar malicious software recommending method, a device, a medium and equipment, which are used for solving the problems that a similar malicious software searching mode is complex and similar malicious software cannot be accurately searched.

In a first aspect, the present invention provides a method for similar malware recommendation, the method comprising:

determining malicious software to be queried;

if it is determined that a first vector is stored, wherein the first vector is a vector corresponding to the malicious software to be queried, the similarity between the stored first vector and each stored second vector is determined, and one second vector is a vector corresponding to one malicious software other than the malicious software to be queried;

taking the specified quantity of malicious software with highest similarity between the corresponding second vector and the first vector as recommended malicious software similar to the malicious software to be queried;

the stored vector corresponding to each piece of malicious software is obtained through analysis data corresponding to each piece of malicious software stored on the basis of a malicious software heterogeneous information network.

Optionally, the saved vector corresponding to each malicious software is a structure vector, a content vector or a fusion vector obtained by splicing the structure vector and the content vector.

Optionally, the analysis data corresponding to each piece of malicious software stored in the malicious software heterogeneous information network is based on an entity and a relation included in a predefined malicious software ontology structure, the entity includes a malicious software instance, the obtained relation between the entities corresponding to each piece of malicious software, and attribute information of each entity.

Optionally, a structural vector corresponding to the malicious software is obtained by the following method:

aiming at the relation between entities corresponding to each piece of malicious software stored on the basis of a malicious software heterogeneous information network, defining a meta-path from at least two view angles according to a specified meta-path definition standard, wherein each view angle corresponds to a group of meta-paths, and the head and tail nodes of each meta-path are malicious software examples;

determining wander sequence data corresponding to each element path under each view angle from the relation among entities corresponding to each malicious software stored in the malicious software heterogeneous information network through path wander, and performing feature learning on the wander sequence data through an embedded model to obtain a substructure vector corresponding to each malicious software under each view angle;

aiming at each malicious software, according to each sub-structure vector corresponding to the malicious software, obtaining a structure vector corresponding to the malicious software.

Optionally, a content vector corresponding to the malicious software is obtained by the following method:

aiming at one piece of malicious software, extracting attribute characteristics of a malicious software instance through characteristic engineering based on attribute information of each entity corresponding to the malicious software stored in a malicious software heterogeneous information network, and obtaining a content vector corresponding to the malicious software according to the attribute characteristics; or,

aiming at one malicious software, extracting attribute characteristics of a malicious software instance through characteristic engineering based on the relation between entities corresponding to the malicious software and attribute information of each entity stored in a malicious software heterogeneous information network, extracting appointed characteristics of adjacent entities of the malicious software instance through the characteristic engineering, and obtaining the appointed characteristics of one adjacent entity according to the attribute characteristics of the adjacent entities;

and obtaining the content vector corresponding to the malicious software according to the attribute characteristics and the appointed characteristics.

Optionally, the specified number of malware with the highest similarity between the corresponding second vector and the first vector is used as recommended malware similar to the malware to be queried, including:

updating the similarity between each specified number of malicious software with highest similarity and the malicious software to be queried according to a threat intelligence knowledge graph, wherein the threat intelligence knowledge graph is constructed by using specified data;

And according to the updated similarity, taking the set quantity of malicious software with the highest similarity between the corresponding second vector and the first vector as recommended malicious software similar to the malicious software to be queried.

In a second aspect, the present invention also provides a similar malware recommendation device, the device comprising:

the determining module is used for determining the malicious software to be queried;

the comparison module is used for determining the similarity between the stored first vector and each stored second vector if the first vector is determined to be stored, wherein the first vector is a vector corresponding to the malicious software to be queried, and one second vector is a vector corresponding to the malicious software which is not the malicious software to be queried;

the recommendation module is used for taking the specified quantity of malicious software with the highest similarity between the corresponding second vector and the first vector as recommended malicious software similar to the malicious software to be queried;

Optionally, the recommendation module is specifically configured to update the similarity between each of the specified number of malware with the highest similarity and the malware to be queried according to a threat intelligence knowledge graph, where the threat intelligence knowledge graph is constructed by using specified data;

In a third aspect, the present invention also provides a non-volatile computer storage medium storing an executable program for execution by a processor to implement the method as described above.

In a fourth aspect, the present invention further provides a similar malware recommendation device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

the memory is used for storing a computer program;

the processor, when executing the program stored on the memory, implements the method steps described above.

According to the scheme provided by the embodiment of the invention, the analysis data corresponding to each malicious software can be stored based on the malicious software heterogeneous information network, and then the vector corresponding to each malicious software can be obtained and stored through the analysis data corresponding to each malicious software stored based on the malicious software heterogeneous information network. When searching for similar malicious software, only the malicious software to be searched is needed to be determined, and when the vector corresponding to the malicious software to be searched is determined and stored, similarity searching can be performed, and the specified quantity of malicious software with the highest similarity between the corresponding vector and the vector corresponding to the malicious software to be searched is used as recommended malicious software similar to the malicious software to be searched, so that searching for the similar malicious software can be realized without writing complicated query sentences, the searching process of the similar malicious software is simplified, and the searching accuracy of the similar malicious software is improved.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart illustrating a method for recommending similar malware according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a method for recommending similar malware according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a malware ontology structure according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of threat intelligence knowledge graph provided by an embodiment of the invention;

FIG. 5 is a schematic diagram of a similar malware recommendation device according to an embodiment of the present invention;

Fig. 6 is a schematic structural diagram of a similar malware recommendation device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that, as used herein, reference to "a plurality of" or "a plurality of" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein.

Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The analysis of the malicious software is a process for learning functions and potential influences of the malicious software, but analysis data corresponding to the malicious software are stored by using a relational database, so that the analysis data are stored and are scattered, deep association mining is difficult to carry out, network security researchers are not utilized to quickly master related information of the malicious software, and the network security researchers are difficult to assist in carrying out direct and deep malicious software association mining.

The information network (or graph) is an important data structure for representing association relations between abstract entities, and is widely used in the specific fields of academic, electronic commerce, medical treatment, finance and the like, such as abstract expression of paper citation, commodity purchasing behavior, medicine influence, transaction behavior and the like. Through the information network, the associated data can be analyzed and modeled, and potential information contained in the data is deeply mined, so that tasks such as classification, prediction, recommendation and the like are realized.

The information network is defined as a directed network graph g= (V, E), where V is the set of all nodes and E is the set of all edges. Heterogeneous information networks are a form of information networks, the nodes and edges of which are of one or more types, and which may contain more rich semantic information.

Based on the advantages of the heterogeneous information network, in the embodiment of the invention, the analysis data corresponding to the malicious software is considered to be abstract expressed by using the heterogeneous information network (can be recorded as the malicious software heterogeneous information network), rich associated information is mined from the analysis data, and the similar malicious software is recommended according to the abstract expression, so that the accuracy and the diversity of the similar malicious software recommendation are improved.

The similarity search refers to searching for target information closest to input information by calculating the similarity between data in an N-dimensional space. The technology is widely applied to various fields such as databases, information retrieval, pattern recognition, data analysis and the like.

According to the scheme provided by the embodiment of the invention, the analysis data corresponding to each malicious software can be stored based on the malicious software heterogeneous information network, and the vector corresponding to each malicious software can be obtained and stored through the analysis data corresponding to each malicious software stored based on the malicious software heterogeneous information network. And further, based on the vector corresponding to each stored malicious software, the malicious software similar to the malicious software to be queried can be determined to be recommended in a similarity searching mode.

Malware attribute enumeration and characterization (MAEC, malware Attribute Enumeration and Characterization) is a standardized language that shares structured information of malware, and may be used for information interaction between different devices, eliminating ambiguity and uncertainty problems in malware descriptions.

Structured threat information presentation (STIX, structured Threat Information Expression), a language used to normalize, obtain, characterize, and communicate standardized cyber threat information, provides support for more efficient cyber threat management processes and application automation in a structured manner.

Resistance tactics, technology and general knowledge (ATT & CK, adversarial Tactics, technologies, and Common Knowledges) is a set of models and knowledge bases proposed by MITRE that reflect the attack behaviour of individual attack lifecycles. ATT & CK on the basis of KillChain model (including investigation tracking, weapon construction, load delivery, vulnerability exploitation, installation implantation, continuous control, goal achievement) proposed by Rockwell-Martin company, a set of knowledge model and framework which are finer granularity and easier to share are constructed for attacker behaviors in the later four stages with more observability, and a set of network attacker behavior knowledge base which is jointly participated and maintained by government, public service enterprises, private enterprises and academic institutions is formed through continuous accumulation so as to guide users to take targeted detection, defense and response work.

In the scheme provided by the embodiment of the invention, the malicious software heterogeneous network structure can be predefined, for example, the malicious software heterogeneous network structure is defined by combining MAEC, STIX and ATT & CK standards, and the analysis data corresponding to the malicious software is extracted by using a big data technology and stored in the malicious software heterogeneous information network. Therefore, standardization and interaction of analysis results of different analysis devices can be effectively realized, internal association conditions of analysis details of the malicious software are clearly displayed, and security researchers are helped to quickly master analysis information related to the malicious software.

Meta-path, also called meta, is a path defined on the directed network graph g= (V, E). It is represented by the structure ofDefine a slave V ₁ To V _n+1 Coincidence relation between e=e ₁ ◇E ₂ ◇……E _n Wherein o is a relationship operator.

In the scheme provided by the embodiment of the invention, the structural vector corresponding to each piece of malicious software can be determined based on meta-path definition according to the analysis data of each piece of malicious software stored in the malicious software heterogeneous information network (the structural vector corresponding to one piece of malicious software can be understood as the vector representation of the structure corresponding to the malicious software), and then the similarity search can be performed based on the structural vector.

However, the inventor further researches and discovers that based on the definition of the element paths, a great number of element paths are needed to be designed for learning the structural characteristics of the nodes one by one, the process of determining the structural vector is complex, the system load is large, the potential association between the element paths is ignored, and the richer combined structural information of the nodes is difficult to express. Therefore, in the embodiment of the present invention, a concept is presented that can specify a meta-path definition reference, define the reference for a specified meta-path, and define the meta-path for multiple views. Therefore, the meta-path set can be designed according to different view angles under the definition standard of each meta-path, and the combined structural characteristics of the nodes under different view angles can be learned.

In the scheme provided by the embodiment of the invention, the content vector corresponding to each piece of malicious software can be determined according to the analysis data of each piece of malicious software stored in the malicious software heterogeneous information network (the content vector corresponding to one piece of malicious software can be understood as the vector representation of the content corresponding to the malicious software), and then the similarity search can be performed based on the content vector.

However, the inventor further studies found that in the heterogeneous information network of the malicious software, a node does not exist independently, and is also affected by a neighboring node (context). Therefore, in the scheme provided by the embodiment of the invention, on the basis of constructing the content vector based on the characteristics of the malicious software instance node, the method of integrating the characteristics of the malicious software instance node and the characteristics of the neighbor entity node can be used for representing the content characteristics of the malicious software so as to enrich the semantic expression of the malicious software content vector.

Knowledge graph is a semantic network that formally describes real world things and relationships.

Knowledge-graph is generally represented by the triplet d= (E, R, E), where D represents the knowledge base; e= { E ₁ ,e ₂ ,…,e _|E| The entity set in D is shown, and the entity in the entity set mainly has |E|; r= { R ₁ ,r ₂ ,…,r _|R| The relationship set in D is represented by a total of |r| different relationships in the relationship set. There are two main types of triples:<entity, attribute value>And<entity, relationship, entity>. The threat intelligence knowledge graph mainly describes the knowledge related to threat intelligence, such as threat organization, campaign, index and other entities, and the relationship of use, alleviation and the like.

Vector-based similarity search is mainly focused on similarity calculation of characteristics of the malicious software, and more abstract knowledge information related to the malicious software is ignored. In consideration of the fact that threat information knowledge patterns can extract high-quality information from multi-source information data to be fused into threat information knowledge, information knowledge related to malicious software is provided, semantic information of the malicious software is expanded to a certain extent, characteristics of the malicious software are reflected more comprehensively, in the scheme provided by the embodiment of the invention, similarity search recommendation results obtained based on vectors can be optimized further according to the knowledge information provided by the threat information knowledge patterns, and accuracy and diversity of the similarity results of recommendation are further improved.

Based on the above description, the embodiment of the present invention provides a similar malware recommendation method, where the step flow of the method may be as shown in fig. 1, and the method includes:

step 101, determining the malicious software to be queried.

In this step, malware that needs to perform a similarity search, i.e., a similar malware query, may be determined.

Step 102, determining whether a vector corresponding to the malicious software to be queried is stored.

In this step, it may be determined, for the malware to be queried, whether a vector (which may be denoted as a first vector) corresponding to the malware is stored. If it is determined that the vector corresponding to the malware is stored, step 103 may be continued, otherwise, it may be prompted that similar malware recommendation cannot be performed.

In the solution provided in this embodiment, the analysis data corresponding to each piece of malware may be saved based on a heterogeneous information network of the piece of malware, for example, a Hugegraph database. Because the analysis data corresponding to each malicious software is stored based on the malicious software heterogeneous information network, the vector corresponding to each malicious software can be obtained and stored through the analysis data corresponding to each malicious software stored based on the malicious software heterogeneous information network.

When similarity searching is carried out, whether vectors corresponding to the malicious software to be queried are stored or not can be determined firstly, if the vectors corresponding to the malicious software to be queried are determined to be stored, the subsequent steps are continued, and the similarity searching for the malicious software to be queried is carried out according to the similarity of the stored vectors corresponding to the malicious software to be queried and the stored vectors corresponding to other malicious software. Otherwise, if the vector corresponding to the malicious software to be queried is not stored, the similarity search for the malicious software to be queried can be considered as impossible, and the recommendation of the similar malicious software can be prompted.

Step 103, determining the similarity between the vector corresponding to the stored malicious software to be queried and the vector corresponding to each stored other malicious software.

In this step, a similarity between the saved vectors corresponding to the malware to be queried and the saved vectors (which may be denoted as second vectors) corresponding to each other malware (i.e., malware other than the malware to be queried) may be determined.

That is, in the present embodiment, the similarity search may be performed based on the vector corresponding to the malware. And in this embodiment, the similarity between vectors may be determined by, but not limited to, a cosine similarity algorithm.

In this embodiment, the vector used for performing the similarity search may be any form of vector corresponding to the malware. For example, the vector may be any one of a structure vector and a content vector, and a fusion vector obtained by splicing the structure vector and the content vector may be used.

That is, in the present embodiment, the similarity may be determined for any one of the structure vector, the content vector, and the fusion vector obtained by splicing the structure vector and the content vector, and the similar malware may be determined from the similarity determined based on the vector.

It should be noted that, in one possible implementation manner, the analysis data corresponding to each piece of malware stored in the heterogeneous information network of the piece of malware may be based on an entity and a relationship included in a predefined malware ontology structure, where the entity includes a malware instance, a relationship between obtained entities corresponding to each piece of malware, and attribute information of each entity.

Malware ontology structures may be predefined by, but not limited to, MAEC, STIX, and ATT & CK standards.

Further, if the malware heterogeneous information network stores the relationship between the entities corresponding to each piece of malware and the attribute information of each entity, in one possible implementation manner, a structural vector corresponding to each piece of malware may be obtained by:

determining wander sequence data corresponding to each element path under each view angle from the relation among entities corresponding to each malicious software stored in a malicious software heterogeneous information network through path wander, and performing feature learning on the wander sequence data through an embedded model to obtain a substructure vector corresponding to each malicious software under each view angle;

In this embodiment, the embedding model may be, but is not limited to being, a Skip-gram model.

In this embodiment, the structural vector corresponding to one piece of malicious software may be obtained according to each sub-structural vector corresponding to the piece of malicious software by direct stitching or weighted summation.

In addition, if the malware heterogeneous information network stores the relationship between the entities corresponding to each piece of malware and the attribute information of each entity, in one possible implementation, the content vector corresponding to one piece of malware may be obtained by:

aiming at one malicious software, extracting attribute characteristics of a malicious software instance through characteristic engineering (converting original data into characteristics) based on the relation between entities corresponding to the malicious software and attribute information of each entity stored in a malicious software heterogeneous information network, extracting appointed characteristics of adjacent entities of the malicious software instance through the characteristic engineering, and obtaining the appointed characteristics of one adjacent entity according to the attribute characteristics of the adjacent entity;

In this embodiment, the malware attribute features (or the malware attribute features and the specified features) may be filtered and combined by a word frequency-inverse document frequency (TF-IDF) algorithm, and the content vector corresponding to the malware may be obtained by vectorization.

It should be noted that, by the method for obtaining the structural vector and the content vector corresponding to the malicious software, complex association relation among the malicious software can be comprehensively considered, instead of only representing the malicious software according to the feature dimension of the empirical manual construction, only the shallow features of the malicious software are not remained, the malicious software can be more accurately represented, and then similar malicious software searching is carried out based on the structural vector and the content vector, so that accuracy of similar malicious software searching can be effectively improved.

Step 104, taking the specified quantity of malicious software with highest similarity between the corresponding vector and the vector corresponding to the malicious software to be queried as recommended malicious software similar to the malicious software to be queried.

In this step, a specified number (for example, N) of malware having the highest similarity to the malware vector to be queried may be used as recommended malware similar to the malware to be queried.

In one possible implementation manner, in this step, the similarity between the specified number of malware with the highest similarity and the malware to be queried may be further updated according to a threat intelligence knowledge graph, the threat intelligence knowledge graph is constructed by using the specified data, and according to the updated similarity, the set number (for example, K) of malware with the highest similarity between the corresponding vector and the vector corresponding to the malware to be queried is used as the recommended malware similar to the malware to be queried.

The above embodiment will be described by way of a specific example. The embodiment of the invention further provides a similar malicious software recommending method, and the step flow of the method can be as shown in fig. 2, and the method comprises the following steps:

First, a malicious software ontology structure is constructed.

In this embodiment, first, a hierarchical and applicable malware ontology structure may be designed according to MAEC, STIX and ATT & CK in combination with actual service requirements, so as to define a heterogeneous network structure of malware, and describe concepts and entities related to malware analysis and relationships between the concepts and entities.

In this embodiment, the malware ontology structure may be as shown in fig. 3, where each node represents various entities in the malware, and the edges represent relationships between the entities, and in fig. 3, include a class 13 entity and a class 12 relationship. Wherein the meaning and code of the 13-class entity can be shown in table 1.

TABLE 1

Entity type	Meaning of	(Code)
			malware-instance	Malware instance	MI
malware-family	Malware family	MF
			capability	Malware capability	C
behavior	Malware behavior	B
			technique	Attack technique	T
tactic	Attack tactics	TA
			ip	IPv4、IPv6	I
domain-name	Domain name	D
			url	URL	U
file	File	F
			api-call-sequence	API call function sequence	A
mutex	Mutex body	M
			process	Progress of a process	P

Class 12 relationships may include: belonging to (belong-to), associated internet protocol address (contact-ip), associated uniform resource locator (contact-url), associated domain name (contact-domain), have (has), create (create), metadata (metadata), related-to (related-to), resolve-to (related-to), contain (contact-of), similarity (similarity), and subelement (child-of).

And secondly, generating a malicious software heterogeneous information network.

After the malware ontology structure is constructed, in this step, the actual traffic data set analyzed by the malware may be collected. The collected data may include static analysis data of the malware (API call sequences, threat types, file types, etc.), dynamic analysis data of the malware (creation process, creation file, communication IP, communication domain name, etc.).

In one possible implementation, the collected data may be preprocessed, for example, by special character filtering, and from the preprocessed data, relationships between entities may be obtained based on various entities and relationships included in a predefined malware ontology structure, and various attribute information of the entities may be obtained.

Furthermore, the obtained data can be uniformly expressed in the form of triples of entity, relation, entity and attribute value based on a pre-designed univocal multi-word mapping vocabulary in the network security field. And the uniformly expressed data can be imported into a Hugegraph open source graph database to generate a malicious software heterogeneous information network.

And thirdly, designing a multi-view meta-path to generate a structural vector corresponding to the malicious software.

The meta-path represents some kind of compound relation between the source entity and the target entity in the information network structure, such as meta-pathIt may be indicated that two malware instances are in communication with the same IP.

In view of the potential association that may exist between multiple meta-paths, the present embodiment proposes the concept of a multi-view meta-path, i.e. a meta-path definition reference may be specified for which a meta-path is defined from at least two views, each view corresponding to a set of meta-paths.

By combining the meta-paths at different perspectives, potential associations between multiple meta-paths can be established. Each view angle can be rationally designed according to actual requirements. For example, when taking the number of hops that the source entity needs to pass through to the target entity as a meta-path definition reference, a k-hop (k-hop) view meta-path (k=1, 2,3 … …) can be designed, where each view represents a number of hops that the source entity needs to pass through to the target entity, such as 1 hop, 2 hop, 3 hop, and above; as another example, when the functional meaning of the path expression is taken as the meta-path definition reference, a k-functional view meta-path (k=1, 2,3, … …) can be designed, where each view represents a functional type, such as a meta-path of a communication type view in the malware domain, a meta-path of an attack operation type view, and so on.

On the basis of generating the heterogeneous information network of the malicious software, in this embodiment, when taking the number of hops that the source entity needs to pass through to the target entity as the meta-path definition reference, the meta-path definition is described by taking as an example for 4 views of 2 hops, 3 hops, 4 hops, 5 hops and above. For each view, multiple meta-paths may be defined, with each meta-path having a head-to-tail node that is a malware-instance entity.

In this embodiment, the meta paths defined for the respective perspectives may be as shown in table 2. Wherein for 2-hop views (2-hop-meta) 10-element paths may be defined, for 3-hop views (3-hop-meta), 4-element paths may be defined, for 4-hop views (4-hop-meta), 4-element paths may be defined, for 5-hop and above views (5-hop-meta).

TABLE 2

/>

According to the meta paths defined for each view as shown in table 2, path migration can be sequentially performed in the heterogeneous information network of the malicious software based on the meta path set of 4 views, corresponding migration sequence data is extracted, and the migration sequence data can be stored in the form of < entity, … …, entity > in the migration sequence file of the corresponding view, such as a TXT file, wherein the head and tail nodes of each migration sequence data are both malware-instance.

For example, meta-pathsCan be expressed in terms of an entity Identification (ID) as:<802,31505,31805,31926,185>the entity corresponding to each ID may be as shown in table 3:

TABLE 3 Table 3

/>

Since this embodiment defines 4 view meta-paths (2-hop-metahohs, 3-hop-metahohs, 4-hop-metahohs, 5-hop-metahohs), it is possible to obtain the walk-around sequence data at 4 views.

After obtaining the walk sequence data, the walk sequence data of 4 views may be respectively subjected to embedded learning using Skip-gram model. Because some malware-instances may not satisfy all meta-path relationships, there are 0-4 sub-structure vectors per malware-instance at each view angle.

Aiming at the sub-structure vector under each view angle corresponding to each malware-instance, the sub-structure vector can be fused by using any one of the following two modes to obtain the structure vector corresponding to each malicious software:

mode 1, directly splicing the sub-structure vectors according to a specified sequence, such as a sequence of 2-hop view angle, 3-hop view angle, 4-hop view angle and 5-hop view angle;

mode 2, the sub-structure vectors for the 4 views are weighted and summed.

And fourthly, generating a content vector corresponding to the malicious software based on the feature engineering.

The fourth step and the third step may be performed in no order. In a malware heterogeneous information network, malware-instance does not exist independently and is also affected by neighboring entities. Although in the present embodiment, the structure vector based on the multi-view meta-path considers the influence of the neighboring entities of the malware-instance on the structure, the influence of the content features of the neighboring entities on the malware-instance is not considered.

Therefore, in this embodiment, when defining the content vector corresponding to the malware, not only the content characteristics of the malware-instance itself, for example, threat types, severity, file types, etc., but also the content characteristics of the neighboring entities of the malware-instance, such as whether there is a communication blacklist IP, the number of communication blacklists IP, etc., may be considered based on the relationship between the entities corresponding to the malware stored in the heterogeneous information network of the malware. By using the content features of neighboring entities as part of the content vector, the semantic representation of the content vector is enriched.

The features corresponding to the content vector of one piece of malware may be understood to include two types, namely, a malware-instance feature (which may be understood as the content feature of the malware-instance itself) and a neighboring entity feature (which may be understood as the content feature of the neighboring entity of the malware-instance), and the feature dimension corresponding to each type, the data type of each feature, and the value examples may be as shown in table 4, but are not limited to:

TABLE 4 Table 4

In table 4, SSDEEP is a fuzzy hash value corresponding to a software-instance. In this embodiment, the ssdeep string may be split into two substrings based on the character ":", and then the two substrings are used as characteristics of a malware-instance type, respectively.

The ssdeep string is:

24576 rjh+w9Yz1Wspe9 bOWFH2hoiMUx2p2pBFsRc5:nY 9 MWFHWWUJUoK7, the two corresponding substrings are:

rjh +w9Yz1Wsp9bOWFH2hoiMUx2pP2pBFsRc5 and nY9MWFHWoJUoK7.

By splitting the ssdeep character string, the accuracy of searching can be further improved when the similarity searching is performed based on the content vector.

After each feature representing the content vector is determined, in this embodiment, TF-IDF algorithm may be used to screen and combine each feature, and obtain, through vectorization, the content vector corresponding to the malware.

And fifthly, determining vectors corresponding to the malicious software.

After the structural vector and the content vector are represented for each malicious software, the three vectors, namely the structural vector, the content vector and the fusion vector obtained by splicing the structural vector and the content vector, can be taken as the vectors corresponding to the malicious software and stored.

And sixthly, calculating the similarity based on the vector corresponding to the malicious software.

After three vectors corresponding to each malicious software are stored, similarity search can be performed on specified malicious software by utilizing a cosine similarity algorithm based on the stored three vectors, and N (TOP-N) malicious software with highest similarity can be respectively obtained for recommendation.

Based on the saved structure vector, similarity search is performed on the specified malicious software by using a cosine similarity algorithm, so that the malicious software similar to the specified malicious software in structure can be obtained.

Based on the stored content vector, similarity search is carried out on the specified malicious software by utilizing a cosine similarity algorithm, so that the malicious software similar to the specified malicious software in content can be obtained.

Based on the stored fusion vector, similarity search is carried out on the specified malicious software by utilizing a cosine similarity algorithm, and the malicious software similar to the specified malicious software in structure and content can be obtained.

And seventhly, recommending and optimizing based on the threat information knowledge graph.

While it is difficult for the system to accurately model and recommend new added malware that is not comprehensive to the analysis. In order to enhance the deeper and longer-range association between the specified malicious software and the candidate malicious software (i.e. other malicious software), in the embodiment, the threat intelligence knowledge graph can be further used for recommendation optimization to compensate the problem of information sparseness or deletion, so that the accuracy and diversity of recommendation results are improved to a certain extent.

In this embodiment, a threat intelligence knowledge map may be constructed from specified data based on the STIX. In the constructed threat information knowledge graph, a top-level entity relationship graph can be shown as a graph in fig. 4, wherein the included entity has loopholes, identities, threat organizations, malicious software, attack modes and response schemes, and the included relationship has containing, aiming, using, utilizing and relieving.

In a heterogeneous information network of malicious software, the association of a malicious software entity at the action execution level can be considered, for example, both malicious software communicates with the same IP or domain name.

The threat information knowledge graph provides a deeper and longer-range knowledge association, such as [ malicious software A ] & gt [ event report 1] & gt [ threat organization 1] & gt [ event report 2] & gt recommendation of [ malicious software B ], and the path deeply digs threat events and threat organization related to the malicious software from the threat information angle, makes up for the sparseness and the lack of information, and improves the accuracy and the diversity to a certain extent.

After the threat intelligence knowledge graph is constructed, in the step, TOP-N similar malicious software obtained by carrying out similarity search according to vectors can be subjected to result reordering by using the threat intelligence knowledge graph. For example, higher weight is given to the malicious software which appears in the same threat event and is used by the same threat organization, the similarity of TOP-N similar malicious software is recalculated, and finally K (TOP-K) malicious software with the highest similarity is taken as a final recommendation result.

The embodiment provides a method for recommending similar searches of TOP-K malicious software. For a malicious software heterogeneous information network, based on multiple views, a meta-path is designed from different angles, and meta-path walk sequence data is extracted. The embedded model may then be used to learn structural features of the entity, use feature engineering to represent content features of the entity, and may be used to fuse the structural features and content features of the entity for further application to similar malware search tasks. And finally, optimizing the search result by using the threat intelligence knowledge graph to realize similar malicious software recommendation.

In the scheme provided by the embodiment of the invention, the problem of repeatability of analysis work to a certain extent is considered in consideration of fragmentation of the analysis data of the malicious software, a malicious software body structure can be constructed by combining MAEC, STIX, ATT and CK, a storage structure of a malicious software analysis result is defined in a uniform and standardized format, and the data is stored in a Hugegraph graph database in a graph form to obtain a malicious software heterogeneous information network.

A meta-path represents a composite association mode of a source entity and a target entity, for a malicious software heterogeneous information network with a complex structure, the quantity of the meta-paths is large, the semantics of single-path expression are limited, potential association among the meta-paths is ignored, and the structure semantic features of the entity which are richer can not be learned. Based on this, the concept of multi-view meta-paths is proposed, which can be designed from multiple views, each view being a combination of a series of meta-paths. And then, the path walk can be carried out, the meta path of each view angle is walked from the heterogeneous information network of the malicious software, and the corresponding walk sequences are extracted and written into the storage files of the corresponding view angles respectively.

Based on the concept of multi-view meta-paths, after corresponding wandering sequences are extracted and respectively written into storage files of corresponding views, skip-gram models can be used for respectively carrying out entity structure vector representation learning on meta-path wandering sequence data under each view to obtain structure vector representations of malicious software entities under each view, and then the structure vector representations under all views can be fused by using a direct splicing or weighted summation method to obtain final structure vector representations of the malicious software entities.

In a heterogeneous information network of malicious software, the entity is not independent, and is also influenced by a neighbor entity (context), for enriching the semantic expression of the entity, the entity content characteristics can comprise the characteristics of the entity and the characteristics of the neighbor entity, and then the TF-IDF algorithm can be used for carrying out characteristic screening to obtain the content vector representation of the malicious software entity. And three forms of entity structure vector, entity content vector and spliced entity structure vector and content vector can be used as final entity vector representation.

The similar malicious software TOP-N query can calculate cosine similarity among entities in a vector space by using a similarity algorithm according to three similar modes of structural similarity, content similarity, structure and content similarity, so as to obtain TOP-N candidate similar malicious software entities which are similar to the designated malicious software entities in the three modes respectively, and the candidate malicious software entities have similarity in a malicious software analysis layer.

In addition, in the embodiment, recommendation optimization can be performed based on the threat intelligence knowledge graph. That is, to improve accuracy and diversity of similar search results, threat intelligence knowledge maps may be used to mine deeper and longer-range associations between malware entities, such as giving higher weight to candidate malware entities that have been used by the same threat organization and that have occurred in the same threat event as a given malware entity, then similarity of TOP-N candidate malware entities may be recalculated, and finally TOP-K malware entities may be taken therefrom as final recommendation results.

Corresponding to the provided method, the following apparatus is further provided.

An embodiment of the present invention provides a similar malware recommendation device, where the structure of the device may be as shown in fig. 5, and the device includes:

the determining module 11 is used for determining the malicious software to be queried;

the comparison module 12 is configured to determine, if it is determined that a first vector is stored, where the first vector is a vector corresponding to the malware to be queried, a similarity between the stored first vector and each stored second vector, and one of the second vectors is a vector corresponding to one malware that is not the malware to be queried;

The recommendation module 13 is configured to take, as recommended malware similar to the malware to be queried, a specified number of malware with the highest similarity between the corresponding second vector and the first vector;

Optionally, the recommendation module 13 is specifically configured to update the similarity between each of the specified number of malware with the highest similarity and the malware to be queried according to a threat intelligence knowledge graph, where the threat intelligence knowledge graph is constructed by using specified data;

The functions of the functional units of each device provided in the foregoing embodiments of the present invention may be implemented by the steps of the corresponding methods, so that the specific working process and the beneficial effects of each functional unit in each device provided in the embodiments of the present invention are not repeated herein.

Based on the same inventive concept, embodiments of the present invention provide the following apparatuses and media.

The embodiment of the invention provides similar malicious software recommending equipment, which can be structured as shown in fig. 6 and comprises a processor 21, a communication interface 22, a memory 23 and a communication bus 24, wherein the processor 21, the communication interface 22 and the memory 23 are communicated with each other through the communication bus 24;

The memory 23 is used for storing a computer program;

the processor 21 is configured to implement the steps described in the above method embodiments of the present invention when executing the program stored in the memory.

Alternatively, the processor 21 may specifically include a Central Processing Unit (CPU), an application specific integrated circuit (ASIC, application Specific Integrated Circuit), one or more integrated circuits for controlling program execution, a hardware circuit developed using a field programmable gate array (FPGA, field Programmable Gate Array), and a baseband processor.

Alternatively, the processor 21 may comprise at least one processing core.

Alternatively, the Memory 23 may include a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), and a disk Memory. The memory 23 is used for storing data required by the operation of the at least one processor 21. The number of memories 23 may be one or more.

The embodiment of the invention also provides a non-volatile computer storage medium, which stores an executable program, and when the executable program is executed by a processor, the method provided by the embodiment of the method of the invention is realized.

In a specific implementation, the computer storage medium may include: a universal serial bus flash disk (USB, universal Serial Bus Flash Drive), a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk or an optical disk, or the like, which can store program codes.

In the embodiments of the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, e.g., the division of the units or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interface, indirect coupling or communication connection of devices or units, electrical or otherwise.

The functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be an independent physical module.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. With such understanding, all or part of the technical solution of the embodiments of the present invention may be embodied in the form of a software product stored in a storage medium, including instructions for causing a computer device, which may be, for example, a personal computer, a server, or a network device, or a processor (processor), to perform all or part of the steps of the method described in the embodiments of the present invention. And the aforementioned storage medium includes: universal serial bus flash disk (Universal Serial Bus Flash Drive), removable hard disk, ROM, RAM, magnetic or optical disk, or other various media capable of storing program code.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A method of similar malware recommendation, the method comprising:

Determining malicious software to be queried;

the method comprises the steps that a stored vector corresponding to each piece of malicious software is obtained through analysis data corresponding to each piece of malicious software stored on the basis of a malicious software heterogeneous information network, the stored vector corresponding to each piece of malicious software comprises a structural vector, the analysis data corresponding to each piece of malicious software stored on the basis of a predefined malicious software ontology structure comprises entities and relations, the entities comprise malicious software instances, the obtained relations between the entities corresponding to each piece of malicious software, and attribute information of each entity;

the corresponding structural vector of the malicious software is obtained by the following steps: aiming at the relation between entities corresponding to each piece of malicious software stored on the heterogeneous information network of the malicious software, defining a meta-path from at least two view angles according to a specified meta-path definition standard, wherein each view angle corresponds to a group of meta-paths, the head and tail nodes of each meta-path are malicious software examples, each view angle represents the number of hops required to pass from a source entity to a target entity or a function type, the migration sequence data corresponding to each meta-path under each view angle is determined from the relation between the entities corresponding to each piece of malicious software stored on the heterogeneous information network of the malicious software through path migration, feature learning is carried out on the migration sequence data through an embedded model, a sub-structure vector under each view angle corresponding to each piece of malicious software is obtained, and the structure vector corresponding to the malicious software is obtained according to each sub-structure vector corresponding to the malicious software.

2. The method of claim 1, wherein each stored vector for malware further comprises a content vector or a fusion vector based on a concatenation of a structure vector and a content vector.

3. The method of claim 1, wherein a content vector corresponding to malware is obtained by:

4. A method according to any one of claims 1 to 3, wherein regarding a specified amount of malware with the highest similarity between the corresponding second vector and the first vector as recommended malware similar to the malware to be queried, the method comprises:

5. A similar malware recommendation device, the device comprising:

6. The apparatus of claim 5, wherein the recommendation module is specifically configured to update a similarity between each of the specified number of malware with highest similarity and the malware to be queried according to a threat intelligence knowledge graph, the threat intelligence knowledge graph being constructed using specified data;

7. A non-transitory computer storage medium storing an executable program that is executed by a processor to implement the method of any one of claims 1-4.

8. A similar malware recommendation device, comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface, the memory complete communication with each other through the communication bus;

the memory is used for storing a computer program;

the processor is configured to implement the method steps of any one of claims 1 to 4 when executing the program stored on the memory.