CN118133284A - Malicious detection method for third party component package in Maven warehouse - Google Patents

Malicious detection method for third party component package in Maven warehouse Download PDF

Info

Publication number
CN118133284A
CN118133284A CN202410385472.7A CN202410385472A CN118133284A CN 118133284 A CN118133284 A CN 118133284A CN 202410385472 A CN202410385472 A CN 202410385472A CN 118133284 A CN118133284 A CN 118133284A
Authority
CN
China
Prior art keywords
sensitive
graph
malicious
subgraph
component package
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410385472.7A
Other languages
Chinese (zh)
Inventor
黄诚
余泓豪
蒋书熠
赵建国
徐健斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Publication of CN118133284A publication Critical patent/CN118133284A/en
Pending legal-status Critical Current

Links

Abstract

A malicious detection method for a third party component package in a Maven warehouse includes the steps of firstly, constructing a component package data set and setting a sensitive API call set as a heuristic rule for extracting a subsequent subgraph; aiming at the characteristic that the component package has no fixed entry method, a potential use API of a user is obtained by using a specific rule to be used as an entry method set of a call graph generation algorithm, and a context-sensitive pointer analysis algorithm is applied to call graph generation; using a sensitive API call set as a heuristic rule, acquiring a call subgraph, generating an inter-process control flow graph of the call subgraph, analyzing a sensitive data flow, and constructing a sensitive behavior subgraph; and acquiring vector representations of nodes and edges of the sensitive behavior subgraph, taking the vector representations as input of an heterogram GCN model, performing model training and detecting malicious component packages. The invention can enhance the code representation capability of malicious behaviors of the component package, reduce the noise during model training, improve the detection accuracy and reduce the false alarm rate.

Description

Malicious detection method for third party component package in Maven warehouse
Technical Field
The application relates to the field of software security, in particular to a malicious detection method for a third party component package in a Maven warehouse.
Background
The rapid and efficient development of large software projects using reusable open source packages has become the mainstay of modern software development. The developer can conveniently use the third party component by using the related construction tool, and can issue the self-developed component into a software source, thereby forming a software supply chain. According to statistics, java is still the preferred programming language for enterprise software project development, the software projects independently developed by enterprises in China occupy nearly 6 times, maven is taken as a package ecosystem of Java component packages, and the number of open source projects and the growth speed of the Java component packages are all positioned in the front of a main stream package ecosystem. However, an attacker can also inject malicious codes into or put in the open source component package, and spread the malicious codes to other components or projects by utilizing the dependency chains among the components, so that huge security risks are brought to the ecology of the Java open source component.
The malicious detection work for the third party component package in Maven is a branch of the malicious code detection task. A typical scenario for malicious code detection is to use specific rules to match or pre-trained machine learning models to predict for a given piece of software, and to output whether the software is malware. The process for carrying out malicious detection on the Maven component package comprises the steps of obtaining a middle representation of each component package after analyzing Java byte code files in a target component package set, and obtaining a control flow and a data flow representation of code semantics by using a static program analysis algorithm based on the middle representation; after the corresponding code representation is obtained, the feature embedding method is used for obtaining orientation quantization representation and is used as training data of a deep learning model, and after training is completed, the model is used for judging the maliciousness of the component package.
The method for detecting the text features based on the source codes carries out machine learning classification by extracting sensitive fields and text statistical features in the source codes, and can realize detection, but has the problems of poor expandability and model generalization effects and high dependence on a decompilation engine; the method based on the runtime features can execute the program through the controlled environment to obtain the dynamic behavior features of the program and the affected condition of the system environment, but has the problems of high implementation difficulty and high cost. The method based on the combination of program static analysis and deep learning is a main stream method in the field of malicious code detection research including malicious component package detection tasks at present, but the existing method still has the problems of inaccurate semantic characterization of malicious codes and insufficient static analysis fine granularity: if the existing research is carried out to convert the whole component package into an inter-process control flow graph, the semantic features of malicious code parts are not considered independently, and a large amount of noise is introduced for training a deep learning model; or lack of fine-grained analysis of component package sensitive operation data streams, so that semantic characterization of codes is insufficient, and a model with better robustness is difficult to train.
Disclosure of Invention
In view of the above, the invention provides a method for detecting the maliciousness of a third party component package in a Maven warehouse, which aims to detect the potential maliciousness component package in the Maven warehouse in real time and help practitioners in the network security field to timely and quickly acquire software supply chain attack threats.
A method of malicious detection for a third party component package in a Maven repository, the method comprising:
Step 1: collecting Java component packages in a Maven central warehouse as benign samples, collecting malicious JAR packages from a main stream malicious software sample library as malicious samples, and constructing a component package data set;
Step 2: setting a sensitive API call set conforming to the characteristics of Java programs and open source component packages;
Step 3: generating a package method call graph, and performing call sub-graph division based on sensitive API call;
Step 4: performing inter-process control flow analysis and sensitive data flow analysis to obtain a sensitive behavior subgraph;
Step 5: acquiring vector representation of nodes and vector representation of edges in the sensitive behavior subgraph, wherein the vector representation and the vector representation of edges form the vector representation of the sensitive behavior subgraph;
step 6: and training a heterogeneous graph neural network model by using the sensitive behavior subgraph vector representation of the component package sample as input, and carrying out whole graph classification to obtain a malicious detection result.
Preferably, in the step 1, the package dataset construction procedure includes:
the method comprises the steps that a web crawler is used for crawling third party component packages in the first 150 hot classifications on a Maven central warehouse, and each component package is selected to be the latest version; malicious JAR packages are collected as malicious samples from a mainstream malware sample library, including VirusShare, virusTotal, IT question-answering community StackOverFlow, on code hosting platforms Github, gitee.
Preferably, in the step 2, the Java sensitive API call set building process includes:
Setting a Java sensitive API call set according to Java official documents, existing documents, technical reports and malicious code samples; further, according to different sensitive behaviors, API calls in Java are classified into data transmission, script execution, command execution, file operation, environment information reading, coding encryption and decryption and other categories; grouping these sensitive APIs into a collection Wherein each API signs a representation using its method.
Preferably, in the step 3, the package invoking the sub-division flow includes:
For a given package, using a static program analysis framework applicable to Java language, analyzing the package to generate an intermediate language representation IR;
Analyzing the IR of the component package, and acquiring the association information of all methods and classes in the component package: for a class, acquiring information of an access modifier, a parent class, a static variable, a static method and a class signature of the class; for the method, acquiring information of an access modifier, a shape parameter type, a return value type, a method signature, a shape parameter name and an internal method body of the method;
method for obtaining potential entry of component package if method The following conditions are satisfied:
Condition 1. The access modifier of (a) is Public or Protect;
Condition 2. Class of genusAccess modifier of (c) is Public or the likeThere is a subclass in the subclasses of (1)The access modifier is Public andThere is no method ofCarrying out heavy load;
If it is And when the condition 1 and the condition 2 are met simultaneously, the method is called as a user-callable method and is recorded as
Condition 3.Is a static initializer and belongs to a classIs Public;
Condition 4. Belonging toAnd the method is a static method or the class thereofIs instantiating;
If condition 3 or condition 4 is satisfied, then the method Is a potential entry method, denoted as
Applying the potential entry method discrimination rules to all methods in the component package, and screening out potential entry methodsAnd serves as an entry node set for invoking graph generation algorithms
Applying a pointer analysis algorithm to the IR of the component package, generating a pointer flow graph to represent the flow condition among pointers in a program, using 2-layer call point sensitive 2-CFA as a context sensitive strategy, using the first 2 call points as context information, and distinguishing pointer pointing relations under different contexts;
In the pointer analysis process, according to the analysis result, acquiring a method call relation, and constructing a method call graph MCG: Wherein Representing a set of individual reachable method nodes in the component package,Representing a set of method node pairs having a calling relationship, the method node pairs characterized by edges between nodes in the method call graph;
Using the Java sensitive API call set described in step 2 For each sensitive API call in the collectionSignatures are signed using regular matching according to their methodReachable method node set of (a)If it matches to a method nodeAccording to the nodeObtaining k-hop induced subgraph thereof;
For a pair ofPruning is performed, firstly deleting that the call edge can not be reachedAnd then delete the nodes in the rest nodesNon-connected nodes, and processed results are used as sensitive calling subgraph
Preferably, in the step 4, the sensitive behavior subgraph construction flow includes:
Using the sensitive calling subgraph constructed in the step 3, obtaining a control flow graph CFG of each developer self-defined method, and expanding the sensitive calling subgraph into an inter-process control flow graph ICFG;
Performing definition reachability analysis on each CFG, and if def-use chains or use-def chains exist among basic blocks, then a data dependency edge exists, so that a data dependency graph DDG of a developer self-defining method is constructed;
Searching basic blocks containing sensitive API calls in the data dependency graph based on regular matching of method sub-signatures by using the sensitive API call set in the step 2 And byTraversing the data dependency graph by using breadth-first traversal to obtain the initial nodeAll data dependent edges are used as sensitive data stream edges and added into CFG;
Through carrying out data flow analysis on the sensitive API call and acquiring sensitive behavior information, a sensitive behavior subgraph with stronger code semantic characterization capability can be obtained; the sensitive behavior subgraph consists of basic blocks and edges, wherein the basic blocks represent information of method calling and method internal sentences; the side information contains 5 sides: the general edge of the in-process control flow graph, the call-return edge connecting the call point and the return point, the call edge and the return edge characterizing the in-process control flow information, and the sensitive data flow edge.
Preferably, in the step 5, the vector characterization extraction process of the sensitive behavior subgraph includes:
based on the sensitive behavior subgraph in the step 4, in order to avoid the influence on Doc2vec model training, data preprocessing is carried out on basic blocks in the subgraph, and line numbers of sentences contained in a calling sentence IR generated in a static analysis framework and index information in a control flow graph are removed;
the variables and parameters in the basic block are added with the fully defined class names, so that the basic block can contain more code semantics; training a Doc2Vec model by using the processed basic blocks as a corpus, and obtaining vector representation of each sub-graph node;
Based on the sensitive behavior subgraph in the step 4, vector representation of edges in the subgraph is obtained by using single thermal coding, embedding of the relation between nodes is represented as a 5-dimensional vector, wherein the 1 st dimension is used for representing whether sensitive data flow exists between the nodes, if the sensitive data flow exists, the bit is 1, otherwise, the bit is 0; 2-5-th dimension represents the 4-type edge relationship in the inter-process control flow graph;
The sensitive behavior subgraph is an abnormal graph, and the node characteristic vector and the edge characteristic vector form the vector representation of the abnormal graph together.
Preferably, in the step 6, the training process of the package malicious property detection model includes:
Vector characterizations of all sensitive behavior subgraphs of a single sample in the component package data set form a vector characterization set, labels of all vector characterizations in the vector characterization set of benign samples are marked as 0, and labels of all vectors in the vector characterization set of malicious samples are marked as 1;
5-fold division is carried out on the vector characterization set after labeling, 80% of data are taken as training sets of the heterogeneous graph rolling network model, and 20% of data are taken as test sets and verification sets of the heterogeneous graph rolling network model;
establishing sensitive behavior sub-graph vector representations with different dimensions into a graph batch, wherein each graph batch has the same dimension;
Inputting the batched graph into an heterograph GCN, and predicting whether the sensitive behavior subgraph is a malicious behavior subgraph in a whole graph classification mode;
Determining a component package with a malicious behavior subgraph greater than or equal to a malicious component package;
And taking the accuracy rate, the recall rate and the false positive rate as evaluation indexes to evaluate the detection performance of the model.
The application provides a malicious detection method for a third party component package in a Maven warehouse, which comprises the following steps: firstly, collecting a sensitive API call set according to the characteristics of Java programs and open source components, and taking the sensitive API call set as a heuristic rule for extracting a subsequent subgraph; according to the characteristics of the Maven component package, potential use APIs of a user are obtained to be used as an entry method set of a call graph generation algorithm, a pointer analysis algorithm and a context sensitive strategy are applied to call graph generation, and a more accurate method call graph is obtained; acquiring a calling subgraph by using the sensitive API calling set as a heuristic rule, and eliminating program semantics irrelevant to sensitive behaviors; generating an inter-process control flow graph for the call sub-graph, and simultaneously analyzing a sensitive data flow to construct a sensitive behavior sub-graph; acquiring vector representations of nodes and edges of the sensitive behavior subgraph, and taking the vector representations as input of an heterogram GCN model to perform model training; compared with the prior art, the application has the beneficial effects that: the code semantics of the component package are obtained by utilizing a program static analysis technology, and the sensitive API call in the code semantics is used as an analysis core to carry out sub-graph division, so that noise brought by malicious behavior irrelevant semantics to a model is reduced; the sensitive data flow information in the using method further enhances the code semantic representation capability of the subgraph, and ensures the high accuracy and low false alarm rate of the detection model.
Drawings
In order to more clearly illustrate this embodiment or the technical solutions of the prior art, the drawings that are required for the description of the embodiment or the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a system framework diagram of a Malen warehouse component package malicious detection model provided by an embodiment of the application;
FIG. 2 is a schematic diagram of decompiled code of a malicious sample component package according to an embodiment of the present application;
FIG. 3 is a schematic diagram of an intermediate representation of a malicious sample component package provided by an embodiment of the present application;
fig. 4 is a schematic diagram of a sensitive behavior sub-graph according to an embodiment of the present application.
Detailed Description
The following detailed description of specific embodiments of the invention refers to the accompanying drawings and detailed description. The following examples or figures are illustrative of the invention and are not intended to limit the scope of the invention.
Referring to fig. 1, fig. 1 is a flowchart of a method for detecting the maliciousness of a component package in a Maven repository according to an embodiment of the present application, including:
step 1: collecting Java component packages in a Maven central warehouse as benign samples, collecting malicious JAR packages from a mainstream malicious software sample library as malicious samples, and constructing a component package data set.
Step 2: and setting a sensitive API call set conforming to the characteristics of Java programs and open source component packages.
Step 3: and generating a component package method call graph, and performing call sub-graph division based on the sensitive API call.
Step 4: and performing inter-process control flow analysis and sensitive data flow analysis to obtain a sensitive behavior subgraph.
Step 5: and obtaining vector representation of nodes and vector representation of edges in the sensitive behavior subgraph, wherein the vector representation and the vector representation of edges form the vector representation of the sensitive behavior subgraph.
Step 6: and training a heterogeneous graph neural network model by using the sensitive behavior subgraph vector representation of the component package sample as input, and carrying out whole graph classification to obtain a malicious detection result.
For the construction of a component package data set, the specific steps include:
Step 1a: in the embodiment of the invention, package categories with the popularity ranking top 150, such as 'Test Frameworks & Tools', 'Logging Frameworks', 'Java Specifications', are selected from Maven reports, and Android Package categories, such as 'Android Package', 'Android Platform', are removed; ranking in each category according to popularity in Maven repositisource, selecting the package with the top 150, and selecting the latest version for each package; the selected component packages are collected by the building web crawlers to form a benign sample set.
Optionally, in the sample collection process, if the number of component packages in a certain category in the Maven repositisource is less than 150, the collection number in the category is adjusted according to the actual situation, and part of data meeting the requirements is selected as a sample.
Step 1b: collecting malicious Java package samples from a mainstream malicious sample sharing library, for example, as in VirusShare, virusTotal; meanwhile, malicious JAR packages are collected from IT questioning and answering community StackOverFlow, and code hosting platform Gitee and GitHub together to form a malicious sample set.
For the construction of a sensitive API call set, the specific steps include:
Step 2a: identifying the component types of malicious behaviors according to Java official documents, existing documents, technical reports and malicious code samples; specifically, the malicious behavior segments in the malicious component package cover script execution, command execution, data transmission, file operation, encoding encryption and decryption, and environment information reading.
Step 2b: determining related sensitive API calls according to different types of malicious behaviors; illustratively, expressed in terms of method signatures, script execution class APIs may include :<groovy.lang.GroovyShell: java.lang.Object evaluate(java.lang.StringscriptText)>、<javax.script.ScriptEngine Object eval(Stringscript)>; command execution class APIs may include :<java.lang.Runtime: Runtime getRuntime()>、<java.lang.Runtime: Process exec(String|String[])>; data transfer class APIs may include :<java.net.InetAddress: java.net.InetAddress getByAddress(byte[])>、<java.net.URLConnection: java.io.OutputStream getOutputStream()> ; classifying and sorting collected sensitive API calls by function, each API expressed using its method signature, forming a set of sensitive API calls
The method comprises the following specific steps of calling sub-graph division for the component package:
Step 3a: for a given package sample, the package is parsed using the disclosed Tai-e program static analysis framework, which converts the package from the JAR file to an intermediate language representation IR (Intermediate Representation), as shown in fig. 2, which is the decompiled code of the malicious sample in this embodiment, and fig. 3, which is the intermediate representation of the malicious sample, wherein each class file in the JAR package is converted to a ti file.
Step 3b: analyzing the converted tir file by using Tai-e to obtain the association information of all methods and classes in the component package: for a class, acquiring information of an access modifier, a parent class, a static variable, a static method and a class signature of the class; for the method, acquiring information of an access modifier, a shape parameter type, a return value type, a method signature, a shape parameter name and an internal method body of the method; and searching for a potential entry method of the component package based on the related information of the obtained class and the method, and if soFor the entry method, the following criteria must be met:
Condition 1. The access modifier of (a) is Public or Protect;
condition 2. Class of genusAccess modifier of (c) is Public or the likeThere is a subclass in the subclasses of (1)The access modifier is Public andThere is no method ofCarrying out heavy load;
If it is And when the condition 1 and the condition 2 are met simultaneously, the method is called as a user-callable method and is recorded as
Condition 3.Is a static initializer and belongs to a classIs Public;
Condition 4. Belonging toAnd the method is a static method or the class thereofIs instantiating;
If condition 3 or condition 4 is satisfied, then the method Is a potential entry method, denoted as
Applying the entry method discrimination rule to all methods in the component package, and screening out all potential entry methodsAnd as a set of entry nodes required by the Method call graph MCG (Method CALL GRAPH) build algorithm.
Step 3c: analyzing the tir file converted by the component package by using Tai-e, using the potential entry method set in the step 3b as an algorithm entry of Pointer Analysis (Pointer Analysis), generating a Pointer flow diagram PFG (Pointer Flow Graph) to represent the flow condition between pointers in a program, and providing pointing information from the pointers to specific objects for constructing a method call diagram; and using a 2-layer call-site-sensitive 2-CFA (2-call-site-active) as a context sensitive strategy, distinguishing pointer pointing relations under different contexts based on the first 2 call sites as context information to improve the precision of pointer analysis, and constructing a pointer flow graph containing the context information.
Optionally, if the accuracy of the analysis strategy based on the context sensitivity of the call point cannot meet the requirement in the pointer analysis process, an analysis strategy based on the context sensitivity of the object can be selected, so that the best accuracy can be obtained, but a certain efficiency is sacrificed; the best efficiency can be obtained by selecting an analysis strategy based on type context sensitivity, and the precision is between the accuracy based on call points and the accuracy based on object context sensitivity.
Step 3d: constructing a method call graph of the component package by using the pointer flow graph in the step 3c as basic information of a call graph generation algorithmWhereinRepresenting a set of individual reachable method nodes in the component package,Representing a set of method node pairs having a calling relationship, the method node pairs being inCharacterized by edges between nodes.
Step 3e: using the Java sensitive API call set described in step 2bFor each sensitive API in the collectionIn a manner that signatures use regular matching according to the methodNode set of method of (a)If it matches to a method nodeThen fromDeparture acquisitionMiddle ANDA node with a distance not exceeding k, and forming the node and calling edges existing between the nodesIs a k-hop induced subgraph; Wherein k can be set to a plurality of values and the optimal k value is determined by comparing the effect of the subsequent detection model under different k values.
Step 3f: for step 3eFurther pruning is performed due toTo be a directed graphMiddle being unreachable along the calling edgeIs deleted by the node and the corresponding calling edge; after completion, pruning andNodes which are not communicated any more and calling edges thereof are deleted, and the processed result is used as a sensitive calling subgraph
Optionally, if a sensitive APIAlready existing in the partitioned sensitive call subgraphIn (3), then do not useAnd carrying out subsequent node searching and sub-graph dividing.
For the construction of the sensitive behavior subgraph, the specific steps include:
Step 4a: based on the sensitive call subgraph described in step 3f, each developer custom method in the sensitive call subgraph is analyzed by using Tai-e to obtain a control flow graph CFG (Control Flow Graph) of the developer custom method, and the call subgraph is expanded into an inter-process control flow graph ICFG (Interprocedural Control Flow Graph).
Step 4b: definition reachability analysis (Reaching Definition Analysis) is performed for each CFG, for a variableIf basic blockFor variableAssignment operation is performedIncorporating variables into method callsAs a method of real ginsengUsed and there is one slaveTo the point ofOn which the variables are locatedIs not covered, then it is calledAndThere is a definition-use relationship between, namely def-use chain; conversely, for real ginsengBasic blockAnd (3) withThere is a usage-defined relationship, i.e., use-def chain; after all basic blocks are analyzed, isolated nodes which have no data dependence with other basic blocks are removed, and a data dependence graph DDG (Data Dependence Graph) is built by using def-use chains and use-def chains as data dependence edges.
Step 4c: using the sensitive API call set described in step 2b, first find the method call basic block in DDGAnd using regular matching modeThe sub-signature of the medium search target method is as followsBasic blocks of (a)ToTraversing the data dependency graph by using breadth-first traversal to obtain the initial nodeAll data dependent edges of the reachable subgraph are added to the CFG as sensitive data stream edges.
Step 4d: the nodes of the constructed sensitive behavior subgraph are basic blocks, namely the intermediate representation of the code statement; the sensitive behavior subgraph contains 5 edges: the method comprises the steps of a common edge of the CFG, a call-return edge connecting a call point and a return point, and a call edge and a return edge representing inter-process control flow information and a sensitive data flow edge; and obtaining a sensitive behavior subgraph shown in fig. 4 by carrying out data flow analysis on the sensitive API call and acquiring sensitive behavior information.
The vector characterization extraction of the sensitive behavior subgraph comprises the following specific steps:
Step 5a: in order to avoid the influence on Doc2vec model training, data preprocessing is carried out on basic blocks in a sensitive behavior subgraph, and line numbers of sentences contained in each calling sentence IR (i.e. basic blocks) generated by Tai-e and index information in a control flow graph are removed.
Step 5b: since the types of variables and method parameters are hidden in the code statement IR generated by Tai-e, where each variable and parameter are respectively denoted as "r1", "r2", etc. according to the appearance sequence, in order to enable more code semantics to be included in the basic block, a fully defined class name of each variable and parameter is appended to the basic block, for example, "(java. Lang. String) r1".
Step 5c: the basic blocks in the sensitive behavior subgraph represent intermediate representations of code sentences, which are regarded as a short document, and the basic blocks of all the sensitive behavior subgraphs extracted in the package data set are used for forming a corpus for training the Doc2Vec model.
Step 5d: and setting the dimension of the node embedded vector as 20 dimensions, and acquiring the vector characterization of the node from the model after the Doc2Vec model is trained.
Step 5e: obtaining vector characterization of edges in a sensitive behavior subgraph by using single-hot coding, and setting an embedded representation of a relation between nodes as a 5-dimensional vector, wherein the 1 st dimension represents whether sensitive data flow exists between nodes, if so, the embedded representation is set as 1, otherwise, the embedded representation is set as 0; the 2 nd to 5 th dimensions characterize the type of edge, the corresponding dimension is set to 1, and the rest are 0.
For the implementation of component package maliciousness detection, the specific steps include:
Step 6a: labeling the labels of the vector characterization of all the sensitive behavior subgraphs of the benign sample as 0, namely the normal behavior subgraphs; labeling the labels of vector characterization of all the sensitive behavior subgraphs of the malicious sample as 1, namely, the malicious behavior subgraphs and the normal behavior subgraphs together form a vector characterization data set; dividing the vector characterization data set, taking 80% of data as a training set required by subsequent model training, taking 10% of data as a test set and 10% of data as a verification set.
Step 6b: taking a vector representation data set as input, forming a plurality of sub-graph vector representations with different dimensions into a batch by using a batch method in whole graph classification, performing training iteration on the whole graph batch, specifically setting the batch size as 32, selecting an Adam optimizer, and enabling the heterogeneous graph neural network model to comprise:
using 3 graph convolution layers, in each graph convolution layer, regarding different types of edges and connected nodes in the vector representation as subgraphs, and respectively performing graph convolution calculation on subgraphs corresponding to 5 types of edges;
Using ReLU as an activation function, sum as an aggregation function, averaging the graph volume aggregation results of 5 sub-graphs, and then using the result as the input of the next-layer graph convolution, wherein the calculation flow is as follows:
Wherein the method comprises the steps of Represent the firstSub-graph adjacency matrix corresponding to the seed-edge type,Represent the firstThe layer graph convolved node feature vector matrix,Represent the firstLayer diagram convolution of the first layerA weight matrix of the seed-edge type;
Reducing the data dimension and obtaining graph embedding using a graph readout layer, specifically selecting an average readout (mean readout) as a graph readout mode, calculating a representation of the single graph from the average readout values;
converting graph embeddings into softmax outputs using a layer 3 multi-layer perceptron WhereinJudging whether one sensitive behavior subgraph is a malicious behavior subgraph or not;
after 100 rounds of training, a malicious behavior subgraph detection model is obtained.
Optionally, to prevent model overfitting and improve generalization ability, the performance of the model on the validation set can be monitored during training and training stopped when the model performance is no longer improved; specifically, the loss of each round of the model on the validation set is monitored, and training is stopped in advance when the loss rises 5 times in succession, so as to obtain better model performance.
Step 6c: judging the maliciousness of all the sensitive behavior subgraphs in the component package to be detected by using the trained malicious behavior subgraph detection model, and judging the component package as a malicious component package if more than or equal to 1 sensitive behavior subgraphs are judged to be the malicious behavior subgraphs.
It should be noted that, for simplicity of description, the above method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present application is not limited by the order of acts described, as some steps may be performed in other order or concurrently in accordance with the present application. Further, those skilled in the art will recognize that the embodiments described in the specification are presently preferred, and that the acts and processes involved are not necessarily required for the present application.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Finally, it should be noted that: the foregoing description is only a preferred embodiment of the present invention, and the present invention is not limited thereto, but it is to be understood that modifications and equivalents of some of the technical features described in the foregoing embodiments may be made by those skilled in the art, although the present invention has been described in detail with reference to the foregoing embodiments.
Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (8)

1. A method for detecting the malicious property of a third party component package in a Maven warehouse, the method comprising:
Step 1: collecting Java component packages in a Maven central warehouse as benign samples, collecting malicious JAR packages from a main stream malicious software sample library as malicious samples, and constructing a component package data set;
step 2: setting a sensitive API call set conforming to the characteristics of Java programs and open source components;
Step 3: generating a package method call graph, and performing call sub-graph division based on sensitive API call;
Step 4: performing inter-process control flow analysis and sensitive data flow analysis to obtain a sensitive behavior subgraph;
Step 5: acquiring vector representation of nodes and vector representation of edges in the sensitive behavior subgraph, wherein the vector representation and the vector representation of edges form the vector representation of the sensitive behavior subgraph;
step 6: and training a heterogeneous graph neural network model by using the sensitive behavior subgraph vector representation of the component package sample as input, and carrying out whole graph classification to obtain a malicious detection result.
2. The method for detecting the malicious property of the third party component package in the Maven warehouse according to claim 1, wherein in the step 1:
the method comprises the steps that a web crawler is used for crawling third party component packages in the first 150 hot classifications on a Maven central warehouse, and each component package is selected to be the latest version; malicious JAR packages are collected as malicious samples from a mainstream malware sample library, including VirusShare, virusTotal, IT question-answering community StackOverFlow, on code hosting platforms Github, gitee.
3. The method for detecting the malicious property of the third party component package in the Maven warehouse according to claim 1, wherein in the step 2:
setting a Java sensitive API call set according to Java official documents, existing documents, technical reports and malicious code samples; further, according to different sensitive behaviors, API calls in Java are classified into data transmission, script execution, command execution, file operation, environment information reading, coding encryption and decryption and other categories; these sensitive APIs are grouped into sets, where each API is represented using its method signature.
4. The method for detecting the malicious property of the third party component package in the Maven warehouse according to claim 1, wherein in the step 3:
using a static program analysis framework suitable for Java language to analyze the component package and then generating an intermediate language representation IR;
Analyzing the IR of the component package, and acquiring the association information of all methods and classes in the component package: for a class, acquiring information of an access modifier, a parent class, a static variable, a static method and a class signature of the class; for the method, acquiring information of an access modifier, a shape parameter type, a return value type, a method signature, a shape parameter name and an internal method body of the method;
setting a group of effective judging conditions by using a method based on package user available API analysis, screening all methods in the package, obtaining potential user available APIs therein, and using the APIs as an entry node set of a call graph generation algorithm;
analyzing the intermediate representation of the component package by using a context-sensitive pointer analysis algorithm, acquiring a method call relation based on the analysis result, and constructing a method call graph
Based on the set of sensitive API calls described in step 2, for each sensitive API call in the setSignatures are signed using regular matching according to their method from/>Reachable method node set/>If a method node/>, is matchedAccording to the node/>Obtain it/>Jump-induced subgraph/>
For a pair ofPruning is performed, firstly deleting the fact that/>, which cannot be reached along the calling edge, is deletedAnd then delete the nodes in the rest nodesNon-connected nodes, and the processed result is used as sensitive calling subgraph/>
5. The method for detecting the malicious property of the third party component package in the Maven warehouse according to claim 1, wherein in the step 4:
Use sensitive call subgraph Acquiring a control flow graph CFG of each developer self-defined method, and expanding a sensitive calling sub-graph into an inter-process control flow graph ICFG; performing definition reachability analysis in the CFG, and if def-use chains or use-def chains exist between basic blocks in the CFG, then a data dependency edge exists, so that a data dependency graph DDG of a developer self-defining method is constructed;
Searching basic blocks containing sensitive API calls in the data dependency graph based on the regular matching mode of the method signature by using the sensitive API call set in the step 2 And by/>Traversing the data dependency graph by using breadth-first traversal for the initial node to obtain/>Is a reachable drawing of (1);
All data dependent edges in the reachable subgraph are used as sensitive data stream edges and added into the CFG;
The basic blocks in ICFG are used as nodes, and the common edges, the call-return edges, the call edges, the return edges and the sensitive data stream edges of the CFG are used as the types of the edges to form a sensitive behavior subgraph.
6. The method for detecting the malicious property of the third party component package in the Maven warehouse according to claim 1, wherein in the step 5:
Preprocessing basic blocks in the sensitive behavior subgraph, and removing statement line numbers and CFG index information in the basic blocks;
adding fully defined class names before variables and parameters in a basic block to enable the variables and parameters to contain more code semantics;
Training a Doc2Vec model by using the processed basic blocks as a corpus, and obtaining vector representation of each sub-graph node;
carrying out vector representation on edges in the subgraph by using single-hot coding, wherein each dimension in the vector corresponds to one relationship type in the sensitive behavior subgraph;
The sensitive behavior subgraph is an abnormal graph, and the node characteristic vector and the edge characteristic vector form vector characterization.
7. The method for detecting the malicious property of the third party component package in the Maven warehouse according to claim 1, wherein in the step 6:
establishing sensitive behavior sub-graph vector representations with different dimensions into a graph batch, wherein each graph batch has the same dimension;
Inputting the batched graph into an heterograph GCN, and predicting whether the sensitive behavior subgraph is a malicious behavior subgraph in the form of an overall graph classification task;
Component packages having more than or equal to one malicious behavior subgraph are determined to be malicious component packages.
8. And taking the accuracy rate, the recall rate and the false positive rate as evaluation indexes to evaluate the detection performance of the model.
CN202410385472.7A 2024-04-01 Malicious detection method for third party component package in Maven warehouse Pending CN118133284A (en)

Publications (1)

Publication Number Publication Date
CN118133284A true CN118133284A (en) 2024-06-04

Family

ID=

Similar Documents

Publication Publication Date Title
CN111639344B (en) Vulnerability detection method and device based on neural network
He et al. Learning to fuzz from symbolic execution with application to smart contracts
Cheng et al. Deepwukong: Statically detecting software vulnerabilities using deep graph neural network
Alrabaee et al. Fossil: a resilient and efficient system for identifying foss functions in malware binaries
CN111400719B (en) Firmware vulnerability distinguishing method and system based on open source component version identification
CN111460472B (en) Encryption algorithm identification method based on deep learning graph network
CN114077741B (en) Software supply chain safety detection method and device, electronic equipment and storage medium
Saccente et al. Project achilles: A prototype tool for static method-level vulnerability detection of Java source code using a recurrent neural network
CN108520180A (en) A kind of firmware Web leak detection methods and system based on various dimensions
Xu et al. Vulnerability detection for source code using contextual LSTM
CN105740711B (en) A kind of malicious code detecting method and system based on kernel objects behavior ontology
Zhang et al. Large-scale empirical study of important features indicative of discovered vulnerabilities to assess application security
CN113297580B (en) Code semantic analysis-based electric power information system safety protection method and device
Ibba et al. Evaluating machine-learning techniques for detecting smart ponzi schemes
Chaumette et al. Automated extraction of polymorphic virus signatures using abstract interpretation
CN114036531A (en) Multi-scale code measurement-based software security vulnerability detection method
Liu et al. Functions-based CFG embedding for malware homology analysis
CN113468524B (en) RASP-based machine learning model security detection method
Lin et al. Towards interpreting ML-based automated malware detection models: A survey
Guan et al. A survey on deep learning-based source code defect analysis
CN116932381A (en) Automatic evaluation method for security risk of applet and related equipment
CN116975881A (en) LLVM (LLVM) -based vulnerability fine-granularity positioning method
CN112131120A (en) Source code defect detection method and device
CN116702157A (en) Intelligent contract vulnerability detection method based on neural network
CN118133284A (en) Malicious detection method for third party component package in Maven warehouse

Legal Events

Date Code Title Description
PB01 Publication