CN115221517A

CN115221517A - Open source repository malicious packet detection method and system

Info

Publication number: CN115221517A
Application number: CN202210830258.9A
Authority: CN
Inventors: 程克非; 刘小川; 张亮
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2022-07-15
Filing date: 2022-07-15
Publication date: 2022-10-21

Abstract

The invention belongs to the technical field of network security, and particularly relates to a method and a system for detecting malicious packets in an open source repository; the method comprises the following steps: acquiring a Python packet to be detected, and extracting characteristics of the packet, wherein the characteristics comprise metadata characteristics, static characteristics and dynamic characteristics; processing the metadata features, the static features and the dynamic features to obtain a total feature vector; processing the total feature vector by adopting a trained RNN-Attention model to obtain a malicious packet detection result; the invention reduces resource consumption, improves the efficiency of feature extraction, integrates the metadata feature, the static feature and the dynamic feature as the feature of the packet and inputs the feature into the machine learning model, avoids that the detection of a code layer is neglected by only detecting the packet name, improves the accuracy of a malicious packet detection result, reduces the false alarm rate of malicious packet detection and has high practicability.

Description

Open source repository malicious packet detection method and system

Technical Field

The invention belongs to the technical field of network security, and particularly relates to a method and a system for detecting malicious packets in an open source repository.

Background

The open source repository is a code sharing platform and plays a vital role in a software supply chain and a software development process. With the continuous deepening of the open source software into various research and production fields in recent years, the influence of the security of the open source software supply chain can be expected to be more extensive in the near future, so that the security risk of the open source software supply chain cannot be ignored.

The safety inspection of the open-source software supply chain is enhanced, and developers and managers are effectively supervised and promoted to continuously improve the safety guarantee degree of the open-source software supply chain from two aspects of technology and management. Under the situation that the number of malicious open source software codes is exponentially increased, the traditional machine learning or deep learning can be applied to carry out feature extraction and analysis on the codes and the variants thereof, and an open source software code leak library is constructed. The application exploration of the novel open source software supply chain safety protection technology is enhanced, the situation awareness is carried out on possible open source software supply chain safety risks, the illegally-invaded virus is killed as early as possible, and the safety of the open source software supply chain is guaranteed.

The traditional malicious packet detection method has poor characteristic expression effect and extraction efficiency on malicious packet extraction, so that the malicious packet detection method has poor effect; detection methods based on distance between different packet names and other metadata differences, because of the lack of code level analysis, can only distinguish differences on the surface of malicious packets and normal packets; traditional methods of combining metadata, static and dynamic characteristics for detection result in excessive resource overhead.

In view of the foregoing, a method for detecting malicious packets that can improve the detection effect of malicious packets and reduce the resource overhead is needed.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a method and a system for detecting malicious packets in an open source repository, wherein the method comprises the following steps:

s1: acquiring a Python packet to be detected, and extracting characteristics of the packet, wherein the characteristics comprise metadata characteristics, static characteristics and dynamic characteristics; the metadata characteristics comprise the editing distance of a package name, an author of the package, the size of a package file and the downloading history of the package, the static characteristics comprise text characteristics, ssdeep Hash, an API (application program interface) calling sequence and confusion characteristics, and the dynamic characteristics comprise a dynamic behavior sequence;

s2: processing the metadata features, the static features and the dynamic features to obtain a total feature vector;

s3: and processing the total feature vector by adopting a trained RNN-Attention model to obtain a malicious packet detection result.

Preferably, the extracting of the ssdeep Hash comprises: collecting malicious packet samples, and constructing a Hash library according to the malicious packet samples; and calculating the ssdeep Hash according to the packet to be detected and the Hash library.

Preferably, the extracting the API call sequence includes: judging the type of the packet to be detected, and unpacking the packet to be detected according to the judgment result to obtain an unpacking result; processing the unpacking result by adopting a regular expression, and constructing a multi-branch tree dependency relationship graph of the package entry file; a multi-branch tree dependency relationship graph of the packet inlet file is traversed in a subsequent mode, source codes corresponding to nodes in the graph are led in according to the traversal sequence of the subsequent mode, and complete source codes with detection packets are obtained; and obtaining an abstract syntax tree according to the complete source code, processing the abstract syntax tree to obtain the name and the parameters of the API call function, and taking the name and the parameters as an API call sequence.

Further, the process of unpacking the packet to be detected includes: extracting setup.py and _ init _. Py files for the package in the tar.gz format, and cleaning other files in the tar.gz source code package; for a package in the whl format, an _ init _. Py file is extracted, and the wheel file in the whl package is cleaned.

Preferably, the process of extracting the aliasing features comprises: and processing the packet to be detected by adopting a coding and decoding function to obtain an obfuscated code feature vector, and judging whether obfuscated features exist according to the obfuscated code feature vector.

Preferably, the process of extracting the dynamic behavior sequence includes: performing package installation and package import operation on the package according to the format of the package to be detected, extracting process information in the package installation and package import process by adopting a docker and function hijack technology, and taking the process information as a dynamic behavior sequence; the process information comprises a called command, a read-write sensitive file name, and an IP and a domain name corresponding to the DNS analysis record.

Preferably, the processing of the metadata feature, the static feature and the dynamic feature includes: converting the API calling sequence and the dynamic behavior sequence into sequence vectors by adopting a BERT word vector model; converting other characteristics except the API calling sequence and the dynamic behavior sequence into corresponding characteristic vectors; and splicing the sequence vector and the feature vectors of other features to obtain a total feature vector.

An open source repository malicious packet detection system comprising: the device comprises a metadata feature extraction module, a static feature extraction module, a dynamic feature extraction module, a word vector conversion module, a prediction module and a recording module;

the metadata feature extraction module is used for extracting metadata features of the packet to be detected;

the static feature extraction module is used for extracting static features of the packet to be detected;

the dynamic feature extraction module is used for extracting dynamic features of the packet to be detected;

the word vector conversion module is used for carrying out vector conversion on the metadata characteristics, the static characteristics and the dynamic characteristics to obtain characteristic vectors; processing the feature vector to obtain a total feature vector;

the prediction module is used for obtaining a malicious packet detection result according to the total feature vector;

and the recording module is used for marking the packets according to the detection result of the malicious packets and carrying out isolation operation on the malicious packets.

The invention has the beneficial effects that: according to the method, aiming at the metadata characteristics, the abstract sequence tree of the package entry point is obtained from the package entry point file according to the analysis of the package structure, and all files related to the package are extracted according to the abstract sequence tree, so that compared with the traditional characteristic extraction method, the method extracts the characteristics of all files, reduces the number of the files needing to be processed, and improves the efficiency of subsequent characteristic extraction; for the extraction of static characteristics, not only the text characteristics of codes are extracted, but also the code characteristics such as confusion, splicing, coding and the like are matched, so that the problem of difficulty in detection of confusing and deforming malicious packets is effectively solved, the detection accuracy of the malicious packets is improved to the maximum extent, a large number of static characteristics are prevented from being used as the characteristics of machine learning, and the model construction efficiency is improved. For the extraction of the dynamic features, two processes of packet installation and packet import are concerned, a lightweight docker and function hijack technology is used, compared with a traditional extraction method which uses a heavyweight sandbox and scanning software, the resource consumption in the process of extracting the dynamic features is reduced, the efficiency of extracting the dynamic features is improved, meanwhile, a detection method is conveniently integrated into other scripts or software, and the method has the characteristic of higher coupling degree; the invention optimizes and improves the traditional feature extraction method, reduces the resource consumption and improves the feature extraction efficiency, integrates the metadata feature, the static feature and the dynamic feature into a packet feature and inputs the packet feature into a machine learning model, avoids neglecting the code detection by only detecting the packet name, improves the accuracy of the malicious packet detection result and reduces the false alarm rate of malicious packet detection compared with a method for detecting by calculating a safety score according to the feature, and has high practicability.

Drawings

FIG. 1 is a flowchart of a method for detecting malicious packets in an open source repository according to the present invention.

FIG. 2 is a schematic diagram of the RNN-Attention model of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides a method for detecting malicious packets in an open source repository, which comprises the following steps of:

s1: acquiring a Python packet to be detected, and extracting characteristics of the packet, wherein the characteristics comprise metadata characteristics, static characteristics and dynamic characteristics; the metadata characteristics comprise the editing distance of a package name, the author of the package, the size of a package file and the download history of the package, the static characteristics comprise a text characteristic, ssdeep Hash, an API calling sequence and a confusion characteristic, and the dynamic characteristics comprise a dynamic behavior sequence.

The extraction of the metadata features comprises the following steps:

detecting a Python packet of an open source repository, acquiring the Python packet to be detected, and acquiring an author of the packet and a download history of the packet from an Application Programming Interface (API) through an official repository according to the packet name of the packet to be detected;

calculating the packet name of the packet to be detected and the Levenshtein distance between other packets in the official repository and the Python internal packet, namely the edit distance;

after statistical analysis is carried out on the obtained malicious packet samples and the normal packet samples, the file sizes of most of the malicious packet samples are within 20KB, some of the malicious packet samples are even only 1KB, and the file sizes of most of the normal packet samples are more than 50 KB. This may be because the time cost of malicious packet obfuscation and concealment is high, and certainly does not preclude the implantation of malicious code in normal packet samples to conceal malicious behavior. Therefore, the file size of the package can also be used as a basis for judging whether the package is malicious or not; and directly acquiring the size of the package file as the size of the package according to the format of the package.

The extraction of the static features comprises the following steps:

text characteristics: the text features comprise the length of the longest sentence and the out-degree and in-degree of the file; the longest sentence length refers to the length of the longest sentence used in the code of the package; the number of file references is called the in-degree of the file, the number of the file references is called the out-degree, and the out-degree and the in-degree of the file respectively refer to the out-degree and the in-degree sum of all source codes of the package;

ssdeep Hash: the ssdeep Hash refers to a fuzzy Hash algorithm and is also called a fragmentation Hash algorithm based on content segmentation, and is used for judging the similarity of codes. The fuzzy hash is mainly based on the principle that a weak hash is used for calculating the local content of a file, the file is sliced under a specific condition, then a strong hash is used for calculating the hash value of each file, a part of the hash values are taken and connected, and the hash values and the slicing condition form a fuzzy hash result. Judging the similarity of the two fuzzy hash values by using a character string similarity comparison algorithm so as to judge the similarity of the two files; specifically, the process of extracting the ssdeep Hash includes: collecting malicious packet samples, and constructing a Hash library according to the malicious packet samples; and calculating the ssdeep Hash (fuzzy Hash value) of the packet to be detected according to the packet to be detected and the Hash library.

API call sequence:

judging the type of the packet to be detected and unpacking the packet to be detected according to the judgment result to obtain an unpacking result; specifically, for a package in a tar.gz format, a setup.py file and an init _ py file are extracted, and other files in the tar.gz source code package are cleaned; for a package in the whl format, an _ init _. Py file is extracted, and the wheel file in the whl package is cleaned.

Processing the unpacking result by adopting a regular expression to construct a multi-branch tree dependency relationship graph of the package inlet file; specifically, only the problem of reference of logic modules in a package is considered, mutual reference between external packages is not considered, a package to be detected is imported, the package names of all modules in the imported package are matched through a regular expression, whether the package name is in a package file is judged after the package name of a module in the package is acquired, if yes, the package names of other imported modules are recursively acquired by the same method, if not, the package name acquisition error is indicated, and the package names of other imported modules are acquired by the same method again.

Sequentially importing source codes corresponding to nodes in the graph according to a backward traversal order, and then sequentially importing codes of other residual py files in the package, namely importing source codes with high relevance to the graph nodes firstly, and then importing residual source codes with low relevance to the graph nodes to obtain complete source codes of the package to be detected;

and obtaining an abstract syntax tree according to the complete source code, processing the abstract syntax tree to obtain the name and the parameters of the API call function, and taking the name and the parameters as an API call sequence.

Confusion features: processing the packet to be detected by adopting a coding and decoding function to obtain a confusion code characteristic vector, wherein the coding and decoding function can adopt Base64, hexadecimal coding and decoding and the like; specifically, when the packet is detected to have the corresponding obfuscated code feature, the corresponding element value in the feature vector is 1, and when the packet is detected to have no obfuscated code feature, the corresponding element value in the feature vector is 0.

The extraction of the dynamic features comprises the following steps:

the dynamic characteristics refer to dynamic behavior sequences, extraction of the dynamic characteristics is also called extraction of the dynamic behavior sequences, and the process of extracting the dynamic behavior sequences comprises the following steps: performing installation package and package import operation on a package according to the format of the package to be detected, and extracting process characteristics in the installation package and package import process by adopting a docker (container technology) and a function hijack technology, wherein the process characteristics comprise characteristics of network connection information, command calling, file reading, file writing and the like of a process; and acquiring process information such as a called command, a read-write sensitive file name, an IP (Internet protocol) and a domain name corresponding to the DNS (domain name system) resolution record according to the process characteristics, and taking the process information as a dynamic behavior sequence.

S2: and processing the metadata features, the static features and the dynamic features to obtain a total feature vector.

BERT is a Transformer's bi-directional encoder proposed by the google team, which aims to pre-train deep bi-directional representations from unlabeled text by conditional computations common in the left and right contexts. Therefore, the pre-trained BERT model can be fine-tuned with only one additional output layer, thereby generating the latest models for various natural language processing tasks. BERT adopts the encoder part of a transformer, wherein a self-attention mechanism can focus on context information and can better capture global information.

Converting the API calling sequence and the dynamic behavior sequence into sequence vectors by adopting a BERT word vector model; converting other characteristics except the API calling sequence and the dynamic behavior sequence into corresponding characteristic vectors, wherein the method comprises the steps of converting the author name of the package into vector representation by using One-Hot coding, and taking the total download quantity of the packages in the last year as the characteristic vector representation of the download history of the packages; and splicing the sequence vector and the feature vectors of other features to obtain a total feature vector.

S3: and processing the total feature vector by adopting a trained recurrent neural network model RNN-Attention with an Attention mechanism to obtain a malicious packet detection result.

The recurrent neural network RNN is structured as shown in FIG. 2, with the input of the network at time t being x _t Its neuronal state S _t Is represented as follows:

S _t ＝φ(Ux _t +WS _t-1 )

wherein S _t-1 And for the neuron state at the time t-1, U is a weight matrix from the input layer to the hidden layer, W is the last value of the hidden layer as the weight of the input at this time, and phi is an activation function Tanh.

For convenience of representation, the above two equations are transformed as follows:

when the model is solved, the loss function of the model is used as an optimization target, network parameters are updated iteratively, the loss function is calculated, and when the loss function is minimum, the U, V and W parameters of the model are obtained, so that the state of the model can be completely determined.

In the classification process, the API calling sequence and the dynamic behavior sequence have context relationship, an attention mechanism is added into the RNN model, and vocabulary information in sentences can be added into sentence vectors through trained weight vectors, so that the detection result of the model is more accurate.

The attention adding mechanism comprises: first, a word vector is passed through a single-layer multi-layer neural network to obtain its hidden representation u _n，j Reuse of u _n，j And a sentence-based word-level importance vector u _s Obtaining the normalized weight a by a softmax function _n，j . Finally according to the word information h _n，j And normalizing the weights to obtain a weighted sum vector S of the word information _n 。

u _n，j ＝tanh(W _w h _n，j +b _w )

Wherein, W _w As parameters of a multi-layer neural network, b _w Is an offset term, u _s Is a context vector.

And inputting the total feature vector into a trained RNN-Attention model to obtain a malicious packet detection result, marking the packet according to the malicious packet detection result, writing the information of the packet into a database, and performing isolation operation on the malicious packet.

After each test, the RNN-Attention model is retrained with the packet's features to adjust the model differently.

The invention also provides a system for detecting malicious packets in the open source repository, which comprises: the device comprises a metadata feature extraction module, a static feature extraction module, a dynamic feature extraction module, a word vector conversion module, a prediction module and a recording module;

the static feature extraction module is used for performing static feature extraction on the packet to be detected;

The above-mentioned embodiments, which are further detailed for the purpose of illustrating the invention, technical solutions and advantages, should be understood that the above-mentioned embodiments are only preferred embodiments of the present invention, and should not be construed as limiting the present invention, and any modifications, equivalents, improvements, etc. made to the present invention within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for detecting malicious packets in an open source repository, comprising:

s3: and processing the total characteristic vector by adopting a trained RNN-Attention model to obtain a malicious packet detection result.

2. The method as claimed in claim 1, wherein the extracting ssdeep Hash comprises: collecting malicious packet samples, and constructing a Hash library according to the malicious packet samples; and calculating the ssdeep Hash according to the packet to be detected and a Hash library.

3. The open source repository malicious packet detection method according to claim 1, wherein the process of extracting the API call sequence comprises: judging the type of the packet to be detected and unpacking the packet to be detected according to the judgment result to obtain an unpacking result; processing the unpacking result by adopting a regular expression, and constructing a multi-branch tree dependency relationship graph of the package entry file; sequentially traversing a multi-branch tree dependency relationship graph of a package entry file, and importing source codes corresponding to nodes in the graph according to a sequential traversal sequence to obtain complete source codes with detection packages; and obtaining an abstract syntax tree according to the complete source code, processing the abstract syntax tree to obtain the name and the parameters of the API call function, and taking the name and the parameters as an API call sequence.

4. The method according to claim 3, wherein the process of unpacking packets to be detected comprises: gz format packet extraction for tar. The setup. Py and _init _. Py files, cleaning other files in the tar.gz source code packet; and extracting a _ init _. Py file for the package in the whl format, and cleaning the wheel file in the whl package.

5. The open-source repository malicious packet detection method according to claim 1, wherein the process of extracting the obfuscated features comprises: and processing the packet to be detected by adopting a coding and decoding function to obtain a confusion code characteristic vector, and judging whether confusion characteristics exist according to the confusion code characteristic vector.

6. The method according to claim 1, wherein the process of extracting the dynamic behavior sequence comprises: performing package installation and package import operation on the package according to the format of the package to be detected, extracting process information in the package installation and package import process by adopting a docker and function hijack technology, and taking the process information as a dynamic behavior sequence; the process information comprises a calling command, a read-write sensitive file name, and an IP and a domain name corresponding to the DNS analysis record.

7. The open source repository malicious packet detection method according to claim 1, wherein the processing of the metadata feature, the static feature and the dynamic feature comprises: converting the API calling sequence and the dynamic behavior sequence into sequence vectors by adopting a BERT word vector model; converting other characteristics except the API calling sequence and the dynamic behavior sequence into corresponding characteristic vectors; and splicing the sequence vector and the feature vectors of other features to obtain a total feature vector.

8. An open source repository malicious packet detection system for executing the open source repository malicious packet detection method of any one of claims 1 to 7, characterized by comprising: the device comprises a metadata feature extraction module, a static feature extraction module, a dynamic feature extraction module, a word vector conversion module, a prediction module and a recording module;