WO2023072002A1 - 开源组件包的安全检测方法及装置 - Google Patents

开源组件包的安全检测方法及装置 Download PDF

Info

Publication number
WO2023072002A1
WO2023072002A1 PCT/CN2022/127118 CN2022127118W WO2023072002A1 WO 2023072002 A1 WO2023072002 A1 WO 2023072002A1 CN 2022127118 W CN2022127118 W CN 2022127118W WO 2023072002 A1 WO2023072002 A1 WO 2023072002A1
Authority
WO
WIPO (PCT)
Prior art keywords
package
malicious
open source
source component
local
Prior art date
Application number
PCT/CN2022/127118
Other languages
English (en)
French (fr)
Inventor
薛迪
赵刚
余志刚
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023072002A1 publication Critical patent/WO2023072002A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements

Definitions

  • the present application relates to the technical field of network security, in particular to a security detection method and device for an open source component package.
  • the security risk caused by the malicious package may launch an attack during the installation phase, and the attack code is executed remotely from the network segment, and no local files remain;
  • software developers call various open source component packages to implement some functional modules when writing product source code, they may call malicious packages carefully disguised by attackers, and package the source code and component packages together when the software developer releases the product Published, packaged and released products are scanned by scanning software, the malicious package injects malicious code into the developed product to avoid antivirus software scanning.
  • the traditional code security detection framework is to detect abnormal behavior during the attack behavior or retroactively detect the attack behavior according to the attack consequences after the attack occurs, which is a passive defense.
  • the emergency pressure big.
  • the existing code security detection framework is based on terminal detection and cloud detection of local source files, but the attacker of the package manager uploads the attack code, the attack code is launched before the user uses the link, and the packaged installation code exists in the terminal development environment. The attacker can easily implant malicious code in the development environment or steal information during the installation phase and transmit it to the specified network address through the network channel.
  • the embodiment of the present application provides a security detection method and device for an open source component package.
  • the malicious code detection capability is advanced, and a safe open source warehouse is built. Effectively suppress the security impact of open source on the R&D environment and reduce the possibility of being attacked.
  • a security detection method of an open source component package comprising: obtaining an online open source component package, and performing feature extraction on the online open source component package, obtaining feature information of the online open source component package;
  • the feature information of the above-mentioned open source component package is carried out security detection, determines whether described online open source component package is a legitimate package; If the first component package in the online open source component package is a legal package, then the first component package is synchronized to A local open source mirror warehouse, where the local open source mirror warehouse is used to provide users with invoked open source component packages.
  • the online open source component package before synchronizing the online open source component package to the local open source mirror warehouse, the online open source component package is first checked for security, and if the online open source component package is determined to be a legal package, it is synchronized to the local open source mirror Warehouse, which advances malicious code detection capabilities, builds a secure open source warehouse, effectively restrains the security impact of open source on the R&D environment, and reduces the possibility of open source component package users being attacked.
  • obtaining the feature information of the online open source component package includes obtaining the creation information of the online open source component package; performing security detection on the creation information of the open source component package to determine whether the online open source component package is a legal package, including: The creation information of the online open source component package is matched with multiple rules in the rule database, and whether the online open source component package is a legal package is determined according to the degree of matching.
  • the method further includes: if it is determined that the second component package in the online open source component package is a malicious package, storing the second component package in an incremental malicious package database.
  • performing security detection on the characteristic information of the open source component package to determine whether the online open source component package is a legal package includes: matching the characteristic information of the online open source component package with multiple rules in the rule database, Determine whether the online open source component package is a legal package according to the degree of matching.
  • the feature information of the online open source component package is matched with multiple rules in the rule database, and whether the online open source component package is a legitimate package is determined according to the degree of matching between the two.
  • generating the rule database is a relatively direct and concise step, which can reduce the consumption of processing resources in the security detection process and improve the efficiency of security detection.
  • the method further includes: obtaining a local malicious package in a local open source component package, and performing feature extraction on the local malicious package to obtain the malicious features of the local malicious package; obtaining the local malicious source code, and analyzing the local Malicious source code is subjected to feature extraction to obtain the malicious code features of the local malicious source code; the malicious features of the local malicious package and the malicious code features of the local malicious source code are used as malicious feature rules in the rule database.
  • obtaining the characteristic information of the online open source component package also includes obtaining the creation information of the online open source component package; the method also includes: obtaining the creation information of the local malicious package; obtaining hacker information from an external database; Creation information of malicious packages and hacker information are used as malicious information rules in the rule database;
  • the security detection for the feature information of the open source component package also includes: matching the creation information of the online open source component package with the malicious information rules in the rule database.
  • security detection is performed on the characteristic information of the open source component package to determine whether the online open source component package is a legal package, including: inputting the characteristic information of the online open source component package into the artificial intelligence AI labeling model, and adopting the AI labeling model
  • the online open source component package is reasoned to determine whether the online open source component package is a legitimate package, and the online open source package that is not a legal package is a malicious package.
  • the AI labeling model is used to perform security detection on the online open source component package. Since the AI labeling model is a machine learning model and is obtained through iterative training, the AI labeling model is deterministic, so the online open source component package The inference results obtained by inputting the feature information of the AI annotation model can ensure the accuracy of the results.
  • the feature information includes risk function features, API call sequence features, and opcode sequence features.
  • the label prediction results are used to indicate whether the online open source component package is legal Packages, online open source packages that are not legal packages are malicious packages.
  • the method further includes: obtaining an adaptive boosting algorithm classifier, the adaptive boosting algorithm classifier includes N second classifiers corresponding to different weights, and N second classifiers corresponding to different weights
  • the classifier is trained according to multiple malicious features of the local malicious package; feature extraction is performed on the source code of the local malicious package to obtain the feature information of the local malicious package; Obtain three first classifiers as AI labeling models.
  • security detection is performed on the characteristic information of the open source component package to determine whether the online open source component package is a legitimate package, including: inputting the characteristic information of the online open source component package into the incremental AI model, and using the incremental AI model Reasoning is carried out on the online open source component package to determine whether the online open source component package is a legitimate package, and determine that the online open source component package that is not a legitimate package is a suspected malicious package.
  • the incremental AI model is used to detect the security of the online open source component package.
  • the characteristic information of the local malicious package and the local legal package are considered at the same time, so that the incremental The reasoning results of the AI model are considered more comprehensively, and online open source component packages that are not legitimate packages are determined as suspected malicious packages, and re-judgment can further improve the accuracy of security detection results and reduce the probability of misjudgment.
  • the feature information includes risk function features, API call sequence features, and opcode sequence features.
  • the feature information of the package and the feature information of the local legal package; the feature information of the local malicious package and the feature information of the local legal package are used as the input of the initial support vector machine SVM algorithm classifier to iterate until the prediction of the initial SVM algorithm classifier is determined to be accurate rate is greater than the first preset threshold, the final SVM algorithm classifier is obtained as an incremental AI model.
  • the method further includes: performing reputation evaluation on the suspected malicious package, obtaining a reputation score of the suspected malicious package, and Determine whether the suspected malicious package is a legitimate package according to the reputation score of the suspected malicious package, wherein the suspected malicious package that is not a legitimate package is a malicious package, calculate and obtain the reputation score of the suspected malicious package, and according to the suspected malicious package The reputation score determines whether a suspected malicious package is legitimate.
  • the reputation evaluation includes one or more of the following: evaluation of the dependency package of the suspected malicious package, evaluation of the package name of the suspected malicious package, evaluation of the structure of the suspected malicious package, evaluation of the author reputation of the suspected malicious package, evaluation of the suspected Package reputation evaluation for malicious packages.
  • the method further includes: obtaining incremental malicious feature rules and/or incremental information rules according to the malicious packages in the incremental malicious package database; according to the incremental malicious feature rules and/or incremental information rules Update the rules database.
  • the method also includes: extracting features of the target malicious package to obtain feature information of the target malicious package, where the target malicious package is part or all of the malicious packages in the incremental malicious package database; The feature information of is iterated as the input of the incremental AI model to obtain the updated incremental AI model.
  • a safety detection device which includes: an acquisition unit, configured to obtain an online open source component package, and perform feature extraction on the online open source component package, to obtain feature information of the online open source component package; a processing unit, for Carry out security detection for the feature information of the open source component package, determine whether the online open source component package is a legal package; storage unit, if the first component package in the online open source component package is a legal package, then synchronize the first component package to the local Open source mirror warehouse, the local open source mirror warehouse is used to provide users with open source component packages called.
  • the storage unit is further configured to: if it is determined that the second component package in the online open source component package is a malicious package, store the second component package in the incremental malicious package database.
  • the processing unit is specifically configured to: match the characteristic information of the online open source component package with multiple rules in the rule database, and determine whether the online open source component package is a legal package according to the degree of matching.
  • the processing unit is also used to: obtain the local malicious package in the local open source component package, and perform feature extraction on the local malicious package to obtain the malicious features of the local malicious package; obtain the local malicious source code, and perform feature extraction on the local malicious package; Feature extraction is performed on the local malicious source code to obtain the malicious code features of the local malicious source code; the malicious features of the local malicious package and the malicious code features of the local malicious source code are used as malicious feature rules in the rule database.
  • acquiring the feature information of the online open source component package also includes acquiring creation information of the online open source component package; the processing unit is also used to: acquire creation information of a local malicious package; acquire hacker information from an external database; The local malicious package creation information and hacker information are used as malicious information rules in the rule database; the security detection for the characteristic information of the open source component package also includes: matching the creation information of the online open source component package with the malicious information rules in the rule database.
  • the processing unit is also used to: input the feature information of the online open source component package into the artificial intelligence AI labeling model, use the AI labeling model to reason the online open source component package, and determine whether the online open source component package is a legal package , where the online open source packages that are not legal packages are malicious packages.
  • the feature information includes risk function features, API call sequence features, and opcode sequence features.
  • the label prediction results are used to indicate whether the online open source component package is legal Packages, online open source packages that are not legal packages are malicious packages.
  • the processing unit is further configured to: acquire an adaptive boosting algorithm classifier, the adaptive boosting algorithm classifier includes N second classifiers corresponding to different weights, and N second classifiers corresponding to different weights
  • the binary classifier is trained according to multiple malicious features of the local malicious package; feature extraction is performed on the source code of the local malicious package to obtain the feature information of the local malicious package; the feature information of the local malicious package is input into the adaptive boosting algorithm classifier respectively, Three first classifiers are trained as AI labeling models.
  • the processing unit is specifically configured to: input the feature information of the online open source component package into the incremental AI model, use the incremental AI model to reason the online open source component package, and determine whether the online open source component package is a legal package , and determine that the online open source component package that is not a legitimate package is a suspected malicious package.
  • the feature information includes risk function features, API call sequence features, and opcode sequence features
  • the processing unit is also used to: perform feature extraction on local malicious packages and local legitimate packages in the local open source component package, and obtain The feature information of the local malicious package and the feature information of the local legal package; the feature information of the local malicious package and the feature information of the local legal package are iterated as the input of the initial support vector machine SVM algorithm classifier until the initial SVM algorithm classifier is determined. If the prediction accuracy rate is greater than the first preset threshold, the final SVM algorithm classifier is obtained as an incremental AI model.
  • the processing unit is further configured to: perform reputation evaluation on the suspected malicious package, obtain a reputation score for the suspected malicious package, and The reputation score of the malicious package determines whether the suspected malicious package is a legitimate package, and the suspected malicious package that is not a legitimate package is a malicious package.
  • the reputation evaluation includes one or more of the following: evaluation of the dependency package of the suspected malicious package, evaluation of the package name of the suspected malicious package, evaluation of the structure of the suspected malicious package, evaluation of the author reputation of the suspected malicious package, evaluation of the suspected Package reputation evaluation for malicious packages.
  • the device further includes an update unit, configured to: obtain incremental malicious feature rules and/or incremental information rules according to the malicious packages in the incremental malicious package database; Or incremental information rules update the rules database.
  • the device also includes a sealing unit, configured to: perform feature extraction on the target malicious package to obtain feature information of the target malicious package, where the target malicious package is part or all of the malicious packages in the incremental malicious package database. package; the feature information of the target malicious package is iterated as the input of the incremental AI model to obtain the updated incremental AI model.
  • an embodiment of the present application provides a communication device, the device includes a communication interface and at least one processor, and the communication interface is used for the device to communicate with other devices.
  • the communication interface may be a transceiver, circuit, bus, module or other types of communication interface.
  • At least one processor is used to call a set of programs, instructions or data to execute the method described in the first aspect or the second aspect above.
  • the device may also include a memory for storing programs, instructions or data invoked by the processor. The memory is coupled to at least one processor, and when the at least one processor executes instructions or data stored in the memory, the method described in the first aspect above can be implemented.
  • the embodiments of the present application also provide a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are run on a computer, the computer executes the computer according to the first aspect or the first aspect.
  • an embodiment of the present application provides a system-on-a-chip, where the system-on-a-chip includes a processor, and may also include a memory, for implementing the method in the above-mentioned first aspect or any possible implementation manner of the first aspect,
  • the system-on-a-chip may consist of chips, or may include chips and other discrete devices.
  • the chip system further includes a transceiver.
  • the embodiments of the present application also provide a computer program product, including instructions, which, when run on a computer, cause the computer to execute the method in the first aspect or any possible implementation manner of the first aspect .
  • FIG. 1A is a flow chart of a software supply chain provided by an embodiment of the present application.
  • FIG. 1B is a schematic diagram of a security risk caused by a malicious package provided by the embodiment of the present application.
  • FIG. 1C is a schematic diagram of a new software supply architecture provided by the embodiment of the present application.
  • FIG. 2A is a flow chart of a security detection method for an open source component package provided in an embodiment of the present application
  • FIG. 2B is a schematic diagram of an abstract syntax tree provided by the embodiment of the present application.
  • Figure 2C is a schematic diagram of a disassembly file provided by the embodiment of the present application.
  • FIG. 2D is a flow chart of another open source component package security detection method provided by the embodiment of the present application.
  • FIG. 2E is a flow chart of another open source component package security detection method provided by the embodiment of the present application.
  • FIG. 2F is a flow chart of another open source component package security detection method provided by the embodiment of the present application.
  • FIG. 3 is a structural block diagram of a safety detection device provided in an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • Multiple means two or more.
  • “And/or” describes the association relationship of associated objects, indicating that there may be three types of relationships, for example, A and/or B may indicate: A exists alone, A and B exist simultaneously, and B exists independently.
  • the character “/” generally indicates that the contextual objects are an "or” relationship.
  • Open source component package open source (Open Source) is called open source code. That is to say, for an open source software, anyone can obtain the source code of the software, modify it within the scope of copyright, and even redistribute it.
  • a component package is a simple package of data and methods that can be combined as part of a system. In different programming languages, a component package can also be called a component package or a control package, etc. Then, an open source component package is an open source component package.
  • Package Manager An online storage warehouse for open source component packages. All developers can upload component packages to the package manager, and can also obtain component packages from the package manager for their own development.
  • Open source mirror warehouse a local private storage warehouse for component package storage and management. Local can generally refer to a company, a department, or even an individual's development facility.
  • Software development platform An organization, usually a service platform for software developers within a company to develop software products using open source component packages.
  • Fig. 1A is a flow chart of a software supply chain provided by the embodiment of the present application.
  • Institutions develop open source component packages, and after the development is completed, publish the open source component packages to the package manager for use by other developers.
  • the software development platform synchronizes the package manager to the local private open source mirror warehouse. When other developers need it, they can download open source component packages from the open source mirror warehouse, and then use these open source component packages, including direct installation, or on the basis of open source component packages Carry out redevelopment to form new functional modules, etc.
  • Figure 1B is a schematic diagram of a security risk caused by a malicious package provided by the embodiment of the present application. Managers cannot identify the security of component packages, and synchronize malicious packages with legitimate packages to the open source mirror warehouse, while software developers cannot check the security of component packages used due to lack of security knowledge background, which will bring security to their R&D environment risk.
  • the malicious package when a developer installs an open source component package from an open source mirror warehouse, the malicious package will launch an attack during the installation phase, and the attack code will be executed remotely from the network segment, and no local files will remain (attack path 1 in Figure 1B).
  • the malicious code scanning software installed on the host of the software developer belongs to the scanning software for local files, so that online malicious packages cannot be scanned, and the attacker will launch information theft and distributed denial of service attacks on the developer host during the installation phase ( distributed denial of service attack, DDoS) and other attacks;
  • malware package attackers hide themselves and do not launch attacks during the installation phase, but hide malicious codes in open source component packages.
  • company software developers will call various open source component packages to achieve some Functional modules (attack path 2 in Figure 1B).
  • the component package called may be a legitimate package or a malicious package carefully disguised by the attacker.
  • the company's software developers release the product, they will package and release the source code and the component package together.
  • the packaged product needs to pass through the company's malicious Security inspection of code scanning software.
  • current malicious code scanning software cannot recognize the security of open source component packages. In this case, malicious attackers will inject malicious code into developed products through malicious packages to avoid antivirus software scanning.
  • the embodiment of this application discloses a new software supply architecture, as shown in Figure 1C, which is a schematic diagram of a new software supply architecture provided by the embodiment of this application, between the package manager and the open source image warehouse , introduce the mirror warehouse security center, which is used to obtain the online open source component package from the package manager, then perform security detection on the online open source component package, filter the secure component package, and store the secure component package in the open source mirror warehouse, so that software developers can It can guarantee the security of the open source component packages obtained and used from the open source mirror warehouse, thereby ensuring the security of the research and development environment.
  • the mirror warehouse security center can be an independent module, or a module combined with the open source mirror warehouse, which is not specifically limited in this embodiment of the application.
  • an embodiment of the present application provides a security detection method for an open source component package, which is used to execute through the software provisioning architecture in FIG. 1C .
  • the method includes the following steps:
  • the first component package in the online open source component package is a legitimate package, then synchronize the first component package to the local open source mirror warehouse, and the local open source mirror warehouse is used to provide the user with the open source component package to call.
  • the execution subject of the steps of the above method is the mirror warehouse security center in Figure 1C, which can be an independent functional module, or part of the functions in the open source mirror warehouse.
  • the following content will not go into details.
  • the corresponding hardware entity of the execution subject may be a terminal device, or a server, or a computing center, etc.
  • the online open source component package is published by the software supplier on the package manager. Institutions or organizations, usually companies, download the online open source component package to the mirror warehouse security center, and then perform feature extraction on the online open source component package to obtain the information of the open source component package. characteristic information.
  • feature extraction includes feature extraction for functions or methods of the source code in the component package to obtain feature information.
  • the process of obtaining the characteristic information of the online open source component package may include:
  • the mirror warehouse security center traverses all open source component packages existing in the package manager, and obtains a list of open source component package names;
  • the mirror warehouse security center traverses each package name in the package name list, obtains the JSON file of the open source component package in the package manager, and parses the JSON file to obtain the package file download link of the open source component package;
  • the mirror warehouse security center downloads and decompresses the package file from the package file download link of the open source component package, extracts the source code from the package file, and then extracts the characteristics of the package file from the source code, which may specifically include application programming interface (application programming interface) , API) call sequence feature, opcode sequence feature and risk function feature to form the feature information of the open source component package.
  • application programming interface application programming interface
  • API API
  • the mirror warehouse security center scans whether the source code contains encryption and decryption functions. If such functions exist, it is determined that the file contains obfuscated code.
  • the obfuscated code uses encryption and decryption functions to convert some code fragments in the source code into confusing strings, thereby Hides code structure.
  • the mirror warehouse security center uses the abstract syntax tree extraction function of the programming language corresponding to the source code to extract the abstract syntax tree of the source code, and extracts the API call sequence of the source code from the nodes of the abstract syntax tree;
  • the mirror warehouse security center uses the source code assembly function to assemble the source code and extract the opcode sequence from the assembly file.
  • FIG. 2B is a schematic diagram of an abstract syntax tree provided by the embodiment of the present application.
  • the API call sequence is a sequence formed from the root node to the leaf node in the abstract syntax tree.
  • an example of an API call sequence is: Store , Name, Assign, FunctionDef, Module.
  • the generated disassembly file can refer to Figure 2C, which is a schematic diagram of a disassembly file provided by the embodiment of the present application.
  • the operation codes in the disassembly file are extracted: LOAD_CONST, LOAD_CONST, MAKE_FUNCTION, STORE_NAME, LOAD_CONST, RETURN_VALUE , LOAD_CONST, STORE_FAST, generate the corresponding opcode sequence.
  • API call sequence and opcode sequence of the source code can also be obtained in other ways, for example:
  • the mirror warehouse security center will copy the source code containing the obfuscated code to the sandbox (Sandbox) to run, and the mirror warehouse security center can run the obfuscated code in real time in the sandbox, and monitor the obfuscated code from real-time operation
  • the security center of the mirror warehouse uses the n-gram model (n-gram) and term frequency-inverse document frequency (tf-idf) technology to perform feature selection on API call sequences and opcode sequences.
  • the mirror warehouse security center uses n-grams technology to divide the API call sequence and operation code sequence into blocks. Every n API calls or operation codes are divided into one block.
  • the n-grams of the operation code as an example are as follows:
  • Package B opcode sequence LOAD_GLOBAL, LOAD_FAST, MAKE_FUNCTION, STORE_NAME
  • n-grams of component package A are (LOAD_GLOBAL, LOAD_FAST), (LOAD_FAST, CALL_FUNCTION), (CALL_FUNCTION, RETURN_VALUE)
  • n-grams of component package B are (LOAD_GLOBAL, LOAD_FAST), (LOAD_FAST, MAKE_FUNCTION), (MAKE_FUNCTION, STORE_NAME)
  • the mirror warehouse security center uses tf-idf technology to calculate the tf-idf value of the n-grams block of the API call sequence and opcode sequence, and uses the tf-idf value of the n-grams block to delete the tf-idf value lower than the mirror warehouse security center
  • the n-grams block that presets the tf-idf threshold, and the remaining n-grams block is the result of feature selection.
  • the n-grams block is combined with the value of tf-idf to form the sequence feature of the opcode.
  • the sequence feature of the API call of the component package can be obtained.
  • n-grams of component package A are (LOAD_GLOBAL, LOAD_FAST), (LOAD_FAST, CALL_FUNCTION), (CALL_FUNCTION, RETURN_VALUE)
  • n-grams of component package B are (LOAD_GLOBAL, LOAD_FAST), (LOAD_FAST, MAKE_FUNCTION), (MAKE_FUNCTION, STORE_NAME)
  • tf-idf values of n-grams of component package A are: 0.0242, 0.6479, 0.8594;
  • tf-idf values of n-grams of component package B are: 0.0149, 0.5946, 0.8843;
  • Component package A feature selection (LOAD_FAST, CALL_FUNCTION), (CALL_FUNCTION, RETURN_VALUE);
  • Component package B feature selection (LOAD_FAST, MAKE_FUNCTION), (MAKE_FUNCTION, STORE_NAME);
  • sequence features of the component package extracted by the mirror warehouse security center are:
  • the first is the determination of the risk function.
  • the risk function may be a function stored in a local risk function database, and the local risk function database may be established by the R&D personnel by setting the risk function in advance, or by obtaining the risk function statistically by the R&D personnel.
  • Risk functions include functions such as network connection, command execution, and file reading and writing.
  • the mirror warehouse security center combines risk functions into risk function characteristics.
  • Component package A contains risk functions: socket.recv, urllib.urlretrieve, fileinput.input, os.popen, ctypes.CDLL;
  • the features of the package files obtained in the corresponding processes of 1) to 3) above can be used as the feature information of the open source component package, and optionally, the feature information of these open source component packages can be stored in the component package information database.
  • AI artificial intelligence
  • Rule verification of the rule base or other methods of file signature, heuristic detection, etc.
  • three different methods are used for security detection of feature information, namely, rule database matching, AI labeling model labeling, and incremental AI model classification.
  • FIG. 2D is a flow chart of another open source component package security detection method provided in the embodiment of the present application, as shown in FIG. 2D.
  • the difference between this method and the method shown in FIG. 2A is that , replace step 202 in FIG. 2A with: 202a, match the characteristic information of the online open source component package with multiple rules in the rule database, and determine whether the online open source component package is a legal package according to the degree of matching.
  • the rule database Before matching the feature information of the open source component package with multiple rules in the rule database, the rule database is obtained first.
  • the rule database includes a plurality of rules that can be matched with the characteristic information of the open source component package, specifically a plurality of rules generated according to the characteristic information of the local malicious package.
  • Local malicious packages refer to component packages that have been judged as malicious packages and are stored in open source mirror warehouses or other developers’ local databases.
  • the degree of matching between the characteristic information of the online open source component package and the rules in the rule database it can be determined whether the online open source component package is a malicious package, the higher the matching degree , the greater the probability that the online open source component package is a malicious package, the lower the matching degree, and the greater the probability that the online open source component package is not a malicious package (that is, a legitimate package).
  • the embodiment of the present application is described by taking the acquisition of multiple rules in the characteristic information generation rule database of the local malicious package as an example.
  • feature extraction can be performed on the local malicious package to obtain feature information of the local malicious package, which are risk function features, API call sequence features, and operation code sequence features.
  • feature information of the local malicious package which are risk function features, API call sequence features, and operation code sequence features.
  • the rule database in the embodiment of the present application is a yara database, then a yara rule is generated according to feature information.
  • feature extraction can also be performed on the local malicious source code.
  • the local malicious source code can be the webpage source code that is pre-stored in the local mirror warehouse and judged as malicious source code, or the malicious source code obtained through other means.
  • the feature extraction method of the malicious source code refer to the above steps or refer to the above steps 1) to 3) to obtain feature information of the local malicious source code, including risk function features, API call sequence features and opcode sequence features.
  • the characteristic information of the local malicious package and the characteristic information of the local malicious source code can be combined to generate multiple malicious characteristic rules in the rule database.
  • the rule database as the yara rule database as an example
  • multiple malicious characteristic yara rules will be generated according to the characteristic information of the local malicious package and the characteristic information of the local malicious source code, including:
  • the mirror warehouse security center obtains the characteristic information of the local malicious package and the characteristic information of the local malicious source code, and saves them in the malicious characteristic array ⁇ M 1 ,...,M i ,...,M n
  • the mirror warehouse security center deletes the repeated malicious features in the malicious feature array, and finally obtains the malicious feature array ⁇ M 1 ,...,M i ,...,M z ⁇ , and the mirror warehouse security center will obtain the malicious feature array ⁇ M 1 ,..., M i ,...,M z ⁇ According to the writing requirements of yara rules, generate malicious characteristic yara rules, and save them in the yara rule database.
  • the degree of matching can be determined specifically by the number of matching rules. For example, when the characteristic information of the online open source component package matches the malicious feature yara rules in the yara rule database and the number of matches is greater than or equal to K, it is determined that the open source component package is a malicious package, otherwise it is determined The open source component package is a legal package, where K is a positive integer.
  • the creation information of the online open source component package can also be extracted, and the rules in the rule database can also be matched against the creation information, and the online open source component package can be further determined as a malicious package or a legitimate package according to the matching degree.
  • Extracting information about the creation of online open source component packages may specifically include:
  • the mirror warehouse security center traverses the online open source component packages that need security detection in the package manager, and obtains a list of online open source component package names.
  • the mirror warehouse security center traverses each package name in the package name list, obtains the JSON file of the open source component package, and parses the JSON file to obtain the package file download link of the open source component package, the source code storage website link (such as Github), and the source code Links to scoring sites (eg SourceRank), and dependencies.txt.
  • Source code storage website link ⁇ Homepage:https://github.com/Kronuz/esprima-python ⁇ ;
  • the mirror warehouse security center obtains the dependent package name of the open source component package from the dependent file requirements.txt file. Then use the above steps 4) and 5) to obtain the package file download link, source code storage website link, source code scoring website link and dependent files of the dependent package.
  • the mirror warehouse security center extracts the package creation information.
  • the mirror warehouse security center downloads and decompresses the package files of the open source component package and its dependent packages from the package file download link, and extracts and parses the configuration files from the package files.
  • the mirror warehouse security center extracts the package name, author, author from these configuration files
  • Package creation information such as mailbox, organization, description, package file structure, maintainer, etc.
  • this creation information can be saved to a package information database.
  • the rule database may include creating information-related information rules.
  • the specific steps include: the mirror warehouse security center obtains the creation information of the local malicious package; the mirror warehouse security center obtains hacker information from an external database; the mirror warehouse security center uses the creation information of the local malicious package and the hacker information as the Malicious information rules in the above rule database.
  • the creation information of the local malicious package may include information such as the package name of the local malicious package, the author, and the author’s email address; the hacker information may be information pre-stored in the local hacker information database. The name of the hacker and the email address of the hacker, etc.
  • the creation information can be directly stored as rules in the rule database, or the creation information can be edited into a format required by the rule database and then stored.
  • malicious information yara rules are generated according to the requirements of yara rule writing and saved to the yara rule database. Specifically: Examples of malicious information yara rules:
  • performing security detection on the characteristic information of the open source component package further includes: matching creation information of the online open source component package with malicious information rules in the rule database.
  • the malicious information yara rule After the malicious information yara rule is generated, it is matched with the creation information of the previous open source data package.
  • the higher the matching degree the greater the probability that the online open source component package is a malicious package.
  • the matching degree can be specifically determined by the number of matching rules.
  • the probability that the online open source component package is a malicious package can be determined by combining the malicious feature yara rules and malicious information yara rules in the yara rule database with the characteristic information and creation information of the online open source component package.
  • the characteristic information (or also including creation information) of the online open source component package is used to match multiple rules in the rule database, and whether the online open source component package is a legal package is determined according to the degree of matching between the two. .
  • generating the rule database is a relatively direct and concise step, which can reduce the consumption of processing resources in the security detection process and improve the efficiency of security detection.
  • Step 202 in FIG. 2A can be replaced by: 202b.
  • the feature extraction process of the online open source component package reference may be made to the aforementioned steps 1) to 3), which will not be repeated here.
  • the feature information is directly read from the component package information database.
  • the artificial intelligence AI labeling model is used to label online open source component packages according to the feature information of online open source component packages.
  • the AI labeling model is a model obtained by training the malicious features of the local malicious package
  • input the feature information of the online open source component package into the AI labeling model and the labeling result can be obtained to mark the online open source component package as a malicious package or not. (i.e. legal package).
  • the AI labeling model is a model obtained by training with the legal features of the local legal package
  • the feature information of the online open source component package is input into the AI labeling model, and the labeling result can be obtained to mark that the online open source component package is a legal package or not. package (that is, a malicious package).
  • Input characteristic information of the online open source component package
  • Output Mark the online open source component package as a malicious package or a legitimate package.
  • an AI labeling model obtained through training of feature information of a local malicious package is taken as an example for specific description.
  • the feature vector includes risk function features, API call sequence features, and opcode sequence features.
  • the feature information of the online open source component package is input into the AI labeling model, and the AI labeling model is used to reason the online open source component package to determine the online open source component package.
  • Whether the package is a legitimate package includes: inputting the feature vectors of the online open source component package into three first classifiers respectively, and obtaining the classification results of each first classifier among the three first classifiers; The classification results of the first classifiers are voted to obtain the voting results, and the label prediction results among the classification results of the three first classifiers are determined according to the voting results.
  • the label prediction results are used to indicate whether the online open source component package is a legal package.
  • the AI labeling model may be a combination classifier, specifically, a combination classifier of an adaptive boosting Adaboost algorithm classifier, a random forest classifier, and the like.
  • This combination classifier can reason according to the three characteristic information of the online open source component package respectively, and obtain the inference result of whether the three online open source component packages are malicious packages, and then use the absolute majority voting method to vote on the three classification results , according to the voting results to determine the label prediction results of the online open source component package among the three classification results.
  • the prediction result with votes greater than or equal to 50% may be determined as the label prediction result; or the prediction result with the most votes may be determined as the label prediction result, etc.
  • This AI labeling model can improve the accuracy of classification results.
  • the first classifier is used as an example of the trained Adaboost algorithm classifier for illustration.
  • the process of training the AI labeling model includes:
  • the adaptive lifting algorithm classifier includes N second classifiers corresponding to different weights in the adaptive lifting algorithm classifier, and the N second classifiers corresponding to different weights are obtained according to a plurality of malicious characteristics training of local malicious packets;
  • Feature extraction is performed on the source code of the local malicious package to obtain the feature vector of the local malicious package; the feature vectors of the local malicious package are respectively input into the adaptive boosting algorithm classifier, and the three first classifiers are trained as AI labeling models.
  • obtaining an adaptive lifting algorithm classifier refers to obtaining an initial Adaboost algorithm classifier (untrained), and each initial Adaboost algorithm includes N second classifiers with different weights, and the second classifier It can also be called a weak classifier, which means that the classification accuracy of this untrained classifier is low, usually 50% or less, and it can be specifically a support vector machine (SVM) classifier.
  • SVM support vector machine
  • the risk function features, API call sequence features and opcode sequence features are used as the input of an Adaboost algorithm classifier respectively to generate three training tasks: task 1, task 2 and task 3.
  • the mirror warehouse security center calculates the error rate of N weak classifiers, and updates the weight of each weak classifier according to the error rate.
  • the AI labeling model is used to perform security detection on the online open source component package. Since the AI labeling model is a machine learning model, it is obtained through iterative training. Therefore, the AI labeling model is deterministic, so the online open source The inference results obtained by inputting the feature information of the component package into the AI labeling model can ensure the accuracy of the results.
  • Step 202 in FIG. 2A can be replaced by: 202c.
  • Input the feature information of the online open source component package into the incremental AI model use the incremental AI model to reason the online open source component package, and determine whether the online open source component package is a legal package , and determine that the online open source component package that is not a legitimate package is a suspected malicious package.
  • the feature extraction process of the online open source component package reference may be made to the aforementioned steps 1) to 3), which will not be repeated here.
  • the feature information is directly read from the component package information database.
  • the difference between the incremental AI model and the aforementioned AI labeling model is that this process extracts the features of local malicious packets and local legal packets, and uses the characteristic information of local malicious packets and local legal packets to jointly train to obtain the incremental AI model.
  • the AI model can be used to infer whether the obtained online open source component package is a legal package.
  • Input characteristic information of online open source component packages, legal threshold ⁇ ;
  • Output The suspicious degree value ⁇ of the online open source component package. If ⁇ , the mirror warehouse security center judges that the component package is a suspected malicious package and needs further analysis; if ⁇ , the mirror warehouse security center judges that the online open source component package is a legitimate package .
  • the incremental AI model output is a probability value, such as [0.6, 0.4], where the predicted probability of a malicious package is 0.6, the predicted probability of a legitimate package is 0.4, the sum of the two values is determined to be 1, and the legal threshold is also a probability value, For example, 0.5, the suspicious degree value corresponds to the predicted probability of a malicious package, then the suspicious degree value 0.6>legal threshold 0.5, it is judged that the package is a suspected malicious package.
  • the feature information includes risk function features, API call sequence features and opcode sequence features
  • the method also includes: extracting features from local malicious packages and local legal packages in the local open source component package, and obtaining the features of the local malicious packages Information and feature information of local legitimate packets; feature information of local malicious packets and feature information of local legal packets are used as the input of the initial support vector machine SVM algorithm classifier to iterate until the preset accuracy rate of the initial SVM algorithm classifier is greater than The first preset threshold is used to obtain the final SVM algorithm classifier as an incremental AI model.
  • the initial incremental AI model can be an initial SVM algorithm classifier.
  • the initial SVM algorithm classifier may be, for example, an initial fuzzy least squares Siamese SVM classifier.
  • the mirror warehouse security center obtains local component packages, including feature information of local malicious packages and local legal packages, specifically including risk function features, API call sequence features, and opcode sequence features, and combines these three features, that is, F + 1 -dimensional
  • the risk function feature, the F 2 -dimensional API call sequence feature and the F 3- dimensional opcode sequence feature are combined into a (F 1 +F 2 +F 3 )-dimensional combined feature.
  • the security center of the mirror warehouse takes the combined features as the input of the initial SVM algorithm, and performs iterative training until it is determined that the prediction accuracy of the SVM algorithm classifier is greater than the first preset threshold, and an incremental AI model is obtained.
  • the incremental AI model is used to perform security detection on the online open source component package.
  • the characteristic information of the local malicious package and the local legal package are considered at the same time, so that The reasoning results of the incremental AI model are considered more comprehensively, and online open source component packages that are not legitimate packages are determined as suspected malicious packages, and re-judgment can further improve the accuracy of security detection results and reduce the probability of misjudgment.
  • the re-judgment of the suspected malicious package may include the rule database generation process described above, considering the creation information of the online open source component package, or may also consider the structure of the online open source component package, dependent packages, package names and other information.
  • the corresponding method in Figure 2F also includes step 204: performing reputation evaluation on the suspected malicious package, obtaining a reputation score for the suspected
  • the reputation score of determines whether the suspected malicious package is a legitimate package, and the suspected malicious package that is not a legitimate package is a malicious package.
  • the reputation evaluation includes one or more of the following: evaluation of the dependency package of the suspected malicious package, evaluation of the package name of the suspected malicious package, evaluation of the structure of the suspected malicious package, evaluation of the reputation of the author of the suspected malicious package, and evaluation of the reputation of the suspected malicious package.
  • the dependent package evaluation is performed on the suspected malicious package, and the dependent score is obtained, including: obtaining the dependent package of any suspected malicious package among multiple suspected malicious packages, and determining the probability that the dependent package is a malicious package; The probability determines the dependency score of any online open source component package, and the dependency score is positively correlated with the probability that the dependent package is a malicious package.
  • the suspected malicious package can be an online open source component package, and the method for obtaining the dependent package can refer to the above steps 4) to 6), or the dependent package can also be a local component package.
  • the probability that the dependent package is malicious can be determined through the rule database matching method described in step 202a, or the AI tagging model reasoning method, incremental AI model reasoning method or other methods described in step 202b. Assuming that the dependent package is determined to be malicious, the probability of the dependent package being malicious is 100%, and the dependency score of the online open source component package may be 1. Assuming that the dependent package is determined to be a legitimate package, the probability of the dependent package being a malicious package is 0, and the dependency score of the online open source component package can be 0. Assuming that the suspicious degree value of the dependent package is determined to be ⁇ according to the incremental AI model reasoning method, it can be determined that the probability of the dependent package being a malicious package is ( ⁇ - ⁇ )/ ⁇ *100% and so on.
  • the dependent package is an online open source component package
  • Evaluate the package name of the suspected malicious package and obtain the package name score including: obtaining the package names of multiple online open source component packages, and generating a list of popular component packages according to the popular component package names in the package names of multiple online open source component packages; Match the package name of the suspected malicious package with the popular component package name in the popular component package list to determine the similarity between the package name of the suspected malicious package and the popular component package; according to the package name of the suspected malicious package and the popular component package The similarity of the package name determines the package name score, and the package name score is positively correlated with the similarity.
  • the open source component packages with the top P download times can be collected from the open source component package download statistics website, and a list of popular component packages can be generated according to the package names of these open source component packages.
  • P can be 500, 605, 1001, etc.
  • Generate a popular component package list according to the package names of these open source component packages including generating a popular component package list according to the size of the download frequency, or generating a popular component package list according to the freshness of the last download time, etc.
  • the semantic similarity calculation can use semantic similarity, for example function, providing a semantically similar instance as follows:
  • the foregoing transformation may specifically include operations such as deletion of characters, homophonic characters, replacement characters, exchange characters, insertion characters, separators, sequence replacement, and version modification.
  • Structural evaluation of suspected malicious packages to obtain structural scores including: obtaining the package names of multiple online open source component packages, and generating a list of popular component packages according to the popular component package names in the package names of multiple online open source component packages; obtaining The first hash value of the file structure of the open source component package in the popular component package list and the second hash value of the file structure of the suspected malicious package, and calculate the distance between the first hash value and the second hash value; The structure score is determined according to the distance between the first hash value and the second hash value, and the structure score is negatively correlated with the distance.
  • the manner of obtaining the list of popular component packages is the same as the foregoing description, and will not be repeated here.
  • the file structure can refer to the directory structure formed after decompressing the open source component package, and input the directory structure into the hash function, namely The corresponding hash value can be obtained.
  • the same method can obtain the second hash value corresponding to the file structure of the suspected malicious package. Then calculate the distance between the first hash value and the second hash value, and determine the structure score according to the distance between the first hash value and the second hash value.
  • the structure score is negatively correlated with the distance, that is, the distance The smaller the value, the higher the structure score. For example, assuming that the distance between the first hash value and the second hash value is 1, the structure score is a/1, where a is a preset value. A suspected malicious package and multiple open source component packages in the popular component package list can be obtained to calculate the structural score, and then these structural scores are added together as the final structural score of the suspected malicious package.
  • Conduct author reputation evaluation on suspected malicious packages and obtain author reputation scores, including: obtain author reputation characteristics of suspected malicious packages, including popularity, total number of users, total number of viewers, activity level, etc. of all items uploaded by the author, and calculate The sum of all author reputation feature values in the suspected malicious package is the author reputation score.
  • the package reputation score including: obtain the package reputation characteristics of the suspected malicious package, including the popularity of the package, the number of users, the number of readers, component package scores, etc., and calculate all The package reputation score is obtained by summing the package reputation feature values.
  • the final reputation score can be determined based on the single reputation score; if the suspected malicious package has undergone multiple reputation evaluations, the multiple reputation scores can be summed. Weighted summation, or other combination methods to determine the final reputation score. Taking the above five reputation evaluations for suspected malicious packages as an example, five reputation scores are obtained, which are Then these five reputation scores are summed to get the final suspected malicious package reputation score The mirror warehouse security center will score the reputation Compared with the preset evaluation threshold ⁇ , if The Mirror Warehouse Security Center judges that the suspected malicious package is a malicious package, if The Mirror Warehouse Security Center judges that the suspected malicious package is a legitimate package.
  • the final reputation score of suspected malicious package A is: 3.7
  • the evaluation threshold preset by the Mirror Warehouse Security Center is: 5
  • the suspected malicious packet A is a malicious packet.
  • the ultimate goal of the mirror warehouse security center is to ensure the security of the company’s open source mirror warehouse and eliminate malicious packages. Therefore, as described in the above-mentioned steps 203 in Figure 2A and Figure 2D-2F, the mirror warehouse security center will screen out
  • the legal package in the embodiment of this application, can be specifically the legal package screened out through the rule database, AI labeling model, incremental AI model and reputation evaluation network, and synchronized to the open source mirror warehouse.
  • the mirror warehouse security center can save the malicious packages detected by the aforementioned rule database, AI labeling model, and reputation evaluation network to the incremental malicious package database.
  • the online open source component package before synchronizing the online open source component package to the local open source mirror warehouse, the online open source component package is first checked for security, and if the online open source component package is determined to be a legal package, it is synchronized to the local
  • the open-source mirror warehouse advances the ability to detect malicious code, builds a safe open-source warehouse, effectively restrains the impact of open-source on the security of the R&D environment, and reduces the possibility of users of open-source component packages being attacked.
  • the method also includes: extracting features of the target malicious package to obtain feature information of the target malicious package, where the target malicious package is part or all of the malicious packages in the incremental malicious package database; using the feature information of the target malicious package as The input of the incremental AI model is iterated to obtain the updated incremental AI model.
  • the incremental malicious package database may include malicious packages determined by the reputation scoring network
  • feature extraction for these malicious packages and update the incremental AI model can optimize the incremental AI model and reduce the probability of judging suspected malicious packages. Improve classification efficiency.
  • the method further includes: acquiring incremental malicious feature rules and/or incremental information rules according to the malicious packages in the incremental malicious package database; updating the rule database according to the incremental malicious feature rules and/or incremental information rules.
  • the mirror warehouse security center generates the malicious feature rules and malicious information rules of the malicious package in the incremental malicious package database through the method described above, compares the newly extracted malicious feature rules and malicious information rules with the existing rules in the rule database, and eliminates The newly extracted malicious feature rules and malicious information rules coincide with the coincident rules in the rule database, and the remaining rules are added to the rule database.
  • this process can also have the effect of optimizing the rule database and improve the classification accuracy of the rule database.
  • FIG. 3 is a safety detection device 300 provided by an embodiment of the present application, which can be used to implement the methods and specific embodiments in FIGS. 2A to 2F above.
  • the apparatus 300 includes an acquisition unit 301 , a processing unit 302 and a storage unit 303 .
  • the obtaining unit 301 is used to obtain the online open source component package, and perform feature extraction on the online open source component package, and obtain the feature information of the online open source component package;
  • the processing unit 302 is configured to perform security detection on the characteristic information of the open source component package, and determine whether the online open source component package is a legal package;
  • the storage unit 303 is configured to, if the first component package in the online open source component package is a legitimate package, synchronize the first component package to a local open source mirror warehouse, and the local open source mirror warehouse is used to provide the user with the called open source component package.
  • the storage unit 303 is further configured to: if it is determined that the second component package in the online open source component package is a malicious package, store the second component package in the incremental malicious package database.
  • the processing unit 302 is specifically configured to: match the characteristic information of the online open source component package with multiple rules in the rule database, and determine whether the online open source component package is a legitimate package according to the matching degree.
  • the processing unit 302 is also used to: obtain the local malicious package in the local open source component package, and perform feature extraction on the local malicious package to obtain the malicious features of the local malicious package; obtain the local malicious source code, and analyze the local malicious source code.
  • the code performs feature extraction to obtain the malicious code features of the local malicious source code; the malicious features of the local malicious package and the malicious code features of the local malicious source code are used as malicious feature rules in the rule database.
  • obtaining the feature information of the online open source component package also includes obtaining creation information of the online open source component package
  • the processing unit 302 is also used to: acquire the creation information of the local malicious package; acquire hacker information from an external database; use the creation information of the local malicious package and the hacker information as malicious information rules in the rule database;
  • the security detection also includes: matching the creation information of the online open source component package with the malicious information rules in the rule database.
  • the processing unit 302 is also used to: input the feature information of the online open source component package into the artificial intelligence AI labeling model, use the AI labeling model to reason the online open source component package, and determine whether the online open source component package is a legal package, wherein no Online open source packages that are legitimate packages are malicious packages.
  • the characteristic information includes risk function characteristics, API call sequence characteristics and operation code sequence characteristics
  • the label prediction results are used to indicate whether the online open source component package is a legal package, and no Online open source packages that are legitimate packages are malicious packages.
  • the processing unit 302 is further configured to: acquire an adaptive boosting algorithm classifier, the adaptive boosting algorithm classifier includes N second classifiers corresponding to different weights, N second classifiers corresponding to different weights Obtained through training based on multiple malicious features of local malicious packages;
  • Feature extraction is performed on the source code of the local malicious package to obtain the characteristic information of the local malicious package;
  • the feature information of the local malicious package is input into the adaptive boosting algorithm classifier respectively, and the three first classifiers are trained as AI labeling models.
  • the processing unit 302 is specifically configured to: input the feature information of the online open source component package into the incremental AI model, use the incremental AI model to reason the online open source component package, determine whether the online open source component package is a legal package, and determine Online open source component packages that are not legitimate packages are suspected malicious packages.
  • the feature information includes a risk function feature, an API call sequence feature, and an operation code sequence feature
  • the processing unit 302 is also used to: perform feature extraction on the local malicious package and the local legal package in the local open source component package, and obtain the local malicious package
  • the feature information of the local legal package and the feature information of the local legal package; the feature information of the local malicious package and the feature information of the local legal package are used as the input of the initial support vector machine SVM algorithm classifier to iterate until the prediction accuracy of the initial SVM algorithm classifier is determined Greater than the first preset threshold, the final SVM algorithm classifier is obtained as an incremental AI model.
  • the processing unit 302 is further configured to:
  • the reputation evaluation includes the following One or more items: evaluation of dependent packages of suspected malicious packages, evaluation of package names of suspected malicious packages, structure evaluation of suspected malicious packages, author reputation evaluation of suspected malicious packages, package reputation evaluation of suspected malicious packages.
  • the device further updates unit 304, configured to: acquire incremental malicious feature rules and/or incremental information rules according to the malicious packages in the incremental malicious package database; Rules update the rules database.
  • the device also includes an updating unit 304, configured to: extract features of the target malicious package, and obtain feature information of the target malicious package, where the target malicious package is part or all of the malicious packages in the incremental malicious package database; The characteristic information of the malicious package is iterated as the input of the incremental AI model to obtain the updated incremental AI model.
  • an updating unit 304 configured to: extract features of the target malicious package, and obtain feature information of the target malicious package, where the target malicious package is part or all of the malicious packages in the incremental malicious package database; The characteristic information of the malicious package is iterated as the input of the incremental AI model to obtain the updated incremental AI model.
  • processing unit 302 may be a central processing unit (Central Processing Unit, CPU).
  • CPU Central Processing Unit
  • the acquisition unit 301 may be an interface circuit or a transceiver. Used to receive or send data or instructions from other electronic devices.
  • the storage unit 303 may be used to store data and/or signaling, and the storage unit may be coupled to the obtaining unit 301 and the processing unit 302 .
  • the processing unit 302 may be configured to read data and/or signaling in the storage unit, so that the security detection process of the open source component package in the foregoing method embodiments is executed.
  • FIG. 4 shows a schematic diagram of a hardware structure of an electronic device in an embodiment of the present application.
  • the structure of the security detection device 300 may refer to the structure shown in FIG. 4 .
  • the electronic device 1000 includes: a memory 1001 , a processor 1002 , a communication interface 1003 and a bus 1004 . Wherein, the memory 1001 , the processor 1002 , and the communication interface 1003 are connected to each other through a bus 1004 .
  • the memory 1001 may be a read-only memory (Read Only Memory, ROM), a static storage device, a dynamic storage device or a random access memory (Random Access Memory, RAM).
  • the memory 1001 may store a program, and when the program stored in the memory 1001 is executed by the processor 1002, the processor 1002 and the communication interface 1003 are used to execute various steps of the distributed rendering method of the embodiment of the present application.
  • the processor 1002 may adopt a general-purpose CPU, a microprocessor, an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), a GPU, or one or more integrated circuits for executing related programs, so as to realize the security detection of the embodiment of the present application
  • ASIC Application Specific Integrated Circuit
  • ASIC Application Specific Integrated Circuit
  • GPU GPU
  • one or more integrated circuits for executing related programs so as to realize the security detection of the embodiment of the present application
  • the acquisition unit 301, the processing unit 302 and the storage unit 303 in the device 300 need to perform functions, or execute the security detection method of the method embodiment of the present application.
  • the processor 1002 may also be an integrated circuit chip with signal processing capability. In the implementation process, each step of the distributed rendering method of the present application may be completed by an integrated logic circuit of hardware in the processor 1002 or instructions in the form of software.
  • the above-mentioned processor 1002 can also be a general-purpose processor, a digital signal processor (Digital Signal Processing, DSP), an application-specific integrated circuit (ASIC), a ready-made programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic devices , discrete gate or transistor logic devices, discrete hardware components.
  • DSP Digital Signal Processing
  • ASIC application-specific integrated circuit
  • FPGA Field Programmable Gate Array
  • Various methods, steps, and logic block diagrams disclosed in the embodiments of the present application may be implemented or executed.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the steps of the method disclosed in the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register.
  • the storage medium is located in the memory 1001, and the processor 1002 reads the information in the memory 1001, and combines its hardware to complete the functions required to be performed by the modules included in the security detection device 300 of the embodiment of the application, or execute the security detection method of the method embodiment of the application. Detection method.
  • the communication interface 1003 implements communication between the electronic device 1000 and other devices or communication networks by using a transceiver device such as but not limited to a transceiver. For example, the determined segmented objects and/or bounding boxes of candidate objects may be obtained through the communication interface 1003 .
  • the bus 1004 may include a path for transferring information between various components of the electronic device 1000 (eg, memory 1001 , processor 1002 , communication interface 1003 ).
  • the electronic device 1000 shown in FIG. 4 only shows a memory, a processor, and a communication interface, in the specific implementation process, those skilled in the art should understand that the electronic device 1000 also includes necessary other devices. Meanwhile, according to specific needs, those skilled in the art should understand that the electronic device 1000 may also include hardware devices for implementing other additional functions. In addition, those skilled in the art should understand that the electronic device 1000 may only include components necessary to realize the embodiment of the present application, and does not necessarily include all the components shown in FIG. 4 .
  • the present application also provides a computer program, which is used to implement the operations and/or processing performed by the security detection device in the method provided in the present application.
  • the present application also provides a computer-readable storage medium, the computer-readable storage medium stores computer programs or computer-executable instructions, and when the computer programs or computer-executable instructions are run on the computer, the computer executes the An operation and/or process performed by a security detection device in a method.
  • the present application also provides a computer program product, the computer program product includes computer-executable instructions or computer programs, when the computer-executable instructions or computer programs run on the computer, the method provided by the application is executed by the security detection device The operations and/or processing are performed.
  • sequence numbers of the above-mentioned processes do not mean the order of execution, and the execution order of the processes should be determined by their functions and internal logic, and should not be used in the embodiments of the present application.
  • the implementation process constitutes any limitation.
  • modules and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present application.
  • the disclosed systems, devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of modules is only a logical function division. In actual implementation, there may be other division methods.
  • multiple modules or components can be combined or integrated. to another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or modules may be in electrical, mechanical or other forms.
  • a module described as a separate component may or may not be physically separated, and a component shown as a module may or may not be a physical module, that is, it may be located in one place, or may also be distributed to multiple network modules. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional module in each embodiment of the present application may be integrated into one processing module, each module may exist separately physically, or two or more modules may be integrated into one module.
  • the functions are implemented in the form of software function modules and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disc, etc., which can store program codes. .

Abstract

本申请公开了一种开源组件包的安全检测方法及装置,其中方法包括:获取在线开源组件包,并对在线开源组件包进行特征提取,获取在线开源组件包的特征信息;针对开源组件包的特征信息进行安全检测,确定在线开源组件包是否为合法包;若在线开源组件包中的第一组件包为合法包,则将第一组件包同步到本地开源镜像仓,本地开源镜像仓用于向用户提供调用的开源组件包。采用本申请实施例使得恶意代码检测能力前移,构建安全开源仓库,能够有效抑制开源化对研发环境的安全影响,降低被攻击的可能性。

Description

开源组件包的安全检测方法及装置
本申请要求于2021年10月31日提交中国专利局、申请号为202111279082.4、申请名称为“开源组件包的安全检测方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及网络安全技术领域,尤其涉及一种开源组件包的安全检测方法及装置。
背景技术
近年来,越来越多的软件产品依赖于免费的开源组件包,软件供应链越来越复杂,导致软件供应链安全问题日益严重,攻击者利用在软件交付环节将包含恶意代码的开源组件包注入到包管理器(Python包管理器、node.js包管理器等),在用户使用环节对软件用户实施攻击,给用户的隐私和财产安全造成了巨大威胁。
恶意包引发的安全风险可能在安装阶段发起攻击,攻击代码从网络段远程执行,无本地文件存留;或者恶意包攻击者在安装阶段隐藏自己不发起攻击,而是将恶意代码隐藏在开源组件包中,软件开发人员在编写产品源代码时调用各种开源组件包实现一些功能模块,则可能调用到攻击者精心伪装的恶意包,并且在软件开发人员发布产品时将源代码和组件包一起打包发布,打包发布的产品通过扫描软件扫描时,恶意包注入恶意代码到开发产品以躲避杀毒软件扫描。
传统代码安全性检测框架是通过攻击行为发生过程中检测到异常行为或攻击发生后根据攻击后果回溯检测攻击行为,均属于被动防守,当面对低成本的脚本类软件供应链攻击时,应急压力大。而且现有代码安全性检测框架基于本地源文件的终端检测和云检测,但是包管理器的攻击者上传攻击代码,攻击代码在用户使用环节之前发起攻击,并打包安装代码存在终端开发环境,攻击者可以轻易在开发环境中植入恶意代码或安装阶段窃取信息通过网络通道传输到指定网络地址。
发明内容
本申请实施例提供了一种开源组件包的安全检测方法及装置,通过先获取在线开源组件包,并对开源组件包的安全性进行评估,使得恶意代码检测能力前移,构建安全开源仓库,有效抑制开源化对研发环境的安全影响,降低被攻击的可能性。
第一方面,提供一种开源组件包的安全检测方法,该方法包括:获取在线开源组件包,并对所述在线开源组件包进行特征提取,获取所述在线开源组件包的特征信息;针对所述开源组件包的特征信息进行安全检测,确定所述在线开源组件包是否为合法包;若所述在线开源组件包中的第一组件包为合法包,则将所述第一组件包同步到本地开源镜像仓,所述本地开源镜像仓用于向用户提供调用的开源组件包。
在本申请实施例中,在将在线开源组件包同步到本地开源镜像仓之前,先对在线开源组件包进行安全检测,确定在线开源组件包为合法包的情况下,将其同步到本地开源镜像仓,使得恶意代码检测能力前移,构建安全开源仓库,有效抑制开源化对研发环境的安全影响,降低开源组件包使用者被攻击的可能性。
在一个可选的示例中,获取在线开源组件包的特征信息包括获取在线开源组件包的创建 信息;针对开源组件包的创建信息进行安全检测,确定在线开源组件包是否为合法包,包括:将在线开源组件包的创建信息与规则数据库中的多条规则进行匹配,根据匹配程度确定在线开源组件包是否为合法包。
在一个可选的示例中,该方法还包括:若确定在线开源组件包中的第二组件包为恶意包,则将第二组件包存储到增量恶意包数据库。
在一个可选的示例中,针对开源组件包的特征信息进行安全检测,确定在线开源组件包是否为合法包,包括:将在线开源组件包的特征信息与规则数据库中的多条规则进行匹配,根据匹配程度确定在线开源组件包是否为合法包。
在本申请实施例中,采用在线开源组件包的特征信息与规则数据库中的多条规则进行匹配,根据两者的匹配程度确定在线开源组件包是否为合法包。这个过程中,生成规则数据库是一个较为直接和简明的步骤,进而可以降低安全检测过程中的处理资源消耗,提升安全检测效率。
在一个可选的示例中,该方法还包括:获取本地开源组件包中的本地恶意包,并对本地恶意包进行特征提取,获取本地恶意包的恶意特征;获取本地恶意源代码,并对本地恶意源代码进行特征提取,获取本地恶意源代码的恶意代码特征;将本地恶意包的恶意特征和本地恶意源码的恶意代码特征作为规则数据库中的恶意特征规则。
在一个可选的示例中,获取在线开源组件包的特征信息还包括获取在线开源组件包的创建信息;该方法还包括:获取本地恶意包的创建信息;从外部数据库中获取黑客信息;将本地恶意包的创建信息和黑客信息作为规则数据库中的恶意信息规则;
针对开源组件包的特征信息进行安全检测还包括:将在线开源组件包的创建信息与规则数据库中的恶意信息规则进行匹配。
在一个可选的示例中,针对开源组件包的特征信息进行安全检测,确定在线开源组件包是否为合法包,包括:将在线开源组件包的特征信息输入人工智能AI标注模型,采用AI标注模型对在线开源组件包进行推理,确定在线开源组件包是否为合法包,其中不为合法包的在线开源包为恶意包。
在本申请实施例中,采用AI标注模型对在线开源组件包进行安全检测,该过程由于AI标注模型为机器学习模型,经过迭代训练获得,因此AI标注模型具有确定性,那么将在线开源组件包的特征信息输入AI标注模型获得的推理结果,能够保证结果的准确性。
在一个可选的示例中,特征信息包括风险函数特征、API调用序列特征和操作码序列特征,将在线开源组件包的特征信息输入AI标注模型,采用AI标注模型对在线开源组件包进行推理,确定在线开源组件包是否为合法包,包括:将在线开源组件包的特征信息分别输入三个第一分类器,获得三个第一分类器中每个第一分类器的分类结果;使用绝对多数投票法对每个第一分类器的分类结果进行投票获得投票结果,根据投票结果确定三个第一分类器的分类结果中的标签预测结果,标签预测结果用于指示在线开源组件包是否为合法包,其中不为合法包的在线开源包为恶意包。
在一个可选的示例中,该方法还包括:获取自适应提升算法分类器,自适应提升算法分类器中包括N个对应不同权值的第二分类器,N个对应不同权值的第二分类器根据本地恶意包的多个恶意特征训练获得;对本地恶意包的源代码进行特征提取,获得本地恶意包的特征信息;将本地恶意包的特征信息分别输入自适应提升算法分类器,训练获得三个第一分类器作为AI标注模型。
在一个可选的示例中,针对开源组件包的特征信息进行安全检测,确定在线开源组件包 是否为合法包,包括:将在线开源组件包的特征信息输入增量AI模型,采用增量AI模型对在线开源组件包进行推理,确定在线开源组件包是否为合法包,并确定不为合法包的在线开源组件包为疑似恶意包。
在本申请实施例中,采用增量AI模型对在线开源组件包进行安全检测,该过程由于增量AI模型的训练过程中,同时考虑了本地恶意包和本地合法包的特征信息,使得增量AI模型的推理结果考虑更全面,将不为合法包的在线开源组件包确定为疑似恶意包,并进行再判断,可以进一步提升安全检测结果的准确性,减少误判的概率。
在一个可选的示例中,特征信息包括风险函数特征、API调用序列特征和操作码序列特征,方法还包括:对本地开源组件包中的本地恶意包和本地合法包进行特征提取,获取本地恶意包的特征信息和本地合法包的特征信息;将本地恶意包的特征信息和本地合法包的特征信息作为初始支持向量机SVM算法分类器的输入进行迭代,直到确定初始SVM算法分类器的预测准确率大于第一预设阈值,获得最终SVM算法分类器作为增量AI模型。
在一个可选的示例中,在确定不为合法包的在线开源组件包为疑似恶意包之后,方法还包括:对所述疑似恶意包进行信誉评估,获得所述疑似恶意包的信誉评分,并根据所述疑似恶意包的信誉评分确定所述疑似恶意包是否为合法包,其中不为合法包的所述疑似恶意包为恶意包,计算获得疑似恶意包的信誉评分,并根据疑似恶意包的信誉评分确定疑似恶意包是否为合法包。
在一个可选的示例中,信誉评估包括以下一项或多项:疑似恶意包的依赖包评估,疑似恶意包的包名评估,疑似恶意包的结构评估,疑似恶意包的作者信誉评估,疑似恶意包的包信誉评估。
在一个可选的示例中,该方法还包括:根据增量恶意包数据库中的恶意包获取增量恶意特征规则和/或增量信息规则;根据增量恶意特征规则和/或增量信息规则更新规则数据库。
在一个可选的示例中,该方法还包括:对目标恶意包进行特征提取,获得目标恶意包的特征信息,目标恶意包为增量恶意包数据库中的部分或全部恶意包;将目标恶意包的特征信息作为增量AI模型的输入进行迭代,获得更新后的增量AI模型。
第二方面,提供一种安全检测装置,该装置包括:获取单元,用于获取在线开源组件包,并对在线开源组件包进行特征提取,获取在线开源组件包的特征信息;处理单元,用于针对开源组件包的特征信息进行安全检测,确定在线开源组件包是否为合法包;存储单元,用于若在线开源组件包中的第一组件包为合法包,则将第一组件包同步到本地开源镜像仓,本地开源镜像仓用于向用户提供调用的开源组件包。
在一个可选的示例中,存储单元还用于:若确定在线开源组件包中的第二组件包为恶意包,则将第二组件包存储到增量恶意包数据库。
在一个可选的示例中,处理单元具体用于:将在线开源组件包的特征信息与规则数据库中的多条规则进行匹配,根据匹配程度确定在线开源组件包是否为合法包。
在一个可选的示例中,处理单元还用于:获取本地开源组件包中的本地恶意包,并对本地恶意包进行特征提取,获取本地恶意包的恶意特征;获取本地恶意源代码,并对本地恶意源代码进行特征提取,获取本地恶意源代码的恶意代码特征;将本地恶意包的恶意特征和本地恶意源码的恶意代码特征作为规则数据库中的恶意特征规则。
在一个可选的示例中,获取在线开源组件包的特征信息还包括获取在线开源组件包的创建信息;处理单元还用于:获取本地恶意包的创建信息;从外部数据库中获取黑客信息;将本地恶意包的创建信息和黑客信息作为规则数据库中的恶意信息规则;针对开源组件包的特 征信息进行安全检测还包括:将在线开源组件包的创建信息与规则数据库中的恶意信息规则进行匹配。
在一个可选的示例中,处理单元还用于:将在线开源组件包的特征信息输入人工智能AI标注模型,采用AI标注模型对在线开源组件包进行推理,确定在线开源组件包是否为合法包,其中不为合法包的在线开源包为恶意包。
在一个可选的示例中,特征信息包括风险函数特征、API调用序列特征和操作码序列特征,将在线开源组件包的特征信息输入AI标注模型,采用AI标注模型对在线开源组件包进行推理,确定在线开源组件包是否为合法包,包括:将在线开源组件包的特征信息分别输入三个第一分类器,获得三个第一分类器中每个第一分类器的分类结果;使用绝对多数投票法对每个第一分类器的分类结果进行投票获得投票结果,根据投票结果确定三个第一分类器的分类结果中的标签预测结果,标签预测结果用于指示在线开源组件包是否为合法包,其中不为合法包的在线开源包为恶意包。
在一个可选的示例中,处理单元还用于:获取自适应提升算法分类器,自适应提升算法分类器中包括N个对应不同权值的第二分类器,N个对应不同权值的第二分类器根据本地恶意包的多个恶意特征训练获得;对本地恶意包的源代码进行特征提取,获得本地恶意包的特征信息;将本地恶意包的特征信息分别输入自适应提升算法分类器,训练获得三个第一分类器作为AI标注模型。
在一个可选的示例中,处理单元具体用于:将在线开源组件包的特征信息输入增量AI模型,采用增量AI模型对在线开源组件包进行推理,确定在线开源组件包是否为合法包,并确定不为合法包的在线开源组件包为疑似恶意包。
在一个可选的示例中,特征信息包括风险函数特征、API调用序列特征和操作码序列特征,处理单元还用于:对本地开源组件包中的本地恶意包和本地合法包进行特征提取,获取本地恶意包的特征信息和本地合法包的特征信息;将本地恶意包的特征信息和本地合法包的特征信息作为初始支持向量机SVM算法分类器的输入进行迭代,直到确定初始SVM算法分类器的预测准确率大于第一预设阈值,获得最终SVM算法分类器作为增量AI模型。
在一个可选的示例中,在确定不为合法包的在线开源组件包为疑似恶意包之后,处理单元还用于:对疑似恶意包进行信誉评估,获得疑似恶意包的信誉评分,并根据疑似恶意包的信誉评分确定疑似恶意包是否为合法包,其中不为合法包的疑似恶意包为恶意包。
在一个可选的示例中,信誉评估包括以下一项或多项:疑似恶意包的依赖包评估,疑似恶意包的包名评估,疑似恶意包的结构评估,疑似恶意包的作者信誉评估,疑似恶意包的包信誉评估。
在一个可选的示例中,该装置还包括更新单元,用于:根据增量恶意包数据库中的恶意包获取增量恶意特征规则和/或增量信息规则;根据增量恶意特征规则和/或增量信息规则更新规则数据库。
在一个可选的示例中,该装置还包括封信单元,用于:对目标恶意包进行特征提取,获得目标恶意包的特征信息,目标恶意包为增量恶意包数据库中的部分或全部恶意包;将目标恶意包的特征信息作为增量AI模型的输入进行迭代,获得更新后的增量AI模型。
第三方面,本申请实施例提供一种通信装置,该装置包括通信接口和至少一个处理器,该通信接口用于该装置与其它设备进行通信。示例性的,通信接口可以是收发器、电路、总线、模块或其它类型的通信接口。至少一个处理器用于调用一组程序、指令或数据,执行上述第一方面或第二方面描述的方法。该装置还可以包括存储器,用于存储处理器调用的程序、 指令或数据。存储器与至少一个处理器耦合,该至少一个处理器执行该存储器中存储的、指令或数据时,可以实现上述第一方面描述的方法。
第四方面,本申请实施例中还提供一种计算机可读存储介质,该计算机可读存储介质中存储有指令,当该指令在计算机上运行时,使得计算机执行如第一方面或第一方面中任一种可能的实现方式中的方法。
第五方面,本申请实施例提供了一种芯片系统,该芯片系统包括处理器,还可以包括存储器,用于实现上述第一方面或第一方面中任一种可能的实现方式中的方法,该芯片系统可以由芯片构成,也可以包含芯片和其他分立器件。
在一个可能的示例中,该芯片系统还包括收发器。
第六方面,本申请实施例中还提供一种计算机程序产品,包括指令,当其在计算机上运行时,使得计算机执行如第一方面或第一方面中任一种可能的实现方式中的方法。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍。
图1A为本申请实施例提供的一种软件供应链流程图;
图1B为本申请实施例提供的一种恶意包造成安全风险的示意图;
图1C为本申请实施例提供的一种新的软件供应架构示意图;
图2A为本申请实施例提供的一种开源组件包的安全检测方法流程图;
图2B为本申请实施例提供的一种抽象语法树示意图;
图2C为本申请实施例提供的一种反汇编文件示意图;
图2D为本申请实施例提供的另一种开源组件包的安全检测方法流程图;
图2E为本申请实施例提供的另一种开源组件包的安全检测方法流程图;
图2F为本申请实施例提供的另一种开源组件包的安全检测方法流程图;
图3为本申请实施例提供的一种安全检测装置结构框图;
图4为本申请实施例提供的一种电子装置的结构示意图。
具体实施方式
本申请的说明书和权利要求书及所述附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或模块的过程、方法、系统、产品或设备没有限定于已列出的步骤或模块,而是可选地还包括没有列出的步骤或模块,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或模块。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。
首先对本申请实施例的专业术语进行介绍。
开源组件包:开源(Open Source)全称为开放源代码。即是说,对于一个开源软件,任何人都可以得到软件的源代码,并在版权范围内加以修改学习,甚至重新发放。组件包是对数据和方法进行简单封装形成包,可用于作为一个系统的部分进行组合。在不同的编程语言中,组件包也可以被称为部件包或控件包等。那么,开源组件包即是开放源代码的组件包。
包管理器:开源组件包的在线存储仓库,所有开发者均可以上传组件包到该包管理器,同时也可以从该包管理器中获取组件包用于自身开发。
开源镜像仓:用于进行组件包存储和管理的本地私有存储仓库。本地通常可以指一个公司,一个部门甚至个人的开发设备。
软件开发平台:一个组织机构,通常为一个公司内的软件开发人员使用开源组件包研发软件产品的服务平台。
以下对本申请实施例的应用场景进行介绍。
请参阅图1A,图1A为本申请实施例提供的一种软件供应链流程图,如图1A所示,在软件供应链流程中,首先由软件供应商(可以是个人,也可以是组织或机构)开发开源组件包,完成开发后,将开源组件包发布到包管理器上,供其他开发者使用。软件开发平台将包管理器同步到本地私有开源镜像仓,其他开发者需要时,可以从开源镜像仓下载开源组件包,然后使用这些开源组件包,包括直接安装,或者在开源组件包的基础上进行再次开发形成新的功能模块等。
在上述过程中,在将开源组件包下载到开源镜像仓后,由于开源组件包数量巨大和频繁更新,导致开源镜像仓管理人员很难检查每一个组件包的合法性,那么公司内部开发环境和发布产品的安全性无法得到保障,这些风险会影响公司平台安全性和开发环境安全性。
具体如图1B所示,图1B为本申请实施例提供的一种恶意包造成安全风险的示意图,如图1B所示,开源组件包恶意攻击者将恶意包上传到包管理器,开源镜像仓管理人员无法识别组件包的安全性,将恶意包同合法包一起同步到开源镜像仓,而软件开发人员由于无安全知识背景无法排查所使用组件包的安全性,会给其研发环境带来安全风险。
恶意包引起的安全风险主要通过两个途径触发:
一是开发人员从开源镜像仓安装开源组件包时,恶意包会在安装阶段发起攻击,攻击代码从网络段远程执行,无本地文件存留(图1B中的攻击途径①)。软件开发人员主机中安装的恶意代码扫描软件又属于针对本地文件的扫描软件,从而无法扫描出在线恶意包,攻击者就会在安装阶段即对开发人员主机发起信息窃取、分布式拒绝服务攻击(distributed denial of service attack,DDoS)等攻击;
二是恶意包攻击者在安装阶段隐藏自己不发起攻击,而是将恶意代码隐藏在开源组件包中,公司软件开发人员在编写产品源代码时为了加快开发速度会调用各种开源组件包实现一些功能模块(图1B中的攻击途径②)。此时,调用的组件包可能是合法包也可能是攻击者精心伪装的恶意包,公司软件开发人员发布产品时会把源代码和组件包一起打包发布,此时打包的产品还需要经过公司恶意代码扫描软件的安全检查,然而当前恶意代码扫描软件无法识别开源组件包的安全性,这种情况下恶意攻击者就会通过恶意包注入恶意代码到开发产品以躲避杀毒软件扫描。
基于上述描述,本申请实施例公开了一种新的软件供应架构,如图1C所示,为本申请实施例提供的一种新的软件供应架构示意图,在包管理器和开源镜像仓之间,引入镜像仓安全中心,用于从包管理器获取在线开源组件包,然后对在线开源组件包进行安全检测,过滤安全组件包,并将安全组件包存储到开源镜像仓,这样软件开发人员就能够保证从开源镜像 仓获取并使用的开源组件包的安全性,进而保证研发环境的安全性。需要说明的是,镜像仓安全中心可以是一个独立的模块,也可以是一个与开源镜像仓组合的模块,本申请实施例中不做具体限定。
具体地,本申请实施例中提供了一种开源组件包的安全检测方法,用于通过图1C中的软件供应架构执行,如图2A所示,该方法包括如下步骤:
201、获取在线开源组件包,并对在线开源组件包进行特征提取,获取在线开源组件包的特征信息;
202、针对开源组件包的特征信息进行安全检测,确定在线开源组件包是否为合法包;
203、若在线开源组件包中的第一组件包为合法包,则将第一组件包同步到本地开源镜像仓,本地开源镜像仓用于向用户提供调用的开源组件包。
上述方法步骤的执行主体为图1C中的镜像仓安全中心,可以为一个独立的功能模块,也可能为开源镜像仓中的部分功能。以下内容对此不再赘述。对应地,该执行主体的对应的硬件实体可能为一个终端设备,也可能为一个服务器,或者计算中心等。
在线开源组件包由软件供应商发布在包管理器上,机构或组织,通常为公司,将在线开源组件包下载到镜像仓安全中心,然后对在线开源组件包进行特征提取,获取开源组件包的特征信息。其中,特征提取包括可以对组件包内的源代码的函数或方法进行特征提取,获得特征信息。
具体地,获取在线开源组件包的特征信息的过程可以包括:
1)镜像仓安全中心遍历包管理器中存在的所有开源组件包,获取开源组件包包名称列表;
2)镜像仓安全中心遍历包名称列表中的每一个包名称,获取包管理器中开源组件包的JSON文件,解析JSON文件获得开源组件包的包文件下载链接;
3)镜像仓安全中心从开源组件包的包文件下载链接下载并解压包文件,从包文件中提取源代码,然后从源代码中提取包文件的特征,具体可以包括应用程序接口(application programming interface,API)调用序列特征、操作码序列特征和风险函数特征,组成开源组件包的特征信息。
以下对源代码的上述三个特征提取过程做具体介绍。
(1)首先介绍API调用序列特征和操作码序列特征的提取。
①.提供一段源代码实例1(清晰展现了代码逻辑关系):
Figure PCTCN2022127118-appb-000001
镜像仓安全中心扫描源代码中是否包含加解密函数,如果存在此类函数,则认定该文件包含混淆代码,混淆代码使用加解密函数将源代码中的部分代码片段转换成混乱的字符串,从而隐藏代码结构。
②.如果源代码包含混淆代码,提供一段混淆代码实例2(上述示例1对应的混淆代码,可以看出混淆代码无法有效阅读):
Figure PCTCN2022127118-appb-000002
Figure PCTCN2022127118-appb-000003
③.如果源代码不包含混淆代码,镜像仓安全中心使用源代码对应的编程语言的抽象语法树提取功能提取源代码的抽象语法树,从抽象语法树的节点中提取源代码的API调用序列;镜像仓安全中心使用源代码汇编功能汇编源代码,从汇编文件中提取操作码序列。
可参阅图2B,为本申请实施例提供的一种抽象语法树示意图,如图2B所示,API调用序列为抽象语法树中根节点到叶节点形成的序列,例如一个API调用序列实例为:Store、Name、Assign、FunctionDef、Module。
生成的反汇编文件可参阅图2C,为本申请实施例提供的一种反汇编文件示意图,如图2C所示,提取反汇编文件中的操作码:LOAD_CONST、LOAD_CONST、MAKE_FUNCTION、STORE_NAME、LOAD_CONST、RETURN_VALUE、LOAD_CONST、STORE_FAST,生成对应的操作码序列。
可选情况下,还可以通过其他方式获取源代码的API调用序列和操作码序列,例如:
A.如果源代码包含混淆代码,镜像仓安全中心将包含混淆代码的源代码拷贝到沙箱(Sandbox)运行,镜像仓安全中心可以在沙箱中实时运行混淆代码,从实时运行中监控混淆代码的代码逻辑,并根据混淆代码运行时的监控记录提取其API调用序列和操作码序列。
B.镜像仓安全中心使用n元语法模型(n-gram)和词频-逆文本频率指数(term frequency-inverse document frequency,tf-idf)技术对API调用序列和操作码序列进行特征选择。
首先镜像仓安全中心使用n-grams技术对API调用序列和操作码序列进行分块,每n个API调用或操作码为一块,以操作码为例的n-grams如下:
输入:
组件包A操作码序列:LOAD_GLOBAL、LOAD_FAST、CALL_FUNCTION、RETURN_VALUE
组件包B操作码序列:LOAD_GLOBAL、LOAD_FAST、MAKE_FUNCTION、STORE_NAME
输出:假设n=2,
组件包A的n-grams为(LOAD_GLOBAL、LOAD_FAST)、(LOAD_FAST、CALL_FUNCTION)、(CALL_FUNCTION、RETURN_VALUE)
组件包B的n-grams为(LOAD_GLOBAL、LOAD_FAST)、(LOAD_FAST、MAKE_FUNCTION)、(MAKE_FUNCTION、STORE_NAME)
然后镜像仓安全中心使用tf-idf技术计算API调用序列和操作码序列的n-grams块的tf-idf值,利用n-grams块的tf-idf值删除tf-idf值低于镜像仓安全中心预设置tf-idf阈值的n-grams块,余下的n-grams块即是特征选择的结果,n-grams块结合tf-idf的值组合成操作码的序列特征。同理,可得到组件包的API调用的序列特征。
tf-idf值实例:
输入:
组件包A的n-grams为(LOAD_GLOBAL、LOAD_FAST)、(LOAD_FAST、CALL_FUNCTION)、(CALL_FUNCTION、RETURN_VALUE)
组件包B的n-grams为(LOAD_GLOBAL、LOAD_FAST)、(LOAD_FAST、MAKE_FUNCTION)、(MAKE_FUNCTION、STORE_NAME)
输出:
组件包A的n-grams的tf-idf值为:0.0242,0.6479,0.8594;
组件包B的n-grams的tf-idf值为:0.0149,0.5946,0.8843;
假设镜像仓预设置tf-idf值为0.5,则特征选择的结果为:
组件包A特征选择:(LOAD_FAST、CALL_FUNCTION)、(CALL_FUNCTION、RETURN_VALUE);
组件包B特征选择:(LOAD_FAST、MAKE_FUNCTION)、(MAKE_FUNCTION、STORE_NAME);
结合特征选择结果和tf-idf值,镜像仓安全中心提取到组件包的序列特征为:
组件包A的序列特征:{(LOAD_FAST、CALL_FUNCTION):0.6479,(CALL_FUNCTION、RETURN_VALUE):0.8594};
组件包B的序列特征:{(LOAD_FAST、MAKE_FUNCTION):0.5946,(MAKE_FUNCTION、STORE_NAME):0.8843}。
(2)然后介绍风险函数特征的提取。
首先是风险函数的确定。风险函数可以为本地风险函数数据库中存储的函数,本地风险函数数据库可以由研发人员预先设置风险函数,或者由研发人员统计获取风险函数等方式建立。风险函数包括网络连接、命令执行和文件读写等函数,镜像仓安全中心将风险函数组合成风险函数特征。
提供一个确定风险函数特征的实例3:
输入:
组件包A包含风险函数:socket.recv,urllib.urlretrieve,fileinput.input,os.popen,ctypes.CDLL;
组件包A风险函数出现次数:3,2,2,1,5;
输出:
组件包A的风险函数特征:
{socket.recv:3,urllib.urlretrieve:2,fileinput.input:2,os.popen:1,ctypes.CDLL:5}。
上述1)~3)对应过程获取的包文件的特征可以作为开源组件包的特征信息,可选地,可以将这些开源组件包的特征信息存储到组件包信息数据库中。
针对开源组件包的特征信息进行安全检测,具体可以包括进行人工智能(artificial intelligence,AI)模型检测,规则库的规则校验,或者其他方式文件签名、启发式检测等。本申请实施例中采用三种不同的方式进行针对特征信息的安全检测,分别为规则数据库匹配,AI标注模型标注,增量AI模型分类。
针对第一种安全检测方式,可参阅图2D,为本申请实施例提供的另一种开源组件包的安全检测方法流程图,如图2D所示,该方法与图2A所示的方法区别在于,将图2A中的步骤202替换为:202a、将在线开源组件包的特征信息与规则数据库中的多条规则进行匹配,根据匹配程度确定在线开源组件包是否为合法包。
在将开源组件包的特征信息与规则数据库中多条规则进行匹配之前,先获取规则数据库。规则数据库中包括多条可以与开源组件包的特征信息进行匹配的规则,具体为根据本地恶意包的特征信息生成的多条规则。本地恶意包是指存储在开源镜像仓或其他开发者本地数据库中的,已经被判定为恶意包的组件包。假设规则数据库中的多条规则根据本地恶意包的特征信息生成,那么根据在线开源组件包的特征信息与规则数据库中的规则的匹配程度可以确定 在线开源组件包是否为恶意包,匹配程度越高,在线开源组件包为恶意包的概率越大,匹配程度越低,在线开源组件包不为恶意包(即为合法包)的概率越大。
本申请实施例以获取本地恶意包的特征信息生成规则数据库中的多条规则为例进行说明。
同样的,可以对本地恶意包进行特征提取,获取本地恶意包的特征信息,为风险函数特征、API调用序列特征和操作码序列特征,具体获取方式可参阅前述步骤1)~3),在此不再赘述。然后根据这些特征信息生成规则数据库中的多条规则。以本申请实施例中的规则数据库为yara数据库,那么根据特征信息生成yara规则。
可选情况下,还可以对本地恶意源代码进行特征提取,本地恶意源代码可以是预先存储在本地镜像仓中被判断为恶意源代码的网页源码,也可以是其他途径获取的恶意源代码等。对恶意源代码的特征提取方式可以参阅前述步骤也可以参阅前述步骤1)~3),获取到本地恶意源代码的特征信息,包括风险函数特征、API调用序列特征和操作码序列特征。
然后,可以组合本地恶意包的特征信息和本地恶意源代码的特征信息,生成规则数据库中的多条恶意特征规则。以规则数据库为yara规则数据库为例,将根据本地恶意包的特征信息和本地恶意源代码的特征信息生成的多条恶意特征yara规则,具体包括:
镜像仓安全中心获取本地恶意包的特征信息和本地恶意源代码的特征信息,统一保存到恶意特征数组{M 1,…,M i,…,M n|n≥1}。镜像仓安全中心删除恶意特征数组中的重复恶意特征,最终得到恶意特征数组{M 1,…,M i,…,M z},镜像仓安全中心将得到的恶意特征数组{M 1,…,M i,…,M z}依据yara规则编写要求生成恶意特征yara规则,并保存到yara规则数据库。
以下提供一个恶意特征yara规则实例:
Figure PCTCN2022127118-appb-000004
生成yara规则数据库后,将获取的在线开源组件包的特征信息与yara规则数据库中的多条恶意特征yara规则进行匹配,匹配程度越高,则说明在线开源组件包为恶意包的概率越大。匹配程度具体可以通过匹配规则条数来确定,例如当在线开源组件包的特征信息与yara规则数据库中的恶意特征yara规则匹配条数大于或等于K时,确定开源组件包为恶意包,否则确定开源组件包为合法包,其中K为正整数。
可能的情况下,还可以对在线开源组件包进行创建信息提取,针对其创建信息也进行规则数据库中的规则匹配,根据匹配程度进一步确定在线开源组件包为恶意包或合法包。
对在线开源组件包进行创建信息提取具体可以包括:
4)镜像仓安全中心遍历包管理器中需要进行安全检测的在线开源组件包,获取在线开源组件包包名称列表。
5)镜像仓安全中心遍历包名称列表中的每一个包名称,获取开源组件包的JSON文件,解析JSON文件获得开源组件包的包文件下载链接、源代码存储网站链接(如Github),源代码评分网站链接(如SourceRank),和依赖文件requirements.txt。
实例:对于组件包esprima
包文件下载链接:
https://files.pythonhosted.org/packages/86/61/ff7a62bcf79cebb6faf42c0ff28756c152a9dcf7244019093ca4513d80ee/esprima-4.0.1.tar.gz;
源代码存储网站链接:{Homepage:https://github.com/Kronuz/esprima-python};
源代码评分网站链接:https://libraries.io/pypi/esprima;
requirements.txt:
numpy==1.16.0  //numpy是依赖组件包名称,1.16.0是版本号信息
Keras==2.4.3
tornado==6.0.3
chardet==3.0.4
6)镜像仓安全中心从依赖文件requirements.txt文件中获取开源组件包的依赖包名称。再使用如上述步骤4)和步骤5)的方式获取依赖包的包文件下载链接、源代码存储网站链接,源代码评分网站链接和依赖文件。
7)镜像仓安全中心提取包创建信息。镜像仓安全中心从包文件下载链接下载并解压开源组件包及其依赖包的包文件,并从包文件中提取并解析配置文件,镜像仓安全中心从这些配置文件中提取包名、作者、作者邮箱、所属机构、描述、包文件结构、维护人员等包创建信息。可选地,可以将这些创建信息保存到组件包信息数据库。
对应地,规则数据库中可以包括创建信息相关的信息规则。具体步骤包括:镜像仓安全中心获取所述本地恶意包的创建信息;镜像仓安全中心从外部数据库中获取黑客信息;镜像仓安全中心将所述本地恶意包的创建信息和所述黑客信息作为所述规则数据库中的恶意信息规则。
其中,本地恶意包的创建信息可以包括本地恶意包的包名、作者和作者邮箱等信息;黑客信息可以为预先存储在本地黑客信息数据库中的信息,那么镜像仓安全中心提取黑客信息,包括提取黑客姓名和黑客邮箱等。同样的,可以直接将这些创建信息作为规则数据库中的规则进行存储,也可以对这些创建信息进行编写形成规则数据库需要的格式再进行存储。
假设按照yara规则编写要求生成恶意信息yara规则,并保存到yara规则数据库。具体为:恶意信息yara规则实例:
Figure PCTCN2022127118-appb-000005
Figure PCTCN2022127118-appb-000006
可选地,针对开源组件包的特征信息进行安全检测还包括:将在线开源组件包的创建信息与规则数据库中的恶意信息规则进行匹配。
生成恶意信息yara规则后,将其与在先开源数据包的创建信息进行匹配,匹配程度越高,则说明在线开源组件包为恶意包的概率越大。同样的,匹配程度具体可以通过匹配规则条数来确定。另外,可以结合yara规则数据库中的恶意特征yara规则和恶意信息yara规则与在线开源组件包的特征信息和创建信息总共的匹配条数来确定在线开源组件包为恶意包的概率,例如当在线开源组件包的特征信息和创建信息总共与yara规则数据库中的规则(包括恶意特征yara规则和恶意信息yara规则)匹配条数大于M时,确定在线开源组件包为恶意包。
可见,在本申请实施例中,采用在线开源组件包的特征信息(或者还包括创建信息)与规则数据库中的多条规则进行匹配,根据两者的匹配程度确定在线开源组件包是否为合法包。这个过程中,生成规则数据库是一个较为直接和简明的步骤,进而可以降低安全检测过程中的处理资源消耗,提升安全检测效率。
针对第二种安全检测方式,可参阅图2E,为本申请实施例提供的另一种开源组件包的安全检测方法流程图,如图2E所示,该方法与图2A所示的方法区别在于,将图2A中的步骤202可以替换为:202b、将在线开源组件包的特征信息输入人工智能AI标注模型,采用AI标注模型对在线开源组件包进行推理,确定在线开源组件包是否为合法包。
具体地,对在线开源组件包的特征提取的过程可以参阅前述步骤1)~步骤3),在此不再赘述。或者,在前述步骤根据获取特征信息并将特征信息存储到组件包信息数据库中之后,直接从组件包信息数据库中读取特征信息。
人工智能AI标注模型,用于根据在线开源组件包的特征信息标注在线开源组件包。例如假设AI标注模型为采用本地恶意包的恶意特征进行训练获得的模型,则将在线开源组件包的特征信息输入AI标注模型,可以获得标注结果标注在线开源组件包是恶意包,或者不是恶意包(即为合法包)。反之,假设AI标注模型为采用本地合法包的合法特征进行训练获得的模型,则将在线开源组件包的特征信息输入AI标注模型,可以获得标注结果标注在线开源组件包是合法包,或者不是合法包(即为恶意包)。
AI标注模型推理过程实例:
输入:在线开源组件包的特征信息;
输出:标注该在线开源组件包为恶意包或者合法包。
本申请实施例以本地恶意包的特征信息训练获得AI标注模型为例进行具体说明。
可选地,特征向量包括风险函数特征、API调用序列特征和操作码序列特征,将在线开源组件包的特征信息输入AI标注模型,采用AI标注模型对在线开源组件包进行推理,确定在线开源组件包是否为合法包,包括:将在线开源组件包的特征向量分别输入三个第一分类器,获得三个第一分类器中每个第一分类器的分类结果;使用绝对多数投票法对每个第一分类器的分类结果进行投票获得投票结果,根据投票结果确定三个第一分类器的分类结果中的 标签预测结果,标签预测结果用于指示在线开源组件包是否为合法包。
具体地,AI标注模型可以是一个组合分类器,具体可以为自适应提升Adaboost算法分类器,随机森林分类器等的组合分类器。这种组合分类器可以分别根据在线开源组件包的三个特征信息进行推理,得到三个在线开源组件包是否为恶意包的推理结果,然后采用然后采用绝对多数投票法对三个分类结果进行投票,根据投票结果确定这三个分类结果中,在线开源组件包的标签预测结果。具体地,可以确定三个分类结果中,票数大于或等于50%的预测结果为标签预测结果;或者确定票数最多的预测结果为标签预测结果等。这种AI标注模型可以提升分类结果准确率。
在采用AI标注模型进行分类标注之前,需要对AI标注模型进行训练,本申请实施例以第一分类器为训练后的Adaboost算法分类器为例进行说明,训练AI标注模型的过程包括:获取自适应提升算法分类器,自适应提升算法分类器中包括N个对应不同权值的第二分类器,N个对应不同权值的第二分类器根据本地恶意包的多个恶意特征训练获得;对本地恶意包的源代码进行特征提取,获得本地恶意包的特征向量;将本地恶意包的特征向量分别输入自适应提升算法分类器,训练获得三个第一分类器作为AI标注模型。
上述过程中,获取自适应提升算法分类器,是指获取初始Adaboost算法分类器(未经过训练的),每个初始Adaboost算法中包括N个具有不同权值的第二分类器,第二分类器也可以被称为弱分类器,其含义为,这种未经过训练的分类器的分类精确度低,通常为50%及以下,具体可以为支持向量机(support vector machine,SVM)分类器。将风险函数特征、API调用序列特征和操作码序列特征,每个特征分别作为一个Adaboost算法分类器的输入,从而生成三个训练任务:任务1、任务2和任务3。镜像仓安全中心在每一个Adaboost算法分类器训练期间,计算N个弱分类器的错误率,根据错误率更新每一个弱分类器权重,经过T轮迭代更新,得到三个第一分类器,相对应的,第一分类器的分类结果准确率更高,通常能够达到80%或80%以上,因此第二分类器可以被称为强分类器。因此,三个特征学习任务共得到三个强分类器H1、H2、H3,组成前述AI标注模型。
可见,在本申请实施例中,采用AI标注模型对在线开源组件包进行安全检测,该过程由于AI标注模型为机器学习模型,经过迭代训练获得,因此AI标注模型具有确定性,那么将在线开源组件包的特征信息输入AI标注模型获得的推理结果,能够保证结果的准确性。
针对第三种安全检测方式,可参阅图2F,为本申请实施例提供的另一种开源组件包的安全检测方法流程图,如图2F所示,该方法与图2F所示的方法区别在于,将图2A中的步骤202可以替换为:202c、将在线开源组件包的特征信息输入增量AI模型,采用增量AI模型对在线开源组件包进行推理,确定在线开源组件包是否为合法包,并确定不为合法包的在线开源组件包为疑似恶意包。
具体地,对在线开源组件包的特征提取的过程可以参阅前述步骤1)~步骤3),在此不再赘述。或者,在前述步骤根据获取特征信息并将特征信息存储到组件包信息数据库中之后,直接从组件包信息数据库中读取特征信息。
增量AI模型,与前述AI标注模型的区别在于,该过程提取了本地恶意包和本地合法包的特征,采用本地恶意包和本地合法包的特征信息共同训练获得增量AI模型,则增量AI模型可以用于推理获得在线开源组件包是否为合法包。
增量AI模型推理过程实例:
输入:在线开源组件包的特征信息,合法阈值Δ;
输出:在线开源组件包的可疑程度值θ,若θ≥Δ,镜像仓安全中心判断组件包为疑似恶 意包,需要进一步分析;若θ<Δ,镜像仓安全中心判断在线开源组件包为合法包。
具体地,增量AI模型输出是一个概率值,例如[0.6,0.4],其中恶意包预测概率为0.6,合法包预测概率为0.4,两个值之和确定为1,合法阈值也是概率值,例如0.5,可疑程度值对应为恶意包预测概率,则可疑程度值0.6>合法阈值0.5,判断该包为疑似恶意包。
可选地,特征信息包括风险函数特征、API调用序列特征和操作码序列特征,该方法还包括:对本地开源组件包中的本地恶意包和本地合法包进行特征提取,获取本地恶意包的特征信息和本地合法包的特征信息;将本地恶意包的特征信息和本地合法包的特征信息作为初始支持向量机SVM算法分类器的输入进行迭代,直到确定初始SVM算法分类器的预设准确率大于第一预设阈值,获得最终SVM算法分类器作为增量AI模型。
具体地,初始增量AI模型可以是一个初始SVM算法分类器,在选择具体的SVM算法分类器时,可以考虑如下因素:对训练样本数值不敏感,在更新训练的过程中,无法确定训练样本的具体数值,因此使用模糊算法降低SVM算法对训练样本数值的依赖性;加快迭代训练速度,因此可以考虑最小二乘方法;应对训练过程中样本不平衡的场景,因此可以考虑孪生SVM。那么,该初始SVM算法分类器例如可以是初始模糊最小二乘孪生SVM分类器。镜像仓安全中心获取本地组件包,包括本地恶意包和本地合法包的特征信息,具体包括风险函数特征、API调用序列特征和操作码序列特征,将这三种特征进行组合,即F 1维的风险函数特征、F 2维的API调用序列特征和F 3维的操作码序列特征组合成一个(F 1+F 2+F 3)维的组合特征。然后镜像仓安全中心将组合特征作为初始SVM算法的输入,进行迭代训练,直到确定该SVM算法分类器的预测准确率大于第一预设阈值,获得增量AI模型。
可见,在本申请实施例中,采用增量AI模型对在线开源组件包进行安全检测,该过程由于增量AI模型的训练过程中,同时考虑了本地恶意包和本地合法包的特征信息,使得增量AI模型的推理结果考虑更全面,将不为合法包的在线开源组件包确定为疑似恶意包,并进行再判断,可以进一步提升安全检测结果的准确性,减少误判的概率。
对疑似恶意包进行再判断,可以包括如前述规则数据库生成规则过程,考虑在线开源组件包的创建信息,或者,还可以考虑在线开源组件包的结构,依赖包,包名等信息。
可选的,在确定不为合法包的在线开源组件包为疑似恶意包之后,图2F对应方法还包括步骤204:对疑似恶意包进行信誉评估,获得疑似恶意包的信誉评分,根据疑似恶意包的信誉评分确定疑似恶意包是否为合法包,其中不为合法包的疑似恶意包为恶意包。信誉评估包括以下一项或多项:疑似恶意包的依赖包评估,疑似恶意包的包名评估,疑似恶意包的结构评估,疑似恶意包的作者信誉评估,疑似恶意包的包信誉评估。
具体地,对疑似恶意包进行依赖包评估,获得依赖得分,包括:获取多个疑似恶意包中的任意疑似恶意包的依赖包,确定依赖包为恶意包的概率;根据依赖包为恶意包的概率确定任意在线开源组件包的依赖得分,其中依赖得分高低与依赖包为恶意包的概率大小呈正相关。
疑似恶意包可以为在线开源组件包,其依赖包获取方法具体可参阅前述步骤4)~步骤6),或者依赖包也可以为本地组件包。确定依赖包为恶意包的概率,可以通过前述步骤202a中描述的规则数据库匹配方法,或者202b中描述的AI标注模型推理方法,增量AI模型推理方法或者其他方法。假设确定依赖包为恶意包,则依赖包为恶意包的概率为100%,在线开源组件包的依赖得分可以为1。假设确定依赖包为合法包,则依赖包为恶意包的概率为0,在线开源组件包的依赖得分可以为0。假设根据增量AI模型推理方法确定依赖包的可疑程度值为θ,则可以确定依赖包为恶意包的概率为(θ-Δ)/Δ*100%等等。
或者,若依赖包为在线开源组件包,则还可以通过计算依赖包包配置文件中作者邮箱的 域名在Google域名排行榜上的排名、依赖包包内的每个文件在威胁数据库中的得分、维护人员数量等,确定依赖包为恶意包的概率。
对疑似恶意包进行包名评估,获得包名得分,包括:获取多个在线开源组件包的包名,并根据多个在线开源组件包的包名中的流行组件包名生成流行组件包列表;将疑似恶意包包的包名与流行组件包列表中的流行组件包名进行匹配,确定疑似恶意包包的包名与流行组件包的相似度;根据疑似恶意包包的包名与流行组件包的相似度确定包名得分,包名得分高低与相似度大小呈正相关。
具体地,可以从开源组件包下载统计网站收集下载次数排名前P的开源组件包,并根据这些开源组件包的包名生成流行组件包列表。其中P可以是500,605,1001等。根据这些开源组件包的包名生成流行组件包列表,包括按照下载频率的大小生成流行组件包列表,或者按照最后下载时间的新鲜度生成流行组件包列表等。然后将疑似恶意包的报名与流行组件包列表中的包名相似度计算,具体包括进行语义相似度计算,距离对比等方式中的一种或多种,语义相似度计算例如可以采用语义相似度函数进行,提供一个语义相似实例如下:
输入:
疑似恶意包包名:organization  ////////流行包包名:organize
输出:0(语义相似).
提供一个Levenshtein距离对比实例:
输入:疑似恶意包包名:PyYMAL  ////////流行包包名:PyYAML
输出:Levenshtein距离=2.
表示疑似恶意包经过两步变换可得到流行包包名。可能的情况下,前述变换具体可以包括删除字符、同音字符、替换字符、交换字符、插入字符、分隔符、顺序置换、版本修改等操作。
疑似恶意包的报名与流行组件包列表中的包名语义相似度越高,或距离越近,则两者的相似度越高,那么疑似恶意包的包名得分越高。
对疑似恶意包进行结构评估,获得结构得分,包括:获取多个在线开源组件包的包名,并根据多个在线开源组件包的包名中的流行组件包名生成流行组件包列表;分别获取流行组件包列表中的开源组件包的文件结构的第一哈希值和疑似恶意包的文件结构的第二哈希值,并计算第一哈希值与第二哈希值之间的距离;根据第一哈希值和第二哈希值之间的距离确定结构得分,结构得分高低与距离大小呈负相关。
具体地,获取流行组件包列表的方式与前述描述相同,在此不再赘述。然后从流行组件包列表中,获取任意一个开源组件包的文件结构的第一哈希值,文件结构具体可以指开源组件包解压缩后形成的目录结构,将该目录结构输入哈希函数,即可获得对应的哈希值。同样的方法可以获取疑似恶意包的文件结构对应的第二哈希值。然后计算第一哈希值与第二哈希值进行距离,并根据第一哈希值和第二哈希值之间的距离确定结构得分,结构得分高低与距离大小呈负相关,也即距离越小,结构得分越高。例如假设第一哈希值与第二哈希值之间的距离为1,则结构得分为a/1,其中a为一个预设的值。可以获取一个疑似恶意包与流行组件包列表中多个开源组件包进行结构得分计算,然后将这些结构得分相加作为该疑似恶意包的最终结构得分。
对疑似恶意包进行作者信誉评估,获得作者信誉得分,包括:获取疑似恶意包的作者信誉特征,包括该作者上传的所有项目的受欢迎程度、使用人数总数、观看人数总数、活跃程度等,计算疑似恶意包中所有作者信誉特征数值之和得到作者信誉得分。
对疑似恶意包的包信誉评估,获得包信誉得分,包括:获取疑似恶意包的包信誉特征,包括该包的受欢迎程度、使用人数、阅读人数、组件包评分等,计算疑似恶意包中所有包信誉特征数值之和得到包信誉得分。
根据前述方法计算信誉评分后,如果疑似恶意包只进行了单项信誉评估,可以根据单项信誉评分确定最终的信誉评分,如果疑似恶意包进行了多项信誉评估,可以对多项信誉评分求和,加权求和,或者其他组合方式确定最终的信誉评分。以针对疑似恶意包进行了上述五种信誉评估为例,获得五种信誉评分,分别为
Figure PCTCN2022127118-appb-000007
然后再对这五种信誉评分进行求和得到最终的疑似恶意包信誉评分
Figure PCTCN2022127118-appb-000008
镜像仓安全中心将信誉评分
Figure PCTCN2022127118-appb-000009
与预先设置的评估阈值μ进行对比,若
Figure PCTCN2022127118-appb-000010
镜像仓安全中心判断该疑似恶意包为恶意包,若
Figure PCTCN2022127118-appb-000011
镜像仓安全中心判断该疑似恶意包为合法包。
以下提供一个信誉评估实例:
输入:标签为疑似恶意包的组件包
输出:
疑似恶意包A的五种信誉评分:1,1,2,0.4,0.3
疑似恶意包A的最终信誉评分为:3.7
镜像仓安全中心预设置的评估阈值为:5
则判断疑似恶意包A为恶意包。
可见,在本申请实施例中,采用信誉评估方式对疑似恶意包的进行进一步安全检测,丰富了安全检测的维度,提升了安全检测判断结构的可靠性。
镜像仓安全中心的最终目的是为了保证公司开源镜像仓的安全性,剔除恶意包,因此,如前述图2A以及图2D~图2F中的方法步骤203描述的,镜像仓安全中心将筛选出的合法包,本申请实施例中具体可以是通过规则数据库、AI标注模型、增量AI模型和信誉评估网络筛选出的合法包,同步到开源镜像仓。
另外,镜像仓安全中心可以将前述规则数据库、AI标注模型、信誉评估网络检测出的恶意包保存到增量恶意包数据库。可以设置增量恶意包数据库中增量恶意包的保存周期,例如保存周期为1周,1周内镜像仓安全中心检测出的恶意包都保存到增量恶意包数据库,并清空上周保存的增量恶意包数据库。
可见,在本申请实施例中,在将在线开源组件包同步到本地开源镜像仓之前,先对在线开源组件包进行安全检测,确定在线开源组件包为合法包的情况下,将其同步到本地开源镜像仓,使得恶意代码检测能力前移,构建安全开源仓库,有效抑制开源化对研发环境的安全影响,降低开源组件包使用者被攻击的可能性。
可选地,该方法还包括:对目标恶意包进行特征提取,获得目标恶意包的特征信息,目标恶意包为增量恶意包数据库中的部分或全部恶意包;将目标恶意包的特征信息作为增量AI模型的输入进行迭代,获得更新后的增量AI模型。
由于增量恶意包数据库中可能包括采用信誉评分网络确定的恶意包,因此针对这些恶意包进行特征提取,更新增量AI模型,可以优化增量AI模型,减少其判断出疑似恶意包的概率,提升分类效率。
可选地,该方法还包括:根据增量恶意包数据库中的恶意包获取增量恶意特征规则和/或增量信息规则;根据增量恶意特征规则和/或增量信息规则更新规则数据库。
镜像仓安全中心通过前述描述的方法生成增量恶意包数据库中恶意包的恶意特征规则和恶意信息规则,并将新提取的恶意特征规则和恶意信息规则与规则数据库中已有规则进行对比,剔除新提取的恶意特征规则和恶意信息规则与规则数据库中的重合规则,剩下的规则添加到规则数据库。
同样的,该过程也可以起到优化规则数据库的效果,提升规则数据库的分类准确率。
图3为本申请实施例提供的一种安全检测装置300,其可以用于执行上述图2A~图2F的方法和具体实施例。在一种可能的实现方式中,如图3所示,该装置300包括获取单元301,处理单元302和存储单元303。
获取单元301,用于获取在线开源组件包,并对在线开源组件包进行特征提取,获取在线开源组件包的特征信息;
处理单元302,用于针对开源组件包的特征信息进行安全检测,确定在线开源组件包是否为合法包;
存储单元303,用于若在线开源组件包中的第一组件包为合法包,则将第一组件包同步到本地开源镜像仓,本地开源镜像仓用于向用户提供调用的开源组件包。
可选地,存储单元303还用于:若确定在线开源组件包中的第二组件包为恶意包,则将第二组件包存储到增量恶意包数据库。
可选地,处理单元302具体用于:将在线开源组件包的特征信息与规则数据库中的多条规则进行匹配,根据匹配程度确定在线开源组件包是否为合法包。
可选地,处理单元302还用于:获取本地开源组件包中的本地恶意包,并对本地恶意包进行特征提取,获取本地恶意包的恶意特征;获取本地恶意源代码,并对本地恶意源代码进行特征提取,获取本地恶意源代码的恶意代码特征;将本地恶意包的恶意特征和本地恶意源码的恶意代码特征作为规则数据库中的恶意特征规则。
可选地,获取在线开源组件包的特征信息还包括获取在线开源组件包的创建信息;
处理单元302还用于:获取本地恶意包的创建信息;从外部数据库中获取黑客信息;将本地恶意包的创建信息和黑客信息作为规则数据库中的恶意信息规则;针对开源组件包的特征信息进行安全检测还包括:将在线开源组件包的创建信息与规则数据库中的恶意信息规则进行匹配。
可选地,处理单元302还用于:将在线开源组件包的特征信息输入人工智能AI标注模型,采用AI标注模型对在线开源组件包进行推理,确定在线开源组件包是否为合法包,其中不为合法包的在线开源包为恶意包。
可选地,特征信息包括风险函数特征、API调用序列特征和操作码序列特征,将在线开源组件包的特征信息输入AI标注模型,采用AI标注模型对在线开源组件包进行推理,确定在线开源组件包是否为合法包,包括:将在线开源组件包的特征信息分别输入三个第一分类器,获得三个第一分类器中每个第一分类器的分类结果;使用绝对多数投票法对每个第一分类器的分类结果进行投票获得投票结果,根据投票结果确定三个第一分类器的分类结果中的标签预测结果,标签预测结果用于指示在线开源组件包是否为合法包,其中不为合法包的在线开源包为恶意包。
可选地,处理单元302还用于:获取自适应提升算法分类器,自适应提升算法分类器中包括N个对应不同权值的第二分类器,N个对应不同权值的第二分类器根据本地恶意包的多个恶意特征训练获得;
对本地恶意包的源代码进行特征提取,获得本地恶意包的特征信息;
将本地恶意包的特征信息分别输入自适应提升算法分类器,训练获得三个第一分类器作为AI标注模型。
可选地,处理单元302具体用于:将在线开源组件包的特征信息输入增量AI模型,采用增量AI模型对在线开源组件包进行推理,确定在线开源组件包是否为合法包,并确定不为合法包的在线开源组件包为疑似恶意包。
可选地,特征信息包括风险函数特征、API调用序列特征和操作码序列特征,处理单元302还用于:对本地开源组件包中的本地恶意包和本地合法包进行特征提取,获取本地恶意包的特征信息和本地合法包的特征信息;将本地恶意包的特征信息和本地合法包的特征信息作为初始支持向量机SVM算法分类器的输入进行迭代,直到确定初始SVM算法分类器的预测准确率大于第一预设阈值,获得最终SVM算法分类器作为增量AI模型。
可选地,在确定不为合法包的在线开源组件包为疑似恶意包之后,处理单元302还用于:
对疑似恶意包进行信誉评估,获得疑似恶意包的信誉评分,并根据疑似恶意包的信誉评分确定疑似恶意包是否为合法包,其中不为合法包的疑似恶意包为恶意包;信誉评估包括以下一项或多项:疑似恶意包的依赖包评估,疑似恶意包的包名评估,疑似恶意包的结构评估,疑似恶意包的作者信誉评估,疑似恶意包的包信誉评估。
可选地,该装置还更新单元304,用于:根据增量恶意包数据库中的恶意包获取增量恶意特征规则和/或增量信息规则;根据增量恶意特征规则和/或增量信息规则更新规则数据库。
可选地,该装置还包括更新单元304,用于:对目标恶意包进行特征提取,获得目标恶意包的特征信息,目标恶意包为增量恶意包数据库中的部分或全部恶意包;将目标恶意包的特征信息作为增量AI模型的输入进行迭代,获得更新后的增量AI模型。
可选的,上述处理单元302可以是中央处理器(Central Processing Unit,CPU)。
可选的,上述获取单元301可以是接口电路或者收发器。用于从其他电子装置接收或发送数据或指令。
可选的,存储单元303可以用于存储数据和/或信令,存储单元可以和获取单元301,以及处理单元302耦合。例如,处理单元302可以用于读取存储单元中的数据和/或信令,使得前述方法实施例中的开源组件包安全检测过程被执行。
如图4所示,图4示出了本申请实施例中的一种电子装置的硬件结构示意图。安全检测装置300的结构可以参考图4所示的结构。电子装置1000包括:存储器1001、处理器1002、通信接口1003和总线1004。其中,存储器1001、处理器1002、通信接口1003通过总线1004实现彼此之间的通信连接。
存储器1001可以是只读存储器(Read Only Memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(Random Access Memory,RAM)。存储器1001可以存储程序,当存储器1001中存储的程序被处理器1002执行时,处理器1002和通信接口1003用于执行本申请实施例的分布式渲染方法的各个步骤。
处理器1002可以采用通用的CPU,微处理器,应用专用集成电路(Application Specific Integrated Circuit,ASIC),GPU或者一个或多个集成电路,用于执行相关程序,以实现本申请实施例的安全检测装置300中的获取单元301,处理单元302和存储单元303所需执行的功能,或者执行本申请方法实施例的安全检测方法。
处理器1002还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,本申请 的分布式渲染方法的各个步骤可以通过处理器1002中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器1002还可以是通用处理器、数字信号处理器(Digital Signal Processing,DSP)、专用集成电路(ASIC)、现成可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器1001,处理器1002读取存储器1001中的信息,结合其硬件完成本申请实施例的安全检测装置300中包括的模块所需执行的功能,或者执行本申请方法实施例的安全检测方法。
通信接口1003使用例如但不限于收发器一类的收发装置,来实现电子装置1000与其他设备或通信网络之间的通信。例如,可以通过通信接口1003获取确定的分割目标和/或候选目标边界框。总线1004可包括在电子装置1000各个部件(例如,存储器1001、处理器1002、通信接口1003)之间传送信息的通路。
应注意,尽管图4所示的电子装置1000仅仅示出了存储器、处理器、通信接口,但是在具体实现过程中,本领域的技术人员应当理解,电子装置1000还包括实现正常运行所必须的其他器件。同时,根据具体需要,本领域的技术人员应当理解,电子装置1000还可包括实现其他附加功能的硬件器件。此外,本领域的技术人员应当理解,电子装置1000也可仅仅包括实现本申请实施例所必须的器件,而不必包括图4中所示的全部器件。
此外,本申请还提供一种计算机程序,该计算机程序用于实现本申请提供的方法中由安全检测装置执行的操作和/或处理。
本申请还提供一种计算机可读存储介质,该计算机可读存储介质中存储有计算机程序或计算机可执行指令,当计算机程序或计算机可执行指令在计算机上运行时,使得计算机执行本申请提供的方法中由安全检测装置执行的操作和/或处理。
本申请还提供一种计算机程序产品,该计算机程序产品包括计算机可执行指令或计算机程序,当该计算机可执行指令或计算机程序在计算机上运行时,使得本申请提供的方法中由安全检测装置执行的操作和/或处理被执行。
应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的模块及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个模块或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相 互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或模块的间接耦合或通信连接,可以是电性,机械或其它的形式。
作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理模块,即可以位于一个地方,或者也可以分布到多个网络模块上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。
功能如果以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。

Claims (29)

  1. 一种开源组件包的安全检测方法,其特征在于,所述方法包括:
    获取在线开源组件包,并对所述在线开源组件包进行特征提取,获取所述在线开源组件包的特征信息;
    针对所述开源组件包的特征信息进行安全检测,确定所述在线开源组件包是否为合法包;
    若所述在线开源组件包中的第一组件包为合法包,则将所述第一组件包同步到本地开源镜像仓,所述本地开源镜像仓用于向用户提供调用的开源组件包。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:若确定所述在线开源组件包中的第二组件包为恶意包,则将所述第二组件包存储到增量恶意包数据库。
  3. 根据权利要求1或2所述的方法,其特征在于,所述针对所述开源组件包的特征信息进行安全检测,确定所述在线开源组件包是否为合法包,包括:
    将所述在线开源组件包的特征信息与规则数据库中的多条规则进行匹配,根据匹配程度确定所述在线开源组件包是否为合法包。
  4. 根据权利要求3所述的方法,其特征在于,所述方法还包括:
    获取本地开源组件包中的本地恶意包,并对所述本地恶意包进行特征提取,获取所述本地恶意包的恶意特征;
    获取本地恶意源代码,并对所述本地恶意源代码进行特征提取,获取所述本地恶意源代码的恶意代码特征;
    将所述本地恶意包的恶意特征和所述本地恶意源码的恶意代码特征作为所述规则数据库中的恶意特征规则。
  5. 根据权利要求4所述的方法,其特征在于,所述获取所述在线开源组件包的特征信息还包括获取所述在线开源组件包的创建信息;
    所述方法还包括:
    获取所述本地恶意包的创建信息;
    从外部数据库中获取黑客信息;
    将所述本地恶意包的创建信息和所述黑客信息作为所述规则数据库中的恶意信息规则;
    所述针对所述开源组件包的特征信息进行安全检测还包括:
    将所述在线开源组件包的创建信息与所述规则数据库中的恶意信息规则进行匹配。
  6. 根据权利要求1或2所述的方法,其特征在于,所述针对所述开源组件包的特征信息进行安全检测,确定所述在线开源组件包是否为合法包,包括:
    将所述在线开源组件包的特征信息输入人工智能AI标注模型,采用所述AI标注模型对所述在线开源组件包进行推理,确定所述在线开源组件包是否为合法包,其中不为合法包的所述在线开源包为恶意包。
  7. 根据权利要求6所述的方法,其特征在于,所述特征信息包括风险函数特征、API调用序列特征和操作码序列特征,所述将所述在线开源组件包的特征信息输入AI标注模型,采用所述AI标注模型对所述在线开源组件包进行推理,确定所述在线开源组件包是否为合法包,包括:
    将所述在线开源组件包的特征信息分别输入三个第一分类器,获得所述三个第一分类器中每个第一分类器的分类结果;
    使用绝对多数投票法对所述每个第一分类器的分类结果进行投票获得投票结果,根据所 述投票结果确定所述三个第一分类器的分类结果中的标签预测结果,所述标签预测结果用于指示所述在线开源组件包是否为合法包,其中不为合法包的所述在线开源包为恶意包。
  8. 根据权利要求7所述的方法,其特征在于,所述方法还包括:
    获取自适应提升算法分类器,所述自适应提升算法分类器中包括N个对应不同权值的第二分类器,所述N个对应不同权值的第二分类器根据本地恶意包的多个恶意特征训练获得;
    对所述本地恶意包的源代码进行特征提取,获得所述本地恶意包的特征信息;
    将所述本地恶意包的特征信息分别输入所述自适应提升算法分类器,训练获得所述三个第一分类器作为所述AI标注模型。
  9. 根据权利要求1或2所述的方法,其特征在于,所述针对所述开源组件包的特征信息进行安全检测,确定所述在线开源组件包是否为合法包,包括:
    将所述在线开源组件包的特征信息输入增量AI模型,采用所述增量AI模型对所述在线开源组件包进行推理,确定所述在线开源组件包是否为合法包,并确定不为合法包的所述在线开源组件包为疑似恶意包。
  10. 根据权利要求9所述的方法,其特征在于,所述特征信息包括风险函数特征、API调用序列特征和操作码序列特征,所述方法还包括:
    对本地开源组件包中的本地恶意包和本地合法包进行特征提取,获取所述本地恶意包的特征信息和所述本地合法包的特征信息;
    将所述本地恶意包的特征信息和所述本地合法包的特征信息作为初始支持向量机SVM算法分类器的输入进行迭代,直到确定所述初始SVM算法分类器的预测准确率大于第一预设阈值,获得最终SVM算法分类器作为所述增量AI模型。
  11. 根据权利要求9或10所述的方法,其特征在于,在确定不为合法包的所述在线开源组件包为疑似恶意包之后,所述方法还包括:
    对所述疑似恶意包进行信誉评估,获得所述疑似恶意包的信誉评分,并根据所述疑似恶意包的信誉评分确定所述疑似恶意包是否为合法包,其中不为合法包的所述疑似恶意包为恶意包;所述信誉评估包括以下一项或多项:所述疑似恶意包的依赖包评估,所述疑似恶意包的包名评估,所述疑似恶意包的结构评估,所述疑似恶意包的作者信誉评估,所述疑似恶意包的包信誉评估。
  12. 根据权利要求3或4所述的方法,其特征在于,所述方法还包括:
    根据所述增量恶意包数据库中的恶意包获取增量恶意特征规则和/或增量信息规则;
    根据所述增量恶意特征规则和/或增量信息规则更新所述规则数据库。
  13. 根据权利要求9-11所述的方法,其特征在于,所述方法还包括:
    对目标恶意包进行特征提取,获得所述目标恶意包的特征信息,所述目标恶意包为所述增量恶意包数据库中的部分或全部恶意包;
    将所述目标恶意包的特征信息作为所述增量AI模型的输入进行迭代,获得更新后的增量AI模型。
  14. 一种安全检测装置,其特征在于,所述装置包括:
    获取单元,用于获取在线开源组件包,并对所述在线开源组件包进行特征提取,获取所述在线开源组件包的特征信息;
    处理单元,用于针对所述开源组件包的特征信息进行安全检测,确定所述在线开源组件包是否为合法包;
    存储单元,用于若所述在线开源组件包中的第一组件包为合法包,则将所述第一组件包 同步到本地开源镜像仓,所述本地开源镜像仓用于向用户提供调用的开源组件包。
  15. 根据权利要求14所述的装置,其特征在于,所述存储单元还用于:若确定所述在线开源组件包中的第二组件包为恶意包,则将所述第二组件包存储到增量恶意包数据库。
  16. 根据权利要求14或15所述的装置,其特征在于,所述处理单元具体用于:
    将所述在线开源组件包的特征信息与规则数据库中的多条规则进行匹配,根据匹配程度确定所述在线开源组件包是否为合法包。
  17. 根据权利要求16所述的装置,其特征在于,所述处理单元还用于:
    获取本地开源组件包中的本地恶意包,并对所述本地恶意包进行特征提取,获取所述本地恶意包的恶意特征;
    获取本地恶意源代码,并对所述本地恶意源代码进行特征提取,获取所述本地恶意源代码的恶意代码特征;
    将所述本地恶意包的恶意特征和所述本地恶意源码的恶意代码特征作为所述规则数据库中的恶意特征规则。
  18. 根据权利要求17所述的装置,其特征在于,所述获取所述在线开源组件包的特征信息还包括获取所述在线开源组件包的创建信息;
    所述处理单元还用于:
    获取所述本地恶意包的创建信息;
    从外部数据库中获取黑客信息;
    将所述本地恶意包的创建信息和所述黑客信息作为所述规则数据库中的恶意信息规则;
    所述针对所述开源组件包的特征信息进行安全检测还包括:
    将所述在线开源组件包的创建信息与所述规则数据库中的恶意信息规则进行匹配。
  19. 根据权利要求14或15所述的装置,其特征在于,所述处理单元还用于:
    将所述在线开源组件包的特征信息输入人工智能AI标注模型,采用所述AI标注模型对所述在线开源组件包进行推理,确定所述在线开源组件包是否为合法包,其中不为合法包的所述在线开源包为恶意包。
  20. 根据权利要求19所述的装置,其特征在于,所述特征信息包括风险函数特征、API调用序列特征和操作码序列特征,所述将所述在线开源组件包的特征信息输入AI标注模型,采用所述AI标注模型对所述在线开源组件包进行推理,确定所述在线开源组件包是否为合法包,包括:
    将所述在线开源组件包的特征信息分别输入三个第一分类器,获得所述三个第一分类器中每个第一分类器的分类结果;
    使用绝对多数投票法对所述每个第一分类器的分类结果进行投票获得投票结果,根据所述投票结果确定所述三个第一分类器的分类结果中的标签预测结果,所述标签预测结果用于指示所述在线开源组件包是否为合法包,其中不为合法包的所述在线开源包为恶意包。
  21. 根据权利要求20所述的装置,其特征在于,所述处理单元还用于:
    获取自适应提升算法分类器,所述自适应提升算法分类器中包括N个对应不同权值的第二分类器,所述N个对应不同权值的第二分类器根据本地恶意包的多个恶意特征训练获得;
    对所述本地恶意包的源代码进行特征提取,获得所述本地恶意包的特征信息;
    将所述本地恶意包的特征信息分别输入所述自适应提升算法分类器,训练获得所述三个第一分类器作为所述AI标注模型。
  22. 根据权利要求14或15所述的装置,其特征在于,所述处理单元具体用于:
    将所述在线开源组件包的特征信息输入增量AI模型,采用所述增量AI模型对所述在线开源组件包进行推理,确定所述在线开源组件包是否为合法包,并确定不为合法包的所述在线开源组件包为疑似恶意包。
  23. 根据权利要求22所述的装置,其特征在于,所述特征信息包括风险函数特征、API调用序列特征和操作码序列特征,所述处理单元还用于:
    对本地开源组件包中的本地恶意包和本地合法包进行特征提取,获取所述本地恶意包的特征信息和所述本地合法包的特征信息;
    将所述本地恶意包的特征信息和所述本地合法包的特征信息作为初始支持向量机SVM算法分类器的输入进行迭代,直到确定所述初始SVM算法分类器的预测准确率大于第一预设阈值,获得最终SVM算法分类器作为所述增量AI模型。
  24. 根据权利要求22或23所述的装置,其特征在于,在确定不为合法包的所述在线开源组件包为疑似恶意包之后,所述处理单元还用于:
    对所述疑似恶意包进行信誉评估,获得所述疑似恶意包的信誉评分,并根据所述疑似恶意包的信誉评分确定所述疑似恶意包是否为合法包,其中不为合法包的所述疑似恶意包为恶意包,所述信誉评估包括以下一项或多项:所述疑似恶意包的依赖包评估,所述疑似恶意包的包名评估,所述疑似恶意包的结构评估,所述疑似恶意包的作者信誉评估,所述疑似恶意包的包信誉评估。
  25. 根据权利要求16或17所述的装置,其特征在于,所述装置还包括更新单元,用于:
    根据所述增量恶意包数据库中的恶意包获取增量恶意特征规则和/或增量信息规则;
    根据所述增量恶意特征规则和/或增量信息规则更新所述规则数据库。
  26. 根据权利要求22-24所述的装置,其特征在于,所述装置还包括更新单元,用于:
    对目标恶意包进行特征提取,获得所述目标恶意包的特征信息,所述目标恶意包为所述增量恶意包数据库中的部分或全部恶意包;
    将所述目标恶意包的特征信息作为所述增量AI模型的输入进行迭代,获得更新后的增量AI模型。
  27. 一种计算机可读存储介质,其特征在于,其上存储有指令,当所述指令被运行时,用于实现如权利要求1至13中任一项所述的方法。
  28. 一种芯片系统,其特征在于,包括:处理器,所述处理器用于执行存储的计算机程序,所述计算机程序用于执行如权利要求1至13中任一项所述的方法。
  29. 一种计算机程序产品,所述计算机程序产品包括:计算机程序,当所述计算机程序被运行时,使得如权利要求1至13中任一项所述的方法被执行。
PCT/CN2022/127118 2021-10-31 2022-10-24 开源组件包的安全检测方法及装置 WO2023072002A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111279082.4 2021-10-31
CN202111279082.4A CN116089938A (zh) 2021-10-31 2021-10-31 开源组件包的安全检测方法及装置

Publications (1)

Publication Number Publication Date
WO2023072002A1 true WO2023072002A1 (zh) 2023-05-04

Family

ID=86160388

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/127118 WO2023072002A1 (zh) 2021-10-31 2022-10-24 开源组件包的安全检测方法及装置

Country Status (2)

Country Link
CN (1) CN116089938A (zh)
WO (1) WO2023072002A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117034275A (zh) * 2023-10-10 2023-11-10 北京安天网络安全技术有限公司 基于Yara引擎的恶意文件检测方法、设备及介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030101290A1 (en) * 2001-11-29 2003-05-29 Chieng-Hwa Lin System and method for dynamic device driver support in an open source operating system
CN112906007A (zh) * 2021-02-09 2021-06-04 中国工商银行股份有限公司 开源软件漏洞管控方法及装置
CN113065125A (zh) * 2021-03-30 2021-07-02 深圳开源互联网安全技术有限公司 Docker镜像的分析方法、装置、电子设备及存储介质

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030101290A1 (en) * 2001-11-29 2003-05-29 Chieng-Hwa Lin System and method for dynamic device driver support in an open source operating system
CN112906007A (zh) * 2021-02-09 2021-06-04 中国工商银行股份有限公司 开源软件漏洞管控方法及装置
CN113065125A (zh) * 2021-03-30 2021-07-02 深圳开源互联网安全技术有限公司 Docker镜像的分析方法、装置、电子设备及存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117034275A (zh) * 2023-10-10 2023-11-10 北京安天网络安全技术有限公司 基于Yara引擎的恶意文件检测方法、设备及介质
CN117034275B (zh) * 2023-10-10 2023-12-22 北京安天网络安全技术有限公司 基于Yara引擎的恶意文件检测方法、设备及介质

Also Published As

Publication number Publication date
CN116089938A (zh) 2023-05-09

Similar Documents

Publication Publication Date Title
Gibert et al. The rise of machine learning for detection and classification of malware: Research developments, trends and challenges
Chumachenko Machine learning methods for malware detection and classification
CN102254111B (zh) 恶意网站检测方法及装置
RU2614557C2 (ru) Система и способ обнаружения вредоносных файлов на мобильных устройствах
US10997307B1 (en) System and method for clustering files and assigning a property based on clustering
US11212297B2 (en) Access classification device, access classification method, and recording medium
Zhang et al. SaaS: A situational awareness and analysis system for massive android malware detection
US11106801B1 (en) Utilizing orchestration and augmented vulnerability triage for software security testing
CN108563951B (zh) 病毒检测方法及装置
US11916937B2 (en) System and method for information gain for malware detection
Huang et al. Open source intelligence for malicious behavior discovery and interpretation
Tchakounté et al. LimonDroid: a system coupling three signature-based schemes for profiling Android malware
Darus et al. Android malware classification using XGBoost on data image pattern
WO2023072002A1 (zh) 开源组件包的安全检测方法及装置
Alam et al. Looking beyond IoCs: Automatically extracting attack patterns from external CTI
Dib et al. Evoliot: A self-supervised contrastive learning framework for detecting and characterizing evolving iot malware variants
Rafiq et al. AndroMalPack: enhancing the ML-based malware classification by detection and removal of repacked apps for Android systems
US20240054210A1 (en) Cyber threat information processing apparatus, cyber threat information processing method, and storage medium storing cyber threat information processing program
Ravi et al. Analysing corpus of office documents for macro-based attacks using machine learning
KR102411383B1 (ko) 사이버 위협 정보 처리 장치, 사이버 위협 정보 처리 방법 및 사이버 위협 정보 처리하는 프로그램을 저장하는 저장매체
Cybersecurity Machine learning for malware detection
US11868473B2 (en) Method for constructing behavioural software signatures
CN114510717A (zh) 一种elf文件的检测方法、装置、存储介质
US20220237238A1 (en) Training device, determination device, training method, determination method, training method, and determination program
Anto et al. Kernel modification APT attack detection in android

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22885879

Country of ref document: EP

Kind code of ref document: A1