CN111367566A

CN111367566A - Mixed source code feature extraction and matching method

Info

Publication number: CN111367566A
Application number: CN201910580956.6A
Authority: CN
Inventors: 巨李岗; 从慧珅; 田伟丽
Original assignee: Beijing Keyware Co ltd
Current assignee: Beijing Keyware Co ltd
Priority date: 2019-06-27
Filing date: 2019-06-27
Publication date: 2020-07-03

Abstract

The invention relates to a mixed source code feature extraction and matching method, which comprises the following steps: step 1) constructing a knowledge base: crawling each open source project through a web crawler, and constructing and maintaining a knowledge base by using the file-level characteristics and the function-level characteristics of each source code file of each open source project; step 2), extracting secondary characteristics of the source code file: aiming at the current mixed source code project, identifying each source code file under the current mixed source code project catalog, and performing secondary feature extraction on each source code file to respectively obtain the file-level features of the source code file and the function-level features of the source code file; and 3) performing feature matching and judgment to determine whether the source code file is an open source code. According to the invention, on the basis of mixed source code feature extraction and matching, the identification of whether all codes in the code engineering are open sources is realized, and meanwhile, an accurate calculation mechanism of the file open source rate is constructed.

Description

Mixed source code feature extraction and matching method

Technical Field

The invention relates to the field of open source identification, in particular to a mixed source code feature extraction and matching method.

Background

The open source software is characterized by ensuring that a user can reissue a source code and a modified version based on the source code after obtaining the source code. An open source software license is required to ensure that anyone can obtain or share the source program when needed, to ensure that anyone can modify and upgrade a portion of the open source software or use it for new open source software, and to ensure that anyone knows that they have the right to do their own actions in developing source code. Because the open source software license is specified as follows: no one is prohibited from acknowledging these rights, or is required to forego the rights by others. These provisions translate into a definitive responsibility if the open source software is modified or a copy of the software is released, which is the most typical role of an open source software license and why an open source software license is essential.

There are many kinds of licenses for open source software from different countries and regions. At present, the only mechanism in the world for authenticating the open source software license is 'open source code initiative action organization OSI', and all open source software license protocols authenticated by OSI comprise 5 categories (strict open source rule; non-open source part can exist; non-open source software can be compatible; software patent is allowed; complete opening) 63 in total. Currently, the more commonly used ones include GPL, LGPL, MPL, BSD, NOKIA, MIT, Apache, etc. GPL and LGPL are the licenses most applied by the current open source software project. The GPL license agreement is rather contagious, and if you want to reissue the binary version after modifying a copy of code that employs the GPL license, you must also reopen its source code. The BSD license is relatively loose, allowing the source code to be republished after modification, including only the license without having to reopen the source code, and the modified version can be turned into commercial use (e.g., microsoft's products incorporate source code for the BSD network portion and the modified version is sold as proprietary software).

By comparing and analyzing the number of purchased commercial code licenses contained in the software with the software license analysis content such as the number of actual software installation licenses, the type of open source code license agreement contained in the software, the source code modification condition and whether the software use conforms to the regulation of the corresponding open source license agreement or not, the quantitative analysis of the compliance of the software licenses is realized, and data support is provided for the work of avoiding intellectual property disputes, correctly pricing the software, auditing management and the like of the software.

In the application of foreign mainstream software, open source codes and third-party plug-ins in the existing mixed source software are used in large quantity, the caused knowledge products and security risks have attracted certain attention and attention abroad, the existing achievements mainly comprise American Blackduck and Protecode mature software, the two sets of mature software are widely applied to units such as American law firm, intellectual property bureau, enterprise audit department, software contractor and the like, and are applied to large-scale software companies and enterprise audit units in other countries.

(1)Blackduck

Blackduck software is currently the largest code analysis software in the market share, but Blackduck mainly implements scanning, auditing and code management of source code. Including a standalone version of the protein and an online test version of the HUB. The Blackduck KnowledgeBase (KB) of software is currently the largest, most comprehensive, open source knowledge base in the world.

As the basis for the overall solution of Blackduck, KB has major advantages including:

1. comprises 5300 billion lines of open source code;

2. encompasses 2,000,000 open source software projects;

3. 2500 unique licenses (licenses);

4. 79,000 security holes;

5. data from 6,500+ sites;

6. professional teams are responsible for maintenance and continuous updating.

The Blackduck supports more than 70 programming languages, can scan and detect more than 100 file types, supports a code line-by-line comparison function, can show the matching content of user codes and open source codes in a parallel window, and helps a user to more accurately confirm code matching.

Blackduck currently owns over 700 more customers in more than 20 countries, including Intel, Cisco, Alcatel-Lucent, Motorola, Qualcomm, Yahoo, etc. The Blackduck product and service are also applied to code auditing during enterprise mergers

(2)Protecode

Protecode is an open source code quality inspection management tool developed by Synopsys, and can manage open source content of third-party codes, discover security vulnerabilities of the open source content and ensure compliance of license and intellectual property rights. Protecodeenterpriserver (es) is software that performs scanning, composition analysis, license compliance analysis, and security vulnerability analysis with respect to source code.

The professor royal phoenix university in Shandong is researching a binary code matching and analyzing technology based on function layer characteristics, the method needs to disassemble malicious software and analyze assembly codes to obtain the characteristics of functions, so the characteristics of the functions are interfered by an obfuscation technology, a method combining static analysis and dynamic analysis needs to be adopted for research, the method is mainly used for realizing the detection of the malicious software, and the existing research results still stay in a laboratory demonstration stage.

The code copying detection technology research is developed by professor Liudongtui of the university of inner Mongolia, the method is matched and identified based on the feature strings, only can support the analysis of C programming language, and still stays in the test simulation stage by relying on a plurality of third-party analysis tools as auxiliary supports.

The university of defense proposes a high-dimensional feature fusion malicious code analysis method, which extracts static binary files, disassembling features and the like of malicious codes, takes local sensitivity thought as a reference, performs fusion analysis and processing on multi-dimensional features, and performs learning training on fused feature vectors by adopting a typical machine learning method.

The professor Zhang Yi of Chongqing university develops the research of a method based on code similarity, at present, only C programming language can be supported, the relevant parameters of code identification degree still have an optimized space, the identification degree is still to be further improved, and the expansibility and the portability are required to be deeply researched.

The new teaching of the tomu university at harbin and the liqing of the base of the chinese mobile game product respectively develop the code multidimensional analysis research of the Android operating system, analyze the structure of the function call relation graph besides the text characteristics of the authority characteristics and the system API characteristics in malicious code detection, respectively construct a kernel function for performing node coding based on a sensitive API and a kernel function for performing node coding based on an instruction operation code, and describe the similarity of the function call relation graph by using a combined kernel function. At present, a multi-feature malicious code detection model is only suitable for an Android operating system, and a lot of uncertainty still exists whether a kernel closed-source Windows operating system is suitable or not.

The code comparison technology research based on feature extraction is carried out by professor Zhao Rong Cai, the information engineering university of China people's liberty, on the basis of defining a binary code description method based on a graph, approximate binary codes are compared from two levels of functions and basic blocks, the same part and difference information between the approximate binary codes are analyzed, the implementation framework of the binary code comparison technology based on the feature extraction is used for enumerating the analysis of the binary code comparison technology in the malicious software variety, but the method still has a lot of uncertainty for the malicious code variety identification, and the credibility of the comparison result is difficult to guarantee by the method without depending on the strategy of a code knowledge base.

The university of China science and technology xi develop technical research of multi-dimensional feature detection malicious codes, a multi-dimensional feature-based obfuscation malicious code detection algorithm is provided, static analysis is performed after obfuscation malicious codes are disassembled, and malicious code family features are summarized and analyzed from multiple feature dimensions of a semantic structure, an Opcode distribution sequence, a call flow graph feature and a system call sequence graph, but the method only aims at the discrimination problem of the malicious code family, and cannot be applied to the scene of a large-scale sample, and the expansibility of the method needs to be discussed deeply.

The method belongs to a dynamic analysis technology and requires resources to run, load and monitor, so that the method based on dynamic analysis has great limitation.

In addition, Xuhaiyin of Huazhong science and technology university develops a code obfuscation technology and application research thereof in software security protection, a Homing professor of Beijing postal and electronic university develops a code obfuscation model research, Wangxou of Beijing postal and electronic university develops a binary code obfuscation key technology research, a Gichunfu professor of south-cut university develops a binary code obfuscation path technology research, a Yang Wu professor of electronic science and technology university develops a software protection research based on binary code obfuscation, a Guo army teacher of northwest university develops a research of a semantic-based binary code anti-obfuscation method, a King Wei auxiliary professor of Tong university develops a credible software watermarking technology research based on fingerprints, and a Jiaguanjie professor of Suzhou university develops a plagiarism behavior research based on a text corpus.

At present, the modeling software for analyzing software codes in China is developed by the open source security alliance of China, which is a company Limited in the same industry, and can provide free binary code security scanning analysis application for users. Meanwhile, both the Blackduck and the Protecode are mature and used and occupy the global main user group for source code analysis, but due to the American trade limitation, the open source code libraries of the two types of software are not sold to China, and only by uploading source code files or binary files, the online detection service of the software is used for scanning and analyzing software composition, so that the problems of confidentiality and safety exist. In addition, most of the domestic existing technologies are in theoretical analysis and technical simulation stages, a plurality of key technologies are needed to be researched and broken through from practical application, systematization and engineering, and the language types supported by source code analysis are too single.

Disclosure of Invention

In order to solve the above problems, the present invention provides a method for extracting and matching the characteristics of a mixed source code, which extracts the characteristics of a source code and then matches the extracted characteristics with an open source code. According to the matching result, compliance analysis and the like of some license agreements can be carried out, so that an intellectual property protection system is promoted, and localization and autonomous controllability of code analysis are promoted.

According to an aspect of the present invention, there is provided a mixed source code feature extraction and matching method, the method including:

step 1) constructing a knowledge base: crawling each open source project through a web crawler, and constructing and maintaining a knowledge base by using the file-level characteristics and the function-level characteristics of each source code file of each open source project;

the file level characteristics of the source code file comprise a file size, a file hash value and an effective code line number, and the function level characteristics of the source code file comprise a function size, a function hash value and a function code line number;

step 2), extracting secondary characteristics of the source code file: aiming at the current mixed source code project, identifying each source code file under the current mixed source code project catalog, and performing secondary feature extraction on each source code file to respectively obtain the file-level features of the source code file and the function-level features of the source code file;

the file level characteristics of the source code file comprise file size, file hash value and effective code line number, the function level characteristics of the source code file comprise function size, function hash value and function code line number, and the source code file is composed of mixed source codes;

step 3), feature matching and judging: the following actions are performed for each source code file: performing file-level matching query in a knowledge base through file-level features of the source code file, determining that the source code file is an open source code when the source code file in the open source project is matched, performing function-level matching query in the knowledge base through function-level features of the source code file when the source code file in the open source project is not matched, determining that the function in the source code file is an open source function when the function of the open source function matched in the source code file exists in the source code file, and determining that the source code file is a closed source code when the function of the open source function matched in the source code file does not exist in the source code file.

More specifically, in the mixed-source code feature extraction and matching method: in step 3), when there is a function matched to an open source function in a source code file in the source code file, the sum of code line numbers of the functions matched to the open source function in the source code file/the effective code line number of the source code file is 100% to obtain the open source rate of the source code file.

More specifically, in the mixed-source code feature extraction and matching method: in the step 1), reading project information of each open source project into a knowledge base, wherein the project information comprises a project name, an open source protocol, a project source and a project version.

More specifically, in the mixed-source code feature extraction and matching method: in step 2), identifying each source code file under the current mixed source code engineering directory includes: and identifying each source code file under the current mixed source code engineering catalog according to the file type.

Therefore, the invention realizes a technical scheme for extracting and matching the characteristics of the mixed source code, and the scheme can realize the analysis of the mixed source code project. The intelligent detection and analysis engine technology of the mixed source code is limited by the size of the knowledge base, the more codes are collected in the knowledge base, the more codes can be matched and identified, the more security holes are found correspondingly, and the higher the accuracy of the analysis result formed by the codes is.

Drawings

Embodiments of the invention will now be described with reference to the accompanying drawings, in which:

fig. 1 is a flowchart illustrating steps of a method for extracting and matching a mixed-source code feature according to an embodiment of the present invention.

Fig. 2 is a block diagram illustrating a knowledge base used in a mixed source code feature extraction and matching method according to an embodiment of the present invention.

Fig. 3 is a detailed diagram illustrating a flowchart of the steps of a hybrid source code feature extraction and matching method according to an embodiment of the present invention.

Detailed Description

Embodiments of the mixed-source code feature extraction and matching method of the present invention will be described in detail below with reference to the accompanying drawings.

The code obfuscation technology is a program transformation technology for protecting software intellectual property, and can avoid software piracy, tampering and reverse engineering to a certain extent. The code obfuscation technology is a double-edged sword, and after self-developed codes are mixed with open-source codes and closed-source (private) codes to form mixed-source codes, the difficulty of code composition analysis is increased, and software intellectual property rights of the open-source codes and the closed-source codes are difficult to identify and protect. In addition, the binary code has high difficulty in composition analysis, mainly because the binary code itself is composed of 0 and 1 digits, the available feature dimension is very small, the recognition and matching mode is very limited, and when the binary code adopts the obfuscation technology, the difficulty in composition analysis of the code is greatly enhanced.

Therefore, in order to identify and protect intellectual property of software and reduce potential safety hazards of the software, the invention builds a mixed source code feature extraction and matching method, and can effectively solve the technical problems.

Fig. 1 is a flowchart illustrating steps of a mixed-source code feature extraction and matching method according to an embodiment of the present invention, where the method includes:

Next, the detailed flow of the mixed-source code feature extraction and matching method of the present invention will be further described.

In the mixed source code feature extraction and matching method:

in step 3), when there is a function matched to an open source function in a source code file in the source code file, the sum of code line numbers of the functions matched to the open source function in the source code file/the effective code line number of the source code file is 100% to obtain the open source rate of the source code file.

In the mixed source code feature extraction and matching method:

in the step 1), reading project information of each open source project into a knowledge base, wherein the project information comprises a project name, an open source protocol, a project source and a project version.

In the mixed source code feature extraction and matching method:

in step 2), identifying each source code file under the current mixed source code engineering directory includes: and identifying each source code file under the current mixed source code engineering catalog according to the file type.

As shown in fig. 3, for the mixed source code project, all files in the project directory are recursively traversed, source code files are identified through file types, and multi-level feature extraction is performed on the source code files to respectively extract file-level and function-level features. The file-level features include: file size, file hash value and number of valid code lines; the function level features include: function size, function hash value, and number of function code lines.

Firstly, matching and inquiring in a knowledge base through file-level features to determine whether an open source file can be matched, if so, indicating that the source code file is an open source, if not, further utilizing function-level features to perform function-level matching in the knowledge base, and if functions contained in a source code are not matched with the open source function, determining that the source code file is a closed source code; and if the functions contained in the source code can be matched with the open source functions, the functions are indicated to be open sources, and the open source rate of the source code file is obtained by using the sum of the code line numbers of the open source functions/the effective code line number of the source code by 100%.

In summary, the invention can respectively extract file level and function level characteristics of the source code file in the mixed source code project and the source code file of the open source code, so as to identify whether all codes in the code project are open sources, and simultaneously construct an accurate calculation mechanism of the file open source rate.

It is to be understood that while the present invention has been described in conjunction with the preferred embodiments thereof, it is not intended to limit the invention to those embodiments. It will be apparent to those skilled in the art from this disclosure that many changes and modifications can be made, or equivalents modified, in the embodiments of the invention without departing from the scope of the invention. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present invention are still within the scope of the protection of the technical solution of the present invention, unless the contents of the technical solution of the present invention are departed.

Claims

1. A method for extracting and matching mixed-source code features, the method comprising:

2. The mixed-source code feature extraction and matching method of claim 1, wherein:

3. The mixed-source code feature extraction and matching method of claim 2, wherein:

4. The mixed-source code feature extraction and matching method of claim 3, wherein: