CN111241497A

CN111241497A - Open source code tracing detection method based on software multiplexing feature learning

Info

Publication number: CN111241497A
Application number: CN202010091777.9A
Authority: CN
Inventors: 严亮
Original assignee: Beijing High Quality System Technology Co Ltd
Current assignee: Beijing High Quality System Technology Co Ltd
Priority date: 2020-02-13
Filing date: 2020-02-13
Publication date: 2020-06-05

Abstract

The invention discloses an open source code traceability detection method based on software multiplexing feature learning, which mainly comprises the steps of inquiring a corresponding multiplexed open source software feature set and multiplexing probability thereof from an association rule table of a built multiplexing feature association model according to software features for traceability detection, searching software codes with the features from a comparison open source software library according to the sequence from high multiplexing probability to low multiplexing probability, and carrying out traceability comparison with detected codes one by one.

Description

Open source code tracing detection method based on software multiplexing feature learning

Technical Field

The invention relates to the technical field of software code traceability detection, in particular to an open source code traceability detection method based on software multiplexing feature learning.

Background

With the rapid development of internet technology and open source software, more and more developers participate in the development of the open source software, and the open source community also accumulates a great amount of excellent open source software resources and open source knowledge resources. The open source software has the characteristics of being free, open, customizable, supervised by the whole community and the like, so that the open source software can provide favorable conditions of quick application of technology, innovation increase, cost reduction, development time reduction and the like for software development. Under the background that the efficiency and the quality of software development need to be improved urgently nowadays, open sources have become a powerful force in IT industry. The benefit of open sourcing is also a source of risk. According to the report of "open source code safety and risk analysis Black Duck (Black Duck) 2018" issued by Synopsys, Inc., the number of open source and drain holes in each code library increased by 134% in the last year. In addition, open source software requires users to be self-responsible for tracking vulnerabilities, fixes, and updates of the open source software they use. If an organization does not know all of the open source components it uses, it is impossible to defend against common attacks against known vulnerabilities in these components and expose itself to legal risks such as license compliance risks and intellectual property rights, with varying degrees of security threats and economic or reputation losses.

However, since the software project has various usage modes for the open source code, the range of the open source software used by the code is difficult to predict in advance, and codes from different open source software may be introduced into the same software at the same time, the number of open source software for comparing the codes in the traceable detection is very large. If the detected software is compared with all the open source software one by one, the method is extremely time-consuming, labor-consuming and unrealistic to realize. Therefore, how to accurately determine the scope of the open source software participating in the tracing detection comparison, and selecting the open source software which is most likely to be used for replacing codes from a massive open source software library to participate in the tracing comparison has great influence on improving the precision and efficiency of the code tracing detection.

Therefore, the invention aims to establish a multiplexing characteristic association model and analyze the multiplexing relation of the software by machine learning of the open source software characteristics with the multiplexing relation, so that the open source software range participating in tracing comparison can be quickly determined by the characteristics of the detected software, and the efficiency and the precision of tracing detection can be obviously improved.

Disclosure of Invention

The invention aims to provide an open source code tracing detection method based on software multiplexing feature learning.

Therefore, the technical scheme of the invention is as follows:

an open source code tracing detection method based on software multiplexing feature learning comprises the following steps:

1) building a multiplexing characteristic correlation model; the model building method comprises the following steps:

1-1) respectively selecting and collecting the characteristic data of multiplexing software and multiplexed software in a software project with a multiplexing relationship, wherein the multiplexing relationship is clear in an open source software community:

1-2) mining the multiplexing characteristic association rule by using an Apriori algorithm: thereby obtaining an association rule table, namely a multiplexing characteristic association model;

2) according to the software features for tracing detection, a corresponding multiplexed open source software feature set and multiplexing probability thereof are inquired from an association rule table, software codes with the features are searched from a comparison open source software library according to the sequence from high to low of the multiplexing probability, and the tracing comparison is carried out on the software codes one by one with the detected codes.

Further, the software characteristic data selected and collected in the step 1-1) comprises programming language, open source license, software type, software label and collection amount.

Further, the software characteristic data selecting and collecting method in the step 1-1) comprises the following steps:

a) acquiring software project names and owner information of software features as required to acquire project git addresses;

b) accessing an interface of the open source community by using a data frame and an API (application program interface) provided by the open source community;

c) operating a remote library of an open source project hosting platform of an open source community website by using a Java web crawler technology and utilizing an open source Java dependent library so as to capture required software characteristic data;

d) and storing the acquired target data in a database to finish the acquisition work of the software characteristic data.

Further, the mining of the multiplexing feature association rule by using the Apriori algorithm includes the following steps:

1-21) mining a frequent item set in the feature data;

1-22) mining association rules based on frequent item sets;

1-23) forming an association rule table by the obtained association rules and storing the association rules in a database.

Further, the mining method of the frequent item set comprises the following steps: setting a minimum support degree, calculating the support degrees of all item sets, deleting the item sets with the support degrees smaller than the minimum value in the database and all superitem sets corresponding to the item sets, wherein the rest item sets are frequent item sets.

Further, the method for mining the association rule of the frequent item set comprises the following steps:

setting a minimum value of the credibility, screening the association rule of each frequent item set, and excluding the association rule corresponding to the frequent item set with the credibility less than the minimum value and the association rule corresponding to the subset of the frequent item set;

and calculating the support degree of each frequent item set consisting of multiplexing software features and non-multiplexing software features and the support degree of the corresponding frequent item set consisting of multiplexing software features, wherein the ratio of the support degree to the support degree is the reliability of the association rule.

Compared with the prior art, the open source code traceability detection method based on software reuse characteristic learning quickly finds the open source software set with code reuse possibility according to some characteristics of the software to be subjected to traceability detection, thereby reducing the scope of open source software subjected to traceability detection comparison and greatly improving the efficiency and precision of open source code traceability detection.

Drawings

FIG. 1 is a block diagram of a method provided by the present invention.

FIG. 2 is a flow chart of software feature data collection.

FIG. 3 is a flow chart of the establishment of a multiplexing software feature association model.

Fig. 4 is a schematic diagram of a support pruning principle based on Apriori algorithm.

Fig. 5 is a schematic diagram of the credibility pruning principle based on Apriori algorithm.

Detailed Description

The invention will be further described with reference to the following figures and specific examples, which are not intended to limit the invention in any way.

A tracing detection method for open source codes based on software multiplexing feature learning mainly comprises the steps of inquiring a corresponding multiplexed open source software feature set and multiplexing probability thereof from an association rule table of a built multiplexing feature association model according to software features for tracing detection, searching software codes with the features from a comparison open source software library according to the sequence from high to low of the multiplexing probability, and performing tracing comparison with detected codes one by one.

The frame diagram of the multiplexing characteristic association model building method is shown in fig. 1:

firstly, respectively selecting and collecting the characteristic data of multiplexing software and multiplexed software in a software project with a multiplexing relationship, wherein the multiplexing relationship is clear in an open source software community: selecting and collecting software characteristic data including programming language, open source license, software type, software label and collection amount;

the programming language is a language used for software development, and several different languages may be involved in an open source software project, but there is one main programming language. This patent deals with code level multiplexing, and therefore lists the programming language as a software multiplexing feature.

The open source license regulates the use of open source software by some terms, and thus is one of the software reuse features required by this patent.

Tags are set for most open source items, and the tags mainly describe the type, function, language and other related information of the items. Information such as open source project type, function, language and the like is an important factor for selecting software during code multiplexing, so that a software label is required to be listed in the software multiplexing characteristic.

In open source community websites such as GitHub and Gitee, the collection function of open source items is provided, and the collection amount is recorded. The collection amount of an open source item is the embodiment of the popularity and the popularity of the item, and can reflect the quality and the good use degree of the software to a certain extent, so the collection amount of the software is listed in the software multiplexing characteristic.

The software characteristic data selecting and collecting method comprises the following steps:

d) the collected target data is stored in a database through connecting Java and MySQL by a Hibernate framework, and the collection work of software feature data is completed.

A flow chart for software feature collection in the GitHub open source community is shown in fig. 2.

After the software feature collection is completed, mining the multiplexing feature association rule by using Apriori algorithm to obtain an association rule table, i.e. a multiplexing feature association model, and the flow chart is shown in fig. 3;

when the multiplexing characteristic association rule is mined, firstly, frequent item sets in characteristic data need to be mined; a frequent item set refers to a collection of items that appear frequently together. Where "frequency" is measured in terms of support, for an item set, its support is defined as the proportion of records in the data set that contain the item set. The support degree is a concept about an item set, and a minimum value of the support degree is usually set, and an item set with the support degree not less than the minimum value is considered to be a frequent item set. In the software multiplexing characteristic association rule mining, the frequently acquired item set is a set of software characteristics such as a programming language, an open source license, a software type or a software label of software, for example, the frequently acquired item set { Android, java } indicates that the Android and the java "often appear together.

The Apriori algorithm provides a principle for pruning a frequent item set, i.e. if an item set is infrequent, all its supersets must also be infrequent, the pruning method is shown in fig. 4, if an item set { a, b } is infrequent, all its supersets circled by dashed lines and crossed are also infrequent. Therefore, when the frequent item set is mined, a minimum value of the support degree is set, the support degrees of all the item sets are calculated, the item sets with the support degree smaller than the minimum value in the database and all the superitem sets corresponding to the item sets are deleted, and the remaining item sets are the frequent item sets.

Secondly, mining association rules based on frequent item sets; association rules are meant to imply that there may be some strong association between two different items. Here "strong" is measured by confidence. Confidence is a concept related to association rules, and is a conditional probability, that is, the probability that an event B occurs simultaneously with an event a on the premise that the event a occurs. A plurality of association rules can be obtained from a frequent item set, the credibility of the association rules can be calculated according to the support degree of the item set, and the higher the credibility value of one association rule is, the higher the probability of the rule is. And setting a minimum value for the credibility, and excluding the association rules smaller than the minimum value, wherein the remaining association rules meeting the requirement of the minimum value are the relations among the features which are interested by the user.

When the data volume is large, the problem of many association rules generated by a frequent item set is also faced, and the screening quantity of the association rules can be reduced by utilizing the Apriori principle. The following principles are followed here:

if rule X → Y-X does not satisfy the minimum confidence level, then for the subset X 'of X, rule X'_m→Y-X′_nThe minimum confidence level is not met.

Taking a frequent item set { a, b, c, d } as an example, association rule mining is performed. First, all possible association rules for the set of items are generated, and as shown in FIG. 5, if the confidence value of rule b, c, d → a is below the set minimum confidence, then all rules consisting of the subset of { b, c, d } are below the minimum confidence, according to the above theorem.

The method for mining the association rule of the frequent item set comprises the following steps:

103 groups of open source software with multiplexing relationship are selected from the Gitee open source community as experimental objects of the software multiplexing relationship analysis experiment, and 20 groups selected from the group shown in Table 1 are as follows:

table 1 open source software reuse case

Inputting the collected and processed data into an Apriori algorithm, learning association rules among multiplexing characteristics of open source software, outputting experimental results, and analyzing the output experimental results to association rule analysis results shown in a table:

table 2 association rule analysis results

Serial number	If the multiplexing software has the characteristics	Then the feature that the multiplexed software may have	Degree of confidence
				1	mobile development	Android	0.71
2	mobile development	Android，java	0.86
				3	management monitoring	MIT	0.85
4	mobile development	Apache 2.0	0.73
				5	Android	Android	0.86
6	mobile development,Android	Apache 2.0,java	0.99
				7	mobile development,Android	Android,java	0.75

In table 2, for an association rule represented by each row, the multiplexed source software item has the feature in the second column, and the software multiplexed by it may have the feature in the third column, and the confidence value of the association rule is the value in the fourth column. Here we set the minimum confidence value to 0.7, the higher the confidence value, the more likely the rule occurs, and the more valuable it is for reference and use. Taking the association rule of the second row number 1 in table 2 as an example, if the open source software project is a mobile phone or mobile development type project (mobile level), it has a 71% possibility of multiplexing the open source software project including "Android" in the tag. Taking the association rule with the seventh row serial number of 6 as an example, if the open source software item is a mobile phone or mobile development type item (mobile level) and the main programming language is Android, then it has 99% of possibility to reuse the Apache 2.0 open source license and the main programming language is an open source software item of java.

Claims

1. An open source code tracing detection method based on software multiplexing feature learning is characterized by comprising the following steps:

2. The open source code tracing detection method based on software reuse feature learning according to claim 1, wherein the software feature data selected and collected in step 1-1) includes programming language, open source license, software type, software label and collection amount.

3. The open source code tracing detection method based on software reuse feature learning according to claim 1, wherein the software feature data selecting and collecting method in step 1-1) is as follows:

4. The method for detecting source tracing of open source code based on software multiplexing feature learning according to claim 1, wherein the mining of the multiplexing feature association rule by using Apriori algorithm comprises the following steps:

1-21) mining a frequent item set in the feature data;

1-22) mining association rules based on frequent item sets;

5. The open source code tracing detection method based on software reuse characteristic learning according to claim 4, characterized in that the mining method of frequent item sets is: setting a minimum support degree, calculating the support degrees of all item sets, deleting the item sets with the support degrees smaller than the minimum value in the database and all superitem sets corresponding to the item sets, wherein the rest item sets are frequent item sets.

6. The open source code tracing detection method based on software multiplexing feature learning according to claim 4 or 5, characterized in that the frequent item set association rule mining method is as follows: