CN111241497A - Open source code tracing detection method based on software multiplexing feature learning - Google Patents

Open source code tracing detection method based on software multiplexing feature learning Download PDF

Info

Publication number
CN111241497A
CN111241497A CN202010091777.9A CN202010091777A CN111241497A CN 111241497 A CN111241497 A CN 111241497A CN 202010091777 A CN202010091777 A CN 202010091777A CN 111241497 A CN111241497 A CN 111241497A
Authority
CN
China
Prior art keywords
software
open source
multiplexing
association rule
frequent item
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010091777.9A
Other languages
Chinese (zh)
Inventor
严亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing High Quality System Technology Co Ltd
Original Assignee
Beijing High Quality System Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing High Quality System Technology Co Ltd filed Critical Beijing High Quality System Technology Co Ltd
Priority to CN202010091777.9A priority Critical patent/CN111241497A/en
Publication of CN111241497A publication Critical patent/CN111241497A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/10Protecting distributed programs or content, e.g. vending or licensing of copyrighted material ; Digital rights management [DRM]
    • G06F21/16Program or content traceability, e.g. by watermarking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Technology Law (AREA)
  • Multimedia (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Stored Programmes (AREA)

Abstract

The invention discloses an open source code traceability detection method based on software multiplexing feature learning, which mainly comprises the steps of inquiring a corresponding multiplexed open source software feature set and multiplexing probability thereof from an association rule table of a built multiplexing feature association model according to software features for traceability detection, searching software codes with the features from a comparison open source software library according to the sequence from high multiplexing probability to low multiplexing probability, and carrying out traceability comparison with detected codes one by one.

Description

Open source code tracing detection method based on software multiplexing feature learning
Technical Field
The invention relates to the technical field of software code traceability detection, in particular to an open source code traceability detection method based on software multiplexing feature learning.
Background
With the rapid development of internet technology and open source software, more and more developers participate in the development of the open source software, and the open source community also accumulates a great amount of excellent open source software resources and open source knowledge resources. The open source software has the characteristics of being free, open, customizable, supervised by the whole community and the like, so that the open source software can provide favorable conditions of quick application of technology, innovation increase, cost reduction, development time reduction and the like for software development. Under the background that the efficiency and the quality of software development need to be improved urgently nowadays, open sources have become a powerful force in IT industry. The benefit of open sourcing is also a source of risk. According to the report of "open source code safety and risk analysis Black Duck (Black Duck) 2018" issued by Synopsys, Inc., the number of open source and drain holes in each code library increased by 134% in the last year. In addition, open source software requires users to be self-responsible for tracking vulnerabilities, fixes, and updates of the open source software they use. If an organization does not know all of the open source components it uses, it is impossible to defend against common attacks against known vulnerabilities in these components and expose itself to legal risks such as license compliance risks and intellectual property rights, with varying degrees of security threats and economic or reputation losses.
However, since the software project has various usage modes for the open source code, the range of the open source software used by the code is difficult to predict in advance, and codes from different open source software may be introduced into the same software at the same time, the number of open source software for comparing the codes in the traceable detection is very large. If the detected software is compared with all the open source software one by one, the method is extremely time-consuming, labor-consuming and unrealistic to realize. Therefore, how to accurately determine the scope of the open source software participating in the tracing detection comparison, and selecting the open source software which is most likely to be used for replacing codes from a massive open source software library to participate in the tracing comparison has great influence on improving the precision and efficiency of the code tracing detection.
Therefore, the invention aims to establish a multiplexing characteristic association model and analyze the multiplexing relation of the software by machine learning of the open source software characteristics with the multiplexing relation, so that the open source software range participating in tracing comparison can be quickly determined by the characteristics of the detected software, and the efficiency and the precision of tracing detection can be obviously improved.
Disclosure of Invention
The invention aims to provide an open source code tracing detection method based on software multiplexing feature learning.
Therefore, the technical scheme of the invention is as follows:
an open source code tracing detection method based on software multiplexing feature learning comprises the following steps:
1) building a multiplexing characteristic correlation model; the model building method comprises the following steps:
1-1) respectively selecting and collecting the characteristic data of multiplexing software and multiplexed software in a software project with a multiplexing relationship, wherein the multiplexing relationship is clear in an open source software community:
1-2) mining the multiplexing characteristic association rule by using an Apriori algorithm: thereby obtaining an association rule table, namely a multiplexing characteristic association model;
2) according to the software features for tracing detection, a corresponding multiplexed open source software feature set and multiplexing probability thereof are inquired from an association rule table, software codes with the features are searched from a comparison open source software library according to the sequence from high to low of the multiplexing probability, and the tracing comparison is carried out on the software codes one by one with the detected codes.
Further, the software characteristic data selected and collected in the step 1-1) comprises programming language, open source license, software type, software label and collection amount.
Further, the software characteristic data selecting and collecting method in the step 1-1) comprises the following steps:
a) acquiring software project names and owner information of software features as required to acquire project git addresses;
b) accessing an interface of the open source community by using a data frame and an API (application program interface) provided by the open source community;
c) operating a remote library of an open source project hosting platform of an open source community website by using a Java web crawler technology and utilizing an open source Java dependent library so as to capture required software characteristic data;
d) and storing the acquired target data in a database to finish the acquisition work of the software characteristic data.
Further, the mining of the multiplexing feature association rule by using the Apriori algorithm includes the following steps:
1-21) mining a frequent item set in the feature data;
1-22) mining association rules based on frequent item sets;
1-23) forming an association rule table by the obtained association rules and storing the association rules in a database.
Further, the mining method of the frequent item set comprises the following steps: setting a minimum support degree, calculating the support degrees of all item sets, deleting the item sets with the support degrees smaller than the minimum value in the database and all superitem sets corresponding to the item sets, wherein the rest item sets are frequent item sets.
Further, the method for mining the association rule of the frequent item set comprises the following steps:
setting a minimum value of the credibility, screening the association rule of each frequent item set, and excluding the association rule corresponding to the frequent item set with the credibility less than the minimum value and the association rule corresponding to the subset of the frequent item set;
and calculating the support degree of each frequent item set consisting of multiplexing software features and non-multiplexing software features and the support degree of the corresponding frequent item set consisting of multiplexing software features, wherein the ratio of the support degree to the support degree is the reliability of the association rule.
Compared with the prior art, the open source code traceability detection method based on software reuse characteristic learning quickly finds the open source software set with code reuse possibility according to some characteristics of the software to be subjected to traceability detection, thereby reducing the scope of open source software subjected to traceability detection comparison and greatly improving the efficiency and precision of open source code traceability detection.
Drawings
FIG. 1 is a block diagram of a method provided by the present invention.
FIG. 2 is a flow chart of software feature data collection.
FIG. 3 is a flow chart of the establishment of a multiplexing software feature association model.
Fig. 4 is a schematic diagram of a support pruning principle based on Apriori algorithm.
Fig. 5 is a schematic diagram of the credibility pruning principle based on Apriori algorithm.
Detailed Description
The invention will be further described with reference to the following figures and specific examples, which are not intended to limit the invention in any way.
A tracing detection method for open source codes based on software multiplexing feature learning mainly comprises the steps of inquiring a corresponding multiplexed open source software feature set and multiplexing probability thereof from an association rule table of a built multiplexing feature association model according to software features for tracing detection, searching software codes with the features from a comparison open source software library according to the sequence from high to low of the multiplexing probability, and performing tracing comparison with detected codes one by one.
The frame diagram of the multiplexing characteristic association model building method is shown in fig. 1:
firstly, respectively selecting and collecting the characteristic data of multiplexing software and multiplexed software in a software project with a multiplexing relationship, wherein the multiplexing relationship is clear in an open source software community: selecting and collecting software characteristic data including programming language, open source license, software type, software label and collection amount;
the programming language is a language used for software development, and several different languages may be involved in an open source software project, but there is one main programming language. This patent deals with code level multiplexing, and therefore lists the programming language as a software multiplexing feature.
The open source license regulates the use of open source software by some terms, and thus is one of the software reuse features required by this patent.
Tags are set for most open source items, and the tags mainly describe the type, function, language and other related information of the items. Information such as open source project type, function, language and the like is an important factor for selecting software during code multiplexing, so that a software label is required to be listed in the software multiplexing characteristic.
In open source community websites such as GitHub and Gitee, the collection function of open source items is provided, and the collection amount is recorded. The collection amount of an open source item is the embodiment of the popularity and the popularity of the item, and can reflect the quality and the good use degree of the software to a certain extent, so the collection amount of the software is listed in the software multiplexing characteristic.
The software characteristic data selecting and collecting method comprises the following steps:
a) acquiring software project names and owner information of software features as required to acquire project git addresses;
b) accessing an interface of the open source community by using a data frame and an API (application program interface) provided by the open source community;
c) operating a remote library of an open source project hosting platform of an open source community website by using a Java web crawler technology and utilizing an open source Java dependent library so as to capture required software characteristic data;
d) the collected target data is stored in a database through connecting Java and MySQL by a Hibernate framework, and the collection work of software feature data is completed.
A flow chart for software feature collection in the GitHub open source community is shown in fig. 2.
After the software feature collection is completed, mining the multiplexing feature association rule by using Apriori algorithm to obtain an association rule table, i.e. a multiplexing feature association model, and the flow chart is shown in fig. 3;
when the multiplexing characteristic association rule is mined, firstly, frequent item sets in characteristic data need to be mined; a frequent item set refers to a collection of items that appear frequently together. Where "frequency" is measured in terms of support, for an item set, its support is defined as the proportion of records in the data set that contain the item set. The support degree is a concept about an item set, and a minimum value of the support degree is usually set, and an item set with the support degree not less than the minimum value is considered to be a frequent item set. In the software multiplexing characteristic association rule mining, the frequently acquired item set is a set of software characteristics such as a programming language, an open source license, a software type or a software label of software, for example, the frequently acquired item set { Android, java } indicates that the Android and the java "often appear together.
The Apriori algorithm provides a principle for pruning a frequent item set, i.e. if an item set is infrequent, all its supersets must also be infrequent, the pruning method is shown in fig. 4, if an item set { a, b } is infrequent, all its supersets circled by dashed lines and crossed are also infrequent. Therefore, when the frequent item set is mined, a minimum value of the support degree is set, the support degrees of all the item sets are calculated, the item sets with the support degree smaller than the minimum value in the database and all the superitem sets corresponding to the item sets are deleted, and the remaining item sets are the frequent item sets.
Secondly, mining association rules based on frequent item sets; association rules are meant to imply that there may be some strong association between two different items. Here "strong" is measured by confidence. Confidence is a concept related to association rules, and is a conditional probability, that is, the probability that an event B occurs simultaneously with an event a on the premise that the event a occurs. A plurality of association rules can be obtained from a frequent item set, the credibility of the association rules can be calculated according to the support degree of the item set, and the higher the credibility value of one association rule is, the higher the probability of the rule is. And setting a minimum value for the credibility, and excluding the association rules smaller than the minimum value, wherein the remaining association rules meeting the requirement of the minimum value are the relations among the features which are interested by the user.
When the data volume is large, the problem of many association rules generated by a frequent item set is also faced, and the screening quantity of the association rules can be reduced by utilizing the Apriori principle. The following principles are followed here:
if rule X → Y-X does not satisfy the minimum confidence level, then for the subset X 'of X, rule X'm→Y-X′nThe minimum confidence level is not met.
Taking a frequent item set { a, b, c, d } as an example, association rule mining is performed. First, all possible association rules for the set of items are generated, and as shown in FIG. 5, if the confidence value of rule b, c, d → a is below the set minimum confidence, then all rules consisting of the subset of { b, c, d } are below the minimum confidence, according to the above theorem.
The method for mining the association rule of the frequent item set comprises the following steps:
setting a minimum value of the credibility, screening the association rule of each frequent item set, and excluding the association rule corresponding to the frequent item set with the credibility less than the minimum value and the association rule corresponding to the subset of the frequent item set;
and calculating the support degree of each frequent item set consisting of multiplexing software features and non-multiplexing software features and the support degree of the corresponding frequent item set consisting of multiplexing software features, wherein the ratio of the support degree to the support degree is the reliability of the association rule.
103 groups of open source software with multiplexing relationship are selected from the Gitee open source community as experimental objects of the software multiplexing relationship analysis experiment, and 20 groups selected from the group shown in Table 1 are as follows:
table 1 open source software reuse case
Figure BDA0002383252050000061
Figure BDA0002383252050000071
Inputting the collected and processed data into an Apriori algorithm, learning association rules among multiplexing characteristics of open source software, outputting experimental results, and analyzing the output experimental results to association rule analysis results shown in a table:
table 2 association rule analysis results
Serial number If the multiplexing software has the characteristics Then the feature that the multiplexed software may have Degree of confidence
1 mobile development Android 0.71
2 mobile development Android,java 0.86
3 management monitoring MIT 0.85
4 mobile development Apache 2.0 0.73
5 Android Android 0.86
6 mobile development,Android Apache 2.0,java 0.99
7 mobile development,Android Android,java 0.75
In table 2, for an association rule represented by each row, the multiplexed source software item has the feature in the second column, and the software multiplexed by it may have the feature in the third column, and the confidence value of the association rule is the value in the fourth column. Here we set the minimum confidence value to 0.7, the higher the confidence value, the more likely the rule occurs, and the more valuable it is for reference and use. Taking the association rule of the second row number 1 in table 2 as an example, if the open source software project is a mobile phone or mobile development type project (mobile level), it has a 71% possibility of multiplexing the open source software project including "Android" in the tag. Taking the association rule with the seventh row serial number of 6 as an example, if the open source software item is a mobile phone or mobile development type item (mobile level) and the main programming language is Android, then it has 99% of possibility to reuse the Apache 2.0 open source license and the main programming language is an open source software item of java.

Claims (6)

1. An open source code tracing detection method based on software multiplexing feature learning is characterized by comprising the following steps:
1) building a multiplexing characteristic correlation model; the model building method comprises the following steps:
1-1) respectively selecting and collecting the characteristic data of multiplexing software and multiplexed software in a software project with a multiplexing relationship, wherein the multiplexing relationship is clear in an open source software community:
1-2) mining the multiplexing characteristic association rule by using an Apriori algorithm: thereby obtaining an association rule table, namely a multiplexing characteristic association model;
2) according to the software features for tracing detection, a corresponding multiplexed open source software feature set and multiplexing probability thereof are inquired from an association rule table, software codes with the features are searched from a comparison open source software library according to the sequence from high to low of the multiplexing probability, and the tracing comparison is carried out on the software codes one by one with the detected codes.
2. The open source code tracing detection method based on software reuse feature learning according to claim 1, wherein the software feature data selected and collected in step 1-1) includes programming language, open source license, software type, software label and collection amount.
3. The open source code tracing detection method based on software reuse feature learning according to claim 1, wherein the software feature data selecting and collecting method in step 1-1) is as follows:
a) acquiring software project names and owner information of software features as required to acquire project git addresses;
b) accessing an interface of the open source community by using a data frame and an API (application program interface) provided by the open source community;
c) operating a remote library of an open source project hosting platform of an open source community website by using a Java web crawler technology and utilizing an open source Java dependent library so as to capture required software characteristic data;
d) and storing the acquired target data in a database to finish the acquisition work of the software characteristic data.
4. The method for detecting source tracing of open source code based on software multiplexing feature learning according to claim 1, wherein the mining of the multiplexing feature association rule by using Apriori algorithm comprises the following steps:
1-21) mining a frequent item set in the feature data;
1-22) mining association rules based on frequent item sets;
1-23) forming an association rule table by the obtained association rules and storing the association rules in a database.
5. The open source code tracing detection method based on software reuse characteristic learning according to claim 4, characterized in that the mining method of frequent item sets is: setting a minimum support degree, calculating the support degrees of all item sets, deleting the item sets with the support degrees smaller than the minimum value in the database and all superitem sets corresponding to the item sets, wherein the rest item sets are frequent item sets.
6. The open source code tracing detection method based on software multiplexing feature learning according to claim 4 or 5, characterized in that the frequent item set association rule mining method is as follows:
setting a minimum value of the credibility, screening the association rule of each frequent item set, and excluding the association rule corresponding to the frequent item set with the credibility less than the minimum value and the association rule corresponding to the subset of the frequent item set;
and calculating the support degree of each frequent item set consisting of multiplexing software features and non-multiplexing software features and the support degree of the corresponding frequent item set consisting of multiplexing software features, wherein the ratio of the support degree to the support degree is the reliability of the association rule.
CN202010091777.9A 2020-02-13 2020-02-13 Open source code tracing detection method based on software multiplexing feature learning Pending CN111241497A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010091777.9A CN111241497A (en) 2020-02-13 2020-02-13 Open source code tracing detection method based on software multiplexing feature learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010091777.9A CN111241497A (en) 2020-02-13 2020-02-13 Open source code tracing detection method based on software multiplexing feature learning

Publications (1)

Publication Number Publication Date
CN111241497A true CN111241497A (en) 2020-06-05

Family

ID=70865674

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010091777.9A Pending CN111241497A (en) 2020-02-13 2020-02-13 Open source code tracing detection method based on software multiplexing feature learning

Country Status (1)

Country Link
CN (1) CN111241497A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112835993A (en) * 2021-02-01 2021-05-25 中国工商银行股份有限公司 Grading method, open source software scanning method and device
CN113721978A (en) * 2021-11-02 2021-11-30 北京大学 Method and system for detecting open source component in mixed source software
CN116185372A (en) * 2023-04-26 2023-05-30 山东浪潮科学研究院有限公司 Back-end source code generation method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102760151A (en) * 2012-04-05 2012-10-31 中国人民解放军国防科学技术大学 Implementation method of open source software acquisition and searching system
CN104715073A (en) * 2015-04-03 2015-06-17 江苏物联网研究发展中心 Association rule mining system based on improved Apriori algorithm
CN105389330A (en) * 2015-09-21 2016-03-09 中国人民解放军国防科学技术大学 Cross-community matched correlation method for open source resources
CN106126235A (en) * 2016-06-24 2016-11-16 中国科学院信息工程研究所 A kind of multiplexing code library construction method, the quick source tracing method of multiplexing code and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102760151A (en) * 2012-04-05 2012-10-31 中国人民解放军国防科学技术大学 Implementation method of open source software acquisition and searching system
CN104715073A (en) * 2015-04-03 2015-06-17 江苏物联网研究发展中心 Association rule mining system based on improved Apriori algorithm
CN105389330A (en) * 2015-09-21 2016-03-09 中国人民解放军国防科学技术大学 Cross-community matched correlation method for open source resources
CN106126235A (en) * 2016-06-24 2016-11-16 中国科学院信息工程研究所 A kind of multiplexing code library construction method, the quick source tracing method of multiplexing code and system

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112835993A (en) * 2021-02-01 2021-05-25 中国工商银行股份有限公司 Grading method, open source software scanning method and device
CN112835993B (en) * 2021-02-01 2024-03-22 中国工商银行股份有限公司 Grading method, open source software scanning method and device
CN113721978A (en) * 2021-11-02 2021-11-30 北京大学 Method and system for detecting open source component in mixed source software
CN113721978B (en) * 2021-11-02 2022-02-11 北京大学 Method and system for detecting open source component in mixed source software
CN116185372A (en) * 2023-04-26 2023-05-30 山东浪潮科学研究院有限公司 Back-end source code generation method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110688553B (en) Information pushing method and device based on data analysis, computer equipment and storage medium
Yang et al. Towards semi-automatic bug triage and severity prediction based on topic model and multi-feature of bug reports
CN111639337B (en) Unknown malicious code detection method and system for massive Windows software
CN111241497A (en) Open source code tracing detection method based on software multiplexing feature learning
CN112416778B (en) Test case recommendation method and device and electronic equipment
CN111782460A (en) Large-scale log data anomaly detection method and device and storage medium
CN111026433A (en) Method, system and medium for automatically repairing software code quality problem based on code change history
Devine et al. Assessment and cross-product prediction of software product line quality: accounting for reuse across products, over multiple releases
Cao et al. Graph-based workflow recommendation: on improving business process modeling
JP2015508918A (en) Redundant consumer transaction rule filtering
CN114491529A (en) Android malicious application program identification method based on multi-modal neural network
CN111338692A (en) Vulnerability classification method and device based on vulnerability codes and electronic equipment
CN115658080A (en) Method and system for identifying open source code components of software
CN110990055B (en) Pull Request function classification method based on program analysis
CN115292674A (en) Fraud application detection method and system based on user comment data
Ban et al. Fam: Featuring android malware for deep learning-based familial analysis
Xue et al. History-driven fix for code quality issues
CN113392399A (en) Malicious software classification method, device, equipment and medium
CN117725592A (en) Intelligent contract vulnerability detection method based on directed graph annotation network
CN111008038B (en) Pull request merging probability calculation method based on logistic regression model
CN110808947B (en) Automatic vulnerability quantitative evaluation method and system
Eken et al. Predicting defects with latent and semantic features from commit logs in an industrial setting
CN116225522A (en) Method and device for generating software prototype, electronic equipment and storage medium
CN115292167A (en) Life cycle prediction model construction method, device, equipment and readable storage medium
CN113792189A (en) Crowd-sourcing software development contribution efficiency evaluation method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination