CN114527986B

CN114527986B - C++ language-oriented source code anonymization method and related equipment

Info

Publication number: CN114527986B
Application number: CN202111681398.6A
Authority: CN
Inventors: 金正平; 刘冰; 刘祥昆; 秦素娟; 时忆杰
Original assignee: Beijing University of Posts and Telecommunications; National Computer Network and Information Security Management Center
Current assignee: Beijing University of Posts and Telecommunications; National Computer Network and Information Security Management Center
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2023-12-26
Anticipated expiration: 2041-12-31
Also published as: CN114527986A

Abstract

The application provides a C++ language-oriented source code anonymization method and related equipment, which comprise the steps of detecting similarity of source codes by utilizing time dynamic characteristics and space dynamic characteristics generated when an extracted target source code is dynamically executed, and determining authors of the target source code. The time dynamic characteristics and the space dynamic characteristics extracted in the dynamic code execution process of the de-anonymization method can represent the programming style of an author, ensure the accuracy of source code de-anonymization, and solve the problem that the existing code de-anonymization method cannot migrate to the C++ language.

Description

C++ language-oriented source code anonymization method and related equipment

Technical Field

The application relates to the technical field of source code processing, in particular to a C++ language-oriented source code anonymization method and related equipment.

Background

Currently, with the vigorous development of internet technology, the information sharing mode is changed from single mode to diversified mode, and massive information is flushed into the field of view of people, so that people can enrich own knowledge through the information. However, information sharing brings convenience to people and also causes a series of problems, for example, people can copy and store contents of others wantonly. If people use the contents in wrong ways, for example, unauthorized use of the contents of others as own contents, steal the views of others and the achievements of others as own views and achievements, the problems of plagiarism, intellectual property disputes and the like are caused.

Still others have fooled money, personal information, etc. into people through some malicious applications. According to the statistical result of the report of Android malicious software thematic in 2019, 360 safety brains intercept about 180.9 ten thousand newly added malicious software samples at the mobile terminal in 2019. Wherein, in 1 month and 12 months, the quantity of the newly added malicious software is more, and in other months, the gap between the quantity of the newly added malicious software is not big. Meanwhile, according to the statistical result of CNCERT Internet security threat report, the number of terminals for internally infecting Trojan horse or zombie malicious programs is 120 or more only in 11 months in 2019, and the malicious programs are privately installed and executed under the condition of not being allowed by users so as to achieve the purpose of mishandling. All of the above-described actions involve infringing the interests of the user with malicious code.

For the above mentioned phenomena of code plagiarism and malicious code, there is an urgent need for effective measures to suppress these behaviors.

Disclosure of Invention

In view of this, the present application aims to provide a method for anonymizing source codes oriented to the c++ language and related devices.

Based on the above objects, the present application provides a method for anonymizing source codes oriented to c++ language, including: acquiring a target source code, wherein the target source code is C++ language code; extracting time dynamic characteristics and space dynamic characteristics generated when the target source code is dynamically executed; and detecting the similarity of the target source code according to the time dynamic characteristics and the space dynamic characteristics by using a learning model, and determining the author of the target source code.

Optionally, the time dynamic characteristics comprise a function number, a function average calling time, a function time duty ratio utilization rate, a function calling number and a program running time; the average calling time of the function comprises the following steps: function average call times including derivative function call times and function average call times not including derivative function call times.

Optionally, calculating the feature value related to the number of functions includes: acquiring the function quantity f of the target source code and the code line number l in the target source code, wherein the calculation formula of the characteristic value NF related to the function quantity is as followsCalculating the characteristic value of the function time duty ratio utilization rate comprises the following steps: acquiring the function execution time t of the target source code and the total running time v of the target source code, wherein the calculation formula of the characteristic value RF of the function time duty ratio utilization ratio is as follows

Optionally, the spatial dynamic feature includes whether the target source code has memory leakage, memory allocation times, memory release times, average single allocation memory size, memory release ratio, application total memory, release total memory, and function average used memory; and responding to the occurrence of memory leakage of the target source code, wherein the space dynamic characteristics further comprise the type of the memory leakage and the proportion of the leaked memory.

Optionally, extracting a temporal dynamic feature generated by the target source code when dynamically executed includes: compiling the target source code by using a first performance analysis tool and a first command in the execution process of the target source code, adding a first option in the compiling process, and generating a first executable file based on the first option; executing the first executable file using the first performance analysis tool and using a second command, and generating a first text file; using the first performance analysis tool and using a third command to carry out text analysis on the first text file to generate a first data file; counting and extracting characteristic values of time dynamic characteristics of the target source codes in the first data file; extracting spatial dynamic characteristics generated by the target source code during dynamic execution, including: compiling the target source code and generating a second text file by using a second performance analysis tool in the target source code execution process; performing text parsing on the second text file by using the second performance analysis tool to generate a second data file; and using the second performance tool to count and extract the characteristic value of the spatial dynamic characteristic of the target source code in the second data file.

Optionally, the method further comprises: constructing an initial learning model; acquiring a source code database comprising written multiple sample source codes of multiple different authors; and extracting sample time dynamic characteristics and sample space dynamic characteristics of the sample source codes, training the initial learning model based on the sample time dynamic characteristics and the sample space dynamic characteristics of the sample source codes, and taking the initial learning model after training as the learning model.

Optionally, the sample time dynamic feature comprises a sample related function number, a sample function average calling time, a sample function time duty ratio utilization rate, a sample function calling times and a sample program running time; the average calling time of the function comprises the following steps: function average call time including derivative function call time and function average not including derivative function call timeCalling time; calculating the feature value related to the number of functions includes: acquiring the function quantity f of the target source code and the code line number l in the target source code, wherein the calculation formula of the characteristic value NF related to the function quantity is as followsCalculating the characteristic value of the function time duty ratio utilization rate comprises the following steps: acquiring the function execution time t of the target source code and the total running time v of the target source code, wherein the calculation formula of the characteristic value RF of the function time duty ratio utilization ratio is as follows The sample space dynamic characteristics comprise whether the sample source code has memory leakage, sample memory allocation times, sample memory release times, sample average single allocation memory size, sample memory release rate, sample application total memory, sample release total memory and sample function average use memory; and responding to the occurrence of memory leakage of the sample target source code, wherein the sample space dynamic characteristics further comprise the type of the memory leakage of the sample and the memory occupation ratio of the memory leakage of the sample.

Based on the same purpose, the application also provides a C++ language-oriented source code anonymizing device, which comprises: the acquisition module is configured to acquire target source codes, wherein the target codes are C++ language codes; a feature extraction module configured to extract temporal and spatial dynamic features generated when the target source code is dynamically executed; and the determining module is configured to determine an author of the target source code according to similarity detection of the target source code according to the time dynamic characteristics and the space dynamic characteristics by using a learning model.

Based on the above object, the present application further provides an electronic device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor executes any one of the source code de-anonymizing methods facing the c++ language.

Based on the above object, the present application further provides a non-transitory computer readable storage medium storing computer instructions, where the computer instructions are configured to cause a computer to execute any one of the source code de-anonymizing methods for c++ language.

From the above, it can be seen that the present application provides a method for c++ language-oriented source code anonymization and related devices, which aims at the problem that the existing method for code anonymization cannot migrate to c++ language, and includes performing similarity detection on source code by using the time dynamic feature and the space dynamic feature generated when the extracted target source code is dynamically executed, and determining the author of the target source code. According to the anonymization method, the time dynamic characteristics and the space dynamic characteristics extracted in the dynamic code execution process can represent the programming style of an author, and the anonymization accuracy of the source code is ensured.

Drawings

In order to more clearly illustrate the technical solutions of the present application or related art, the drawings that are required to be used in the description of the embodiments or related art will be briefly described below, and it is apparent that the drawings in the following description are only embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort to those of ordinary skill in the art.

FIG. 1 is a schematic flow chart of a source code de-anonymization method facing C++ language in an embodiment of the present application;

FIG. 2 is a flowchart of a method for C++ language-oriented source code anonymization according to another embodiment of the present application;

FIG. 3 is a schematic diagram of a source code anonymizing apparatus facing the C++ language according to the embodiment of the present application;

fig. 4 is a block diagram of an electronic device for a source code anonymization method for c++ language according to an embodiment of the present application.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail below with reference to the accompanying drawings.

It should be noted that unless otherwise defined, technical or scientific terms used in the embodiments of the present application should be given the ordinary meaning as understood by one of ordinary skill in the art to which the present application belongs. The terms "first," "second," and the like, as used in embodiments of the present application, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.

In the related art, as early as 1997, research on source code anonymization has been well advanced. For example, some documents have been studied intensively for features in terms of code structure, with good results. In the Caliskan et al study, syntactic features were added to the feature set of source code de-anonymization because confusion is easier to occur in some ways for lexical features and layout features, affecting the source code de-anonymization results, and the likelihood of the syntactic features being confused is less than for lexical features and layout features. Extraction of syntactic features involves parsing the source code into abstract syntax trees (Abstract Syntax Tree, AST) and parsing the internal structure of the tree to obtain valid information. In the design of syntactic features, the multiple sub-features are separated, for example, the maximum depth of the abstract syntax tree, the average depth of 58 types of nodes after excluding leaf nodes, etc. In the study of Alslami et al, the results of the above-mentioned Calisman et al were further studied. Because the related features of the abstract syntax tree need to be designed manually and an extraction scheme is implemented, the research adds Long Short-Term Memory (LSTM) and two-way Long-Term Memory (Bidirectional Long Short-Term Memory, biLSTM) algorithms, and the automatic extraction of the features in the abstract syntax tree can be realized. Experimental results show that the scheme can reach 88.86% accuracy when 70 authors are involved.

In the study of fratzeskou et al, they have proposed a new scheme for identifying a file (Source Code Author Profiles, SCAP) for a source code author, which uses binary form to represent style information of a code author in combination with a chinese language model (N-Gram), unlike the previous way to represent style information of a code author. In subsequent experiments, experimental results showed that the approach of franzeskou et al was effective, with 100% accuracy being achieved when 8 program authors were involved. In the study of Burrow et al in 2009, they proposed a new solution to anonymize source code based on word level form in combination with a chinese language model, and achieved an average accuracy of 76.78%. In the subsequent study of Burows et al in 2014, aiming at the existing more researches on anonymization of source codes, because different research results cannot be directly compared, evaluation methods used by different researches are different, and Burows et al summarize and compare multiple research results, including research based on information retrieval technology, research based on machine learning and the like. Experiments show that in a machine learning algorithm, a support vector machine (Support Vector Machine, SVM) and a neural network are better, and the Chinese language model can be effectively used for machine learning and the like.

As many people in large systems collaborate in development, in the study of Meng et al, a first fine-grained technique was proposed for identifying multiple authors in a binary file. To determine the granularity to which the author belongs, three large and long-standing open-source projects are studied to determine whether to use a function or a basic block as an attribute unit. The study by Meng et al was directed to the first step in library code recognition.

However, the above mentioned features ignore the feature that the code can be dynamically executed, and ignore the large amount of information generated by the code in the process of dynamic execution, which can characterize the programming style of the author. Although in Wang et al's research, the concept of dynamic features was first proposed, and dynamic feature classes and extraction schemes based on the Python programming language were designed. However, the above solution cannot be used in other programming languages, and the above solution is designed based on analysis data generated by the Memory Profile and the cpprofile module, and in other programming languages, it is necessary to redesign the dynamic feature extraction solution first, and then redesign part of the dynamic features based on the dynamic feature extraction solution. Therefore, the above solution has a problem of difficult migration to other programming languages, such as the c++ language.

In view of this, one embodiment of the present application provides a method for anonymizing source code oriented to the c++ language, as shown in fig. 1, including:

s101, acquiring a target source code, wherein the target source code is C++ language code.

S102, extracting time dynamic characteristics and space dynamic characteristics generated when the target source code is dynamically executed.

S103, performing similarity detection on the target source code according to the time dynamic characteristics and the space dynamic characteristics by using a learning model, and determining an author of the target source code.

Aiming at the problem that the existing code anonymization method cannot migrate to the C++ language, the embodiment of the application provides a C++ language-oriented source code anonymization method, which comprises the step of detecting similarity of source codes by utilizing time dynamic features and space dynamic features generated when an extracted target source code is dynamically executed, so as to determine authors of the target source code. In the method for anonymizing provided by the embodiment of the application, in the process of dynamically executing codes, time dynamic characteristics are extracted to represent the time complexity of the codes written by the authors and the using habit and preference degree of the authors for library functions; extracting space dynamic characteristics to represent the space complexity of the memory space release habit, the memory space opening habit, the memory management habit and the code of an author. The anonymization method provided by the embodiment of the application comprehensively characterizes the programming style of the author by extracting the time dynamic characteristics and the space dynamic characteristics, and ensures the accuracy of source code anonymization.

In another embodiment, the method for anonymizing source code facing the c++ language as described in the present application may also be as shown in fig. 2.

In some embodiments, the temporal dynamics include parameters related to the number of functions, the average call time of the functions, the utilization of the function time, the number of function calls, and the program run time; the average calling time of the function comprises the following steps: function average call times including derivative function call times and function average call times not including derivative function call times.

The time dynamic characteristics comprise characteristics of multiple angles of the source code in execution, so that the writing preference of an author can be comprehensively represented, and the anonymizing accuracy of the source code is ensured.

The number of functions involved (numFunction) statistics of the number of functions involved in the source code file at run-time, which differs from the number of functions in the lexical feature, which here would include the number of library functions called in the function, may indicate the author's usage habits with respect to the library functions.

The function call times (functionCalls) count the function call times involved in the running process of the source code file, wherein the function call times comprise call conditions of library functions, and the feature can indicate the preference degree of authors for functions and the preference degree of the authors for function call correlations (similar to recursion and the like).

The average call time (avgFunctionTime) of the function counts the average running time of the source code file during the running process (total running time of the function/number of function calls), which may indicate the author's usage preference for the function.

The function time occupation ratio (radioFunctionTime) counts the ratio of the running time of the function in the running process of the source code file to the total time, and the characteristic together with the average calling time of the function can indicate the preference of an author on the use of the function.

The program run time (runTime) counts the total time that the source code file uses during run-time, and this feature may indicate the temporal complexity of the code written by the author.

In specific implementation, the embodiment of the application designs the time dynamic feature based on the performance analysis tool Gprof, and compared with the related technology, the application adds a function quantity feature (numFunction), the function related to the feature not only comprises the function customized by the author but also comprises the library function called internally, and the feature value is used for representing the preference degree of the author to the library function. Secondly, in the function average calling time (avgFunctionTime) part, the function average calling time is divided into two parts, namely, a function average calling time containing derivative function calling time and a function average calling time not containing derivative function calling time, and the values are used for representing write preference of authors to functions. Finally, the present application incorporates a function time occupancy rate (radiopunctionctime) feature that can also be used to characterize author writing preferences for functions. Compared with the three newly added features related to the number of functions, the average calling time of the functions and the occupation ratio of the functions in the related art, the method further ensures the anonymization accuracy of the source codes.

In some embodiments, calculating the feature value related to the number of functions comprises: acquiring the function quantity f of the target source code and the code line number l in the target source code, wherein the calculation formula of the characteristic value NF related to the function quantity is as followsCalculating the characteristic value of the function time duty ratio utilization rate comprises the following steps: acquiring the function execution time t of the target source code and the total running time v of the target source code, wherein the calculation formula of the characteristic value RF of the function time duty ratio utilization ratio is +.>

In some embodiments, the spatial dynamic characteristics include whether the target source code has memory leaks, memory allocation times, memory release times, average single allocation memory size, memory release rate, total memory applied, total memory released, and function average used memory; and responding to the occurrence of memory leakage of the target source code, wherein the space dynamic characteristics further comprise the type of the memory leakage and the proportion of the leaked memory.

The space dynamic characteristics comprise characteristics of multiple angles of the source code in execution, so that the writing preference of an author can be comprehensively represented, and the anonymizing accuracy of the source code is ensured.

Whether memory leakage (memLeak) exists or not counts whether memory leakage occurs during the execution of the source code file, and can indicate the memory release habit of the author.

Memory leak type (memLeakType) counts the type of memory leak that occurs (if a memory leak occurs during execution of the source code file). The memory leak types are mainly classified into defined Lost, indirectyLost, possibyLost, still deactuable, and Supported.

Average single allocation memory size (avgmelloc) statistics the size of memory allocation per space opening when the source code file is executed, which can indicate the author's space opening habit. Let variable AM be the eigenvalue, variable m be the program open space value, variable t be the program memory allocation times, the calculation method of average single allocation memory size eigenvalue is:

the ratio of the amount of memory leaked (radiopeakmem) counts the size of memory allocated and the size of memory leaked when the source code file is executed, and calculates the ratio of them, which can indicate the memory management habit of the author. Let variable RL be the eigenvalue, variable m be the program open space value, variable l be the leaky memory value, the calculation method of the radiopeakmem eigenvalue is:

the memory release rate (radio) counts the number of memory allocations and the number of memory releases during execution of the source code file and calculates their ratio, which, together with the ratio of the leaked memory, may indicate the author's memory management habit.

The number of memory allocations (allocTimes) counts the number of memory allocations when the source code file is executed. This feature may indicate the spatial development habit of the author.

The number of memory releases (freeTimes) counts the number of memory releases while the source code file is executing. This feature may indicate the author's space release habit.

Total memory application (allocMem) statistics the allocated memory size when the source code file is executed. This feature may indicate the spatial complexity of the author code.

The total memory (freeMem) is released and the size of the released memory is counted when the source code file is executed. This feature may indicate the author's space release habit.

And counting the allocated memory size and the number of functions contained in the code when the source code file is executed by using an average use memory (avgFunctionMameuse) of the functions, and calculating the average use memory size of the functions. This feature may indicate that the user space opens up habits.

In specific implementation, the design of the space dynamic characteristics is performed based on the performance analysis tool Valgrind, and compared with the prior art, the method adds the memory leakage related characteristics, such as whether memory leakage (memLeak), memory leakage type (memLeakType) and the like exist. Because of the memory leakage in the code, it can be shown whether the code author has attention to avoid the situation at ordinary times, which can characterize the programming habit of the author. Secondly, the application adds the related characteristics of the memory allocation times, including the memory allocation times (allocTimes), the memory release times (freeTimes) and the like, and the characteristics can represent the preference of an author for memory allocation when writing codes. Compared with the prior art, the method has the four characteristics of whether memory leakage exists, the memory leakage type exists, the memory allocation times and the memory release times exist, and the accuracy of source code anonymization is further guaranteed.

In some embodiments, extracting the temporal dynamics of the target source code generated upon dynamic execution in S102 includes:

s201, compiling the target source code by using a first performance analysis tool and a first command in the execution process of the target source code, adding a first option in the compiling process, and generating a first executable file based on the first option.

S202, executing the first executable file by using the first performance analysis tool and using a second command, and generating a first text file.

S203, performing text analysis on the first text file by using the first performance analysis tool and using a third command to generate a first data file.

S204, counting and extracting characteristic values of time dynamic characteristics of the target source codes in the first data file.

In specific implementation, the embodiment of the application adopts a Gprof tool to extract the time dynamic characteristics, wherein Gprof is a GNU Profile tool and can be run on an operating system such as Linux, AIX, sun and the like and used for performance analysis of a C++ program. When the source code file is run, the running information of the program is recorded in a log form by using the tool, and an analysis file (Flat Profile) is generated; on the basis, the time dynamic characteristics are finally generated by counting indexes such as the number of the related functions, the average calling time of the functions, the duty ratio utilization rate of the functions, the times of the functions calling, the running time of the program and the like. The specific procedure is shown in algorithm pseudocode 1, using first three times the os.system method, which can execute an internal string in the form of a command on the system. The first execution of the os.system in-command is to compile a source code file using the gcc command and add-pg options to be able to generate an executable file for Gprof parsing. The second execution of the os.system internal command is to execute the executable file generated by the last command and add the data in the title corresponding to the programming file as input data. The timeout is added because part of the programming file has a dead loop condition, namely the program is always executing and is not terminated, so the longest execution time is added for solving the condition. Since the-pg option is added when the gcc command of os.system is executed for the first time, after os.system is executed for the second time, a gmon.out file is generated, which is the analysis data generated by the Gprof tool after the program is executed. But the file needs to be parsed using commands. The purpose of executing the os.system in-command a third time is to parse the gmon.out file and put the parsed content into the temp.txt file. Then, a plurality of methods including numFunction, functionCalls and the like are called to complete the acquisition of a plurality of sub-feature types, so that the analysis of text contents in temp.

The Flat Profile file generated by the Gprof tool contains a plurality of items of data, such as% time, seconds, calls, ts/call, etc., which are used for representing various kinds of function related operation information recorded in the execution process of the user program, such as time occupation percentage, program accumulated operation time, etc. Because the source Code files are all acquired from the Google Code Jam, during dynamic execution, the question data input comes from the question sample data, while the Gprof is not suitable for a program with shorter analysis time in the aspect of partial data statistics, partial features are implemented in a way of embedding sentences and performing text analysis on program output.

In some embodiments, extracting the spatial dynamic characteristics of the target source code generated upon dynamic execution in S102 includes:

s301, compiling the target source code by using a second performance analysis tool in the target source code executing process and generating a second text file.

S302, performing text analysis on the second text file by using the second performance analysis tool to generate a second data file.

S303, counting and extracting characteristic values of the space dynamic characteristics of the target source codes in the second data file by using the second performance tool.

In specific implementation, the space dynamic characteristics are extracted by using a Valgrind tool. Valgrind is a set of simulation debugging tools for open source code (GPL V2) under Linux. Valgrind includes a kernel and other related debug tools based on the kernel. Valgrind can be used to detect memory leaks, monitor program cache issues, etc. Valgrind contains a number of tools including Memcheck, callgrind, cachegrind, helgrind, etc. The Memcheck tool is mainly used herein. As shown in algorithm pseudo code 2, using a Valgrind tool in the process of program execution to generate a log file to record the running information of the program and generate an analysis file; on the basis, by counting indexes such as whether memory leakage exists, the memory leakage type, the memory allocation times, the memory allocation size and the like, dynamic characteristics are finally generated, and the method is particularly similar to the process of extracting the time dynamic characteristics.

In some embodiments, the method further comprises: constructing an initial learning model; acquiring a source code database comprising written multiple sample source codes of multiple different authors; and extracting sample time dynamic characteristics and sample space dynamic characteristics of the sample source codes, training the initial learning model based on the sample time dynamic characteristics and the sample space dynamic characteristics of the sample source codes, and taking the initial learning model after training as the learning model.

In specific implementation, the learning model is constructed in a random forest mode, and the specific construction mode comprises the following steps:

s401, randomly selecting N sample source codes (randomly selecting one sample at a time and then returning to continue selection). The selected N sample source codes are used for training a decision tree to serve as samples at the root node of the decision tree.

S402, each sample has M characteristics, and when each node of the decision tree needs to be split, the M characteristics are randomly selected from the M characteristics, so that the condition M < < M > is satisfied. A policy, say information gain, is then employed from the m features to select 1 feature as the splitting attribute of the node.

S403, each node in the decision tree forming process is split according to S402 (if the attribute selected by the node next time is the attribute used when the parent node is split, the node already reaches the leaf node and does not need to be split continuously). Until no further splitting is possible. Note that pruning is not performed throughout the decision tree formation process.

S404, a large number of decision trees are established according to the steps S401 to S403 to form a random forest.

The learning model is constructed in a random forest mode, and the random forest algorithm can generate a classifier with high accuracy and can process a large number of input variables; when a forest is built, the forest can generate unbiased estimation on generalized errors internally, and has a fast learning speed.

In some embodiments, the sample time dynamic characteristics include a sample related function number, a sample function average call time, a sample function time duty cycle utilization, a sample function call number, and a sample program run time; the average calling time of the function comprises the following steps: function average call time including derivative function call time and function average call time not including derivative function call time; calculating the feature value related to the number of functions includes: acquiring the function quantity f of the target source code and the code line number l in the target source code, wherein the calculation formula of the characteristic value NF related to the function quantity is as followsCalculating the characteristic value of the function time duty ratio utilization rate comprises the following steps: obtaining the function execution time t of the target source code and the total running time v of the target source code, wherein the function time accounts for the special of the utilization rateThe calculation formula of the sign value RF is +.>The sample space dynamic characteristics comprise whether the sample source code has memory leakage, sample memory allocation times, sample memory release times, sample average single allocation memory size, sample memory release rate, sample application total memory, sample release total memory and sample function average use memory; and responding to the occurrence of memory leakage of the sample target source code, wherein the sample space dynamic characteristics further comprise the type of the memory leakage of the sample and the memory occupation ratio of the memory leakage of the sample.

The learning model is trained by using the time dynamic characteristics and the space dynamic characteristics, which includes characteristics of the source codes at a plurality of angles in the execution process, can comprehensively characterize the writing preference of authors, and ensures the anonymizing accuracy of the source codes.

Compared with the related art, the accuracy of the C++ language-oriented source code anonymization provided by the embodiment of the application is shown in a table 1, wherein scheme 1 is the study of Franzeskou et al, and a specific scheme is a source code author identification file (Source Code Author Profiles, SCAP), and the scheme is combined with a Chinese language model (N-Gram) to represent style information of a code author in a binary form;

scheme 2 is a study of Burrow et al in 2009, and specifically comprises the steps of combining a Chinese language model and carrying out anonymization on a source code based on a word level form;

scheme 3 is a study of Alsulami et al, and in particular, a further study of the result of adding syntactic features to source code for anonymization, wherein Long Short-Term Memory (LSTM) and two-way Long-Term Memory (Bidirectional Long Short-Term Memory, biLSTM) algorithms are added to the study, so that automatic extraction of features in an abstract syntax tree can be realized;

Scheme 4 is a study of Burrow et al in 2014, and specific schemes are that multiple types of study results are summarized and compared, including a study based on information retrieval technology, a study based on machine learning and the like; experiments show that in a machine learning algorithm, a support vector machine (Support Vector Machine, SVM) and a neural network are better, and the Chinese language model can be effectively used for machine learning and the like;

scheme 5 was a study by Meng et al, specifically using fine-grained techniques to identify multiple authors in a binary file; to determine the granularity to which the author belongs, three large and long-standing open-source projects are studied to determine whether to use a function or a basic block as an attribute unit.

The number of training files for training the learning model for each of the schemes in table 1 was 15.

Table 1: accuracy contrast for source code de-anonymization

As can be seen from the data in the table 1, the accuracy of the C++ language-oriented source code anonymization method provided by the application is higher than that of a method used by the related technology, and the accuracy of C++ language source code anonymization is ensured.

It should be noted that, the method of the embodiments of the present application may be performed by a single device, for example, a computer or a server. The method of the embodiment can also be applied to a distributed scene, and is completed by mutually matching a plurality of devices. In the case of such a distributed scenario, one of the devices may perform only one or more steps of the methods of embodiments of the present application, and the devices may interact with each other to complete the methods.

It should be noted that some embodiments of the present application are described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

Based on the same inventive concept, corresponding to the method of any embodiment, the application further provides a source code anonymizing device facing the C++ language, as shown in fig. 3, including: an acquisition module 10 configured to acquire a target source code, the target code being c++ language code; a feature extraction module 20 configured to extract temporal and spatial dynamic features generated when the target source code is dynamically executed; a determining module 30 configured to determine an author of the target source code based on similarity detection of the target source code based on the temporal dynamic feature and the spatial dynamic feature using a learning model.

Aiming at the problem that the existing code anonymization method cannot migrate to the C++ language, the embodiment of the application provides a C++ language-oriented source code anonymization device, which comprises the step of detecting the similarity of source codes by utilizing the time dynamic characteristics and the space dynamic characteristics generated when the extracted target source codes are dynamically executed, so as to determine the authors of the target source codes. In the process of dynamically executing codes, the anonymizing device extracts time dynamic characteristics to represent the time complexity of the codes written by the author and the using habit and preference degree of the author for library functions; extracting space dynamic characteristics to represent the space complexity of the memory space release habit, the memory space opening habit, the memory management habit and the code of an author. The anonymizing device comprehensively characterizes the programming style of the author by extracting the time dynamic characteristics and the space dynamic characteristics, and ensures the accuracy of source code anonymization.

In some embodiments, the calculation is performedThe feature values related to the number of functions include: acquiring the function quantity f of the target source code and the code line number l in the target source code, wherein the calculation formula of the characteristic value NF related to the function quantity is as followsCalculating the characteristic value of the function time duty ratio utilization rate comprises the following steps: acquiring the function execution time t of the target source code and the total running time v of the target source code, wherein the calculation formula of the characteristic value RF of the function time duty ratio utilization ratio is +.>

In some embodiments, the feature extraction module further comprises a temporal dynamic feature extraction sub-module and a spatial dynamic feature extraction sub-module, the temporal dynamic feature extraction sub-module configured to compile the target source code using a first performance analysis tool and using a first command during execution of the target source code, the compiling adding a first option, generating a first executable file based on the first option; executing the first executable file using the first performance analysis tool and using a second command, and generating a first text file; using the first performance analysis tool and using a third command to carry out text analysis on the first text file to generate a first data file; counting and extracting characteristic values of time dynamic characteristics of the target source codes in the first data file; the space dynamic feature extraction sub-module is configured to compile the target source code and generate a second text file using a second performance analysis tool during execution of the target source code; performing text parsing on the second text file by using the second performance analysis tool to generate a second data file; and using the second performance tool to count and extract the characteristic value of the spatial dynamic characteristic of the target source code in the second data file.

In some embodiments, the c++ language-oriented source code de-anonymizing apparatus further comprises a building module configured to build an initial learning model; a sample acquisition module configured to acquire a source code database comprising a plurality of written sample source codes for a plurality of different authors; the sample extraction module is configured to extract sample time dynamic characteristics and sample space dynamic characteristics of the sample source code, the initial learning model is trained based on the sample time dynamic characteristics and the sample space dynamic characteristics of the sample source code, and the initial learning model after training is used as the learning model.

In some embodiments, the sample time dynamic characteristics include a sample related function number, a sample function average call time, a sample function time duty cycle utilization, a sample function call number, and a sample program run time; the average calling time of the function comprises the following steps: function average call time including derivative function call time and function average call time not including derivative function call time; calculating the feature value related to the number of functions includes: acquiring the function quantity f of the target source code and the code line number l in the target source code, wherein the calculation formula of the characteristic value NF related to the function quantity is as follows Calculating the characteristic value of the function time duty ratio utilization rate comprises the following steps: acquiring the function execution time t of the target source code and the total running time v of the target source code, wherein the calculation formula of the characteristic value RF of the function time duty ratio utilization ratio is +.>The sample space dynamic features includeWhether the sample source code has memory leakage, sample memory allocation times, sample memory release times, sample average single allocation memory size, sample memory release rate, sample application total memory, sample release total memory and sample function average use memory exists or not; and responding to the occurrence of memory leakage of the sample target source code, wherein the sample space dynamic characteristics further comprise the type of the memory leakage of the sample and the memory occupation ratio of the memory leakage of the sample.

For convenience of description, the above devices are described as being functionally divided into various modules, respectively. Of course, the functions of each module may be implemented in the same piece or pieces of software and/or hardware when implementing the present application.

The device of the foregoing embodiment is configured to implement the corresponding c++ language-oriented source code anonymizing method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which is not described herein.

Based on the same inventive concept, the application also provides an electronic device corresponding to the method of any embodiment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the C++ language-oriented source code anonymization method according to any embodiment when executing the program.

Fig. 4 shows a more specific hardware architecture of an electronic device according to this embodiment, where the device may include: a processor 41, a memory 42, an input/output interface 43, a communication interface 44 and a bus 45. Wherein the processor 41, the memory 42, the input/output interface 43 and the communication interface 44 are in communication connection with each other inside the device via a bus 45.

The processor 41 may be implemented by a general-purpose CPU (Central Processing Unit ), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc. for executing relevant programs to implement the technical solutions provided in the embodiments of the present disclosure.

The Memory 42 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory ), static storage device, dynamic storage device, or the like. The memory 42 may store an operating system and other application programs, and when the technical solutions provided in the embodiments of the present specification are implemented by software or firmware, relevant program codes are stored in the memory 42 and invoked by the processor 41 to be executed.

The input/output interface 43 is used to connect with an input/output module to realize information input and output. The input/output module may be configured as a component in a device (not shown) or may be external to the device to provide corresponding functionality. Wherein the input devices may include a keyboard, mouse, touch screen, microphone, various types of sensors, etc., and the output devices may include a display, speaker, vibrator, indicator lights, etc.

The communication interface 44 is used to connect a communication module (not shown) to enable communication interaction of the device with other devices. The communication module may implement communication through a wired manner (such as USB, network cable, etc.), or may implement communication through a wireless manner (such as mobile network, WIFI, bluetooth, etc.).

Bus 45 includes a path to transfer information between components of the device (e.g., processor 41, memory 42, input/output interface 43, and communication interface 44).

It should be noted that although the above device only shows the processor 41, the memory 42, the input/output interface 43, the communication interface 44, and the bus 45, in the implementation, the device may further include other components necessary for achieving normal operation. Furthermore, it will be understood by those skilled in the art that the above-described apparatus may include only the components necessary to implement the embodiments of the present description, and not all the components shown in the drawings.

The electronic device of the foregoing embodiment is configured to implement the corresponding c++ language-oriented source code anonymizing method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which is not described herein.

Based on the same inventive concept, corresponding to any of the above embodiments of the method, the present application further provides a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the c++ language-oriented source code anonymization method according to any of the above embodiments.

The computer readable media of the present embodiments, including both permanent and non-permanent, removable and non-removable media, may be used to implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device.

The storage medium of the foregoing embodiment stores computer instructions for causing the computer to execute the method for anonymizing source code oriented to c++ language according to any of the foregoing embodiments, and has the advantages of the corresponding method embodiments, which are not described herein.

Those of ordinary skill in the art will appreciate that: the discussion of any of the embodiments above is merely exemplary and is not intended to suggest that the scope of the application (including the claims) is limited to these examples; the technical features of the above embodiments or in the different embodiments may also be combined within the idea of the present application, the steps may be implemented in any order, and there are many other variations of the different aspects of the embodiments of the present application as described above, which are not provided in detail for the sake of brevity.

Additionally, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures, in order to simplify the illustration and discussion, and so as not to obscure the embodiments of the present application. Furthermore, the devices may be shown in block diagram form in order to avoid obscuring the embodiments of the present application, and this also takes into account the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform on which the embodiments of the present application are to be implemented (i.e., such specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the application, it should be apparent to one skilled in the art that embodiments of the application can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative in nature and not as restrictive.

While the present application has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of those embodiments will be apparent to those skilled in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic RAM (DRAM)) may use the embodiments discussed.

The present embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Accordingly, any omissions, modifications, equivalents, improvements and/or the like which are within the spirit and principles of the embodiments are intended to be included within the scope of the present application.

Claims

1. The C++ language-oriented source code anonymization method is characterized by comprising the following steps of:

acquiring a target source code, wherein the target source code is C++ language code;

extracting time dynamic characteristics and space dynamic characteristics generated when the target source code is dynamically executed;

using a learning model to detect the similarity of the target source code according to the time dynamic characteristics and the space dynamic characteristics, and determining the author of the target source code;

the time dynamic characteristics comprise the number of related functions, average calling time of the functions, the utilization ratio of the function time, the times of the function calling and the running time of the program;

The average calling time of the function comprises the following steps: function average call time including derivative function call time and function average call time not including derivative function call time;

calculating the feature value related to the number of functions includes: obtaining the function quantity of the target source codeAnd the number of code lines in the target source code +.>The characteristic value of the number of the related functions +.>The calculation formula of (2) is +.>；

Calculating the characteristic value of the function time duty ratio utilization rate comprises the following steps: acquiring the function execution time of the target source codeTotal run time with the target source code +.>Characteristic value of the utilization ratio of the function time +.>The calculation formula of (2) is；

The space dynamic characteristics comprise whether the target source code has memory leakage, memory allocation times, memory release times, average single allocation memory size, memory release ratio, total memory application, total memory release and function average used memory.

2. The method for C++ language-oriented source code anonymization according to claim 1,

and responding to the occurrence of memory leakage of the target source code, wherein the space dynamic characteristics further comprise the type of the memory leakage and the proportion of the leaked memory.

3. The method for de-anonymizing source code in c++ language as set forth in claim 1, wherein extracting temporal dynamic features generated by the target source code upon dynamic execution comprises:

compiling the target source code by using a first performance analysis tool and a first command in the execution process of the target source code, adding a first option in the compiling process, and generating a first executable file based on the first option;

executing the first executable file using the first performance analysis tool and using a second command, and generating a first text file;

using the first performance analysis tool and using a third command to carry out text analysis on the first text file to generate a first data file;

counting and extracting characteristic values of time dynamic characteristics of the target source codes in the first data file;

extracting spatial dynamic characteristics generated by the target source code during dynamic execution, including:

compiling the target source code and generating a second text file by using a second performance analysis tool in the target source code execution process;

performing text parsing on the second text file by using the second performance analysis tool to generate a second data file;

And using the second performance analysis tool to count and extract characteristic values of the spatial dynamic characteristics of the target source codes in the second data file.

4. The c++ language-oriented source code anonymization method of claim 1, further comprising:

constructing an initial learning model;

acquiring a source code database comprising written multiple sample source codes of multiple different authors;

and extracting sample time dynamic characteristics and sample space dynamic characteristics of the sample source codes, training the initial learning model based on the sample time dynamic characteristics and the sample space dynamic characteristics of the sample source codes, and taking the initial learning model after training as the learning model.

5. The method for de-anonymizing source code in c++ language as in claim 4 wherein the sample time dynamic features include sample related function number, sample function average call time, sample function time duty cycle utilization, sample function call times and sample program run time;

Calculating the function timeThe characteristic values of the duty ratio utilization include: acquiring the function execution time of the target source codeTotal run time with the target source code +.>Characteristic value of the utilization ratio of the function time +.>The calculation formula of (2) is；

The sample space dynamic characteristics comprise whether the sample source code has memory leakage, sample memory allocation times, sample memory release times, sample average single allocation memory size, sample memory release rate, sample application total memory, sample release total memory and sample function average use memory;

and responding to the occurrence of memory leakage of the sample target source code, wherein the sample space dynamic characteristics further comprise the type of the memory leakage of the sample and the memory occupation ratio of the memory leakage of the sample.

6. A c++ language-oriented source code de-anonymizing apparatus, comprising:

the acquisition module is configured to acquire target source codes, wherein the target source codes are C++ language codes;

A feature extraction module configured to extract temporal and spatial dynamic features generated when the target source code is dynamically executed;

a determining module configured to determine an author of the target source code based on similarity detection of the target source code based on the temporal dynamic feature and the spatial dynamic feature using a learning model;

the time dynamic characteristics comprise the number of related functions, average calling time of the functions, the utilization rate of the function time, the times of the function calling and the running time of the program; the average calling time of the function comprises the following steps: function average call time including derivative function call time and function average call time not including derivative function call time;

wherein calculating the feature value related to the number of functions comprises: obtaining the function quantity of the target source codeAnd the number of code lines in the target source code +.>The characteristic value of the number of the related functions +.>The calculation formula of (2) is +.>The method comprises the steps of carrying out a first treatment on the surface of the Calculating the characteristic value of the function time duty ratio utilization rate comprises the following steps: acquiring a function execution time of the target source code +.>Total run time with the target source code +.>Characteristic value of the utilization ratio of the function time +. >The calculation formula of (2) is +.>；

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 5 when the program is executed by the processor.

8. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 5.