CN118132141A

CN118132141A - Function automatic reconstruction method and device based on code characteristic diagram and electronic equipment

Info

Publication number: CN118132141A
Application number: CN202410544530.6A
Authority: CN
Inventors: 崔笛; 王路桥; 庄慧盈; 王强强; 王嘉琪; 孙嘉莹
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2024-05-06
Filing date: 2024-05-06
Publication date: 2024-06-04

Abstract

The invention discloses a function automatic reconstruction method and device based on a code characteristic diagram, and electronic equipment, wherein the method comprises the following steps: extracting static features and dynamic features of codes to be reconstructed to obtain a tree view and a stream view; extracting annotation of the code to be reconstructed to obtain annotation characterization; respectively carrying out embedding processing on the tree view and the stream view to obtain tree view characterization and stream view characterization; merging tree view characterization, flow view characterization and annotation characterization in a multiple way to obtain a mixed characterization of the code to be reconstructed; and inputting the mixed representation into a trained code reconstruction model to obtain a reconstructed code. According to the method provided by the invention, the code is reconstructed by taking the tree view capable of representing the static features and the stream view capable of representing the dynamic features of the code as the input of the code reconstruction model, so that the code reconstruction model can extract the more comprehensive features of the code to reconstruct the code, and the accuracy of the reconstructed code is increased.

Description

Function automatic reconstruction method and device based on code characteristic diagram and electronic equipment

Technical Field

The invention belongs to the technical field of software engineering, and particularly relates to a function automatic reconstruction method and device based on a code characteristic diagram and electronic equipment.

Background

The code reconstruction is to improve the quality and performance of the software by adjusting the program code on the premise of not changing the observable behavior of the software, so that the design mode and architecture of the program are more reasonable, the expansibility and maintainability of the software are improved, and the code is easier to be understood by other people.

The extraction function is one of the most common reconstruction operations for decomposing large complex codes, and can also be used in combination with other reconstruction operations to eliminate design defects such as code repetition, overlong and the like. It organizes and frees a piece of code together and then places the code into a function. Such functions are typically smaller in size and easier to read. Current refinement function reconstruction tools rely on preset extraction criteria and fixed thresholds. The developer treats these tools as technical advisors to refine the reconstruction functions. However, there is no uniform criterion for extracting which code lines as new functions, in other words, this is subjective, and studies have shown that these tools do not accurately select the appropriate code line.

The most advanced related art at present can be roughly classified into two categories: the first category is non-machine learning methods based on heuristics, and the second category is machine learning methods based on historical data. Heuristic-based non-machine learning methods are a method of solving problems by using heuristic rules and empirical knowledge, which typically set extraction criteria and fixed thresholds in advance. Machine learning based on historical data is a machine learning technique in which the process of training a model is to use past data to predict future events or make decisions. This technique expects models to learn the experience of artificially refining functions in the history data rather than by fixed extraction criteria and thresholds.

Based on the two methods, a certain bank discloses a code reconstruction method and device (application number: CN 202010580466.9, application publication number: CN 111767076A) for performing static code analysis on a code to be reconstructed to generate an abstract syntax tree; reconstructing the abstract syntax tree according to a function code mark preset in the code to be reconstructed; and generating a source code file according to the reconstructed abstract syntax tree. The code reconstruction method and the device provided by the invention have the advantages of improving the maintainability of codes to a certain extent, reducing the complexity of the codes, but having the defects.

First, the invention only considers the generated abstract syntax tree, and if only the abstract syntax tree is reconstructed from the aspect, the structure and the behavior of the code may not be fully reflected, and thus the code cannot be more comprehensively and deeply understood. Secondly, the invention adopts static code analysis to the code to be reconstructed, focuses on static characteristics, mainly focuses on potential code problems, code specification violations and the like, but can not understand the behaviors of a runtime environment, a configuration file, an external library and the like, so that an analysis result is inaccurate.

Therefore, the current code reconstruction method only carries out reconstruction based on static characteristics of the code, so that the reconstructed code cannot comprehensively reflect the characteristics of the source code, and the inaccuracy is caused.

Disclosure of Invention

The embodiment of the invention provides a function automatic reconstruction method and device based on a code characteristic diagram and electronic equipment, which can solve the problem that the current code reconstruction method is only based on static characteristics of codes to reconstruct, so that the reconstructed codes cannot comprehensively reflect the characteristics of source codes and cause inaccuracy.

In a first aspect, a method for automatically reconstructing a function based on a code feature map according to an embodiment of the present invention includes:

Extracting static features and dynamic features of codes to be reconstructed to obtain a tree view and a stream view;

Extracting annotation of the code to be reconstructed to obtain annotation characterization;

Respectively carrying out embedding processing on the tree view and the stream view to obtain tree view characterization and stream view characterization;

Merging tree view characterization, flow view characterization and annotation characterization in a multiple way to obtain a mixed characterization of the code to be reconstructed;

and inputting the mixed representation into a trained code reconstruction model to obtain a reconstructed code.

In a second aspect, an embodiment of the present invention provides a function automatic reconstruction device based on a code characteristic diagram, where the device includes a processing unit; the processing unit is used for:

In a third aspect, an embodiment of the present invention provides an electronic device, including a processor and a memory, where the memory is configured to store a computer program; the processor may be adapted to execute a computer program (instructions) stored in a memory to implement the method of the first aspect described above.

Compared with the prior art, the embodiment of the invention has the beneficial effects that: according to the method provided by the invention, the code is reconstructed by taking the tree view capable of representing the static features and the stream view capable of representing the dynamic features of the code as the input of the code reconstruction model, so that the code reconstruction model can extract the more comprehensive features of the code to reconstruct the code, and the accuracy of the reconstructed code is increased.

Drawings

Fig. 1 is a schematic flow chart of a function automatic reconstruction method based on a code characteristic diagram according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a scenario for determining a hybrid representation provided by an embodiment of the present invention;

FIG. 3 is a schematic flow chart of a training method of a code reconstruction model according to the present invention;

fig. 4 is a schematic structural diagram of a function automatic reconstruction device based on a code characteristic diagram according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to specific examples, but embodiments of the present invention are not limited thereto.

The function automatic reconstruction method based on the code characteristic diagram provided by the embodiment of the invention can be applied to electronic equipment such as a mobile terminal, a personal notebook computer, an super computer and the like, and the embodiment of the invention does not limit the specific type of the electronic equipment.

Fig. 1 is a schematic flow chart of a function automatic reconstruction method based on a code characteristic diagram according to an embodiment of the present invention. By way of example and not limitation, the method 100 may be applied in the electronic device described above. The method 100 may include steps S101-S105, each of which is described below.

S101, extracting static features and dynamic features of codes to be reconstructed to obtain a tree view and a stream view.

In some embodiments, static features such as structural features and dynamic features such as behavioral features and dependency features of the code to be reconstructed can be extracted to obtain a code characteristic diagram based on multi-view feature representation. And then extracting subgraphs of the code characteristic diagram to obtain a tree view for representing static characteristics and a flow view for representing dynamic characteristics.

In one possible implementation, the structural features, behavioral features, and dependency features of the code to be reconstructed may be extracted separately, and an abstract syntax tree (a graph of a tree structure), a control flow graph, and a program dependency graph may be generated separately. The three graphs are then combined to obtain a code characteristic graph based on the multi-view feature representation. Because the abstract syntax tree, the control flow graph and the program dependency graph have advantages in describing the characteristics of the code, such as the structure, the behavior, the dependency and the like, the code characteristic graph generated according to the three graphs is more comprehensive and deeper, and the possibility is provided for the subsequent comprehensive and deeper analysis of the code.

For example, the srcML tool may be used to parse the code to be reconstructed, parse the abstract syntax tree according to the syntax rules of the programming language, and perform lexical analysis and syntax analysis; and converting the analysis result into XML (Extensible Markup Language) format, and simultaneously reserving the document and annotation information in the code to be reconstructed. The abstract syntax tree may then be further processed by means of a static analysis tool Joern, and control flow graphs and program dependency graphs may be generated.

Specifically, code to be reconstructed retrieved with srcML may be imported into the code library of Joern, loaded using the command line tool provided by Joern or API (Application Programming Interface), and then an abstract syntax tree and control flow graph constructed using the query language provided by Joern. Depending on the needs of the user, query statements may be written to select particular code segments and generate corresponding graphical representations. At the same time, a program dependency graph is generated using the Joern query language. Query statements may be written to select particular variables, function calls, or other statements and track their dependencies. After the abstract syntax tree, control flow graph, and program dependency graph are generated, these graphs may be analyzed using the query language or other tools provided by Joern to generate a code property graph based on the multi-view feature representation (see 201 in FIG. 2).

In one example, the abstract syntax tree may be structured as:

；

Wherein, For abstract syntax tree,/>Is a node of abstract syntax tree,/>For the use of tag functions/>, in abstract syntax treesMarked edge,/>Is an attribute function of the abstract syntax tree node.

In particular, the method comprises the steps of,Each abstract syntax tree node may be assigned an attribute key/>, such as a partial code, a code-corresponding attribute value, an operator, or an operand= { Operators, operands, numbers }, which is intended to reflect different node types. In addition, the attribute function can also allocate attribute keys/>, such as attribute values corresponding to the sequence, and the like= { Code, order }, for representing an ordered structure of an abstract syntax tree.

In one example, a control flow graph may be structured as follows:

；

Wherein, For control flow graph,/>To control nodes of a flow graph, represent statements or predicates in code,For controlling edges of a flowsheet,/>To add tag functions on each side of the control flow graph.

In particular, the method comprises the steps of,Can be from the collection/>Is allocated for each edge/>(True) and,(False) or empty/>(Representing none) tags.

In one example, a program dependency graph may be constructed as:

；

Wherein, For program dependency graph,/>Is a node of the program dependency graph,/>For the edges of the program dependency graph,/>Tag function for program dependency graph,/>Is a property function of the program dependency graph.

Specifically, nodes of a program dependency graphIdentical to the nodes of the control flow graph, but edge/>Different. Tag function/>Assigning a label of C or D to each edge of the program dependency graph, this process can be expressed as，/>={C,D}，/>A set of labels representing a program dependency graph, C being a control dependency and D being an attribute dependency. Attribute function/>Attribute symbols are allocated to each data dependency node of the program dependency graph, and attribute conditions are allocated to each control dependency node.

In one example, a code property graph formed by merging abstract syntax trees, control flow graphs, and program dependency graphs may satisfy the following formula:

；

Wherein, For code characteristic diagram,/>Is a node of the code characteristic diagram,/>Is an edge of the code characteristic diagram,/>Tag function for code property graph,/>Is an attribute function of the code characteristic diagram.

Wherein:

；

Specifically, the node of the abstract syntax tree may be used as the node of the code characteristic diagram, the edge and label functions of the three diagrams may be combined as the edge and label functions of the code characteristic diagram, and the attribute functions of the abstract syntax tree and the attribute functions of the program dependency diagram may be combined as the attribute functions of the code characteristic diagram.

In one possible implementation, two subgraphs, namely a tree view and a stream view, may be obtained by extracting static features and dynamic features of the code from the code property graph, respectively.

In one example, detailed information expressing the semantic features and programming constructs of a program may be extracted from an abstract syntax tree, breaking the program down into a language structure and organizing in an ordered tree structure. All paths that code statements and conditions may traverse during program execution may be extracted from the control flow graph for identifying all possible execution paths of the program. The dependency relationships between statements and predicates in the code may be extracted from the program dependency graph, capturing all statements and predicates in the specified statement, and their effect on the variable value. Program dependency graphs help to better understand the behavior of code by understanding these complex logical relationships and optimizing the code structure.

S102, embedding the tree view and the stream view respectively to obtain a tree view representation and a stream view representation.

For example, the tree view representation and the stream view representation may each satisfy the following formula:

；

Wherein, For/>Tree view representation of the code segments (functions) to be reconstructed,/>For/>Tree view representation of j-th line code in abstract syntax tree of code segments to be reconstructed,/>The representation belongs to,/>For tree view characterization or stream view characterizationPersonal node,/>For/>Nodes of abstract syntax tree of code segments to be reconstructed,/>Representing a dimension of/>Is used in the real number field of the (a),Represents the/>The number of nodes of the abstract syntax tree of the code segment to be reconstructed; /(I)For/>Flow view characterization of the code segments to be reconstructed,/>For/>Control flow graph and program dependency graph of code segments to be reconstructedFlow view characterization of line code,/>For/>Nodes of a control flow graph of a code segment to be reconstructed,/>For/>The number of nodes of the control flow graph for each code segment to be reconstructed.

Alternatively, one canSet to 256. The dimension embedding can process and analyze the graph structure data more efficiently, and is convenient to apply in the fields of machine learning, data visualization and the like.

For example, the tree view representation and the flow view representation may be combined in the same set in preparation for subsequent determination of the hybrid representation. The merged set may be expressed as:

；

Wherein, For/>The code segments to be reconstructed incorporate the tree view representation and the set of stream view representations.

In one possible implementation, the tree view and the stream view may be embedded by different methods to better reflect the structural and syntactic properties of the code. By tree embedding and stream embedding the tree views and the stream view, respectively, it is possible to ensure that each view can effectively capture unique information and set an optimal feature extraction method for each structure.

In one example, the tree view representation emphasizes both the type of node and the hierarchical structure, which can effectively reflect the factors of the static structure of the code. The tree view representation (see 202a in fig. 2) may be obtained by extracting vectors of successive distributions from the tree view using a series of algorithms and models, using an embedding technique such as CodeBERT.

Specifically, first, one sub-graph view containing only abstract syntax trees can be extracted from the code feature graph, then the pre-trained CodeBERT model is loaded into the code using the model loading function provided in the code and fine-tuning is performed on the basis of the pre-trained model using the code dataset of the specific relevant tasks to improve the accuracy and effectiveness of the model. The tree view is then converted to an input format acceptable to the model using a function or method in the CodeBERT library. This includes encoding the tree view as a sequence of labels or numbers in a particular format. And then using an API or a function of the CodeBERT model, taking the processed tree view as input, and obtaining the corresponding continuous distribution vector representation. These vectors will have semantic information for the nodes in the tree view. And finally, aggregating the embedded vectors of the pair nodes to obtain tree view characterization.

Illustratively, the re-extracted tree view is merely structurally similar to the pre-merger abstract syntax tree, with specific details being different. The tree view may include information for both abstract syntax tree, control flow graph, and program dependency graph. For example, edges of a tree view may include edges of abstract syntax trees, control flow graphs, and program dependency graphs at the same time.

In one example, flow view characterization may emphasize dynamic features such as execution flow, control logic, and path and flow control of code. The flow view characterization (see 202b in fig. 2) may be obtained by using embedding techniques such as Node2Vec, etc., and extracting vectors of continuous distribution from control flow and program dependency by using specific algorithms and models.

Specifically, first, a sub-graph flow view including only a flow view and a program dependency graph may be extracted from a code characteristic graph, then a Node2Vec model is installed, and the model is trained using the Node2Vec technique using training functions provided in the Node2Vec library, with the constructed sub-graph as input. The model will learn a continuous distribution vector representation of nodes from the structure of the graph and the neighbor relation of the nodes. After training, the function or method in the Node2Vec model is used to extract the continuous distribution vector of the Node from the model. These vectors contain representations of nodes in a low-dimensional vector space. And finally, carrying out aggregation operation on the embedded vectors of the nodes to obtain the flow view representation.

S103, extracting the annotation of the code to be reconstructed, and obtaining annotation characterization.

In some embodiments, annotations of the code to be reconstructed may be extracted first. And then splitting, embedding and aggregating the code annotation in sequence to obtain annotation characterization.

In one possible implementation, to extract more code content features, the code to be reconstructed may be input into a ChatGPT and a large model of the text, generating code annotations. In the code profile, the code annotation can be used as an additional feature for code analysis.

In one example, the impact of the prompt engineering (Prompt Engineering) on the retrieved code annotation result may be considered when instruction input is made in the large model.

For example, a prompt word with the following structure can be input in the model, so as to obtain a code annotation with better effect:

An excellent engineer acting as one (C language) field annotates each line in the following (C language) code and considers the (correctness, comprehensiveness and accuracy) of the annotation: the "" "is inserted with the code" ", and a summary of the code content features is made at the end. The content in brackets may be replaced as the code language changes and the requirements for the annotation change.

In one possible implementation, first, code annotations obtained from a large model may be segmented into words or phrases using common segmentation tools such as NLTK (Natural Language Toolkit). Next, a vocabulary is constructed based on the segmented annotation text, mapping each unique vocabulary to a unique number. The vocabulary in the vocabulary will be passed as input to the Word2Vec model for embedding. Word2Vec models are then trained using the text data containing the code annotation Word segments and the constructed vocabulary to learn a continuously distributed representation of the Word from a large number of tag sequences. After training is completed, the embedded vector representation of each Word is obtained by providing the Word in the annotation using the trained Word2Vec model. These vectors will reflect the semantics of the word and the context information. And finally, carrying out averaging, weighted averaging or other aggregation operation on the embedded vector of each word contained in the code annotation to obtain the embedded representation of the whole annotation, generating a continuously distributed vector, capturing the semantic information of the code annotation, and generating the representation of the code annotation of the code to be reconstructed.

S104, merging the tree view representation, the flow view representation and the annotation representation in a multiple way to obtain the mixed representation of the code to be reconstructed.

In some embodiments, the tree view representation and the flow view representation may be fused first to obtain a sub-mix representation; and fusing the sub-mixed characterization and the annotation characterization to obtain the mixed characterization.

In one possible implementation, the tree view representation and the flow view representation may be normalized to obtain a normalized tree view representation and a normalized flow view representation, respectively. And then fusing the normalized tree view representation and the normalized flow view representation by a compact bilinear pooling technology to obtain the tree flow inner product. And finally, performing feature mapping on the tree flow inner product through a tensor sketch projection algorithm, mapping the tree flow inner product into a lower-dimensional space, and fusing to obtain the sub-hybrid characterization.

In one example, statistics of mean, standard deviation, minimum, and maximum values for each feature in the tree view and the stream view may be determined; then scaling transformation is applied through a standard score normalization method, a minimum-maximum normalization method is used for adjusting the characteristic range, and finally, the tree view and the stream view characterization are normalizedAnd/>。

In one example, the tree-flow inner product may be subjected to two random feature mappings by two sets of hash functions and sign functions, respectively, based on a tensor sketch projection algorithm, to redistribute and weight dimensions of the tree-flow inner product to obtain a first low-dimensional feature representation and a second low-dimensional feature representation of the tree-flow inner product. And then performing Fourier transformation on the first low-dimensional characteristic representation and the second low-dimensional characteristic representation respectively to obtain a transformed first low-dimensional characteristic representation and a transformed second low-dimensional characteristic representation. And multiplying the transformed first low-dimensional characteristic representation and the transformed second low-dimensional characteristic representation element by element to obtain a Hadamard product of the inner product of the tree stream. Performing inverse Fourier transform on the Hadamard product of the tree stream inner product to convert the Hadamard product into a time domain to obtain a Hadamard product belonging to the real number domainFeature mapping (i.e., sub-mix characterization) of (i) wherein/>Dimensions that characterize the sub-mix.

Specifically, two sets of random but fixed hash functions h _k and a sign function S _k may be first generated, where k=1, 2. These two functions can convert the input data randomly but in a repeatable manner, providing the basis for random feature mapping in the dimension reduction process. Wherein the hash function may randomly map each dimension of the input vector (i.e., tree stream inner product) to 1 toRandom assignment of dimensions is achieved such that each dimension in the original space of input vectors is mapped randomly to a certain dimension in the new space. The sign function may assign a random sign (+ 1 or-1) to each dimension of the input vector, which may increase the randomness of the random feature map and preserve some statistical features of the original data of the input vector.

Exemplary, the first low-dimensional feature representation/>Individual element/>Is that the value of the hash function in the input vector (i.e. the tree stream inner product) is equal to/>Obtained by weighted summation of all elements of (1) through a sign function, wherein/>Is greater than or equal to 1 and less than or equal to/>Positive integer of/>Is tree flow inner product. Second low-dimensional feature representation/>Is derived from the second set of hash functions and the sign function in a similar manner as the first low-dimensional representation.

For example, if the values of the first set of hash functions assigned to element x ₁、x₂、x₃、x₄、x₅ in the input vector are sequentially 1,2, 1; the values of the first set of sign functions are in turn: +1, -1; the first element in the first low-dimensional feature representation of the input vector, which is randomly feature mapped by the set of hash and sign functionsSecond element/>。

The tensor sketch projection (Tensor Sketch Projection) algorithm is an algorithm proposed by the invention for data dimension reduction and feature mapping, in particular in a high-dimensional space. The algorithm approximates the feature space of the original data through random mapping and Fourier transformation, and the core purpose of the algorithm is to preserve the structural features of the data in a high-dimensional space and reduce the computational complexity. By mapping the square of the inner product of the original space to an approximation of the inner product of the new space, i.e.,Wherein/>Representation will be in the original space/>、/>Inner product of vectorsMapping to a new space by a tensor sketch projection algorithm allows efficient computation and comparison of the similarity of the original high-dimensional data points in the new low-dimensional space.

Thus, the tree flow inner product may satisfy the following formula:

；

Wherein, Representing tree flow inner product,/>Representation processing by compact linear pooling technique,/>For normalized tree view characterization,/>For normalized flow view characterization,/>For/>Nodes of abstract syntax tree of code segments to be reconstructed,/>For/>The abstract syntax tree of the code segments to be reconstructed is the/>Tree view representation after line code normalization,/>For/>Control flow graph and program dependency graph of individual code segments to be reconstructed/>Stream view characterization after line code normalization,/>For/>Nodes of a control flow graph of a code segment to be reconstructed,/>Representing the/>, by a tensor sketch projection algorithm, for each element in the tree-flow inner productThe treatment is carried out in such a way that,The whole tree flow inner product is processed through a tensor sketch projection algorithm.

Illustratively, the sub-mix characterization may satisfy the following formula:

；

Wherein, For/>Sub-mix characterization of the individual code segments to be reconstructed.

In one possible implementation, the sub-mixture characterization and the annotation characterization may be normalized based on a similar approach, then the normalized sub-mixture characterization and annotation characterization are fused by a compact bilinear pooling technique to obtain a sub-annotation inner product, and finally the inner product is subjected to data reduction and feature mapping by a tensor sketch projection algorithm to obtain the mixture characterization (see 203 in fig. 2).

S105, inputting the mixed representation into a trained code reconstruction model to obtain a reconstructed code.

The mixed representation may be input to a trained code reconstruction model, and the code reconstruction model may reconstruct the code to be reconstructed by a representation learning technique to obtain a reconstructed code.

According to the method provided by the invention, the code is reconstructed by taking the tree view capable of representing the static features and the stream view capable of representing the dynamic features of the code as the input of the code reconstruction model, so that the code reconstruction model can extract the more comprehensive features of the code to reconstruct the code, and the accuracy of the reconstructed code is increased.

Further, by forming a code characteristic diagram by combining an abstract grammar tree, a control flow graph and a program dependency graph, analyzing the overall attribute of the code to be reconstructed from the grammar semantic angle, the execution flow angle and the resource dependency angle respectively, a clearer code structure can be provided for the mixed characterization of the extracted code; the organization and the representation form of the codes can be learned through the characteristic learning technology, so that the effective attribute of the codes can be better extracted, and more potential modes and characteristics can be found; the tree view representation, the flow view representation and the annotation representation are fused by a compact bilinear pooling technology, so that the feature information of different layers and different granularities of the code to be reconstructed can be comprehensively considered, and the representation capability of the hybrid representation can be improved; the mixed characterization can integrate the characteristic information of the two views, so that the characteristics are reduced from two dimensions to one dimension, the parameter quantity of the model is reduced, the mixed characterization can be used as the input of the code reconstruction model to provide more suitable characteristics for a machine learning classifier, the robustness and generalization of the model can be improved, the code characteristic diagram can be better understood, and more correct decisions can be made.

Fig. 3 is a schematic flow chart of a training method of a code reconstruction model provided by the invention. By way of example and not limitation, the method 300 may include steps S301-S303, each of which is described below.

S301, resampling the sample data to balance the number of the reconstructed samples and the non-reconstructed samples in the sample data, and obtaining balanced sample data.

In one possible implementation, a number of more types of samples may be sub-sampled to reduce their number, and a number of fewer types of samples may be super-sampled to increase their number, thereby equalizing the number of different types of samples. Therefore, risks such as model bias to a plurality of classes, inaccurate performance indexes, overfitting risks, important information loss and the like caused by sample imbalance during training of the model can be avoided.

In one example, the number of general reconstructed samples is less than the number of non-reconstructed samples. Thus, the non-reconstructed samples may be sub-sampled and the reconstructed samples super-sampled until the frequencies of the two samples are the same; the sampled reconstructed samples and non-reconstructed samples are then combined to obtain equalized samples.

In particular, the non-reconstructed samples in the training set may be randomly removed during the sub-sampling process. In the supersampling process, a composite sample may be constructed by interpolating between the sample and its random nearest neighbors.

S302, determining the mixed representation of the equalized sample data.

Illustratively, a hybrid representation of the sample data may be extracted by steps S101-S104 of method 100 described above.

S303, training the code reconstruction model according to the mixed representation of the equalized sample data to obtain a trained code reconstruction model.

By way of example, the code reconstruction model may be trained based on a hybrid characterization of the equalized sample data to obtain a trained code reconstruction model.

Fig. 4 is a schematic structural diagram of a function automatic reconstruction device based on a code characteristic diagram according to an embodiment of the present invention. By way of example, and not limitation, apparatus 400 may include a processing unit 410.

The processing unit 410 may be configured to:

According to the device provided by the invention, the code is reconstructed by taking the tree view capable of representing the static characteristics and the stream view capable of representing the dynamic characteristics of the code as the input of the code reconstruction model, so that the code reconstruction model can extract the more comprehensive characteristics of the code to reconstruct the code, and the accuracy of the reconstructed code is improved.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. The electronic device 500 as shown in fig. 5 may include: at least one processor 510 (only one processor is shown in fig. 5), a memory 520, and a computer program 530 stored in the memory 520 and executable on the at least one processor 510, the processor 510 implementing the steps in any of the various method embodiments described above when executing the computer program 530.

The electronic device 500 may be a processing device such as a robot, which can implement the method described above, and the embodiment of the present invention does not limit the specific type of the electronic device.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

Claims

1. A method for automatically reconstructing a function based on a code characteristic map, comprising:

Extracting the annotation of the code to be reconstructed to obtain annotation characterization;

merging the tree view representation, the flow view representation and the annotation representation in a multiple manner to obtain a mixed representation of the code to be reconstructed;

2. The method of claim 1, wherein extracting static features and dynamic features of the code to be reconstructed results in a tree view and a stream view, comprising:

Respectively extracting structural features, behavior features and dependency features of the code to be reconstructed to generate an abstract syntax tree, a control flow graph and a program dependency graph, wherein the structural features are the static features, and the behavior features and the dependency features are the dynamic features;

merging the abstract syntax tree, the control flow graph and the program dependency graph to obtain a code characteristic graph of the code to be reconstructed based on multi-view characteristic representation;

The tree view and the stream view are extracted and obtained from the code characteristic diagram.

3. The method of claim 1, wherein extracting annotations of the code to be reconstructed results in annotation characterization, comprising:

Extracting and obtaining the annotation of the code to be reconstructed;

and sequentially carrying out splitting, embedding and aggregation operations on the annotation of the code to be reconstructed to obtain the annotation characterization.

4. The method of claim 1, wherein the fusing the tree view representation, the flow view representation, and the annotation representation in fractions results in a hybrid representation of the code to be reconstructed, comprising:

Fusing the tree view representation and the flow view representation to obtain a sub-mixed representation;

and fusing the sub-mixed characterization and the annotation characterization to obtain the mixed characterization.

5. The method of claim 4, wherein the fusing the tree view representation and the flow view representation results in a sub-mix representation comprising:

respectively carrying out normalization processing on the tree view representation and the flow view representation to obtain normalized tree view representation and normalized flow view representation;

fusing the normalized tree view representation and the normalized flow view representation by a compact bilinear pooling technology to obtain a tree flow inner product;

And performing data dimension reduction and feature mapping on the tree flow inner product through a tensor sketch projection algorithm to obtain the sub-mixed characterization.

6. The method of claim 5, wherein the mapping the data reduction and feature of the tree-flow inner product by a tensor sketch projection algorithm to obtain the sub-mixture representation comprises:

Respectively carrying out random feature mapping on the tree flow inner product twice according to a hash function and a symbol function so as to carry out reassignment and weighting treatment on the dimension of the tree flow inner product, and obtaining a first low-dimensional feature representation and a second low-dimensional feature representation of the tree flow inner product;

Performing Fourier transform on the first low-dimensional feature representation and the second low-dimensional feature representation respectively to obtain a transformed first low-dimensional feature representation and a transformed second low-dimensional feature representation;

multiplying the transformed first low-dimensional feature representation and the transformed second low-dimensional feature representation element by element to obtain a hadamard product of the tree-stream inner product;

And performing inverse Fourier transform on the Hadamard product to convert the Hadamard product into a time domain, so as to obtain the sub-mixed representation.

7. The method of claim 1, wherein prior to said inputting the hybrid representation into the trained code reconstruction model to obtain the reconstructed code, the method further comprises:

Resampling the sample data to equalize the number of reconstructed samples and non-reconstructed samples in the sample data to obtain equalized sample data;

and training a code reconstruction model according to the equalized sample data to obtain the trained code reconstruction model.

8. The method of claim 7, wherein the number of non-reconstructed samples is greater than the reconstructed samples;

the resampling the sample data to equalize the number of the reconstructed samples and the non-reconstructed samples in the sample data to obtain equalized sample data includes:

sub-sampling the non-reconstructed samples to remove part of the non-reconstructed samples, and obtaining sampled non-reconstructed samples;

supersampling the reconstructed samples to increase the number of the reconstructed samples to obtain sampled reconstructed samples;

And combining the sampled non-reconstructed sample and the sampled reconstructed sample to obtain the equalized sample data.

9. A function automatic reconstruction device based on a code characteristic diagram, which is characterized by comprising a processing unit, wherein the processing unit comprises a code reconstruction model and is used for:

10. An electronic device comprising a memory, a processor and a computer program stored in the memory, characterized in that the processor implements the method according to any of claims 1-8 when executing the computer program.