CN115964080A - Code clone detection method, system, equipment and medium based on visual image - Google Patents

Code clone detection method, system, equipment and medium based on visual image Download PDF

Info

Publication number
CN115964080A
CN115964080A CN202310042780.5A CN202310042780A CN115964080A CN 115964080 A CN115964080 A CN 115964080A CN 202310042780 A CN202310042780 A CN 202310042780A CN 115964080 A CN115964080 A CN 115964080A
Authority
CN
China
Prior art keywords
code
clone
visual image
module
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310042780.5A
Other languages
Chinese (zh)
Inventor
邱少健
彭梦晴
胡叶红
王劭晟
黄梦阳
黄晖豪
李琦伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Novi Aichuang Guangzhou Technology Co ltd
South China Agricultural University
Original Assignee
Novi Aichuang Guangzhou Technology Co ltd
South China Agricultural University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Novi Aichuang Guangzhou Technology Co ltd, South China Agricultural University filed Critical Novi Aichuang Guangzhou Technology Co ltd
Priority to CN202310042780.5A priority Critical patent/CN115964080A/en
Publication of CN115964080A publication Critical patent/CN115964080A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses a code clone detection method, a system, equipment and a medium based on a visual image, which comprises the following steps: removing the annotation in the Java code, and calling the VoidVisitorAdap in the Java parser to separate the code function in the Java file; converting code characters in the code function file into ASCII codes, filling the ASCII codes into RGB three-primary-color values to obtain RGB pixel points, and combining the pixel points into a visualized color image to obtain a visualized image of the codes; dividing the code visual image into a clone code visual image and a non-clone code visual image; inputting the code visual image into a pre-established clone detection model for training to obtain a trained clone detection model; and inputting the code visual image to be detected into the trained clone detection model for detection to obtain a detection result. The invention can completely reserve the code information by converting the code into the visual image, and avoids the code information loss caused by the intermediate form conversion of the image.

Description

Code clone detection method, system, equipment and medium based on visual image
Technical Field
The invention belongs to the technical field of code cloning and deep learning, and particularly relates to a code cloning detection method, a code cloning detection system, code cloning detection equipment and a code cloning detection medium based on a visual image.
Background
In recent years, there are two main methods for detecting code clone: detection based on code shape and detection based on code grammar and semantic graph. Wherein the detection based on the code shape is to convert the entire code into a gray map so as to distinguish the code portion from the non-code portion (the gray value of the non-code portion is 0, displayed as black on the gray map), and the code clone is judged by comparing the shapes of the code portions displayed on the gray map. The detection based on the code Syntax and the semantic Graph is to convert the code into a Graph containing code Syntax information, such as an Abstract Syntax Tree AST (Abstract Syntax Tree) or a Graph containing code semantic information, such as a Control Flow Graph (CFG) or a Program Dependency Graph (PDG), to obtain the Syntax and semantic information of the code, and then to judge the code clone by comparing the similarity of the graphs.
Disadvantages of code shape graph based detection: by comparing the similarity of the code shapes, the judgment is made only by depending on the surface of the code, and the change of a word, a character or a symbol in the code cannot be detected without going deep into the detailed part of the code, so that the method cannot distinguish the Type-1 code clone from the Type-2 code clone, and meanwhile, the method is difficult to detect the Type-3 code clone, and in the Type-3 code clone, the increased or decreased code drives the change of the code shapes, thereby interfering the judgment of the similarity of the code shapes by the method, and further causing the detection to be wrong and missed.
Based on the defects of code grammar and semantic graph detection: although the method can deeply reach the grammar and semantic level of the code to detect the similarity of the code, the subgraph isomorphism is required to be used for accurately matching the graph, so that the NP problem is caused, the calculation amount is increased exponentially, and more time is spent. Meanwhile, the process involves the intermediate form conversion of the graph, which easily causes the loss of code information and the situations of missing detection and error detection.
Disclosure of Invention
The invention mainly aims to overcome the defects of the prior art and provide a code clone detection method, a system, equipment and a medium based on a visual image.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention provides a code clone detection method based on a visual image, which comprises the following steps:
obtaining Java code data in a Java file format; removing the annotation of the code in the java file, calling the VoidVisitorAdap in the java parser, separating the code function in each java file, and storing each separated code function in a separate java file to obtain a code function file;
converting the code characters in the code function file into ASCII codes, filling the ASCII codes into RGB three-primary-color values to obtain RGB pixel points, and combining the pixel points into a visualized color image to obtain a code visualized image;
dividing the code visual image into a clone code visual image and a non-clone code visual image, and synthesizing a new clone code visual image through an SMOTE algorithm;
inputting the clone code visual image and the non-clone code visual image into a pre-established clone detection model for training to obtain a trained clone detection model;
the clone detection model comprises two Transformer sub-networks, and the two Transformer sub-networks share a weight; the Transformer sub-network comprises a Transformer coding module and a sparse attention module; the Transformer coding module is used for extracting the characteristics of the code visualization image, and the sparse attention module is used for searching for a distinguishing pixel block in the code visualization image and taking the corresponding implicit characteristics as the input of the next coding module; the training process comprises the following steps: inputting the clone code visual image pair or the non-clone code visual image pair into two transform sub-networks, mapping the clone code visual image pair or the non-clone code visual image pair to a high-dimensional feature space by the transform sub-networks, outputting corresponding representations, and calculating by utilizing a contrast Loss function according to the representations to obtain a representation distance;
and inputting the code visual image to be detected into the trained clone detection model for similarity detection to obtain a detection result.
The removing of the annotations of the codes in the java file comprises a single-line annotation, a plurality of lines of annotations and a document annotation.
As a preferred technical scheme, the code characters in the code function file are converted into ASCII codes, the ASCII code values are filled with RGB tristimulus values to obtain RGB pixel points, and the pixel points are combined into a visualized color image to obtain a code visualized image; the method comprises the following specific steps:
the three primary colors are red R, green G and blue B;
the size of the generated java file is between 0 and 30kB, and the java file is converted into a conversion rule of 1 pixel according to 3 characters, so that the pixel point count obtained after conversion is between 0 and 10240;
simultaneously, sequencing the red R, the green G and the blue B in different modes, and filling the ASCII codes into the color values of the three primary colors in different sequences each time to obtain a plurality of different visual color images;
in addition, since the method functions in the source code are different, the sizes of the code visual images converted from the java files are different, and therefore the code visual images larger than the preset size are cut to the preset size; and filling the code visual image smaller than the preset size to the preset size by using 0, and ensuring that the code visual image is close to the central position of the filled image to a certain extent.
As a preferred technical solution, the synthesizing of the new clone code visualization image by the SMOTE algorithm means: if the number of the clone code visual images is far smaller than that of the non-clone code visual images, an SMOTE algorithm is adopted, for each sample x in the clone code visual images, a clone code visual image y is randomly selected from the neighbor of the sample x, and then x and y are synthesized into a new clone code visual image, so that the over-fitting risk is reduced; then, the clone code visual image label is 1, and the non-clone code visual image label is 0.
As a preferred technical solution, the Transformer encoding module specifically includes: the Transformer sub-network comprises a plurality of Transformer coding modules with the same structure; the Transformer coding module comprises a multi-head self-attention module and a multi-layer perceptron module;
the multi-head self-attention module comprises a plurality of self-attention modules, different self-attention modules learn related features in mutually independent feature subspaces, and finally, the results output by the multi-head self-attention module are spliced and subjected to linear transformation to obtain the output of the multi-head self-attention module; the output and the input matrix are subjected to residual error connection, namely matrix addition, and finally, the output and the input matrix are subjected to layer standardization to be used as the input of the next multilayer perceptron module;
the multi-layer perceptron module comprises two full-connection layers, the activating function of the first layer is ReLU, the activating function is not used in the second layer, the fitting degree of the multi-layer perceptron module to a complex process is deepened, and the capability of a clone detection model is enhanced.
As a preferred technical solution, the sparse attention module specifically includes: if the Transformer sub-network comprises L Transformer coding modules, the sparse attention module screens the implicit characteristics input by the last Transformer coding module by using the weights learned by the first L-1 Transformer coding modules; due to the abstraction of high-level features, attention is paid to the difficulty in expressing the feature information of the corresponding input visual image block; therefore, the weight of each attention diagram is autonomously learned by using the attention diagram information learned by all the previous Transformer coding modules and combining with the compression excitation module, namely, the sparse attention module firstly fuses the previously obtained attention diagram information into a two-dimensional matrix through average pooling, and then uses two fully-connected layers to model the correlation between the attention diagram information to obtain the weight value of each attention diagram; finally, the weighted values are weighted and summed with the attention diagram after being normalized to obtain the final attention weight.
As a preferred technical scheme, the Transformer sub-network maps the clone code visual image pair or the non-clone code visual image pair to a high-dimensional feature space, outputs a corresponding representation, and calculates a representation distance by using a contextual Loss function according to the representation; the method specifically comprises the following steps:
firstly, dividing an image of a clone code visual image pair or an image of a non-clone code visual image pair into N image blocks with the same size, then linearly mapping the image blocks into serialized embedded vectors, and then adding learnable classification vectors and position coding information; secondly, combining the embedded vectors into a matrix, and inputting the matrix into a plurality of transform coding modules for feature extraction; before the last transform encoding module, inputting corresponding hidden features into a sparse attention module to find identifying pixel blocks of a clone-code visualized image pair or a non-clone-code visualized image pair; finally, processing the classification characteristics output by the Transformer coding module through a full connection layer to obtain the category information of a clone code visual image pair or a non-clone code visual image pair;
secondly, with the help of a contrast Loss function, the calculation is carried out based on the small characterization distance of the clone code visual image and the large characterization distance of the non-clone code visual image, and the calculation process is as follows:
first, a pair of samples (X) is selected a ,X b ) The euclidean distance of the sample is:
Figure BDA0004051112050000042
wherein, X a Representing a representation, X, of a sample a of a visual image b Representing a representation of the visual image sample b;
the contextual Loss function is then expressed as:
Figure BDA0004051112050000041
wherein Y represents a label, Y =0 represents a non-clone pair or Y =1 represents a clone pair, d represents a euclidean distance, and m represents a distance threshold of the sample; when (X) a ,X b ) When the distance of (3) is less than m, the contrast Loss becomes 0, so that X a And X b The method is similar but not identical, and the generalization capability of the algorithm is ensured to a certain extent;
finally, the trained clone detection model outputs a sample characterization distance based on the generated features, and if the value of the characterization distance is lower than 0.5, the source codes of the input two visual images are a pair of clone codes.
The invention further provides a code clone detection system based on a visual image, which is applied to the code clone detection method based on the visual image and comprises a data set making module, a code visualization module, a data preprocessing module, a clone detection model building module and a clone detection module;
the data set making module is used for obtaining Java code data in a Java file format; removing the annotation of the code in the java file, calling the VoidVisitorAdap in the java parser, separating the code function in each java file, and storing each separated code function in a separate java file to obtain a code function file;
the code visualization module is used for converting code characters in the code function file into ASCII codes, filling the ASCII codes with RGB tricolor color values to obtain RGB pixel points, and combining the pixel points into a visualized color image to obtain a code visualized image;
the data preprocessing module is used for dividing the visual code image into a clone code visual image and a non-clone code visual image and synthesizing a new clone code visual image through an SMOTE algorithm;
the clone detection model construction module is used for inputting the clone code visual image and the non-clone code visual image into a preset clone detection model for training to obtain a trained clone detection model;
the clone detection model comprises two Transformer sub-networks which share a weight; the Transformer sub-network comprises a Transformer coding module and a sparse attention module; the Transformer coding module is used for extracting the characteristics of the code visualization image, and the sparse attention module is used for searching for a distinguishing pixel block in the code visualization image and taking the corresponding implicit characteristics as the input of the next coding module; the training process comprises the following steps: inputting the clone code visual image pair or the non-clone code visual image pair into two transform sub-networks, mapping the clone code visual image pair or the non-clone code visual image pair to a high-dimensional feature space by the transform sub-networks, outputting corresponding representations, and calculating by utilizing a contrast Loss function according to the representations to obtain a representation distance;
and the clone detection module is used for inputting the code visual image to be detected into the trained clone detection model for similarity detection to obtain a detection result.
Yet another aspect of the present invention provides an electronic device, comprising:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores computer program instructions executable by the at least one processor to cause the at least one processor to perform the visual image-based code clone detection method.
Still another aspect of the present invention provides a computer-readable storage medium storing a program which, when executed by a processor, implements the method for detecting code clones based on a visual image.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the invention can carry out clone detection based on the ASCII coding level of the code, is accurate to each character of the code and goes deep into all details of the code, avoids the confusion of code clone types caused by simply judging the surface of the code or detection mistakes and omissions caused by simply taking the shape of the code as a detection basis, and further more accurately judges the code clone.
2. The ASCII coding of the codes is filled into the three primary color values of the pixel points, and the formed visual image can completely reserve the source code information and cannot cause the loss of the code information.
3. The codes are converted into images, and the similarity of the codes is judged by comparing the similarity of the images, so that the process is simple, popular and easy to understand.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flow chart of a code clone detection method based on a visual image according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of code visualization according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a clone detection model according to an embodiment of the present invention;
FIG. 4 is a block diagram of a code clone detection system based on a visual image according to an embodiment of the present invention
Fig. 5 is a block diagram of an electronic device according to an embodiment of the invention.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
Interpretation of terms: code clone type
Type-1 (complete clone): two pieces of code are identical (except for comments and blanks).
Type-2 (renamed clone): the two code fragments are identical, except for the modification of the names of variables, types, words and functions.
Type-3 (clones modified by addition and deletion): the two segments of code are similar, with some statements being added, deleted or modified, and the layout of the code being modified.
Type-4 (self-achieving cloning): the two code fragments implement the same function but in a different manner.
Referring to fig. 1, in an embodiment of the present application, a code clone detection method based on a visual image is provided, which includes the following steps:
s1, obtaining Java code data in a Java file format; removing the annotation of the codes in the java file, calling the VoidVisitorAdap in the java parser, separating the code functions in each java file, and storing each separated code function in a separate java file to obtain a code function file.
Furthermore, acquiring a large amount of Java code data in a format of a Java file through BigCloneBenech and GitHub, removing annotations of codes in the Java file, including single-line annotations, multi-line annotations and document annotations, calling a VoidVisitorAdap in Java parser, separating code functions in each Java file, and storing each separated code function in a separate Java file to obtain a code function file; the above operation is also performed for data to be detected.
S2, converting the code characters in the code function file into ASCII codes, filling the ASCII codes into RGB three-primary-color values, obtaining RGB pixel points, and combining the pixel points into a visualized color image to obtain a code visualized image.
Further, the three primary colors are red R, green G and blue B; each character can be converted into ASCII code, and the value of the ASCII code can be used as the color value of three primary colors of RGB; for example, when the character string "public" is converted into ASCII code, p (112), u (117), b (98), l (108), i (105), c (99), two tristimulus color values [112,117,98] and [108,105,99] are obtained, thereby obtaining a yellow-gray pixel point and a brown-gray pixel point; as shown in fig. 2.
After the selected data set source code is separated by the function of the step S1, the size of the java file is generally between 0kB and 30kB, and the data set source code is converted according to the following conversion rule: the 3 characters are converted into 1 pixel, so that the pixel count obtained after conversion is between 0 and 10240 (30 x 1024/3).
Meanwhile, the letters red R, green G, blue B are ordered in 6 different ways: RBG, RGB, BGR, BRG, GRB and GBR, wherein ASCII codes are filled into the color values of three primary colors in different orders every time, so that 6 different visual color images can be obtained;
in addition, since the method functions in the source code are different, the sizes of the visualized images of the codes converted by the java files are different, and the sizes of the visualized images of the codes stored in the code functions are different; to address this problem, code visualization images larger than 105 × 105 are cropped to 105 × 105 size; code visualization images smaller than 105 x 105 are filled with 0 to 105 x 105 size while ensuring to some extent that the code visualization images are close to the center position of the filled image.
And S3, dividing the code visual image into a clone code visual image and a non-clone code visual image, and synthesizing a new clone code visual image through an SMOTE algorithm.
Further, the synthesis of the new clone code visualization image through the SMOTE algorithm refers to: if the number of the clone code visual images is far smaller than that of the non-clone code visual images, an SMOTE algorithm is adopted, for each sample x in the clone code visual images, a clone code visual image y is randomly selected from the neighbors of the sample x, then x and y are synthesized into a new clone code visual image, and the over-sampling method for synthesizing the new sample can reduce the risk of over-fitting; then, the label of the visual image of the clone code is 1, the label of the visual image of the non-clone code is 0, and the visual image of the non-clone code is taken as a data set and divided into a 70% training set and a 30% verification set, wherein the training data set is mainly used for training a model, and the verification set is used for evaluating a trained model result.
S4, inputting the clone code visual image and the non-clone code visual image into a preset clone detection model for training to obtain a trained clone detection model; the clone detection model comprises two Transformer sub-networks, and the two Transformer sub-networks share a weight; the Transformer sub-network comprises a Transformer coding module and a sparse attention module; the Transformer coding module is used for extracting the characteristics of the code visualization image, and the sparse attention module is used for searching the distinguishing pixel blocks in the code visualization image and taking the corresponding implicit characteristics as the input of the next coding module; the training process comprises the following steps: inputting the clone code visual image pair or the non-clone code visual image pair into two transform sub-networks, mapping the clone code visual image pair or the non-clone code visual image pair to a high-dimensional feature space by the transform sub-networks, outputting corresponding representations, and calculating by utilizing a contrast Loss function according to the representations to obtain a representation distance;
further, the Transformer encoding module specifically includes: the Transformer sub-network comprises a plurality of Transformer coding modules with the same structure; the Transformer coding module comprises a multi-head self-attention module and a multi-layer perceptron module;
the multi-head self-attention module comprises a plurality of self-attention modules, different self-attention modules learn related features in mutually independent feature subspaces, and finally, the results output by the multi-head self-attention module are spliced and subjected to linear transformation to obtain the output of the multi-head self-attention module; the output and the input matrix are subjected to residual error connection, namely matrix addition, and finally, the output and the input matrix are subjected to layer standardization to be used as the input of the next multilayer perceptron module;
the multi-layer perceptron module comprises two full connection layers, the activating function of the first layer is ReLU, the activating function is not used in the second layer, the fitting degree of the multi-layer perceptron module to a complex process is deepened, and the capability of a clone detection model is enhanced.
The sparse attention module is specifically: in order to fully utilize weight information to realize the positioning of the area with identification, a sparse attention module is provided; if the Transformer sub-network comprises L Transformer coding modules, the sparse attention module screens the implicit characteristics input by the last Transformer coding layer by using the weights learned by the first L-1 Transformer coding modules; due to the abstraction of high-level features, attention is paid to the difficulty in expressing the feature information of the corresponding input visual image block; therefore, the weight of each attention diagram is autonomously learned by combining the compression excitation module and the attention diagram information learned by all the previous Transformer coding modules; the sparse attention module firstly fuses the previously obtained attention map information into a two-dimensional matrix through average pooling, then uses two fully-connected layers to model the correlation between the attention map information, finally obtains the weight value of each attention map, and finally normalizes the weight value and sums the weight value with the attention map to obtain the final attention weight.
Specifically, as shown in fig. 3, the transform sub-network maps the clone code visualized image pair or the non-clone code visualized image pair to a high-dimensional feature space, outputs a corresponding representation, and calculates a representation distance by using a contrast Loss function according to the representation; the method comprises the following specific steps:
firstly, dividing an image of a clone code visual image pair or a non-clone code visual image pair into N image blocks with the same size, then linearly mapping the image blocks into serialized embedded vectors, and then adding learnable classification vectors and position coding information; secondly, combining the embedded vectors into a matrix, and inputting the matrix into a plurality of transform coding modules for feature extraction; before the last transform encoding module, inputting corresponding hidden features into a sparse attention module to find identifying pixel blocks in a clone code visual image pair or a non-clone code visual image pair; finally, processing the classification characteristics output by the Transformer coding module through a full connection layer to obtain the category information of a clone code visual image pair or a non-clone code visual image pair;
secondly, with the help of a contrast Loss function, the calculation is carried out based on the small characterization distance of the clone code visual image and the large characterization distance of the non-clone code visual image, and the calculation process is as follows:
first, a pair of samples (X) is selected a ,X b ) The euclidean distance of the sample is:
Figure BDA0004051112050000092
wherein, X a Representing a representation, X, of a sample a of a visual image b A representation of the visualized image sample b is represented.
The contextual Loss function is then expressed as:
Figure BDA0004051112050000091
wherein Y represents a label, Y =0 represents a non-clone pair or Y =1 represents a clone pair, d represents a euclidean distance, and m represents a distance threshold of the sample; when (X) a ,X b ) When the distance is less than m, the contrast Loss becomes 0, so that X a And X b The method is similar but not identical, and the generalization capability of the algorithm is ensured to a certain extent;
finally, the trained clone detection model outputs sample characterization distance based on the generated features, and if the value of the characterization distance is lower than 0.5, the input source codes of the two visual images are a pair of clone codes.
And S5, inputting the code visual image to be detected into the trained clone detection model for similarity detection to obtain a detection result.
Further, the code visual image to be detected is input to the trained clone detection model for similarity detection, and the obtained detection result is that the code similarity is judged according to the image similarity.
It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present invention is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present invention.
Based on the same idea as the visualized image-based code clone detection method in the above embodiment, the present invention further provides a visualized image-based code clone detection system, which can be used to execute the visualized image-based code clone detection method. For convenience of illustration, only the parts related to the embodiments of the present invention are shown in the schematic structural diagram of the code clone detection system based on visual images, and those skilled in the art will understand that the illustrated structure does not constitute a limitation to the device, and may include more or less components than those illustrated, or combine some components, or arrange different components.
Referring to fig. 4, in another embodiment of the present application, a code clone detection system 100 based on visual images is provided, which includes a data set making module 101, a code visualization module 102, a data preprocessing module 103, a clone detection model building module 104, and a clone detection module 105;
the data set making module 101 is used for obtaining Java code data in a Java file format; removing the annotation of the code in the java file, calling the VoidVisitorAdap in the java parser, separating the code function in each java file, and storing each separated code function in a separate java file to obtain a code function file;
the code visualization module 102 is configured to convert code characters in the code function file into ASCII codes, fill the ASCII codes with RGB tristimulus color values to obtain RGB pixel points, and combine the pixel points into a visualized color image to obtain a code visualized image;
the data preprocessing module 103 is configured to divide the code visualization image into a clone code visualization image and a non-clone code visualization image, and synthesize a new clone code visualization image through an SMOTE algorithm;
the clone detection model building module 104 is configured to input the clone code visual image and the non-clone code visual image into a preset clone detection model for training to obtain a trained clone detection model;
the clone detection model comprises two Transformer sub-networks, and the two Transformer sub-networks share a weight; the Transformer sub-network comprises a Transformer coding module and a sparse attention module; the Transformer coding module is used for extracting the characteristics of the code visualization image, and the sparse attention module is used for searching for a distinguishing pixel block in the code visualization image and taking the corresponding implicit characteristics as the input of the next coding module; the training process comprises the following steps: inputting the clone code visual image pair or the non-clone code visual image pair into two transform sub-networks, mapping the clone code visual image pair or the non-clone code visual image pair to a high-dimensional feature space by the transform sub-networks, outputting corresponding representations, and calculating by utilizing a contrast Loss function according to the representations to obtain a representation distance;
the clone detection module 105 is configured to input the code visualization image to be detected to the trained clone detection model for similarity detection, so as to obtain a detection result.
It should be noted that, the code clone detection system based on a visual image of the present invention corresponds to the code clone detection method based on a visual image of the present invention one to one, and the technical features and the beneficial effects thereof described in the above embodiment of the code clone detection method based on a visual image are all applicable to the embodiment of the code clone detection system based on a visual image, and specific contents may refer to the description in the embodiment of the method of the present invention, which is not described herein again, and thus is stated herein.
In addition, in the implementation of the code clone detection system based on a visual image according to the above embodiment, the logical division of each program module is only an example, and in practical applications, the above function distribution may be performed by different program modules according to needs, for example, due to the configuration requirements of corresponding hardware or the convenience of implementation of software, that is, the internal structure of the code clone detection system based on a visual image is divided into different program modules to perform all or part of the above described functions.
Referring to fig. 5, in an embodiment, an electronic device for implementing a code clone detection method based on a visual image is provided, and the electronic device 200 may include a first processor 201, a first memory 202, a bus, and a computer program stored in the first memory 202 and executable on the first processor 201, such as a code clone detection program 203 of a visual image.
The first memory 202 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The first memory 202 may in some embodiments be an internal storage unit of the electronic device 200, such as a removable hard disk of the electronic device 200. The first memory 202 may also be an external storage device of the electronic device 200 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 200. Further, the first memory 202 may also include both an internal storage unit and an external storage device of the electronic device 200. The first memory 202 may be used not only to store application software installed in the electronic device 200 and various types of data, such as code of the code clone detection program 203 for a visual image, but also to temporarily store data that has been output or will be output.
The first processor 201 may be formed by an integrated circuit in some embodiments, for example, by a single packaged integrated circuit, or by a plurality of integrated circuits packaged with the same function or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The first processor 201 is a Control Unit (Control Unit) of the electronic device, connects various components of the whole electronic device by using various interfaces and lines, and executes various functions of the electronic device 200 and processes data by running or executing programs or modules stored in the first memory 202 and calling data stored in the first memory 202.
Fig. 5 shows only an electronic device having components, and those skilled in the art will appreciate that the structure shown in fig. 5 does not constitute a limitation of the electronic device 200, and may include fewer or more components than those shown, or some components may be combined, or a different arrangement of components.
The code clone detection program 203 of the visual image stored in the first memory 202 of the electronic device 200 is a combination of instructions that, when executed in the first processor 201, may implement:
obtaining Java code data in a Java file format; removing the annotation of the code in the java file, calling the VoidVisitorAdap in the java parser, separating the code function in each java file, and storing each separated code function in a separate java file to obtain a code function file;
converting the code characters in the code function file into ASCII codes, filling the ASCII codes into RGB three-primary-color values to obtain RGB pixel points, and combining the pixel points into a visualized color image to obtain a code visualized image;
dividing the code visual image into a clone code visual image and a non-clone code visual image, and synthesizing a new clone code visual image through an SMOTE algorithm;
inputting the clone code visual image and the non-clone code visual image into a pre-established clone detection model for training to obtain a trained clone detection model;
the clone detection model comprises two Transformer sub-networks, and the two Transformer sub-networks share a weight; the Transformer sub-network comprises a Transformer coding module and a sparse attention module; the Transformer coding module is used for extracting the characteristics of the code visualization image, and the sparse attention module is used for searching for a distinguishing pixel block in the code visualization image and taking the corresponding implicit characteristics as the input of the next coding module; the training process comprises the following steps: inputting the clone code visual image pair or the non-clone code visual image pair into two transform sub-networks, mapping the clone code visual image pair or the non-clone code visual image pair to a high-dimensional feature space by the transform sub-networks, outputting corresponding representations, and calculating by utilizing a contrast Loss function according to the representations to obtain a representation distance;
and inputting the code visual image to be detected into the trained clone detection model for similarity detection to obtain a detection result.
Further, the modules/units integrated with the electronic device 200, if implemented in the form of software functional units and sold or used as independent products, may be stored in a non-volatile computer-readable storage medium. The computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM).
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such modifications are intended to be included in the scope of the present invention.

Claims (10)

1. The code clone detection method based on the visual image is characterized by comprising the following steps:
obtaining Java code data in a Java file format; removing the annotation of the code in the java file, calling the VoidVisitorAdap in the java parser, separating the code function in each java file, and storing each separated code function in a separate java file to obtain a code function file;
converting the code characters in the code function file into ASCII codes, filling the ASCII codes into RGB three-primary-color values to obtain RGB pixel points, and combining the pixel points into a visualized color image to obtain a code visualized image;
dividing the code visual image into a clone code visual image and a non-clone code visual image, and synthesizing a new clone code visual image through an SMOTE algorithm;
inputting the clone code visual image and the non-clone code visual image into a pre-established clone detection model for training to obtain a trained clone detection model;
the clone detection model comprises two Transformer sub-networks which share a weight; the Transformer sub-network comprises a Transformer coding module and a sparse attention module; the Transformer coding module is used for extracting the characteristics of the code visualization image, and the sparse attention module is used for searching for a distinguishing pixel block in the code visualization image and taking the corresponding implicit characteristics as the input of the next coding module; the training process comprises the following steps: inputting the clone code visual image pair or the non-clone code visual image pair into two transform sub-networks, mapping the clone code visual image pair or the non-clone code visual image pair to a high-dimensional feature space by the transform sub-networks, outputting corresponding representations, and calculating by utilizing a contrast Loss function according to the representations to obtain a representation distance;
and inputting the code visual image to be detected into the trained clone detection model for similarity detection to obtain a detection result.
2. A visual image-based code clone detection method according to claim 1, wherein said removing annotations of code in said java file comprises a single line annotation, multiple lines annotation and document annotation.
3. The code clone detection method based on visualized image as claimed in claim 1, wherein said code characters in said code function file are converted into ASCII code, then the value of ASCII code is filled into RGB three primary color values to obtain RGB pixel points, and the pixel points are combined into visualized color image to obtain code visualized image; the method specifically comprises the following steps:
the three primary colors are red R, green G and blue B;
the size of the generated java file is between 0 and 30kB, and the java file is converted into a conversion rule of 1 pixel according to 3 characters, so that the pixel count obtained after conversion is between 0 and 10240;
simultaneously, sequencing red R, green G and blue B in various different modes, and filling the ASCII codes into the color values of three primary colors in different sequences each time to obtain a plurality of different visual color images;
in addition, since the method functions in the source code are different, the sizes of the code visual images converted by the java files are also different, so that the code visual images with the sizes larger than the preset size are cut to the preset size; and filling the code visual image smaller than the preset size to the preset size by using 0, and ensuring that the code visual image is close to the central position of the filled image to a certain extent.
4. A visual image-based code clone detection method according to claim 1, wherein said synthesizing a new clone code visual image by SMOTE algorithm is characterized by: if the number of the clone code visual images is far smaller than that of the non-clone code visual images, an SMOTE algorithm is adopted to randomly select one clone code visual image y from the neighborhood of the sample x for each sample x in the clone code visual images, and then x and y are synthesized into a new clone code visual image, so that the risk of overfitting is reduced; then, the visual image label of the clone code is 1, and the visual image label of the non-clone code is 0.
5. The visual image-based code clone detection method according to claim 1, wherein said Transformer encoding module is specifically: the Transformer sub-network comprises a plurality of Transformer coding modules with the same structure; the transform coding module comprises a multi-head self-attention module and a multi-layer perceptron module;
the multi-head self-attention module comprises a plurality of self-attention modules, different self-attention modules learn related features in mutually independent feature subspaces, and finally, results output by the multi-head self-attention module are spliced and subjected to linear transformation to obtain the output of the multi-head self-attention module; the output and the input matrix are subjected to residual error connection, namely matrix addition, and finally, the output and the input matrix are subjected to layer standardization to be used as the input of the next multilayer perceptron module;
the multilayer perceptron module comprises two full-connected layers, the activation function of the first layer is a ReLU, the activation function is not used in the second layer, the fitting degree of the multilayer perceptron module to a complex process is deepened, and the capability of a clone detection model is enhanced.
6. A method for detecting code clones based on visual images as claimed in claim 1, wherein said sparse attention module is specifically: if the Transformer sub-network comprises L Transformer coding modules, the sparse attention module screens the implicit characteristics input by the last Transformer coding module by using the weights learned by the first L-1 Transformer coding modules; due to the abstraction of high-level features, attention is paid to the difficulty in expressing the feature information of the corresponding input visual image block; therefore, the attention map information learned by all previous transform coding modules is used, and the weight of each attention map is autonomously learned by combining with a compression excitation module, namely, a sparse attention map module firstly fuses the previously obtained attention map information into a two-dimensional matrix through average pooling, and then uses two fully-connected layers to model the correlation between the attention map information to obtain the weight value of each attention map; finally, the weighted values are weighted and summed with the attention diagram after being normalized to obtain the final attention weight.
7. The code clone detection method based on visual images as claimed in claim 1, wherein said Transformer sub-network maps the pair of visual images of clone codes or the pair of visual images of non-clone codes to a high-dimensional feature space, outputs corresponding representations, and calculates a representation distance by using a contrast Loss function according to the representations; the method specifically comprises the following steps:
firstly, dividing an image of a clone code visual image pair or an image of a non-clone code visual image pair into N image blocks with the same size, then linearly mapping the image blocks into serialized embedded vectors, and then adding learnable classification vectors and position coding information; secondly, combining the embedded vectors into a matrix, and inputting the matrix into a plurality of transform coding modules for feature extraction; before the last transform encoding module, inputting corresponding hidden features into a sparse attention module to find identifying pixel blocks of a clone-code visualized image pair or a non-clone-code visualized image pair; finally, processing the classification characteristics output by the Transformer coding module through a full connection layer to obtain the category information of a clone code visual image pair or a non-clone code visual image pair;
secondly, with the help of a contrast Loss function, the calculation is carried out based on the small characterization distance of the clone code visual image and the large characterization distance of the non-clone code visual image, and the calculation process is as follows:
first, a pair of samples (X) is selected a ,X b ) The euclidean distance of the sample is:
Figure FDA0004051112040000032
wherein, X a Representing a representation, X, of a sample a of a visual image b Representing a representation of the visual image sample b;
the contextual Loss function is then expressed as:
Figure FDA0004051112040000031
wherein Y represents a label, Y =0 represents a non-clone pair or Y =1 represents a clone pair, d represents a euclidean distance, and m represents a distance threshold of the sample; when (X) a ,X b ) When the distance is less than m, the contrast Loss becomes 0, so that X a And X b The method is similar but not identical, and the generalization capability of the algorithm is ensured to a certain extent;
finally, the output of the trained clone detection model is the sample characterization distance based on the generated features, and if the value of the characterization distance is lower than 0.5, the input source codes of the two visual images are a pair of clone codes.
8. The code clone detection system based on the visual image is characterized by being applied to the code clone detection method based on the visual image, which comprises a data set making module, a code visualization module, a data preprocessing module, a clone detection model building module and a clone detection module, wherein the code visualization module is used for displaying the code clone detection result;
the data set making module is used for obtaining Java code data in a Java file format; removing the annotation of the code in the java file, calling the VoidVisitorAdap in the java parser, separating the code function in each java file, and storing each separated code function in a separate java file to obtain a code function file;
the code visualization module is used for converting the code characters in the code function file into ASCII codes, filling the ASCII codes with RGB tricolor color values to obtain RGB pixel points, and combining the pixel points into a visualized color image to obtain a code visualized image;
the data preprocessing module is used for dividing the visual code image into a clone code visual image and a non-clone code visual image and synthesizing a new clone code visual image through an SMOTE algorithm;
the clone detection model construction module is used for inputting the clone code visual image and the non-clone code visual image into a preset clone detection model for training to obtain a trained clone detection model;
the clone detection model comprises two Transformer sub-networks, and the two Transformer sub-networks share a weight; the Transformer sub-network comprises a Transformer coding module and a sparse attention module; the Transformer coding module is used for extracting the characteristics of the code visualization image, and the sparse attention module is used for searching for a distinguishing pixel block in the code visualization image and taking the corresponding implicit characteristics as the input of the next coding module; the training process comprises the following steps: inputting the clone code visual image pair or the non-clone code visual image pair into two transform sub-networks, mapping the clone code visual image pair or the non-clone code visual image pair to a high-dimensional feature space by the transform sub-networks, outputting corresponding representations, and calculating by utilizing a contrast Loss function according to the representations to obtain a representation distance;
and the clone detection module is used for inputting the code visual image to be detected into the trained clone detection model for similarity detection to obtain a detection result.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores computer program instructions executable by the at least one processor to cause the at least one processor to perform a visual image-based code clone detection method as claimed in any one of claims 1 to 7.
10. A computer-readable storage medium storing a program, wherein the program, when executed by a processor, implements the visual image-based code clone detection method according to any one of claims 1 to 7.
CN202310042780.5A 2023-01-28 2023-01-28 Code clone detection method, system, equipment and medium based on visual image Pending CN115964080A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310042780.5A CN115964080A (en) 2023-01-28 2023-01-28 Code clone detection method, system, equipment and medium based on visual image

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310042780.5A CN115964080A (en) 2023-01-28 2023-01-28 Code clone detection method, system, equipment and medium based on visual image

Publications (1)

Publication Number Publication Date
CN115964080A true CN115964080A (en) 2023-04-14

Family

ID=87363442

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310042780.5A Pending CN115964080A (en) 2023-01-28 2023-01-28 Code clone detection method, system, equipment and medium based on visual image

Country Status (1)

Country Link
CN (1) CN115964080A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116302089A (en) * 2023-05-23 2023-06-23 华中科技大学 Picture similarity-based code clone detection method, system and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116302089A (en) * 2023-05-23 2023-06-23 华中科技大学 Picture similarity-based code clone detection method, system and storage medium
CN116302089B (en) * 2023-05-23 2023-08-18 华中科技大学 Picture similarity-based code clone detection method, system and storage medium

Similar Documents

Publication Publication Date Title
JP4515999B2 (en) Mixed code decoding method and apparatus, and recording medium
CN110059320A (en) Entity relation extraction method, apparatus, computer equipment and storage medium
US9892114B2 (en) Methods and systems for efficient automated symbol recognition
CN112949476B (en) Text relation detection method, device and storage medium based on graph convolution neural network
US20150213313A1 (en) Methods and systems for efficient automated symbol recognition using multiple clusters of symbol patterns
CN115964080A (en) Code clone detection method, system, equipment and medium based on visual image
CN114005125A (en) Table identification method and device, computer equipment and storage medium
CN114254071A (en) Querying semantic data from unstructured documents
CN113222055B (en) Image classification method and device, electronic equipment and storage medium
CN114399775A (en) Document title generation method, device, equipment and storage medium
CN116610304B (en) Page code generation method, device, equipment and storage medium
CN110533020A (en) A kind of recognition methods of text information, device and storage medium
CN117009968A (en) Homology analysis method and device for malicious codes, terminal equipment and storage medium
CN116740570A (en) Remote sensing image road extraction method, device and equipment based on mask image modeling
CN116433474A (en) Model training method, font migration device and medium
CN116363663A (en) Image processing method, image recognition method and device
CN113536782B (en) Sensitive word recognition method and device, electronic equipment and storage medium
CN113554549B (en) Text image generation method, device, computer equipment and storage medium
CN113420119B (en) Intelligent question-answering method, device, equipment and storage medium based on knowledge card
CN113888760A (en) Violation information monitoring method, device, equipment and medium based on software application
CN111695441B (en) Image document processing method, device and computer readable storage medium
CN115937875A (en) Text recognition method and device, storage medium and terminal
CN111581332A (en) Similar judicial case matching method and system based on triple deep hash learning
CN113032780A (en) Webshell detection method based on image analysis, terminal device and storage medium
CN114764858B (en) Copy-paste image identification method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination