CN115951891A - Code clone detection method and device, terminal equipment and readable storage medium - Google Patents

Code clone detection method and device, terminal equipment and readable storage medium Download PDF

Info

Publication number
CN115951891A
CN115951891A CN202211101209.8A CN202211101209A CN115951891A CN 115951891 A CN115951891 A CN 115951891A CN 202211101209 A CN202211101209 A CN 202211101209A CN 115951891 A CN115951891 A CN 115951891A
Authority
CN
China
Prior art keywords
function
character string
source code
function body
code file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211101209.8A
Other languages
Chinese (zh)
Inventor
万振华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Seczone Technology Co Ltd
Original Assignee
Seczone Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Seczone Technology Co Ltd filed Critical Seczone Technology Co Ltd
Priority to CN202211101209.8A priority Critical patent/CN115951891A/en
Publication of CN115951891A publication Critical patent/CN115951891A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a code clone detection method, a device, a terminal device and a readable storage medium, wherein the method comprises the following steps: when an extraction instruction is received, a source code file is obtained, the source code file is analyzed to obtain function body information corresponding to the source code file, at least one sub-function body is determined from the source code file based on the function body information, the sub-function body is extracted through a preset matching rule to obtain at least one function character string, code clone detection is carried out based on the function character string, the function body is efficiently obtained through a preset tool, then a part with characteristic significance in the function body is extracted, the rest of the function body is filtered to obtain a function character string corresponding to the function body, finally whether the code is cloned or not can be detected through the function character string, the technical problem that the existing code clone detection difficulty is large is solved, and the detection accuracy of the cloned code is improved.

Description

Code clone detection method and device, terminal equipment and readable storage medium
Technical Field
The present invention relates to the field of data processing, and in particular, to a code clone detection method and apparatus, a terminal device, and a readable storage medium.
Background
With the rapid development of software technology, the demand of code clone detection is increasing day by day. Code clone detection refers to the detection of two or more identical or similar source code segments between codes. Functions in the code may be applied to a plurality of software or items, and when a certain function is desired to be modified, all items corresponding to the function need to be modified to normally operate, so that before modification, code clone detection needs to be performed.
The current way of detecting the clone of the code is that a programmer searches the position of the function in the source code through a search tool, and then confirms whether the similar functions are clone codes or not manually. Or the machine firstly clusters high similarity or the same function from a plurality of source codes according to preset conditions, and then the high similarity or the same function is delivered to a programmer for confirmation.
Both the above two modes depend on the personal professional level of a programmer, the programmer needs to know the code structure, syntax and function definition of each language to extract codes from source code files of different language types, namely, the requirement on the programmer is high, and the manual confirmation mode is influenced by subjective factors, so that the accuracy cannot be guaranteed.
Disclosure of Invention
The invention mainly aims to provide a code clone detection method, a device, a terminal device and a readable storage medium, aiming at solving the technical problem of high difficulty in code clone detection and improving the detection accuracy of a clone code.
In order to achieve the above object, the present invention provides a code clone detection method, comprising the steps of:
when an extracting instruction is received, a source code file is obtained;
analyzing the source code file to obtain function body information corresponding to the source code file;
determining at least one sub-function body from the source code file based on the function body information;
and extracting the sub-function body through a preset matching rule to obtain at least one function character string, and carrying out code clone detection based on the function character string.
Optionally, the step of analyzing the source code file to obtain the function body information corresponding to the source code file includes:
and when a scanning instruction is received, scanning the source code file through a preset scanning tool to acquire the function body information in the source code file.
Optionally, the step of determining at least one sub-function body from the source code file based on the function body information includes:
acquiring a starting line and an ending line in the function body information;
and obtaining at least one sub-function body from the source code file by intercepting based on the starting line and the ending line.
Optionally, the step of extracting the sub-function body through a preset matching rule to obtain at least one function character string includes:
acquiring a character string in the subfunction body;
matching and extracting the character strings according to the preset matching rule to obtain at least one first function character string;
and matching and filtering the first function character string according to the preset matching rule to obtain at least one second function character string, and taking the second function character string as the function character string.
Optionally, the step of performing matching extraction on the character strings through the preset matching rule to obtain at least one first function character string, performing matching filtering on the first function character string through the preset matching rule to obtain at least one second function character string, and using the second function character string as the function character string includes:
identifying a first token in the string;
extracting characters among the first characteristic symbols in the character string to obtain a first function character string;
identifying a second characteristic string in the first function string;
and filtering the second characteristic character string in the first function character string, and taking the filtered second function character string as the function character string.
Optionally, after the step of filtering the second feature string in the first function string and using the filtered first function string as the function string, the method further includes:
acquiring a font format of the function character string;
and comparing the font format with a preset format, and if the comparison result is inconsistent, performing format conversion on the function character string corresponding to the font format.
Optionally, after the step of extracting the sub-function body through a preset matching rule to obtain at least one function character string, the method further includes:
generating a feature identifier for the function string;
and generating an index identifier for the function character string based on the function body information and the feature identifier.
In addition, to achieve the above object, the present invention also provides a code clone detecting device, including:
the acquisition module is used for acquiring the source code file when receiving the extraction instruction;
the analysis module is used for analyzing the source code file to obtain function body information corresponding to the source code file;
a determining module, configured to determine at least one sub-function body from the source code file based on the function body information;
and the extraction module is used for extracting the sub-function body through a preset matching rule to obtain at least one function character string and carrying out code clone detection based on the function character string.
In addition, to achieve the above object, the present invention further provides a terminal device, which includes a memory, a processor, and a code clone detection program stored on the memory and executable on the processor, wherein the code clone detection program implements the steps of the code clone detection method as described above when executed by the processor.
Further, to achieve the above object, the present invention also provides a computer readable storage medium having stored thereon a code clone detection program, which when executed by a processor implements the steps of the code clone detection method as described above.
The invention provides a code clone detection method, a code clone detection device, terminal equipment and a readable storage medium. When an extracting instruction is received, a source code file is obtained;
analyzing the source code file to obtain function body information corresponding to the source code file;
determining at least one sub-function body from the source code file based on the function body information;
extracting the subfunction body through a preset matching rule to obtain at least one function character string, carrying out code clone detection based on the function character string, namely analyzing a source code file by an analysis tool to obtain positioning related information of the function body in the source code file, intercepting the source code file based on the information to obtain a key part of the function body, further replacing characters which are useless for expressing function information through the preset matching rule of a regular expression, and finally obtaining the function character string. The replaced function character string occupies a smaller space than the original function body, is convenient to store, extracts the characters of the key part, and does not influence the information expressing the original function body, so that the automatic extraction of the function characteristics in the source code file is realized, the function character string is obtained, the code clone detection is carried out based on the function character string, and finally, the extraction difficulty of the code clone is reduced and the detection accuracy is improved.
Drawings
Fig. 1 is a schematic diagram of functional modules of a terminal device to which the code clone detection apparatus of the present application belongs;
FIG. 2 is a schematic flow chart diagram of an exemplary embodiment of a code clone detection method of the present application;
FIG. 3 is a schematic flow chart diagram illustrating another exemplary embodiment of a code clone detection method of the present application;
FIG. 4 is a schematic flow chart diagram illustrating another exemplary embodiment of a code clone detection method of the present application;
FIG. 5 is a schematic flow chart diagram illustrating another exemplary embodiment of a code clone detection method of the present application;
FIG. 6 is a schematic flow chart diagram illustrating another exemplary embodiment of a code clone detection method of the present application;
fig. 7 is a schematic diagram illustrating analysis of a source code file according to the code clone detection method of the present application.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The main solution of the embodiment of the invention is as follows: the method comprises the steps of obtaining a source code file when an extraction instruction is received, analyzing the source code file to obtain function body information corresponding to the source code file, determining at least one sub-function body from the source code file based on the function body information, extracting the sub-function body through a preset matching rule to obtain at least one function character string, carrying out code clone detection based on the function character string, namely analyzing the source code file by means of an analysis tool to obtain positioning related information of the function body in the source code file, intercepting the source code file based on the information to obtain a key part of the function body, further replacing characters useless for expressing function information through the preset matching rule of a regular expression, and finally obtaining the function character string. The replaced function character string occupies a smaller space than the original function body, is convenient to store, extracts the characters of the key part, and does not influence the information expressing the original function body, so that the automatic extraction of the function characteristics in the source code file is realized, the function character string is obtained, the code clone detection is carried out based on the function character string, and finally, the extraction difficulty of the code clone is reduced and the detection accuracy is improved.
Specifically, referring to fig. 1, fig. 1 is a schematic diagram of functional modules of a terminal device to which the code clone detection apparatus of the present application belongs. The code clone detection device can be a device which is independent of the terminal equipment, can acquire a source code file, can analyze the source code file to obtain function body information, can determine a subfunction body and can extract the subfunction body, and the device can be borne on the terminal equipment in a hardware or software mode. The terminal equipment can be an intelligent mobile terminal with a code clone detection function, such as a mobile phone, a tablet personal computer and the like, and can also be fixed terminal equipment or a server with a code clone detection function and the like.
In this embodiment, the terminal device to which the code clone detecting apparatus belongs at least includes an output module 110, a processor 120, a memory 130, and a communication module 140.
The memory 130 stores an operating system and a code clone detection program, and the code clone detection device can store information such as a source code file, function body information, a subfunction body, a function character string and the like in the memory 130; the output module 110 may be a display screen or the like. The communication module 140 may include a WIFI module, a mobile communication module, a bluetooth module, and the like, and communicates with an external device or a server through the communication module 140.
Wherein the code clone detection program in the memory 130 when executed by the processor implements the steps of:
when an extracting instruction is received, a source code file is obtained;
analyzing the source code file to obtain function body information corresponding to the source code file;
determining at least one sub-function body from the source code file based on the function body information;
and extracting the sub-function body through a preset matching rule to obtain at least one function character string, and carrying out code clone detection based on the function character string.
Further, the code clone detection program in the memory 130 also implements the following steps before being executed by the processor:
and when a scanning instruction is received, scanning the source code file through a preset scanning tool to acquire function body information in the source code file.
Further, the code clone detection program in the memory 130 also implements the following steps before being executed by the processor:
acquiring a starting line and an ending line in the function body information;
and obtaining at least one sub-function body from the source code file by intercepting based on the starting line and the ending line.
Further, the code clone detection program in the memory 130 also implements the following steps before being executed by the processor:
acquiring character strings in the subfunction bodies;
matching and extracting the character strings through the preset matching rule to obtain at least one first function character string;
and performing matching filtering on the first function character string through the preset matching rule to obtain at least one second function character string, and taking the second function character string as the function character string.
Further, the code clone detection program in the memory 130 also implements the following steps before being executed by the processor:
identifying a first token in the string;
extracting characters among the first characteristic symbols in the character string to obtain a first function character string;
identifying a second characteristic string in the first function string;
and filtering the second characteristic character string in the first function character string, and taking the filtered second function character string as the function character string.
Further, the code clone detection program in the memory 130 also implements the following steps before being executed by the processor:
acquiring a font format of the function character string;
and comparing the font format with a preset format, and if the comparison result is inconsistent, performing format conversion on the function character string corresponding to the font format.
Further, the code clone detection program in the memory 130 also implements the following steps before being executed by the processor:
generating a feature identifier for the function string;
and generating an index identifier for the function character string based on the function body information and the feature identifier.
The invention provides a code clone detection method, a code clone detection device, terminal equipment and a readable storage medium. The method comprises the steps of obtaining a source code file when an extraction instruction is received, analyzing the source code file to obtain function body information corresponding to the source code file, determining at least one sub-function body from the source code file based on the function body information, extracting the sub-function body through a preset matching rule to obtain at least one function character string, carrying out code clone detection based on the function character string, namely analyzing the source code file by means of an analysis tool to obtain positioning related information of the function body in the source code file, intercepting the source code file based on the information to obtain a key part of the function body, further replacing characters which are useless for expressing function information through the preset matching rule of a regular expression, and finally obtaining the function character string. The replaced function character string occupies a smaller space than the original function body, is convenient to store, extracts the characters of the key part, and does not influence the information expressing the original function body, so that the automatic extraction of the function characteristics in the source code file is realized, the function character string is obtained, the code clone detection is carried out based on the function character string, and finally, the extraction difficulty of the code clone is reduced and the detection accuracy is improved.
Based on the above terminal device architecture but not limited to the above architecture, embodiments of the method of the present application are provided.
Referring to fig. 2, fig. 2 is a schematic flowchart of an exemplary embodiment of the code clone detection method of the present application. The code clone detection method comprises the following steps:
step S1001, when receiving an extraction instruction, acquiring a source code file;
specifically, when an extracting instruction is received, a source code file uploaded in advance is acquired from a cloud end, or a stored source code file is locally called. The source code file refers to a file containing the most original program code, and the machine or the system completes the corresponding operation steps by executing the most original program code. The source code file is compiled by a programmer and stored in the cloud or the local after being compiled.
Step S1002, analyzing the source code file to obtain function body information corresponding to the source code file;
specifically, after the source code file is obtained, the source code file is firstly analyzed by a ctags (generated tags files for source code, text viewer) tool, which supports reading of source code files of various languages, in this embodiment, the ctags tool is taken as an example, it is understood that the function body information may also be obtained by other tools or means, such as a java's j tags tool, a python's ptags tool, etc., the ctags is installed in a VIM (Vi imgiven, text editor), and the function body information is obtained by entering a parameter command of the ctags in a command symbol window, for example, entering a command of "ctags-f- — kinds-C = f-" ldfies = ksnet file path ". After receiving the key-in command, all the function information of the source code file, including the function name, the file source, the function definition, the type, the start position, and the end position information, i.e., the function body information, is obtained by traversing the source code file set obtained in step S1001.
Step S1003, based on the function body information, determining at least one sub function body from the source code file;
specifically, function bodies can be distinguished by function names, when the function names are repeated, a function name can be newly created according to the existing function names as the name of the repeated function, such as swap function swap (), and swap functions starting second can be named according to swap1 (), swap2 (). The storage path of the function body and the position in the source code file can be determined by the information of the file source, the starting position and the ending position. The application function and the code variable type of the function body can be known by the function definition and the type. Based on the above function body information, we need to find the character segment part corresponding to the function body from the source code file and extract it into the feature information capable of representing the function body.
Firstly, the function definition usually describes the application function of the function in a Chinese annotation mode, wherein Chinese is not a machine language and cannot be directly identified by a machine or a system; secondly, variables defined by different function bodies are common variables possibly, namely the same variable definition character strings appear in different function bodies for many times, and the variables do not have significance for extracting the characteristic information of the source code function, so that an annotation part and a variable definition part in the function bodies are filtered.
Step S1004, extracting the sub-function body through a preset matching rule to obtain at least one function character string, and carrying out code clone detection based on the function character string.
Specifically, matching the rule with the character string means that a specific part which we want is obtained from the character string through a regular expression, such as the regular expression:
“(?://.*?$|[{}]+)|(?:/\*.*?\*/)|('(\\.|[^\\'])*'|\"(\\.|[^\\\"])*\"|.[^/'\"]*)”
the corresponding match is the annotation part in the function body. Expression formula
“\n|\t|\r|\{|}|\s+”
The space, line feed, return symbol, tab symbol and brace in the matching function body are respectively corresponded.
The part with characteristic meaning in the function body is usually a step execution part, namely the content of a character string in a brace, therefore, firstly, characters in the brace are extracted through a regular expression, splicing and storing are carried out, then characters for representing symbols such as spaces, line feed, carriage return symbols, tab symbols and braces are filtered, the symbol characters appear in each function body and cannot reflect the characteristic information of the function body, and finally, the characteristic information capable of representing the function body, namely the function character string is obtained, and one function character string is the characteristic information and corresponds to one function body.
According to the scheme, when an extraction instruction is received, a source code file is obtained, the source code file is analyzed, function body information corresponding to the source code file is obtained, based on the function body information, at least one sub-function body is determined from the source code file, the sub-function body is extracted through a preset matching rule, at least one function character string is obtained, code clone detection is carried out based on the function character string, namely, the source code file is analyzed through an analysis tool, positioning related information of the function body in the source code file is obtained, the source code file is intercepted based on the information, a key part of the function body is obtained, characters which are useless for expressing the function information are further replaced through the preset matching rule of regular expression, and finally the function character string is obtained. The function character string is the characteristic information of the function body, when the clone code is detected, whether the corresponding function body is the clone code function can be known only by comparing the function character string with other function character strings, the method has low requirement on the professional level of personnel, and the replaced function character string occupies smaller space than the original function body and is convenient to store, so that the automatic extraction of the function characteristics in the source code file is realized, the function character string is obtained, the code clone detection is carried out based on the function character string, and finally, the extraction difficulty of the code clone is reduced and the detection accuracy is improved.
Referring to fig. 3, fig. 3 is a schematic flow chart of another exemplary embodiment of the code clone detection method of the present application.
In step S1003, the step of determining at least one sub-function body from the source code file based on the function body information includes:
step A100, obtaining a start line and an end line in the function body information;
specifically, through step S1002, the source code file is analyzed to obtain function body information corresponding to the source code file, where the function body information includes all function information of the source code file, including a function name, a file source, a function definition, a type, a start position, and an end position, referring to fig. 7, fig. 7 is a schematic diagram of analyzing the source code file related to the code clone detection method of the present application, and a command of "ctags-f- - -kinds-C = f- -fields = neKSt file path" is entered by calling a ctags tool, so that the function body information as shown in the drawing can be obtained, where in fig. 7, each "/; "\ tfunction" represents a function body, which includes a definition part and an execution function part; "line" indicates the number of starting lines corresponding to the function body; "end:" corresponds to the number of ending lines of the function body. And acquiring line and end, namely acquiring a start line and an end line of the function body to obtain the position information of the function body in the source code file for subsequent extraction of the function body.
Step A200, based on the start line and the end line, at least one sub-function body is obtained from the source code file through interception.
Specifically, according to the result generated by the ctags tool and shown in fig. 6, the contents of the starting line to the ending line are intercepted from the source file, and a sub-function body is obtained. One "/; "t function" corresponds to one function, and generally, more than one function is included in a software source code.
According to the scheme, the starting line and the ending line in the function body information are obtained, at least one sub-function body is obtained from the source code file through interception based on the starting line and the ending line, namely the position information of the function body is determined from the source code file by combining the operation simplicity of the ctags tool, the corresponding function body is obtained through interception, and an operator does not need to understand the code structure and grammar of each language, so that the difficulty of extracting the function body is reduced, and the code clone detection efficiency is improved.
Referring to fig. 4, fig. 4 is a flowchart illustrating another exemplary embodiment of the code clone detection method of the present application.
In step S1004, extracting the sub-function body by presetting a matching rule to obtain at least one function string, and performing code clone detection based on the function string includes:
step B100, acquiring character strings in the subfunction body;
specifically, the function body comprises a definition part and a function part, the definition part and the function part are stored in a character string form, the content of the source file is read according to the starting position and the ending position of the function body, and the content is spliced and stored in a local character string.
Step B200, matching and extracting the character strings through the preset matching rules to obtain at least one first function character string;
specifically, the preset regular expression includes "\{ ([ \ S ] }", a desired specific part can be obtained from the character string through the regular expression, and whether the given character string meets the filtering logic of the regular expression or not is judged through the expression, and the extraction mode is also called as "matching".
In the splicing of the stored character strings, the content in the parenthesis is the content which is needed by us, namely the first function character string. While the content outside the parenthesis is the execution part of the function, the definition part such as the definition void type variable has little influence on the function body, and has almost the function of the annotation and the restriction program, so that the function can not be extracted.
Illustratively, the following function is a string of functions preceded by a regular expression of "\{ ([ \ S ] }":
Figure BDA0003839088040000111
Figure BDA0003839088040000121
/>
after the regular expression is "\{ ([ \ S \ S ] }", namely characters between the matched brackets are target characters, splicing the target characters and storing the spliced target characters locally to obtain a first function character string:
Figure BDA0003839088040000122
where it can be seen that the first line of code is not in parenthesis and therefore no fetch reservation is made.
And step B300, performing matching filtering on the first function character string through the preset matching rule to obtain at least one second function character string, and taking the second function character string as the function character string.
Specifically, the characters such as comments, spaces, line feed, carriage return symbols, tab symbols, braces and the like in the first function character string have no effect on identifying the features of the function itself, that is, the character strings may appear in each function body and have indistinguishable feature effects.
Illustratively, the matching filtered second function string is:
intn;if(*(ush*)&ptr!=0)farfree(ptr);return;for(n=0;n<next_ptr;n++)if(ptr!=table[n].new_ptr)continue;farfree(table[n].org_ptr);while(++n<next_ptr)table[n-1]=table[n];next_ptr--;return;ptr=opaque;Assert(0,"zcfree:ptrnotfound");
wherein the preset matching rule is a preset regular expression, and the annotated expression is "(; the regular expression of space, line feed, carriage return symbol, tab symbol, brace is "\ n | \ t | \ r | \{ | } | \ s +". And matching the corresponding target characters through the expression, and replacing and filtering the target characters.
According to the scheme, the character strings in the subfunction body are obtained, the character strings are matched and extracted through the preset matching rules to obtain at least one first function character string, the first function character string is matched and filtered through the preset matching rules to obtain at least one second function character string, the second function character string is used as the function character string, namely, in the character string corresponding to the function body, the effective part which plays a role in identifying the function body identification is extracted and reserved, the invalid part is replaced and filtered to obtain the function character string with a more identification representing function, and therefore the detection accuracy on the same function or the function with higher similarity is greatly improved.
Referring to fig. 5, fig. 5 is a flowchart illustrating another exemplary embodiment of the code clone detection method of the present application.
Step S1004, after the step of extracting the characters between the first feature symbols in the character string to obtain the first function character string, further includes:
step D100, obtaining the font format of the function character string;
specifically, the first function string obtained in step B300 contains upper and lower case letters, and when the first function string is used or converted in some scenes, the occurrence of both upper case letters and lower case letters may cause error processing, and therefore, it is necessary to perform normalization processing on the letters. First, the font type of each character in the character string is obtained, such as upper case format or lower case format.
And step D200, comparing the font format with a preset format, and if the comparison result is inconsistent, performing format conversion on the function character string of the font format.
Specifically, the preset format refers to a letter in a lower case format, in this embodiment, the lower case letter is used as a uniform format to reduce the risk of error reporting caused by inconsistent formats, and the preset format can be set according to actual needs.
Illustratively, the method takes the lowercase letters as a uniform format, compares the letters with the lowercase letters respectively, converts the letters which are not in accordance with the comparison into the lowercase letters, and obtains the final function character string without changing Arabic numerals.
According to the scheme, the font type of the function character string is obtained, the font type is compared with the preset type, if the comparison result is inconsistent, the type conversion is carried out on the function character string of the font type, namely, the case type of the uniform character letters is used, the risk of error reporting caused by format inconsistency in subsequent operations is reduced, and the reliability of code clone detection is improved.
Referring to fig. 6, fig. 6 is a flowchart illustrating another exemplary embodiment of the code clone detection method of the present application.
In step S1004, after the step of extracting the sub-function body by presetting a matching rule to obtain at least one function string, and performing code clone detection based on the function string, the method further includes:
step E100, generating a feature identifier for the function character string;
specifically, a main function execution part in the braces is extracted from the function character strings, and the comment parts outside the braces and interference items with lower functions on function identification, such as spaces, line feed, carriage return symbols, tab symbols, braces and the like, are filtered, so that the difference between the function character strings corresponding to different function bodies is maximized as much as possible, and the different function bodies are better distinguished. Feature information can be constructed for the function character string based on the differences, and first, a unique hash value is obtained by calculating the function character string through a computer and is used as a feature identifier.
And E200, generating an index identifier for the function character string based on the function body information and the feature identifier.
Specifically, the feature identifier associates information between the function string and the corresponding function body, the positioning information in the function body information associates information between the function body and the project source code file, and the index identifier can be obtained by combining the feature identifier and the function body information, that is, the associated information between the function string and the source code file and the function body is obtained.
By the above scheme, in this embodiment, specifically, the feature identifier is generated for the function character string, the index identifier is generated for the function character string based on the function body information and the feature identifier, and the function is used as the center to associate the relevant function information, so that the effect of fast reading query and having the feature identifier is achieved, and the function can be used as the object of code clone detection.
In addition, an embodiment of the present application further provides a code clone detection device, where the code clone detection device includes:
the acquisition module is used for acquiring the source code file when receiving the extraction instruction;
the analysis module is used for analyzing the source code file to obtain function body information corresponding to the source code file;
the determining module is used for determining at least one sub function body from the source code file based on the function body information;
and the extraction module is used for extracting the sub-function body through a preset matching rule to obtain at least one function character string and carrying out code clone detection based on the function character string.
For the principle and implementation process of implementing data stream detection in this embodiment, please refer to the above embodiments, which are not described herein again.
In addition, a terminal device is further provided in an embodiment of the present application, where the terminal device includes a memory, a processor, and a code clone detection program stored in the memory and executable on the processor, and when the code clone detection program is executed by the processor, the code clone detection program implements the steps of code clone detection as described above.
Since the code clone detection program is executed by the processor, all technical solutions of all the foregoing embodiments are adopted, so that at least all the beneficial effects brought by all the technical solutions of all the foregoing embodiments are achieved, and no further description is given here.
Furthermore, an embodiment of the present application also provides a computer-readable storage medium, on which a code clone detection program is stored, and the code clone detection program implements the steps of code clone detection as described above when executed by a processor.
Since the code clone detection program is executed by the processor, all technical solutions of all the foregoing embodiments are adopted, so that at least all the beneficial effects brought by all the technical solutions of all the foregoing embodiments are achieved, and no further description is given here.
The label connection relation set based on the mode is limited by the influence of personal main factors, which easily causes the problems of insufficient accuracy and comprehensiveness of the label connection relation and high maintenance cost,
compared with the prior art, namely, the function body cannot be identified from the source code file quickly and effectively by manpower, and the judgment on the function body depends on personal experience, the method and the device can quickly acquire the function body information through the ctags tool, realize the position positioning of the function body of various language types, have high acquisition efficiency and wide applicability, and greatly reduce the difficulty of extracting the function. And then, carrying out effective extraction and ineffective replacement on the function body: the method extracts a main function execution part in the braces, filters the comment part outside the braces and interference items with lower function identification effects such as space, line feed, carriage return symbol, tab symbol, braces and the like, and enables the difference between function character strings corresponding to different function bodies to be maximum as much as possible so as to better distinguish the different function bodies, and practically improves the clone detection accuracy between two functions with the same or high similarity while the execution steps are simple and convenient.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of other like elements in a process, method, article, or system comprising the element.
The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.
Through the description of the foregoing embodiments, it is clear to those skilled in the art that the method of the foregoing embodiments may be implemented by software plus a necessary general hardware platform, and certainly may also be implemented by hardware, but in many cases, the former is a better implementation. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, a controlled terminal, or a network device) to execute the method of each embodiment of the present application.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A code clone detection method, characterized in that it comprises the following steps:
when an extracting instruction is received, a source code file is obtained;
analyzing the source code file to obtain function body information corresponding to the source code file;
determining at least one sub-function body from the source code file based on the function body information;
and extracting the sub-function body through a preset matching rule to obtain at least one function character string, and carrying out code clone detection based on the function character string.
2. The code clone detection method of claim 1, wherein said step of analyzing said source code file to obtain function body information corresponding to said source code file comprises:
and when a scanning instruction is received, scanning the source code file through a preset scanning tool to acquire the function body information in the source code file.
3. The code clone detection method of claim 2, wherein said step of determining at least one sub-function body from said source code file based on said function body information comprises:
acquiring a starting line and an ending line in the function body information;
and obtaining at least one sub-function body from the source code file by intercepting based on the starting line and the ending line.
4. The code clone detection method of claim 1, wherein said step of extracting said sub-function body by presetting a matching rule to obtain at least one function string comprises:
acquiring character strings in the subfunction bodies;
matching and extracting the character strings according to the preset matching rule to obtain at least one first function character string;
and performing matching filtering on the first function character string through the preset matching rule to obtain at least one second function character string, and taking the second function character string as the function character string.
5. The code clone detection method of claim 4, wherein said matching extraction of said character strings by said preset matching rules to obtain at least one first function character string, said matching filtering of said first function character string by said preset matching rules to obtain at least one second function character string, and said step of using said second function character string as said function character string comprises:
identifying a first token in the string;
extracting characters among the first characteristic symbols in the character string to obtain a first function character string;
identifying a second characteristic string in the first function string;
and filtering the second characteristic character string in the first function character string, and taking the filtered second function character string as the function character string.
6. The code clone detection method of claim 5, wherein after the step of filtering said second feature string in said first function string and treating said filtered first function string as said function string, further comprising:
acquiring a font format of the function character string;
and comparing the font format with a preset format, and if the comparison result is inconsistent, performing format conversion on the function character string corresponding to the font format.
7. The code clone detection method of claim 1, wherein said step of extracting said sub-function body by presetting a matching rule to obtain at least one function string further comprises:
generating a feature identifier for the function string;
and generating an index identifier for the function character string based on the function body information and the feature identifier.
8. A code clone detecting device, characterized in that it comprises:
the acquisition module is used for acquiring the source code file when receiving the extraction instruction;
the analysis module is used for analyzing the source code file to obtain function body information corresponding to the source code file;
a determining module, configured to determine at least one sub-function body from the source code file based on the function body information;
and the extraction module is used for extracting the sub-function body through a preset matching rule to obtain at least one function character string.
9. A terminal device, characterized in that the terminal device comprises a memory, a processor and a code clone detection program stored on the memory and executable on the processor, which when executed by the processor implements the steps of the code clone detection method according to any one of claims 1 to 7.
10. A computer-readable storage medium, having stored thereon a code clone detection program, which when executed by a processor implements the steps of a code clone detection method according to any one of claims 1 to 7.
CN202211101209.8A 2022-09-08 2022-09-08 Code clone detection method and device, terminal equipment and readable storage medium Pending CN115951891A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211101209.8A CN115951891A (en) 2022-09-08 2022-09-08 Code clone detection method and device, terminal equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211101209.8A CN115951891A (en) 2022-09-08 2022-09-08 Code clone detection method and device, terminal equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN115951891A true CN115951891A (en) 2023-04-11

Family

ID=87281495

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211101209.8A Pending CN115951891A (en) 2022-09-08 2022-09-08 Code clone detection method and device, terminal equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN115951891A (en)

Similar Documents

Publication Publication Date Title
CN111177184A (en) Structured query language conversion method based on natural language and related equipment thereof
CN106843840B (en) Source code version evolution annotation multiplexing method based on similarity analysis
US10482170B2 (en) User interface for contextual document recognition
WO2006136055A1 (en) A text data mining method
CN103544298A (en) Log analysis method and analysis device for component
CN111159982A (en) Document editing method and device, electronic equipment and computer readable storage medium
CN112579466A (en) Test case generation method and device and computer readable storage medium
CN113419721B (en) Web-based expression editing method, device, equipment and storage medium
CN106570095B (en) XML data operation method and equipment
CN111611788B (en) Data processing method and device, electronic equipment and storage medium
CN109325217B (en) File conversion method, system, device and computer readable storage medium
CN110134397A (en) Code snippet interpretation method, device, computer equipment and storage medium
CN117473984A (en) Method and system for dividing txt document content chapters
CN115951891A (en) Code clone detection method and device, terminal equipment and readable storage medium
CN102209279A (en) Extensible markup language (XML)-based multi-language support method
CN115203494A (en) Text-oriented time information extraction method and device
CN112035440B (en) Knowledge base management method, device, electronic equipment and storage medium
CN111460141B (en) Text processing method and device and electronic equipment
CN113805861A (en) Code generation method based on machine learning, code editing system and storage medium
CN113435217A (en) Language test processing method and device and electronic equipment
CN113723082A (en) Method and device for detecting Chinese pinyin from text
CN114035726B (en) Method and system for robot flow automatic page element identification process
CN110618809B (en) Front-end webpage input constraint extraction method and device
US20210295031A1 (en) Automated classification and interpretation of life science documents
CN115964051A (en) Multilingual entry detection method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination