Disclosure of Invention
The invention aims to provide a deep code searching method, a system and a device based on code structure semantic information aiming at the defects of the prior art. The invention can solve the problems of different grammars of codes and natural languages and the problem of difficult semantic understanding of user query by the conventional method.
The purpose of the invention is realized by the following technical scheme: a deep code searching method based on code structure semantic information comprises the following steps:
(1) code data in the software project are obtained, and a code search data set is generated by utilizing the word segmentation model.
(2) And preprocessing the code search data.
(3) And optimizing and testing the depth code search model by utilizing the preprocessed data set.
(4) And implementing code search based on the depth code search model.
Further, the step (1) is specifically as follows: determining a Java software project range, and extracting Java methods and related annotations from Java files in the project; using the analyzed data set for training a BPE word segmentation model; and performing word segmentation processing on the data in the data set by using the trained word segmentation model to form a code search data set.
Further, in step (2), data preprocessing includes: intercepting a first segment of annotation of the Java method as a natural language annotation, and omitting the annotation of parameters and the annotation of a return value; removing Java methods that do not contain any API and duplicate Java methods; converting the Java method into an abstract syntax tree by using a code analysis tool Javaparser, extracting an API sequence from the abstract syntax tree by adopting a depth-limited traversal strategy, wherein the Java method after data centralized preprocessing comprises the abstract syntax tree and the API sequence; and (4) dividing the preprocessed data set into a training set, a verification set and a test set, and optimizing and testing the depth code search model in the step (3).
Further, in step (3): the depth code search model utilizes three attention-based long-term and short-term memory networks and comprises a network construction code structure information coding module, a code semantic information coding module and a natural language coding module.
In order to optimize parameters in the three long-term and short-term memory networks, the three modules respectively take the abstract syntax tree, the API sequence and the related comments of the Java method in the training data in the step (2) as input; then constructing an information fusion module, fusing vectors output by the code structure information coding module and the code semantic information coding module; and finally, constructing a similarity matching module, calculating cosine similarity between the fusion vector and the output vector of the natural language coding module, calculating a loss function of the depth code search model, and optimizing parameters in the three coding modules.
The stopping conditions of the optimization process are as follows: and the optimization iteration times exceed the total number of the Java methods in the training set or the effect of the model on the verification set is converged.
Further, the step (4) is specifically as follows: and taking the natural language query and the Java method as input, calculating the similarity between the two, finishing the reordering of the Java method set according to the similarity, and outputting the Java method after final ordering.
A deep code searching system based on code structure semantic information comprises an off-line end and an on-line end.
And the offline end is responsible for functions of analyzing Java files, constructing a structured data set, training a deep code search model and the like.
And the on-line end is used for interacting with the user, providing webpage search entries and search result presentation for the user, recording the behavior of the user for viewing the search results, and performing statistical analysis and subsequent further optimization on the depth code search model.
Further, the off-line end comprises a data analysis module, a data storage module and a model training module.
And the data analysis module extracts the Java method and corresponding annotations from the Java file, analyzes the Java method into an abstract syntax tree, prunes according to the depth of the abstract syntax tree, extracts the method name and method body of the Java method from the analyzed abstract syntax tree by using a method based on a regular expression, and traverses the abstract syntax tree to extract an API sequence.
And the data storage module supports large-scale calculation and storage, supports multiple calculation types, provides fine-grained authority management, sandbox protection and data monitoring functions, and stores the analyzed data in a json format for training a deep code search model.
The model training module supports various machine learning and deep learning calculation frameworks, including a streaming calculation framework, a deep learning framework and a calculation engine.
Further, the online end comprises a core business layer, a deep code search model and a data persistence module.
The core service layer has the main functions of completing a code search task requested by a user interactive interface by using a deep code search model and storing a behavior of a user for viewing a search result into a data persistence module. The core service layer uses the Springboot technology to build functions of an elastic search part, an elastic algorithm service part, a user authentication part and a user behavior analysis part. The ElasticSearch is a Lucene-dependent high-performance search engine library for storing code search data sets. The elastic algorithm service utilizes the Docker technology to serve the code search function of the deep code search model in the form of a Restful API interface. When the back-end service layer receives a natural language query request from a user interactive interface, the user authenticates and filters out an unauthorized request, the ElasticSearch searches a Java method set related to the natural language query from a code search data set for the authorized request, the code search service in the elastic algorithm service reorders the Java method set, and finally, the ordering result is sent to the user interactive interface. The behavior of the user for viewing the search result is fed back to the core service layer, records are stored in the data persistence module, and the records can be viewed and statistically analyzed through user behavior analysis.
Further, the core service layer provides a caching mechanism for the user interaction interface to relieve server stress and quickly respond to user searches. The method comprises the steps of displaying functional modules in a core service layer and the dependency relationship among the functional modules by utilizing a graphical technology to form an interactive module display interface, wherein the interface supports the function of dynamically adjusting the dependency relationship among the modules in a mode of dragging the positions of the modules by a mouse.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program when executed implementing the method or the system.
The invention has the beneficial effects that: the invention efficiently utilizes the existing codes and annotation texts in the mass code warehouse, avoids repeated project development work, and introduces a deep code search model to improve the code utilization rate. The method comprises the steps of constructing an abstract syntax tree and an API sequence corresponding to a code segment (such as a method level code) by using code structure information, constructing a code structure information coding module, a code semantic information coding module, an information fusion module and a natural language coding module, effectively improving the comprehension capability of a code search model on code structure semantic information and a natural language description text, and effectively improving the search effect and performance of a code search system by using the code search model.
Detailed Description
In order to more clearly explain the present invention, the present invention is further explained below with reference to the embodiments and the drawings. It is to be understood by persons skilled in the art that the following detailed description is illustrative and not restrictive, and is not to be taken as limiting the scope of the invention.
As shown in fig. 1, an embodiment of a depth code search method based on code structure semantic information in the present invention specifically includes the following steps:
101: code data is acquired and a code search data set is generated using a word segmentation model.
Firstly, the range of data selection needs to be determined, for example, a Java code file in an enterprise is selected, and Java methods (method-level Java code) and related annotation pairs are extracted from the Java code file as a data set. And then training a BPE (Byte-Pair Encoding) word segmentation model by using the words in the data set, and carrying out word segmentation processing on the data set by using the trained BPE model to form a code search data set.
102: the code search data set generated in step 101 is subjected to data preprocessing.
In order to achieve excellent results for the deep learning model, effective data preprocessing work needs to be performed on the data set. Intercepting a first segment of content of the Java method related annotation as a natural language annotation, and removing annotation content and the like related to Java method parameters and return values in the first segment of content; java methods that do not contain any API are removed as well as duplicate Java methods.
In order to obtain the structural semantic information in the Java method, a code analysis tool Javaparser is utilized to convert the Java method into an abstract syntax tree so as to express the structural semantic of the Java method. And traversing each part in the abstract syntax tree by adopting a depth-first traversal strategy, extracting the API in the abstract syntax tree, and generating an API sequence for expressing the key programming semantics of the Java method. Thus, the Java methods in the dataset are represented as two parts, an abstract syntax tree and an API sequence.
And finally, dividing the analyzed data set into a training set, a verification set and a test set.
103: and optimizing and testing the depth code search model by using the data set preprocessed in the step 102.
As shown in fig. 2, the depth code search model mainly includes a code encoding module, a natural language encoding module 207, and a similarity matching module 208. The code encoding module comprises a code structure information encoding module 203, a code semantic information encoding module 204 and an information fusion module 205. The code structure information encoding module 203, the code semantic information encoding module 204, and the natural language encoding module 207 are respectively formed by a Long Short-Term Memory (LSTM) network based on the attention mechanism.
In order to optimize the deep code search model, a random initialization method is adopted to initialize parameters in a long-term and short-term memory network in the code structure information coding module 203, the code semantic information coding module 204 and the natural language coding module 207 respectively, and an abstract syntax tree 201, an API sequence 202 and related annotation information (query/natural language annotation 206) analyzed in a Java method are used as input of the module code structure information coding module 203, the code semantic information coding module 204 and the natural language coding module 207 respectively and are output as vectors.
The information fusion module 205 fuses the vectors output by the code structure information encoding module 203 and the code semantic information encoding module 204 into vectors with the same size as the output of the natural language module 207, so as to better represent the structure semantic information included in the Java method.
The two vectors output by the natural language coding module 207 and the information fusion module 205 are sent to the similarity matching module 208, the similarity matching module 208 calculates the similarity between the two vectors of the natural language coding module 207 and the information fusion module 205 by using a cosine similarity algorithm, calculates a loss function of the depth code search model, and optimizes parameters of a long-term and short-term memory network in the code structure information coding module 203, the code semantic information coding module 204 and the natural language coding module 207.
In the model optimization iteration process, a training set is used as training data, a verification set is used as detection of model effect, and an iteration stopping condition is that the training data are completely used up, namely the iteration times exceed the total number of Java methods in the training set or the effect of the model on the verification set is converged. The effect of the trained deep code search model running on the test set will be the final result of evaluating the performance of the model.
104: a search system implementation based on a depth code search model.
As shown in fig. 3, an embodiment of a depth code search system based on semantic information of a code structure of the present invention mainly includes two parts, an offline end 301 and an online end 306.
The offline end 301 is divided into three parts, namely a data analysis module 302, a data storage module 303 and a model training module 304. The deep code search system of the present invention manages the code repository 305 in an enterprise using Git technology.
The data parsing module 302 mainly completes the function of parsing the Java file. The data parsing module 302 extracts Java methods and their corresponding annotations from Java files in the code repository 305 using the code parsing tool Java parser. The data parsing module 302 parses the Java method into an abstract syntax tree, prunes according to the depth of the abstract syntax tree, extracts the method name and method body of the Java method from the parsed abstract syntax tree by using a method based on a regular expression, and traverses the abstract syntax tree to extract an API sequence. As such, data parsing module 302 represents Java methods as structured data, including Java method source code, comments, method names, method bodies, abstract syntax trees, API sequences. The data storage module 303 saves all the structured data about Java methods parsed from the code repository 305 by the data parsing module 302 in json format.
The data of the data storage module 303 is all located in the server, which is a solution relying on the cloud server. The data storage module 303 supports large-scale computing and storage, can support various computing types such as SQL, MapReduce, and UDF, and provides fine-grained rights management, sandbox protection, and data monitoring functions.
The model training module 304 trains the deep code search model using the Java methods and annotation data stored in the data storage module 303. The model training module 304 is an artificial intelligence platform that can provide a one-stop deep learning solution, and supports various machine learning and deep learning computing frameworks, such as a streaming computing framework Flink, a deep learning framework tensrflow, a computing engine Spark, and the like.
The on-line end 306 mainly includes three modules, namely a core business layer 314, a deep search model 311 and a data persistence module 312. The on-line end 306 realizes interaction with the user through the user interaction interface 313, provides a web search entry and low-delay search result presentation and user behavior embedding points for the user, and records the use conditions of the search system user, such as user ID, search time, operating system, browser and version number, input natural language description, access conditions to the search results, and the like. These data are used to statistically analyze the user's code search behavior, providing training data for further model optimization.
The main task at line end 306 is to implement the functions required for the user interaction interface 313 with the core services layer 314 at the back end. The core service layer 314 is built by using a SpringBoot technology, and four functions of user authentication 307, user behavior analysis 309, elastic search308 and elastic algorithm service 310 are completed.
The bottom layer of the ElasticSearch308 implements a Lucene-dependent high-performance search engine library, which imports a code search data set, i.e., a parsed Java method and a related annotation set, from the data storage module 303. The core service layer 314 at the back end receives the natural language query request from the user interface 313, and the user authentication 307 determines whether the request has authority. When the request is authorized, the core business layer 314 of the back-end retrieves the set of Java methods associated with the natural language query from the code search dataset using the ElasticSearch 308.
Then, the elastic algorithm service 310 services the code search function in the deep search model 311 by using a Docker technology, the service takes the natural language query and the Java method as input, calculates the similarity between the two (i.e. the output of the similarity module 208 in the deep search model), completes the reordering (from large to small) of the Java method set according to the similarity, and outputs the Java method after final ordering.
The service process comprises the steps of creating a Docker container, deploying a deep search model 311 to the Docker container, optimizing the deep search model 311 by using a code search data set in an ElasticSearch308, and providing a code search function based on a Restful API interface to a core service layer 31 at the back end.
The reason why the elastic algorithm service 310 uses the Docker technology to carry the algorithm service is that Docker has particularly good environment portability and extensibility. On one hand, when the Docker technology is used, a model environment can be built locally and quickly, an algorithm model can be deployed and debugged; after the local debugging is passed, the local environment mirror image can be directly uploaded to a server for deployment. On the other hand, when the user request amount exceeds the load limit of the current Docker service, the elastic algorithm service 310 may rapidly deploy a new Docker service on a newly added server, so as to improve the response speed to the user request.
After the search is completed, the core business layer 314 at the back end sends the search result of the elastic algorithm service 310 to the user interaction interface 313, records the viewing behavior of the user on the search result on the user interaction interface 313, and stores the data in the server disk by using the data persistence module 312 for further optimizing the deep search model 311 subsequently. The data stored by the data persistence module 312 may be viewed and statistically analyzed by the user behavior analysis 309.
In addition, it has been found in practice that users often use the same natural language query multiple times in sequence when searching for a single functional method, so the core service layer 314 also provides caching to relieve server stress and quickly respond to user searches. The depth code search system can perform update iteration along with the advance of time, and in order to conveniently and flexibly adjust the logic structure of the system, the dependency relationships among the modules and the modules in the line end 306 are displayed in a background management interface by utilizing a graphical technology to form an interactive module display interface. The interface supports the function of dynamically adjusting the dependency relationship between the modules in a mode of dragging the positions of the modules by a mouse.
The experimental environment of the deep code search system based on the code structure semantic information is as follows:
operating the system: ubuntu 16.04.1LTS
CPU:Intel(R)Xeon(R)Gold 6226CPU@2.70GHz
GPU:NVIDIA GeForce RTX 2080Ti
Memory: 64GB
Hard disk: 4TB
Programming language: python 3.7
Anaconda:5.3.0
TensorFlow:2.3.0
Docker:19.03.4
As shown in fig. 4, an embodiment of a terminal device or a server of the present invention includes a removable medium 401 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, mounted on a drive 402 as necessary, so that a computer program read out therefrom is installed into a storage section 406 as necessary. The driver 402 is also connected to the I/O interface 407 as necessary. The communication section 403 performs communication processing via a network such as the internet. The following components are connected to an output portion 405 of a line tube (CRT), a display, and the like, and a speaker and the like; a storage section 406 including a hard disk and the like; and a communication section 403 including a network interface card such as a LAN card, a modem, or the like. The communication section 403 accepts control information sent to the input section 404 from, for example, a keyboard and a mouse.
The central processing unit CPU 409 can perform various appropriate actions and processes in accordance with a program stored in the read only memory ROM 410 or a program loaded from the storage section 406 into the random access memory RAM 411. In the RAM 411, various programs and data necessary for the operation of the terminal device or the server are also stored. The CPU 409, ROM 410, and RAM 411 are connected to each other via a bus 408. An input/output I/O interface 407 is also connected to bus 408.
In particular, the processes described above with reference to the flowcharts may be implemented as computer application software according to the embodiments of the present disclosure. For example, the disclosed embodiments of the invention include a computer application comprising a computer program embodied on a computer readable medium, the computer program comprising program code for executing the system shown in FIG. 2. In the present embodiment, the computer program can be downloaded and installed from a network through the communication section 403, and/or installed from the removable medium 401. The above-described functions defined in the system of the present invention are executed when the computer program is executed by the central processing unit CPU 409.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a Random Access Memory (RAM), a read-only memory (ROM), a computer diskette, a hard disk, an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The above-described embodiments are intended to illustrate rather than limit the invention, and those skilled in the art will be able to make various modifications and improvements without departing from the spirit of the invention without limiting the invention to the details of the illustrative embodiments set forth herein. Any modification and variation of the present invention within the spirit of the present invention and the scope of the claims will fall within the scope of the present invention.