CN113761163A - Deep code search method, system and device based on code structure semantic information - Google Patents

Deep code search method, system and device based on code structure semantic information Download PDF

Info

Publication number
CN113761163A
CN113761163A CN202110946937.8A CN202110946937A CN113761163A CN 113761163 A CN113761163 A CN 113761163A CN 202110946937 A CN202110946937 A CN 202110946937A CN 113761163 A CN113761163 A CN 113761163A
Authority
CN
China
Prior art keywords
code
code search
deep
search
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110946937.8A
Other languages
Chinese (zh)
Other versions
CN113761163B (en
Inventor
刘超
夏鑫
李博奥
张洋
张昕东
杨小虎
王新宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202110946937.8A priority Critical patent/CN113761163B/en
Publication of CN113761163A publication Critical patent/CN113761163A/en
Application granted granted Critical
Publication of CN113761163B publication Critical patent/CN113761163B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • G06F8/425Lexical analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/43Checking; Contextual analysis
    • G06F8/436Semantic checking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Databases & Information Systems (AREA)
  • Stored Programmes (AREA)

Abstract

本发明公开了一种基于代码结构语义信息的深度代码搜索方法、系统及装置,包括从目标软件项目中提取代码文件中的方法级代码和相关注释,形成数据集;训练分词模型生成代码搜索数据集;预处理代码搜索数据集,将方法级代码解析为抽象语法树,遍历抽象语法树并提取API序列;构建深度代码搜索模型,包括代码结构信息编码模块、代码语义信息编码模块、信息融合模块、自然语言编码模块;使用预处理后的数据集,优化深度代码搜索模型;利用深度代码搜索模型,从代码搜索数据集中获得与自然语言查询对应的方法级代码。本发明的方法能够有效提升代码搜索模型对于代码结构语义信息以及自然语言描述文本的理解能力,提高代码搜索系统的搜索效果与性能。

Figure 202110946937

The invention discloses a deep code search method, system and device based on code structure semantic information, including extracting method-level codes and related comments in code files from a target software project to form a data set; training a word segmentation model to generate code search data set; preprocess code search data set, parse method-level code into abstract syntax tree, traverse abstract syntax tree and extract API sequence; build deep code search model, including code structure information encoding module, code semantic information encoding module, and information fusion module , natural language coding module; use the preprocessed data set to optimize the deep code search model; use the deep code search model to obtain method-level codes corresponding to natural language queries from the code search data set. The method of the invention can effectively improve the code search model's ability to understand the code structure semantic information and the natural language description text, and improve the search effect and performance of the code search system.

Figure 202110946937

Description

Deep code searching method, system and device based on code structure semantic information
Technical Field
The invention belongs to the field of code search, and particularly relates to a deep code search method, a deep code search system and a deep code search device based on code structure semantic information.
Background
With the development of software development and internet technology, the size of codes accumulated by IT enterprises is larger and larger. The huge code data volume means that a large number of codes which can be repeatedly used exist, and how to utilize the existing codes and annotation texts in a massive code warehouse, so that repeated project development work of programmers is avoided, the utilization rate of the existing codes is improved, and the method is a very important problem in the field of intelligent software engineering. For a large internet company or a software company, there is an increasing demand for developing a code search system for use by an insider of an enterprise, in consideration of security, intellectual property, and the like. Some companies have developed code search systems that can be used to open source code repositories or enterprise internal code repositories, most of which are based on keyword matching and information retrieval techniques. The accuracy of these keyword matching and information retrieval technology based code search systems is often unsatisfactory, for a number of reasons including:
(1) code search systems lack comprehension capabilities for users' queries;
(2) a mismatch between semantic information expressed by the natural language query and structured information contained in the code;
(3) source code is a highly structured programming language, and natural languages that follow complex grammars are heterogeneous.
Thus, an effective natural language code search engine needs to establish a higher level semantic mapping relationship between the code and the natural language query so that the search engine can understand the semantic meaning of the natural language query and the source code to improve the accuracy of code search.
Disclosure of Invention
The invention aims to provide a deep code searching method, a system and a device based on code structure semantic information aiming at the defects of the prior art. The invention can solve the problems of different grammars of codes and natural languages and the problem of difficult semantic understanding of user query by the conventional method.
The purpose of the invention is realized by the following technical scheme: a deep code searching method based on code structure semantic information comprises the following steps:
(1) code data in the software project are obtained, and a code search data set is generated by utilizing the word segmentation model.
(2) And preprocessing the code search data.
(3) And optimizing and testing the depth code search model by utilizing the preprocessed data set.
(4) And implementing code search based on the depth code search model.
Further, the step (1) is specifically as follows: determining a Java software project range, and extracting Java methods and related annotations from Java files in the project; using the analyzed data set for training a BPE word segmentation model; and performing word segmentation processing on the data in the data set by using the trained word segmentation model to form a code search data set.
Further, in step (2), data preprocessing includes: intercepting a first segment of annotation of the Java method as a natural language annotation, and omitting the annotation of parameters and the annotation of a return value; removing Java methods that do not contain any API and duplicate Java methods; converting the Java method into an abstract syntax tree by using a code analysis tool Javaparser, extracting an API sequence from the abstract syntax tree by adopting a depth-limited traversal strategy, wherein the Java method after data centralized preprocessing comprises the abstract syntax tree and the API sequence; and (4) dividing the preprocessed data set into a training set, a verification set and a test set, and optimizing and testing the depth code search model in the step (3).
Further, in step (3): the depth code search model utilizes three attention-based long-term and short-term memory networks and comprises a network construction code structure information coding module, a code semantic information coding module and a natural language coding module.
In order to optimize parameters in the three long-term and short-term memory networks, the three modules respectively take the abstract syntax tree, the API sequence and the related comments of the Java method in the training data in the step (2) as input; then constructing an information fusion module, fusing vectors output by the code structure information coding module and the code semantic information coding module; and finally, constructing a similarity matching module, calculating cosine similarity between the fusion vector and the output vector of the natural language coding module, calculating a loss function of the depth code search model, and optimizing parameters in the three coding modules.
The stopping conditions of the optimization process are as follows: and the optimization iteration times exceed the total number of the Java methods in the training set or the effect of the model on the verification set is converged.
Further, the step (4) is specifically as follows: and taking the natural language query and the Java method as input, calculating the similarity between the two, finishing the reordering of the Java method set according to the similarity, and outputting the Java method after final ordering.
A deep code searching system based on code structure semantic information comprises an off-line end and an on-line end.
And the offline end is responsible for functions of analyzing Java files, constructing a structured data set, training a deep code search model and the like.
And the on-line end is used for interacting with the user, providing webpage search entries and search result presentation for the user, recording the behavior of the user for viewing the search results, and performing statistical analysis and subsequent further optimization on the depth code search model.
Further, the off-line end comprises a data analysis module, a data storage module and a model training module.
And the data analysis module extracts the Java method and corresponding annotations from the Java file, analyzes the Java method into an abstract syntax tree, prunes according to the depth of the abstract syntax tree, extracts the method name and method body of the Java method from the analyzed abstract syntax tree by using a method based on a regular expression, and traverses the abstract syntax tree to extract an API sequence.
And the data storage module supports large-scale calculation and storage, supports multiple calculation types, provides fine-grained authority management, sandbox protection and data monitoring functions, and stores the analyzed data in a json format for training a deep code search model.
The model training module supports various machine learning and deep learning calculation frameworks, including a streaming calculation framework, a deep learning framework and a calculation engine.
Further, the online end comprises a core business layer, a deep code search model and a data persistence module.
The core service layer has the main functions of completing a code search task requested by a user interactive interface by using a deep code search model and storing a behavior of a user for viewing a search result into a data persistence module. The core service layer uses the Springboot technology to build functions of an elastic search part, an elastic algorithm service part, a user authentication part and a user behavior analysis part. The ElasticSearch is a Lucene-dependent high-performance search engine library for storing code search data sets. The elastic algorithm service utilizes the Docker technology to serve the code search function of the deep code search model in the form of a Restful API interface. When the back-end service layer receives a natural language query request from a user interactive interface, the user authenticates and filters out an unauthorized request, the ElasticSearch searches a Java method set related to the natural language query from a code search data set for the authorized request, the code search service in the elastic algorithm service reorders the Java method set, and finally, the ordering result is sent to the user interactive interface. The behavior of the user for viewing the search result is fed back to the core service layer, records are stored in the data persistence module, and the records can be viewed and statistically analyzed through user behavior analysis.
Further, the core service layer provides a caching mechanism for the user interaction interface to relieve server stress and quickly respond to user searches. The method comprises the steps of displaying functional modules in a core service layer and the dependency relationship among the functional modules by utilizing a graphical technology to form an interactive module display interface, wherein the interface supports the function of dynamically adjusting the dependency relationship among the modules in a mode of dragging the positions of the modules by a mouse.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program when executed implementing the method or the system.
The invention has the beneficial effects that: the invention efficiently utilizes the existing codes and annotation texts in the mass code warehouse, avoids repeated project development work, and introduces a deep code search model to improve the code utilization rate. The method comprises the steps of constructing an abstract syntax tree and an API sequence corresponding to a code segment (such as a method level code) by using code structure information, constructing a code structure information coding module, a code semantic information coding module, an information fusion module and a natural language coding module, effectively improving the comprehension capability of a code search model on code structure semantic information and a natural language description text, and effectively improving the search effect and performance of a code search system by using the code search model.
Drawings
The accompanying drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention:
FIG. 1 is a flowchart of a deep code search method based on semantic information of a code structure according to an embodiment of the present invention;
FIG. 2 is a block diagram of an overall depth code search model according to an embodiment of the present invention;
FIG. 3 is an architecture diagram of a deep code search system based on semantic information of code structures according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a computer system of a terminal device or a server for implementing the embodiment of the present invention.
Detailed Description
In order to more clearly explain the present invention, the present invention is further explained below with reference to the embodiments and the drawings. It is to be understood by persons skilled in the art that the following detailed description is illustrative and not restrictive, and is not to be taken as limiting the scope of the invention.
As shown in fig. 1, an embodiment of a depth code search method based on code structure semantic information in the present invention specifically includes the following steps:
101: code data is acquired and a code search data set is generated using a word segmentation model.
Firstly, the range of data selection needs to be determined, for example, a Java code file in an enterprise is selected, and Java methods (method-level Java code) and related annotation pairs are extracted from the Java code file as a data set. And then training a BPE (Byte-Pair Encoding) word segmentation model by using the words in the data set, and carrying out word segmentation processing on the data set by using the trained BPE model to form a code search data set.
102: the code search data set generated in step 101 is subjected to data preprocessing.
In order to achieve excellent results for the deep learning model, effective data preprocessing work needs to be performed on the data set. Intercepting a first segment of content of the Java method related annotation as a natural language annotation, and removing annotation content and the like related to Java method parameters and return values in the first segment of content; java methods that do not contain any API are removed as well as duplicate Java methods.
In order to obtain the structural semantic information in the Java method, a code analysis tool Javaparser is utilized to convert the Java method into an abstract syntax tree so as to express the structural semantic of the Java method. And traversing each part in the abstract syntax tree by adopting a depth-first traversal strategy, extracting the API in the abstract syntax tree, and generating an API sequence for expressing the key programming semantics of the Java method. Thus, the Java methods in the dataset are represented as two parts, an abstract syntax tree and an API sequence.
And finally, dividing the analyzed data set into a training set, a verification set and a test set.
103: and optimizing and testing the depth code search model by using the data set preprocessed in the step 102.
As shown in fig. 2, the depth code search model mainly includes a code encoding module, a natural language encoding module 207, and a similarity matching module 208. The code encoding module comprises a code structure information encoding module 203, a code semantic information encoding module 204 and an information fusion module 205. The code structure information encoding module 203, the code semantic information encoding module 204, and the natural language encoding module 207 are respectively formed by a Long Short-Term Memory (LSTM) network based on the attention mechanism.
In order to optimize the deep code search model, a random initialization method is adopted to initialize parameters in a long-term and short-term memory network in the code structure information coding module 203, the code semantic information coding module 204 and the natural language coding module 207 respectively, and an abstract syntax tree 201, an API sequence 202 and related annotation information (query/natural language annotation 206) analyzed in a Java method are used as input of the module code structure information coding module 203, the code semantic information coding module 204 and the natural language coding module 207 respectively and are output as vectors.
The information fusion module 205 fuses the vectors output by the code structure information encoding module 203 and the code semantic information encoding module 204 into vectors with the same size as the output of the natural language module 207, so as to better represent the structure semantic information included in the Java method.
The two vectors output by the natural language coding module 207 and the information fusion module 205 are sent to the similarity matching module 208, the similarity matching module 208 calculates the similarity between the two vectors of the natural language coding module 207 and the information fusion module 205 by using a cosine similarity algorithm, calculates a loss function of the depth code search model, and optimizes parameters of a long-term and short-term memory network in the code structure information coding module 203, the code semantic information coding module 204 and the natural language coding module 207.
In the model optimization iteration process, a training set is used as training data, a verification set is used as detection of model effect, and an iteration stopping condition is that the training data are completely used up, namely the iteration times exceed the total number of Java methods in the training set or the effect of the model on the verification set is converged. The effect of the trained deep code search model running on the test set will be the final result of evaluating the performance of the model.
104: a search system implementation based on a depth code search model.
As shown in fig. 3, an embodiment of a depth code search system based on semantic information of a code structure of the present invention mainly includes two parts, an offline end 301 and an online end 306.
The offline end 301 is divided into three parts, namely a data analysis module 302, a data storage module 303 and a model training module 304. The deep code search system of the present invention manages the code repository 305 in an enterprise using Git technology.
The data parsing module 302 mainly completes the function of parsing the Java file. The data parsing module 302 extracts Java methods and their corresponding annotations from Java files in the code repository 305 using the code parsing tool Java parser. The data parsing module 302 parses the Java method into an abstract syntax tree, prunes according to the depth of the abstract syntax tree, extracts the method name and method body of the Java method from the parsed abstract syntax tree by using a method based on a regular expression, and traverses the abstract syntax tree to extract an API sequence. As such, data parsing module 302 represents Java methods as structured data, including Java method source code, comments, method names, method bodies, abstract syntax trees, API sequences. The data storage module 303 saves all the structured data about Java methods parsed from the code repository 305 by the data parsing module 302 in json format.
The data of the data storage module 303 is all located in the server, which is a solution relying on the cloud server. The data storage module 303 supports large-scale computing and storage, can support various computing types such as SQL, MapReduce, and UDF, and provides fine-grained rights management, sandbox protection, and data monitoring functions.
The model training module 304 trains the deep code search model using the Java methods and annotation data stored in the data storage module 303. The model training module 304 is an artificial intelligence platform that can provide a one-stop deep learning solution, and supports various machine learning and deep learning computing frameworks, such as a streaming computing framework Flink, a deep learning framework tensrflow, a computing engine Spark, and the like.
The on-line end 306 mainly includes three modules, namely a core business layer 314, a deep search model 311 and a data persistence module 312. The on-line end 306 realizes interaction with the user through the user interaction interface 313, provides a web search entry and low-delay search result presentation and user behavior embedding points for the user, and records the use conditions of the search system user, such as user ID, search time, operating system, browser and version number, input natural language description, access conditions to the search results, and the like. These data are used to statistically analyze the user's code search behavior, providing training data for further model optimization.
The main task at line end 306 is to implement the functions required for the user interaction interface 313 with the core services layer 314 at the back end. The core service layer 314 is built by using a SpringBoot technology, and four functions of user authentication 307, user behavior analysis 309, elastic search308 and elastic algorithm service 310 are completed.
The bottom layer of the ElasticSearch308 implements a Lucene-dependent high-performance search engine library, which imports a code search data set, i.e., a parsed Java method and a related annotation set, from the data storage module 303. The core service layer 314 at the back end receives the natural language query request from the user interface 313, and the user authentication 307 determines whether the request has authority. When the request is authorized, the core business layer 314 of the back-end retrieves the set of Java methods associated with the natural language query from the code search dataset using the ElasticSearch 308.
Then, the elastic algorithm service 310 services the code search function in the deep search model 311 by using a Docker technology, the service takes the natural language query and the Java method as input, calculates the similarity between the two (i.e. the output of the similarity module 208 in the deep search model), completes the reordering (from large to small) of the Java method set according to the similarity, and outputs the Java method after final ordering.
The service process comprises the steps of creating a Docker container, deploying a deep search model 311 to the Docker container, optimizing the deep search model 311 by using a code search data set in an ElasticSearch308, and providing a code search function based on a Restful API interface to a core service layer 31 at the back end.
The reason why the elastic algorithm service 310 uses the Docker technology to carry the algorithm service is that Docker has particularly good environment portability and extensibility. On one hand, when the Docker technology is used, a model environment can be built locally and quickly, an algorithm model can be deployed and debugged; after the local debugging is passed, the local environment mirror image can be directly uploaded to a server for deployment. On the other hand, when the user request amount exceeds the load limit of the current Docker service, the elastic algorithm service 310 may rapidly deploy a new Docker service on a newly added server, so as to improve the response speed to the user request.
After the search is completed, the core business layer 314 at the back end sends the search result of the elastic algorithm service 310 to the user interaction interface 313, records the viewing behavior of the user on the search result on the user interaction interface 313, and stores the data in the server disk by using the data persistence module 312 for further optimizing the deep search model 311 subsequently. The data stored by the data persistence module 312 may be viewed and statistically analyzed by the user behavior analysis 309.
In addition, it has been found in practice that users often use the same natural language query multiple times in sequence when searching for a single functional method, so the core service layer 314 also provides caching to relieve server stress and quickly respond to user searches. The depth code search system can perform update iteration along with the advance of time, and in order to conveniently and flexibly adjust the logic structure of the system, the dependency relationships among the modules and the modules in the line end 306 are displayed in a background management interface by utilizing a graphical technology to form an interactive module display interface. The interface supports the function of dynamically adjusting the dependency relationship between the modules in a mode of dragging the positions of the modules by a mouse.
The experimental environment of the deep code search system based on the code structure semantic information is as follows:
operating the system: ubuntu 16.04.1LTS
CPU:Intel(R)Xeon(R)Gold 6226CPU@2.70GHz
GPU:NVIDIA GeForce RTX 2080Ti
Memory: 64GB
Hard disk: 4TB
Programming language: python 3.7
Anaconda:5.3.0
TensorFlow:2.3.0
Docker:19.03.4
As shown in fig. 4, an embodiment of a terminal device or a server of the present invention includes a removable medium 401 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, mounted on a drive 402 as necessary, so that a computer program read out therefrom is installed into a storage section 406 as necessary. The driver 402 is also connected to the I/O interface 407 as necessary. The communication section 403 performs communication processing via a network such as the internet. The following components are connected to an output portion 405 of a line tube (CRT), a display, and the like, and a speaker and the like; a storage section 406 including a hard disk and the like; and a communication section 403 including a network interface card such as a LAN card, a modem, or the like. The communication section 403 accepts control information sent to the input section 404 from, for example, a keyboard and a mouse.
The central processing unit CPU 409 can perform various appropriate actions and processes in accordance with a program stored in the read only memory ROM 410 or a program loaded from the storage section 406 into the random access memory RAM 411. In the RAM 411, various programs and data necessary for the operation of the terminal device or the server are also stored. The CPU 409, ROM 410, and RAM 411 are connected to each other via a bus 408. An input/output I/O interface 407 is also connected to bus 408.
In particular, the processes described above with reference to the flowcharts may be implemented as computer application software according to the embodiments of the present disclosure. For example, the disclosed embodiments of the invention include a computer application comprising a computer program embodied on a computer readable medium, the computer program comprising program code for executing the system shown in FIG. 2. In the present embodiment, the computer program can be downloaded and installed from a network through the communication section 403, and/or installed from the removable medium 401. The above-described functions defined in the system of the present invention are executed when the computer program is executed by the central processing unit CPU 409.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a Random Access Memory (RAM), a read-only memory (ROM), a computer diskette, a hard disk, an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The above-described embodiments are intended to illustrate rather than limit the invention, and those skilled in the art will be able to make various modifications and improvements without departing from the spirit of the invention without limiting the invention to the details of the illustrative embodiments set forth herein. Any modification and variation of the present invention within the spirit of the present invention and the scope of the claims will fall within the scope of the present invention.

Claims (10)

1.一种基于代码结构语义信息的深度代码搜索方法,其特征在于,包括:1. a deep code search method based on code structure semantic information, is characterized in that, comprises: (1)获取软件项目中的代码数据,并利用分词模型生成代码搜索数据集。(1) Obtain the code data in the software project, and use the word segmentation model to generate a code search data set. (2)对所述代码搜索数据进行预处理。(2) Preprocessing the code search data. (3)利用预处理后的数据集对深度代码搜索模型进行优化与测试。(3) Use the preprocessed dataset to optimize and test the deep code search model. (4)基于深度代码搜索模型的实施代码搜索。(4) Implement code search based on deep code search model. 2.如权利要求1所述基于代码结构语义信息的深度代码搜索方法,其特征在于,步骤(1)具体为:确定Java软件项目范围,从项目中的Java文件提取Java方法和相关注释;将解析出的数据集用于训练BPE分词模型;利用训练后的分词模型对数据集中的数据进行分词处理,形成代码搜索数据集。2. the deep code search method based on code structure semantic information as claimed in claim 1, is characterized in that, step (1) is specially: determine the scope of Java software project, extract Java method and relevant note from the Java file in the project; The parsed data set is used to train the BPE word segmentation model; the trained word segmentation model is used to segment the data in the data set to form a code search data set. 3.如权利要求2所述基于代码结构语义信息的深度代码搜索方法,其特征在于,步骤(2)中,数据预处理,包括:截取Java方法的第一段注释作为自然语言注释,省略参数的注释和返回值注释;去掉不包含任何API的Java方法以及重复的Java方法;利用代码解析工具Javaparser将Java方法转化为抽象语法树,采用深度有限遍历策略从抽象语法树中提取API序列,数据集中预处理后Java方法包括抽象语法树和API序列两部分;将预处理后的数据集划分为训练集、验证集以及测试集,用于步骤(3)优化和测试深度代码搜索模型。3. the deep code search method based on code structure semantic information as claimed in claim 2, is characterized in that, in step (2), data preprocessing, comprises: intercept the first paragraph of annotation of Java method as natural language annotation, omit parameter comments and return value comments; remove Java methods that do not contain any API and duplicate Java methods; use the code parsing tool Javaparser to convert Java methods into abstract syntax trees, and use depth-limited traversal strategies to extract API sequences from abstract syntax trees, data The Java method after centralized preprocessing includes two parts: abstract syntax tree and API sequence; the preprocessed data set is divided into training set, validation set and test set, which are used for step (3) to optimize and test the deep code search model. 4.如权利要求3所述基于代码结构语义信息的深度代码搜索方法,其特征在于,步骤(3)中:深度代码搜索模型利用三个基于注意力机制的长短期记忆力网络,包括网络构建代码结构信息编码模块、代码语义信息编码模块以及自然语言编码模块。4. the deep code search method based on code structure semantic information as claimed in claim 3, is characterized in that, in step (3): deep code search model utilizes three long-term and short-term memory networks based on attention mechanism, comprises network construction code Structure information encoding module, code semantic information encoding module and natural language encoding module. 为了优化三个长短期记忆力网络中的参数,以上三个模块分别以步骤(2)中训练数据中Java方法的抽象语法树、API序列以及相关注释作为输入;然后构建信息融合模块,融合代码结构信息编码模块和代码语义信息编码模块输出的向量;最后构建相似度匹配模块,计算融合向量与自然语言编码模块输出向量之间的余弦相似度,计算深度代码搜索模型的损失函数,优化三个编码模块中的参数。In order to optimize the parameters in the three long-term and short-term memory networks, the above three modules respectively take the abstract syntax tree, API sequence and related annotations of the Java method in the training data in step (2) as input; then build an information fusion module to fuse the code structure The vector output by the information encoding module and the code semantic information encoding module; finally, a similarity matching module is constructed to calculate the cosine similarity between the fusion vector and the output vector of the natural language encoding module, calculate the loss function of the deep code search model, and optimize the three codes parameters in the module. 优化过程的停止条件为:优化迭代次数超过训练集中的Java方法总数,或模型在验证集上的效果收敛。The optimization process is stopped when the number of optimization iterations exceeds the total number of Java methods in the training set, or the model converges on the validation set. 5.如权利要求4所述基于代码结构语义信息的深度代码搜索方法,其特征在于,步骤(4)具体为:以自然语言查询和Java方法为输入,计算两者之间的相似度,根据相似度大小完成对Java方法集合的重排序,输出最终排序后的Java方法。5. the deep code search method based on code structure semantic information as claimed in claim 4, is characterized in that, step (4) is specially: take natural language query and Java method as input, calculate the similarity between the two, according to The similarity size completes the reordering of the Java method set, and outputs the final sorted Java method. 6.一种基于代码结构语义信息的深度代码搜索系统,其特征在于,包括离线端、在线端。6. A deep code search system based on code structure semantic information, characterized in that it includes an offline terminal and an online terminal. 离线端,负责解析Java文件、构建结构化数据集以及训练深度代码搜索模型等功能。The offline side is responsible for parsing Java files, building structured datasets, and training deep code search models. 在线端,用于与用户交互,为用户提供网页搜索入口和搜索结果呈现,记录用户查看搜索结果的行为,用于统计分析和后续进一步优化深度代码搜索模型。On the online side, it is used to interact with users, provide users with web page search entry and search results presentation, record users' behavior of viewing search results, and use them for statistical analysis and subsequent further optimization of the deep code search model. 7.如权利要求6所述基于代码结构语义信息的深度代码搜索系统,其特征在于,离线端包括数据解析模块、数据存储模块、模型训练模块。7 . The deep code search system based on code structure semantic information according to claim 6 , wherein the offline terminal comprises a data parsing module, a data storage module, and a model training module. 8 . 数据解析模块,从Java文件中提取Java方法及其对应的注释,将Java方法解析为抽象语法树,根据抽象语法树的深度进行剪枝,利用基于正则表达式的方法从解析的抽象语法树中提取Java方法的方法名和方法体,遍历抽象语法树提取API序列。The data parsing module extracts Java methods and their corresponding annotations from Java files, parses Java methods into abstract syntax trees, prunes according to the depth of the abstract syntax trees, and uses regular expression-based methods to extract data from the parsed abstract syntax trees. Extract the method name and method body of the Java method, and traverse the abstract syntax tree to extract the API sequence. 数据存储模块,支持大规模计算存储,支持多种计算类型,并提供细粒度权限管理、沙箱防护及数据监控功能,以json格式保存解析出的数据用于深度代码搜索模型的训练。The data storage module supports large-scale computing and storage, supports multiple computing types, and provides fine-grained permission management, sandbox protection, and data monitoring functions. The parsed data is saved in json format for training of deep code search models. 模型训练模块,支持多种机器学习和深度学习计算框架,包括流式计算框架、深度学习框架、计算引擎。The model training module supports a variety of machine learning and deep learning computing frameworks, including streaming computing frameworks, deep learning frameworks, and computing engines. 8.如权利要求6所述基于代码结构语义信息的深度代码搜索系统,其特征在于,所述在线端包括核心业务层、深度代码搜索模型、数据持久化模块。8 . The deep code search system based on code structure semantic information according to claim 6 , wherein the online terminal comprises a core business layer, a deep code search model, and a data persistence module. 9 . 核心业务层主要功能在于,使用深度代码搜索模型完成用户交互界面请求的代码搜索任务,并将用户查看搜索结果的行为存储到数据持久化模块。核心业务层使用SpringBoot技术搭建包括ElasticSearch、弹性算法服务、用户鉴权以及用户行为分析四个部分的功能。ElasticSearch是依赖于Lucene的高性能搜索引擎库,用于存储代码搜索数据集。弹性算法服务利用Docker技术将深度代码搜索模型的代码搜索功能以Restful API接口的形式进行服务化。当后端业务层接收到来自用户交互界面的自然语言查询请求,用户鉴权过滤掉无权限的请求,ElasticSearch为有权限的请求从代码搜索数据集中检索出与自然语言查询相关的Java方法集合,弹性算法服务中的代码搜索服务对Java方法集合进行重排序,最终将排序结果发送给用户交互界面。用户查看搜索结果的行为反馈给核心业务层,将记录存储在数据持久化模块,而这些记录可通过用户行为分析进行查看和统计分析。The main function of the core business layer is to use the deep code search model to complete the code search task requested by the user interface, and store the user's behavior of viewing search results in the data persistence module. The core business layer uses SpringBoot technology to build four functions including ElasticSearch, elastic algorithm service, user authentication and user behavior analysis. ElasticSearch is a high-performance search engine library that relies on Lucene to store code search datasets. The elastic algorithm service uses Docker technology to service the code search function of the deep code search model in the form of a Restful API interface. When the back-end business layer receives the natural language query request from the user interface, the user authentication filters out the unauthorized request, and ElasticSearch retrieves the Java method set related to the natural language query from the code search data set for the authorized request. The code search service in the elastic algorithm service reorders the Java method collection, and finally sends the sorted result to the user interface. The user's behavior of viewing search results is fed back to the core business layer, and records are stored in the data persistence module, and these records can be viewed and statistically analyzed through user behavior analysis. 9.如权利要求8所述基于代码结构语义信息的深度代码搜索系统,其特征在于,所述核心业务层为用户交互界面提供了缓存机制,以减缓服务器压力和快速响应用户搜索。利用图形化技术展示核心业务层中各功能模块以及它们之间的依赖关系,形成可交互的模块展示界面,该界面支持以鼠标拖拽模块位置的方式,动态调整模块之间依赖关系的功能。9 . The deep code search system based on code structure semantic information according to claim 8 , wherein the core business layer provides a cache mechanism for the user interaction interface, so as to reduce server pressure and quickly respond to user searches. 10 . Graphical technology is used to display the functional modules in the core business layer and the dependencies between them to form an interactive module display interface, which supports the function of dynamically adjusting the dependencies between modules by dragging and dropping the position of the modules with the mouse. 10.一种计算机装置,包括存储器、处理器及存储在存储器上并可在处理器上运行计算机程序,其特征在于,该计算机程序被执行时实现如权利要求1-3中任一项所述的方法或权利要求4-8中任一项所述的系统。10. A computer device comprising a memory, a processor and a computer program stored on the memory and running on the processor, wherein the computer program is implemented as described in any one of claims 1-3 when executed The method or system of any one of claims 4-8.
CN202110946937.8A 2021-08-18 2021-08-18 Deep code search method, system and device based on code structure semantic information Active CN113761163B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110946937.8A CN113761163B (en) 2021-08-18 2021-08-18 Deep code search method, system and device based on code structure semantic information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110946937.8A CN113761163B (en) 2021-08-18 2021-08-18 Deep code search method, system and device based on code structure semantic information

Publications (2)

Publication Number Publication Date
CN113761163A true CN113761163A (en) 2021-12-07
CN113761163B CN113761163B (en) 2024-02-02

Family

ID=78790301

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110946937.8A Active CN113761163B (en) 2021-08-18 2021-08-18 Deep code search method, system and device based on code structure semantic information

Country Status (1)

Country Link
CN (1) CN113761163B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114461194A (en) * 2021-12-28 2022-05-10 东莞市李群自动化技术有限公司 Auxiliary method, equipment and storage medium for code development
CN114896368A (en) * 2022-05-23 2022-08-12 郑州大学产业技术研究院有限公司 Semantic code search model construction method in large-scale candidate set and related device
CN115062106A (en) * 2022-06-07 2022-09-16 华南理工大学 Code searching method, system and medium based on function multiple graph embedding
CN115202732A (en) * 2022-06-27 2022-10-18 深圳市互通创新科技有限公司 Intelligent software development auxiliary system and use method
CN115577075A (en) * 2022-10-18 2023-01-06 华中师范大学 Deep code searching method based on relational graph convolutional network
CN116048454A (en) * 2023-03-06 2023-05-02 山东师范大学 A code rearrangement method and system based on iterative contrastive learning
CN116400901A (en) * 2023-04-12 2023-07-07 上海计算机软件技术开发中心 Python code automatic generation method and system
WO2023197397A1 (en) * 2022-04-13 2023-10-19 堡垒科技有限公司 Decentralized trusted tokenization protocol for open source software
CN116955719A (en) * 2023-09-20 2023-10-27 布谷云软件技术(南京)有限公司 Code management method and system for digital storage of chained network structure
CN117422002A (en) * 2023-12-19 2024-01-19 利尔达科技集团股份有限公司 AIGC-based embedded product generation method, system and storage medium
CN117573084A (en) * 2023-08-02 2024-02-20 广东工业大学 Code complement method based on layer-by-layer fusion abstract syntax tree

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102063488A (en) * 2010-12-29 2011-05-18 南京航空航天大学 Code searching method based on semantics
CN110716749A (en) * 2019-09-03 2020-01-21 东南大学 Code searching method based on function similarity matching
CN111159223A (en) * 2019-12-31 2020-05-15 武汉大学 An interactive code search method and device based on structured embedding
US10761839B1 (en) * 2019-10-17 2020-09-01 Globant España S.A. Natural language search engine with a predictive writing tool for coding
CN112507065A (en) * 2020-11-18 2021-03-16 电子科技大学 Code searching method based on annotation semantic information
CN112800172A (en) * 2021-02-07 2021-05-14 重庆大学 A code search method based on two-stage attention mechanism

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102063488A (en) * 2010-12-29 2011-05-18 南京航空航天大学 Code searching method based on semantics
CN110716749A (en) * 2019-09-03 2020-01-21 东南大学 Code searching method based on function similarity matching
US10761839B1 (en) * 2019-10-17 2020-09-01 Globant España S.A. Natural language search engine with a predictive writing tool for coding
CN111159223A (en) * 2019-12-31 2020-05-15 武汉大学 An interactive code search method and device based on structured embedding
CN112507065A (en) * 2020-11-18 2021-03-16 电子科技大学 Code searching method based on annotation semantic information
CN112800172A (en) * 2021-02-07 2021-05-14 重庆大学 A code search method based on two-stage attention mechanism

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114461194A (en) * 2021-12-28 2022-05-10 东莞市李群自动化技术有限公司 Auxiliary method, equipment and storage medium for code development
WO2023197397A1 (en) * 2022-04-13 2023-10-19 堡垒科技有限公司 Decentralized trusted tokenization protocol for open source software
CN114896368A (en) * 2022-05-23 2022-08-12 郑州大学产业技术研究院有限公司 Semantic code search model construction method in large-scale candidate set and related device
CN115062106A (en) * 2022-06-07 2022-09-16 华南理工大学 Code searching method, system and medium based on function multiple graph embedding
CN115062106B (en) * 2022-06-07 2024-11-15 华南理工大学 Code search method, system and medium based on multi-graph embedding of function functions
CN115202732B (en) * 2022-06-27 2023-08-08 苏州唐人数码科技有限公司 Intelligent software development auxiliary system and application method
CN115202732A (en) * 2022-06-27 2022-10-18 深圳市互通创新科技有限公司 Intelligent software development auxiliary system and use method
CN115577075A (en) * 2022-10-18 2023-01-06 华中师范大学 Deep code searching method based on relational graph convolutional network
CN115577075B (en) * 2022-10-18 2024-03-12 华中师范大学 Depth code searching method based on relation diagram convolution network
CN116048454A (en) * 2023-03-06 2023-05-02 山东师范大学 A code rearrangement method and system based on iterative contrastive learning
CN116400901A (en) * 2023-04-12 2023-07-07 上海计算机软件技术开发中心 Python code automatic generation method and system
CN116400901B (en) * 2023-04-12 2024-06-11 上海计算机软件技术开发中心 Python code automatic generation method and system
CN117573084A (en) * 2023-08-02 2024-02-20 广东工业大学 Code complement method based on layer-by-layer fusion abstract syntax tree
CN117573084B (en) * 2023-08-02 2024-04-12 广东工业大学 A code completion method based on layer-by-layer fusion of abstract syntax trees
CN116955719A (en) * 2023-09-20 2023-10-27 布谷云软件技术(南京)有限公司 Code management method and system for digital storage of chained network structure
CN116955719B (en) * 2023-09-20 2023-12-05 布谷云软件技术(南京)有限公司 Code management method and system for digital storage of chained network structure
CN117422002A (en) * 2023-12-19 2024-01-19 利尔达科技集团股份有限公司 AIGC-based embedded product generation method, system and storage medium
CN117422002B (en) * 2023-12-19 2024-04-19 利尔达科技集团股份有限公司 AIGC-based embedded product generation method, AIGC-based embedded product generation system and storage medium

Also Published As

Publication number Publication date
CN113761163B (en) 2024-02-02

Similar Documents

Publication Publication Date Title
CN113761163B (en) Deep code search method, system and device based on code structure semantic information
US12141557B2 (en) Pruning engine
CN111191002B (en) Neural code searching method and device based on hierarchical embedding
US8577823B1 (en) Taxonomy system for enterprise data management and analysis
US12007988B2 (en) Interactive assistance for executing natural language queries to data sets
US11042707B2 (en) Conversational interface for APIs
US11601453B2 (en) Methods and systems for establishing semantic equivalence in access sequences using sentence embeddings
US11726994B1 (en) Providing query restatements for explaining natural language query results
US11500865B1 (en) Multiple stage filtering for natural language query processing pipelines
US11481202B2 (en) Transformation templates to automate aspects of computer programming
CN116991990A (en) AIGC-based program development auxiliary methods, storage media and equipment
CN113836235B (en) Data processing method based on data center and related equipment thereof
CN117252261A (en) Knowledge graph construction method, electronic equipment and storage medium
EP3945431A1 (en) Bridge from natural language processing engine to database engine
US20190164061A1 (en) Analyzing product feature requirements using machine-based learning and information retrieval
CN112000929A (en) Cross-platform data analysis method, system, equipment and readable storage medium
CN116974554A (en) Code data processing method, apparatus, computer device and storage medium
Zhu et al. A neural network architecture for program understanding inspired by human behaviors
Huang et al. BCGen: a comment generation method for bytecode
CN108932225B (en) Method and system for converting natural language requirements into semantic modeling language statements
US11847117B2 (en) Filter class for querying operations
US11316807B2 (en) Microservice deployment in multi-tenant environments
Tavares et al. How COVID-19 impacted data science: a topic retrieval and analysis from GitHub projects’ descriptions
CN117294467B (en) A SQL injection attack detection method and device based on class imbalance
Gu et al. Coral: federated query join order optimization based on deep reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant