CN117112851B

CN117112851B - Code searching method based on post-interaction mechanism

Info

Publication number: CN117112851B
Application number: CN202311381385.6A
Authority: CN
Inventors: 路云峰; 张愈博; 刘艳芳
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2023-10-24
Filing date: 2023-10-24
Publication date: 2024-04-02
Anticipated expiration: 2043-10-24
Also published as: CN117112851A

Abstract

The invention discloses a code searching method based on a post-interaction mechanism, which comprises the following steps: acquiring training data, wherein the training data comprises code data and text data corresponding to natural language notes; initializing model parameters, outputting code data and text data to a pre-constructed neural network model, and outputting a code characterization vector and a text characterization vector; calculating fine granularity similarity between a search code characterization vector and a search text characterization vector through a pre-constructed interaction matrix; calculating model loss according to the fine granularity similarity and carrying out parameter optimization on the neural network model; according to the invention, the cross-modal mapping is carried out on the code characterization vector and the text characterization vector through the post-interaction mechanism, so that the accuracy of code retrieval is improved.

Description

Code searching method based on post-interaction mechanism

Technical Field

The invention relates to the technical field of computer software, in particular to a code searching method based on a post-interaction mechanism.

Background

Code searching includes searching and retrieving relevant code segments from a code repository based on expressed text intent by a developer in a search query. Today, with the advent of large code projects and advanced search functions, code searching has become a critical software development activity. It also supports many other important software engineering tasks such as program repair, code synthesis, and vulnerability detection.

Initially, code models relied on traditional Information Retrieval (IR) techniques such as keyword matching or Application Program Interface (API) matching. Later, researchers demonstrated that a pre-trained language model could be fabricated using a self-supervised pre-training approach, thereby significantly improving code search performance. Pre-trained language models such as cordir, code-MVP, codeRetriver, etc., achieve good performance in some Code search scenarios.

However, these code pre-training methods typically use a Masked Language Modeling (MLM) loss function during the pre-training phase, which has been studied to prove that this loss function can produce inadequate code and text semantic characterization, resulting in poor performance of tasks based on characterization similarity (e.g., code search).

Furthermore, because program language code and natural language text belong to two different modalities of information, creating semantic mappings between code and text is more challenging than mapping between text. The currently mainstream code searching method adopts a double-encoder structure, and aims to simultaneously learn intra-mode and inter-mode relationships by utilizing double encoders sharing weights. However, it is very difficult to accomplish both of these objectives simultaneously in practice, because the dual encoder structure does not explicitly model the cross-modal information.

Therefore, how to improve the quality of characterization and text-to-code cross-modality mapping capability in code search has become a technical problem that needs to be solved by those skilled in the art.

Disclosure of Invention

In view of the above, the invention provides a code searching method based on a post-interaction mechanism, which performs cross-modal mapping on a code characterization vector and a text characterization vector through the post-interaction mechanism, thereby improving the accuracy of code retrieval.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a code searching method based on a post-interaction mechanism comprises the following steps:

acquiring training data, wherein the training data comprises code data and text data corresponding to natural language notes;

initializing model parameters, outputting code data and text data to a pre-constructed neural network model, and outputting a code characterization vector and a text characterization vector;

calculating fine granularity similarity between a search code characterization vector and a search text characterization vector through a pre-constructed interaction matrix;

and calculating model loss according to the fine granularity similarity and carrying out parameter optimization on the neural network model.

Preferably, the neural network model comprises a prompt template module, a re-parameterization module, a code coding module and a text coding module;

the prompt template module is used for acquiring code data and generating a first trainable prompt vector and a first prefix vector; acquiring text data and generating a second trainable hint vector and a second prefix vector;

the re-parameterization module is used for continuously converting the first trainable prompt vector and the second trainable prompt vector;

the continuously converted first trainable hint vector and the code sequence are reconstructed and then input into a code encoding module to obtain a first character level characterization vector;

and reconstructing the continuously converted second trainable prompt vector and the text sequence, and inputting the reconstructed text sequence into a text encoding module to obtain a second character level characterization vector.

Preferably, the code encoding module and the text encoding module are each composed of a plurality of transducer modules.

Preferably, the initializing model parameters includes:

freezing parameters of the code encoding module and the text encoding module; randomly initializing a prompt template module and parameters of re-parameterization.

Preferably, the data flow calculation formula of the prompt template module is as follows:

wherein,and->Parameters of the trainable PL and NL end prompt templates are final optimization objects. />Andcampt, < +.>And->Is a prefix vector of codes and text.

Preferably, the continuous conversion is performed by the heavy parameter module, specifically:

wherein,and->Is a parameter of a feed-forward network that is trainable at the code and text ends.

Preferably, the step of constructing the interaction matrix includes:

obtaining a code characterization vector and a text characterization vector, and constructing an initial matrix；

And performing maximum pooling operation on the rows and columns of the matrix to obtain a final interaction matrix.

Preferably, the calculating the fine granularity similarity between the search code characterization vector and the search text characterization vector includes:

according to the maximum pooling result, two factors are obtained:

calculating the similarity between the ith code sample and the jth text sample according to the two factors;

wherein,for similarity, ->Is a preset super parameter.

Preferably, calculating the model loss according to the fine grain similarity includes:

calculating loss of code end to text endAnd text end to code end penalty->：

The final loss function is:

。

compared with the prior art, the invention discloses a code searching method based on a post-interaction mechanism, which realizes cross-modal mapping between codes and natural language texts based on the post-interaction mechanism, adopts an interaction matrix to carry out character-level fine-granularity interaction matching on the characterization vectors of the codes and the characterization vectors of the language texts so as to relieve semantic gaps among different modalities and improve the accuracy of cross-modal retrieval.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of a code searching method based on a post-interaction mechanism.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1, the embodiment of the invention discloses a code searching method based on a post-interaction mechanism, which comprises the following steps:

s1: training data is acquired, the training data including code data and text data corresponding to natural language annotations.

S2: initializing model parameters, outputting code data and text data to a pre-constructed neural network model, and outputting a code characterization vector and a text characterization vector.

S3: and calculating fine granularity similarity between the search code characterization vector and the search text characterization vector through a pre-constructed interaction matrix.

S4: and calculating model loss according to the fine granularity similarity and carrying out parameter optimization on the neural network model.

In the training process, fine granularity similarity between codes and natural voice texts is calculated through an interaction matrix to form cross-modal mapping.

In order to further implement the technical scheme, the embodiment of the invention provides a method for improving the code or text characterization quality through a prompt template. Specifically, the neural network model comprises a prompt template module, a re-parameterization module, a code coding module and a text coding module; the prompt template module is used for acquiring code data and generating a first trainable prompt vector and a first prefix vector; acquiring text data and generating a second trainable hint vector and a second prefix vector; the re-parameterization module is used for continuously converting the first trainable prompt vector and the second trainable prompt vector; the continuously converted first trainable hint vector and the code sequence are reconstructed and then input into a code encoding module to obtain a first character level characterization vector; and reconstructing the continuously converted second trainable prompt vector and the text sequence, and inputting the reconstructed text sequence into a text encoding module to obtain a second character level characterization vector.

In this embodiment, the initializing module in S2 specifically includes: parameters of the code-text double encoder are initialized and frozen by parameters of a UniXcoder of a pre-training model, parameters of a prompt template of the code-text are initialized randomly, and parameters of the encoder are parameterized again.

Further, the code encoding module and the text encoding module are both composed of a plurality of layers of transformers modules. Specifically, the device comprises 12 layers of transformers, wherein each layer has 12 attention heads, and the vector size of the hidden layer is 768 dimensions. Unlike the common double-tower structure, the weight between the double encoders is not shared, and in the training process, the parameters of the encoder are frozen, and only the parameters of the prompt template are updated. The hint template is essentially a learnable 13-layer tensor structure, with layer 1 corresponding to successive hint vectors of the input layer and the remaining 12 layers corresponding to each layer of modules in the transducer. The re-parameterized encoder is implemented by a feed-forward network with the same input-output dimensions as the successive hint vectors.

The data flow expression for processing the code data and the text data in the prompt template module is as follows:

wherein,and->Parameters of the prompt templates at the trainable programming Language (Programming Language, PL) and Natural Language (NL) ends, respectively, are the final optimization objects. />And->Campt, < +.>And->Is a prefix vector of codes and text.

To improve training stability, a re-parameterized encoder is used to transform successive prompts before entering the code and text information into the encoder:

wherein,and->Is the parameter of the respective re-parameterized encoder at the code and text ends, in particular implemented using a feed forward network (Feedforward Neural Network, FNN).

In the above step S3, the code and text input information are reconstructed by adding a prompt before the code, text sequence:

wherein [ CLS ]]And [ SEP ]]Is a special symbol in the vocabulary that separates the different components of the input information, and X and Y are inputs that are ultimately ready to the encoder. This step is accomplished by inserting trainable hints in front of the code and text sequencesAnd (3) withWhich are each essentially a learnable one-dimensional vector, we will continuously optimize their parameters during the training process, thereby helping the encoder to better understand the code and text information and thereby improving their characterization quality.

Subsequently, we will reconstruct the code and text information separatelyInputting codes and text encoder to obtain code characterization vectorAnd text token vector +.>：

Wherein the method comprises the steps ofAnd->Parameters of the code and text encoder, respectively, < >>And->Is a prefix vector of codes and text that is added as a prefix to the front of the code sequence and text sequence during encoding and input into each layer of the encoder fransformer module to aid decoding.

In order to further implement the technical scheme, the invention discloses an interaction matrix for calculating fine granularity similarity among cross-modal data. When constructing the interaction matrix, firstly, constructing an initial matrix according to the obtained code characterization vector output by the code encoding module and the text characterization vector output by the text encoding module；

Then, the rows and columns of the matrix are maximally pooled to obtain the final interaction matrix, and two factors are obtained:

wherein,for similarity, ->Is a preset super parameter.

In order to further implement the above technical solution, calculating model loss according to fine granularity similarity includes:

calculating loss of code end to text endAnd text end to code end penalty->：

Wherein N represents the batch size,is a temperature super parameter.

The final loss function is:

。

finally, the optimization objective of the method is to prompt parameters of templates and re-parameterized encoders, without directly optimizing parameters of dual encoders:

in a specific training process, first will<Code, text>The bimodal training data set is loaded into the data loader in a batch size 256 to update model parameters during training using a small batch gradient descent method. Parameters of the code-text dual encoder are then initialized and frozen using parameters of the pre-training model UniXcoder, and parameters of the code-text hint template and parameters of the re-parameterized encoder are randomly initialized. Weight sharing between the re-parameterized encoder and the code and text hint templates is eliminated. Setting the maximum lengths of the code input sequence and the text input sequence to 256 and 128 respectively, and obtaining the hyper-parameters in the similarity calculation formulaSetting the temperature super parameter in the multi-mode contrast learning loss function to be 0.8 +.>Set to 0.05. Finally, fitting parameters of the model according to the learning rate of 2e-5 and a small batch random gradient descent method until the training is stopped after the training set iterates for 50 times, and storing a weight parameter file of the model.

After the model training process is completed, we obtain the optimal model parameters, and the reasoning prediction process is briefly described below. To save search time in the reasoning process, we perform offline pre-computation on candidate codes, but only compute text queries online. Firstly, initializing parameters of a code-text double encoder by using parameters of a UniXcoder of a pre-training model, and initializing parameters of a prompt template of the code-text and parameters of a re-parameterized encoder by using optimal parameters obtained through training. And then carrying out pre-calculation characterization on all codes in the candidate code database offline, reconstructing through a medium parameter module, inputting the codes into a code encoder for encoding to obtain corresponding code characterization vectors, and storing all the code characterization vectors into a vector database for storage. And during online reasoning, reconstructing natural language requirements to be matched, and inputting the natural language requirements into a text coding module for coding to obtain corresponding text characterization vectors. And then calculating fine granularity similarity, namely matching score, between the text token vector and each code token vector, and returning the candidate code with the highest matching score as a search result.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. The code searching method based on the post-interaction mechanism is characterized by comprising the following steps:

initializing model parameters, outputting code data and text data to a pre-constructed neural network model, and outputting a code characterization vector and a text characterization vector; the method specifically comprises the following steps: the neural network model comprises a prompt template module, a re-parameterization module, a code coding module and a text coding module;

the prompt template module is used for acquiring code data and generating a first trainable prompt vector and a first prefix vector; acquiring text data and generating a second trainable hint vector and a second prefix vector; the re-parameterization module is used for continuously converting the first trainable prompt vector and the second trainable prompt vector;

the continuously converted second trainable prompt vector and the text sequence are reconstructed and then input into a text coding module to obtain a second character level characterization vector;

calculating fine granularity similarity between the first character level characterization vector and the second character level characterization vector through a pre-constructed interaction matrix;

2. The method for searching codes based on post-interaction mechanism according to claim 1, wherein the code encoding module and the text encoding module are each composed of a plurality of sequentially connected transformer modules.

3. The post-interaction mechanism based code search method of claim 1, wherein initializing model parameters comprises:

freezing parameters of the code encoding module and the text encoding module; and randomly initializing parameters of the prompt template module and the re-parameterization module.

4. The code search method based on the post-interaction mechanism of claim 1, wherein the data flow calculation formula of the prompt template module is:

wherein phi is _X And phi _Y Parameters of a trainable program language and a natural language end prompt template are final optimization objects; p (P) _PL And P _NL Hint vectors, V, for codes and text, respectively _PL And V _NL Is a prefix vector of codes and text.

5. The code searching method based on the post-interaction mechanism according to claim 1, wherein the re-parameterization module performs continuous conversion, specifically:

P _PL ，V _PL ←H(P _PL ，V _PL ；ψ _X )

P _NL ，V _NL ←H(P _NL ，V _NL ；ψ _Y )

wherein, psi is _X Sum phi _Y Parameters of a code-side and text-side trainable feed-forward network, respectively.

6. The method for searching for codes based on post-interaction mechanism as recited in claim 1, wherein the step of constructing the interaction matrix comprises:

obtaining a code characterization vector and a text characterization vector, and constructing an initial matrix M ^i，j ；

7. The post-interaction mechanism based code search method of claim 6, wherein said calculating fine-grained similarity between the search code token vector and the search text token vector comprises:

according to the maximum pooling result, two factors are obtained:

where mc and mt are the input sequence lengths of the code and text respectively,representative matrix M ^i，j Vectors of all elements of row k, +.>Representative matrix M ^i，j A vector of all elements in column k;

wherein s is _i，j For similarity, λ is a preset hyper-parameter.

8. The method for searching codes based on post-interaction mechanism according to claim 1, wherein calculating model loss according to fine-grained similarity comprises:

calculating loss of code end to text endAnd text end to code end penalty->：

Wherein N represents the batch size, and τ is a temperature super-parameter;

the final loss function is:

9. the method for searching codes based on post-interaction mechanism according to claim 1, wherein the optimization target is a prompt template module when parameter optimization is performed.