CN117111952A

CN117111952A - Code complement method and device based on generation type artificial intelligence and medium

Info

Publication number: CN117111952A
Application number: CN202311076902.9A
Authority: CN
Inventors: 林吴航; 袁威强; 胡光龙; 刘�东; 李家诚; 沙雨辰
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2023-08-23
Filing date: 2023-08-23
Publication date: 2023-11-24

Abstract

The embodiment of the disclosure provides a code complement method and device based on a generated artificial intelligence, and a medium, wherein the method comprises the following steps: inputting the first information into a trained code complement model, and outputting an object code corresponding to the first information by the code complement model, wherein the object code is used for realizing an expected code function; the first information comprises a code description statement for describing the code function by natural language and/or a code character string segment to be complemented. Whether the input code string segment is used during training or not, the trained code complement model can generate corresponding target codes capable of achieving expected functions, the target codes have good generalization capability, and code description sentences for describing the code functions by natural language can be independently or together with the code string segment input into the code complement model to output the target codes so as to support input contents of different modes to generate the target codes.

Description

Code complement method and device based on generation type artificial intelligence and medium

Technical Field

Embodiments of the present disclosure relate to the field of generative artificial intelligence, and more particularly, to a code complement method and apparatus based on generative artificial intelligence, and a medium.

Background

This section is intended to provide a background or context to the embodiments of the disclosure recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.

Generated artificial intelligence is an artificial intelligence technique that utilizes deep learning algorithms to simulate the ability of humans to create information. Unlike traditional rule-based artificial intelligence, generative artificial intelligence can generate new data of different modalities (e.g., text, picture, speech, etc.) by learning large-scale data sets. Generated artificial intelligence has wide application in many fields including natural language processing, computer vision, audio synthesis, and the like. For example, in the field of natural language processing, generative artificial intelligence may be used for tasks such as automatic text summarization, article authoring, dialog systems, and the like.

The code complement method in the related art generates complement codes depending on rules and templates defined in advance, but is limited by the defined rules, is difficult to adapt to complex business scenes, and when encountering new and unconventional code fragments, the generated codes cannot realize expected functions.

Disclosure of Invention

In view of the above, the present disclosure provides a code complement method, device and medium based on generation type artificial intelligence to solve the deficiencies in the related art.

In order to achieve the above object, the present disclosure provides the following technical solutions:

in a first aspect of embodiments of the present disclosure, there is provided a code completion method based on generative artificial intelligence, comprising:

inputting first information into a trained code complement model, and outputting an object code corresponding to the first information by the code complement model, wherein the object code is used for realizing expected code functions; the first information comprises a code description statement for describing the code function by natural language and/or a code character string segment to be complemented.

Optionally, the code complement model is obtained by training based on a sample data set; wherein the sample dataset comprises: code sample data, and annotation sample data corresponding to the code sample data.

Optionally, the code complement model is a causal language model obtained by fine tuning on the basis of a pre-trained initialization model; the causal language model is used for predicting missing characters in the code character string segment to be complemented according to input sample data formed by splicing the code sample data and the annotation sample data so as to determine the target code.

Optionally, the causal language model is obtained by fine tuning the initialization model based on a first loss function;

the annotation sample data and the code sample data respectively comprise a plurality of token, the first loss function is provided with a weight for each token, and the weight characterizes the influence degree of the loss of the token on the total loss of the initialization model; the weight of each token in the annotation sample data is less than the weight of each token in the code sample data corresponding to the annotation sample data.

Optionally, the weight of each token in the annotation sample data is 0, so that the first loss function masks the annotation sample data.

Optionally, the formula of the first loss function is:

wherein,weight representing the ith token in the annotation sample data, +.>A real tag representing the ith token in the annotation sample data, +.>Predictive labels, N, representing the ith token in the annotated sample data _D Total number of token representing the annotation sample data,/->Weight representing the jth token in the code sample data,/for each token>A real tag representing the jth token in said code sample data,/for example >Predictive label, N, representing the jth token in the code sample data _C A total number of tokens representing the code sample data.

Optionally, the initialization model is a GPT model or an OPT model.

Optionally, the object code is a function level code, and the function level code includes: function name, list of function parameters, return value type of function and function body.

Optionally, the sample data set includes a first sample data set, where the first sample data set is obtained after filtering the code sample data based on a static syntax detection tool, and the static syntax detection tool is used to find syntax errors in the code sample data.

Optionally, the sample data set further includes a second sample data set, the second sample data set being obtained by filtering the annotation sample data based on a data cleansing policy;

the data cleaning strategy at least comprises any one of the following steps: and eliminating repeated sequences in the annotation sample data, eliminating useless characters in the annotation sample data, eliminating the annotation sample data with the content length smaller than a first threshold value, eliminating the annotation sample data with the content length larger than a second threshold value, eliminating the annotation sample data with the content as a code, eliminating the annotation sample data with the content as a messy code character, eliminating the annotation sample data with the content as an unrecognizable foreign language and eliminating the annotation sample data with the content as a non-natural language.

Optionally, in the sample data set, a data interval in which a ratio between the data amount of the first sample data set and the data amount of the second sample data set is located is

Optionally, the sample data set is formed by combining the first sample data set and the second sample data set, and the sample data set contains first function level code sample data, and the first function level code sample data is obtained after the number of the function level codes in the sample data set is adjusted based on a data distribution adjustment strategy.

Optionally, the formula of the data distribution adjustment policy is:

N′＝1+int(log _M N)

wherein M is a real number greater than 1, N represents the number of homofunctional level codes in the sample data set before adjustment, and N' represents the number of homofunctional level codes in the sample data set after adjustment; and the function names, the function parameter lists, the return value types of the functions and the function bodies of the same class of function level codes are identical.

In a second aspect of embodiments of the present disclosure, there is provided a code complement apparatus based on generative artificial intelligence, comprising:

an input module for inputting first information into the trained code complement model;

The output module is used for outputting target codes corresponding to the first information by the code complement model, and the target codes are used for realizing expected code functions;

the first information comprises a code description statement for describing the code function by natural language and/or a code character string segment to be complemented.

Optionally, the formula of the first loss function is:

wherein,weight representing the ith token in the annotation sample data, +.>A real tag representing the ith token in the annotation sample data, +.>Predictive labels, N, representing the ith token in the annotated sample data _D Total number of token representing the annotation sample data,/->Weight representing the jth token in the code sample data,/for each token>A real tag representing the jth token in said code sample data,/for example>Predictive label, N, representing the jth token in the code sample data _C A total number of tokens representing the code sample data.

Optionally, the initialization model is a GPT model or an OPT model.

Optionally, the formula of the data distribution adjustment policy is:

N′＝1+int(log _M N)

In a third aspect of embodiments of the present disclosure, a medium has stored thereon a computer program which, when executed by a processor, implements a method as described in the first aspect above.

In a fourth aspect of embodiments of the present disclosure, there is provided a computing device comprising:

a processor;

a memory for storing a processor executable program;

wherein the processor is configured to implement the method according to the first aspect by running the executable program.

According to the embodiment of the disclosure, whether the input code string segment is used or not during training, the trained code complement model can generate corresponding target codes capable of achieving expected functions, the target codes have good generalization capability, and code description sentences for describing the code functions by natural language can be input to the code complement model alone or together with the code string segment to output the target codes so as to support input contents of different modes to generate the target codes.

Drawings

The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings, in which:

FIG. 1 schematically illustrates a flow diagram of a method of code completion based on generative artificial intelligence, in accordance with an embodiment of the present disclosure;

FIG. 2 schematically illustrates a schematic diagram of a code completion model according to an embodiment of the present disclosure;

FIG. 3 schematically illustrates a schematic diagram of another code completion model according to an embodiment of the present disclosure;

FIG. 4 schematically illustrates a schematic diagram of a sample dataset according to an embodiment of the present disclosure;

FIG. 5 schematically illustrates a flow chart of a code sample data screening method based on a static grammar detection tool in accordance with an embodiment of the present disclosure;

FIG. 6 schematically illustrates a schematic diagram of another sample dataset according to an embodiment of the present disclosure;

FIG. 7 schematically illustrates a schematic diagram of another sample dataset according to an embodiment of the present disclosure;

FIG. 8 schematically illustrates a flow chart of a method of annotation sample data filtering based on a data cleansing policy, according to an embodiment of the disclosure;

FIG. 9 schematically illustrates a schematic diagram of another sample dataset according to an embodiment of the present disclosure;

FIG. 10 schematically illustrates a block diagram of a code completion device based on generative artificial intelligence, in accordance with an embodiment of the present disclosure;

FIG. 11 schematically illustrates a schematic diagram of a medium according to an embodiment of the present disclosure;

fig. 12 schematically illustrates a schematic diagram of a computing device according to an embodiment of the disclosure.

In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

The principles and spirit of the present disclosure will be described below with reference to several exemplary embodiments. It should be understood that these embodiments are presented merely to enable one skilled in the art to better understand and practice the present disclosure and are not intended to limit the scope of the present disclosure in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Those skilled in the art will appreciate that embodiments of the present disclosure may be implemented as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the following forms, namely: complete hardware, complete software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

According to the embodiment of the disclosure, a code complement method, a device and a medium based on the generated artificial intelligence are provided.

In this document, it should be understood that any number of elements in the drawings is for illustration and not limitation, and that any naming is used only for distinction and not for any limitation.

The principles and spirit of the present disclosure are explained in detail below with reference to several representative embodiments thereof.

Summary of The Invention

The code complement method in the related art depends on a rule and a template which are defined in advance, and generates complement codes according to the input context of the code to be complemented and the defined rule, so that the code input quantity of a user is reduced, and the development efficiency is improved.

Based on the above, the embodiment of the disclosure provides a code complement method based on a generated artificial intelligence, in which whether an input code string segment is used during training or not, a trained code complement model can generate a corresponding target code capable of realizing an expected function, the target code has good generalization capability, and code description sentences for describing the code function by natural language can be input to the code complement model alone or together with the code string segment to output the target code so as to support input contents of different modalities to generate the target code.

Having described the basic principles of the present disclosure, various non-limiting embodiments of the present disclosure are specifically described below.

Exemplary method

A code complement method based on generated artificial intelligence according to an exemplary embodiment of the present disclosure is described below with reference to fig. 1.

FIG. 1 schematically illustrates a flow chart of a code completion method based on generative artificial intelligence, in accordance with an embodiment of the present disclosure. The method may include:

step S101, inputting first information into a trained code complement model;

step S102, outputting target codes corresponding to the first information by the code complement model, wherein the target codes are used for realizing expected code functions;

In this embodiment, the natural language may be a language that is formed naturally along with the evolution of culture, and that human beings communicate and express ideas, such as chinese, english, etc., and is generally formed by organizing certain grammatical structures and logical structures, and may convey information.

In an embodiment, the object code may be token-level code, line-level code, block-level code, function-level code, or the like, which is not limiting of the present disclosure.

In an embodiment, the code completion model may be obtained by training based on a sample dataset. Wherein the sample dataset may comprise: code sample data, and annotation sample data corresponding to the code sample data.

In the present embodiment, the annotation sample data corresponding to the code sample data may be a code description sentence describing the meaning or function of the corresponding code sample data in natural language.

Fig. 2 schematically illustrates a schematic diagram of a code completion model according to an embodiment of the present disclosure. In one embodiment, as shown in FIG. 2, the code complement model may be a causal language model that is trimmed based on a pre-trained initialization model. The causal language model is used for predicting missing characters in the code character string segment to be complemented according to input sample data formed by splicing the code sample data and the annotation sample data so as to determine an object code

The inventor finds that when the initialization model is subjected to fine tuning, the input sample data comprises code sample data and annotation sample data, and when the model is calculated to lose, the loss of the code sample data and the annotation sample data is included, so that corresponding neural network parameters of the annotation sample data and the code sample data in the model are optimized in the back propagation process, and the annotation sample data and the code sample data are fitted simultaneously.

However, the code completion model disclosed by the invention outputs the target code, a code description statement is not required to be output, the fitting of code sample data is paid attention to when the initialization model is subjected to fine tuning, and at the moment, the loss of annotation sample data can cause the model to excessively adjust parameters in the back propagation process to fit the annotation sample data, so that the accuracy of model prediction codes is reduced.

Therefore, the inventor thinks that the loss of the annotation sample data has smaller influence on the model loss compared with the code sample data when the model loss is calculated, so that the model is more focused on the fitting of the code sample data, and the accuracy of the model prediction code is improved.

FIG. 3 schematically illustrates a schematic diagram of another code completion model according to an embodiment of the present disclosure; in one embodiment, as shown in FIG. 3, the causal language model may be obtained by fine tuning an initialization model based on a first loss function. The annotation sample data and the code sample data respectively comprise a plurality of token, and the first loss function can be provided with a weight for each token, wherein the weight characterizes the influence degree of the loss of the token on the total loss of the initialization model; the weight of each token in the annotation sample data may be less than the weight of each token in the code sample data corresponding to the annotation sample data.

In this embodiment, the annotation sample data and the code sample data may be subjected to a Tokenization (token) process, so that the annotation sample data and the code sample data are divided into a plurality of tokens.

In this embodiment, the weight of each token in the code sample data is usually not 0, so as to avoid the situation that the model loss does not include the code sample data loss, and the neural network parameters corresponding to the code sample data in the model cannot be optimized in the back propagation process, and the code sample data cannot be fitted.

In an embodiment, the weight of each token in the annotation sample data may be 0, and at this time, the weight of each token in the code sample data may be a value greater than 0 (e.g., 0.5, 1, 1.1, etc.), so that the first loss function masks the annotation sample data, and the calculated loss of the model includes the loss of the code sample data, but does not include the loss of the annotation sample data, so as to concentrate on optimizing the neural network parameters corresponding to the code sample data in the model in the back propagation process, and avoid the situation that the model excessively adjusts the parameters due to the loss of the annotation sample data, thereby improving the accuracy of the model prediction code.

Next, the effect of improving the accuracy of the model prediction code by the method of this embodiment is described with reference to specific experimental data:

The inventor adopts a published Java function-level evaluation data set AixStandard as an experimental data set, and the experimental data set comprises 175 topics, wherein each topic comprises a code description statement for describing functions of function-level codes by natural language and a plurality of test cases, each test case consists of input parameters and correct answers, and the test of different conditions and boundary conditions possibly occurring in the topic is included.

The evaluation mode adopted by the experiment is that for each question, a code description sentence is input into a code complement model to obtain a function level code generated by the model, the function level code is input into an Online evaluation (Online Judge) system, for a plurality of test cases of each question, the Online evaluation system inputs input parameters of the test cases into the function level code and executes the input parameters, the generated answer is compared with correct answers of the corresponding test cases, and if the answers are consistent, the answer is determined to pass the test cases, and finally, the number of the test cases passing the test cases is judged.

The indexes adopted in the experiment comprise:

avgcassratio index: mean execution accuracy of code; for example, there are two questions in total, A questions include 5 test cases, the model-generated code passes through 4 test cases, B questions include 4 test cases, and the model-generated code passes through 2 test cases

Pass@1 index: the full accuracy of the code; for example, a total of three questions, 5 test cases are given to A, 4 test cases are given to B, 4 test cases are given to C, 5 test cases are given to C, and 5 test cases are given to C, if all the 5 test cases are given to C, the model-generated code passes

The initialization model adopted in the experiment is a GPT2 model with the size of 350M.

Table 1 is experimental data indicating whether the annotated sample data is yes/no participating in the model training process (the first loss function does not mask/mask the annotated sample data):

TABLE 1

	AvgPassRatio	Pass@1
			Annotating sample data to participate in training	61.41％	41.14％
Annotating sample data without participation in training	64.49％	46.85％

As can be seen from the experimental data in Table 1, compared with the annotation sample data participating in training, the AvgPassRatio index value and the pass@1 index value are both improved when the annotation sample data does not participate in training, so that the accuracy of model generation codes can be improved by shielding the annotation sample data by the first loss function.

In one embodiment, the first loss function may be formulated as:

wherein,weight representing the ith token in the annotation sample data, +.>True tag representing the ith token in the annotation sample data,/th token >Predictive labels, N, representing the ith token in annotated sample data _D Total number of token representing annotation sample data, +.>Weight representing jth token in code sample data, < >>True tag representing jth token in code sample data, < >>Predictive labels, N, representing the jth token in code sample data _C The total number of token representing the code sample data.

In this embodiment, the real label of the token may represent the actual probability of each token in the vocabulary as the token, the predictive label of the token may represent the predictive probability of each token in the vocabulary as the token by the initialization model, and for convenience of calculation, the real label and the predictive label may be represented in a One-bit-valid (One-Hot) code.

Specifically, for example, in the sample data, the content of the annotation sample data includes "sum of two numbers", the total number of token of the annotation sample data is 5, the 1 st token is "sum", and the corresponding serial number in the vocabulary is input into the initialization model; initializing the model predicts the probability of each token in the vocabulary as the 2 nd token, assuming the size of the vocabulary is 11, the model predicts the probability of token as "one" (corresponding number 2 in the vocabulary) as 0.3 at maximum, and the second token should actually be "two" (corresponding number 3 in the vocabulary), at which time the real tag of the 2 nd token in the sample data is annotated Can be expressed as [0,0,1,0,0,0,0,0,0,0,0 ]]Annotating predictive tags of the 2 nd token in sample data +.>Can be expressed as [0.1,0.3,0.15,0.1,0.05,0.05,0.05,0.05,0.05,0.05,0.05 ]]Assuming that the weight of each token in the annotation sample data is 0, the 2 nd token in the annotation sample data in the first loss function formula has a loss value of 0

The content of the code sample data corresponding to the annotation sample data comprises 'getSum (n, m)', the total number of the token of the code sample data is 6, the 1 st token is 'getSum', and the serial number corresponding to the token in the vocabulary is input into the initialization model; initializing the model predicts the probability of each token in the vocabulary as the 2 nd token, assuming the size of the vocabulary is 11, the model predicts the probability of token as "" (corresponding number in the vocabulary is 7) to be 0.2 at maximum, and the second token should be "(" (corresponding number in the vocabulary is 8), at this time, the true label of the 2 nd token in the code sample data may be represented as [0,0,0,0,0,0,0,1,0,0,0], the predicted label of the 2 nd token in the code sample data may be represented as [0.1,0.15,0.15,0.1,0.05,0.05,0.2,0.05,0.05,0.05,0.05], and assuming the weight of each token in the code sample data is 1, so the loss value of the 2 nd token in the code sample data in the first loss function formula is 1

In one embodiment, the initialization model may be GPT (generated Pre-trained Transformers, converter-based generated Pre-training model) or OPT (Open Pre-trained Transformers, open converter-based Pre-training model).

In this embodiment, the GPT model may refer to a GPT series of models, including GPT-1, GPT2, GPT3, GPT4, GPT5, and the like. The initialization model may also be a language model with a Decoder (Decoder) structure such as LLaMA (Large Language Model Meta AI, meta artificial intelligence large language model), BLOOM (BigScience Large Open-science Open-access Multilingual Language Model, large Open science acquisition multiple language model), T5 (Text-to-Text Transfer Transformer, text-to-Text transmission model based on a converter), and the like.

The inventors have further found that the difficulty of function level code complementation is higher than the code complementation of token level code, line level code and block level code, because it involves the conversion of both text modes of natural language and code, requiring that the code complementation model be able to understand the semantics of natural language accurately, and thus to generate accurate function level code. The annotation sample data used in the training process of the code completion model can improve the semantic understanding accuracy of the model to natural language.

In one embodiment, the object code may be function level code, which may include: function name, list of function parameters, return value type of function and function body. Wherein the function name, the list of function parameters, the return value type of the function may be collectively referred to as a function signature.

The inventors have further found that the code completion method of the related art helps the model understand the semantics of the code by introducing another form of information of the code to achieve better code completion effects, such as converting the code into a code map or inputting the form of an abstract syntax tree into the model for training. However, in the case where the code itself includes noise data, even if the code is converted into another form, the noise data still exists, and the model is trained based on data of high noise, so that the code complement effect of the model cannot be fundamentally improved. The inventors therefore contemplate that the grammar quality of code sample data in a sample data set may be improved to reduce noise data and thereby improve the accuracy of model generated code.

Fig. 4 schematically illustrates a schematic diagram of a sample dataset according to an embodiment of the present disclosure. In one embodiment, as shown in FIG. 4, the sample data set may include a first sample data set that may be obtained after filtering the code sample data based on a static syntax detection tool that may be used to discover syntax errors in the code sample data.

In this embodiment, according to different programming languages used in the code sample data, a corresponding static grammar detection tool may be adopted, for example, java language may be adopted as Java tool; the Python language can adopt Pycodestyle, pyflakes, pylint and other tools; javaScript may employ JSLint, ESLint, JSHint and JSHunter et al tools, which the present disclosure is not limited to.

FIG. 5 schematically illustrates a flow chart of a code sample data screening method based on a static grammar detection tool in accordance with an embodiment of the present disclosure; in one embodiment, as shown in FIG. 5, filtering code sample data based on a static grammar detection tool may include:

step S201, acquiring a first original sample data set based on a code hosting platform;

step S202, analyzing the first original sample data set based on a grammar analysis tool, and filtering code sample data;

step S203, code sample data is detected based on the static grammar detection tool, and code sample data with correct grammar is screened out.

The code hosting platform may be Github, gitee, gitlab, etc., and the parsing tool may employ Tree-atter, antlr, etc., which is not limited by the present disclosure.

Specifically, taking the Java language as an example, a full amount of Java data files can be obtained from the Github, then the files are subjected to grammar structure analysis by using a Tree-side tool, a plurality of Java codes with comments are filtered, then the Java tool is used for detecting the plurality of Java codes, and the Java codes with correct grammar are screened.

The inventors have further found that noise in the sample data set originates from annotated sample data in addition to code originating from grammatical errors, and that noise data contained in the annotated sample data can interfere with the understanding of the natural language semantics by the model, thereby reducing the accuracy of the model to generate code.

Annotation sample data that is too short in content, for example, lacks sufficient information content for the model to understand; excessively long annotation sample data may include excessive details and complex information, which may increase the risk of model overfitting and difficulty in understanding its semantics; the repeated sequence in the annotation sample data is redundant information and has no new meaning, useless characters in the annotation sample data have no explicit semantics, are only used for formatting, marking texts and the like, and the annotation sample data with the contents of codes, messy code characters, unrecognizable foreign language and non-natural language do not contain explicit semantic information related to the natural language, and belong to noise data for understanding the natural language semantics by a model.

The inventors therefore contemplate that noise data in the annotated sample data may be filtered out to improve the accuracy of the model generation code.

Fig. 6 and 7 schematically illustrate schematic diagrams of another sample dataset according to embodiments of the present disclosure.

In an embodiment, as shown in fig. 6, the sample data set may comprise a second sample data set.

In an embodiment, as shown in fig. 7, the sample data set may include a first sample data set and may also include a second sample data set.

The second sample data set may be obtained by filtering the annotation sample data based on a data cleansing policy, where the data cleansing policy may at least include any one of the following: the method comprises the steps of eliminating repeated sequences in annotation sample data, eliminating useless characters in the annotation sample data, eliminating annotation sample data with content length smaller than a first threshold value, eliminating annotation sample data with content length larger than a second threshold value, eliminating annotation sample data with content being codes, eliminating annotation sample data with content being messy code characters, eliminating annotation sample data with content being unrecognizable foreign language and eliminating annotation sample data with content being non-natural language. The first threshold and the second threshold may be set according to actual service requirements, for example, the first threshold may be set to 5 characters, and the second threshold may be set to 300 characters, which is not limited in this disclosure.

FIG. 8 schematically illustrates a flow chart of a method of annotation sample data filtering based on a data cleansing policy, according to an embodiment of the disclosure; in one embodiment, as shown in FIG. 8, the step of filtering annotation sample data based on a data cleansing policy may comprise:

step S301, a second original sample data set is obtained based on a code hosting platform;

step S302, analyzing the second original sample data set based on a grammar analysis tool, and filtering annotation sample data;

step S303, filtering the annotation sample data based on the data cleaning strategy to obtain the annotation sample data after noise reduction.

Specifically, a full code data file can be obtained from the Github, then the file is subjected to grammar structure analysis by using a Tree-side tool, a plurality of codes with notes are filtered, and the codes are filtered based on a data cleaning strategy to obtain the note sample data after noise reduction.

The inventors have further found that the data volume of the first sample data set and the data volume of the second sample data set, if they are far apart, may result in less accuracy in model generation code due to less occupancy of the code sample data after grammar optimization or the annotation sample data after noise reduction. Accordingly, the inventors have conceived that the ratio between the data amount of the first sample data set and the data amount of the second sample data set can be maintained within a certain data interval.

In one embodiment, the ratio of the data volume of the first sample data set to the data volume of the second sample data set is

Specifically, taking the ratio of the data volume of the first sample data set to the data volume of the second sample data set as 1:1.5 as an example, assuming that the data volume of the first sample data set is 100w, 150w sample data can be randomly selected from the second sample data set to be retained.

Table 2 shows experimental results of different ratios of the data amount of the first sample data set to the data amount of the second sample data set:

TABLE 2

Ratio of	AvgPassRatio	Pass@1
			1：1	63.34％	42.85％
1：1.5	64.49％	46.85％
			1：2	62.59％	40.00％

As can be seen from the experimental data in Table 2, when the ratio of the data volume of the first sample data set to the data volume of the second sample data set is three ratios in the table, the AvgPassRatio index value and the pass@1 index value are ideal, and when the ratio is from 1:1.5 to 1:1 and from 1:1.5 to 1:2, the two indexes are in a descending trend, so that the data interval where the ratio between the data volume of the first sample data set and the data volume of the second sample data set is proved to beIn this case, the model generation code is more accurate.

The inventor further found that the frequency of the commonly used function level codes in the sample data set is higher, and the frequency of the less commonly used function level codes in the sample data set is lower, so that the model may pay more attention to the learning of the high-frequency function level codes during training, the attention of the low-frequency function level codes is insufficient, the characteristics of the low-frequency function level codes are not fully learned, and the accuracy of the generated codes is still insufficient. Therefore, the inventor thinks that the number of the high-frequency and low-frequency function level codes can be balanced by adjusting the number of the function level codes, so that the model can fully learn the characteristics of the high-frequency and low-frequency function level codes, and the accuracy of generating the codes by the model is improved.

FIG. 9 schematically illustrates a schematic diagram of another sample dataset according to an embodiment of the present disclosure; in an embodiment, as shown in fig. 9, the sample data set may be formed by combining a first sample data set and a second sample data set, where the sample data set includes first function level code sample data, and the first function level code sample data may be obtained by adjusting the number of function level codes in the sample data set based on a data distribution adjustment policy.

In one embodiment, the formula for the data distribution adjustment policy may be:

N′＝1+int(log _M N)

wherein M is a real number greater than 1, N represents the number of the same class of function codes in the sample data set before adjustment, and N' represents the number of the same class of function codes in the sample data set after adjustment; the same kind of function level code characterizes the same function name, function parameter list, return value type and function body of the function level code.

In this embodiment, the value of M in the formula of the data distribution adjustment policy may be determined according to a specific scenario, for example, the formula may be:

N′＝1+int(log ₅ n), or,

N′＝1+int(log ₁₀ n), or,

N′＝1+int(log ₂ N)

specifically, the formula N' =1+int (log ₅ N) for example, assuming that the number of certain function level codes is 50000, belonging to high-frequency function level codes, the number of expected reservations is calculated to be 7 by a data distribution adjustment strategy formula, then 7 such function level codes are randomly selected to be reserved, a certain function level code is 5000, belonging to low-frequency function level codes, the number of expected reservations is calculated to be 6 by a data distribution adjustment strategy formula, then 6 such function level codes are randomly selected to be reserved, and then the number of expected reservations is calculated to be 6 by a data distribution adjustment strategy formula The data distribution adjustment strategy adjusts the number of high-frequency function level codes to be 10 times that of low-frequency function level codes to be about 1.2 times that of low-frequency function level codes, so that the number of high-frequency function level codes and the number of low-frequency function level codes are balanced.

Table 3 shows the experimental results of the adjustment strategy formulas using different data distributions:

TABLE 3 Table 3

Data distribution adjustment strategy formula	AvgPassRatio	Pass@1
			N′＝1+int(log ₅ N)	51.00％	30.28％
N′＝1+int(log ₁₀ N)	49.79％	28.57％
			N′＝1+int(log ₂ N)	46.21％	26.85％
Unadjusted data distribution	46.05％	26.28％

As can be seen from the experimental data in the table 3, after the number of the function level codes in the sample data set is respectively adjusted by using three formulas in the table, the avgPassRatio index value and the pass@1 index value are improved, so that the accuracy of generating the codes by the model can be improved by balancing the number of the high-frequency function level codes and the low-frequency function level codes in the sample data set.

Exemplary apparatus

Having introduced the method of exemplary embodiments of the present disclosure, next, a code completion apparatus based on generated artificial intelligence according to exemplary embodiments of the present disclosure is described with reference to fig. 10.

FIG. 10 schematically illustrates a block diagram of a generated artificial intelligence based code completion apparatus according to an embodiment of the present disclosure, the apparatus comprising:

An input module 11 for inputting first information into the trained code complement model;

an output module 12 for outputting, by the code complement model, an object code corresponding to the first information, the object code being for implementing a desired code function;

In one embodiment, the code completion model may be a causal language model that is trimmed based on a pre-trained initialization model. The causal language model is used for predicting missing characters in the code character string segment to be complemented according to input sample data formed by splicing the code sample data and the annotation sample data so as to determine an object code

In an embodiment, the causal language model may be obtained by fine tuning the initialization model based on a first loss function. The annotation sample data and the code sample data respectively comprise a plurality of token, and the first loss function can be provided with a weight for each token, wherein the weight characterizes the influence degree of the loss of the token on the total loss of the initialization model; the weight of each token in the annotation sample data may be less than the weight of each token in the code sample data corresponding to the annotation sample data.

In one embodiment, the first loss function may be formulated as:

Wherein,weight representing the ith token in the annotation sample data, +.>True tag representing the ith token in the annotation sample data,/th token>Representing the ith tok in annotated sample dataen predictive tag, N _D Total number of token representing annotation sample data, +.>Weight representing jth token in code sample data, < >>True tag representing jth token in code sample data, < >>Predictive labels, N, representing the jth token in code sample data _C The total number of token representing the code sample data.

In an embodiment, the sample data set may include a first sample data set, which may be obtained after filtering the code sample data based on a static syntax detection tool, which may be used to discover syntax errors in the code sample data.

In an embodiment, the sample data set may comprise a second sample data set.

In an embodiment, the sample data set may comprise a first sample data set and may further comprise a second sample data set.

In an embodiment, the sample data set may be formed by combining a first sample data set and a second sample data set, where the sample data set includes first function level code sample data, and the first function level code sample data may be obtained by adjusting the number of function level codes in the sample data set based on a data distribution adjustment policy.

N′＝1+int(log _M N)

Exemplary Medium

Having introduced the method of an exemplary embodiment of the present disclosure, next, a medium of an exemplary embodiment of the present disclosure is described with reference to fig. 11.

In the present exemplary embodiment, the above-described method may be implemented by a program product, such as a portable compact disc read only memory (CD-ROM) and including program code, and may be run on a device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium.

The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RE, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the C programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

Exemplary computing device

Having introduced the plug-ins, methods, and media of the exemplary embodiments of the present disclosure, next, a computing device of the exemplary embodiments of the present disclosure is described with reference to fig. 12.

The computing device 120 shown in fig. 12 is merely an example and should not be taken as limiting the functionality and scope of use of embodiments of the present disclosure.

As shown in fig. 12, computing device 120 is in the form of a general purpose computing device. Components of computing device 120 may include, but are not limited to: the at least one processing unit 1201, the at least one memory unit 1202, and a bus 1203 connecting the different system components (including the processing unit 1201 and the memory unit 1202).

Bus 1203 includes a data bus, a control bus, and an address bus.

The storage unit 1202 may include readable media in the form of volatile memory, such as Random Access Memory (RAM) 12021 and/or cache memory 12022, and may further include readable media in the form of nonvolatile memory, such as Read Only Memory (ROM) 12023.

The storage unit 1202 may also include a program/utility 12025 having a set (at least one) of program modules 12024, such program modules 12024 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

The computing device 120 may also communicate with one or more external devices 1204 (e.g., keyboard, pointing device, etc.).

Such communication may occur through an input/output (I/O) interface 1205. Moreover, computing device 120 may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 1206. As shown in FIG. 12, network adapter 1206 communicates with other modules of computing device 120 via bus 1203. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with computing device 120, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

It should be noted that while several units/modules or sub-units/modules of a generative artificial intelligence based code complement device are mentioned in the above detailed description, such partitioning is merely exemplary and not mandatory. Indeed, the features and functionality of two or more units/modules described above may be embodied in one unit/module in accordance with embodiments of the present disclosure. Conversely, the features and functions of one unit/module described above may be further divided into ones that are embodied by a plurality of units/modules.

Furthermore, although the operations of the methods of the present disclosure are depicted in the drawings in a particular order, this is not required to or suggested that these operations must be performed in this particular order or that all of the illustrated operations must be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.

While the spirit and principles of the present disclosure have been described with reference to several particular embodiments, it is to be understood that this disclosure is not limited to the particular embodiments disclosed nor does it imply that features in these aspects are not to be combined to benefit from this division, which is done for convenience of description only. The disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A code completion method based on generated artificial intelligence, comprising:

inputting first information into a trained code complement model, and outputting an object code corresponding to the first information by the code complement model, wherein the object code is used for realizing expected code functions;

2. The method of claim 1, wherein the code complement model is obtained by training based on a sample dataset;

wherein the sample dataset comprises: code sample data, and annotation sample data corresponding to the code sample data.

3. The method of claim 2, wherein the code complement model is a causal language model obtained by fine tuning on the basis of a pre-trained initialization model;

the causal language model is used for predicting missing characters in the code character string segment to be complemented according to input sample data formed by splicing the code sample data and the annotation sample data so as to determine the target code.

4. The method of claim 3, wherein the causal language model is obtained by fine tuning the initialization model based on a first loss function;

5. The method of claim 4, wherein each token in the annotated sample data has a weight of 0 such that the first loss function masks the annotated sample data.

6. The method of claim 4, wherein the first loss function is formulated as:

7. A method according to claim 3, wherein the initialization model is a GPT model or an OPT model.

8. A code completion device based on generated artificial intelligence, comprising:

9. A medium having stored thereon a computer program which, when executed by a processor, implements the method of any of claims 1-7.

10. A computing device, comprising:

a processor;

a memory for storing a processor executable program;

wherein the processor is configured to implement the method of any of claims 1-7 by running the executable program.