CN117725920A - Code change label generation method, device and equipment - Google Patents

Code change label generation method, device and equipment Download PDF

Info

Publication number
CN117725920A
CN117725920A CN202311642908.8A CN202311642908A CN117725920A CN 117725920 A CN117725920 A CN 117725920A CN 202311642908 A CN202311642908 A CN 202311642908A CN 117725920 A CN117725920 A CN 117725920A
Authority
CN
China
Prior art keywords
code
information
code change
label
change information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311642908.8A
Other languages
Chinese (zh)
Inventor
骆朋帅
周金果
狄鹏
范刚
董德俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202311642908.8A priority Critical patent/CN117725920A/en
Publication of CN117725920A publication Critical patent/CN117725920A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The embodiment of the specification discloses a code change label generation method, a code change label generation device and code change label generation equipment, which are beneficial to adding the code change label more efficiently and reliably and meeting personalized requirements. The scheme comprises the following steps: acquiring a code change information set and a predefined label information set; according to the label information contained in the label information set, carrying out semantic comparison and code difference comparison feature analysis on the code change information contained in the code change information set, and generating a corresponding label for the code change information; acquiring a large language model trained by utilizing large-scale natural language data and program language data; performing fine tuning training on the large language model according to the code change information and the corresponding label to obtain a label generation large language model; and generating a large language label generating model by using the label, and generating a corresponding label for the target code change information.

Description

Code change label generation method, device and equipment
Technical Field
The present disclosure relates to the field of machine learning, and in particular, to a method, an apparatus, and a device for generating a code change label.
Background
In software development, code change labels are key information used to identify the type of code change, and common labels include "fix" (e.g., representing repair Bug), "coat" (e.g., representing new function addition), "test" (e.g., representing test related change), etc.
In practical application, the corresponding label needs to be manually judged and manually added for the code change, but because the code change frequently occurs and the related code gauge is large, the mode of manually judging and manually adding the label is time-consuming and labor-consuming, has low efficiency, and particularly, when the code change is not produced by the current label maintainer, the reliability of the label is difficult to ensure. Moreover, the types of the currently used labels are relatively fixed, and the personalized requirements are difficult to adapt.
Based on this, a solution is needed that facilitates more efficient, more reliable, and more personalized addition of code change labels.
Disclosure of Invention
One or more embodiments of the present disclosure provide a method, an apparatus, a device, and a storage medium for generating a code change label, so as to solve the following technical problems: there is a need for a solution that facilitates more efficient, reliable, and more personalized addition of code change labels.
To solve the above technical problems, one or more embodiments of the present specification are implemented as follows:
one or more embodiments of the present disclosure provide a code change label generating method, including:
acquiring a code change information set and a predefined label information set;
according to the label information contained in the label information set, carrying out semantic comparison and code difference comparison feature analysis on the code change information contained in the code change information set, and generating a corresponding label for the code change information;
acquiring a large language model trained by utilizing large-scale natural language data and program language data;
performing fine tuning training on the large language model according to the code change information and the corresponding label to obtain a label generation large language model;
and generating a large language label generating model by using the label, and generating a corresponding label for the target code change information.
One or more embodiments of the present specification provide a code change label generating apparatus, including:
the information set acquisition module acquires a code change information set and a predefined label information set;
The first label generation module is used for carrying out semantic comparison and code difference comparison feature analysis on the code change information contained in the code change information set according to the label information contained in the label information set, and generating a corresponding label for the code change information;
the large language model acquisition module acquires a large language model trained by utilizing large-scale natural language data and program language data;
the model fine tuning training module carries out fine tuning training on the large language model according to the code change information and the corresponding label to obtain a label generation large language model;
and the second label generating module is used for generating a large language label generating model by using the label and generating a corresponding label for the target code change information.
One or more embodiments of the present specification provide a code change label generating apparatus including:
at least one processor; the method comprises the steps of,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to:
Acquiring a code change information set and a predefined label information set;
according to the label information contained in the label information set, carrying out semantic comparison and code difference comparison feature analysis on the code change information contained in the code change information set, and generating a corresponding label for the code change information;
acquiring a large language model trained by utilizing large-scale natural language data and program language data;
performing fine tuning training on the large language model according to the code change information and the corresponding label to obtain a label generation large language model;
and generating a large language label generating model by using the label, and generating a corresponding label for the target code change information.
One or more embodiments of the present specification provide a non-volatile computer storage medium storing computer-executable instructions configured to:
acquiring a code change information set and a predefined label information set;
according to the label information contained in the label information set, carrying out semantic comparison and code difference comparison feature analysis on the code change information contained in the code change information set, and generating a corresponding label for the code change information;
Acquiring a large language model trained by utilizing large-scale natural language data and program language data;
performing fine tuning training on the large language model according to the code change information and the corresponding label to obtain a label generation large language model;
and generating a large language label generating model by using the label, and generating a corresponding label for the target code change information.
The above-mentioned at least one technical solution adopted by one or more embodiments of the present disclosure can achieve the following beneficial effects: through semantic comparison and code difference comparison feature analysis, the essential expression or implicit semantics of the code change information can be more accurately and efficiently understood, so that a reliable label can be automatically and efficiently generated on a small scale, and then the small-scale data are used for fine-tuning training of a large language model with better understanding capability on both natural language and program language, so that the large language model can rapidly and further obtain the capability of reliably generating the code change label, and further can be used for efficiently generating labels for other large-scale code change information; moreover, the scheme also supports the user to customize the tag information according to actual needs, and small-scale fine tuning training data can be similarly and efficiently obtained for the customized tag information.
Drawings
In order to more clearly illustrate the embodiments of the present description or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some of the embodiments described in the present description, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a file modification content composition according to one or more embodiments of the present disclosure;
FIG. 2 is a flow diagram of an attempt to generate commit information with a single file change provided by one or more embodiments of the present disclosure;
FIG. 3 is a flow diagram of an attempt to generate commit information in the event of multiple file changes provided by one or more embodiments of the present disclosure;
FIG. 4 is a schematic diagram of a structured Commit Message provided in one or more embodiments of the present disclosure;
FIG. 5 is a flowchart illustrating a method for generating a code change label according to one or more embodiments of the present disclosure;
FIG. 6 is a flow diagram of a code change information set generation scheme provided by one or more embodiments of the present disclosure;
FIG. 7 is a schematic diagram of a tag information collection provided in one or more embodiments of the present disclosure;
FIG. 8 is a flow diagram of an approach provided by one or more embodiments of the present disclosure for generating labels for code change information sets;
FIG. 9 is a schematic diagram of a fine training data set provided in one or more embodiments of the present disclosure;
FIG. 10 is a schematic diagram of a solution for fine-tuning training and deployment of a large language model provided in one or more embodiments of the present disclosure;
FIG. 11 is a schematic diagram of a code change label generating device according to one or more embodiments of the present disclosure;
fig. 12 is a schematic structural diagram of a code change label generating apparatus according to one or more embodiments of the present disclosure.
Detailed Description
The embodiment of the specification provides a code change label generation method, a device, equipment and a storage medium.
In order to make the technical solutions in the present specification better understood by those skilled in the art, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.
In response to the problems in the background art, the applicant has proposed and tried to implement a scheme (called an attempt scheme) for modifying a label based on automatically generating submission information (such as a Commit Message, which may include a label), but a significant defect is found in the actual test, so that a more preferable scheme is further proposed. The trial scheme is first described in order to more fully embody the advantages of the scheme further proposed herein.
In the trial scheme, file changes are entered as stripes. FIG. 1 is a schematic diagram of the composition of file changes involved in an attempt scheme provided in one or more embodiments of the present disclosure, where in FIG. 1, each file change is composed of two parts: the change operation type of the file and the corresponding file path may be represented by different capital letters, for example, a new file operation, D a delete file operation, M a modify file operation, R a rename file, etc., and the corresponding file path may be represented by a full file name or other encoding format. The submitting information is generated by analyzing the two parts of contents, and the main implementation principle and flow are as follows:
For a single file change:
FIG. 2 is a flow diagram of an attempt to generate commit information with a single file change provided by one or more embodiments of the present disclosure. The following steps can be summarized mainly.
A first partial step. For the case of single Change, only the corresponding File Path (File Path) and Change operation type (Change Kind) need to be processed. Specific file change operations, such as modification, production, renaming, etc., can be obtained according to the change operation type, a new file path is generated by deforming the file path into a more readable form, and then the generated file change operation and the transformed file path are combined to generate a Description part in the commit information.
A second partial step. A regular commit (Conventional Commit) object is generated according to the file path, and then a corresponding Label (Label) is generated by performing string matching on the file name, the file path and the file extension with related constants which are defined in advance, and a substantially invalid Label (for example, a Unknown) may be generated finally because there may be a case that the matching is unsuccessful.
And finally, assembling the description generated in the first part generation step and the label generated in the second part generation step to generate final submission information.
For multiple file changes:
FIG. 3 is a flow diagram of an attempt to generate commit information in the event of multiple file changes provided in one or more embodiments of the present disclosure, described briefly below.
For the case of multiple changes, two kinds of commit messages can be generated finally, one is labeled, the list of change files and the change types of the files are labeled, and only if the number of change files is smaller than a set threshold value and the operation types of all changes are consistent, the commit message is generated, and in addition, a commit message of only counting the number of file change operation types is generated, for example, "2 files are deleted, 3 files are newly added, and 1 file is modified".
It follows that the problems with the above attempt schemes include: the label generation process is based on the changed file path of the file warehouse completely, and the generation mode is to enumerate the file path characteristics of various label types to perform character string matching, and if the matching is not successful, the label cannot be generated effectively; moreover, since it is based solely on file paths, code changes to files are not considered at all, and semantic understanding of specific code changes in the repository is lacking, resulting in the possibility that the generated tag may be unreliable.
In order to solve the problems of the background art and the above attempted solutions, the applicant provides a more optimal solution (referred to as the present solution) by the present application. The automatic label generation scheme based on the large language model uses the cleaned and processed data set with label information to carry out fine adjustment training on the specified large language model, so that the model has the capability of generating the specified label, and the large language model is obtained by training based on a large amount of natural language data and program language data, so that the model has the understanding capability on the natural language and the program language, the semantics of code change can be better understood, and more accurate and appropriate labels can be generated.
Furthermore, the present solution also supports generating a multi-tag set, i.e. multiple tags may be generated for one piece of data, because multiple tag properties may be provided for the same code change, e.g. it may belong to both "fix" and "build". The generated multi-label can be applied to a plurality of different scenes, such as a structured Commit Message, a Change Log and a Release Note, and a user can select a proper label from a plurality of labels according to requirements to be used in an actual scene.
Commit Message, i.e., commit Message, is a descriptive text provided by a developer when committing a code change to a platform, which typically contains information about the purpose, content, and context of the change. Fig. 4 is a schematic diagram of a structured Commit Message provided in one or more embodiments of the present disclosure. In this Commit Message, one or more tags automatically generated by the present scheme may be included, as well as a Message portion.
Change Log, a Change Log, is a document that records the history of software versions and updates, and records information about changes made in each version, errors repaired, added functions, etc., so that users or developers can know the evolution and update content of the software.
Release Note, a document or record describing changes to a particular version of software, application or product, new functionality, errors in repair, and other important information, is typically written by a development team or product management team and provided to a user, customer or other interested party at the time of software Release or update. The purpose of the release notes is to provide clear guidance for the user to learn about the improvements in the new version, the new functionality and the repaired problem so that they can better use, understand and evaluate the updated content of the software.
Labels automatically generated by this scheme can be used similarly in Change Log, release notes.
The scheme can automatically generate proper labels for code change, analyze the content of the code change by utilizing language understanding and generating capability of a large language model, and generate the labels capable of reflecting change characteristics. In Commit Message, these tags can provide a concise and explicit change type that allows developers and team members to quickly learn the nature of the code change. In Release notes and Change logs, these tags may provide more detailed Change descriptions that help the user understand what changes are in each version. The method and the device can improve development efficiency and document accuracy, automatically generated labels can accurately reflect the characteristic of code change, reduce the workload of manually writing the labels, and are also helpful for providing change description of consistency and normalization, and better change tracking and understanding.
Based on the general description above, the following continues to describe the present solution in detail.
Fig. 5 is a flowchart of a code change label generating method according to one or more embodiments of the present disclosure. In this flow, the large language model is fine-tuned using the cleaned and processed small-scale dataset with tag information, and then tags are generated for more data using the large language model.
The flow in fig. 5 includes the steps of:
s502: a set of code change information is obtained, and a set of predefined tag information is obtained.
Steps S502 and S504 are the preparation process of the small-scale data set for fine tuning training.
In one or more embodiments of the present disclosure, the code change information set may be collected from a developer, or may be collected from a related platform, for example, by calling an interface provided by a code data management platform (for example, a code version control platform such as GitHub, SVN, etc.), to obtain submission information respectively associated with different change codes submitted by the developer to the code data management platform, where the submission information itself may describe a code change condition, so that the code change information set may be constructed based on the submission information, which is less costly and facilitates selection of a specific fine tuning training sample according to actual needs.
It should be noted that the obtained submission information does not necessarily include a tag, or a reliable tag, but these cases do not affect the implementation of the present solution, because the present solution automatically generates a tag for each code change information in the code change information set.
In the above example, the code Change information may be submitted information such as Commit Message, or may be Change information in other forms similar to the use, such as Release Note and Change Log mentioned above, and annotation information of Change descriptions in the code.
Further, the acquired submitted information or code change information can be filtered according to some set strategies, and relatively high-quality data is reserved to form a code change information set to be labeled. The setting policy includes, for example, at least one of: the good degree of the corresponding change codes and/or the submitted information with high enough secondary development heat are reserved; filtering out submitted information with too low an effective information amount; filtering out submitted information with too high number of Token for a large language model; etc.
In one or more embodiments of the present disclosure, for a set of tag information, tags may be customized as desired in addition to existing tags.
In order to make the semantics of the tag clearer, the tag and the related description information thereof can be defined to jointly form a group of tag information, and a plurality of groups of tag information form a tag information set. Therefore, the label is convenient to expand, and the expandability of the scheme is effectively improved.
S504: and according to the label information contained in the label information set, carrying out semantic comparison and code difference comparison feature analysis on the code change information contained in the code change information set, and generating a corresponding label for the code change information.
In one or more embodiments of the present disclosure, the code change information may be understood more deeply to directly include and implicitly include semantics based on a semantic similarity calculation between the code change information and the tag information, thereby attempting to match the corresponding tags for the semantics. Text semantic similarity calculations may be performed using pre-trained machine learning models, thus facilitating comprehensive and accurate mining of various semantics.
In one or more embodiments of the present disclosure, for code difference comparison, such as using codeDiff, the analysis is convenient, the implementation cost is low, and other forms and corresponding grammars may be used to describe the difference between the comparison codes. By adopting the feature analysis logic respectively defined for different tag information, the codeDiff is subjected to feature analysis, and statement keywords related to the codeDiff or other specific change areas can be specifically analyzed so as to infer and judge which tags have more consistent semantics with the current code change operation. This way, it is helpful to efficiently and precisely mine some specific semantics.
S506: a large language model trained using large-scale natural language data and program language data is obtained.
The large language model is usually composed of an artificial neural network with massive parameters, and is trained by using large-scale training data, so that the supervised training is supported by using marked data, and the self-supervision training or semi-supervision training is supported by using unmarked data. The large language model adopted by the scheme has better understanding capability for both natural language and program language.
S508: and performing fine tuning training on the large language model according to the code change information and the corresponding label to obtain a label generation large language model.
Fine tuning training refers to the process of further tuning and optimizing the pre-trained model in machine learning. By training on task-specific data, fine-tuning training may make the pre-training model more suitable for a particular application scenario or task. This process typically involves a smaller scale data set and fewer training iterations, which is more efficient and less costly.
And performing fine tuning training on the large language model by adopting a supervised training mode based on the code change information and the corresponding labels.
S510: and generating a large language label generating model by using the label, and generating a corresponding label for the target code change information.
In one or more embodiments herein, a fine-tuned trained large language model may be used to automatically generate code change labels for target code change information, such as newly generated code change information. When in use, the object code change information or the information obtained by adaptively processing the object code change information is used as Prompt information (Prompt), a large language model is input, and corresponding Answer information (Answer) is output, namely the generated corresponding label.
Through the method of fig. 5, the semantics of the essential expression or the implicit semantics of the code change information can be more accurately and efficiently understood through semantic comparison and code difference comparison feature analysis, so that a close and reliable label can be automatically and efficiently generated for the code change information in a small scale, and then the small-scale data are used for fine tuning training of a large language model with better understanding ability on both natural language and program language, so that the large language model can quickly and further obtain the ability of reliably generating the code change label, and further can be used for efficiently generating labels for other code change information in a large scale; moreover, the scheme also supports the user to customize the tag information according to actual needs, and small-scale fine tuning training data can be similarly and efficiently obtained for the customized tag information.
Based on the method of fig. 5, the present specification also provides some specific embodiments and extensions of the method, and the following description will proceed.
More intuitively, taking the GitHub as an example of a code version control platform, one or more embodiments of the present disclosure provide a flow chart of a code change information set generating scheme, see fig. 6.
Github provides an interface for retrieving Commit Message, release Note, and Change Log, which is used by way of example in FIG. 6. In Github, star can be used for representing good comments, and fork is used for representing that the current code is used for secondary development, so that data of high star number and high fork number can be reserved; filtering out data with low effective information content, such as Message is empty, messy codes exist in Message, or no information content (for example, XXX.java file is modified) in Message; filtering out data with too high Token number according to the input Token length limit of the large language model to be used subsequently; etc.
And filtering to obtain a Commit Message data set serving as a code change information set. A structured representation of the Commit Message dataset is also given by way of example below:
in this structure, the commit id field indicates the commit number, the message field describes the change base, and the focus is to pay attention to the contents in the difference field, in which multiple sets of change information are included, and each set of change information includes an oldPath field indicating a path before a code change, a newPath field indicating a path after a code change, and a codeDiff field indicating a code difference comparison. This scenario will also be described in the following examples.
Further, for the tag information sets. A tag may be defined, and one or more language synonym phrases and associated descriptions generated for the tag, to form a set of tag information, and so on, to construct different sets of tag information, to form a set of tag information. For ease of use, a relatively representative word in a synonymous phrase may be used as the tag. Intuitively, one or more embodiments of the present description also provide a schematic diagram of a tag information set, see fig. 7.
In fig. 7, tag information is exemplarily described in two languages of chinese and english, and short description is used in order to reduce the data amount. Each set of tag information contains four pieces of content: chinese synonymous phrase, english synonymous phrase, chinese short description, english short description.
In one or more embodiments of the present disclosure, for code change information such as submitted information, a message portion, a code path portion, and a code difference comparison portion may be generally included, and then the semantic comparison or code difference comparison feature analysis may be differentially selected to perform the foregoing semantic comparison or code difference comparison feature analysis, so as to improve understanding efficiency and reliability of semantics of different pieces of information.
For semantic comparison, the method specifically may include: determining a message part and a code path part in code change information contained in the code change information set; extracting keywords from the message part and the code path part to obtain keywords to be compared; and carrying out text semantic similarity calculation on the keywords to be compared and the tag information contained in the tag information set to obtain a corresponding semantic comparison result, so as to generate corresponding tags for the code change information according to the semantic comparison result.
For code difference comparison feature analysis, specific may include: a code difference comparing section for determining code change information included in the code change information set; determining feature analysis logic respectively defined for one or more groups of tag information contained in the tag information set; in the code difference comparing section, it is recognized whether or not a code change has occurred in a specified code region involved in the feature analysis logic, so that a corresponding tag is generated for the code change information according to the result of the recognition.
Based on such implementation considerations, one or more embodiments of the present disclosure intuitively provide a flowchart of a scheme for generating labels for code change information sets, see fig. 8.
In fig. 8, a common Message data set is exemplarily used as the code change information set, in which case, the Message part is specifically a Message field content, the code path part is specifically an oldtath and/or newPath (which may be collectively referred to as a codePath) field content, and the code difference comparison part is specifically a codeDiff field content.
After the Commit Message data set is collected and filtered, generating a label for the data set is started, and the label mainly comprises the following steps:
based on the message and the codePath, a machine learning model such as HarvestText or KeyBERT is adopted, keywords are extracted, a machine learning model such as hiiamid or service_similarity_hindi is adopted, semantic similarity calculation is carried out on the keywords and each group of tag information in the tag information database, and if the similarity exceeds a set threshold value, corresponding tag information is marked.
And writing corresponding feature analysis logic in advance according to the currently defined tag information. For example, three kinds of tags, i.e., "fix", "docs (document type modification)", and "style (style modification)", are defined.
For a "fix" tag, the corresponding feature analysis logic includes, for example: if it is recognized in codeDiff that a code change has occurred in a specified code area corresponding to the tag information, the tag information is generated for the code change information, and in this case, the specified code area includes a specified program control flow area. The specified program control flow area may include a condition area in a specified keyword sentence. The specified keyword sentence may include at least one of a return sentence, a continuous sentence, a break sentence, an if sentence, for example; and/or a catch block region in a try-catch statement, etc.
That is, if it is recognized that the program control flow is changed during the code change, the case is "fix". For the case of modifying the condition of the return statement, the continuous statement, the break statement, or the if statement, or the catch block of the try-catch statement, the "fix" label may be marked.
As for the "docs" and "style" tags, both are non-repair cases. In this case, if it is recognized in the codeDiff that only the specified code area corresponding to the tag information has been changed, the tag information is generated for the code change information. In this case, the specified code region includes at least one of: annotate the content area, blank editing area.
Specifically, for a "docs" tag, the specified code region includes an annotation content region; that is, if it is recognized that only the comment content is modified in the code change, the "docs" tag information may be marked. For "style" tag information, the specified code region includes a blank editing region; that is, if it is recognized that only blank editing (e.g., new or deleted blank) is performed during the code change, a "style" tag may be marked.
Similarly, corresponding feature analysis logic can be written for other different tag information according to the difference of the specific areas of the change.
The labels marked by the two parts of processing can be combined, and the label marked Commit Message data set is structured as follows as a corresponding label set generated for the current code change information:
it can be seen that a labels field is added, which contains one or more generated tags.
In practical applications, structured data such as Commit Message data set is not convenient to be directly used for fine tuning training of a large language model, and needs to be used for training after conversion processing, which may specifically include: extracting field contents respectively corresponding to the message part, the code path part and the code difference comparison part from the code change information, and fusing and converting the field contents into character strings serving as prompt information; one or more labels generated for the code change information are used as answer information, and a prompt and answer pair is formed by the label and the prompt information; and taking the prompt and answer pair as supervised training data to conduct fine tuning training on the large language model. Intuitively, one or more embodiments of the present description provide a schematic diagram of a fine-tuning training dataset, see fig. 9.
In fig. 9, the Prompt information is specifically Prompt, and the Answer information is specifically Answer. Message, oldPath, newPath, codeDiff in the Commit Message data set can be used as a Prompt, spliced into a character string format through a line-wrapping character, content in a corresponding labels is used as an Answer, spliced into a character string format through commas or other spacers, and each group of data is processed to form each < Prompt, answer > pair, namely each piece of training data.
Based on the prepared fine-tuning training dataset, further, one or more embodiments of the present disclosure provide a schematic diagram of a solution for fine-tuning training and deployment of a large language model, see fig. 10.
In the scheme of fig. 10, the labeled data is a fine tuning training data set, and the large language model is pre-trained by large-scale natural language data and program language data. The marked data is utilized to further finely tune and train the large language model so as to generate a new model with code change label generation capability, then the new model is subjected to reasoning and evaluation, if the new model meets the requirement, the new model can be directly deployed and used, if the effect is not as good as expected, the training data set can be adjusted and optimized to carry out the next iteration until the better effect is finally achieved. In addition, since a plurality of labels may be generated based on the scheme, only one or a part of the labels may be needed in actual use, and then a post-processing model (Post Process Module) may be added for correction and selection of the labels in actual deployment.
Based on the same thought, one or more embodiments of the present disclosure further provide apparatuses and devices corresponding to the above method, as shown in fig. 11 and fig. 12. The apparatus and device are capable of performing the above method and related alternatives accordingly.
Fig. 11 is a schematic structural diagram of a code change label generating apparatus according to one or more embodiments of the present disclosure, where the apparatus includes:
an information set acquisition module 1102 that acquires a code change information set and a predefined tag information set;
the first tag generation module 1104 performs semantic comparison and code difference comparison feature analysis on the code change information contained in the code change information set according to the tag information contained in the tag information set, and generates a corresponding tag for the code change information;
a large language model acquisition module 1106 that acquires a large language model trained using large-scale natural language data and program language data;
the model fine tuning training module 1108 performs fine tuning training on the large language model according to the code change information and the corresponding label to obtain a label to generate the large language model;
the second tag generation module 1110 generates a large language tag generation model using the tag, and generates a corresponding tag for the object code change information.
Optionally, before the information set obtaining module 1102 obtains the code change information set, the information set obtaining module obtains submitted information respectively associated with different change codes submitted by a developer to the code data management platform by calling an interface provided by the code data management platform;
And filtering the submitted information according to a set strategy to generate the code change information set.
Optionally, the code data management platform comprises a code version control platform;
the setting strategy comprises at least one of the following:
the good degree of the corresponding change codes and/or the submitted information with high enough secondary development heat are reserved;
filtering out submitted information with too low an effective information amount;
and filtering out the submitted information with too high number of Token for the large language model.
Optionally, the information set obtaining module 1102 defines a tag, and synonym phrases and related descriptions of one or more languages generated for the tag, before the obtaining of the predefined tag information set, to form a set of tag information;
the tag information sets composed of different sets of the tag information are determined.
Optionally, the first tag generating module 1104 determines a message part and a code path part in the code change information included in the code change information set;
extracting keywords from the message part and the code path part to obtain keywords to be compared;
and carrying out text semantic similarity calculation on the keywords to be compared and the tag information contained in the tag information set to obtain a corresponding semantic comparison result, so as to generate corresponding tags for the code change information according to the semantic comparison result.
Optionally, the first tag generating module 1104 determines a code difference comparing part in the code change information included in the code change information set;
determining feature analysis logic respectively defined for one or more groups of tag information contained in the tag information set;
in the code difference comparing section, it is identified whether a code change has occurred in a specified code region involved in the feature analysis logic, so that a corresponding tag is generated for the code change information according to the result of the identification.
Optionally, the first tag generation module 1104 generates, for tag information indicating that the code change type is repair, the tag information for the code change information if it is recognized in the code difference comparison part that the code change has occurred in a specified code area related to the feature analysis logic corresponding to the tag information;
the specified code area corresponding to the label information for representing the code change type as repair comprises a specified program control flow area.
Optionally, the specified program control flow area includes a condition area in a specified keyword sentence;
the specified keyword statement comprises at least one of a return statement, a continuous statement, a break statement and an if statement; and/or a catch block region in the try-catch statement.
Optionally, the first tag generating module 1104 generates tag information for the code change information if it is identified that only a specified code area involved in the feature analysis logic corresponding to the tag information has been changed in the code difference comparing part, for the tag information indicating that the code change type is non-repair, where the non-repair includes document type modification, and/or style type modification;
the specified code area corresponding to the label information for representing the code change type as repair comprises at least one of the following:
annotating the content area;
blank editing area.
Optionally, the code change information is structured data;
the model fine tuning training module 1108 extracts field contents corresponding to the message part, the code path part and the code difference comparison part from the code change information, and performs fusion conversion to form a character string as prompt information;
one or more labels generated for the code change information are used as answer information, and a prompt and answer pair is formed by the label and the prompt information;
and taking the prompt and answer pair as supervised training data, and performing fine tuning training on the large language model.
Fig. 12 is a schematic structural diagram of a code change label generating apparatus provided in one or more embodiments of the present disclosure, where the apparatus includes:
at least one processor; the method comprises the steps of,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to:
acquiring a code change information set and a predefined label information set;
according to the label information contained in the label information set, carrying out semantic comparison and code difference comparison feature analysis on the code change information contained in the code change information set, and generating a corresponding label for the code change information;
acquiring a large language model trained by utilizing large-scale natural language data and program language data;
performing fine tuning training on the large language model according to the code change information and the corresponding label to obtain a label generation large language model;
and generating a large language label generating model by using the label, and generating a corresponding label for the target code change information.
Based on the same considerations, one or more embodiments of the present specification further provide a non-volatile computer storage medium storing computer-executable instructions configured to:
acquiring a code change information set and a predefined label information set;
according to the label information contained in the label information set, carrying out semantic comparison and code difference comparison feature analysis on the code change information contained in the code change information set, and generating a corresponding label for the code change information;
acquiring a large language model trained by utilizing large-scale natural language data and program language data;
performing fine tuning training on the large language model according to the code change information and the corresponding label to obtain a label generation large language model;
and generating a large language label generating model by using the label, and generating a corresponding label for the target code change information.
In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable Gate Array, FPGA)) is an integrated circuit whose logic function is determined by the programming of the device by a user. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented by using "logic compiler" software, which is similar to the software compiler used in program development and writing, and the original code before the compiling is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but not just one of the hdds, but a plurality of kinds, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), lava, lola, myHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.
The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in one or more software and/or hardware elements when implemented in the present specification.
It will be appreciated by those skilled in the art that the present description may be provided as a method, system, or computer program product. Accordingly, the present specification embodiments may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present description embodiments may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The present description is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
The description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for apparatus, devices, non-volatile computer storage medium embodiments, the description is relatively simple, as it is substantially similar to method embodiments, with reference to the section of the method embodiments being relevant.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
The foregoing is merely one or more embodiments of the present description and is not intended to limit the present description. Various modifications and alterations to one or more embodiments of this description will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, or the like, which is within the spirit and principles of one or more embodiments of the present description, is intended to be included within the scope of the claims of the present description.

Claims (21)

1. A code change label generation method, comprising:
acquiring a code change information set and a predefined label information set;
according to the label information contained in the label information set, carrying out semantic comparison and code difference comparison feature analysis on the code change information contained in the code change information set, and generating a corresponding label for the code change information;
acquiring a large language model trained by utilizing large-scale natural language data and program language data;
performing fine tuning training on the large language model according to the code change information and the corresponding label to obtain a label generation large language model;
and generating a large language label generating model by using the label, and generating a corresponding label for the target code change information.
2. The method of claim 1, prior to the acquiring the set of code change information, the method further comprising:
acquiring submission information respectively associated with different change codes submitted to the code data management platform by a developer by calling an interface provided by the code data management platform;
and filtering the submitted information according to a set strategy to generate the code change information set.
3. The method of claim 2, the code data management platform comprising a code version control platform;
the setting strategy comprises at least one of the following:
the good degree of the corresponding change codes and/or the submitted information with high enough secondary development heat are reserved;
filtering out submitted information with too low an effective information amount;
and filtering out the submitted information with too high number of Token for the large language model.
4. The method of claim 1, prior to the obtaining the predefined set of tag information, the method further comprising:
defining a label, and generating synonymous word groups and related descriptions of one or more languages for the label to form a group of label information;
the tag information sets composed of different sets of the tag information are determined.
5. The method of claim 1, wherein the semantic comparison of the code change information included in the code change information set according to the tag information included in the tag information set specifically includes:
determining a message part and a code path part in code change information contained in the code change information set;
extracting keywords from the message part and the code path part to obtain keywords to be compared;
And carrying out text semantic similarity calculation on the keywords to be compared and the tag information contained in the tag information set to obtain a corresponding semantic comparison result, so as to generate corresponding tags for the code change information according to the semantic comparison result.
6. The method of claim 1, wherein the performing code difference comparison feature analysis on the code change information included in the code change information set according to the tag information included in the tag information set specifically includes:
determining a code difference comparison part in code change information contained in the code change information set;
determining feature analysis logic respectively defined for one or more groups of tag information contained in the tag information set;
in the code difference comparing section, it is identified whether a code change has occurred in a specified code region involved in the feature analysis logic, so that a corresponding tag is generated for the code change information according to the result of the identification.
7. The method of claim 6, wherein the generating a corresponding tag for the code change information according to the result of the identifying specifically includes:
In the tag information indicating that the code change type is repair, if it is recognized that the code change has occurred in the specified code region related to the feature analysis logic corresponding to the tag information in the code difference comparison section, the tag information is generated for the code change information;
the specified code area corresponding to the label information for representing the code change type as repair comprises a specified program control flow area.
8. The method of claim 7, wherein the specified program control flow area comprises a conditional area in a specified keyword sentence;
the specified keyword statement comprises at least one of a return statement, a continuous statement, a break statement and an if statement; and/or a catch block region in the try-catch statement.
9. The method of claim 6, wherein the generating a corresponding tag for the code change information according to the result of the identifying specifically includes:
for the tag information indicating that the code change type is non-repair, if in the code difference comparison part, it is recognized that only the designated code area involved in the feature analysis logic corresponding to the tag information is changed, generating the tag information for the code change information, wherein the non-repair comprises document type modification and/or style type modification;
The specified code area corresponding to the label information for representing the code change type as repair comprises at least one of the following:
annotating the content area;
blank editing area.
10. The method of claim 1, the code change information being structured data;
and performing fine tuning training on the large language model according to the code change information and the corresponding label thereof, wherein the fine tuning training specifically comprises the following steps:
extracting field contents respectively corresponding to a message part, a code path part and a code difference comparison part from the code change information, and fusing and converting the field contents into character strings serving as prompt information;
one or more labels generated for the code change information are used as answer information, and a prompt and answer pair is formed by the label and the prompt information;
and taking the prompt and answer pair as supervised training data, and performing fine tuning training on the large language model.
11. A code change label generating apparatus comprising:
the information set acquisition module acquires a code change information set and a predefined label information set;
the first label generation module is used for carrying out semantic comparison and code difference comparison feature analysis on the code change information contained in the code change information set according to the label information contained in the label information set, and generating a corresponding label for the code change information;
The large language model acquisition module acquires a large language model trained by utilizing large-scale natural language data and program language data;
the model fine tuning training module carries out fine tuning training on the large language model according to the code change information and the corresponding label to obtain a label generation large language model;
and the second label generating module is used for generating a large language label generating model by using the label and generating a corresponding label for the target code change information.
12. The apparatus of claim 11, wherein the information set obtaining module obtains, before the obtaining the code change information set, submission information respectively associated with different change codes submitted by a developer to the code data management platform by calling an interface provided by the code data management platform;
and filtering the submitted information according to a set strategy to generate the code change information set.
13. The apparatus of claim 12, the code data management platform comprising a code version control platform;
the setting strategy comprises at least one of the following:
the good degree of the corresponding change codes and/or the submitted information with high enough secondary development heat are reserved;
Filtering out submitted information with too low an effective information amount;
and filtering out the submitted information with too high number of Token for the large language model.
14. The apparatus of claim 11, wherein the information set obtaining module defines a tag and generates one or more language synonym phrases and related descriptions for the tag to form a set of tag information prior to the obtaining of the predefined set of tag information;
the tag information sets composed of different sets of the tag information are determined.
15. The apparatus of claim 11, the first tag generation module to determine a message portion and a code path portion in code change information included in the set of code change information;
extracting keywords from the message part and the code path part to obtain keywords to be compared;
and carrying out text semantic similarity calculation on the keywords to be compared and the tag information contained in the tag information set to obtain a corresponding semantic comparison result, so as to generate corresponding tags for the code change information according to the semantic comparison result.
16. The apparatus of claim 11, the first tag generation module to determine a code difference comparison portion in code change information included in the code change information set;
Determining feature analysis logic respectively defined for one or more groups of tag information contained in the tag information set;
in the code difference comparing section, it is identified whether a code change has occurred in a specified code region involved in the feature analysis logic, so that a corresponding tag is generated for the code change information according to the result of the identification.
17. The apparatus according to claim 16, wherein the first tag generation module generates tag information indicating that a code change type is repair for the code change information if it is recognized in the code difference comparison section that a code change has occurred in a specified code region involved in the feature analysis logic corresponding to the tag information;
the specified code area corresponding to the label information for representing the code change type as repair comprises a specified program control flow area.
18. The apparatus of claim 17, wherein the specified program control flow area comprises a conditional area in a specified keyword sentence;
the specified keyword statement comprises at least one of a return statement, a continuous statement, a break statement and an if statement; and/or a catch block region in the try-catch statement.
19. The apparatus of claim 16, wherein the first tag generation module generates tag information for code change information if it is recognized that only a specified code region involved in the feature analysis logic corresponding to the tag information has been changed in the code difference comparison section for tag information indicating that the code change type is non-repair, the non-repair including document type modification, and/or style type modification;
the specified code area corresponding to the label information for representing the code change type as repair comprises at least one of the following:
annotating the content area;
blank editing area.
20. The apparatus of claim 11, the code change information being structured data;
the model fine tuning training module extracts field contents respectively corresponding to a message part, a code path part and a code difference comparison part from the code change information, and performs fusion conversion to form a character string serving as prompt information;
one or more labels generated for the code change information are used as answer information, and a prompt and answer pair is formed by the label and the prompt information;
and taking the prompt and answer pair as supervised training data, and performing fine tuning training on the large language model.
21. A code change label generating apparatus comprising:
at least one processor; the method comprises the steps of,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform:
acquiring a code change information set and a predefined label information set;
according to the label information contained in the label information set, carrying out semantic comparison and code difference comparison feature analysis on the code change information contained in the code change information set, and generating a corresponding label for the code change information;
acquiring a large language model trained by utilizing large-scale natural language data and program language data;
performing fine tuning training on the large language model according to the code change information and the corresponding label to obtain a label generation large language model;
and generating a large language label generating model by using the label, and generating a corresponding label for the target code change information.
CN202311642908.8A 2023-12-01 2023-12-01 Code change label generation method, device and equipment Pending CN117725920A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311642908.8A CN117725920A (en) 2023-12-01 2023-12-01 Code change label generation method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311642908.8A CN117725920A (en) 2023-12-01 2023-12-01 Code change label generation method, device and equipment

Publications (1)

Publication Number Publication Date
CN117725920A true CN117725920A (en) 2024-03-19

Family

ID=90209807

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311642908.8A Pending CN117725920A (en) 2023-12-01 2023-12-01 Code change label generation method, device and equipment

Country Status (1)

Country Link
CN (1) CN117725920A (en)

Similar Documents

Publication Publication Date Title
US11948058B2 (en) Utilizing recurrent neural networks to recognize and extract open intent from text inputs
US11977847B2 (en) Dynamically updated text classifier
US11748232B2 (en) System for discovering semantic relationships in computer programs
US11010284B1 (en) System for understanding navigational semantics via hypothesis generation and contextual analysis
US10977155B1 (en) System for providing autonomous discovery of field or navigation constraints
US9858330B2 (en) Content categorization system
CN112417093B (en) Model training method and device
CN112579466A (en) Test case generation method and device and computer readable storage medium
CN117591661B (en) Question-answer data construction method and device based on large language model
Mundotiya et al. Development of a Dataset and a Deep Learning Baseline Named Entity Recognizer for Three Low Resource Languages: Bhojpuri, Maithili, and Magahi
Huo et al. ARCLIN: automated API mention resolution for unformatted texts
CN117972048A (en) Question and answer processing method and device
CN116560631B (en) Method and device for generating machine learning model code
CN111339272A (en) Code defect report retrieval method and device
CN116028626A (en) Text matching method and device, storage medium and electronic equipment
AU2019290658B2 (en) Systems and methods for identifying and linking events in structured proceedings
CN117725920A (en) Code change label generation method, device and equipment
CN117874261B (en) Question-answer type event extraction method based on course learning and related equipment
CN117574981B (en) Training method of information analysis model and information analysis method
CN116453702B (en) Data processing method, device, system and medium for autism behavior feature set
CN116991459B (en) Software multi-defect information prediction method and system
CN117313727A (en) Model training and entity recognition method
CN117909505A (en) Event argument extraction method and related equipment
CN114817469A (en) Text enhancement method, and training method and device of text enhancement model
CN114037284A (en) App popularity evolution result prediction method based on multilayer attribute network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination