CN115481035A - Model training method and device, code retrieval method and device and storage medium - Google Patents

Model training method and device, code retrieval method and device and storage medium Download PDF

Info

Publication number
CN115481035A
CN115481035A CN202211163783.6A CN202211163783A CN115481035A CN 115481035 A CN115481035 A CN 115481035A CN 202211163783 A CN202211163783 A CN 202211163783A CN 115481035 A CN115481035 A CN 115481035A
Authority
CN
China
Prior art keywords
vector
code segment
training
queried
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211163783.6A
Other languages
Chinese (zh)
Inventor
姜林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Corp Ltd
Original Assignee
China Telecom Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Corp Ltd filed Critical China Telecom Corp Ltd
Priority to CN202211163783.6A priority Critical patent/CN115481035A/en
Publication of CN115481035A publication Critical patent/CN115481035A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3664Environments for testing or debugging software
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Quality & Reliability (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Hardware Design (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure provides a model training method and device, a code retrieval method and device and a storage medium, and relates to the field of big data. The model training method comprises the following steps: extracting internal information and external information of the sample code segment with the exception thrown out; converting the internal information and the external information of the sample code segment into an initial vector of the sample code segment; constructing a training sample by using the initial vector of the sample code segment; and training the machine learning model by using the training samples, and writing the embedded vectors which are output by the machine learning model and correspond to the training samples into a vector database. The present disclosure utilizes a trained machine learning model and a generated vector database to perform code retrieval on code that throws exceptions.

Description

Model training method and device, code retrieval method and device, and storage medium
Technical Field
The disclosure relates to the field of big data, and in particular to a model training method and device, a code retrieval method and device, and a storage medium.
Background
In software development activities, an exception refers to an abnormal event occurring in the running process of software, and the abnormal event usually interrupts the normal execution flow of the software and destroys the robustness of the software. An exception may cause serious problems if not properly handled. To increase robustness, modern programming languages mostly provide embedded exception handling mechanisms for representing, communicating, and handling exceptions. Exception handling mechanisms can help reduce the probability of a program crash while providing the information needed to eliminate bugs. Therefore, the exception handling mechanism plays an extremely important role in the software development process.
In order to improve the quality of exception handling codes in open source software, the current automatic exception handling code recommendation method mainly abstracts and represents open source software codes and uses the abstracted and represented open source software codes for probability model training, so that the most appropriate exception handling codes are automatically generated for code segments throwing exceptions.
Disclosure of Invention
The inventors have noted that in the related art, the method of generating exception handling code based on the probabilistic learning model tends to recommend high frequency code, ignoring less common low frequency code. Moreover, the exception handling code they recommend is likely not syntactically correct code, thereby reducing the development efficiency of the programmer.
Therefore, the code retrieval scheme is characterized in that the code segments throwing the exception are embedded into the low-dimensional continuous vector space by adopting the deep learning technology to form the vector database for retrieving the exception handling code, so that the fitting capability of a deep learning model is utilized, and the limitation in directly generating the exception handling code is avoided.
According to a first aspect of embodiments of the present disclosure, there is provided a model training method, performed by a model training apparatus, including: extracting internal information and external information of the sample code segment with the exception thrown out; converting internal information and external information of the sample code segment into an initial vector of the sample code segment; constructing a training sample by using the initial vector of the sample code segment; and training a machine learning model by using the training samples, and writing the embedded vector which is output by the machine learning model and corresponds to the training samples into a vector database.
In some embodiments, extracting internal information of the sample code snippet throwing the exception comprises: and storing the tokens in the sample code segment traversed in sequence into a first list so as to use the first list as the internal information of the sample code segment, wherein the length of the first list is not greater than a preset length threshold.
In some embodiments, extracting external information for the sample code fragment throwing the exception comprises: and analyzing the sample code segment into a first abstract syntax tree, and extracting the characteristics of the method and the class of the sample code segment from the first abstract syntax tree to obtain the external information of the sample code segment.
In some embodiments, the method and class of the sample code snippet includes at least one of: the method comprises the following steps of setting the name enclosingType of the located type, the name enclosingMethod of the located method, the return type of the located method, the annotation type annotation of the located method and the exception type throwsException thrown by the located method, wherein the annotation type and the exception type of the located method only reserve the first item.
In some embodiments, converting the internal information and the external information of the sample code segment into an initial vector of the sample code segment comprises: respectively constructing a corresponding internal information dictionary and an external information dictionary for the internal information and the external information of the sample code segment by using a word vector model; converting each token of the internal information of the sample code segment into a corresponding first vector according to an internal information dictionary; converting each token of the external information of the sample code segment into a corresponding second vector according to an external information dictionary; all first vectors and all second vectors are concatenated to generate an initial vector for the sample code segment.
In some embodiments, the training sample item = < input, input >, where input is the initial vector of the sample code fragment.
In some embodiments, the machine learning model comprises, in order, an input layer, a plurality of hidden layers, a fully-connected layer, and an output layer, each hidden layer comprising a convolutional layer and a pooling layer, wherein the input layer and the output layer have the same dimensions; writing the embedded vector output by the machine learning model corresponding to the training sample into a vector database comprises: writing the embedded vectors corresponding to the training samples output by the fully-connected layer in the machine learning model into a vector database.
According to a second aspect of the embodiments of the present disclosure, there is provided a model training apparatus including: a first training module configured to extract internal information and external information of a sample code fragment throwing an exception; a second training module configured to convert internal information and external information of the sample code snippet into an initial vector of the sample code snippet; a third training module configured to construct a training sample using the initial vector of the sample code segment, train a machine learning model using the training sample, and write an embedded vector corresponding to the training sample output by the machine learning model into a vector database.
According to a third aspect of the embodiments of the present disclosure, there is provided a model training apparatus including: a memory configured to store instructions; a processor coupled to the memory, the processor configured to perform a method implementing any of the embodiments described above based on instructions stored by the memory.
According to a fourth aspect of the embodiments of the present disclosure, there is provided a code retrieval method including: extracting internal information and external information of the code segment to be queried, which throws the exception currently; converting the internal information and the external information of the code segment to be queried into an initial vector of the code segment to be queried; processing the initial vector of the code segment to be queried by using a machine learning model to obtain a query vector, wherein the machine learning model is obtained by training by using the training method of any one of the embodiments; calculating a similarity between the query vector and each embedded vector in a vector database, wherein the vector database is obtained by using the training method of any of the embodiments; taking the embedded vector corresponding to the maximum similarity as a target embedded vector; and taking the exception handling code corresponding to the target embedded vector as a retrieval result.
In some embodiments, extracting internal information of the code snippet to be queried that currently throws the exception comprises: and storing the tokens in the code segments to be queried, which are traversed in sequence, into a second list so as to use the second list as the internal information of the code segments to be queried, wherein the length of the second list is not greater than a preset length threshold.
In some embodiments, extracting external information of the code fragment to be queried that currently throws the exception comprises: and analyzing the code segment to be queried into a second abstract syntax tree, and extracting the method and the class characteristics of the code segment to be queried from the second abstract syntax tree to obtain the external information of the code segment to be queried.
In some embodiments, the method and the class of the code segment to be queried include at least one of the following: the method comprises the following steps of setting the name enclosingType of the located type, the name enclosingMethod of the located method, the return type of the located method, the annotation type annotation of the located method and the exception type throwsException thrown by the located method, wherein the annotation type and the exception type of the located method only reserve the first item.
In some embodiments, converting the internal information and the external information of the code segment to be queried into the initial vector of the code segment to be queried comprises: respectively constructing a corresponding internal information dictionary and an external information dictionary for the internal information and the external information of the code segment to be inquired by using a word vector model; converting each token of the internal information of the code segment to be queried into a corresponding third vector according to the internal information dictionary; converting each token of the external information of the code segment to be queried into a corresponding fourth vector according to an external information dictionary; and splicing all the third vectors and all the fourth vectors to generate the initial vector of the code segment to be queried.
According to a fifth aspect of an embodiment of the present disclosure, there is provided a code retrieval apparatus including: the first processing module is configured to extract internal information and external information of a code segment to be queried, of which an exception is thrown currently; the second processing module is configured to convert the internal information and the external information of the code segment to be queried into an initial vector of the code segment to be queried; a third processing module, configured to process the initial vector of the code segment to be queried by using a machine learning model to obtain a query vector, where the machine learning model is obtained by training using the training method described in any of the above embodiments; a fourth processing module configured to calculate a similarity between the query vector and each embedded vector in a vector database, wherein the vector database is obtained by using the training method described in any of the above embodiments; and the fifth processing module is configured to take the embedded vector corresponding to the maximum similarity as a target embedded vector and take the abnormal processing code corresponding to the target embedded vector as a retrieval result.
According to a sixth aspect of an embodiment of the present disclosure, there is provided a code retrieval apparatus including: a memory configured to store instructions; a processor coupled to the memory, the processor configured to perform a method implementing any of the embodiments described above based on instructions stored by the memory.
According to a seventh aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein the computer-readable storage medium stores computer instructions, and the instructions, when executed by a processor, implement the method according to any one of the embodiments.
Other features of the present disclosure and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and for those skilled in the art, other drawings can be obtained according to the drawings without inventive exercise.
FIG. 1 is a schematic flow chart diagram of a model training method according to an embodiment of the present disclosure;
FIG. 2 is a schematic structural diagram of a machine learning model according to one embodiment of the present disclosure;
FIG. 3 is a schematic structural diagram of a model training apparatus according to an embodiment of the present disclosure;
FIG. 4 is a schematic structural diagram of a model training device according to another embodiment of the present disclosure;
FIG. 5 is a flowchart illustrating a code retrieval method according to an embodiment of the disclosure;
FIG. 6 is a schematic structural diagram of a code retrieval apparatus according to an embodiment of the present disclosure;
fig. 7 is a schematic structural diagram of a code retrieval apparatus according to another embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.
The relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.
Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
Fig. 1 is a schematic flow chart of a model training method according to an embodiment of the present disclosure. In some embodiments, the following model training method is performed by a model training apparatus.
At step 101, internal information and external information of a sample code fragment throwing an exception are extracted.
In some embodiments, extracting internal information of the sample code fragment throwing the exception comprises: and storing the tokens (Token) in the sample code segments traversed in sequence into a list so as to take the list as the internal information of the sample code segments, wherein the length of the list is not more than a preset length threshold value.
That is, the tokens in the sample code segment traversed sequentially are stored in a list, the maximum length of the list is limited to Nmax, and tokens exceeding the maximum length are deleted. The generated list intList is used as the internal information of the sample code snippet.
In some embodiments, extracting external information for the sample code fragment throwing the exception comprises: and analyzing the sample code fragment into an abstract syntax tree, and extracting the method and the class characteristics of the sample code fragment from the abstract syntax tree to obtain the external information of the sample code fragment.
The source code file is parsed into an abstract syntax tree, for example, by using the JavaParser tool.
In some embodiments, the method and class of the sample code snippet includes at least one of: the method comprises the following steps of obtaining a name enclosingType of a located type, a name enclosingMethod of a located method, a return type of the located method, an annotation type annotation of the located method and an exception type threws exception thrown by the located method, wherein the annotation type and the exception type of the located method only reserve a first item.
At step 102, the internal information and the external information of the sample code segment are converted into an initial vector of the sample code segment.
In some embodiments, converting the internal information and the external information of the sample code segment into an initial vector of the sample code segment comprises:
1) And respectively constructing a corresponding internal information dictionary and an external information dictionary for the internal information and the external information of the sample code segment by using the word vector model.
For example, the word vector model is a word2vec model. The internal information dictionary is
Figure BDA0003861337480000071
The external information dictionary is
Figure BDA0003861337480000072
D is the vector dimension to be represented, V e And V i Is a limited dictionary size, and words not in the dictionary will be replaced with special symbols<UNK>. The internal information dictionary and the external information dictionary are independent of each other.
2) Each token of the internal information of the sample code segment is converted into a corresponding first vector according to the internal information dictionary. Each token of the extrinsic information of the sample code segment is converted into a corresponding second vector according to an extrinsic information dictionary. The first vector and the second vector are both D-dimensional vectors.
3) All first vectors and all second vectors are concatenated to generate an initial vector of sample code fragments.
For example, the length of the initial vector input is Nmax +5.
In step 103, training samples are constructed using the initial vectors of the sample code fragments.
In some embodiments, training sample item = < input, input >, where input is the initial vector of sample code fragments. Therefore, the training of the machine learning model can be completed without manual marking.
In step 104, the machine learning model is trained using the training samples, and the embedded vectors corresponding to the training samples output by the machine learning model are written into a vector database.
In some embodiments, as shown in fig. 2, the machine learning model comprises an input layer, a plurality of hidden layers, a fully-connected layer, and an output layer in sequence, each hidden layer comprising a convolutional layer and a pooling layer, wherein the input layer and the output layer have the same dimensions.
In some embodiments, the embedded vectors corresponding to the training samples output by the fully-connected layers in the machine-learned model are written into a vector database.
In the model training method provided by the above embodiment of the present disclosure, the code segments with the thrown exception are embedded into the low-dimensional continuous vector space by using the deep learning technique, and a vector database is formed for retrieving the exception handling code, so that not only is the fitting capability of the deep learning model utilized, but also the limitation when the exception handling code is directly generated is avoided.
Fig. 3 is a schematic structural diagram of a model training apparatus according to an embodiment of the present disclosure. As shown in fig. 3, the model training apparatus includes a first training module 31, a second training module 32, and a third training module 33.
The first training module 31 is configured to extract internal information and external information of the sample code fragment throwing the exception.
In some embodiments, the first training module 31 stores the tokens in the sample code segment traversed in sequence into a list, so as to use the list as the internal information of the sample code segment, wherein the length of the list is not greater than the preset length threshold.
In some embodiments, the first training module 31 parses the sample code segment into an abstract syntax tree, and extracts the method and class characteristics of the sample code segment from the abstract syntax tree to obtain external information of the sample code segment.
The source code file is parsed into an abstract syntax tree, for example, by using the JavaParser tool.
In some embodiments, the method and class of the sample code snippet includes at least one of: the method comprises the following steps of setting the name enclosingType of the located type, the name enclosingMethod of the located method, the return type of the located method, the annotation type annotation of the located method and the exception type throwsException thrown by the located method, wherein the annotation type and the exception type of the located method only reserve the first item.
The second training module 32 is configured to convert the internal information and the external information of the sample code segment into an initial vector of the sample code segment.
In some embodiments, second training module 32 uses the word vector model to construct corresponding internal and external information dictionaries for the internal and external information of the sample code segment, respectively.
For example, the word vector model is a word2vec model. The internal information dictionary is
Figure BDA0003861337480000081
The external information dictionary is
Figure BDA0003861337480000082
D is the vector dimension to be represented, V e And V i Is a limited dictionary size, and words not in the dictionary will be replaced with special symbols<UNK>. The internal information dictionary and the external information dictionary are independent of each other.
Next, the second training module 32 converts each token of the internal information of the sample code segment into a corresponding first vector according to the internal information dictionary. Each token of the extrinsic information of the sample code segment is converted into a corresponding second vector according to an extrinsic information dictionary. The first vector and the second vector are both D-dimensional vectors.
The second training module 32 then concatenates all of the first vectors and all of the second vectors to generate an initial vector of sample code fragments.
For example, the length of the initial vector input is Nmax +5.
The third training module 33 is configured to construct training samples using the initial vectors of the sample code segments, train the machine learning model using the training samples, and write the embedded vectors corresponding to the training samples output by the machine learning model into the vector database.
In some embodiments, training sample item = < input, input >, where input is the initial vector of the sample code fragment. Therefore, the training of the machine learning model can be completed without manual marking.
For example, the structure of the machine learning model is shown in FIG. 2.
In some embodiments, third training module 33 writes the embedded vectors corresponding to the training samples output by the fully-connected layers in the machine-learned model into the vector database.
Fig. 4 is a schematic structural diagram of a model training apparatus according to another embodiment of the present disclosure. As shown in fig. 4, the model training apparatus includes a memory 41 and a processor 42.
The memory 41 is used for storing instructions, the processor 42 is coupled to the memory 41, and the processor 42 is configured to execute the method according to any embodiment in fig. 1 based on the instructions stored in the memory.
As shown in fig. 4, the model training apparatus further includes a communication interface 43 for information interaction with other devices. Meanwhile, the model training device further comprises a bus 44, and the processor 42, the communication interface 43 and the memory 41 are communicated with each other through the bus 44.
The memory 41 may comprise a high-speed RAM memory, and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 41 may also be a memory array. The storage 41 may also be partitioned, and the blocks may be combined into virtual volumes according to certain rules.
Further, the processor 42 may be a central processing unit CPU, or may be an application specific integrated circuit ASIC, or one or more integrated circuits configured to implement embodiments of the present disclosure.
The present disclosure also relates to a computer-readable storage medium, wherein the computer-readable storage medium stores computer instructions, and the instructions, when executed by a processor, implement the method according to any one of the embodiments in fig. 1.
Fig. 5 is a flowchart illustrating a code retrieval method according to an embodiment of the disclosure. In some embodiments, the following code retrieval method is performed by a code retrieval apparatus.
In step 501, the internal information and the external information of the code segment to be queried, which throws the exception currently, are extracted.
In some embodiments, extracting internal information of the code snippet to be queried that currently throws the exception comprises: and storing the tokens in the code segments to be queried, which are traversed in sequence, into a list so as to take the list as the internal information of the code segments to be queried, wherein the length of the list is not greater than a preset length threshold.
In some embodiments, extracting external information of the code fragment to be queried that currently throws the exception comprises: analyzing the code segment to be queried into an abstract syntax tree, and extracting the method and the class characteristics of the code segment to be queried from the abstract syntax tree so as to obtain the external information of the code segment to be queried.
For example, the method and the characteristics of the class of the code segment to be queried include at least one of the following: the method comprises the following steps of setting the name enclosingType of the located type, the name enclosingMethod of the located method, the return type of the located method, the annotation type annotation of the located method and the exception type throwsException thrown by the located method, wherein the annotation type and the exception type of the located method only reserve the first item.
In step 502, the internal information and the external information of the code segment to be queried are converted into an initial vector of the code segment to be queried.
In some embodiments, converting the internal information and the external information of the code segment to be queried into an initial vector of the code segment to be queried comprises:
1) And respectively constructing a corresponding internal information dictionary and an external information dictionary for the internal information and the external information of the code segment to be inquired by using the word vector model.
For example, the word vector model is a word2vec model. The internal information dictionary is
Figure BDA0003861337480000101
The external information dictionary is
Figure BDA0003861337480000102
D is the vector dimension to be represented, V e And V i Is a limited dictionary size, notWords in the dictionary will be replaced with special symbols<UNK>. The internal information dictionary and the external information dictionary are independent of each other.
2) And converting each token of the internal information of the code segment to be queried into a corresponding third vector according to the internal information dictionary. And converting each token of the external information of the code segment to be queried into a corresponding fourth vector according to the external information dictionary. The third vector and the fourth vector are both D-dimensional vectors.
3) And splicing all the third vectors and all the fourth vectors to generate an initial vector of the code segment to be queried.
For example, the length of the initial vector input is Nmax +5.
In step 503, the initial vector of the code segment to be queried is processed by using a machine learning model to obtain a query vector (query).
It should be noted that the machine learning model is obtained by training using the training method according to any embodiment of fig. 1.
At step 504, a similarity is calculated between the query vector and each embedded vector in the vector database.
It should be noted that the vector database is obtained by using the training method according to any embodiment in fig. 1.
For example, the calculation formula of the similarity is as shown in formula (1).
Figure BDA0003861337480000111
Wherein, query is a query vector, and embedding is an embedded vector.
In step 505, the embedding vector corresponding to the maximum similarity is taken as the target embedding vector.
In step 506, the exception handling code corresponding to the target embedded vector is used as the retrieval result.
Fig. 6 is a schematic structural diagram of a code retrieval apparatus according to an embodiment of the present disclosure. As shown in fig. 6, the code retrieval apparatus includes a first processing module 61, a second processing module 62, a third processing module 63, a fourth processing module 64, and a fifth processing module 65.
The first processing module 61 is configured to extract internal information and external information of a code segment to be queried for which an exception is currently thrown.
In some embodiments, the extracting, by the first processing module 61, the internal information of the code segment to be queried, for which the exception is currently thrown, includes: and storing the tokens in the sequentially traversed code segments to be queried into a list so as to take the list as the internal information of the code segments to be queried, wherein the length of the list is not more than a preset length threshold value.
In some embodiments, the extracting, by the first processing module 61, the external information of the code segment to be queried of which the exception is currently thrown includes: analyzing the code segment to be queried into an abstract syntax tree, and extracting the method and the class characteristics of the code segment to be queried from the abstract syntax tree so as to obtain the external information of the code segment to be queried.
For example, the method and the class of the code segment to be queried include at least one of the following: the method comprises the following steps of setting the name enclosingType of the located type, the name enclosingMethod of the located method, the return type of the located method, the annotation type annotation of the located method and the exception type throwsException thrown by the located method, wherein the annotation type and the exception type of the located method only reserve the first item.
The second processing module 62 is configured to convert the internal information and the external information of the code segment to be queried into an initial vector of the code segment to be queried.
In some embodiments, the second processing module 62 uses the word vector model to construct a corresponding internal information dictionary and external information dictionary for the internal information and external information of the code segment to be queried, respectively.
For example, the word vector model is a word2vec model. The internal information dictionary is
Figure BDA0003861337480000121
The external information dictionary is
Figure BDA0003861337480000122
D is the vector dimension to be represented, V e And V i Is a limited dictionary size, and words not in the dictionary will be replaced with special symbols<UNK>. The internal information dictionary and the external information dictionary are independent of each other.
Next, the second processing module 62 converts each token of the internal information of the code segment to be queried into a corresponding third vector according to the internal information dictionary. And converting each token of the external information of the code segment to be queried into a corresponding fourth vector according to the external information dictionary. The third vector and the fourth vector are both D-dimensional vectors.
Then, the second processing module 62 splices all the third vectors and all the fourth vectors to generate an initial vector of the code segment to be queried.
For example, the length of the initial vector input is Nmax +5.
The third processing module 63 is configured to process the initial vector of the code segment to be queried using a machine learning model to obtain a query vector.
It should be noted that the machine learning model is obtained by training using the training method according to any embodiment of fig. 1.
The fourth processing module 64 is configured to calculate a similarity between the query vector and each embedded vector in the vector database.
It should be noted that the vector database is obtained by using the training method according to any embodiment in fig. 1.
For example, the calculation formula of the similarity is as shown in the above formula (1).
The fifth processing module 65 is configured to take the embedded vector corresponding to the maximum similarity as a target embedded vector, and take the exception handling code corresponding to the target embedded vector as a retrieval result.
Fig. 7 is a schematic structural diagram of a code retrieval apparatus according to another embodiment of the present disclosure. As shown in fig. 7, the code retrieving means includes a memory 41, a processor 42, a communication interface 43 and a bus 44. Fig. 7 differs from fig. 4 in that, in the embodiment shown in fig. 7, the processor 72 is configured to perform the method according to any of the embodiments in fig. 5 based on instructions stored in the memory 71.
The present disclosure also relates to a computer-readable storage medium, wherein the computer-readable storage medium stores computer instructions, and the instructions, when executed by a processor, implement a method according to any one of the embodiments in fig. 5.
By implementing the above embodiments of the present disclosure, the following beneficial effects can be obtained:
1. compared with the traditional keyword-based retrieval mode, the method has the advantages that the vector database technology is adopted, the abnormal code fragments are embedded into the vector space, and the code retrieval is carried out in the space, so that the retrieval efficiency is greatly improved;
2. the deep learning model does not need to be trained by manually labeled data, so that the labor cost of model training is reduced, and the deep learning model can be better applied to the retrieval of abnormal processing codes of large-scale open source software.
In some embodiments, the functional units described above can be implemented as general purpose processors, programmable Logic Controllers (PLCs), digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other Programmable Logic devices, discrete Gate or transistor Logic devices, discrete hardware components, or any suitable combination thereof for performing the functions described in this disclosure.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The description of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the disclosure in the form disclosed. Many modifications and variations will be apparent to practitioners skilled in this art. The embodiment was chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure for various embodiments with various modifications as are suited to the particular use contemplated.

Claims (17)

1. A model training method, performed by a model training apparatus, comprising:
extracting internal information and external information of the sample code segment with the exception thrown out;
converting internal information and external information of the sample code segment into an initial vector of the sample code segment;
constructing a training sample by using the initial vector of the sample code segment;
and training a machine learning model by using the training samples, and writing the embedded vector which is output by the machine learning model and corresponds to the training samples into a vector database.
2. The method of claim 1, wherein extracting internal information of a sample code fragment throwing an exception comprises:
and storing the tokens in the sample code segment traversed in sequence into a first list so as to use the first list as the internal information of the sample code segment, wherein the length of the first list is not greater than a preset length threshold.
3. The method of claim 1, wherein extracting extrinsic information of a sample code snippet throwing an exception comprises:
parsing the sample code snippet into a first abstract syntax tree,
and extracting the characteristics of the method and the class of the sample code segment from the first abstract syntax tree to obtain the external information of the sample code segment.
4. The method of claim 3, wherein,
the method and the class of the sample code segment comprise at least one of the following characteristics: the method comprises the following steps of obtaining a name enclosingType of a located type, a name enclosingMethod of a located method, a return type of the located method, an annotation type annotation of the located method and an exception type threws exception thrown by the located method, wherein the annotation type and the exception type of the located method only reserve a first item.
5. The method of claim 1, wherein converting the internal information and the external information of the sample code segment into an initial vector of the sample code segment comprises:
respectively constructing a corresponding internal information dictionary and an external information dictionary for the internal information and the external information of the sample code segment by using a word vector model;
converting each token of the internal information of the sample code segment into a corresponding first vector according to an internal information dictionary;
converting each token of the external information of the sample code segment into a corresponding second vector according to an external information dictionary;
all first vectors and all second vectors are concatenated to generate an initial vector for the sample code segment.
6. The method of claim 1, wherein,
the training sample item = < input, input >, where input is the initial vector of the sample code fragment.
7. The method of any one of claims 1-6,
the machine learning model sequentially comprises an input layer, a plurality of hidden layers, a full-connection layer and an output layer, wherein each hidden layer comprises a convolution layer and a pooling layer, and the input layer and the output layer have the same dimension;
writing the embedded vector output by the machine learning model corresponding to the training sample into a vector database comprises:
writing the embedded vectors corresponding to the training samples output by the fully connected layers in the machine learning model into a vector database.
8. A model training apparatus comprising:
a first training module configured to extract internal information and external information of a sample code fragment throwing an exception;
a second training module configured to convert internal information and external information of the sample code segment into an initial vector of the sample code segment;
a third training module configured to construct a training sample by using the initial vector of the sample code segment, train a machine learning model by using the training sample, and write an embedded vector corresponding to the training sample output by the machine learning model into a vector database.
9. A model training apparatus comprising:
a memory configured to store instructions;
a processor coupled to the memory, the processor configured to perform implementing the method of any of claims 1-7 based on instructions stored by the memory.
10. A code retrieval method, comprising:
extracting internal information and external information of the code segment to be queried, which throws the exception currently;
converting the internal information and the external information of the code segment to be queried into an initial vector of the code segment to be queried;
processing the initial vector of the code segment to be queried by using a machine learning model to obtain a query vector, wherein the machine learning model is obtained by training by using the training method of any one of claims 1 to 7;
calculating a similarity between the query vector and each embedded vector in a vector database, wherein the vector database is obtained by using the training method of any one of claims 1-7;
taking the embedded vector corresponding to the maximum similarity as a target embedded vector;
and taking the exception handling code corresponding to the target embedded vector as a retrieval result.
11. The method of claim 10, wherein extracting internal information of a code snippet to be queried that currently throws an exception comprises:
and storing the tokens in the code segment to be queried, which are traversed in sequence, into a second list so as to take the second list as the internal information of the code segment to be queried, wherein the length of the second list is not greater than a preset length threshold.
12. The method of claim 10, wherein extracting extrinsic information of a code snippet to be queried that currently throws an exception comprises:
analyzing the code segment to be queried into a second abstract syntax tree;
and extracting the characteristics of the method and the class of the code segment to be queried from the second abstract syntax tree to obtain the external information of the code segment to be queried.
13. The method of claim 12, wherein,
the method and the characteristics of the class of the code segment to be queried comprise at least one of the following: the method comprises the following steps of setting the name enclosingType of the located type, the name enclosingMethod of the located method, the return type of the located method, the annotation type annotation of the located method and the exception type throwsException thrown by the located method, wherein the annotation type and the exception type of the located method only reserve the first item.
14. The method of claim 10, wherein converting the internal information and the external information of the code snippet to be queried into an initial vector of the code snippet to be queried comprises:
respectively constructing a corresponding internal information dictionary and an external information dictionary for the internal information and the external information of the code segment to be inquired by using a word vector model;
converting each token of the internal information of the code segment to be queried into a corresponding third vector according to the internal information dictionary;
converting each token of the external information of the code segment to be queried into a corresponding fourth vector according to an external information dictionary;
and splicing all the third vectors and all the fourth vectors to generate the initial vector of the code segment to be queried.
15. A code retrieval apparatus comprising:
the first processing module is configured to extract internal information and external information of a code segment to be queried, of which an exception is thrown currently;
a second processing module configured to convert internal information and external information of the code segment to be queried into an initial vector of the code segment to be queried;
a third processing module, configured to process the initial vector of the code segment to be queried by using a machine learning model to obtain a query vector, wherein the machine learning model is obtained by training according to the training method of any one of claims 1 to 7;
a fourth processing module configured to calculate a similarity between the query vector and each embedded vector in a vector database, wherein the vector database is obtained by using the training method of any one of claims 1 to 7;
and the fifth processing module is configured to take the embedded vector corresponding to the maximum similarity as a target embedded vector and take the abnormal processing code corresponding to the target embedded vector as a retrieval result.
16. A code retrieval apparatus comprising:
a memory configured to store instructions;
a processor coupled to the memory, the processor configured to perform implementing the method of any of claims 10-14 based on instructions stored by the memory.
17. A non-transitory computer-readable storage medium, wherein the computer-readable storage medium stores computer instructions which, when executed by a processor, implement the method of any of claims 1-7, 10-14.
CN202211163783.6A 2022-09-23 2022-09-23 Model training method and device, code retrieval method and device and storage medium Pending CN115481035A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211163783.6A CN115481035A (en) 2022-09-23 2022-09-23 Model training method and device, code retrieval method and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211163783.6A CN115481035A (en) 2022-09-23 2022-09-23 Model training method and device, code retrieval method and device and storage medium

Publications (1)

Publication Number Publication Date
CN115481035A true CN115481035A (en) 2022-12-16

Family

ID=84393845

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211163783.6A Pending CN115481035A (en) 2022-09-23 2022-09-23 Model training method and device, code retrieval method and device and storage medium

Country Status (1)

Country Link
CN (1) CN115481035A (en)

Similar Documents

Publication Publication Date Title
CN111460787B (en) Topic extraction method, topic extraction device, terminal equipment and storage medium
JP6909832B2 (en) Methods, devices, equipment and media for recognizing important words in audio
CN107832301B (en) Word segmentation processing method and device, mobile terminal and computer readable storage medium
CN112287670A (en) Text error correction method, system, computer device and readable storage medium
US20210200952A1 (en) Entity recognition model training method and entity recognition method and apparatus using them
CN111259144A (en) Multi-model fusion text matching method, device, equipment and storage medium
CN111460115A (en) Intelligent man-machine conversation model training method, model training device and electronic equipment
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
CN110263127A (en) Text search method and device is carried out based on user query word
CN107918605B (en) Word segmentation processing method and device, mobile terminal and computer readable storage medium
CN107832302B (en) Word segmentation processing method and device, mobile terminal and computer readable storage medium
CN111177375A (en) Electronic document classification method and device
CN108763202A (en) Method, apparatus, equipment and the readable storage medium storing program for executing of the sensitive text of identification
CN114860942A (en) Text intention classification method, device, equipment and storage medium
CN112307175A (en) Text processing method, text processing device, server and computer readable storage medium
WO2024027252A1 (en) Static webpage generation method and apparatus, electronic device, and storage medium
CN109902309B (en) Translation method, device, equipment and storage medium
CN112613322A (en) Text processing method, device, equipment and storage medium
CN110795617A (en) Error correction method and related device for search terms
CN111310473A (en) Text error correction method and model training method and device thereof
CN115481035A (en) Model training method and device, code retrieval method and device and storage medium
CN111401069A (en) Intention recognition method and intention recognition device for conversation text and terminal
WO2021227951A1 (en) Naming of front-end page element
CN114220505A (en) Information extraction method of medical record data, terminal equipment and readable storage medium
WO2021082570A1 (en) Artificial intelligence-based semantic identification method, device, and semantic identification apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination