CN111723192B

CN111723192B - Code recommendation method and device

Info

Publication number: CN111723192B
Application number: CN202010562667.6A
Authority: CN
Inventors: 许静; 过辰楷; 杨卉; 张青峰
Original assignee: Tianjin Geizhi Information Technology Co ltd; Nankai University
Current assignee: Tianjin Geizhi Information Technology Co ltd; Nankai University
Priority date: 2020-06-19
Filing date: 2020-06-19
Publication date: 2024-02-02
Anticipated expiration: 2040-06-19
Also published as: CN111723192A

Abstract

The embodiment of the application provides a code recommending method, a device, electronic equipment and a computer readable storage medium, which solve the problems that the existing code automatic recommending function cannot provide class-level references for users and the code recommending efficiency is low. The code recommendation method comprises the following steps: inputting a plurality of data to be selected into a trained classification model to obtain a plurality of classifications corresponding to the data to be selected one by one, wherein the data to be selected are used for representing the contents of a plurality of candidate codes; acquiring a plurality of classification similarity values corresponding to a plurality of classifications one by one according to a plurality of classifications corresponding to a plurality of data to be selected one by one; and acquiring recommended codes from the candidate codes according to the classification similarity values.

Description

Code recommendation method and device

Technical Field

The present invention relates to the field of code recommendation technologies, and in particular, to a code recommendation method, a code recommendation device, an electronic device, and a computer readable storage medium.

Background

The main body of modern software development workflows is an Integrated Development Environment (IDE), and the core functions of the IDE of many advanced development tools include code automatic recommendation functions. However, the code recommendation system included in the IDE can only implement Application Program Interface (API) level recommendation, that is, can only implement code completion, so that only the operation of inputting lengthy words is omitted, class level reference cannot be provided for the user, and code recommendation efficiency is not high.

Disclosure of Invention

In view of the above, the embodiments of the present invention provide a code recommending method, apparatus, electronic device, and computer readable storage medium, which solve the problems that the existing code automatic recommending function cannot provide a class level reference for a user, and the code recommending efficiency is not high.

According to one aspect of the present application, an embodiment of the present application provides a code recommendation method, including: inputting a plurality of data to be selected into a trained classification model to obtain a plurality of classifications corresponding to the data to be selected one by one, wherein the data to be selected are used for representing contents of a plurality of candidate codes and user codes; acquiring a plurality of classification similarity values corresponding to a plurality of classifications one by one according to a plurality of classifications corresponding to a plurality of data to be selected one by one; and acquiring recommended codes from the candidate codes according to the classification similarity values.

In an embodiment of the present application, the obtaining the recommended code from the plurality of candidate codes according to the plurality of classification similarity values includes: acquiring recommended codes from the candidate codes according to the sum of the classification similarity values and the preliminary similarity values; wherein before acquiring the recommended codes from the candidate codes according to the sum of the plurality of classification similarity values and the preliminary similarity values, the method further comprises: obtaining a plurality of preliminary similarity values corresponding to the candidate codes one by one according to the candidate codes and the user codes; and respectively adding the plurality of preliminary similarity values and the plurality of classification similarity values which are in one-to-one correspondence with the plurality of candidate codes to obtain the sum of the plurality of classification similarity values and the preliminary similarity values which are in one-to-one correspondence with the plurality of candidate codes.

In an embodiment of the present application, before inputting the plurality of candidate data into the trained classification model, the method further includes: extracting features of a plurality of original candidate codes to obtain feature matrixes of the plurality of original candidate codes corresponding to the plurality of original candidate codes; extracting the characteristics of the user codes to obtain the characteristic vectors of the user codes corresponding to the user codes; multiplying the feature matrix of the plurality of original candidate codes by the feature vector of the user code to obtain preliminary similarity vectors of the plurality of original candidate codes and the user code, wherein the similarity vectors comprise a plurality of similarity vector element values which are in one-to-one correspondence with the plurality of original candidate codes; and extracting codes with the similarity vector element values larger than a preset threshold value, which are in one-to-one correspondence with the plurality of original candidate codes, so as to obtain a plurality of candidate codes.

In an embodiment of the present application, obtaining a plurality of preliminary similarity values corresponding to a plurality of candidate codes one-to-one according to a plurality of candidate codes and the user code includes: extracting a plurality of similarity vector element values corresponding to a plurality of candidate codes one by one; and; and normalizing the plurality of similarity vector element values corresponding to the plurality of candidate codes one by one to obtain a plurality of preliminary similarity values corresponding to the plurality of candidate codes one by one.

In an embodiment of the present application, before obtaining a plurality of preliminary similarity values corresponding to a plurality of candidate codes one-to-one according to the plurality of candidate codes and the user code, the method further includes: collecting a plurality of items at an item set platform; and extracting a plurality of class files of each item in a plurality of items to obtain a plurality of original candidate codes.

In an embodiment of the present application, the collecting the plurality of items in the item set platform includes: and collecting a plurality of items with the attention degree larger than a preset value on the item set platform.

In an embodiment of the present application, before inputting the plurality of candidate data into the trained classification model, the method includes: training the classification model by using a training set to obtain a trained classification model; the training of the classification model by using the training set comprises the following steps: extracting a plurality of codes in a code data set by using abstract syntax tree nodes to obtain abstract syntax trees of the plurality of codes in the code data set; and respectively carrying out word vector conversion on the abstract syntax trees of the codes to obtain a plurality of training sets for representing the content of the code data set.

In an embodiment of the present application, before inputting the plurality of candidate data into the trained classification model, the method further includes: extracting abstract syntax tree nodes of a plurality of code sets respectively to obtain abstract syntax trees of the plurality of code sets, wherein the plurality of code sets are used for representing a plurality of candidate code contents and a set of user code contents; and performing word vector conversion on the abstract syntax trees of the plurality of code sets to obtain a plurality of candidate data for representing the candidate code contents.

In an embodiment of the present application, before inputting the plurality of candidate data into the trained classification model, the method further includes: and splicing the user codes with a plurality of candidate codes into a plurality of code sets, wherein the plurality of code sets are in one-to-one correspondence with the plurality of candidate codes.

According to another aspect of the present application, an embodiment of the present application further provides a code recommendation apparatus, including: the classification module is configured to input a plurality of data to be selected into a trained classification model to obtain a plurality of classifications corresponding to the data to be selected one by one, wherein the data to be selected are used for representing the contents of a plurality of candidate codes; the classification similarity value acquisition module is configured to acquire a plurality of classification similarity values corresponding to a plurality of classifications one by one according to the classifications corresponding to the plurality of data to be selected one by one; and a code recommendation module configured to obtain recommended codes from the plurality of candidate codes according to the plurality of classification similarity values.

According to another aspect of the present application, an embodiment of the present application further provides an electronic device, including: a processor; and a memory having stored therein computer program instructions that, when executed by the processor, cause the processor to perform any of the code recommendation methods described previously.

According to another aspect of the present application, an embodiment of the present application further provides a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, cause the processor to perform a code recommendation method as described in any of the preceding.

According to the code recommending method, the code recommending device, the electronic equipment and the computer readable storage medium, the classifying model is used for classifying the candidate codes, the classifying similarity values of the candidate codes are obtained according to the classifying result, and the recommending codes are obtained from the candidate codes according to the classifying similarity values, so that the recommending of different types of codes can be recommended according to the types of the candidate codes, the recommending of API level can be achieved, the recommending of class level can be achieved, more choices are provided for users, and meanwhile, the classifying model with higher classifying efficiency is used, and the recommending efficiency of the codes is improved.

Drawings

Fig. 1 is a flow chart of a code recommendation method according to an embodiment of the present application.

Fig. 2 is a flow chart of a code recommendation method according to another embodiment of the present application.

Fig. 3 is a flow chart illustrating a code recommendation method according to another embodiment of the present application.

Fig. 4 is a flow chart of a code recommendation method according to another embodiment of the present application.

Fig. 5 is a flowchart illustrating a code recommendation method according to another embodiment of the present application.

Fig. 6 is a flowchart illustrating a code recommendation method according to another embodiment of the present application.

Fig. 7 is a flowchart illustrating a code recommendation method according to another embodiment of the present application.

Fig. 8 is a schematic structural diagram of a code recommendation device according to an embodiment of the present application.

Fig. 9 is a schematic structural diagram of a code recommendation device according to another embodiment of the present application.

Fig. 10 is a schematic structural diagram of a code recommendation device according to another embodiment of the present application.

Fig. 11 is a schematic structural diagram of a code recommendation device according to another embodiment of the present application.

Fig. 12 is a schematic structural diagram of a code recommendation device according to another embodiment of the present application.

Fig. 13 is a schematic structural diagram of a code recommendation device according to another embodiment of the present application.

Fig. 14 is a schematic structural diagram of a code recommendation device according to another embodiment of the present application.

Fig. 15 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Summary of the application

As described above, the code recommendation system included in the existing IDE can only implement API-level recommendation, cannot provide class-level reference to the user, and is not efficient in code recommendation, because the code recommendation system included in the existing IDE is based on simple free text or short code fragments input by the user, and matches simple usage descriptions, usage patterns or relationship call graphs of related APIs in the database, so as to obtain related APIs, however, it is difficult to make abstract usage descriptions, usage patterns or relationship call graphs, etc. for large-scale codes with simple text, and therefore class-level recommendation cannot be implemented. For example, a user inputs "SUM", and the code recommendation system contained in the existing IDE retrieves the API whose description is SUM based on the "SUM" two words, so that a recommendation code "SUM" is obtained, that is, a function corresponding to the recommendation SUM based on the user inputs the simple free text "SUM" two words. For another example, the code recommendation system included in the existing IDE can search the API containing the letters according to the "AVE", so as to recommend the "AVERAGE", that is, the last letters of the AVERAGE are recommended according to the first letters of the AVERAGE inputted by the user, that is, only the complement of the code can be achieved, therefore, only the operation of inputting lengthy words is omitted, and a more comprehensive reference cannot be provided for the user. In addition, when the code segments of the query input by the user use different languages, the search time is long and the matching result cannot be obtained, thereby resulting in low code recommendation efficiency.

Aiming at the technical problems, the basic idea of the application is to provide a code recommendation method, wherein a classification model is utilized to classify a plurality of candidate codes, a plurality of classification similarity values of the plurality of candidate codes are obtained according to classification results, and recommendation codes are obtained from the plurality of candidate codes according to the plurality of classification similarity values, so that candidate codes with higher similarity can be matched according to partial codes written by a user, code recommendation at class level is realized, and more comprehensive references are provided for the user. In addition, the code recommendation method of the application utilizes the classification model to obtain the classification of the candidate codes, and then obtains the similarity value according to the classification, so that the recommended codes are obtained, namely the code content is directly matched by utilizing the classification model, the problems of language difference and the like are avoided, the problems of long matching time or incapability of obtaining a matching result and the like are avoided, the code recommendation efficiency is higher, and in addition, the classification model with higher classification efficiency can be selected to further improve the classification efficiency, so that the code recommendation efficiency is further improved.

Having described the basic principles of the present application, various non-limiting embodiments of the present application will now be described in detail with reference to the accompanying drawings.

Exemplary code recommendation method

Fig. 1 is a flow chart of a code recommendation method according to an embodiment of the present application. As shown in fig. 1, the code recommendation method includes the steps of:

step 101: and inputting the plurality of data to be selected into the trained classification model to obtain a plurality of classifications corresponding to the plurality of data to be selected one by one, wherein the plurality of data to be selected are used for representing contents of a plurality of candidate codes and user codes.

Specifically, the data to be selected is input data of the classification model, a plurality of data to be selected may be input into the classification model at one time, and then the classification model may sequentially output classifications corresponding to each data to be selected. The plurality of candidate data are characterized by contents of a plurality of candidate codes and user codes, each user code is provided with a plurality of candidate codes corresponding to the candidate codes, namely, a user inputs a section of user codes and a plurality of candidate codes waiting to be recommended, and each candidate data in the plurality of candidate data input at one time is characterized by the contents of each candidate code in the plurality of candidate codes and the same user code.

In one embodiment the classification model may be a depth forest model, which is divided into a multi-granularity scanning module and a forest cascade module. The multi-granularity scanning module can adopt sliding windows with various different sizes to carry out sliding sampling on input data to be selected, namely, the input data is the data to be selected, a plurality of sub-samples are generated, the generated sub-samples are used for training of completely random forests and random forests, each forest can output a probability vector, all probability vectors are spliced together, and a re-characterization vector is obtained, namely, the re-characterization vector corresponding to the data to be selected is output. The forest cascade module can be used for carrying out calculation for a plurality of times after splicing the class vector obtained in the previous layer and the re-characterization vector obtained by the multi-granularity scanning module until a more accurate class vector is obtained, wherein the initial input is the re-characterization vector generated by the multi-granularity scanning module, the input of the next layer is the vector obtained by splicing the class vector output in the previous layer and the re-characterization vector obtained by the multi-granularity scanning module, the layers are transmitted downwards until the accuracy rate of the class vector is not improved compared with that of the previous layer after cross verification, the hierarchy is deepened, the class vector output by each forest of the last layer is obtained, and then the average class vector is obtained by averaging all the class vectors output by the forest, wherein each element in the average class vector represents a classification probability value, and finally the classification with the largest probability is selected as the output classification result.

In an embodiment, parameters of the depth forest model are set as follows to train, and the obtained trained depth forest model can output more accurate classification results. Three different sliding windows of the multi-granularity scanning module comprise 100 rows, 50 rows and 25 rows respectively; the step length of data scanning is 1 line, namely the number of lines of the sliding window sliding downwards every time is 1 line; the number of forests in the multi-granularity scanning is 2, wherein one forest is a random forest, and the other forest is a completely random forest; the number of decision trees in each forest is 500. In the embodiment, the number of forests in the forest cascade module is set to be 4, and the number of decision trees in each forest is set to be 1000. Specifically, the selection principle of the number of rows and columns included in the sliding window is as follows: since the contents of the candidate code and the user code are in the form of text and the word is the smallest node of the text, in order to maintain the integrity of the word, it is necessary to ensure that the number of columns included in the sliding window is the same as the dimensions of the word vectors, so that each sliding window includes a number of rows that is a multiple of the number of nodes read when the multi-granularity scan performs sliding sampling, and the number of columns included in the sliding window is the dimensions of each word vector.

When the depth forest model is used, compared with other classification models, only the parameters are required to be set, so that the parameters required to be set are fewer, in addition, the depth forest model is classified in a multi-granularity scanning and forest cascading mode, the classification accuracy is higher, the occupied system memory is smaller, and the classification efficiency of the depth forest model is higher.

It should be understood that the classification model may also be a neural network classification model or other models for classification, and the specific type of the classification model may be selected according to the actual application scenario, which is not specifically limited in this application.

Step 102: and obtaining a plurality of classification similarity values corresponding to the plurality of classifications one by one according to the plurality of classifications corresponding to the plurality of data to be selected one by one.

Specifically, the classification similarity value characterizes the similarity of the data to be selected to the user code. Each class corresponds to a known similarity value, so that a corresponding known similarity value can be obtained according to the class to which each piece of data to be selected belongs, namely, the class similarity value corresponding to each piece of data to be selected, and therefore, a plurality of pieces of data to be selected correspond to a plurality of class similarity values.

Step 103: and acquiring recommended codes from the candidate codes according to the classification similarity values.

Specifically, a candidate code corresponding to a classification similarity value with the largest similarity value among the plurality of classification similarity values may be selected as the recommendation code.

Therefore, according to the code recommendation method provided by the embodiment of the application, the plurality of candidate codes are classified by using the classification model, a plurality of classification similarity values of the plurality of candidate codes are obtained according to the classification result, and the recommended codes are obtained from the plurality of candidate codes according to the plurality of classification similarity values, so that the candidate codes with higher similarity can be matched according to the partial codes written by the user, the class-level code recommendation is realized, and more comprehensive references are provided for the user. In addition, the code recommendation method of the application utilizes the classification model to obtain the classification of the candidate codes, and then obtains the similarity value according to the classification, so that the recommended codes are obtained, namely the code content is directly matched by utilizing the classification model, the problems of language difference and the like are avoided, the problems of long matching time or incapability of obtaining a matching result and the like are avoided, the code recommendation efficiency is higher, and in addition, the classification model with higher classification efficiency can be selected to further improve the classification efficiency, so that the code recommendation efficiency is further improved.

In one embodiment, obtaining the recommended code from the plurality of candidate codes based on the plurality of classification similarity values comprises: and acquiring recommended codes from the candidate codes according to the sum of the classified similarity values and the preliminary similarity values. According to the sum of the plurality of classified similarity values and the preliminary similarity values, recommended codes are obtained from the plurality of candidate codes, so that the reference of code recommendation is increased by the preliminary similarity values, and the accuracy of code recommendation is further improved.

Fig. 2 is a flow chart of a code recommendation method according to another embodiment of the present application. As shown in fig. 2, before acquiring the recommended code from the plurality of candidate codes according to the sum of the plurality of classification similarity values and the preliminary similarity value, the method further includes:

step 201: and obtaining a plurality of preliminary similarity values corresponding to the candidate codes one by one according to the candidate codes and the user codes.

Specifically, the preliminary similarity value characterizes the similarity of the data to be selected and the user code. In order to further improve the accuracy of code recommendation, a plurality of preliminary similarity values corresponding to the candidate codes one by one can be obtained according to the candidate codes and the user codes.

It should be understood that the preliminary similarity value obtaining method may be the same obtaining step as the classification similarity value, but a different classification model may be used, or other similarity value obtaining methods may be used.

Step 202: and respectively adding the plurality of preliminary similarity values and the plurality of classification similarity values which are in one-to-one correspondence with the plurality of candidate codes to obtain the sum of the plurality of classification similarity values and the preliminary similarity values which are in one-to-one correspondence with the plurality of candidate codes.

Fig. 3 is a flow chart illustrating a code recommendation method according to another embodiment of the present application. As shown in fig. 3, before inputting the plurality of candidate data into the trained classification model, the method further includes:

step 301: and extracting the characteristics of the plurality of original candidate codes to obtain characteristic matrixes of the plurality of original candidate codes corresponding to the plurality of original candidate codes.

Specifically, node extraction may be performed on a plurality of original candidate codes through AST, and the extracted nodes are converted into word vectors through one-bit valid (one-hot) coding, so as to obtain feature matrices of the plurality of original candidate codes.

Step 302: and extracting the characteristics of the user codes to obtain the characteristic vectors of the user codes corresponding to the user codes.

Specifically, node extraction may be performed on the user code through AST, and the extracted node may be converted into a Word Vector through one-hot encoding, or may be converted into a Word Vector through Word to Vector, so as to obtain a feature Vector of the user code.

Step 303: multiplying the feature matrix of the plurality of original candidate codes by the feature vector of the user code to obtain a plurality of preliminary similarity vectors of the original candidate codes and the user code, wherein the preliminary similarity vectors comprise a plurality of similarity vector element values corresponding to the plurality of original candidate codes one by one.

Specifically, the feature matrix of the plurality of original candidate codes is multiplied by the feature vector of the user code to obtain a preliminary similarity vector representing a plurality of similarity vector element values corresponding to the plurality of original candidate codes one by one, and each similarity vector element value of the preliminary similarity vector is the similarity between the original candidate code and the user code.

Step 304: and extracting codes with similarity vector element values larger than a preset threshold value, which are in one-to-one correspondence with the plurality of original candidate codes, so as to obtain a plurality of candidate codes.

Specifically, the preset threshold may be a value of similarity preset according to a requirement for similarity, and the original candidate code corresponding to the similarity vector element value is extracted as the candidate code only when the similarity vector element value is greater than the preset threshold. The preset threshold may also be a value of similarity preset according to a requirement on the number of candidate codes, for example, the requirement on the number of candidate codes is to extract 1000 candidate codes, and then the similarity vector element values of all the original candidate codes are sorted from large to small according to the sizes of the similarity vector element values, and the similarity vector element values of the original candidate codes sorted into 1001 are set as the preset threshold.

It should be understood that the preset threshold may be set according to a specific application scenario, and the setting manner of the preset threshold is not specifically limited in this application.

Fig. 4 is a flow chart of a code recommendation method according to another embodiment of the present application. As shown in fig. 4, obtaining a plurality of preliminary similarity values corresponding to the plurality of candidate codes one-to-one according to the plurality of candidate codes and the user code includes:

step 401: a plurality of similarity vector element values corresponding one-to-one to the plurality of candidate codes are extracted. Specifically, step 303 obtains a plurality of similarity vector element values corresponding to the plurality of original candidate codes one by one, and step 304 extracts a plurality of candidate codes from the plurality of original candidate codes, so that each candidate code in the plurality of candidate codes corresponds to one similarity vector element value, that is, a plurality of similarity vector element values corresponding to the plurality of candidate codes one by one may be extracted according to the results of step 303 and step 304. Step 402: and normalizing the multiple similarity vector element values corresponding to the multiple candidate codes one by one to obtain multiple preliminary similarity values corresponding to the multiple candidate codes one by one.

Specifically, since the similarity Vector element values are vectors obtained by multiplying a plurality of candidate code feature matrices by a user code feature Vector, and the plurality of candidate code feature matrices and the user code feature Vector are obtained by one-hot or Word to Vector conversion, the elements of the vectors are not all numerical values between 0 and 1, i.e., the similarity Vector element values are not all numerical values between 0 and 1. Therefore, the normalization processing is performed on the similarity vector element values to obtain a plurality of preliminary similarity values with values between 0 and 1.

For example, the plurality of similarity vector element values are (n 1, n2, n3,..once., np), where max is the maximum value, min is the minimum value, and the normalized calculation formula is: n1' = (n 1-min)/(max-min), wherein n1' is a preliminary similarity value obtained by n1 normalization processing, and a plurality of preliminary similarity values are obtained by normalization processing of a plurality of similarity vector element values (n 1', n2', n3',.

Fig. 5 is a flowchart illustrating a code recommendation method according to another embodiment of the present application. As shown in fig. 5, before inputting the plurality of candidate data into the trained classification model, the method further includes:

step 501: a plurality of items is collected at an item set platform.

Specifically, the item set platform may select a gitsub, or may select other item set platforms, where the selection of the item set platform is not specifically limited in this application.

In an embodiment, collecting the plurality of items at the item set platform may be collecting the plurality of items having a focus greater than a predetermined value at the item set platform. The attention is one of evaluation criteria of the quality of the items, so that the quality of the collected items can be improved by selecting the items with the attention greater than a preset value. The preset threshold value can be freely set according to the requirement of a user on the quality of the project, and the size of the preset threshold value is not particularly limited. Besides the attention, the downloading amount is one of the evaluation criteria of the quality of the items, so that the collected items can collect a plurality of items with the downloading amount larger than a preset value.

It should be understood that the evaluation criteria of the quality of the item may have other criteria besides the attention degree and the downloading amount, and a plurality of items whose evaluation criteria of the quality of the other items are greater than a preset value may be collected.

Step 502: and extracting a plurality of class files of each item in the plurality of items to obtain a plurality of original candidate codes.

Specifically, a plurality of class files of each of the plurality of items are extracted, wherein the class files include class-level codes, and the plurality of class files may be a plurality of original candidate codes.

In one embodiment, the inputting the plurality of candidate data into the trained classification model includes: training the classification model by using the training set to obtain a trained classification model.

Fig. 6 is a flowchart illustrating a code recommendation method according to another embodiment of the present application. As shown in fig. 6, training the classification model using the training set includes, before obtaining the trained classification model:

step 601: and extracting the abstract syntax tree nodes of the plurality of codes in the code data set to obtain an abstract syntax tree of the plurality of codes in the code data set.

In particular, the code dataset may be a bigCloneBench dataset, but may also be other datasets. The code dataset may include 4 classifications of known similarity, the 4 classifications being type 1, type 2, type 3, and type 4, respectively. Type 1 may characterize identical codes, i.e., two code segments are identical code pairs except for space, annotation; type 2 may characterize renamed or parameterized code, i.e., code pairs that are identical except for variable names, type names, function names; type 3 may characterize nearly identical codes, i.e., with the addition and deletion of several sentences, or with different identifiers, words, types, spaces, layouts, and annotations, but still similar code pairs; type 4 may characterize semantically similar code, i.e., heterogeneous code of the same function, as being text or grammatically dissimilar, but semantically similar. The similarity corresponding to the four classification types may be 0.875,0.625,0.375 and 0.125, respectively. The extracting of the plurality of codes in the code data set by the abstract syntax tree node may be extracting the plurality of codes in the code data set by an AST to obtain an abstract syntax tree of the plurality of codes in the code data set, where the abstract syntax tree of each code includes a plurality of nodes.

It should be understood that the code dataset may be classified into 2 kinds of known similarity, 3 kinds of known similarity, 5 kinds of known similarity or more, and the similarity value of each kind of known similarity may be freely selected according to the requirement, and the number of kinds of the known similarity classifications and the similarity value corresponding to the known similarity classification are not particularly limited in this application.

Step 602: and respectively carrying out word vector conversion on the abstract syntax trees of the codes to obtain a plurality of training sets for characterizing the content of the code dataset.

Specifically, a plurality of nodes included in an abstract syntax tree of a plurality of codes are converted into Word vectors through Word to vectors, so that a plurality of vectors used for representing the codes are obtained, and the plurality of vectors used for representing the codes are training sets.

In an embodiment, the plurality of nodes included in the abstract syntax tree of the plurality of codes are converted into Word vectors through Word to vectors, which may be that Word to vectors based on a skip-gram model are used to build the embedding of the neural words, so that the plurality of nodes included in the abstract syntax tree of the plurality of codes are converted into Word vectors.

It should be understood that the plurality of nodes included in the abstract syntax tree of the plurality of codes may be converted into Word vectors through Word to vectors, or Word to vectors may be converted by using Word to vectors based on continuous Word bag (CBOW), and the method of Word Vector conversion is not specifically limited in this application. Fig. 7 is a flowchart illustrating a code recommendation method according to another embodiment of the present application. As shown in fig. 7, before inputting the plurality of candidate data into the trained classification model, the method further includes:

step 701: and respectively extracting abstract syntax tree nodes of the plurality of code sets to obtain abstract syntax trees of the plurality of code sets, wherein the plurality of code sets are used for representing a plurality of candidate code contents and a set of user code contents.

Specifically, node extraction may be performed on the code sets through AST to obtain abstract syntax trees of multiple code sets, where the abstract syntax tree of each code set includes multiple nodes. Each code set is used to characterize a set of candidate code content and the same user code content, and multiple code sets are used to characterize multiple sets of candidate code content and user code content.

Step 702: and carrying out word vector transformation on the abstract syntax tree of the plurality of code sets to obtain a plurality of candidate data for representing the contents of the plurality of candidate codes.

Specifically, a plurality of nodes included in an abstract syntax tree of a plurality of code sets are converted into Word vectors through Word to vectors, so that a plurality of vectors used for representing a plurality of candidate code contents are obtained, and the plurality of vectors used for representing the plurality of candidate code contents are a plurality of data to be selected.

In an embodiment, before inputting the plurality of candidate data into the trained classification model, further comprises: and splicing the user codes with the candidate codes into a plurality of code sets, wherein the code sets are in one-to-one correspondence with the candidate codes.

Specifically, the same user code and each candidate code are spliced into a code set, i.e. each code set corresponds to one candidate code one by one. The user code and the candidate code are spliced in a mode that the user code and the candidate code are placed in a file.

Exemplary code recommendation apparatus

Fig. 8 is a schematic structural diagram of a code recommendation device according to an embodiment of the present application. As shown in fig. 8, the code recommendation device 80 includes:

the classification module 810 is configured to input a plurality of candidate data into the trained classification model to obtain a plurality of classifications corresponding to the plurality of candidate data, where the plurality of candidate data is used to characterize contents of the plurality of candidate codes.

The classification similarity value obtaining module 820 is configured to obtain a plurality of classification similarity values corresponding to the plurality of classifications one by one according to the plurality of classifications corresponding to the plurality of candidate data one by one.

The code recommendation module 830 is configured to obtain a recommended code from the plurality of candidate codes according to the plurality of classification similarity values.

Fig. 9 is a schematic structural diagram of a code recommendation device according to another embodiment of the present application. As shown in fig. 9, the code recommendation device 80 further includes:

the preliminary similarity value obtaining module 910 is configured to obtain a plurality of preliminary similarity values corresponding to the plurality of candidate codes one by one according to the plurality of candidate codes and the user code.

The summing module 920 is configured to add the plurality of preliminary similarity values and the plurality of classification similarity values that are in one-to-one correspondence with the plurality of candidate codes, respectively, to obtain a sum of the plurality of classification similarity values and the preliminary similarity values that are in one-to-one correspondence with the plurality of candidate codes.

The code recommendation module 830 is further configured to obtain a recommended code from the plurality of candidate codes based on a sum of the plurality of classification similarity values and the preliminary similarity value.

Fig. 10 is a schematic structural diagram of a code recommendation device according to another embodiment of the present application. As shown in fig. 10, the code recommendation device 80 further includes:

The original candidate code feature extraction module 1010 is configured to perform feature extraction on the plurality of original candidate codes to obtain feature matrices of the plurality of original candidate codes corresponding to the plurality of original candidate codes.

The user code feature extraction module 1020 is configured to perform feature extraction on the user code to obtain feature vectors of the user code corresponding to the user code.

The similarity vector element value obtaining module 1030 is configured to multiply the feature matrices of the plurality of original candidate codes with the feature vectors of the user codes to obtain preliminary similarity vectors of the plurality of original candidate codes and the user codes, where the similarity vectors include similarity vector element values corresponding to the plurality of original candidate codes one by one.

The candidate code taking module 1040 is configured to extract codes with similarity vector element values greater than a preset threshold value, which are in one-to-one correspondence with the plurality of original candidate codes, so as to obtain a plurality of candidate codes.

Fig. 11 is a schematic structural diagram of a code recommendation device according to another embodiment of the present application. As shown in fig. 11, the preliminary similarity value acquisition module 910 includes:

the similarity vector element value extraction unit 9101 is configured to extract a plurality of similarity vector element values corresponding to a plurality of candidate codes one by one.

The normalization processing unit 9102 is configured to perform normalization processing on a plurality of similarity vector element values corresponding to a plurality of candidate codes one by one, so as to obtain a plurality of preliminary similarity values corresponding to a plurality of candidate codes one by one.

Fig. 12 is a schematic structural diagram of a code recommendation device according to another embodiment of the present application. As shown in fig. 12, the code recommendation device 80 further includes:

the project collection module 1210 is configured to collect a plurality of projects at the project set platform.

The original candidate code obtaining module 1220 is configured to extract a plurality of class files of each of the plurality of items to obtain a plurality of original candidate codes.

In an embodiment, the item collection module 1210 is further configured to collect a plurality of items having a focus greater than a preset value at the item collection platform.

Fig. 13 is a schematic structural diagram of a code recommendation device according to another embodiment of the present application. As shown in fig. 13, the code recommendation device 80 further includes:

the code dataset node extraction module 1310 is configured to extract abstract syntax tree nodes of a plurality of codes in the code dataset, and obtain abstract syntax trees of the plurality of codes in the code dataset.

The training set obtaining module 1320 is configured to perform word vector transformation on the abstract syntax trees of the plurality of codes respectively, so as to obtain a plurality of training sets for characterizing the content of the code dataset.

In one embodiment, as shown in fig. 13, the code recommendation device 80 further includes:

the classification model training module 1330 is configured to train the classification model using the training set to obtain a trained classification model.

Fig. 14 is a schematic structural diagram of a code recommendation device according to another embodiment of the present application. As shown in fig. 14, the code recommendation device 80 further includes:

the code set node extraction module 1410 is configured to extract abstract syntax tree nodes of a plurality of code sets, respectively, to obtain abstract syntax trees of the plurality of code sets, where the plurality of code sets are used to represent a set of a plurality of candidate code contents and user code contents.

The candidate data obtaining module 1420 is configured to perform word vector transformation on the abstract syntax trees of the plurality of code sets to obtain a plurality of candidate data for characterizing a plurality of candidate code contents.

The code recommendation device shown in fig. 14 further includes in one embodiment:

the code set obtaining module 1405 is configured to splice the user code and the plurality of candidate codes into a plurality of code sets, respectively, where the plurality of code sets are in one-to-one correspondence with the plurality of candidate codes.

Exemplary electronic device

Fig. 15 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 15, the electronic device 150 includes: one or more processors 1501 and memory 1502; and computer program instructions stored in the memory 1502, which when executed by the processor 1501, cause the processor 1501 to perform the code recommendation method according to any of the embodiments described above.

The processor 1501 may be a Central Processing Unit (CPU) or other form of processing unit having data processing and/or instruction execution capabilities, and may control other components in the electronic device to perform desired functions.

Memory 1502 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, random Access Memory (RAM) and/or cache memory (cache) and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on a computer readable storage medium and the processor 1501 may execute the program instructions to implement the steps in the code recommendation method of the various embodiments of the present application above and/or other desired functions. Information such as the original candidate code acquisition path, the acquisition method of the code dataset, the feature extraction manner, and the like may also be stored in the computer-readable storage medium.

In one example, the electronic device 150 may further include: an input device 1503 and an output device 1504, which are interconnected by a bus system and/or other forms of connection mechanisms (not shown in fig. 15).

For example, when the electronic device is a stand-alone device, the input means 1503 may be a communication network connector for receiving the acquired input signal from an external removable device. In addition, the input device 703 may also include, for example, a keyboard, mouse, microphone, and the like.

The output device 1504 may output various information to the outside, and may include, for example, a display, a speaker, a printer, and a communication network and a remote output apparatus connected thereto, and the like.

Of course, only some of the components of the electronic device 150 that are relevant to the present application are shown in fig. 15 for simplicity, components such as buses, input/output interfaces, etc. are omitted. In addition, the electronic device 150 may include any other suitable components depending on the particular application.

Exemplary computer program product and computer readable storage Medium

In addition to the methods and apparatus described above, embodiments of the present application may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform the steps of the code recommendation method of any of the embodiments described above.

The computer program product may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present application may also be a computer-readable storage medium, having stored thereon computer program instructions, which when executed by a processor, cause the processor to perform steps in a code recommendation method according to various embodiments of the present application described in the above section of the description of exemplary code recommendation methods.

A computer readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random access memory ((RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The basic principles of the present application have been described above in connection with specific embodiments, however, it should be noted that the advantages, benefits, effects, etc. mentioned in the present application are merely examples and not limiting, and these advantages, benefits, effects, etc. are not to be considered as necessarily possessed by the various embodiments of the present application. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, as the application is not intended to be limited to the details disclosed herein as such.

The block diagrams of the devices, apparatuses, devices, systems referred to in this application are only illustrative examples and are not intended to require or imply that the connections, arrangements, configurations must be made in the manner shown in the block diagrams. As will be appreciated by one of skill in the art, the devices, apparatuses, devices, systems may be connected, arranged, configured in any manner. Words such as "including," "comprising," "having," and the like are words of openness and mean "including but not limited to," and are used interchangeably therewith. The terms "or" and "as used herein refer to and are used interchangeably with the term" and/or "unless the context clearly indicates otherwise. The term "such as" as used herein refers to, and is used interchangeably with, the phrase "such as, but not limited to.

It is also noted that in the apparatus, devices and methods of the present application, the components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered as equivalent to the present application.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit the embodiments of the application to the form disclosed herein. Although a number of example aspects and embodiments have been discussed above, a person of ordinary skill in the art will recognize certain variations, modifications, alterations, additions, and subcombinations thereof.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is to be construed as including any modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims

1. A code recommendation method, comprising:

inputting a plurality of data to be selected into a trained classification model to obtain a plurality of classifications corresponding to the data to be selected one by one, wherein the data to be selected are used for representing contents of a plurality of candidate codes and user codes;

according to a plurality of classifications corresponding to the data to be selected one by one, a plurality of classification similarity values corresponding to the classifications are obtained, wherein the classification similarity values represent the similarity between the data to be selected and the user codes; and

acquiring recommended codes from the candidate codes according to the classification similarity values;

The obtaining recommended codes from the candidate codes according to the classification similarity values comprises: acquiring recommended codes from the candidate codes according to the sum of the classification similarity values and the preliminary similarity values;

wherein before acquiring the recommended codes from the candidate codes according to the sum of the plurality of classification similarity values and the preliminary similarity values, the method further comprises:

obtaining a plurality of preliminary similarity values corresponding to the candidate codes one by one according to the candidate codes and the user codes; and

respectively adding the plurality of preliminary similarity values and the plurality of classification similarity values which are in one-to-one correspondence with the plurality of candidate codes to obtain the sum of the plurality of classification similarity values and the preliminary similarity values which are in one-to-one correspondence with the plurality of candidate codes;

the step of inputting the plurality of candidate data into the trained classification model further comprises the following steps:

extracting features of a plurality of original candidate codes to obtain feature matrixes of the plurality of original candidate codes corresponding to the plurality of original candidate codes;

extracting the characteristics of the user codes to obtain the characteristic vectors of the user codes corresponding to the user codes;

Multiplying the feature matrix of the plurality of original candidate codes by the feature vector of the user code to obtain preliminary similarity vectors of the plurality of original candidate codes and the user code, wherein the similarity vectors comprise a plurality of similarity vector element values which are in one-to-one correspondence with the plurality of original candidate codes; and

and extracting codes with the similarity vector element values larger than a preset threshold value, which are in one-to-one correspondence with the plurality of original candidate codes, so as to obtain a plurality of candidate codes.

2. The code recommendation method of claim 1, wherein obtaining a plurality of preliminary similarity values corresponding to a plurality of the candidate codes one-to-one based on a plurality of the candidate codes and the user code comprises:

extracting a plurality of similarity vector element values corresponding to a plurality of candidate codes one by one; and normalizing the plurality of similarity vector element values corresponding to the plurality of candidate codes one by one to obtain a plurality of preliminary similarity values corresponding to the plurality of candidate codes one by one.

3. The code recommendation method of claim 1, wherein said entering the plurality of candidate data into the trained classification model is preceded by:

Collecting a plurality of items at an item set platform; and

and extracting a plurality of class files of each item in a plurality of items to obtain a plurality of original candidate codes.

4. The code recommendation method of claim 3, wherein said collecting a plurality of items at an item set platform comprises:

and collecting a plurality of items with the attention degree larger than a preset value on the item set platform.

5. The code recommendation method of claim 1, wherein said entering the plurality of candidate data into the trained classification model is preceded by:

training the classification model by using a training set to obtain a trained classification model;

the training of the classification model by using the training set comprises the following steps:

extracting a plurality of codes in a code data set by using abstract syntax tree nodes to obtain abstract syntax trees of the plurality of codes in the code data set; and

and respectively carrying out word vector conversion on the abstract syntax trees of the codes to obtain a plurality of training sets for representing the content of the code data set.

6. The code recommendation method of claim 1, wherein said entering the plurality of candidate data into the trained classification model is preceded by:

Extracting abstract syntax tree nodes of a plurality of code sets respectively to obtain abstract syntax trees of the plurality of code sets, wherein the plurality of code sets are used for representing a plurality of candidate code contents and a set of user code contents; and

and carrying out word vector transformation on the abstract syntax tree of the plurality of code sets to obtain a plurality of candidate data for representing the candidate code contents.

7. The code recommendation method of claim 6, wherein said entering the plurality of candidate data into the trained classification model is preceded by:

and splicing the user codes with a plurality of candidate codes into a plurality of code sets, wherein the plurality of code sets are in one-to-one correspondence with the plurality of candidate codes.

8. A code recommendation device, comprising:

the classification module is configured to input a plurality of data to be selected into a trained classification model to obtain a plurality of classifications corresponding to the data to be selected one by one, wherein the data to be selected are used for representing the contents of a plurality of candidate codes;

the classification similarity value acquisition module is configured to acquire a plurality of classification similarity values corresponding to a plurality of classifications one by one according to the classifications corresponding to the data to be selected one by one, wherein the classification similarity values represent the similarity between the data to be selected and the user codes; and

The code recommending module is configured to acquire recommended codes from the candidate codes according to the classification similarity values;

wherein, the recommending device further comprises:

the preliminary similarity value acquisition module is configured to obtain a plurality of preliminary similarity values corresponding to the candidate codes one by one according to the candidate codes and the user codes;

the summing module is configured to respectively sum the plurality of preliminary similarity values and the plurality of classification similarity values which are in one-to-one correspondence with the plurality of candidate codes to obtain the sum of the plurality of classification similarity values and the preliminary similarity values which are in one-to-one correspondence with the plurality of candidate codes;

the code recommending module is further configured to acquire recommended codes from the candidate codes according to the sum of the plurality of classification similarity values and the preliminary similarity value;

wherein, the recommending device further comprises:

the original candidate code feature extraction module is configured to perform feature extraction on a plurality of original candidate codes to obtain feature matrixes of the plurality of original candidate codes corresponding to the plurality of original candidate codes;

the user code feature extraction module is configured to perform feature extraction on the user code to obtain a feature vector of the user code corresponding to the user code;

The similarity vector element value acquisition module is configured to multiply the feature matrix of the plurality of original candidate codes with the feature vector of the user code to obtain preliminary similarity vectors of the plurality of original candidate codes and the user code, wherein the similarity vectors comprise a plurality of similarity vector element values which are in one-to-one correspondence with the plurality of original candidate codes;

and the candidate code taking module is used for configuring and extracting codes with the similarity vector element values larger than a preset threshold value, which are in one-to-one correspondence with the plurality of original candidate codes, so as to obtain a plurality of candidate codes.

9. An electronic device, comprising:

a processor; and

a memory in which computer program instructions are stored which, when executed by the processor, cause the processor to perform the code recommendation method of any of claims 1 to 7.