CN111723192A

CN111723192A - Code recommendation method and device

Info

Publication number: CN111723192A
Application number: CN202010562667.6A
Authority: CN
Inventors: 许静; 过辰楷; 杨卉; 张青峰
Original assignee: Tianjin Geizhi Information Technology Co ltd; Nankai University
Current assignee: Tianjin Geizhi Information Technology Co ltd; Nankai University
Priority date: 2020-06-19
Filing date: 2020-06-19
Publication date: 2020-09-29
Anticipated expiration: 2040-06-19
Also published as: CN111723192B

Abstract

The embodiment of the application provides a code recommendation method and device, electronic equipment and a computer-readable storage medium, and solves the problems that the conventional code automatic recommendation function cannot provide class-level reference for a user and the code recommendation efficiency is low. The code recommendation method comprises the following steps: inputting a plurality of data to be selected into a trained classification model to obtain a plurality of classifications corresponding to the plurality of data to be selected one by one, wherein the plurality of data to be selected are used for representing the contents of a plurality of candidate codes; according to a plurality of classifications corresponding to a plurality of data to be selected one by one, obtaining a plurality of classification similarity values corresponding to the plurality of classifications one by one; and acquiring a recommended code from the candidate codes according to the classification similarity values.

Description

Code recommendation method and device

Technical Field

The invention relates to the technical field of code recommendation, in particular to a code recommendation method and device, electronic equipment and a computer-readable storage medium.

Background

The main body of modern software development workflows is the Integrated Development Environment (IDE), which includes code auto-recommendation functionality in the core functionality of the IDE of many advanced development tools. However, the code recommendation system included in the IDE can only implement recommendation at an Application Program Interface (API) level, that is, only implement code completion, and therefore, only the operation of inputting a lengthy word is omitted, a class level reference cannot be provided for a user, and code recommendation efficiency is not high.

Disclosure of Invention

In view of this, embodiments of the present invention provide a code recommendation method and apparatus, an electronic device, and a computer-readable storage medium, which solve the problems that the existing automatic code recommendation function cannot provide a class-level reference for a user, and the code recommendation efficiency is not high.

According to an aspect of the present application, an embodiment of the present application provides a code recommendation method, including: inputting a plurality of data to be selected into a trained classification model to obtain a plurality of classifications corresponding to the plurality of data to be selected one by one, wherein the plurality of data to be selected are used for representing a plurality of candidate codes and the content of a user code; according to a plurality of classifications corresponding to a plurality of data to be selected one by one, obtaining a plurality of classification similarity values corresponding to the plurality of classifications one by one; and acquiring a recommended code from the candidate codes according to the classification similarity values.

In an embodiment of the present application, the obtaining, according to the classification similarity values, a recommended code from the candidate codes includes: acquiring a recommended code from the candidate codes according to the sum of the classification similarity values and the preliminary similarity value; wherein before obtaining the recommended code from the candidate codes according to the sum of the classification similarity values and the preliminary similarity value, the method further comprises: obtaining a plurality of preliminary similarity values corresponding to the candidate codes one by one according to the candidate codes and the user codes; and respectively adding the plurality of preliminary similarity values and the plurality of classification similarity values which are in one-to-one correspondence with the plurality of candidate codes to obtain the sum of the plurality of classification similarity values and the preliminary similarity values which are in one-to-one correspondence with the plurality of candidate codes.

In an embodiment of the present application, before inputting the multiple candidate data into the trained classification model, the method further includes: extracting features of a plurality of original candidate codes to obtain feature matrixes of the original candidate codes corresponding to the original candidate codes; extracting the features of the user codes to obtain feature vectors of the user codes corresponding to the user codes; multiplying the feature matrix of the original candidate codes by the feature vector of the user code to obtain a preliminary similarity vector of the original candidate codes and the user code, wherein the similarity vector comprises a plurality of similarity vector element values in one-to-one correspondence with the original candidate codes; and extracting a plurality of codes which are in one-to-one correspondence with the original candidate codes and have similarity vector element values larger than a preset threshold value to obtain a plurality of candidate codes.

In an embodiment of the present application, the obtaining, according to the plurality of candidate codes and the user code, a plurality of preliminary similarity values in one-to-one correspondence with the plurality of candidate codes includes: extracting a plurality of similarity vector element values corresponding to the candidate codes one by one; and; and normalizing the similarity vector element values which are in one-to-one correspondence with the candidate codes to obtain the preliminary similarity values which are in one-to-one correspondence with the candidate codes.

In an embodiment of the present application, before obtaining, according to the plurality of candidate codes and the user code, a plurality of preliminary similarity values in one-to-one correspondence with the plurality of candidate codes, the method further includes: collecting a plurality of items at an item set platform; and extracting a plurality of class files of each item in a plurality of items to obtain the plurality of original candidate codes.

In an embodiment of the present application, said collecting a plurality of items at the item set platform comprises: and collecting a plurality of items with the attention degrees larger than a preset value on the item set platform.

In an embodiment of the present application, before inputting the plurality of candidate data into the trained classification model, the method includes: training the classification model by using a training set to obtain the trained classification model; wherein, the training of the classification model by using the training set comprises the following steps before obtaining the trained classification model: extracting nodes of an abstract syntax tree of a plurality of codes in a code data set to obtain the abstract syntax tree of the plurality of codes in the code data set; and respectively carrying out word vector conversion on the abstract syntax trees of the codes to obtain a plurality of training sets for representing the content of the code data set.

In an embodiment of the present application, before inputting the multiple candidate data into the trained classification model, the method further includes: respectively extracting abstract syntax tree nodes of a plurality of code sets to obtain abstract syntax trees of the code sets, wherein the code sets are used for representing a plurality of candidate code contents and a set of user code contents; and performing word vector conversion on the abstract syntax trees of the plurality of code sets to obtain a plurality of candidate data for representing the content of the candidate codes.

In an embodiment of the present application, before inputting the multiple candidate data into the trained classification model, the method further includes: and splicing the user codes and the candidate codes into a plurality of code sets respectively, wherein the plurality of code sets correspond to the candidate codes one by one.

According to another aspect of the present application, an embodiment of the present application further provides a code recommendation apparatus, including: the classification module is configured to input a plurality of data to be selected into a trained classification model to obtain a plurality of classifications corresponding to the plurality of data to be selected one by one, wherein the plurality of data to be selected are used for representing the contents of a plurality of candidate codes; a classification similarity value acquisition module configured to acquire, according to a plurality of classifications corresponding to a plurality of data to be selected one by one, a plurality of classification similarity values corresponding to the plurality of classifications one by one; and the code recommending module is configured to acquire recommended codes from the candidate codes according to the classification similarity values.

According to another aspect of the present application, an embodiment of the present application further provides an electronic device, including: a processor; and a memory having stored therein computer program instructions which, when executed by the processor, cause the processor to perform any of the foregoing code recommendation methods.

According to another aspect of the present application, an embodiment of the present application further provides a computer-readable storage medium having stored thereon computer program instructions, which, when executed by a processor, cause the processor to execute the code recommendation method according to any one of the preceding claims.

According to the code recommendation method, the code recommendation device, the electronic equipment and the computer readable storage medium, the classification model is used for classifying the candidate codes, the classification similarity values of the candidate codes are obtained according to the classification result, and the recommended codes are obtained from the candidate codes according to the classification similarity values, so that codes of different classes can be recommended according to the types of the candidate codes, API-level recommendation can be achieved, class-level recommendation can be achieved, more choices can be provided for users, and meanwhile, the classification model with high classification efficiency is used, and the code recommendation efficiency is improved.

Drawings

Fig. 1 is a flowchart illustrating a code recommendation method according to an embodiment of the present application.

Fig. 2 is a flowchart illustrating a code recommendation method according to another embodiment of the present application.

Fig. 3 is a flowchart illustrating a code recommendation method according to another embodiment of the present application.

Fig. 4 is a flowchart illustrating a code recommendation method according to another embodiment of the present application.

Fig. 5 is a flowchart illustrating a code recommendation method according to another embodiment of the present application.

Fig. 6 is a flowchart illustrating a code recommendation method according to another embodiment of the present application.

Fig. 7 is a flowchart illustrating a code recommendation method according to another embodiment of the present application.

Fig. 8 is a schematic structural diagram of a code recommendation apparatus according to an embodiment of the present application.

Fig. 9 is a schematic structural diagram of a code recommendation apparatus according to another embodiment of the present application.

Fig. 10 is a schematic structural diagram of a code recommendation apparatus according to another embodiment of the present application.

Fig. 11 is a schematic structural diagram of a code recommendation apparatus according to another embodiment of the present application.

Fig. 12 is a schematic structural diagram of a code recommendation apparatus according to another embodiment of the present application.

Fig. 13 is a schematic structural diagram of a code recommendation apparatus according to another embodiment of the present application.

Fig. 14 is a schematic structural diagram of a code recommendation apparatus according to another embodiment of the present application.

Fig. 15 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Summary of the application

As described above, the code recommendation system included in the conventional IDE can only implement API-level recommendation, and cannot provide a class-level reference for a user, and the code recommendation efficiency is not high because the code recommendation system included in the conventional IDE matches a simple usage description, a usage pattern, a relational call graph, or the like of a relevant API in a database according to a simple free text or a very short code fragment input by the user to obtain the relevant API, however, it is difficult to make an abstract usage description, a usage pattern, a relational call graph, or the like with simple characters for a large segment of code, and thus class-level recommendation cannot be implemented. For example, a user inputs "SUM" and a code recommendation system included in an existing IDE would retrieve an API whose usage specification is SUM based on the "SUM" of two words, and would obtain a recommendation code of "SUM", i.e., a function corresponding to the SUM of recommendations based on the simple free text "SUM" of two words input by the user. For another example, when the user inputs "AVE", the code recommendation system included in the existing IDE searches for the API containing these several letters according to "AVE", so as to recommend "AVERAGE", that is, according to the first several letters input by the user to be averaged, the last several letters to be averaged are recommended, that is, only completion of the code can be achieved, therefore, only the operation of inputting a lengthy word is omitted, and a more comprehensive reference cannot be provided for the user. In addition, when the code segments of the query input by the user use different languages, the search time is long and the matching result cannot be obtained, so that the code recommendation efficiency is low.

In view of the above technical problems, a basic idea of the present application is to provide a code recommendation method, which classifies a plurality of candidate codes by using a classification model, obtains a plurality of classification similarity values of the plurality of candidate codes according to a classification result, and obtains a recommended code from the plurality of candidate codes according to the plurality of classification similarity values, so that candidate codes with higher similarity can be matched according to a part of codes written by a user, thereby implementing class-level code recommendation and providing a more comprehensive reference for the user. In addition, the code recommendation method obtains the classification of the candidate codes by using the classification model, and then obtains the similarity value according to the classification, so that the recommended codes are obtained, namely the classification model is directly used for matching the code content, the problems of language difference and the like do not exist, the problems of long matching time or incapability of obtaining a matching result and the like do not occur, the code recommendation efficiency is higher, and in addition, the classification model with higher classification efficiency can be selected to further improve the classification efficiency, so that the code recommendation efficiency is further improved.

Having described the general principles of the present application, various non-limiting embodiments of the present application will now be described with reference to the accompanying drawings.

Exemplary code recommendation method

Fig. 1 is a flowchart illustrating a code recommendation method according to an embodiment of the present application. As shown in fig. 1, the code recommendation method includes the following steps:

step 101: and inputting a plurality of data to be selected into the trained classification model to obtain a plurality of classifications corresponding to the plurality of data to be selected one by one, wherein the plurality of data to be selected are used for representing a plurality of candidate codes and the content of the user code.

Specifically, the data to be selected is input data of the classification model, a plurality of data to be selected can be input into the classification model at one time, and then the classification model can sequentially output the classification corresponding to each data to be selected. The plurality of candidate data are characterized by a plurality of candidate codes and the content of the user code, each user code has a plurality of candidate codes corresponding to the user code, namely, a user inputs a section of the user code and has a plurality of candidate codes waiting to be recommended, and each candidate data in the plurality of candidate data input at one time is characterized by each candidate code in the plurality of candidate codes and the content of the same user code.

In an embodiment, the classification model may be a deep forest model, and the deep forest model is divided into a multi-granularity scanning module and a forest cascading module. The multi-granularity scanning module can adopt a plurality of sliding windows with different sizes to perform sliding sampling on input data to be selected, namely the input data is the data to be selected, a plurality of sub-samples are generated, the generated sub-samples are used for training of completely random forests and random forests, each forest outputs a probability vector, all the probability vectors are spliced together to obtain a re-characterization vector, namely the re-characterization vector corresponding to the data to be selected is output. The forest cascade module can splice the class vector obtained by the previous layer and the re-characterization vector obtained by the multi-granularity scanning module and then carry out multiple calculations until a more accurate class vector is obtained, the initial input of the class vector is the re-characterization vector generated by the multi-granularity scanning module, the input of the next layer is the vector obtained by splicing the class vector output by the previous layer and the re-characterization vector obtained by the multi-granularity scanning module, the class vectors are transmitted layer by layer downwards until the accuracy of the class vector is not improved compared with that of the previous layer through cross verification, the hierarchy deepening is stopped, the class vector output by each forest of the last layer is obtained, the class vectors output by all the forests are averaged to obtain an average class vector, wherein each element in the average class vector represents a classified probability value, and finally the classification with the maximum probability value is selected as an output classification result.

In an embodiment, the parameters of the deep forest model are set as follows for training, and the obtained trained deep forest model can output a more accurate classification result. Three different sliding windows of the multi-granularity scanning module comprise 100 lines, 50 lines and 25 lines respectively; the step length of data scanning is 1 line, namely the number of lines sliding downwards in each scanning of the sliding window is 1 line; the number of the forests scanned in the multi-granularity mode is 2, wherein one forest is a random forest, and the other forest is a completely random forest; there are 500 decision trees in each forest. In this embodiment, the number of forests in the forest cascade module is set to be 4, and the number of decision trees in each forest is set to be 1000. Specifically, the selection principle of the number of rows and the number of columns included in the sliding window is as follows: because the content of the candidate code and the user code is in the form of text, and the word is the minimum node of the text, in order to maintain the integrity of the word, it is necessary to ensure that the number of columns included in the sliding window is the same as the dimension of the word vector, so that the number of rows included in each sliding window is a multiple of the number of nodes read when the sliding sampling is performed by the multi-granularity scanning, and the number of columns included in the sliding window is the dimension of each word vector.

Compared with other classification models, the deep forest model only needs to be set with the parameters, so that the parameters needing to be set are few.

It should be understood that the classification model may also be a neural network classification model or other models for classification, and the specific category of the classification model may be selected according to the actual application scenario, which is not specifically limited in this application.

Step 102: and acquiring a plurality of classification similarity values which are in one-to-one correspondence with the plurality of classifications according to the plurality of classifications which are in one-to-one correspondence with the plurality of data to be selected.

Specifically, the classification similarity value characterizes the similarity between the data to be selected and the user code. Each classification corresponds to a known similarity value, so that the corresponding known similarity value can be obtained according to the belonging classification of each data to be selected, namely the classification similarity value corresponding to each data to be selected, and therefore, a plurality of data to be selected correspond to a plurality of classification similarity values.

Step 103: and acquiring a recommended code from the candidate codes according to the classification similarity values.

Specifically, a candidate code corresponding to the classification similarity value with the largest similarity value among the plurality of classification similarity values may be selected as the recommended code.

Therefore, the code recommendation method provided by the embodiment of the application classifies a plurality of candidate codes by using the classification model, obtains a plurality of classification similarity values of the candidate codes according to the classification result, and obtains the recommended codes from the candidate codes according to the classification similarity values, so that the candidate codes with higher similarity can be matched according to part of codes written by a user, thereby realizing code recommendation at class level and providing more comprehensive reference for the user. In addition, the code recommendation method obtains the classification of the candidate codes by using the classification model, and then obtains the similarity value according to the classification, so that the recommended codes are obtained, namely the classification model is directly used for matching the code content, the problems of language difference and the like do not exist, the problems of long matching time or incapability of obtaining a matching result and the like do not occur, the code recommendation efficiency is higher, and in addition, the classification model with higher classification efficiency can be selected to further improve the classification efficiency, so that the code recommendation efficiency is further improved.

In one embodiment, obtaining the recommended code from the plurality of candidate codes according to the plurality of classification similarity values comprises: and acquiring a recommended code from the candidate codes according to the sum of the classification similarity values and the preliminary similarity value. The recommended codes are obtained from the candidate codes according to the sum of the classification similarity values and the preliminary similarity values, so that the reference of code recommendation is increased by the preliminary similarity values, and the accuracy of code recommendation is further improved.

Fig. 2 is a flowchart illustrating a code recommendation method according to another embodiment of the present application. As shown in fig. 2, before obtaining the recommended code from the plurality of candidate codes according to the sum of the plurality of classification similarity values and the preliminary similarity value, the method further includes:

step 201: and obtaining a plurality of preliminary similarity values corresponding to the candidate codes one by one according to the candidate codes and the user codes.

Specifically, the preliminary similarity value characterizes the similarity between the data to be selected and the user code. In order to further improve the accuracy of code recommendation, a plurality of preliminary similarity values corresponding to a plurality of candidate codes one to one may be obtained from the plurality of candidate codes and the user code.

It should be understood that the method for obtaining the preliminary similarity value may be the same obtaining step as the classification similarity value, but a different classification model may be used, or another method for obtaining the similarity value may be used.

Step 202: and respectively adding the plurality of preliminary similarity values and the plurality of classification similarity values which are in one-to-one correspondence with the plurality of candidate codes to obtain the sum of the plurality of classification similarity values and the preliminary similarity values which are in one-to-one correspondence with the plurality of candidate codes.

Fig. 3 is a flowchart illustrating a code recommendation method according to another embodiment of the present application. As shown in fig. 3, before inputting a plurality of candidate data into the trained classification model, the method further includes:

step 301: and performing feature extraction on the plurality of original candidate codes to obtain a feature matrix of the plurality of original candidate codes corresponding to the plurality of original candidate codes.

Specifically, node extraction may be performed on a plurality of original candidate codes through AST, and the extracted nodes are converted into word vectors through one-hot (one-bit) encoding, so as to obtain feature matrices of the plurality of original candidate codes.

Step 302: and performing feature extraction on the user code to obtain a feature vector of the user code corresponding to the user code.

Specifically, the user code may be subjected to node extraction through AST, and the extracted nodes are converted into Word vectors through one-hot coding, or converted into Word vectors through Word to Vector, so as to obtain the feature vectors of the user code.

Step 303: and multiplying the feature matrixes of the original candidate codes with the feature vectors of the user codes to obtain the initial similarity vectors of the original candidate codes and the user codes, wherein the initial similarity vectors comprise a plurality of similarity vector element values which are in one-to-one correspondence with the original candidate codes.

Specifically, the feature matrix of the original candidate codes is multiplied by the feature vector of the user code to obtain a preliminary similarity vector representing a plurality of similarity vector element values corresponding to the original candidate codes one by one, and each similarity vector element value of the preliminary similarity vector is the similarity between one original candidate code and the user code.

Step 304: and extracting codes of which the similarity vector element values which are in one-to-one correspondence with the original candidate codes are larger than a preset threshold value to obtain a plurality of candidate codes.

Specifically, the preset threshold may be a similarity value preset according to a requirement for similarity, and only when the similarity vector element value is greater than the preset threshold, the original candidate code corresponding to the similarity vector element value is extracted as the candidate code. The preset threshold may also be a similarity value preset according to a requirement on the number of candidate codes, for example, if the requirement on the number of candidate codes is to extract 1000 candidate codes, all original candidate codes are sorted by similarity vector element values from large to small according to the size of the similarity vector element values, and the similarity vector element value of the original candidate code sorted to 1001 is set as the preset threshold.

It should be understood that the preset threshold may be set according to a specific application scenario, and the setting manner of the preset threshold is not specifically limited in the present application.

Fig. 4 is a flowchart illustrating a code recommendation method according to another embodiment of the present application. As shown in fig. 4, obtaining a plurality of preliminary similarity values in one-to-one correspondence with a plurality of candidate codes according to the plurality of candidate codes and the user code includes:

step 401: a plurality of similarity vector element values corresponding to the candidate codes one to one are extracted. Specifically, step 303 obtains a plurality of similarity vector element values corresponding to a plurality of original candidate codes one by one, and step 304 extracts a plurality of candidate codes from the plurality of original candidate codes, so that each candidate code in the plurality of candidate codes corresponds to one similarity vector element value, that is, a plurality of similarity vector element values corresponding to a plurality of candidate codes one by one can be extracted according to the results of step 303 and step 304. Step 402: and normalizing the multiple similarity vector element values which are in one-to-one correspondence with the multiple candidate codes to obtain multiple preliminary similarity values which are in one-to-one correspondence with the multiple candidate codes.

Specifically, since the similarity vector element values are vectors obtained by multiplying a plurality of candidate code feature matrices by user code feature vectors, which are obtained by one-hot or Word toVector conversion, the elements of the vectors are not all values between 0 and 1, that is, the similarity vector element values are not all values between 0 and 1. Therefore, the multiple similarity vector element values are normalized to obtain multiple preliminary similarity values with values between 0 and 1.

For example, the plurality of similarity vector element values are (n1, n2, n3,. ·, np), where the maximum value is max and the minimum value is min, and the normalized calculation formula is: n1'═ n1-min)/(max-min), where n1' is a preliminary similarity value obtained by n1 normalization, and multiple preliminary similarity values obtained by normalization of multiple similarity vector element values are (n1', n2', n3',. once, np').

Fig. 5 is a flowchart illustrating a code recommendation method according to another embodiment of the present application. As shown in fig. 5, before inputting a plurality of candidate data into the trained classification model, the method further includes:

step 501: a plurality of items are collected at an item set platform.

Specifically, the project set platform may select github, or may select another project set platform, and the selection of the project set platform is not specifically limited in the present application.

In one embodiment, collecting a plurality of items at the item set platform may be collecting a plurality of items having a degree of interest greater than a preset value at the item set platform. The attention degree is one of the evaluation criteria of high and low item quality, so that the quality of the collected items can be improved by selecting the items with the attention degree larger than a preset value. The preset threshold value can be freely set according to the requirement of a user on the quality of the project, and the size of the preset threshold value is not specifically limited. In addition, in addition to the attention, the download amount is one of the evaluation criteria of high and low item quality, so that a plurality of items with download amounts larger than a preset value can be collected from the collected items.

It should be understood that the evaluation criteria of the project quality may also have other criteria besides the attention and the download amount, and a plurality of projects with evaluation criteria of the project quality higher than a preset value may also be collected.

Step 502: a plurality of class files of each of a plurality of items are extracted to obtain a plurality of original candidate codes.

Specifically, a plurality of class files for each of a plurality of items are extracted, wherein the class files include class-level code and the plurality of class files may be a plurality of original candidate code.

In one embodiment, before inputting the plurality of candidate data into the trained classification model, the method includes: and training the classification model by using a training set to obtain the trained classification model.

Fig. 6 is a flowchart illustrating a code recommendation method according to another embodiment of the present application. As shown in fig. 6, the training of the classification model using the training set includes, before obtaining the trained classification model:

step 601: and extracting nodes of the abstract syntax tree of the plurality of codes in the code data set to obtain the abstract syntax tree of the plurality of codes in the code data set.

Specifically, the code dataset may be a bigconebench dataset, or may be another dataset. The code data set may include 4 classes of known similarity, and the classes in 4 may be type 1, type 2, type 3, and type 4, respectively. Type 1 may characterize exactly the same code, i.e. a code pair where two code segments are exactly the same, except for spaces, comments; type 2 may characterize renamed or parameterized code, i.e., code pairs that are identical except for variable names, type names, function names; type 3 can characterize almost the same code, i.e. there are additions and deletions of several sentences, or different pairs of identifiers, words, types, spaces, layouts and annotations are used, but still similar; type 4 may represent semantically similar code, i.e., heterogeneous code of the same function, that is not similar in text or syntax, but that is semantically similar. The similarity corresponding to the four classification types may be 0.875, 0.625, 0.375, and 0.125, respectively. The node extraction of the abstract syntax tree for the plurality of codes in the code data set may be node extraction of the plurality of codes in the code data set by AST to obtain an abstract syntax tree for the plurality of codes in the code data set, where the abstract syntax tree for each code includes a plurality of nodes.

It should be understood that the code data set may be classified into 2 categories with known similarity, 3 categories with known similarity, 5 categories with known similarity, or more categories with known similarity, and the similarity values of the categories with known similarity may also be freely selected according to the requirement.

Step 602: and respectively carrying out word vector conversion on the abstract syntax trees of the codes to obtain a plurality of training sets for representing the content of the code data set.

Specifically, a plurality of nodes included in the abstract syntax tree of the plurality of codes are converted into Word vectors through Word to vectors to obtain a plurality of vectors for representing the codes, and the plurality of vectors for representing the codes are training sets.

In an embodiment, a plurality of nodes included in the abstract syntax tree of the plurality of codes are converted into Word vectors through Word to vectors, which may be Word to vectors based on a skip-Word model (skip-grams) model to establish the neural Word embedding, so that the plurality of nodes included in the abstract syntax tree of the plurality of codes are converted into Word vectors.

It should be understood that the plurality of nodes included in the abstract syntax tree of the plurality of codes are converted into Word vectors through Word to Vector, or Word to Vector conversion may be performed by using Word to Vector based on continuous Word bag (CBOW), and the method for converting Word vectors is not particularly limited in this application. Fig. 7 is a flowchart illustrating a code recommendation method according to another embodiment of the present application. As shown in fig. 7, before inputting a plurality of candidate data into the trained classification model, the method further includes:

step 701: and respectively extracting abstract syntax tree nodes of the plurality of code sets to obtain abstract syntax trees of the plurality of code sets, wherein the plurality of code sets are used for representing a set of a plurality of candidate code contents and user code contents.

Specifically, node extraction may be performed on the code sets through AST, so as to obtain an abstract syntax tree of a plurality of code sets, where the abstract syntax tree of each code set includes a plurality of nodes. Each code set is used for characterizing a candidate code content and a set of the same user code content, and a plurality of code sets are used for characterizing a plurality of candidate code contents and sets of user code contents.

Step 702: and performing word vector conversion on the abstract syntax trees of the plurality of code sets to obtain a plurality of data to be selected for representing the content of the plurality of candidate codes.

Specifically, a plurality of nodes included in an abstract syntax tree of a plurality of code sets are converted into Word vectors through Word to vectors to obtain a plurality of vectors for representing a plurality of candidate code contents, and the plurality of vectors for representing the plurality of candidate code contents are a plurality of data to be selected.

In an embodiment, before inputting the plurality of candidate data into the trained classification model, the method further includes: and respectively splicing the user codes and the candidate codes into a plurality of code sets, wherein the plurality of code sets correspond to the candidate codes one by one.

Specifically, the same user code and each candidate code are spliced into one code set, that is, each code set corresponds to one candidate code one by one. The splicing mode of the user code and the candidate code is to put the user code and the candidate code in a file.

Exemplary code recommendation apparatus

Fig. 8 is a schematic structural diagram of a code recommendation apparatus according to an embodiment of the present application. As shown in fig. 8, the code recommendation apparatus 80 includes:

the classification module 810 is configured to input a plurality of candidate data into the trained classification model, so as to obtain a plurality of classifications corresponding to the plurality of candidate data one to one, where the plurality of candidate data are used to characterize the content of the plurality of candidate codes.

The classification similarity value obtaining module 820 is configured to obtain a plurality of classification similarity values corresponding to a plurality of classifications according to the plurality of classifications corresponding to the plurality of data to be selected one by one.

The code recommending module 830 is configured to obtain a recommended code from the candidate codes according to the classification similarity values.

Fig. 9 is a schematic structural diagram of a code recommendation apparatus according to another embodiment of the present application. As shown in fig. 9, the code recommending apparatus 80 further includes:

the preliminary similarity value obtaining module 910 is configured to obtain a plurality of preliminary similarity values corresponding to the plurality of candidate codes one to one according to the plurality of candidate codes and the user code.

A summing module 920, configured to add the plurality of preliminary similarity values and the plurality of classification similarity values corresponding to the plurality of candidate codes one to one, respectively, to obtain a sum of the plurality of classification similarity values and the preliminary similarity values corresponding to the plurality of candidate codes one to one.

The code recommending module 830 is further configured to obtain a recommended code from the candidate codes according to a sum of the classification similarity values and the preliminary similarity value.

Fig. 10 is a schematic structural diagram of a code recommendation apparatus according to another embodiment of the present application. As shown in fig. 10, the code recommending apparatus 80 further includes:

the original candidate code feature extraction module 1010 is configured to perform feature extraction on the plurality of original candidate codes to obtain a feature matrix of the plurality of original candidate codes corresponding to the plurality of original candidate codes.

The user code feature extraction module 1020 is configured to perform feature extraction on the user code to obtain a feature vector of the user code corresponding to the user code.

The similarity vector element value obtaining module 1030 is configured to multiply the feature matrix of the multiple original candidate codes and the feature vector of the user code to obtain a preliminary similarity vector of the multiple original candidate codes and the user code, where the similarity vector includes similarity vector element values corresponding to the multiple original candidate codes one to one.

The candidate code extracting module 1040 is configured to extract a code whose similarity vector element values corresponding to the plurality of original candidate codes one to one are greater than a preset threshold, so as to obtain a plurality of candidate codes.

Fig. 11 is a schematic structural diagram of a code recommendation apparatus according to another embodiment of the present application. As shown in fig. 11, the preliminary similarity value obtaining module 910 includes:

the similarity vector element value extraction unit 9101 is configured to extract a plurality of similarity vector element values corresponding to a plurality of candidate codes one to one.

A normalization processing unit 9102 configured to perform normalization processing on a plurality of similarity vector element values corresponding to the plurality of candidate codes one to one, to obtain a plurality of preliminary similarity values corresponding to the plurality of candidate codes one to one.

Fig. 12 is a schematic structural diagram of a code recommendation apparatus according to another embodiment of the present application. As shown in fig. 12, the code recommending apparatus 80 further includes:

an item collection module 1210 configured to collect a plurality of items at an item set platform.

The original candidate code obtaining module 1220 is configured to extract a plurality of class files of each of a plurality of items, so as to obtain a plurality of original candidate codes.

In an embodiment, the item collection module 1210 is further configured to collect a plurality of items with a focus greater than a predetermined value on the item set platform.

Fig. 13 is a schematic structural diagram of a code recommendation apparatus according to another embodiment of the present application. As shown in fig. 13, the code recommending apparatus 80 further includes:

the code data aggregation point extraction module 1310 is configured to perform abstract syntax tree node extraction on the plurality of codes in the code data set to obtain an abstract syntax tree of the plurality of codes in the code data set.

The training set obtaining module 1320 is configured to perform word vector conversion on the abstract syntax trees of the plurality of codes, respectively, to obtain a plurality of training sets for representing the content of the code data set.

In one embodiment, as shown in fig. 13, the code recommending apparatus 80 further includes:

the classification model training module 1330 is configured to train the classification model using the training set to obtain a trained classification model.

Fig. 14 is a schematic structural diagram of a code recommendation apparatus according to another embodiment of the present application. As shown in fig. 14, the code recommending apparatus 80 further includes:

the code aggregation point extraction module 1410 is configured to perform abstract syntax tree node extraction on the plurality of code sets respectively to obtain abstract syntax trees of the plurality of code sets, where the plurality of code sets are used for representing a set of a plurality of candidate code contents and user code contents.

The candidate data obtaining module 1420 is configured to perform word vector conversion on the abstract syntax trees of the multiple code sets to obtain multiple candidate data for representing the content of the multiple candidate codes.

In an embodiment, the code recommendation apparatus shown in fig. 14 further includes:

the code set obtaining module 1405 is configured to splice the user code with a plurality of candidate codes into a plurality of code sets, respectively, where the plurality of code sets correspond to the plurality of candidate codes one to one.

Exemplary electronic device

Fig. 15 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 15, the electronic device 150 includes: one or more processors 1501 and memory 1502; and computer program instructions stored in the memory 1502 which, when executed by the processor 1501, cause the processor 1501 to perform the code recommendation method of any of the embodiments described above.

The processor 1501 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device to perform desired functions.

The memory 1502 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, Random Access Memory (RAM), cache memory (or the like). The non-volatile memory may include, for example, Read Only Memory (ROM), a hard disk, flash memory, and the like. One or more computer program instructions may be stored on a computer-readable storage medium and executed by processor 1501 to implement the steps of the code recommendation methods of the various embodiments of the present application described above and/or other desired functions. Information such as an original candidate code acquisition path, an acquisition method of a code data set, a feature extraction manner, and the like may also be stored in the computer-readable storage medium.

In one example, the electronic device 150 may further include: an input device 1503 and an output device 1504, which are interconnected by a bus system and/or other form of connection mechanism (not shown in fig. 15).

For example, when the electronic device is a stand-alone device, the input device 1503 may be a communication network connector for receiving the collected input signal from an external mobile device. The input device 703 may also include, for example, a keyboard, a mouse, a microphone, and so forth.

The output device 1504 may output various information to the outside, which may include, for example, a display, speakers, a printer, and a communication network and its connected remote output devices, among others.

Of course, for simplicity, only some of the components of the electronic device 150 relevant to the present application are shown in fig. 15, and components such as a bus, an input device/output interface, and the like are omitted. In addition, the electronic device 150 may include any other suitable components, depending on the particular application.

Exemplary computer program product and computer-readable storage Medium

In addition to the above-described methods and apparatuses, embodiments of the present application may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform the steps of the code recommendation method of any of the above-described embodiments.

The computer program product may include program code for carrying out operations for embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present application may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform the steps in the code recommendation method according to various embodiments of the present application described in the "exemplary code recommendation method" section above in this specification.

A computer-readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a random access memory ((RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing describes the general principles of the present application in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present application are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present application. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the foregoing disclosure is not intended to be exhaustive or to limit the disclosure to the precise details disclosed.

The block diagrams of devices, apparatuses, systems referred to in this application are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

It should also be noted that in the devices, apparatuses, and methods of the present application, the components or steps may be decomposed and/or recombined. These decompositions and/or recombinations are to be considered as equivalents of the present application.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the application to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and the like that are within the spirit and principle of the present invention are included in the present invention.

Claims

1. A code recommendation method, comprising:

inputting a plurality of data to be selected into a trained classification model to obtain a plurality of classifications corresponding to the plurality of data to be selected one by one, wherein the plurality of data to be selected are used for representing a plurality of candidate codes and the content of a user code;

according to a plurality of classifications corresponding to a plurality of data to be selected one by one, obtaining a plurality of classification similarity values corresponding to the plurality of classifications one by one; and

and acquiring a recommended code from the candidate codes according to the classification similarity values.

2. The code recommendation method according to claim 1, wherein said obtaining a recommended code from the candidate codes according to the classification similarity values comprises: acquiring a recommended code from the candidate codes according to the sum of the classification similarity values and the preliminary similarity value;

wherein before obtaining the recommended code from the candidate codes according to the sum of the classification similarity values and the preliminary similarity value, the method further comprises:

obtaining a plurality of preliminary similarity values corresponding to the candidate codes one by one according to the candidate codes and the user codes; and

and respectively adding the plurality of preliminary similarity values and the plurality of classification similarity values which are in one-to-one correspondence with the plurality of candidate codes to obtain the sum of the plurality of classification similarity values and the preliminary similarity values which are in one-to-one correspondence with the plurality of candidate codes.

3. The code recommendation method of claim 2, wherein the inputting the plurality of candidate data into the trained classification model further comprises:

extracting features of a plurality of original candidate codes to obtain feature matrixes of the original candidate codes corresponding to the original candidate codes;

extracting the features of the user codes to obtain feature vectors of the user codes corresponding to the user codes;

multiplying the feature matrix of the original candidate codes by the feature vector of the user code to obtain a preliminary similarity vector of the original candidate codes and the user code, wherein the similarity vector comprises a plurality of similarity vector element values in one-to-one correspondence with the original candidate codes; and

and extracting a plurality of codes which are in one-to-one correspondence with the original candidate codes and have similarity vector element values larger than a preset threshold value to obtain a plurality of candidate codes.

4. The code recommendation method according to claim 3, wherein said deriving a plurality of preliminary similarity values in one-to-one correspondence with a plurality of candidate codes according to the plurality of candidate codes and the user code comprises:

extracting a plurality of similarity vector element values corresponding to the candidate codes one by one; and normalizing the similarity vector element values which are in one-to-one correspondence with the candidate codes to obtain the preliminary similarity values which are in one-to-one correspondence with the candidate codes.

5. The code recommendation method of claim 3, wherein the inputting the plurality of candidate data into the trained classification model further comprises:

collecting a plurality of items at an item set platform; and

and extracting a plurality of class files of each item in a plurality of items to obtain a plurality of original candidate codes.

6. The code recommendation method of claim 5, wherein said collecting a plurality of items at an item set platform comprises:

and collecting a plurality of items with the attention degrees larger than a preset value on the item set platform.

7. The code recommendation method of claim 1, wherein the inputting the plurality of candidate data into the trained classification model comprises:

training the classification model by using a training set to obtain the trained classification model;

wherein, the training of the classification model by using the training set comprises the following steps before obtaining the trained classification model:

extracting nodes of an abstract syntax tree of a plurality of codes in a code data set to obtain the abstract syntax tree of the plurality of codes in the code data set; and

and respectively carrying out word vector conversion on the abstract syntax trees of the codes to obtain a plurality of training sets for representing the content of the code data set.

8. The code recommendation method of claim 1, wherein the inputting the plurality of candidate data into the trained classification model further comprises:

respectively extracting abstract syntax tree nodes of a plurality of code sets to obtain abstract syntax trees of the code sets, wherein the code sets are used for representing a plurality of sets of candidate code contents and user code contents; and

and performing word vector conversion on the abstract syntax trees of the plurality of code sets to obtain a plurality of candidate data for representing the content of the candidate codes.

9. The code recommendation method of claim 8, wherein the inputting the plurality of candidate data into the trained classification model further comprises:

and splicing the user codes and the candidate codes into a plurality of code sets respectively, wherein the plurality of code sets correspond to the candidate codes one by one.

10. A code recommendation apparatus, comprising:

the classification module is configured to input a plurality of data to be selected into a trained classification model to obtain a plurality of classifications corresponding to the plurality of data to be selected one by one, wherein the plurality of data to be selected are used for representing the contents of a plurality of candidate codes;

a classification similarity value acquisition module configured to acquire, according to a plurality of classifications corresponding to a plurality of data to be selected one by one, a plurality of classification similarity values corresponding to the plurality of classifications one by one; and

and the code recommending module is configured to acquire recommended codes from the candidate codes according to the classification similarity values.

11. An electronic device, comprising:

a processor; and

a memory having stored therein computer program instructions which, when executed by the processor, cause the processor to perform the code recommendation method of any of claims 1-9.

12. A computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, cause the processor to perform a code recommendation method as claimed in any one of claims 1 to 9.