WO2023060034A1 - Systems and methods for natural language code search - Google Patents

Systems and methods for natural language code search Download PDF

Info

Publication number
WO2023060034A1
WO2023060034A1 PCT/US2022/077458 US2022077458W WO2023060034A1 WO 2023060034 A1 WO2023060034 A1 WO 2023060034A1 US 2022077458 W US2022077458 W US 2022077458W WO 2023060034 A1 WO2023060034 A1 WO 2023060034A1
Authority
WO
WIPO (PCT)
Prior art keywords
code
natural language
network
language query
candidates
Prior art date
Application number
PCT/US2022/077458
Other languages
French (fr)
Inventor
Akhilesh Deepak Gotmare
Junnan LI
Chu Hong Hoi
Original Assignee
Salesforce.Com, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US17/587,984 external-priority patent/US20230109681A1/en
Application filed by Salesforce.Com, Inc. filed Critical Salesforce.Com, Inc.
Priority to CN202280062215.9A priority Critical patent/CN117957523A/en
Publication of WO2023060034A1 publication Critical patent/WO2023060034A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0475Generative networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning

Definitions

  • the embodiments relate generally to machine learning systems and natural language processing (NLP), and more specifically to searching code snippets using natural language.
  • NLP natural language processing
  • Al models have been widely used in a variety of applications. Some Al models may be used to search and/or generate code snippets in programming languages in response to a natural language input. For example, a natural language input may describe a function such as “filter the sales records that occurred at the zip code 94070,” and the Al model may generate or search a code segment (e.g., in Python, C#, etc.) that implements this function.
  • a natural language input may describe a function such as “filter the sales records that occurred at the zip code 94070,” and the Al model may generate or search a code segment (e.g., in Python, C#, etc.) that implements this function.
  • code segment e.g., in Python, C#, etc.
  • FIG. 1 is a simplified diagram of a computing device that implements a code generator, according to some embodiments.
  • FIG. 2 is a simplified diagram of a code generator, according to some embodiments.
  • FIG. 3 is a simplified diagram of a method for training the code generator, according to some embodiments.
  • FIG. 4 is a simplified diagram of a method for determining a code snippet that is semantically equivalent to a natural language query, according to some embodiments.
  • elements having the same designations have the same or similar functions.
  • Natural language queries are being used to improve search in different areas, e.g., web search, database search, legal search, and/or the like.
  • Organizations that have large code repositories may benefit from indexing and searching through the code and reuse code that is known to function properly.
  • Some recent approaches to natural language search of code and code snippets leverage pairs of natural language and source code sequences to train a text-to- code search model to search a sample of code snippets.
  • the model may be a fast encoder neural network, also referred to as a fast encoder.
  • a fast encoder In the contrastive learning framework, pairs of natural language and program language sequences that semantically match are pulled together while pairs that do not semantically match are pushed apart.
  • Fast encoder networks may use contrastive learning. Fast encoders networks may be efficient for scenarios that include searching a large number of candidate code snippets at the expense of accuracy in semantics matching.
  • Models using binary classifiers may be considered slow classifiers. Slow classifiers, while more accurate, become infeasible when searching a large number of candidate code snippets due to an amount of time the models take to analyze the code snippets against the natural language sequence.
  • models trained using contrastive learning framework may be faster by at least a factor of ten, but may also be less accurate by at least a factor of ten or more than models that use a binary classifier.
  • embodiments are directed to a cascading neural network model that includes both fast encoder model and an accurate classifier model.
  • the cascading neural network model provides improved natural language search efficiency of large sets of code snippets.
  • the cascading neural network model is a hybrid approach that combines a fast encoder network and a slow classifier network.
  • the encoder network determines a top K code candidates from the set of code snippets based on a natural language query.
  • the top K code candidates are passed through a slow classifier network that pairs each of the code candidates with the natural language query and generates confidence scores for each pair.
  • the code snippet with a highest confidence score may be the code snippet that semantically matches the natural language query.
  • the number K may denote a threshold that identifies the number of code candidates that the encoder network may generate.
  • the K threshold is preferably much smaller than the size of the set of code snippets. If the K threshold is too small there is an increased likelihood of missing the correct code snippet and if K threshold is too large it may be infeasible to efficiently run the second stage slow classifier.
  • memory overhead for storing a fast encoder network and a slow classifier network may be minimized by sharing or partially sharing the weights of the network.
  • a transformer encoder of the fast encoder and slow classifier may be shared by training the transformer encoder to be used by both the fast encoder network and the slow classifier network.
  • network may comprise any hardware or software-based framework that includes any artificial intelligence network or system, neural network or system and/or any training or learning models implemented thereon or therewith.
  • module may comprise hardware or software-based framework that performs one or more functions.
  • the module may be implemented on one or more neural networks.
  • FIG. 1 is a simplified diagram of a computing device that implements a code generator, according to some embodiments described herein.
  • computing device 100 includes a processor 110 coupled to memory 120. Operation of computing device 100 is controlled by processor 110.
  • processor 110 may be representative of one or more central processing units, multi-core processors, microprocessors, microcontrollers, digital signal processors, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), graphics processing units (GPUs) and/or the like in computing device 100.
  • Computing device 100 may be implemented as a stand-alone subsystem, as a board added to a computing device, and/or as a virtual machine.
  • Memory 120 may be used to store software executed by computing device 100 and/or one or more data structures used during operation of computing device 100.
  • Memory 120 may include one or more types of machine readable media. Some common forms of machine readable media may include floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH- EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.
  • Processor 110 and/or memory 120 may be arranged in any suitable physical arrangement.
  • processor 110 and/or memory 120 may be implemented on a same board, in a same package (e.g., system-in-package), on a same chip (e.g., system- on-chip), and/or the like.
  • processor 110 and/or memory 120 may include distributed, virtualized, and/or containerized computing resources, as well as multiple processors. Consistent with such embodiments, processor 110 and/or memory 120 may be located in one or more data centers and/or cloud computing facilities.
  • memory 120 may include non-transitory, tangible, machine readable media that includes executable code that when run by one or more processors (e.g., processor 110) may cause the one or more processors to perform the methods described in further detail herein.
  • memory 120 includes instructions for a natural language (NL) processing module, such as a code generator 130 that may be used to implement and/or emulate the systems and models, and/or to implement any of the methods described further herein.
  • the code generator 130 may receive an input 140, e.g., such as natural language text or query or computer code, via a data interface 115.
  • the data interface 115 may be any of a user interface that receives input 140 from a user, or a communication interface that receives or retrieves input 140 stored in memory 120 or another memory storage such as a database.
  • the code generator 130 may generate an output 150, such as a programmable language (PL) sequence, code, or code snippet that is semantically equivalent to the natural language text or query.
  • code generator 130 may include a cascading neural network that includes an encoder network 132 and a classifier network 134, such that the output of encoder network 132 may be input, in part, into classifier network 134.
  • FIG. 2 is a block diagram 200 of a code generator, according to some embodiments. As illustrated in FIG.
  • the code generator 130 includes encoder network 132 and classifier network 134.
  • the code generator 130 receives a natural language query 202 in a natural language query or text. Natural language query 202 may be input 140 discussed in FIG. 1.
  • the natural language query 202 may be a human written or spoken text, such as “filter the sales records that occurred at the zip code 94070” that code generator 130 may translate into a programming language sequence, such as a code snippet.
  • the code generator 130 passes natural language query 202 through encoder network 132.
  • Encoder network 132 may generate a K number of code candidates 204A-K.
  • the code candidates 204A-K may be code snippets in a programming language that may semantically represent and/or semantically match the natural language query 202.
  • the classifier network 134 may receive pairs of code candidates 204 A-K the natural language query 202. Each pair in the pairs may include one of candidates 204A-K and natural language query 202. Classifier network 134 may generate a code snippet 206 that is a semantic representation of the natural language query 202.
  • encoder network 132 is substantially faster, e.g. at least by a factor of ten or more, than the classifier network 134. In fact, due to speed of the encoder network 132, encoder network 132 may quickly determine code candidates 204A-K from a large set of available code snippets.
  • the classifier network 134 is slower than encoder network 132, but is substantially more accurately, e.g., at least by at least a factor of ten or more, in identifying a code snippet that semantically matches the natural language query 202. As illustrated in FIG.
  • classifier network 134 receives pairs of code candidates 204A-K and natural language query 202 and identifies code snippet 206 that is a semantic representation of the natural language query 202.
  • code generator 130 improves the speed and accuracy for determining the code snippet 206 that is a semantic representation of the natural language query 202.
  • encoder network 132 may be or include a bidirectional encoder representations from transformers (BERT) network or a variant of a BERT network.
  • the BERT network or a variant of the BERT network may be pre-trained on programming language sequences in a variety of programing languages to retrieve code snippets from text input.
  • Example pre-trained BERT networks may be GraphCodeBERT or CodeBERT.
  • An example programing languages may be Ruby, JavaScript, Go, Python, Java, C, C++, C#, Php, or the like.
  • encoder network 132 may further be trained on a the contrastive learning framework using a bimodal dataset.
  • a contrastive loss function such as the infoNCE loss function may be used to train encoder network 132, and is replicated below: where fg(Xi) is the dense representation for the natural language input x t , y t is the corresponding semantically equivalent programming language sequence, N is the number of training examples in the bimodal dataset, cr is a temperature hyper-parameter, and B denotes the current training minibatch.
  • the encoder network 132 may be trained until the contrastive loss function is minimized.
  • Code snippets 208 may include potential code snippets, universe of code snippets, available code snippets., etc., that may correspond to various natural language queries.
  • Code snippets 208 may be encoded into an index G C ⁇ > shown as code snippet index 210.
  • Code snippet index 210 may be an index of the encodings of each code snippet in code snippets 208.
  • Encoder network 132 may encode the set of code snippets offline, e.g., prior to receiving natural language query 202 from which code generator 130 determines code snippet 206.
  • Code snippet index 210 may be stored within encoder network 132 or elsewhere in memory 120 described in FIG. 1.
  • encoder network 132 may receive a natural query x t (natural language query 202), compute fg ⁇ Xi), query the code snippet index 210, and return the code snippet from C (code snippets 208) corresponding to the nearest neighbor or neighbors in the code snippet index 210.
  • the neighbor(s) may be computed using a distance metric determined by a similarity function, e.g., a cosine similarity function.
  • the number of code candidates 204 may be governed by a hyperparameter K which may be a threshold values that cause encoder network 132 to identify the top K candidates, e.g. code candidates 204A-K.
  • classifier network 134 may also be or include a bidirectional encoder representations from transformers (BERT) network or a variant of BERT network.
  • the BERT network or a variant of the BERT network may be pre-trained on programming language sequences to retrieve code snippets from text input.
  • Example pre-trained BERT networks may be GraphCodeBERT or CodeBERT and an example programing language languages may be Ruby, JavaScript, Go, Python, Java, C, C++, C#, Php, or the like.
  • Classifier network 134 may receive natural language query x t (shown as 202) and a programming language sequence 7/, (one of code candidates 204A-K or another code sequence) as input, encode the natural language input x t and the code sequence jointly and perform binary classification. The binary classification may predict whether the natural language input x t and a code sequence match or does not match in semantics. In some embodiments, classifier network 134 may receive a concatenation of the natural language input X and a code sequence J/ 7 , such a
  • Classifier network 134 may be trained for binary classification using training batches.
  • the training batches may include pairs where each pair includes a natural language query and a code snippet.
  • the training batches may be batches for a bimodal dataset where positive pairs denote semantic matches between the natural language query and the code snippet and the negative pairs denote semantic mismatches.
  • the cross-entropy objective function for this training scheme may be:
  • Pe ⁇ x ⁇ y ⁇ represents the probability that the natural language sequence Xt semantically matches the programming language sequence J/ 7 , as predicted by the classifier.
  • Classifier network 134 may be trained until the cross-entropy objective function is minimized.
  • a training batch of negative pairs may be generated.
  • classifier network 134 includes a transformer encoder based classifier, the interactions between the natural language tokens and programming language tokens in the self-attention layers may help in improving the precision of the classifier network 134.
  • classifier network 134 may determine code snippet 206 from natural language query 202 and code candidates 204.
  • the natural language sequence x t e.g. natural language query 202
  • code snippet / one of code candidates 204A-K
  • Classifier network 134 may generate a confidence score for each pair and rank each code candidate in the code candidate 204A-K according to the confidence scores.
  • the confidence score may be a probability having a measure, e.g., from zero to one, with the values closer to one indicating a greater probability of a match, and the values closer to zero indicating a greater probability of a mismatch.
  • the code snippet /, (a code candidate in the code candidates 204 A- K) that corresponds to a pair with the highest score may be a semantic match with the natural language sequence Xt (natural language query 202).
  • the code generator 130 discussed herein includes a cascade of networks, such as encoder network 132 and classifier network 134 which combines the speed of the fast encoder network 132 with precision of the classifier network 134 in a two stage process.
  • encoder network 132 receives natural language query 202 and generates code candidates 204A-K from the set of code snippets C (code snippets 208).
  • Encoder network 132 may determine encodings of the natural language query 202 and match the encodings against the code snippet index 210 of the code snippet 208 using a distance function.
  • encoder network 132 may determine a K number of code candidates 204 A- K where K is a configurable candidate threshold that may be a hyperparameter.
  • K is a configurable candidate threshold that may be a hyperparameter.
  • the K number of candidates are the top candidates that have the closest distance in the code snippet index 210 to the encodings of the natural language query 202.
  • the code candidates 204 are paired with the natural language query 202.
  • Example pairs may be 202-204A, 202-204B, ..., 202-204K.
  • Classifier network 134 receives pairs 202-204A, 202-204B, ..., 202-204K. For each pair in pairs 202-204A, 202-204B, ..., 202-204 K, classifier network 134 returns a confidence score that the natural language query 202 semantically matches a corresponding one of code candidates 204A-K using a binary classifier.
  • classifier network 134 selects the code snippet 206 that semantically matches the natural language query 202.
  • code snippet 206 may correspond to a pair having a highest confidence score.
  • encoder network 132 while computationally faster is also less accurate, in determining a code snippet that semantically matches natural language query, than classifier network 134. In a scheme where K « ICI, adding classifier network 134 in sequence with encoder network 132 may add a minor computational overhead.
  • the second stage where classifier network 134 refines code candidates 204A-K improves the retrieval performance provided that the value of K is set such that the recall of the encoder network 132 is reasonably high.
  • K may be a hyperparameter. Setting a very low K would lead to high likelihood of missing the code snippet 206 in the set of code candidates 204 passed to classifier network 134. On the other hand, setting a high K would make the scheme infeasible for retrieval with the classifier network 134. However, setting K to a value, such as ten, already offers significant gains in retrieval performance over conventional code generation systems, which only marginal gains when K is set to 100 and above.
  • encoder network 132 and classifier network 134 may share a portion of a neural network structure.
  • encoder network 132 and classifier network 134 may share weights of the layers in the transformer encoder in the BERT network. Sharing the neural network structure minimizes a memory overhead incurred by encoder network 132 and classifier network 134. Sharing the neural network structure, e.g. transformer layers, by encoder network 132 and classifier network 134 may be achieved by training the transformer encoder with a joint objective of infoNCE (22 inf-oNCE ) shown in Eq. (1) and binary crossentropy (T NE ) shown in Eq. (2). While the number of parameters in this shared variant would be nearly half of when the transformer layers are not shared, the computational cost during the inference may be similar or the same.
  • infoNCE 22 inf-oNCE
  • T NE binary crossentropy
  • the classifier network 134 may have an additional classification layer or head that determines confidence scores for the pairs 202-204A, 202- 204B, ..., 202-204K. Classifier network 134 would include the classification head on top of the transformer encoder.
  • the shared neural network structure may receive three inputs, natural language query 202, set of candidate code snippets C (code snippets 208), and pairs 202-204A, 202-204B, ..., 202-204K. In the shared embodiment, two passes are made through the shared layers of the network, where the natural language query 202 is an input during a first pass, and pairs 202-204A, 202-204B, ..., 202-204K are input during a second pass.
  • FIG. 3 is a simplified diagram of a method 300 for training a code generator, according to some embodiments.
  • One or more of the processes 302-304 of method 300 may be implemented, at least in part, in the form of executable code stored on non-transitory, tangible, machine-readable media that when run by one or more processors may cause the one or more processors to perform one or more of the processes 302-304.
  • an encoder network is trained.
  • encoder network 132 that may be a pre-trained BERT network may further be trained on a contrastive learning framework to identify code snippets that semantically matches the natural language sequences.
  • the contrastive loss function used to train encoder network 132 may be a contrastive loss function, such as infoNCE loss function.
  • the training may include batches of negative and positive pairs, each pair including a natural language sequence and a programming language sequence. The training may continue iteratively until the infoNCE loss function is minimized.
  • a classifier network is trained.
  • classifier network 134 that may be a pre-trained BERT network may be trained on binary classification to determine a probability score that the code snippets match the natural language sequences.
  • the crossentropy objective function may be is used to train classifier network 134.
  • the training may include batches of negative and positive pairs, each pair including a natural language sequence and a programming language sequence. The training may continue iteratively until the crossentropy objective function is minimized.
  • FIG. 4 is a simplified diagram of a method 400 for generating a code snippet that is semantically equivalent to a natural language query, according to some embodiments.
  • One or more of the processes 402-408 of method 400 may be implemented, at least in part, in the form of executable code stored on non-transitory, tangible, machine-readable media that when run by one or more processors may cause the one or more processors to perform one or more of the processes 402-408.
  • a code snippet index is generated.
  • encoder network 132 receives code snippets 208 that may semantically correspond to numerous natural language queries.
  • Encoder network 132 encodes the code snippets 208 and generates a code snippet index 210 that corresponds to the encoded code snippets.
  • Process 402 may occur after encoder network 132 is trained and prior to encoder network 132 processing natural language query 202.
  • code candidates for a natural language query are generated.
  • encoder network 132 may receive natural language query 202 and generate encodings for the natural language query 202.
  • Encoder network 132 may use the code snippet index 210 to match the encodings of the natural language query 202 to encodings of code snippets 208 to identify code candidates 204A-K that may semantically match natural language query 202.
  • the number of code candidates 204A-K may be set using a number K which may be a hyperparameter.
  • pairs that include natural language query and code candidates are generated.
  • code generator 130 may generate pairs 202-204A, 202-204B, ..., 202- 204K, where each pair includes natural language query 202 and one of code candidates 204 A- K.
  • a code snippet is determined.
  • classifier network 104 may receive the pairs 202-204A, 202-204B, ..., 202-204K and determine a confidence score for each pair.
  • the pair with a highest confidence score may be code snippet 206 that semantically matches natural language query 202.
  • computing devices such as computing device 100 may include non-transitory, tangible, machine readable media that include executable code that when run by one or more processors (e.g., processor 110) may cause the one or more processors to perform the processes of methods 300-400.
  • processors e.g., processor 110
  • Some common forms of machine readable media that may include the processes of methods 300-400 are, for example, floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Embodiments are directed to translating a natural language query into a code snippet in a programing language that semantically represents the query. The embodiments include a cascading neural network that includes an encoder network and a classifier network. The encoder network being faster but less accurate than the classifier network. The encoder network is trained using a contrastive learning framework to identify code candidates from a large set of code snippets. The classifier network is trained using a binary classifier to identify the code snippet that semantically represents the query from the code candidates.

Description

SYSTEMS AND METHODS FOR NATURAL LANGUAGE CODE SEARCH
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims priority to U.S. Provisional Patent Application No. 63/252,393, filed October 5, 2021, and U.S. Nonprovisional Patent Application No. 17/587,984, filed on January 28, 2022, which are incorporated by reference herein in their entirety.
TECHNICAL FIELD
[0002] The embodiments relate generally to machine learning systems and natural language processing (NLP), and more specifically to searching code snippets using natural language.
BACKGROUND
[0003] Artificial intelligence (Al) models have been widely used in a variety of applications. Some Al models may be used to search and/or generate code snippets in programming languages in response to a natural language input. For example, a natural language input may describe a function such as “filter the sales records that occurred at the zip code 94070,” and the Al model may generate or search a code segment (e.g., in Python, C#, etc.) that implements this function. Existing code generation systems have focused on either improving the speed of natural language search or on improving the accuracy of the natural language search. However, these existing natural language search methods largely struggle with a tradeoff between efficiency and exhaustiveness of the search.
BRIEF DESCRIPTION OF THE DRAWINGS
[0004] FIG. 1 is a simplified diagram of a computing device that implements a code generator, according to some embodiments.
[0005] FIG. 2 is a simplified diagram of a code generator, according to some embodiments. [0006] FIG. 3 is a simplified diagram of a method for training the code generator, according to some embodiments.
[0007] FIG. 4 is a simplified diagram of a method for determining a code snippet that is semantically equivalent to a natural language query, according to some embodiments. [0008] In the figures, elements having the same designations have the same or similar functions.
DETAILED DESCRIPTION
[0009] Natural language queries are being used to improve search in different areas, e.g., web search, database search, legal search, and/or the like. There is also interest in searching large sets of code snippets using natural language queries. Organizations that have large code repositories may benefit from indexing and searching through the code and reuse code that is known to function properly. Some recent approaches to natural language search of code and code snippets leverage pairs of natural language and source code sequences to train a text-to- code search model to search a sample of code snippets.
[0010] One approach to train the model includes using a contrastive learning framework. The model may be a fast encoder neural network, also referred to as a fast encoder. In the contrastive learning framework, pairs of natural language and program language sequences that semantically match are pulled together while pairs that do not semantically match are pushed apart. Fast encoder networks may use contrastive learning. Fast encoders networks may be efficient for scenarios that include searching a large number of candidate code snippets at the expense of accuracy in semantics matching.
[0011] Another approach to train a model uses a binary classifier. This type of a model uses a trained binary classifier that receives a natural language and programing language sequences as inputs and predicts whether the natural language and programming language sequences match semantically. Models using binary classifiers may be considered slow classifiers. Slow classifiers, while more accurate, become infeasible when searching a large number of candidate code snippets due to an amount of time the models take to analyze the code snippets against the natural language sequence. In other words, models trained using contrastive learning framework may be faster by at least a factor of ten, but may also be less accurate by at least a factor of ten or more than models that use a binary classifier.
[0012] To improve natural language searching of large numbers of code snippets, embodiments are directed to a cascading neural network model that includes both fast encoder model and an accurate classifier model. The cascading neural network model provides improved natural language search efficiency of large sets of code snippets. Specifically, the cascading neural network model is a hybrid approach that combines a fast encoder network and a slow classifier network. First, the encoder network determines a top K code candidates from the set of code snippets based on a natural language query. Second, the top K code candidates are passed through a slow classifier network that pairs each of the code candidates with the natural language query and generates confidence scores for each pair. The code snippet with a highest confidence score may be the code snippet that semantically matches the natural language query.
[0013] The number K may denote a threshold that identifies the number of code candidates that the encoder network may generate. The K threshold is preferably much smaller than the size of the set of code snippets. If the K threshold is too small there is an increased likelihood of missing the correct code snippet and if K threshold is too large it may be infeasible to efficiently run the second stage slow classifier.
[0014] In some embodiments, memory overhead for storing a fast encoder network and a slow classifier network may be minimized by sharing or partially sharing the weights of the network. For example, a transformer encoder of the fast encoder and slow classifier may be shared by training the transformer encoder to be used by both the fast encoder network and the slow classifier network.
[0015] As used herein, the term “network” may comprise any hardware or software-based framework that includes any artificial intelligence network or system, neural network or system and/or any training or learning models implemented thereon or therewith.
[0016] As used herein, the term “module” may comprise hardware or software-based framework that performs one or more functions. In some embodiments, the module may be implemented on one or more neural networks.
[0017] FIG. 1 is a simplified diagram of a computing device that implements a code generator, according to some embodiments described herein. As shown in FIG. 1 , computing device 100 includes a processor 110 coupled to memory 120. Operation of computing device 100 is controlled by processor 110. And although computing device 100 is shown with only one processor 110, it is understood that processor 110 may be representative of one or more central processing units, multi-core processors, microprocessors, microcontrollers, digital signal processors, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), graphics processing units (GPUs) and/or the like in computing device 100. Computing device 100 may be implemented as a stand-alone subsystem, as a board added to a computing device, and/or as a virtual machine.
[0018] Memory 120 may be used to store software executed by computing device 100 and/or one or more data structures used during operation of computing device 100. Memory 120 may include one or more types of machine readable media. Some common forms of machine readable media may include floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH- EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.
[0019] Processor 110 and/or memory 120 may be arranged in any suitable physical arrangement. In some embodiments, processor 110 and/or memory 120 may be implemented on a same board, in a same package (e.g., system-in-package), on a same chip (e.g., system- on-chip), and/or the like. In some embodiments, processor 110 and/or memory 120 may include distributed, virtualized, and/or containerized computing resources, as well as multiple processors. Consistent with such embodiments, processor 110 and/or memory 120 may be located in one or more data centers and/or cloud computing facilities.
[0020] In some examples, memory 120 may include non-transitory, tangible, machine readable media that includes executable code that when run by one or more processors (e.g., processor 110) may cause the one or more processors to perform the methods described in further detail herein. For example, as shown, memory 120 includes instructions for a natural language (NL) processing module, such as a code generator 130 that may be used to implement and/or emulate the systems and models, and/or to implement any of the methods described further herein. In some examples, the code generator 130, may receive an input 140, e.g., such as natural language text or query or computer code, via a data interface 115. The data interface 115 may be any of a user interface that receives input 140 from a user, or a communication interface that receives or retrieves input 140 stored in memory 120 or another memory storage such as a database. The code generator 130 may generate an output 150, such as a programmable language (PL) sequence, code, or code snippet that is semantically equivalent to the natural language text or query. In some embodiments, code generator 130 may include a cascading neural network that includes an encoder network 132 and a classifier network 134, such that the output of encoder network 132 may be input, in part, into classifier network 134. [0021] FIG. 2 is a block diagram 200 of a code generator, according to some embodiments. As illustrated in FIG. 2, the code generator 130 includes encoder network 132 and classifier network 134. The code generator 130 receives a natural language query 202 in a natural language query or text. Natural language query 202 may be input 140 discussed in FIG. 1. The natural language query 202 may be a human written or spoken text, such as “filter the sales records that occurred at the zip code 94070” that code generator 130 may translate into a programming language sequence, such as a code snippet. The code generator 130 passes natural language query 202 through encoder network 132. Encoder network 132 may generate a K number of code candidates 204A-K. The code candidates 204A-K may be code snippets in a programming language that may semantically represent and/or semantically match the natural language query 202. The classifier network 134 may receive pairs of code candidates 204 A-K the natural language query 202. Each pair in the pairs may include one of candidates 204A-K and natural language query 202. Classifier network 134 may generate a code snippet 206 that is a semantic representation of the natural language query 202.
[0022] In some embodiments, encoder network 132 is substantially faster, e.g. at least by a factor of ten or more, than the classifier network 134. In fact, due to speed of the encoder network 132, encoder network 132 may quickly determine code candidates 204A-K from a large set of available code snippets. The classifier network 134, on the other hand, is slower than encoder network 132, but is substantially more accurately, e.g., at least by at least a factor of ten or more, in identifying a code snippet that semantically matches the natural language query 202. As illustrated in FIG. 2, classifier network 134 receives pairs of code candidates 204A-K and natural language query 202 and identifies code snippet 206 that is a semantic representation of the natural language query 202. By using the hybrid approach that includes encoder network 132 and classifier network 134, code generator 130 improves the speed and accuracy for determining the code snippet 206 that is a semantic representation of the natural language query 202.
[0023] In some embodiments, encoder network 132 may be or include a bidirectional encoder representations from transformers (BERT) network or a variant of a BERT network. The BERT network or a variant of the BERT network may be pre-trained on programming language sequences in a variety of programing languages to retrieve code snippets from text input. Example pre-trained BERT networks may be GraphCodeBERT or CodeBERT. An example programing languages may be Ruby, JavaScript, Go, Python, Java, C, C++, C#, Php, or the like. During a training phase, to recognize code candidates 204, encoder network 132 may further be trained on a the contrastive learning framework using a bimodal dataset. In the bimodal dataset, the positive pairs, that are representations of natural language queries and programming language sequences that match in semantics are pulled together. On the other hand, representations of negative pairs, that are randomly paired natural language queries and programing language sequences, are pushed apart. A contrastive loss function, such as the infoNCE loss function may be used to train encoder network 132, and is replicated below:
Figure imgf000008_0001
where fg(Xi) is the dense representation for the natural language input xt , yt is the corresponding semantically equivalent programming language sequence, N is the number of training examples in the bimodal dataset, cr is a temperature hyper-parameter, and B denotes the current training minibatch. The encoder network 132 may be trained until the contrastive loss function is minimized.
[0024] Once trained, encoder network 132 may receive a set of candidate code snippets C =
Figure imgf000008_0002
shown as code snippets 208. Code snippets 208 may include potential code snippets, universe of code snippets, available code snippets., etc., that may correspond to various natural language queries. Code snippets 208 may be encoded into an index
Figure imgf000008_0003
G C}> shown as code snippet index 210. Code snippet index 210 may be an index of the encodings of each code snippet in code snippets 208. Encoder network 132 may encode the set of code snippets offline, e.g., prior to receiving natural language query 202 from which code generator 130 determines code snippet 206. Code snippet index 210 may be stored within encoder network 132 or elsewhere in memory 120 described in FIG. 1.
[0025] In some embodiments, after generating the code snippet index 210, encoder network 132 may receive a natural query xt (natural language query 202), compute fg^Xi), query the code snippet index 210, and return the code snippet from C (code snippets 208) corresponding to the nearest neighbor or neighbors in the code snippet index 210. The neighbor(s) may be computed using a distance metric determined by a similarity function, e.g., a cosine similarity function. The rank 7 assigned to the correct code snippet from the set of code snippets C (code snippets 208) for the natural language query Xt may then be used to compute the mean i TV 1 reciprocal ranking (MRR) metric - J . £es£ — . From the MRR metric, the code candidates test I = 1 ri
204 with a rank included in the MRR or a certain distance from the rank in the MRR may be determined. In some embodiments, the number of code candidates 204 may be governed by a hyperparameter K which may be a threshold values that cause encoder network 132 to identify the top K candidates, e.g. code candidates 204A-K.
[0026] In some embodiments, classifier network 134 may also be or include a bidirectional encoder representations from transformers (BERT) network or a variant of BERT network. The BERT network or a variant of the BERT network may be pre-trained on programming language sequences to retrieve code snippets from text input. Example pre-trained BERT networks may be GraphCodeBERT or CodeBERT and an example programing language languages may be Ruby, JavaScript, Go, Python, Java, C, C++, C#, Php, or the like.
[0027] Classifier network 134 may receive natural language query xt (shown as 202) and a programming language sequence 7/, (one of code candidates 204A-K or another code sequence) as input, encode the natural language input xt and the code sequence jointly and perform binary classification. The binary classification may predict whether the natural language input xt and a code sequence
Figure imgf000009_0001
match or does not match in semantics. In some embodiments, classifier network 134 may receive a concatenation of the natural language input X and a code sequence J/7, such a
Figure imgf000009_0002
[0028] Classifier network 134 may be trained for binary classification using training batches. The training batches may include pairs where each pair includes a natural language query and a code snippet. The training batches may be batches for a bimodal dataset where positive pairs denote semantic matches between the natural language query and the code snippet and the negative pairs denote semantic mismatches. Given a set of pairs that include natural language query and a semantically programming language sequence {%;, yt}^, the cross-entropy objective function for this training scheme may be:
N
^CE = ~ ^ Z log Pe xi, yi) + log
Figure imgf000009_0003
i = l,j = i where Pe^x^ y^ represents the probability that the natural language sequence Xt semantically matches the programming language sequence J/7 , as predicted by the classifier. Classifier network 134 may be trained until the cross-entropy objective function is minimized.
[0029] From a training minibatch of positive pairs
Figure imgf000009_0004
G B , a training batch of negative pairs may be generated. For example, a negative pair may be generated by randomly selecting a programming language sequence y j(j E B;j = i) from the programming language sequences in the minibatch and pairing the selected sequence with x^ When classifier network 134 includes a transformer encoder based classifier, the interactions between the natural language tokens and programming language tokens in the self-attention layers may help in improving the precision of the classifier network 134. [0030] Once trained, classifier network 134 may determine code snippet 206 from natural language query 202 and code candidates 204. For example, during inference, classifier network 134 may receive multiple pairs as inputs, each pair including the natural language sequence xt (e.g. natural language query 202) with a code snippet /, (one of code candidates 204A-K) from a set of candidate code snippets C = [y1, y2, ... y^c^ (code candidates 204A-K). Classifier network 134 may generate a confidence score for each pair and rank each code candidate in the code candidate 204A-K according to the confidence scores. The confidence score may be a probability having a measure, e.g., from zero to one, with the values closer to one indicating a greater probability of a match, and the values closer to zero indicating a greater probability of a mismatch. The code snippet /, (a code candidate in the code candidates 204 A- K) that corresponds to a pair with the highest score may be a semantic match with the natural language sequence Xt (natural language query 202).
[0031] As discussed above, the code generator 130 discussed herein includes a cascade of networks, such as encoder network 132 and classifier network 134 which combines the speed of the fast encoder network 132 with precision of the classifier network 134 in a two stage process. In the first stage, encoder network 132 receives natural language query 202 and generates code candidates 204A-K from the set of code snippets C (code snippets 208). Encoder network 132 may determine encodings of the natural language query 202 and match the encodings against the code snippet index 210 of the code snippet 208 using a distance function. In some embodiments, encoder network 132 may determine a K number of code candidates 204 A- K where K is a configurable candidate threshold that may be a hyperparameter. Typically, the K number of candidates are the top candidates that have the closest distance in the code snippet index 210 to the encodings of the natural language query 202.
[0032] In the second stage, the code candidates 204 are paired with the natural language query 202. Example pairs may be 202-204A, 202-204B, ..., 202-204K. Classifier network 134 receives pairs 202-204A, 202-204B, ..., 202-204K. For each pair in pairs 202-204A, 202-204B, ..., 202-204 K, classifier network 134 returns a confidence score that the natural language query 202 semantically matches a corresponding one of code candidates 204A-K using a binary classifier. Based on the confidence scores associated with pairs 202-204A, 202-204B, ..., 202- 204K, classifier network 134 selects the code snippet 206 that semantically matches the natural language query 202. In some instances, code snippet 206 may correspond to a pair having a highest confidence score. [0033] As discussed above, encoder network 132, while computationally faster is also less accurate, in determining a code snippet that semantically matches natural language query, than classifier network 134. In a scheme where K « ICI, adding classifier network 134 in sequence with encoder network 132 may add a minor computational overhead. The second stage where classifier network 134 refines code candidates 204A-K improves the retrieval performance provided that the value of K is set such that the recall of the encoder network 132 is reasonably high. In some embodiments, K may be a hyperparameter. Setting a very low K would lead to high likelihood of missing the code snippet 206 in the set of code candidates 204 passed to classifier network 134. On the other hand, setting a high K would make the scheme infeasible for retrieval with the classifier network 134. However, setting K to a value, such as ten, already offers significant gains in retrieval performance over conventional code generation systems, which only marginal gains when K is set to 100 and above.
[0034] In some embodiments, encoder network 132 and classifier network 134 may share a portion of a neural network structure. For example, encoder network 132 and classifier network 134 may share weights of the layers in the transformer encoder in the BERT network. Sharing the neural network structure minimizes a memory overhead incurred by encoder network 132 and classifier network 134. Sharing the neural network structure, e.g. transformer layers, by encoder network 132 and classifier network 134 may be achieved by training the transformer encoder with a joint objective of infoNCE (22inf-oNCE) shown in Eq. (1) and binary crossentropy (TNE) shown in Eq. (2). While the number of parameters in this shared variant would be nearly half of when the transformer layers are not shared, the computational cost during the inference may be similar or the same.
[0035] In the shared embodiment, the classifier network 134 may have an additional classification layer or head that determines confidence scores for the pairs 202-204A, 202- 204B, ..., 202-204K. Classifier network 134 would include the classification head on top of the transformer encoder. Further, the shared neural network structure may receive three inputs, natural language query 202, set of candidate code snippets C (code snippets 208), and pairs 202-204A, 202-204B, ..., 202-204K. In the shared embodiment, two passes are made through the shared layers of the network, where the natural language query 202 is an input during a first pass, and pairs 202-204A, 202-204B, ..., 202-204K are input during a second pass.
[0036] FIG. 3 is a simplified diagram of a method 300 for training a code generator, according to some embodiments. One or more of the processes 302-304 of method 300 may be implemented, at least in part, in the form of executable code stored on non-transitory, tangible, machine-readable media that when run by one or more processors may cause the one or more processors to perform one or more of the processes 302-304.
[0037] At process 302, an encoder network is trained. For example, encoder network 132 that may be a pre-trained BERT network may further be trained on a contrastive learning framework to identify code snippets that semantically matches the natural language sequences. The contrastive loss function used to train encoder network 132 may be a contrastive loss function, such as infoNCE loss function. The training may include batches of negative and positive pairs, each pair including a natural language sequence and a programming language sequence. The training may continue iteratively until the infoNCE loss function is minimized. [0038] At process 304, a classifier network is trained. For example, classifier network 134 that may be a pre-trained BERT network may be trained on binary classification to determine a probability score that the code snippets match the natural language sequences. The crossentropy objective function may be is used to train classifier network 134. The training may include batches of negative and positive pairs, each pair including a natural language sequence and a programming language sequence. The training may continue iteratively until the crossentropy objective function is minimized.
[0039] FIG. 4 is a simplified diagram of a method 400 for generating a code snippet that is semantically equivalent to a natural language query, according to some embodiments. One or more of the processes 402-408 of method 400 may be implemented, at least in part, in the form of executable code stored on non-transitory, tangible, machine-readable media that when run by one or more processors may cause the one or more processors to perform one or more of the processes 402-408.
[0040] At process 402, a code snippet index is generated. For example, encoder network 132 receives code snippets 208 that may semantically correspond to numerous natural language queries. Encoder network 132 encodes the code snippets 208 and generates a code snippet index 210 that corresponds to the encoded code snippets. Process 402 may occur after encoder network 132 is trained and prior to encoder network 132 processing natural language query 202.
[0041] At process 404, code candidates for a natural language query are generated. For example, encoder network 132 may receive natural language query 202 and generate encodings for the natural language query 202. Encoder network 132 may use the code snippet index 210 to match the encodings of the natural language query 202 to encodings of code snippets 208 to identify code candidates 204A-K that may semantically match natural language query 202. As discussed above, the number of code candidates 204A-K may be set using a number K which may be a hyperparameter.
[0042] At process 406, pairs that include natural language query and code candidates are generated. For example, code generator 130 may generate pairs 202-204A, 202-204B, ..., 202- 204K, where each pair includes natural language query 202 and one of code candidates 204 A- K.
[0043] At process 408, a code snippet is determined. For example, classifier network 104 may receive the pairs 202-204A, 202-204B, ..., 202-204K and determine a confidence score for each pair. The pair with a highest confidence score may be code snippet 206 that semantically matches natural language query 202.
[0044] Some examples of computing devices, such as computing device 100 may include non-transitory, tangible, machine readable media that include executable code that when run by one or more processors (e.g., processor 110) may cause the one or more processors to perform the processes of methods 300-400. Some common forms of machine readable media that may include the processes of methods 300-400 are, for example, floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.
[0045] This description and the accompanying drawings that illustrate inventive aspects, embodiments, implementations, or applications should not be taken as limiting. Various mechanical, compositional, structural, electrical, and operational changes may be made without departing from the spirit and scope of this description and the claims. In some instances, well- known circuits, structures, or techniques have not been shown or described in detail in order not to obscure the embodiments of this disclosure. Like numbers in two or more figures represent the same or similar elements.
[0046] In this description, specific details are set forth describing some embodiments consistent with the present disclosure. Numerous specific details are set forth in order to provide a thorough understanding of the embodiments. It will be apparent, however, to one skilled in the art that some embodiments may be practiced without some or all of these specific details. The specific embodiments disclosed herein are meant to be illustrative but not limiting. One skilled in the art may realize other elements that, although not specifically described here, are within the scope and the spirit of this disclosure. In addition, to avoid unnecessary repetition, one or more features shown and described in association with one embodiment may be incorporated into other embodiments unless specifically described otherwise or if the one or more features would make an embodiment non-functional.
[0047] Although illustrative embodiments have been shown and described, a wide range of modification, change and substitution is contemplated in the foregoing disclosure and in some instances, some features of the embodiments may be employed without a corresponding use of other features. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. Thus, the scope of the invention should be limited only by the following claims, and it is appropriate that the claims be construed broadly and in a manner consistent with the scope of the embodiments disclosed herein.

Claims

WHAT IS CLAIMED IS:
1. A method for translating a natural language query into a code snippet in a programming language, the method comprising: generating, at an encoder network, a code snippet index from a plurality of code snippets; generating, using the code snippet index and the encoder network, code candidates for the natural language query; generating pairs from the natural language query and the code candidates, a pair including the natural language query and a code candidate from the code candidates; and determining, using a classifier network that sequentially follows the encoder network and the pairs, the code snippet in the programming language for the natural language query, wherein the code snippet is a semantic representation of the natural language query.
2. The method of claim 1, further comprising: training, the encoder network to determine the code candidates on a contrastive loss function.
3. The method of claims 1 or 2, further comprising: training, the classifier network to determine the code snippet from the pairs using a cross-entropy objective function.
4. The method of claims 1-3, wherein the encoder network is an order of magnitude faster and an order of magnitude less accurate than the classifier network.
5. The method of claims 1-4, wherein the encoder network is trained on a different loss function from the classifier network.
6. The method of claims 1-5, wherein the encoder network shares a portion of a neural network structure with the classifier network.
7. The method of claims 1-6, wherein generating the code candidates further comprises: generating encodings from the natural language query; and determining, using the code snippet index, the encodings of the code candidates that are within a distance determined by a distance function from the encodings of the natural language query.
8. The method of claims 1-7, wherein determining the code snippet further comprises: determining a confidence score that a code candidate of each pair is the semantic representation of the natural language query; ranking confidence scores of the pairs; and selecting a code candidate of a pair corresponding to a highest confidence score as the code snippet that is the semantic representation of the natural language query.
9. A system for translating a natural language query into a code snippet in a programming language, the system comprising: a memory configured to store a cascading neural network; a processor coupled to the memory and configured to execute instructions for causing the cascading neural network to: generate, at an encoder network of the cascading neural network, a code snippet index from a plurality of code snippets; generate, using the code snippet index and the encoder network, code candidates for the natural language query; generate pairs from the natural language query and the code candidates, a pair including the natural language query and a code candidate from the code candidates; and determine, using a classifier network of the cascading neural network and the pairs, the code snippet in the programming language for the natural language query, wherein the code snippet is a semantic representation of the natural language query.
10. The system of claim 9, wherein the processor is further configured to: train, the encoder network to determine the code candidates on a contrastive loss function; and train, the classifier network to determine the code snippet from the pairs using a crossentropy objective function.
11. The system of claims 9 or 10, wherein the encoder network is an order of magnitude faster and an order of magnitude less accurate than the classifier network.
12. The system of claims 9-11, wherein the encoder network shares a portion of a neural network structure with the classifier network.
13. The system of claims 9-12, wherein to generate the code candidates the processor is further configured to: generate encodings from the natural language query; and determine, using the code snippet index, encodings of the code candidates that are within a distance determined by a distance function from the encodings of the natural language query.
14. The system of claims 9-13, wherein to determine the code snippet the processor is further configured to: determine a confidence score that a code candidate of each pair is a semantic representation of the natural language query; rank confidence scores of the pairs; and select a code candidate of a pair corresponding to a highest confidence score as the code snippet that is the semantic representation of the natural language query.
15. A non-transitory computer readable medium having instructions stored thereon, that when executed by a processor cause the processor to perform operations for translating a natural language query into a code snippet in a programming language, the operations comprising: generating, at an encoder network, a code snippet index from a plurality of code snippets; generating, using the code snippet index and the encoder network, code candidates for the natural language query; generating pairs from the natural language query and the code candidates, a pair including the natural language query and a code candidate from the code candidates; and
15 determining, using a classifier network and the pairs, the code snippet in the programming language for the natural language query, wherein the code snippet is a semantic representation of the natural language query.
16. The non-transitory computer readable medium of claim 15, further comprising: training, the encoder network to determine the code candidates on a contrastive loss function; and training, the classifier network to determine the code snippet from the pairs using a cross-entropy objective function.
17. The non-transitory computer readable medium of claims 15 or 16, wherein the encoder network is an order of magnitude faster and an order of magnitude less accurate than the classifier network.
18. The non-transitory computer readable medium of claims 15-17, wherein the encoder network shares a portion of a neural network structure with the classifier network.
19. The non-transitory computer readable medium of claims 15-18, wherein generating the code candidates further comprises: generating encodings from the natural language query; and determining, using the code snippet index, encodings of the code candidates that are within a distance determined by a distance function from the encodings of the natural language query.
20. The non-transitory computer readable medium of claims 15-19, wherein determining the code snippet further comprises: determining a confidence score that a code candidate of each pair is a semantic representation of the natural language query; ranking confidence scores of the pairs; and selecting a code candidate of a pair corresponding to a highest confidence score as the code snippet that is the semantic representation of the natural language query.
16
PCT/US2022/077458 2021-10-05 2022-10-03 Systems and methods for natural language code search WO2023060034A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202280062215.9A CN117957523A (en) 2021-10-05 2022-10-03 System and method for natural language code search

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US202163252393P 2021-10-05 2021-10-05
US63/252,393 2021-10-05
US17/587,984 2022-01-28
US17/587,984 US20230109681A1 (en) 2021-10-05 2022-01-28 Systems and methods for natural language code search

Publications (1)

Publication Number Publication Date
WO2023060034A1 true WO2023060034A1 (en) 2023-04-13

Family

ID=83995678

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2022/077458 WO2023060034A1 (en) 2021-10-05 2022-10-03 Systems and methods for natural language code search

Country Status (1)

Country Link
WO (1) WO2023060034A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116719520A (en) * 2023-08-07 2023-09-08 支付宝(杭州)信息技术有限公司 Code generation method and device
CN117093196A (en) * 2023-09-04 2023-11-21 广东工业大学 Knowledge graph-based programming language generation method and system
CN117349453A (en) * 2023-12-04 2024-01-05 武汉大学 Acceleration method of deep learning code search model based on extension code

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
AKHILESH DEEPAK GOTMARE ET AL: "Cascaded Fast and Slow Models for Efficient Semantic Code Search", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 15 October 2021 (2021-10-15), XP091077176 *
GUO DAYA ET AL: "GRAPHCODEBERT: PRE-TRAINING CODE REPRESEN- TATIONS WITH DATA FLOW", 13 September 2021 (2021-09-13), XP093011060, Retrieved from the Internet <URL:https://arxiv.org/pdf/2009.08366.pdf> [retrieved on 20230102] *
PASQUALE SALZA ET AL: "On the Effectiveness of Transfer Learning for Code Search", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 12 August 2021 (2021-08-12), XP091032288 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116719520A (en) * 2023-08-07 2023-09-08 支付宝(杭州)信息技术有限公司 Code generation method and device
CN116719520B (en) * 2023-08-07 2023-11-17 支付宝(杭州)信息技术有限公司 Code generation method and device
CN117093196A (en) * 2023-09-04 2023-11-21 广东工业大学 Knowledge graph-based programming language generation method and system
CN117093196B (en) * 2023-09-04 2024-03-01 广东工业大学 Knowledge graph-based programming language generation method and system
CN117349453A (en) * 2023-12-04 2024-01-05 武汉大学 Acceleration method of deep learning code search model based on extension code
CN117349453B (en) * 2023-12-04 2024-02-23 武汉大学 Acceleration method of deep learning code search model based on extension code

Similar Documents

Publication Publication Date Title
Mallia et al. Learning passage impacts for inverted indexes
US11562147B2 (en) Unified vision and dialogue transformer with BERT
US20210141799A1 (en) Dialogue system, a method of obtaining a response from a dialogue system, and a method of training a dialogue system
US11113479B2 (en) Utilizing a gated self-attention memory network model for predicting a candidate answer match to a query
KR102342066B1 (en) Method and apparatus for machine translation using neural network and method for learning the appartus
WO2023060034A1 (en) Systems and methods for natural language code search
Tang et al. Improving document representations by generating pseudo query embeddings for dense retrieval
US20230109681A1 (en) Systems and methods for natural language code search
KR20220114495A (en) Interaction layer neural network for search, retrieval, and ranking
CN111985228B (en) Text keyword extraction method, text keyword extraction device, computer equipment and storage medium
US11380301B2 (en) Learning apparatus, speech recognition rank estimating apparatus, methods thereof, and program
US20220374595A1 (en) Systems and methods for semantic code search
KR20240011164A (en) Transfer learning in image recognition systems
Jaech et al. Match-tensor: a deep relevance model for search
Kan et al. Zero-shot learning to index on semantic trees for scalable image retrieval
US11822887B2 (en) Robust name matching with regularized embeddings
WO2021118462A1 (en) Context detection
Bai et al. Memory consolidation for contextual spoken language understanding with dialogue logistic inference
WO2023063880A2 (en) System and method for training a transformer-in-transformer-based neural network model for audio data
CN113111649B (en) Event extraction method, system and equipment
Huang et al. Text sentiment analysis based on Bert and Convolutional Neural Networks
Alagarsamy et al. An experimental analysis of optimal hybrid word embedding methods for text classification using a movie review dataset
Kroher et al. MXX@ FinSim3-an LSTM–based approach with custom word embeddings for hypernym detection in financial texts
CN112749565A (en) Semantic recognition method and device based on artificial intelligence and semantic recognition equipment
Wu et al. Probabilistic transformer: A probabilistic dependency model for contextual word representation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22794056

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202280062215.9

Country of ref document: CN

WWE Wipo information: entry into national phase

Ref document number: 2022794056

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2022794056

Country of ref document: EP

Effective date: 20240506