US12430585B2 - Systems and methods for program synthesis - Google Patents
Systems and methods for program synthesisInfo
- Publication number
- US12430585B2 US12430585B2 US17/896,946 US202217896946A US12430585B2 US 12430585 B2 US12430585 B2 US 12430585B2 US 202217896946 A US202217896946 A US 202217896946A US 12430585 B2 US12430585 B2 US 12430585B2
- Authority
- US
- United States
- Prior art keywords
- program
- samples
- sequence
- sub
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active, expires
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/217—Validation; Performance evaluation; Active pattern learning techniques
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/30—Creation or generation of source code
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/092—Reinforcement learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0475—Generative networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/096—Transfer learning
Definitions
- this program repair model 566 is designed as a sequence-to-sequence generation model.
- the input sequence is the concatenation of the problem description D 505 and buggy program Wfail.
- Additional signals received from the unit test results 112 include the type of test outcomes, e.g., one of CompileError, RuntimeError, FailedTest, PassedTest, and error subtypes (e.g. syntax errors, out-of-index errors, and/or the like) may also be included in the input sequence.
- the error types are extracted from error traces returned by the compiler.
- the ground-truth program W 106 can be used as the expected correct program.
- D,W fail ,u,c ) ⁇ t log[ p ⁇ ( w t
- each selected failed sequence can be stacked N/M times for upsampling. This results in the same number of output programs N as in the first round of generation.
- these N repaired programs generated by the program repairing model 561 may be passed to module 543 to apply the code refining procedure 550 as described above.
- FIG. 6 is a simplified diagram of a computing device 600 for implementing the reinforcement learning based program synthesis framework shown in FIGS. 1 - 5 , according to some embodiments.
- computing device 600 includes a processor 610 coupled to memory 620 . Operation of computing device 600 is controlled by processor 610 .
- processor 610 may be representative of one or more central processing units, multi-core processors, microprocessors, microcontrollers, digital signal processors, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), graphics processing units (GPUs) and/or the like in computing device 600 .
- Computing device 600 may be implemented as a stand-alone subsystem, as a board added to a computing device, and/or as a virtual machine.
- Memory 620 may be used to store software executed by computing device 600 and/or one or more data structures used during operation of computing device 600 .
- Memory 620 may include one or more types of machine-readable media. Some common forms of machine-readable media may include floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip, or cartridge, and/or any other medium from which a processor or computer is adapted to read.
- Processor 610 and/or memory 620 may be arranged in any suitable physical arrangement.
- processor 610 and/or memory 620 may be implemented on a same board, in a same package (e.g., system-in-package), on a same chip (e.g., system-on-chip), and/or the like.
- processor 610 and/or memory 620 may include distributed, virtualized, and/or containerized computing resources. Consistent with such embodiments, processor 610 and/or memory 620 may be located in one or more data centers and/or cloud computing facilities.
- memory 620 may include non-transitory, tangible, machine readable media that includes executable code that when run by one or more processors (e.g., processor 610 ) may cause the one or more processors to perform the methods described in further detail herein.
- memory 620 includes instructions for a program synthesis module 630 that may be used to implement and/or emulate the systems and models, and/or to implement any of the methods described further herein.
- a program synthesis module 630 may receive input 640 that includes a natural language problem specification via the data interface 615 and generate a code program as output 650 .
- the program synthesis model 630 includes an actor network module 631 (similar to 130 in FIG. 1 ), a critic network module 632 (similar to 140 in FIG. 1 ) and a language model 633 (similar to 110 or 120 in FIG. 1 ). Details of the program synthesis module 630 and its submodule 631 - 633 and their interactions may be discussed in relation to FIGS. 1 - 5 .
- the program synthesis module 630 and its submodule 631 - 633 may be implemented by hardware, software and/or a combination thereof.
- FIG. 7 is a simplified block diagram of a networked system suitable for implementing the program synthesis framework described in FIGS. 1 - 5 and other embodiments described herein.
- block diagram 700 shows a system including the user device 710 which may be operated by user 740 , data vendor servers 745 , 770 and 780 , server 730 , and other forms of devices, servers, and/or software components that operate to perform various methodologies in accordance with the described embodiments.
- Exemplary devices and servers may include device, stand-alone, and enterprise-class servers which may be similar to the computing device 100 described in FIG.
- an OS such as a MICROSOFT® OS, a UNIX® OS, a LINUX® OS, or other suitable device and/or server-based OS.
- OS such as a MICROSOFT® OS, a UNIX® OS, a LINUX® OS, or other suitable device and/or server-based OS.
- the devices and/or servers illustrated in FIG. 7 may be deployed in other ways and that the operations performed, and/or the services provided by such devices and/or servers may be combined or separated for a given embodiment and may be performed by a greater number or fewer number of devices and/or servers.
- One or more devices and/or servers may be operated and/or maintained by the same or different entities.
- the user device 710 , data vendor servers 745 , 770 and 780 , and the server 730 may communicate with each other over a network 760 .
- User device 710 may be utilized by a user 740 (e.g., a driver, a system admin, etc.) to access the various features available for user device 710 , which may include processes and/or applications associated with the server 730 to receive an output data anomaly report.
- a user 740 e.g., a driver, a system admin, etc.
- User device 710 , data vendor server 745 , and the server 730 may each include one or more processors, memories, and other appropriate components for executing instructions such as program code and/or data stored on one or more computer readable mediums to implement the various applications, data, and steps described herein.
- instructions may be stored in one or more computer readable media such as memories or data storage devices internal and/or external to various components of system 700 , and/or accessible over network 760 .
- User device 710 may be implemented as a communication device that may utilize appropriate hardware and software configured for wired and/or wireless communication with data vendor server 745 and/or the server 730 .
- user device 710 may be implemented as an autonomous driving vehicle, a personal computer (PC), a smart phone, laptop/tablet computer, wristwatch with appropriate computer hardware resources, eyeglasses with appropriate computer hardware (e.g., GOOGLE GLASS®), other type of wearable computing device, implantable communication devices, and/or other types of computing devices capable of transmitting and/or receiving data, such as an IPAD® from APPLE®.
- PC personal computer
- smart phone laptop/tablet computer
- eyeglasses e.g., GOOGLE GLASS®
- other type of wearable computing device e.g., implantable communication devices, and/or other types of computing devices capable of transmitting and/or receiving data
- IPAD® IPAD® from APPLE®
- User device 710 of FIG. 7 contains a user interface (UI) application 712 , and/or other applications 716 , which may correspond to executable processes, procedures, and/or applications with associated hardware.
- UI user interface
- the user device 710 may receive a message indicating the generated program from the server 730 and display the message via the UI application 712 .
- user device 710 may include additional or different modules having specialized hardware and/or software as required.
- User device 710 may further include database 718 stored in a transitory and/or non-transitory memory of user device 710 , which may store various applications and data and be utilized during execution of various modules of user device 710 .
- Database 718 may store user profile relating to the user 740 , predictions previously viewed or saved by the user 740 , historical data received from the server 730 , and/or the like.
- database 718 may be local to user device 710 . However, in other embodiments, database 718 may be external to user device 710 and accessible by user device 710 , including cloud storage systems and/or databases that are accessible over network 760 .
- User device 710 includes at least one network interface component 719 adapted to communicate with data vendor server 745 and/or the server 730 .
- network interface component 719 may include a DSL (e.g., Digital Subscriber Line) modem, a PSTN (Public Switched Telephone Network) modem, an Ethernet device, a broadband device, a satellite device and/or various other types of wired and/or wireless network communication devices including microwave, radio frequency, infrared, Bluetooth, and near field communication devices.
- DSL Digital Subscriber Line
- PSTN Public Switched Telephone Network
- Data vendor server 745 may correspond to a server that hosts one or more of the databases 703 a - n (or collectively referred to as 703 ) to provide training datasets including public code data to the server 730 .
- the database 703 may be implemented by one or more relational database, distributed databases, cloud databases, and/or the like.
- the data vendor server 745 includes at least one network interface component 726 adapted to communicate with user device 710 and/or the server 730 .
- network interface component 726 may include a DSL (e.g., Digital Subscriber Line) modem, a PSTN (Public Switched Telephone Network) modem, an Ethernet device, a broadband device, a satellite device and/or various other types of wired and/or wireless network communication devices including microwave, radio frequency, infrared, Bluetooth, and near field communication devices.
- DSL Digital Subscriber Line
- PSTN Public Switched Telephone Network
- Ethernet device e.g., a broadband device
- satellite device e.g., a satellite device
- various other types of wired and/or wireless network communication devices including microwave, radio frequency, infrared, Bluetooth, and near field communication devices.
- the data vendor server 745 may send asset information from the database 703 , via the network interface 726 , to the server 730 .
- a problem specification e.g., 105 in FIG. 1
- a corresponding solution program e.g., 106 in FIG. 1
- an input interface e.g., 615 in FIG. 6 , 733 in FIG. 7 .
- a pretrained language model (e.g., 120 in FIG. 1 ) may be finetuned based on the problem specification and the corresponding solution program.
- the finetuned pretrained language model may generate a sampled program (e.g., 133 in FIG. 1 ) in response to the problem specification (e.g., 105 in FIG. 1 ) at a decoding time step.
- a predicted token w t for the sampled program may be generated governed by the current parameters of the finetuned pretrained language model (e.g., 130 in FIG. 1 ) at the decoding time step t.
- the policy gradient is computed based on a probability distribution of a predicted test outcome generated by the critic model and a gradient of a conditional probability of a predicted token conditioned on prior predicted tokens and the problem specification, e.g., according to Eq. (9).
- the critic model (e.g., 140 in FIG. 1 ) is trained.
- the critic model may receive a training sequence of the problem specification (e.g., 105 in FIG. 1 ) and the sampled program (e.g., 133 in FIG. 1 ), and generate a predicted test outcome corresponding to the sampled program.
- the predicted test outcome is computed a softmax operation of max-pooled contextual hidden states of a decoder in the critic model, e.g., according to Eqs. (6)-(7).
- a cross-entropy loss by comparing the predicted test outcome and the execution result of the sampled program, e.g., according to Eq. (8), and the critic model may be updated based on the cross-entropy loss.
- FIG. 9 is an example logic flow diagram illustrating a method of program synthesis based on the LM shown in FIG. 5 , according to some embodiments described herein.
- One or more of the processes of method 900 may be implemented, at least in part, in the form of executable code stored on non-transitory, tangible, machine-readable media that when run by one or more processors may cause the one or more processors to perform one or more of the processes.
- method 800 corresponds to the operation of the program synthesis module 630 (e.g., FIGS. 6 - 7 ).
- a problem specification (e.g., 505 in FIG. 5 ) may be received, via an input interface (e.g., 615 in FIG. 6 , 733 in FIG. 7 ), at a language model (e.g., 130 in FIG. 5 ) pretrained for program synthesis.
- an input interface e.g., 615 in FIG. 6 , 733 in FIG. 7
- a language model e.g., 130 in FIG. 5
- one or more unit test input-output pairs may be extracted from the problem specification (e.g., 105 in FIG. 2 ).
- the language model may generate a plurality of program samples (e.g., 533 in FIG. 5 ) from the problem specification.
- one or more unit tests may be applied to the plurality of program samples (e.g., 533 in FIG. 5 ) based on the one or more unit test input-output pairs.
- a first set of program samples (e.g., 541 in FIG. 5 ) that pass the one or more unit tests and a second set of program samples (e.g., 542 in FIG. 5 ) that are unsuccessful may be determined, from the plurality of program samples.
- program samples in the second set comprise at least one of compile error, runtime error and failure to pass at least one of the unit tests.
- a critic model may determine a value to a second program sample in the second set based on a predicted probability that the second program sample pass the one or more unit tests, e.g., according to Eq. (11).
- a subset of program samples may be selected with the highest values from the second set.
- An input sequence (e.g., 566 ) is formed by concatenating the problem specification, a selected program sample and error information corresponding to the selected program sample.
- the error information comprises any of: a unit test outcome corresponding to the selected program sample, and an error subtype during compiling or runtime of the selected program sample.
- a program repair model may be used to generate a repaired program sample based on the input sequence.
- the program repair model is trained by a training objective comparing program samples that fail the unit tests and a ground-truth program corresponding to the problem specification, conditioned on a unit test outcome and/or an error subtype corresponding to the program samples.
- one or more sub-sequences may be selected, via critic scoring, from the first set of program samples.
- Each sub-sequence is a truncated version of a program sample.
- a critic model may determine a value to each token of a first program sample in the first set based on a predicted probability that a subsequence up to the respective token pass the one or more unit tests, e.g., according to Eq. (10).
- a particular token of the first program sample having a highest value may be identified, and a sub-sequence of the first program sample up to the particular token may be selected as a sub-sequence.
- the selected sub-sequence contains a particular token up to which a corresponding sub-sequence has a higher probability to fail than to pass the one or more unit tests, the selected sub-sequence is further chopped at the particular token.
- the language model may generate remaining tokens conditioned on the one or more sub-sequences, e.g., using the sub-sequences as “seeds” (e.g., 545 in FIG. 5 ).
- the generated remaining tokens from step 920 may be combined with the one or more sub-sequences to generate one or more refined program samples.
- a CodeT5-large model (770M) is pretrained from scratch following T5-large's architecture.
- code-specific tokenizer are used from the CodeT5 work, and 6 programming languages (PLs) are used in CodeSearchNet (described in Husain et al., Codesearchnet challenge: Evaluating the state of semantic code search, Computing Research Repository (CoRR), abs/1909.09436, 2019) (CSN) instead of 8 PLs in CodeT5 as C/C#datasets are not publicly available.
- PLs programming languages
- CSN Computing Research Repository
- Only the pretraining task of masked span prediction (MSP) are applied and hence, the model does not have to parse programs into abstract syntax trees (ASTs) to obtain the identifier information.
- MSP masked span prediction
- Example data experiments are run on a kubernetes with 16 A100-40G GPUs on Google Cloud Platform and the total pretraining duration is around 21 days.
- a corruption rate of 15% a peak learning rate (LR) of 2e-4, and a batch size of 2048 are adopted.
- CSN is pretrained for 150 epochs (10 days) and then on GCPY for 10 epochs (5 days).
- a peak LR of 1e-4 and a batch size of 256, and pretrain for 10 epochs (6 days) are adopted.
- the maximum length is set to 768 and 600 for source and target sequences respectively for this objective.
- an AdamW optimizer with a 0:05 weight decay and a linear decay LR scheduler with a warmup step of 1000 is adopted.
- n@k metric is used, which only considers a subset of n candidates from k generated programs per problem. The subset of n candidates are typically selected by a filtering method by passing generated programs through example tests given as part of the problem description.
- Example benchmarks for comparison include APPS program synthesis benchmark (see Hendrycks et al.), as it has large coding problems of varying difficulties collected from multiple coding websites.
- APPS consists of 10,000 coding problems with a 50-50 train-test split. Each problem is accompanied by 23.2 correct Python programs and 21.2 unit tests on average. The average length per problem is 293.2 words and the average length per program is 18.0 lines.
- the pretrained CodeT5 is finetuned the RL-based framework described in FIG. 1 .
- the maximum source and target sequence length is set to 600 and 512 respectively.
- MBPP Benchmark is a smaller and simpler Python program synthesis dataset (described in Austin et al., Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021) (Mostly Basic Programming Problems) for evaluation. The dataset contains 974 instances with 374/90/500 instances for training/validation/testing respectively and 10 reserved for few-shot learning. The problems are typically short, usually one sentence of natural language descriptions each. Each problem is accompanied by 1 correct solution (6.8 lines of code on average) and 3 unit tests in the form of assert statements for validating the functional correctness. Unlike APPS, unit tests in MBPP are not hidden and are explicitly incorporated into the source sequences for program synthesis models.
- Example baselines include GPT2 (Radford et al., Language models are unsupervised multitask learners, OpenAI blog, 1(8):9, 2019), GPT-Neo (Black et al., GPT-NEO: Large scale autoregressive language modeling with mesh-tensorflow. URL https://doi.org/10.5281/zenodo, 5297715, 2021), and GPT3 (Brown et al., Language models are few-shot learners. Advances in neural information processing systems, 33:1877-1901, 2020) to compare with the RL-based framework described herein (referred to as “CodeRL”). The results are also compared with Codex (see Chen et al.) and AlphaCode (see Li et al.).
- results of pretrained LMs are from models finetuned on APPS using the standard loss Lce only.
- CodeRL is model-agnostic, it can be also integrated with GPT variants such as GPT-J and GPT-Neo.
- FIG. 10 ( a ) shows that the CodeRL with the CodeT5 model can achieve significant performance gains, outperforming many pretrained LMs of much larger sizes. Specifically, CodeRL achieved new SOTA results of 2:69% pass@1, 6:81% pass@5, and 20:98% pass@1000.
- FIG. 10 ( b ) shows that when evaluating on a subset of filtered code samples, CodeRL+CodeT5 can achieve SOTA results of 8:48% 1@k and 12:62% 5@k.
- FIG. 10 ( b ) also shows that for challenging programming tasks in interview and competition levels, finetuning can significantly improve model performance. Specifically, Codex, which was not finetuned on APPS and tested in a few-shot setting, can achieve good n@ 1000 results, but the model fails dramatically at synthesis tasks in interview and competition levels. This observation indicates a significant gap between the pretraining stage and downstream synthesis tasks.
- FIG. 11 shows the results of CodeT5-770M trained by different approaches to estimate returns of code samples.
- the CodeRL objective with relative token-level return estimates by the critic model (Model D) can achieve the best performance on pass@ 1 and pass@ 5.
- Model B absolute returns without a baseline
- this approach heavily penalizes all incorrect samples (even though they might still be better than a naive baseline).
- considering relative return estimates that can effectively exploit imperfect codes can lead to better synthesis systems.
- Model A simply assigning identical rewards to all tokens in a code sample (Model A) is disadvantageous as these return estimates are too restrictive to be used as feedback signals for RL training.
- FIG. 12 shows the results with different combinations of Lce and Lrl. Since CodeRL is model-agnostic, experiments to both CodeT5 and GPT-Neo are performed. Note that in these experiments, Lce and Lrl are applied on models that are already warm-started/finetuned with Lce for up to 10 epochs. Firstly, when with using only Lrl, the problem of vanishing gradients during finetuning, which was observed. Therefore, the final models actually deteriorate and lead to performance drops. Secondly, by using only Lce for further finetuning, despite improvement in losses during training time, the model performance indeed degrades during test time. These models are thus expected to be overfitting to the training data, as similarly observed in our analysis of pretrained models in FIG. 16 .
- FIG. 13 shows the ablation results of critical sampling (CS) during inference, applied on CodeT5 models. Different combinations of program refining and repairing steps are tested. Overall, positive impacts of CS, combining both program refining and repairing, across all metrics, with particularly more significant gains on pass@1000, are observed. It is noted that just program refining alone can help to bring performance gains, but its impact is reduced on the 1 @1000 metric. Note that n@k measures the solving rate among the subset P filtered from k samples. As program refining will technically increase the size of this subset, the n@k metric will consider an exponentially larger number of options of n samples than before. This will normalize n@k by a larger pool of n candidate set, resulting in less impact of program refining on model performance.
- CS critical sampling
- the data experiments investigate a subset of the APPS test split, which contains the test samples of the highest difficulty level (i.e. competition programming tasks).
- FIG. 14 shows that the performance gains are quite consistent on both GPT-J and CodeT5.
- the performance gain of CodeRL is more significant on both GPT-J and CodeT5 models.
- Lrl Learning objective
- the performance of synthesis systems is correlated with the quality of foundation models.
- FIG. 15 reports the results of CodeT5 with different configurations of model sizes, pretraining data, and pretraining objectives. For a fair comparison, all models are only finetuned/warm-started with Lce on APPS up to 12 epochs. It is observed that scaling up the number of model parameters (from 60M to 770M) can significantly improve model performance of downstream synthesis tasks.
- the pretraining data is improved by adding the GCPY dataset (10 ⁇ larger than the CSN dataset), good performance improvement may be observed, i.e. from 1:3 to 1:56 pass@1, and 1:72 to 2:06 pass@5.
- MSP Masked Span Prediction
- NTP Next Token Prediction
- FIG. 16 shows the performance of CodeT5 model variants by finetuning epochs and by difficulty levels of programming tasks. Note that in these experiments, the data experiments only compare among CodeT5 model variants by pretraining strategies, and hence, only engage Lce in the finetuning stage on APPS. Consistent with our prior analysis, enhancing both pretraining data (with larger data of GCPY) and pretraining objectives (with NTP objective) improves model performance across training epochs in general. Moreover, as noted by the analysis of learning objectives, only using Lce often leads to overfitting performance, typically after epoch 10 in our case. Hence, to further finetune large-scale LMs, it is beneficial to adopt the RL objective Lrl to utilize synthetic training samples and avoid overfitting models.
- FIG. 17 reports the results of our CodeRL+CodeT5 on MBPP benchmark compared with finetuned GPT models of up to 137 B size.
- the CodeRL+CodeT5 (ZS) was trained on APPS and then evaluated on MBPP in a zero-shot setting. It is observed that CodeRL with CodeT5 of a much smaller model size yields surprisingly good zero-shot performance, setting a new SOTA result of 63.0% pass@80 over GPT-137B's 61.4% pass@80. This validates the strong zero-shot transfer ability of CodeRL for unseen tasks.
- a common concern about transfer learning is that the source (APPS) and target (MBPP) tasks might have overlap in their training data, which could result in the source model tending to memorize these substantially similar data when applied to the target task.
- APPS source
- MBPP target
- it is analyzed how many lines of code appear in both the training set of APPS and programs of MBPP following Austin et al. For this analysis, code comments are discarded and the whitespaces are normalized for each line, and then exclude lines that appear more than twice anywhere in MBPP, as these are likely to be common Python keywords such as return and break.
- FIG. 19 demonstrates the average percentages of generated programs per problem, grouped by their test outcomes.
- CodeT5 or CodeRL+CodeT5 are used to generate programs and randomly select 200 generated programs per test sample in the APPS test split. Programs are passed to either example unit tests or hidden unit tests and group the output programs by their test outcomes. The outcomes are categorized according to the definition in Eq. (2), including CompileError, RuntimeError, FailedTest, and PassedTest.
- example unit tests FIG. 19 ( a )
- hidden unit tests FIG. 19 ( b )
- example tests are not as comprehensive as hidden tests and hence, limit the positive impacts of the CodeRL generation procedure due to false positives.
- FIG. 20 shows an example of a programming problem from the APPS benchmark and corresponding programs generated by CodeT5 variants.
- the CodeT5 model that is finetuned by Lce only is compared with another model that follows CodeRL framework.
- CodeRL+CodeT5 programs are shown before and after applying the CS procedure. It is observed that applying CodeRL can generate more appropriate programs and using the CS procedure further improves their functional correctness.
- CodeT5 model misunderstands the problem and focuses on finding the greatest common divisor between a and b only. Instead, the CodeRL model avoids this mistake and tackles the problem of finding the greatest common divisor between the factorials of a and b.
- CodeRL can improve the complexity of the generated programs, an important quality in complex programming problems.
- the generated program is functionally correct but fails during execution due to a timeout error.
- This program simply computes separate factorials of both a and b, which will slow down the execution in scenarios with extremely large a or b.
- Applying the CS procedure can condition models on parts of the prior program and (re)generates new tokens to produce a more efficient program.
- the factorials are computed on min(a,b) to improve the efficiency of the programs.
- the resulting final program is able to pass all hidden unit tests (including tests with extremely large values) without timeout errors.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Medical Informatics (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Stored Programmes (AREA)
- Machine Translation (AREA)
Abstract
Description
ce(θ)=−Σt log p θ(W|D)=−Σt log[p θ(w t |w 1:t−1 ,D)], (1)
where the conditional probability pθ is parameterized following the above softmax function. During inference time, models may generate sequences of programs by autoregressively sampling token ŵt from the conditional distribution pθ(wt|ŵ1:t−1,D).
rl(θ)=− W
∇θ rl(θ)≈− W
h pool=Pooling(Linear(h 1), . . . ,Linear(h T)). (6)
{circumflex over (u)}=softmax(h pool). (7)
critic(ϕ)=log p ϕ(u|W s ,D). (8)
∇θ rl(θ)≈− W
{circumflex over (q)} ϕ
corresponding to the critic's predicted probability of the sub-sequence till t passing the unit tests. The sequence at position tmax corresponding to the highest critic assigned value and the sub-sequence 543 to the left of the position tmax is used as the seed 545 for the next stage. If this seed sequence till tmax contains a token with pϕ
{circumflex over (q)} ϕ
corresponding to the critic's predicted probability of the program passing the unit tests. The top M failed programs 565 are selected with the highest probabilities and passed to a program repair model w 566.
ce repair(
where u is one of {CompileError; RuntimeError; FailedTest;PassedTest} and c is the error subtype. During inference time, each selected failed sequence can be stacked N/M times for upsampling. This results in the same number of output programs N as in the first round of generation. Finally, these N repaired programs generated by the program repairing model 561 may be passed to module 543 to apply the code refining procedure 550 as described above.
Claims (20)
Priority Applications (5)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US17/896,946 US12430585B2 (en) | 2022-05-23 | 2022-08-26 | Systems and methods for program synthesis |
| PCT/US2023/022994 WO2023229946A1 (en) | 2022-05-23 | 2023-05-19 | Systems and methods for program synthesis |
| JP2024569393A JP2025520071A (en) | 2022-05-23 | 2023-05-19 | Systems and methods for program synthesis |
| EP23731466.1A EP4529650A1 (en) | 2022-05-23 | 2023-05-19 | Systems and methods for program synthesis |
| CN202380041792.4A CN119234225A (en) | 2022-05-23 | 2023-05-19 | System and method for program synthesis |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202263344900P | 2022-05-23 | 2022-05-23 | |
| US17/896,946 US12430585B2 (en) | 2022-05-23 | 2022-08-26 | Systems and methods for program synthesis |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| US20230376841A1 US20230376841A1 (en) | 2023-11-23 |
| US12430585B2 true US12430585B2 (en) | 2025-09-30 |
Family
ID=88791759
Family Applications (2)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US17/896,942 Active 2043-12-26 US12430584B2 (en) | 2022-05-23 | 2022-08-26 | Systems and methods for program synthesis |
| US17/896,946 Active 2043-12-14 US12430585B2 (en) | 2022-05-23 | 2022-08-26 | Systems and methods for program synthesis |
Family Applications Before (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US17/896,942 Active 2043-12-26 US12430584B2 (en) | 2022-05-23 | 2022-08-26 | Systems and methods for program synthesis |
Country Status (1)
| Country | Link |
|---|---|
| US (2) | US12430584B2 (en) |
Families Citing this family (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US12524616B2 (en) * | 2022-10-28 | 2026-01-13 | Microsoft Technology Licensing, Llc | Generation of interactive utterances of code tasks |
| CN120476406A (en) * | 2023-05-10 | 2025-08-12 | 谷歌有限责任公司 | Improved training of large neural networks |
| US12524214B1 (en) | 2023-06-30 | 2026-01-13 | Amazon Technologies, Inc. | Automated error troubleshooting via generative AI software development assistant |
| US12530173B1 (en) | 2023-06-30 | 2026-01-20 | Amazon Technologies, Inc. | Graphical user interface for generative AI software development assistant |
| US12236193B1 (en) * | 2024-03-15 | 2025-02-25 | CAST AI Group, Inc. | Automated selection of large language models in cloud computing environments |
| WO2025234064A1 (en) * | 2024-05-09 | 2025-11-13 | Ntt株式会社 | Information processing device, information processing method, and program |
| CN118446322B (en) * | 2024-06-28 | 2025-02-11 | 北京科技大学 | A method and device for controlling reasoning state based on prior knowledge of large language model |
Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20020199168A1 (en) * | 2001-06-25 | 2002-12-26 | Mitsubishi Denki Kabushiki Kaisha | Automatic program generation apparatus |
| US20080072100A1 (en) * | 2006-06-05 | 2008-03-20 | International Business Machines Corporation | Generating functional test scripts |
| US20130219374A1 (en) | 2012-02-17 | 2013-08-22 | Infosys Technologies, Ltd. | System and method for automated and objective assessment of programming language code |
| US20230176829A1 (en) | 2021-12-07 | 2023-06-08 | Microsoft Technology Licensing, Llc | Multi-modal program inference |
| US20230244452A1 (en) * | 2022-02-02 | 2023-08-03 | Deepmind Technologies Limited | Computer code generation from task descriptions using neural networks |
| US20230280989A1 (en) * | 2022-03-04 | 2023-09-07 | Microsoft Technology Licensing, Llc | Synthesizing a computer program to include idiomatic function(s) and semantically-meaningful variable(s) using programming by example |
| US11797839B2 (en) * | 2017-10-27 | 2023-10-24 | Google Llc | Training neural networks using priority queues |
| US11941373B2 (en) | 2021-12-17 | 2024-03-26 | Microsoft Technology Licensing, Llc. | Code generation through reinforcement learning using code-quality rewards |
Family Cites Families (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10795645B2 (en) | 2017-03-27 | 2020-10-06 | Microsoft Technology Licensing, Llc | Neural network for program synthesis |
-
2022
- 2022-08-26 US US17/896,942 patent/US12430584B2/en active Active
- 2022-08-26 US US17/896,946 patent/US12430585B2/en active Active
Patent Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20020199168A1 (en) * | 2001-06-25 | 2002-12-26 | Mitsubishi Denki Kabushiki Kaisha | Automatic program generation apparatus |
| US20080072100A1 (en) * | 2006-06-05 | 2008-03-20 | International Business Machines Corporation | Generating functional test scripts |
| US20130219374A1 (en) | 2012-02-17 | 2013-08-22 | Infosys Technologies, Ltd. | System and method for automated and objective assessment of programming language code |
| US11797839B2 (en) * | 2017-10-27 | 2023-10-24 | Google Llc | Training neural networks using priority queues |
| US20230176829A1 (en) | 2021-12-07 | 2023-06-08 | Microsoft Technology Licensing, Llc | Multi-modal program inference |
| US11941373B2 (en) | 2021-12-17 | 2024-03-26 | Microsoft Technology Licensing, Llc. | Code generation through reinforcement learning using code-quality rewards |
| US20230244452A1 (en) * | 2022-02-02 | 2023-08-03 | Deepmind Technologies Limited | Computer code generation from task descriptions using neural networks |
| US20230280989A1 (en) * | 2022-03-04 | 2023-09-07 | Microsoft Technology Licensing, Llc | Synthesizing a computer program to include idiomatic function(s) and semantically-meaningful variable(s) using programming by example |
Non-Patent Citations (10)
| Title |
|---|
| Bahdanau et al., "An Actor-Critic Algorithm for Sequence Prediction", ICLR 2017 Conference Paper, arXiv 1607.07086v3, Mar. 3, 2017, pp. 1-17. |
| Bunel et al., "Leveraging Grammar and Reinforcement Learning for Neural Program Synthesis", ICLP 2018 Conference Paper, arXiv 1805.04276v2, May 22, 2018, pp. 1-15. |
| Elhattami, "Beyond Codex: A Code Generation Model That You Can Train", Towards Data Science, Nov. 23, 2021, Retrieved from the internet: URL:https://towardsdatascience.com/beyond-codex-a-code-generation-model-that-you-can-train—6ac9bdcba07f, pp. 1-18. |
| International Search Report and Written Opinion for PCT/US2023/022994, dated Aug. 8, 2023, 11 pages. |
| Non-Final Office Action for U.S. Appl. No. 17/896,942, dated Mar. 13, 2025, 21 pages. |
| Sanchez-Stern et al., "Generating correctness proofs with neural networks", Proceedings of the 4th ACM Sigplan International Workshop on Machine Learning and Programming Languages, Acmpub27, New York, NY, USA, Jun. 15, 2020, pp. 1-10, XP058452215. |
| Shin et al., "Synthetic Datasets for Neural Program Synthesis", ICLR 2019 Conference Paper, arXiv 1912.12345v1, Dec. 27, 2019, pp. 1-16. |
| Wang et al., "Automating Reinforcement Learning Architecture Design for Code Optimization, Association for Computing Machinery", 2022 Association for Computing Machinery, ACM ISBN 978-1-4503-9183-2, Feb. 22, 2004, pp. 129-143. |
| Xu et al., "Neural Program Synthesis by Self-Learning", arXiv 1910.05865v1, Oct. 13, 2019, pp. 1-11. |
| Yang et al., "Program Synthesis Guided Reinforcement Learning for Partially Observed Environments ", 35th Conference of Neural Information Processing Systems, 2021, pp. 1-15. |
Also Published As
| Publication number | Publication date |
|---|---|
| US20230376841A1 (en) | 2023-11-23 |
| US20230376840A1 (en) | 2023-11-23 |
| US12430584B2 (en) | 2025-09-30 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12430585B2 (en) | Systems and methods for program synthesis | |
| US20240419411A1 (en) | Systems and methods for a conversational framework of program synthesis | |
| US11562147B2 (en) | Unified vision and dialogue transformer with BERT | |
| US11893060B2 (en) | Latent question reformulation and information accumulation for multi-hop machine reading | |
| US11182395B2 (en) | Similarity matching systems and methods for record linkage | |
| US20220343084A1 (en) | Translation apparatus, translation method and program | |
| US20240249113A1 (en) | Systems and methods for semantic parsing with execution for answering questions of varying complexity from unstructured text | |
| US10902350B2 (en) | System and method for relationship identification | |
| US12406154B2 (en) | Systems and methods for search based neural text generation models | |
| US20250103300A1 (en) | Systems and methods for iterative code generation with large language models and representative sub-modules | |
| WO2023229946A1 (en) | Systems and methods for program synthesis | |
| US20250323822A1 (en) | Real-time monitoring ecosystem | |
| US20240428079A1 (en) | Systems and methods for training a language model for code generation | |
| US12456013B2 (en) | Systems and methods for training a neural network model using knowledge from pre-trained large language models | |
| US20240428044A1 (en) | Systems and methods for retrieval based question answering using neura network models | |
| US20240249082A1 (en) | Systems and methods for text simplification with document-level context | |
| EP4529650A1 (en) | Systems and methods for program synthesis | |
| US20250111198A1 (en) | Systems and Methods for Constrained Text Generation Using Large Language Models | |
| US20250378323A1 (en) | Systems and methods for alignment of neural network based models | |
| US20240394539A1 (en) | Systems and methods for factual natural langauge processing | |
| US20250322229A1 (en) | System and method for mitigating biases during training of a machine learning model | |
| US20250322293A1 (en) | System and method for mitigating biases in a training dataset for a machine learning model in pre-processing | |
| US20250322230A1 (en) | System and method for mitigating biases in a machine learning model during post-processing | |
| US20260044319A1 (en) | Systems and methods for building a code generation agent | |
| US20250384244A1 (en) | Systems and methods for constructing neural networks |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
| AS | Assignment |
Owner name: SALESFORCE, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LE, HUNG;WANG, YUE;GOTMARE, AKHILESH DEEPAK;AND OTHERS;SIGNING DATES FROM 20220528 TO 20220530;REEL/FRAME:061025/0251 |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
| AS | Assignment |
Owner name: SALESFORCE, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LE, HUNG;WANG, YUE;GOTMARE, AKHILESH DEEPAK;AND OTHERS;SIGNING DATES FROM 20230113 TO 20230115;REEL/FRAME:072115/0879 |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: AWAITING TC RESP, ISSUE FEE PAYMENT VERIFIED |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED |
|
| STCF | Information on status: patent grant |
Free format text: PATENTED CASE |