CN112905188A - Code translation method and system based on generation type countermeasure GAN network - Google Patents

Code translation method and system based on generation type countermeasure GAN network Download PDF

Info

Publication number
CN112905188A
CN112905188A CN202110162786.7A CN202110162786A CN112905188A CN 112905188 A CN112905188 A CN 112905188A CN 202110162786 A CN202110162786 A CN 202110162786A CN 112905188 A CN112905188 A CN 112905188A
Authority
CN
China
Prior art keywords
program
sample
code
gan network
network model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110162786.7A
Other languages
Chinese (zh)
Inventor
杨永全
王占威
魏志强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ocean University of China
Qingdao National Laboratory for Marine Science and Technology Development Center
Original Assignee
Ocean University of China
Qingdao National Laboratory for Marine Science and Technology Development Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ocean University of China, Qingdao National Laboratory for Marine Science and Technology Development Center filed Critical Ocean University of China
Priority to CN202110162786.7A priority Critical patent/CN112905188A/en
Publication of CN112905188A publication Critical patent/CN112905188A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/51Source to source
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/76Adapting program code to run in a different environment; Porting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a code translation method and a system based on a generative countermeasure GAN network, comprising the following steps: the method comprises the steps that a sample program set is used as input data and sent into a code feature extraction model, a first code feature value vector corresponding to each sample program is obtained, the first code feature value vector is input into a generator of the GAN network model, a target language of a prediction result is obtained by setting relevant parameters, comparison is carried out according to the feature value vector of the prediction result and the feature value vector of a test set source program language, and iterative training is carried out continuously to obtain an optimal GAN network model; and acquiring a program to be translated, translating codes by using the optimal GAN network model, and acquiring a target translation program corresponding to the program to be translated. The code translation method based on the generative countermeasure GAN network can realize the conversion of the source code from one high-level programming language to another high-level programming language and has higher accuracy.

Description

Code translation method and system based on generation type countermeasure GAN network
Technical Field
The present invention relates to the field of machine learning technologies, and in particular, to a code translation method and system based on a generative countermeasure GAN network.
Background
In recent years, neural machine translation has had great success in some human languages, relying on large-scale parallel corpora. However, the currently-used natural language translation model does not take into account the characteristics of different definitions of user-defined variables and various operation symbols in the code, so that the result of the application of the existing model to code translation is not ideal. Since the proposal of Goodfellow in 2014, GAN network has become a widely used deep learning model through five or six years of development. The proposition of the generating type confrontation network is inspired by two-person zero-sum games (namely the sum of the income and the loss of the two parties of the games is always zero, and the income of one party is the loss of the other party), and the model comprises a generator and a discriminator which are continuously confronted and trained and mutually learn and evolve.
Code translation techniques based on generative countermeasure networks are an important software reuse technique. We can refer to code and software written in earlier programming languages as legacy code and legacy software. The intelligent source code conversion refers to conversion from one programming language to another programming language, and functions of two programs are completely the same after conversion. The legacy code and legacy software can be converted into the high-level programming language currently used by the code translation technology based on the generative countermeasure network, which has very important significance and value for upgrading and improving the original legacy system, thereby not only avoiding the development cost, development period and safety risk of developing a new system on the zero basis due to abandoning the original system, but also inheriting the advantages of the original system, saving the development cost, improving the efficiency and reducing the risk; meanwhile, in the further development and optimization upgrading process of the converted code, the technology and the method provided by the high-level programming language which is currently used can be better utilized to develop a more complete and powerful system.
In addition, due to the heterogeneity caused by different computer devices such as CPUs, the programming languages and the operating environments supported by the system are different, and the programs cannot be smoothly operated on all platforms. For example, a C + + program written for the X86 platform may not run on different processors (e.g., the M1 processor in the ARM architecture-based self-research and the mashup processor in the wayside architecture). This requires code migration to the appropriate high-level programming language. In the past workflow, the workload of manual code conversion is often huge and tedious. The code rewriting and transplanting of the software can be quickly realized through the code translation technology based on the generation type countermeasure network, so that various functions of the software are reserved, and the waste of resources such as manpower, material resources and the like is greatly reduced.
Although there are many common features between the code composed of the high-level programming language and the natural language, which are composed of a series of words and can be represented in the form of a syntax tree, the existing natural language translation technology is difficult to apply in the field of code translation due to the fact that the code has a stronger logical structure than the natural language, custom identifiers, long-distance dependency between the identifiers, and the like.
Therefore, a code translation method based on a generative countermeasure GAN network is needed.
Disclosure of Invention
The invention provides a code translation method and a code translation system based on a generation type countermeasure GAN network, which aim to solve the problem of how to quickly and accurately realize code translation.
In order to solve the above problem, according to an aspect of the present invention, there is provided a code translation method for countering a GAN network based on a generative method, the method including:
acquiring a sample program set, and extracting a code characteristic value vector of each sample program in the sample program set by using a code characteristic extraction model to acquire a first code characteristic value vector corresponding to each sample program;
inputting the first code characteristic value vector corresponding to each sample program into a generator of the current GAN network model, so as to obtain a target program corresponding to each sample program by using the generator;
sending the target program corresponding to each sample program to the code characteristic extraction model through a discriminator of the current GAN network to extract code characteristic value vectors so as to obtain second code characteristic value vectors corresponding to each target program;
respectively comparing the first code characteristic value vector and the second code characteristic value vector corresponding to each sample program by using a discriminator of the current GAN network to obtain a comparison result corresponding to each sample program;
if the current GAN network model can distinguish the sample program and the target program according to the comparison result corresponding to each sample program, alternately training a discriminator and a generator of the current GAN network model until the trained GAN network model cannot distinguish the sample program and the target program, and determining that the current GAN network model is the optimal GAN network model;
and acquiring a program to be translated, translating codes by using the optimal GAN network model, and acquiring a target translation program corresponding to the program to be translated.
Preferably, the code feature extraction model is based on an idea of ASTNN cutting an abstract syntax tree AST, the AST is cut into a plurality of sub-trees by using AST nodes representing program control blocks corresponding to preset characters, the plurality of sub-trees are subjected to convolution operation by using a TBCN, a bidirectional recurrent neural network is adopted, internal neurons adopt an LSTM to extract sequence information between code blocks, code vectors generated at each time step are stored in a vector matrix, and finally maximum pooling is adopted to obtain code feature value vectors.
Preferably, the preset characters include: "if", "while", "for", and "function".
Preferably, if it is determined that the current GAN network model can distinguish the sample program from the target program according to the comparison result corresponding to each sample program, the alternately training the discriminator and the generator of the current GAN network model until the trained GAN network model cannot distinguish the sample program from the target program, and determining that the current GAN network model is the optimal GAN network model includes:
if the current GAN network model can be determined to be capable of distinguishing the sample program and the target program according to the comparison result corresponding to each sample program, locking a generator, updating parameters of a discriminator, updating the parameters of the discriminator after the continuous training times of the discriminator are equal to the preset times, acquiring sample data by using the generator to obtain a new sample, locking the discriminator, not updating the network parameters of the discriminator, inputting the new sample into the discriminator again, calculating loss, reversely transmitting the new sample into the generator for training once, updating the parameters of the generator, and recalculating the comparison result until the current GAN network model cannot distinguish the sample program from the target program according to the comparison result, indicating that the distribution of characteristic value vectors on the discriminator tends to be consistent, indicating that the model is converged, and determining that the current GAN network model is the optimal GAN network model.
Preferably, the optimization goal of the GAN network model is:
Figure BDA0002936154630000041
wherein the content of the first and second substances,
Figure BDA0002936154630000042
indicating that the arbiter D tries to maximize the objective function V (D, G),
Figure BDA0002936154630000043
indicating that the generator G tries to minimize the objective function V (D, G), x-pdata (x) representing that x corresponds to the data distribution pdata of the real program code, z-pz (z) representing that z corresponds to the data distribution pz of the generator code; d (X) represents the discrimination of the real sample, the closer to 1 the discrimination result is, the better, so the loss function is ln (D (x)), and z is the input sample program, G (z) represents the generated sample, for the generated sample D (G (z)), because the output of the discriminator is between 0 and 1, D (G (z)) should be as small as possible, so1-D (G (z)) approaches 1, then the logarithm approaches 0, and thus the total approaches 0, implementing a zero-sum game, so the greater the penalty the arbiter wants, the better.
According to another aspect of the present invention, there is provided a code translation system for countering a GAN network based on generative approaches, the system comprising:
the system comprises a first code characteristic value vector acquisition unit, a second code characteristic value vector acquisition unit and a third code characteristic value vector acquisition unit, wherein the first code characteristic value vector acquisition unit is used for acquiring a sample program set and extracting a code characteristic value vector of each sample program in the sample program set by using a code characteristic extraction model so as to acquire a first code characteristic value vector corresponding to each sample program;
the target program generating unit is used for inputting the first code characteristic value vector corresponding to each sample program into a generator of the current GAN network model so as to obtain the target program corresponding to each sample program by using the generator;
a second code characteristic value vector obtaining unit, configured to send, through a discriminator of a current GAN network, a target program corresponding to each sample program to the code characteristic extraction model to perform extraction of a code characteristic value vector, so as to obtain a second code characteristic value vector corresponding to each target program;
a comparison result obtaining unit, configured to compare the first code characteristic value vector and the second code characteristic value vector corresponding to each sample program respectively by using a discriminator of a current GAN network, and obtain a comparison result corresponding to each sample program;
the optimal GAN network model determining unit is used for alternately training a discriminator and a generator of the current GAN network model if the current GAN network model can distinguish the sample program and the target program according to the comparison result corresponding to each sample program, and determining the current GAN network model as the optimal GAN network model until the trained GAN network model can not distinguish the sample program and the target program;
and the target translation program acquisition unit is used for acquiring a program to be translated, translating the code by using the optimal GAN network model and acquiring a target translation program corresponding to the program to be translated.
Preferably, the code feature extraction model is based on an idea of ASTNN cutting an abstract syntax tree AST, the AST is cut into a plurality of sub-trees by using AST nodes representing program control blocks corresponding to preset characters, the plurality of sub-trees are subjected to convolution operation by using a TBCN, a bidirectional recurrent neural network is adopted, internal neurons adopt an LSTM to extract sequence information between code blocks, code vectors generated at each time step are stored in a vector matrix, and finally maximum pooling is adopted to obtain code feature value vectors.
Preferably, the preset characters include: "if", "while", "for", and "function".
Preferably, if it is determined that the current GAN network model can distinguish the sample program from the target program according to the comparison result corresponding to each sample program, the optimal GAN network model determining unit alternately trains the discriminator and the generator of the current GAN network model until the trained GAN network model cannot distinguish the sample program from the target program, and determines that the current GAN network model is the optimal GAN network model, including:
if the current GAN network model can be determined to be capable of distinguishing the sample program and the target program according to the comparison result corresponding to each sample program, locking a generator, updating parameters of a discriminator, updating the parameters of the discriminator after the continuous training times of the discriminator are equal to the preset times, acquiring sample data by using the generator to obtain a new sample, locking the discriminator, not updating the network parameters of the discriminator, inputting the new sample into the discriminator again, calculating loss, reversely transmitting the new sample into the generator for training once, updating the parameters of the generator, and recalculating the comparison result until the current GAN network model cannot distinguish the sample program from the target program according to the comparison result, indicating that the distribution of characteristic value vectors on the discriminator tends to be consistent, indicating that the model is converged, and determining that the current GAN network model is the optimal GAN network model.
Preferably, the optimization goal of the GAN network model is:
Figure BDA0002936154630000051
wherein the content of the first and second substances,
Figure BDA0002936154630000052
indicating that the arbiter D tries to maximize the objective function V (D, G),
Figure BDA0002936154630000053
indicating that the generator G tries to minimize the objective function V (D, G), x-pdata (x) representing that x corresponds to the data distribution pdata of the real program code, z-pz (z) representing that z corresponds to the data distribution pz of the generator code; d (X) represents to discriminate the real sample, the discrimination result is as close to 1 as better, so the loss function is ln (D (x)), and z is the input sample program, G (z) represents the generated sample, for the generated sample D (G (z)), because the output of the discriminator is between 0-1, D (G (z)) is as small as possible, so 1-D (G (z)) approaches to 1, then the logarithm approaches to 0, so the totality approaches to 0, so the discriminator hopes that the loss is larger and better.
The invention provides a code translation method and a system based on a generative countermeasure GAN network, which intelligently realize source code translation and conversion by using a generative countermeasure network technology, and have better applicability and reliability compared with the conventional rule-based translation method and system; meanwhile, the code translation technology based on the generation type countermeasure network can be easily expanded to any language, translation conversion among multiple languages is realized, and the code translation technology has strong practical significance.
Drawings
A more complete understanding of exemplary embodiments of the present invention may be had by reference to the following drawings in which:
fig. 1 is a flow diagram of a code translation method 100 based on a generative countermeasure GAN network according to an embodiment of the present invention;
FIG. 2 is a flow diagram of the operation of a generator according to an embodiment of the invention;
FIG. 3 is a flowchart of the operation of an arbiter according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a GAN network learning data distribution process according to an embodiment of the present invention;
FIG. 5 is a flow diagram of an alternate training of generators and discriminators according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a code translation system 600 based on a generative countermeasure GAN network according to an embodiment of the present invention.
Detailed Description
The exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, however, the present invention may be embodied in many different forms and is not limited to the embodiments described herein, which are provided for complete and complete disclosure of the present invention and to fully convey the scope of the present invention to those skilled in the art. The terminology used in the exemplary embodiments illustrated in the accompanying drawings is not intended to be limiting of the invention. In the drawings, the same units/elements are denoted by the same reference numerals.
Unless otherwise defined, terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Further, it will be understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense.
Fig. 1 is a flowchart of a code translation method 100 based on a generative countermeasure GAN network according to an embodiment of the present invention. As shown in fig. 1, the code translation method based on the generative countermeasure GAN network provided by the embodiment of the present invention intelligently implements source code translation and conversion by using the generative countermeasure network technology, and has better applicability and reliability compared with the conventional rule-based translation method, the present invention learns a large number of source code data samples, grasps the commonality and characteristics of different programming languages, and automatically maps code instructions with similar semantics to the same potential space, thereby implementing mutual translation of instructions between different languages, and these mapping relationships rely on automatic learning of models rather than manual definition, thereby greatly saving the labor cost and time cost required for code translation; meanwhile, the code translation technology based on the generation type countermeasure network can be easily expanded to any language, translation conversion among multiple languages is realized, and the code translation technology has strong practical significance. In the code translation method 100 based on the generative countermeasure GAN network according to the embodiment of the present invention, starting from step 101, a sample program set is obtained in step 101, and a code feature extraction model is used to extract a code feature value vector for each sample program in the sample program set, so as to obtain a first code feature value vector corresponding to each sample program.
Preferably, the code feature extraction model is based on an idea of ASTNN cutting an abstract syntax tree AST, the AST is cut into a plurality of sub-trees by using AST nodes representing program control blocks corresponding to preset characters, the plurality of sub-trees are subjected to convolution operation by using a TBCN, a bidirectional recurrent neural network is adopted, internal neurons adopt an LSTM to extract sequence information between code blocks, code vectors generated at each time step are stored in a vector matrix, and finally maximum pooling is adopted to obtain code feature value vectors.
Preferably, the preset characters include: "if", "while", "for", and "function".
In the invention, a sample program set is firstly obtained, and each sample program in the sample program set is respectively put into a code characteristic extraction model to generate a corresponding first code characteristic value vector. The code feature extraction model of the present invention is based on the idea of ASTNN cutting of AST, and instead of directly processing the entire AST, the model cuts the AST into a series of subtrees according to the 4 AST nodes representing program control blocks, i.e., "if", "while", "for", and "function", and then performs convolution operation on the subtrees by using TBCNN. And then, extracting sequence information among code blocks by adopting a bidirectional cyclic neural network and an internal neuron by adopting an LSTM (least squares metric), storing the code vectors generated at each time step into a vector matrix, and finally obtaining the code characteristic value vectors by adopting maximum pooling.
In step 102, the first code feature value vector corresponding to each sample program is input into a generator of the current GAN network model, so as to obtain a target program corresponding to each sample program by using the generator.
In the invention, an initial GAN network model is constructed, and network parameters of a generator and an arbiter in the GAN network model are set. Then, the obtained first code characteristic value vector is input into a generator of the current GAN network model to generate a target program corresponding to each sample program. The work flow of the generator is shown in fig. 2.
In step 103, the target program corresponding to each sample program is sent to the code feature extraction model through the discriminator of the current GAN network to extract a code feature value vector, so as to obtain a second code feature value vector corresponding to each target program.
In step 104, the discriminator of the current GAN network is used to compare the first code eigenvalue vector and the second code eigenvalue vector corresponding to each sample program, so as to obtain the comparison result corresponding to each sample program.
In the invention, the discriminator of the GAN network sends the target program corresponding to each sample program to the code characteristic extraction model to extract the code characteristic vector so as to obtain the second code characteristic vector corresponding to each target program. Then, the first code characteristic vector and the second code characteristic vector corresponding to each program sample are compared by the discriminator respectively to obtain a comparison result corresponding to each program sample. The working flow of the discriminator is shown in fig. 3.
In step 105, if it is determined that the current GAN network model can distinguish the sample program and the target program according to the comparison result corresponding to each sample program, the discriminator and the generator of the current GAN network model are alternately trained until the trained GAN network model cannot distinguish the sample program and the target program, and the current GAN network model is determined to be the optimal GAN network model.
Preferably, if it is determined that the current GAN network model can distinguish the sample program from the target program according to the comparison result corresponding to each sample program, the alternately training the discriminator and the generator of the current GAN network model until the trained GAN network model cannot distinguish the sample program from the target program, and determining that the current GAN network model is the optimal GAN network model includes:
if the current GAN network model can be determined to be capable of distinguishing the sample program and the target program according to the comparison result corresponding to each sample program, locking a generator, updating parameters of a discriminator, updating the parameters of the discriminator after the continuous training times of the discriminator are equal to the preset times, acquiring sample data by using the generator to obtain a new sample, locking the discriminator, not updating the network parameters of the discriminator, inputting the new sample into the discriminator again, calculating loss, reversely transmitting the new sample into the generator for training once, updating the parameters of the generator, and recalculating the comparison result until the current GAN network model cannot distinguish the sample program from the target program according to the comparison result, indicating that the distribution of characteristic value vectors on the discriminator tends to be consistent, indicating that the model is converged, and determining that the current GAN network model is the optimal GAN network model.
Preferably, the optimization goal of the GAN network model is:
Figure BDA0002936154630000091
wherein the content of the first and second substances,
Figure BDA0002936154630000092
indicating that the arbiter D tries to maximize the objective function V (D, G),
Figure BDA0002936154630000093
the indication generator G tries to makeThe objective function V (D, G) is minimum, x-pdata (x) represents that x conforms to the data distribution pdata of the real program code, and z-pz (z) represents that z conforms to the data distribution pz of the generated program code; d (X) represents to discriminate the real sample, the discrimination result is as close to 1 as better, so the loss function is ln (D (x)), and z is the input sample program, G (z) represents the generated sample, for the generated sample D (G (z)), because the output of the discriminator is between 0-1, D (G (z)) is as small as possible, so 1-D (G (z)) approaches to 1, then the logarithm approaches to 0, so the totality approaches to 0, so the discriminator hopes that the loss is larger and better.
In the invention, analysis is carried out according to the obtained comparison result, if the comparison result indicates that the difference is larger, parameter adjustment optimization of a generator and judgment is carried out, a proper loss function and algorithm are selected, and the similarity between the data generated by the generated network and the real data is improved as much as possible. The similarity between them can be measured using the objective function formula (1) in the present invention. This formula embodies the process of the game of the producer and the arbiter, the producer wants the loss to be as small as possible, and the arbiter wants to be as large as possible. I.e. the arbiter D tries to maximize the objective function V (D, G) and the generator G tries to minimize the objective function V (D, G). In the formula, x to pdata(x) Representing a data distribution p where x conforms to the real program codedata,z~pz(z) data distribution p representing z-coincidence generator codez
Figure BDA0002936154630000101
Since the GAN network is individually trained alternately and iteratively, this objective function is also optimized separately for the arbiter and the generator. Firstly, the fixed generator G optimizes the discriminator D, and the expression form is as follows:
Figure BDA0002936154630000102
where d (x) represents the discrimination of the true program sample, where if d (x) approaches 1, which indicates that the discrimination is good, the loss function log (d (x)) approaches 0. And z is an input program sample, G (z) represents a generated sample, and for judging the generated sample D (G (z)), because the output of the discriminator is between 0 and 1, and D (G (z)) is as small as possible, so that 1-D (G (z)) approaches to 1, the logarithm approaches to 0, and thus the total approaches to 0, and zero sum game is realized, so that the discriminator hopes to lose more and better. The overall expression is as indicated above.
The generator was then optimized, the expression format being as follows:
Figure BDA0002936154630000103
here, the optimization of the generative model is simple, and since the discriminator D is fixed, it is only necessary to make the result D (g (z)) of discrimination close to 1.
Fig. 4 is a schematic diagram of a data distribution process for generative confrontation learning network. Wherein the dotted line with larger points is the real program sample distribution pdataThe solid line is the generated target program sample distribution pzThe dotted line with a smaller dot is the probability d (x) of judging the sample data to be true. It can be seen from the figure that the solid line is a large dotted line of the gradual approximation point, and (a) the figure is generated when the countermeasure network is close to convergence pzAnd ρdataThe initial state of (1). (b) Graph representation fixed generator parameters, improved discriminator, until D*(x) Converge to
Figure BDA0002936154630000111
(c) The graph shows that the arbiter parameters are fixed and the generator is modified until the arbiter cannot determine the true and false data. (d) Graph representation, after multiple iterations, pzAnd pdataWhen the data are completely consistent and D (x) is 0.5, the discriminator cannot distinguish true data from false data, and the generator can forge vivid data.
Because the generation network and the discrimination network are two completely independent convolutional neural networks, the discriminator is optimized separately in the iterative training process, which is difficult to realize, and the problem of over-fitting of a limited data set is easy to occur. So in the present invention, the training of the discriminator K times in succession and the subsequent training of the generator 1 time are alternated, as long as the generator G changes slowly enough, the discriminator D can continue to approach the optimal solution.
With reference to fig. 5, in the present invention, when optimizing the training model, the feature values of the two types of samples are input to the discriminator for comparison, loss is calculated, the generator is locked, the network parameters of the discriminator are updated, and the parameters of the discrimination model are saved after the discriminator D is continuously trained for K times; sampling data from a generator G to obtain a target sample, locking the discriminator, not updating the network parameters of the discriminator, inputting the target sample into the discriminator again, calculating loss, reversely transmitting the target sample into the generator for training, and updating the network parameters of the generator; and repeating the process, and if the distribution of the characteristic value vectors of the sample program and the target program on the discriminator tends to be consistent, converging the model, wherein the GAN network model at the moment is the optimal GAN network model.
In step 106, a program to be translated is obtained, code translation is performed by using the optimal GAN network model, and a target translation program corresponding to the program to be translated is obtained.
According to the code translation method, the generator model, the discriminator model and the code feature extraction model are connected through the network, mutual training is carried out in the network, the self translation effect is improved, and a better translation result is finally obtained, so that the source code can be converted from one high-level programming language (such as C + + or Python) to another high-level programming language (such as C or Java).
The specific process of the code translation system based on the generation type countermeasure network comprises the following steps: a large number of high-level programming languages of different types are collected to form parallel linguistic data to be used as a training set of a translation model, a large number of overlapped keywords among the programming languages are reserved, and corresponding fragments among the high-level programming languages are mapped in a similar potential space. And the server takes the selected training set as input data and sends the input data into an automatic code feature extraction model based on a convolution and cyclic neural network to obtain a feature value vector A of the input data, and the feature value vector A is taken as an input vector of the generator to generate the target program code. The discriminator will put the object program code generated by the generator into the automatic code characteristic extraction model to generate the corresponding code characteristic value vector B. At the moment, the discriminator begins to compare A, B two groups of code characteristic value vectors, if the discrimination of the two groups of code characteristic value vectors cannot cheat the discriminator, the discriminator firstly optimizes the algorithm of the discriminator to discriminate the two groups of characteristic value vectors as much as possible, and after optimizing K times, the generator is optimized again to enable the generated characteristic value vector B to achieve the aim of cheating the discriminator. Through continuous alternate training and optimization of the generator and the discriminator, finally, the characteristic value vector B of the target program code and the characteristic value vector A of the source program code obtained by the generator cannot be distinguished by the discriminator, the model is converged, and a high-quality code translation model based on the GAN network can be obtained.
After the optimal GAN network model is determined, the target translation program corresponding to the program to be translated can be obtained by directly inputting the program to be translated into the optimal GAN network model, and code translation is efficiently and accurately realized.
Fig. 6 is a schematic structural diagram of a code translation system 600 based on a generative countermeasure GAN network according to an embodiment of the present invention. As shown in fig. 6, a code translation system 600 for countering a GAN network based on a generative method according to an embodiment of the present invention includes: a first code feature value vector acquisition unit 601, a target program generation unit 602, a second code feature value vector acquisition unit 603, a comparison result acquisition unit 604, an optimal GAN network model determination unit 605, and a target translation program acquisition unit 606.
Preferably, the first code feature value vector obtaining unit 601 is configured to obtain a sample program set, and perform code feature value vector extraction on each sample program in the sample program set by using a code feature extraction model to obtain a first code feature value vector corresponding to each sample program.
Preferably, the code feature extraction model is based on an idea of ASTNN cutting an abstract syntax tree AST, the AST is cut into a plurality of sub-trees by using AST nodes representing program control blocks corresponding to preset characters, the plurality of sub-trees are subjected to convolution operation by using a TBCN, a bidirectional recurrent neural network is adopted, internal neurons adopt an LSTM to extract sequence information between code blocks, code vectors generated at each time step are stored in a vector matrix, and finally maximum pooling is adopted to obtain code feature value vectors.
Preferably, the preset characters include: "if", "while", "for", and "function".
Preferably, the target program generating unit 602 is configured to input the first code feature value vector corresponding to each sample program into a generator of the current GAN network model, so as to obtain the target program corresponding to each sample program by using the generator.
Preferably, the second code characteristic value vector obtaining unit 603 is configured to send, through the arbiter of the current GAN network, the target program corresponding to each sample program to the code characteristic extraction model to perform code characteristic value vector extraction, so as to obtain a second code characteristic value vector corresponding to each target program.
Preferably, the comparison result obtaining unit 604 is configured to compare the first code characteristic value vector and the second code characteristic value vector corresponding to each sample program by using a discriminator of the current GAN network, respectively, to obtain a comparison result corresponding to each sample program.
Preferably, the optimal GAN network model determining unit 605 is configured to, if it is determined that the current GAN network model can distinguish the sample program and the target program according to the comparison result corresponding to each sample program, alternately train the discriminator and the generator of the current GAN network model until the trained GAN network model cannot distinguish the sample program and the target program, and determine that the current GAN network model is the optimal GAN network model.
Preferably, if it is determined that the current GAN network model can distinguish the sample program from the target program according to the comparison result corresponding to each sample program, the optimal GAN network model determining unit 605 alternately trains the discriminator and the generator of the current GAN network model until the trained GAN network model cannot distinguish the sample program from the target program, and determining that the current GAN network model is the optimal GAN network model includes:
if the current GAN network model can be determined to be capable of distinguishing the sample program and the target program according to the comparison result corresponding to each sample program, locking a generator, updating parameters of a discriminator, updating the parameters of the discriminator after the continuous training times of the discriminator are equal to the preset times, acquiring sample data by using the generator to obtain a new sample, locking the discriminator, not updating the network parameters of the discriminator, inputting the new sample into the discriminator again, calculating loss, reversely transmitting the new sample into the generator for training once, updating the parameters of the generator, and recalculating the comparison result until the current GAN network model cannot distinguish the sample program from the target program according to the comparison result, indicating that the distribution of characteristic value vectors on the discriminator tends to be consistent, indicating that the model is converged, and determining that the current GAN network model is the optimal GAN network model.
Preferably, the optimization goal of the GAN network model is:
Figure BDA0002936154630000141
wherein the content of the first and second substances,
Figure BDA0002936154630000142
indicating that the arbiter D tries to maximize the objective function V (D, G),
Figure BDA0002936154630000143
indicating that the generator G tries to minimize the objective function V (D, G), x-pdata (x) representing that x corresponds to the data distribution pdata of the real program code, z-pz (z) representing that z corresponds to the data distribution pz of the generator code; d (X) represents the discrimination of the real sample, the closer to 1 the discrimination result is, the better, so the loss function is ln (D (x)), z is the input sample program, G (z) represents the generated sample, for the generated sample D (G (z)), the output of the discriminator is between 0 and 1, D (G (z)) should be as much as possibleSo that 1-D (g (z)) approaches 1, then the logarithm approaches 0, and so the population approaches 0, a zero-sum game is implemented, so the greater the penalty the arbiter wants to be.
Preferably, the target translation program obtaining unit 606 is configured to obtain a program to be translated, perform code translation by using the optimal GAN network model, and obtain a target translation program corresponding to the program to be translated.
The code translation system 600 based on the generative countermeasure GAN network according to the embodiment of the present invention corresponds to the code translation method 100 based on the generative countermeasure GAN network according to another embodiment of the present invention, and is not described herein again.
The invention has been described with reference to a few embodiments. However, other embodiments of the invention than the one disclosed above are equally possible within the scope of the invention, as would be apparent to a person skilled in the art from the appended patent claims.
Generally, all terms used in the claims are to be interpreted according to their ordinary meaning in the technical field, unless explicitly defined otherwise herein. All references to "a/an/the [ device, component, etc ]" are to be interpreted openly as referring to at least one instance of said device, component, etc., unless explicitly stated otherwise. The steps of any method disclosed herein do not have to be performed in the exact order disclosed, unless explicitly stated.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims (10)

1. A method for code translation based on a generative countermeasure GAN network, the method comprising:
acquiring a sample program set, and extracting a code characteristic value vector of each sample program in the sample program set by using a code characteristic extraction model to acquire a first code characteristic value vector corresponding to each sample program;
inputting the first code characteristic value vector corresponding to each sample program into a generator of the current GAN network model, so as to obtain a target program corresponding to each sample program by using the generator;
sending the target program corresponding to each sample program to the code characteristic extraction model through a discriminator of the current GAN network to extract code characteristic value vectors so as to obtain second code characteristic value vectors corresponding to each target program;
respectively comparing the first code characteristic value vector and the second code characteristic value vector corresponding to each sample program by using a discriminator of the current GAN network to obtain a comparison result corresponding to each sample program;
if the current GAN network model can distinguish the sample program and the target program according to the comparison result corresponding to each sample program, alternately training a discriminator and a generator of the current GAN network model until the trained GAN network model cannot distinguish the sample program and the target program, and determining that the current GAN network model is the optimal GAN network model;
and acquiring a program to be translated, translating codes by using the optimal GAN network model, and acquiring a target translation program corresponding to the program to be translated.
2. The method as claimed in claim 1, wherein the code feature extraction model is based on the concept of ASTNN cutting abstract syntax tree AST, the AST is cut into a plurality of sub-trees by using AST nodes representing program control blocks corresponding to preset characters, the plurality of sub-trees are convolved by using TBCNN, a bidirectional recurrent neural network is used, internal neurons use LSTM to extract sequence information between code blocks, code vectors generated at each time step are stored in a vector matrix, and finally maximum pooling is used to obtain the code feature value vectors.
3. The method of claim 2, wherein the predetermined characters comprise: "if", "while", "for", and "function".
4. The method of claim 1, wherein the step of alternately training a discriminator and a generator of the current GAN network model if it is determined that the current GAN network model can distinguish the sample program from the target program according to the comparison result corresponding to each sample program until the trained GAN network model cannot distinguish the sample program from the target program comprises:
if the current GAN network model can be determined to be capable of distinguishing the sample program and the target program according to the comparison result corresponding to each sample program, locking a generator, updating parameters of a discriminator, updating the parameters of the discriminator after the continuous training times of the discriminator are equal to the preset times, acquiring sample data by using the generator to obtain a new sample, locking the discriminator, not updating the network parameters of the discriminator, inputting the new sample into the discriminator again, calculating loss, reversely transmitting the new sample into the generator for training once, updating the parameters of the generator, and recalculating the comparison result until the current GAN network model cannot distinguish the sample program from the target program according to the comparison result, indicating that the distribution of characteristic value vectors on the discriminator tends to be consistent, indicating that the model is converged, and determining that the current GAN network model is the optimal GAN network model.
5. The method of claim 1, wherein the goal of the GAN network model is to:
Figure FDA0002936154620000021
wherein the content of the first and second substances,
Figure FDA0002936154620000022
indicating that the arbiter D tries to maximize the objective function V (D, G),
Figure FDA0002936154620000023
indicating that the generator G tries to minimize the objective function V (D, G), x-pdata (x) representing that x corresponds to the data distribution pdata of the real program code, z-pz (z) representing that z corresponds to the data distribution pz of the generator code; d (X) represents the discrimination of the real sample, the closer to 1 the discrimination result is, the better, so the loss function is ln (D (x)), and z is the input sample program, G (z) represents the generated sample, for the generated sample D (G (z)), because the output of the discriminator is at 0Between 1, D (g (z)) is as small as possible, so that 1-D (g (z)) approaches 1, the logarithm approaches 0, and thus the total approaches 0, and a zero-sum game is realized, so that the larger the loss is, the better the discriminator wants to be.
6. A code translation system based on a generative countermeasure GAN network, the system comprising:
the system comprises a first code characteristic value vector acquisition unit, a second code characteristic value vector acquisition unit and a third code characteristic value vector acquisition unit, wherein the first code characteristic value vector acquisition unit is used for acquiring a sample program set and extracting a code characteristic value vector of each sample program in the sample program set by using a code characteristic extraction model so as to acquire a first code characteristic value vector corresponding to each sample program;
the target program generating unit is used for inputting the first code characteristic value vector corresponding to each sample program into a generator of the current GAN network model so as to obtain the target program corresponding to each sample program by using the generator;
a second code characteristic value vector obtaining unit, configured to send, through a discriminator of a current GAN network, a target program corresponding to each sample program to the code characteristic extraction model to perform extraction of a code characteristic value vector, so as to obtain a second code characteristic value vector corresponding to each target program;
a comparison result obtaining unit, configured to compare the first code characteristic value vector and the second code characteristic value vector corresponding to each sample program respectively by using a discriminator of a current GAN network, and obtain a comparison result corresponding to each sample program;
the optimal GAN network model determining unit is used for alternately training a discriminator and a generator of the current GAN network model if the current GAN network model can distinguish the sample program and the target program according to the comparison result corresponding to each sample program, and determining the current GAN network model as the optimal GAN network model until the trained GAN network model can not distinguish the sample program and the target program;
and the target translation program acquisition unit is used for acquiring a program to be translated, translating the code by using the optimal GAN network model and acquiring a target translation program corresponding to the program to be translated.
7. The system of claim 6, wherein the code feature extraction model is based on the concept of ASTNN cutting abstract syntax tree AST, and the AST is cut into a plurality of sub-trees by using AST nodes representing program control blocks corresponding to preset characters, the sub-trees are convolved by using TBCNN, a bidirectional recurrent neural network is used, internal neurons use LSTM to extract sequence information between code blocks, code vectors generated at each time step are stored in a vector matrix, and finally maximum pooling is used to obtain the code feature value vectors.
8. The system of claim 7, wherein the predetermined characters comprise: "if", "while", "for", and "function".
9. The system according to claim 6, wherein the optimal GAN network model determining unit, if it is determined that the current GAN network model can distinguish the sample program from the target program according to the comparison result corresponding to each sample program, alternately trains the discriminator and the generator of the current GAN network model until the trained GAN network model cannot distinguish the sample program from the target program, and determines that the current GAN network model is the optimal GAN network model, including:
if the current GAN network model can be determined to be capable of distinguishing the sample program and the target program according to the comparison result corresponding to each sample program, locking a generator, updating parameters of a discriminator, updating the parameters of the discriminator after the continuous training times of the discriminator are equal to the preset times, acquiring sample data by using the generator to obtain a new sample, locking the discriminator, not updating the network parameters of the discriminator, inputting the new sample into the discriminator again, calculating loss, reversely transmitting the new sample into the generator for training once, updating the parameters of the generator, and recalculating the comparison result until the current GAN network model cannot distinguish the sample program from the target program according to the comparison result, indicating that the distribution of characteristic value vectors on the discriminator tends to be consistent, indicating that the model is converged, and determining that the current GAN network model is the optimal GAN network model.
10. The system of claim 6, wherein the goal of the GAN network model is to:
Figure FDA0002936154620000041
wherein the content of the first and second substances,
Figure FDA0002936154620000042
indicating that the arbiter D tries to maximize the objective function V (D, G),
Figure FDA0002936154620000043
indicating that the generator G tries to minimize the objective function V (D, G), x-pdata (x) representing that x corresponds to the data distribution pdata of the real program code, z-pz (z) representing that z corresponds to the data distribution pz of the generator code; d (X) represents the discrimination of the real sample, the closer to 1 the discrimination result is, the better, so the loss function is ln (D (x)), and z is the input sample program, G (z) represents the generated sample, for the generated sample D (G (z)), because the output of the discriminator is at 0Between 1, D (g (z)) is as small as possible, so that 1-D (g (z)) approaches 1, the logarithm approaches 0, and thus the total approaches 0, and a zero-sum game is realized, so that the larger the loss is, the better the discriminator wants to be.
CN202110162786.7A 2021-02-05 2021-02-05 Code translation method and system based on generation type countermeasure GAN network Pending CN112905188A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110162786.7A CN112905188A (en) 2021-02-05 2021-02-05 Code translation method and system based on generation type countermeasure GAN network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110162786.7A CN112905188A (en) 2021-02-05 2021-02-05 Code translation method and system based on generation type countermeasure GAN network

Publications (1)

Publication Number Publication Date
CN112905188A true CN112905188A (en) 2021-06-04

Family

ID=76123140

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110162786.7A Pending CN112905188A (en) 2021-02-05 2021-02-05 Code translation method and system based on generation type countermeasure GAN network

Country Status (1)

Country Link
CN (1) CN112905188A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114064472A (en) * 2021-11-12 2022-02-18 天津大学 Automatic software defect repairing and accelerating method based on code representation
CN114860241A (en) * 2022-07-07 2022-08-05 中国海洋大学 Code abstract syntax tree generation method based on generation countermeasure network
US11847436B2 (en) 2022-01-25 2023-12-19 Hewlett Packard Enterprise Development Lp Machine learning (ML) model-based compiler

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109582352A (en) * 2018-10-19 2019-04-05 北京硅心科技有限公司 A kind of code completion method and system based on double AST sequences
CN110489102A (en) * 2019-07-29 2019-11-22 东北大学 A method of Python code is automatically generated from natural language
CN110750997A (en) * 2018-07-05 2020-02-04 普天信息技术有限公司 Machine translation method and device based on generation countermeasure learning
CN111090461A (en) * 2019-11-18 2020-05-01 中山大学 Code annotation generation method based on machine translation model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110750997A (en) * 2018-07-05 2020-02-04 普天信息技术有限公司 Machine translation method and device based on generation countermeasure learning
CN109582352A (en) * 2018-10-19 2019-04-05 北京硅心科技有限公司 A kind of code completion method and system based on double AST sequences
CN110489102A (en) * 2019-07-29 2019-11-22 东北大学 A method of Python code is automatically generated from natural language
CN111090461A (en) * 2019-11-18 2020-05-01 中山大学 Code annotation generation method based on machine translation model

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114064472A (en) * 2021-11-12 2022-02-18 天津大学 Automatic software defect repairing and accelerating method based on code representation
CN114064472B (en) * 2021-11-12 2024-04-09 天津大学 Automatic software defect repairing acceleration method based on code representation
US11847436B2 (en) 2022-01-25 2023-12-19 Hewlett Packard Enterprise Development Lp Machine learning (ML) model-based compiler
CN114860241A (en) * 2022-07-07 2022-08-05 中国海洋大学 Code abstract syntax tree generation method based on generation countermeasure network

Similar Documents

Publication Publication Date Title
CN112905188A (en) Code translation method and system based on generation type countermeasure GAN network
CN108446540B (en) Program code plagiarism type detection method and system based on source code multi-label graph neural network
WO2019233112A1 (en) Vectorized representation method for software source codes
KR102027141B1 (en) A program coding system based on artificial intelligence through voice recognition and a method thereof
CN111859978A (en) Emotion text generation method based on deep learning
CN114120041B (en) Small sample classification method based on double-countermeasure variable self-encoder
CA3135717A1 (en) System and method for transferable natural language interface
CN115983274B (en) Noise event extraction method based on two-stage label correction
CN113705196A (en) Chinese open information extraction method and device based on graph neural network
CN115935372A (en) Vulnerability detection method based on graph embedding and bidirectional gated graph neural network
CN113741886A (en) Statement level program repairing method and system based on graph
CN111563148A (en) Dialog generation method based on phrase diversity
CN113723070A (en) Text similarity model training method, text similarity detection method and text similarity detection device
CN110909174B (en) Knowledge graph-based method for improving entity link in simple question answering
CN110489348B (en) Software functional defect mining method based on migration learning
CN113705207A (en) Grammar error recognition method and device
CN115936014A (en) Medical entity code matching method, system, computer equipment and storage medium
CN113190233B (en) Intelligent source code translation method and system for multi-source heterogeneous programming language
CN115686923A (en) Method and system for automatically repairing software source code defects
CN115630652A (en) Customer service session emotion analysis system, method and computer system
CN114065210A (en) Vulnerability detection method based on improved time convolution network
CN111950615A (en) Network fault feature selection method based on tree species optimization algorithm
CN117033733B (en) Intelligent automatic classification and label generation system and method for library resources
CN117875322A (en) Entity extraction method, system, equipment and medium of text data
Huang et al. Improving Just-In-Time Comment Updating via AST Edit Sequence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination