CN117763461A - Code scanning model training method and device, storage medium and electronic equipment - Google Patents

Code scanning model training method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN117763461A
CN117763461A CN202311786493.1A CN202311786493A CN117763461A CN 117763461 A CN117763461 A CN 117763461A CN 202311786493 A CN202311786493 A CN 202311786493A CN 117763461 A CN117763461 A CN 117763461A
Authority
CN
China
Prior art keywords
code
model
sample
layer
defect
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311786493.1A
Other languages
Chinese (zh)
Inventor
王磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
AlipayCom Co ltd
Original Assignee
AlipayCom Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by AlipayCom Co ltd filed Critical AlipayCom Co ltd
Priority to CN202311786493.1A priority Critical patent/CN117763461A/en
Publication of CN117763461A publication Critical patent/CN117763461A/en
Pending legal-status Critical Current

Links

Landscapes

  • Image Analysis (AREA)

Abstract

One or more embodiments of the present specification disclose a code scan model training method, comprising: acquiring a code scanning model to be trained, wherein the code scanning model comprises an input layer, a convolution layer, a feature extraction layer, an output layer and an evaluation layer; acquiring a plurality of first training samples, wherein the first training samples comprise first code samples and defect numbers of the first code samples; training the code scan model with the first training sample; the training targets include: minimizing the diffusion loss of the code scan model and maximizing the objective function of the code scan model output accuracy.

Description

Code scanning model training method and device, storage medium and electronic equipment
Technical Field
The embodiment of the specification belongs to the technical field of computers, and particularly relates to a code scanning model training method, a device, a storage medium and electronic equipment.
Background
Code inspection refers to detecting whether codes submitted by an encoder meet coding standards and the quality of the codes through a code scanning tool after the encoder completes code writing or code modification and submits code files to a code library.
The code inspection is an indispensable link in the software development process, and the code readability and maintainability can be improved by inspecting the code, so that the quality of the software development is ensured. However, with the increasing complexity and size of software systems, there is still a great challenge in how to improve the accuracy and coverage of code viewing, and thus there is a need to provide a more intelligent, efficient, and wider-coverage code viewing scheme.
Disclosure of Invention
The embodiment of the specification provides a code scanning model training method, a device, a storage medium and electronic equipment, and the technical scheme is as follows:
in a first aspect, embodiments of the present disclosure provide a code scan model training method, including:
acquiring a code scanning model to be trained, wherein the code scanning model comprises an input layer, a convolution layer, a feature extraction layer, an output layer and an evaluation layer;
acquiring a plurality of first training samples, wherein the first training samples comprise first code samples and defect numbers of the first code samples;
training the code scan model with the first training sample, the training comprising: inputting the first code sample into the input layer to obtain word vector expression of the first code sample; inputting the word vector expression to the convolution layer to obtain a word vector sequence of the first code sample; inputting the word vector sequence into the feature extraction layer to obtain defect code features of the first code sample; inputting the defect code characteristics to the output layer to obtain a predicted defect type of the first code sample; inputting the predicted defect type to the evaluation layer, wherein the evaluation layer judges whether the output of the code scanning model is accurate or not based on the defect number of the first code sample;
The training targets include: minimizing the diffusion loss of the code scan model and maximizing the objective function of the code scan model output accuracy.
In a second aspect, embodiments of the present disclosure provide a code scan model training apparatus, including:
a model acquisition unit configured to acquire a code scan model to be trained, the code scan model including an input layer, a convolution layer, a feature extraction layer, an output layer, and an evaluation layer;
a first sample acquisition unit configured to acquire a plurality of first training samples including a first code sample and a defect number of the first code sample;
a model training unit configured to train the code scan model using the first training sample, the training comprising: inputting the first code sample into the input layer to obtain word vector expression of the first code sample; inputting the word vector expression to the convolution layer to obtain a word vector sequence of the first code sample; inputting the word vector sequence into the feature extraction layer to obtain defect code features of the first code sample; inputting the defect code characteristics to the output layer to obtain a predicted defect type of the first code sample; inputting the predicted defect type to the evaluation layer, wherein the evaluation layer judges whether the output of the code scanning model is accurate or not based on the defect number of the first code sample;
The training targets include: minimizing the diffusion loss of the code scan model and maximizing the objective function of the code scan model output accuracy.
In a third aspect, embodiments of the present description provide a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to the first aspect described above.
In a fourth aspect, embodiments of the present disclosure provide an electronic device, including:
one or more processors, and a memory associated with the one or more processors, the memory for storing program instructions that, when read for execution by the one or more processors, perform the steps of the method of the first aspect described above.
The technical scheme provided by one or more embodiments of the present specification at least includes:
firstly, a large model method based on a convolutional neural network utilizes machine learning and artificial intelligence technology, and by training a large-scale code sample and analyzing actual running conditions, the method has higher accuracy and coverage rate, and can help developers to better and more comprehensively find and solve the code performance problem; secondly, in the process of model training, a defect code block and a code sample containing the defect code block are used as input of model training, the influence of the code context on defect positioning is fully reflected, and then the defect is learned more accurately and reliably through multiple fusion refining of a convolution layer and a feature extraction layer; and finally, adjusting and verifying the model through the code sample and the test set of the specific scene, so that the reliability of the model is further improved while the coverage rate of the model is improved.
Drawings
In order to more clearly illustrate the technical solutions of one or more embodiments of the present description, the drawings that are required for use in the embodiments will be briefly described below, and it will be apparent that the drawings in the following description are only some embodiments of the present description, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is an exemplary system architecture diagram to which embodiments of the present specification are applied.
Fig. 2 is a flow diagram of a code scan model training method according to one or more embodiments of the present disclosure.
FIG. 3 is a flow diagram of a model training method provided in one or more embodiments of the present disclosure.
Fig. 4 is a schematic structural diagram of three-layer word embedding provided in one or more embodiments of the present disclosure.
Fig. 5 is a schematic structural diagram of a convolutional neural network according to one or more embodiments of the present disclosure.
FIG. 6 is a flow diagram of another method of model training provided in one or more embodiments of the present disclosure.
FIG. 7 is a code scan model training apparatus provided in one or more embodiments of the present disclosure.
FIG. 8 is a diagram illustrating another code scan model training apparatus provided in one or more embodiments of the present disclosure.
Fig. 9 is a schematic block diagram of an electronic device provided in one or more embodiments of the present disclosure.
Detailed Description
The following description of the embodiments will be made clearly and fully with reference to the accompanying drawings in one or more embodiments of the present disclosure.
The terms first, second, third and the like in the description and in the claims and in the above drawings are used for distinguishing between different objects and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus. Depending on the context, the word "if" as applied herein may be interpreted as "when" or "in response to a determination" or "in response to a detection".
In order to create better software products, the software needs to be strongly cooperated with each other in the development process, but the uneven technical level of the developers brings great challenges to team cooperation, and code inspection is proposed. The purpose of code viewing is to help developers find errors that are not found during the software development process and make corrections, thereby ensuring the implementation of software functionality, improving software quality, and developer skill level.
Code viewing can be generally divided into three categories: junction pair inspection, flow inspection, and tool inspection. Junction inspection refers to having two developers working on one computer at the same time, one writing a program, and the other will review the program he is writing. Before merging and warehousing, the codes are manually inspected by a manager or expert, and the codes submitted by developers can be merged and warehoused only after the manual inspection is qualified. Tool inspection refers to static code analysis using a code scanning tool, which conventional rule-based code scanning tools examine potential code problems, such as unused variables, potential null pointer anomalies, etc., by defining a series of rules.
In the prior art, the knot pair inspection efficiency is low, the development time and cost consumed by the process inspection are very large, and the tool inspection is relatively more effective, but the range of the tool inspection can be inspected only by a plurality of static code defects, such as unused variables, potential null pointer anomalies and the like.
In view of this, one or more embodiments of the present disclosure provide a code scan model training method and training apparatus, which deploy a trained code scan model into a code scan platform, so that a large model-based method can identify more potential performance problems and risks by training large-scale code samples and analyzing actual operating conditions using machine learning and artificial intelligence techniques. It can detect not only conventional rule-based problems, but also some hidden, code structure and logic related performance problems, such as inefficient algorithms, frequent IO operations, etc. The method based on the large model has higher accuracy and coverage rate, and can help developers to better discover and solve the code performance problem. To facilitate an understanding of the present disclosure, a brief description of a system architecture used in embodiments of the present description will first be provided.
FIG. 1 shows an exemplary system architecture for applying embodiments of the present disclosure, where the system architecture mainly includes a code inspection platform, the code inspection platform includes a trained code scanning model, after a developer completes modification of a code, a code file and a checklist are uploaded to the code inspection platform, the code inspection platform scans the code, when the code has a defect, the defect location, the defect type and a modification suggestion are fed back to the developer, the developer resubmisses the code after modifying the code to the code inspection platform for rescanning, and submits the modification to an auditor after confirming that there is no defect, and the auditor confirms and merges and stores the code. It should be appreciated that the number of code viewing platforms in FIG. 1 is merely illustrative. Any number of code viewing platforms may be provided in the system, as desired for implementation.
Fig. 2 is a flow diagram of a code scan model training method provided in one or more embodiments of the present disclosure, which may be performed by a general purpose device, such as an electronic device, or a client device or a server device installed with a client, a server, or both. The method may comprise the steps of:
Step 202: acquiring a code scanning model to be trained, wherein the code scanning model comprises an input layer, a convolution layer, a feature extraction layer, an output layer and an evaluation layer;
step 204: acquiring a plurality of first training samples, wherein the first training samples comprise first code samples and defect numbers of the first code samples;
step 206: training the code scan model with the first training sample;
the training comprises: inputting the first code sample into the input layer to obtain word vector expression of the first code sample; inputting the word vector expression to the convolution layer to obtain a word vector sequence of the first code sample; inputting the word vector sequence into the feature extraction layer to obtain defect code features of the first code sample; inputting the defect code characteristics to the output layer to obtain a predicted defect type of the first code sample; inputting the predicted defect type to the evaluation layer, wherein the evaluation layer judges whether the output of the code scanning model is accurate or not based on the defect number of the first code sample;
the training targets include: minimizing the diffusion loss of the code scan model and maximizing the objective function of the code scan model output accuracy.
Each of the steps in fig. 2 is described separately with reference to specific examples and embodiments.
The big data model is a method of modeling and analyzing a big data set. It is a mathematical and statistical model that reveals hidden patterns, trends and correlations in data. The goal of big data models is to extract useful information and knowledge from massive amounts of data to support decisions and predictions. The big data model comprises a machine learning model, a deep learning model, a graph analysis model, a recommendation system model, a time sequence data model, a natural language model and the like, and the data types of different big data are different. In step 202, a convolutional neural network-based data model is employed, which includes an input layer, a convolutional layer, a feature extraction layer, an output layer, and an evaluation layer, each of which is specifically described in later sections of the specification.
In step 204, a plurality of first training samples are obtained, the first training samples including a first code sample and a defect number of the first code sample. The large data model training requires a large amount of data, and how to obtain these training samples is the first challenge that the large data model training needs to face, and in one or more embodiments of the present disclosure, the choice is made to obtain a large number of real code samples and associated performance issues from sources such as open source projects, business applications, or internal code libraries. Meanwhile, the real code sample can also comprise the context, the code structure, the dependency relationship and other information of the defect code block besides the defect code block. In one or more embodiments of the present description, after code samples and associated performance issues are obtained, the code samples and performance issues need to be preprocessed to make the data more normative and more suitable for model input. Specifically, for a true code sample, a normalization operation needs to be performed on the code, where the code normalization operation specifically includes: (1) determining a code encoding specification; (2) Unifying the naming of the variables, functions and classes of the code samples based on the coding specification; and (3) adjusting formats such as code indentation, line feed and space. After the code standardization is completed, the code grammar needs to be further analyzed, and variables, functions and classes are marked. In one or more embodiments of the present disclosure, known code defects are encoded to obtain numbers of each code defect, and a defect code feature library is built based on the code defect and the code numbers, so as to digitize the code defects.
In step 206, the code scan model is trained using the previously preprocessed code samples and defect types, and fig. 3 shows a flowchart of model training in one or more embodiments of the present specification, with a specific training process for the model comprising the steps of:
step 302: inputting the first code sample into the input layer to obtain word vector expression of the first code sample;
step 304: inputting the word vector expression to the convolution layer to obtain a word vector sequence of the first code sample;
step 306: inputting the word vector sequence into the feature extraction layer to obtain defect code features of the first code sample;
step 308: inputting the defect code characteristics to the output layer to obtain a predicted defect type of the first code sample;
step 310: inputting the predicted defect type to the evaluation layer, wherein the evaluation layer judges whether the output of the code scanning model is accurate or not based on the defect number of the first code sample;
the training targets include: minimizing the diffusion loss of the code scan model and maximizing the objective function of the code scan model output accuracy.
The steps of the model training process are further described below.
In step 302, the input layer performs three-layer word embedding on the first code sample and the defect code block through an ELMO model, where the word embedding is word embedding, word embedding of a first layer bidirectional LSTM network, and word embedding of a second layer bidirectional LSTM network; and merging the results of the three-layer word embedding based on the different weights of the three-layer word embedding to obtain the code sample multidimensional word vector and the code block multidimensional word vector.
FIG. 4 illustrates a specific process for three-layer word embedding in one or more embodiments of the present description.
The pre-processed code text requires further conversion to be identified and modeled by the model, and in one or more embodiments of the present description, the code scan model extracts the code text as input through the input layer.
For the inspection of the code, if the inspection and judgment are not dependent on the context at all, only the code text is inspected and judged in isolation, and the large judgment error is likely to be caused because the information is too little, so that in order to accurately embody the code defect in combination with the context, the input layer extracts two texts as inputs, one is a complete code sample, the code sample contains the defect code block and the context of the defect code block, and the other is a single defect code block, so that the input of the code scanning model for each training sample can be expressed as (S r ,S i ) Wherein S is r Representing a complete code sample, S i Representing a defective code block. The complete code sample and the defect code block are used as two data to be input into a code scanning model for training, so that the model can fully combine the defect code and the context of the defect code to identify and position the defect in the learning process, and the code scanning platform can also perform the code scanning task subsequently
After the input of the model is clarified, the input layer needs to convert the input data into word vectors readable by the model, and there are many methods for converting the input data into word vectors, for example, word2vec, BERT and the like, one of the most obvious disadvantages of the word vectors generated by the word2vec model is that the generated word vector is static, that is, one word corresponds to a unique word vector, but in actual situations, the same word has different meanings in different contexts, and sometimes has completely different meanings, so that a problem exists when the input layer uses one word vector to represent the word vector, while the ELMO model is a dynamic word vector, when the word vector is used for a certain word, the whole text needs to be input, and the word vector is dynamically generated according to the whole context information, so that the word vector obtained by the same word under different contexts is different values, based on the characteristics of the ELMO model, in the embodiment of the present specification, the two input processes are converted by using the ELMO model, and the specific conversion process is as follows:
Firstly, respectively carrying out character-level coding on two inputs through a first layer convolutional neural network to obtain static word embedded vectors of two input data; and then inputting the static word embedded vector into a first layer of bidirectional LSTM network to respectively obtain forward output and backward output of the first layer of LSTM network, wherein the forward output and backward output of the first layer of LSTM network represent the syntactic characteristics of input data, and the forward output and backward output are input into a second layer of bidirectional LSTM network to obtain forward output and backward output of the second layer of LSTM network, and the forward output and backward output of the second layer of LSTM network represent the semantic characteristics of the input data.
The outputs of the first layer convolutional neural network, the first layer bidirectional LSTM network and the second layer LSTM network are given different weights, the outputs of the three networks are fused based on the given weights, so as to obtain word vector expressions of two input data, and in one or more embodiments of the present specification, the weights can be obtained through learning or manually adjusting model parameters.
And convolving the code sample multidimensional word vector and the code block multidimensional word vector through a convolutional neural network to respectively obtain a corresponding code sample vector sequence and a code block vector sequence.
In one or more embodiments of the present disclosure, after obtaining word vector expressions of code sample words including contexts and word vector expressions of defect code blocks through an ELMO model, further enhancement of context features of codes is required, fig. 5 shows a convolutional neural network adopted in one or more embodiments of the present disclosure, the convolutional neural network mainly includes a convolutional kernel, a pooling layer and a fully connected layer, after input data is obtained, features of different lengths are captured by adjusting a window width of the convolutional kernel, for example, when the window width of the convolutional kernel is set to 2, the convolutional kernel can only capture a relationship of adjacent words, when the window width of the convolutional kernel is set to 3, the convolutional kernel can only capture a relationship of adjacent three words, the integrity of code line information can be maintained while extracting code line information through a plurality of convolution kernels of different window widths, after the code line information is sufficiently extracted, features are further screened through the pooling layer, and finally, the selected important features of the pooling code are flattened through the fully connected layer, so as to obtain a vector sequence of samples and defect code blocks.
And carrying out fusion compression on the code sample vector sequence and the code block vector sequence to obtain the defect code characteristics.
After the convolutional neural network is screened, a code sample vector sequence and a vector sequence of a defect code block which fully reflect the context characteristics are obtained, at the moment, the two vector sequences are directly integrated, and the vector sequence containing the defect self characteristics and the defect context characteristics can be obtained, however, the integration mode keeps the integrity of information, but brings great burden to the subsequent classification searching, and reduces the classification efficiency, so that the code sample vector sequence and the code block vector sequence need to be fused and compressed for reducing the dimension, and in one or more embodiments of the specification, weights are further distributed for the code sample vector sequence and the code block vector sequence, and the code sample vector sequence and the code block vector sequence are integrated and compressed based on the weights, so that the defect code characteristics which can reflect the code defects most are obtained.
And performing classification search on the defect code characteristics through a classifier to obtain the defect code type.
After the defect code characteristics are obtained through the series of steps, the defect code characteristics can be searched in a defect code characteristic library through two kinds of searching until the corresponding defect code type is found.
FIG. 6 illustrates another process of model training in one or more embodiments of the present description, further comprising the steps of, based on FIG. 2:
step 602: acquiring a second training sample, wherein the second training sample comprises a second code sample and a defect number of the second code sample;
step 604: inputting the second code sample into the code scanning model to obtain a code scanning result of the second code sample;
step 606: and adjusting the code scanning model based on the code scanning result of the second code sample.
In one or more embodiments of the present disclosure, in order to further improve the coverage rate of the code scan model, a second training sample is further obtained, where the second training sample is a code defect in a specific scene and belongs to a code defect with a smaller occurrence probability, and the second training sample also includes the code sample and a defect number, and is input into the code scan model for further training, and the code scan model is adjusted according to the scan result, so as to achieve the purpose of further improving the coverage rate of the code scan model.
Acquiring a test set;
and inputting the test set into the code scanning model to obtain the accuracy of the code scanning model.
And comparing the accuracy with a model performance threshold, and when the accuracy is lower than the model performance threshold, adjusting model parameters of the code scanning model, and then continuing training and adjusting.
And when the accuracy rate is higher than the model performance threshold, saving model parameters, and deploying the code scanning model into an actual code scanning platform after solidification.
After the training of the model is completed, the performance of the code scanning model needs to be further verified by combining with a test set, and the most important index for verifying the performance of the code scanning model is the accuracy of the code scanning model. And comparing the accuracy of the model with a performance threshold, and when the accuracy is lower than the performance threshold of the model, still lacking the performance of the representative code scanning model, wherein model parameters of the code scanning model need to be adjusted at the moment, and then training and adjustment are continued.
If the accuracy is higher than the model performance threshold, training of the characterization model is completed, model parameters can be saved at the moment, and the code scanning model is deployed into an actual code scanning platform after solidification.
By the method, on one hand, the influence of the context of the code on the code defect is fully considered in the process of model training, and the content of the defect code and the context of the defect are combined in the process of model training
Referring to fig. 7, a code scan model training apparatus is provided in one or more embodiments of the present disclosure. As shown in fig. 7, the code scan model training apparatus 700 includes a model acquisition unit 702, a first sample acquisition unit 704, and a model training unit 706. Wherein, the main functions of each constituent unit are as follows:
a model acquisition unit 702 configured to acquire a code scan model to be trained, the code scan model including an input layer, a convolution layer, a feature extraction layer, an output layer, and an evaluation layer;
a first sample acquisition unit 704 configured to acquire a plurality of first training samples including a first code sample and a defect number of the first code sample;
A model training unit 706 configured to train the code scan model using the first training sample, the training comprising: inputting the first code sample into the input layer to obtain word vector expression of the first code sample; inputting the word vector expression to the convolution layer to obtain a word vector sequence of the first code sample; inputting the word vector sequence into the feature extraction layer to obtain defect code features of the first code sample; inputting the defect code characteristics to the output layer to obtain a predicted defect type of the first code sample; inputting the predicted defect type to the evaluation layer, wherein the evaluation layer judges whether the output of the code scanning model is accurate or not based on the defect number of the first code sample;
the training targets include: minimizing the diffusion loss of the code scan model and maximizing the objective function of the code scan model output accuracy.
The pre-processed code text requires further conversion to be identified and modeled by the model, and in one or more embodiments of the present description, the code scan model extracts the code text as input through the input layer.
For the inspection of the code, if the inspection and judgment are not dependent on the context at all, only the code text is inspected and judged in isolation, and the large judgment error is likely to be caused because the information is too little, so that in order to accurately embody the code defect in combination with the context, the input layer extracts two texts as inputs, one is a complete code sample, the code sample contains the defect code block and the context of the defect code block, and the other is a single defect code block, so that the input of the code scanning model for each training sample can be expressed as (S r ,S i ) Wherein S is r Representing a complete code sample, S i Representing a defective code block.
After the input of the model is clarified, the input layer needs to convert the input data into word vectors readable by the model, and there are many methods for converting the input data into word vectors, for example, word2vec, BERT and the like, one of the most obvious disadvantages of the word vectors generated by the word2vec model is that the generated word vector is static, that is, one word corresponds to a unique word vector, but in actual situations, the same word has different meanings in different contexts, and sometimes has completely different meanings, so that a problem exists when the input layer uses one word vector to represent the word vector, while the ELMO model is a dynamic word vector, when the word vector is used for a certain word, the whole text needs to be input, and the word vector is dynamically generated according to the whole context information, so that the word vector obtained by the same word under different contexts is different values, based on the characteristics of the ELMO model, in the embodiment of the present specification, the two input processes are converted by using the ELMO model, and the specific conversion process is as follows:
Firstly, respectively carrying out character-level coding on two inputs through a first layer convolutional neural network to obtain static word embedded vectors of two input data; and then inputting the static word embedded vector into a first layer of bidirectional LSTM network to respectively obtain forward output and backward output of the first layer of LSTM network, wherein the forward output and backward output of the first layer of LSTM network represent the syntactic characteristics of input data, and the forward output and backward output are input into a second layer of bidirectional LSTM network to obtain forward output and backward output of the second layer of LSTM network, and the forward output and backward output of the second layer of LSTM network represent the semantic characteristics of the input data.
The outputs of the first layer convolutional neural network, the first layer bidirectional LSTM network and the second layer LSTM network are given different weights, the outputs of the three networks are fused based on the given weights, so as to obtain word vector expressions of two input data, and in one or more embodiments of the present specification, the weights can be obtained through learning or manually adjusting model parameters.
And convolving the code sample multidimensional word vector and the code block multidimensional word vector through a convolutional neural network to respectively obtain a corresponding code sample vector sequence and a code block vector sequence.
In one or more embodiments of the present disclosure, after obtaining word vector expression of a code sample word vector including a context and word vector expression of a defect code block through an ELMO model, further enhancing context features of the code is required, where the convolutional neural network mainly includes a convolutional kernel, a pooling layer, and a full-connection layer, after obtaining input data, features of different lengths are captured by adjusting a window width of the convolutional kernel, for example, when the window width of the convolutional kernel is set to 2, the convolutional kernel can only capture a relationship of adjacent words, when the window width of the convolutional kernel is set to 3, the convolutional kernel can only capture a relationship of adjacent three words, and by using a plurality of convolution kernels of different window widths, the integrity of the code line information can be maintained while extracting the code line information, after sufficiently extracting the code line information, the features are further filtered by the pooling layer, and finally, important features filtered by the pooling layer are flattened by the full-connection layer, so as to obtain a vector sequence of the code sample and the defect code block.
And carrying out fusion compression on the code sample vector sequence and the code block vector sequence to obtain the defect code characteristics.
After the convolutional neural network is screened, a code sample vector sequence and a vector sequence of a defect code block which fully reflect the context characteristics are obtained, at the moment, the two vector sequences are directly integrated, and the vector sequence containing the defect self characteristics and the defect context characteristics can be obtained, however, the integration mode keeps the integrity of information, but brings great burden to the subsequent classification searching, and reduces the classification efficiency, so that the code sample vector sequence and the code block vector sequence need to be fused and compressed for reducing the dimension, and in one or more embodiments of the specification, weights are further distributed for the code sample vector sequence and the code block vector sequence, and the code sample vector sequence and the code block vector sequence are integrated and compressed based on the weights, so that the defect code characteristics which can reflect the code defects most are obtained.
And performing classification search on the defect code characteristics through a classifier to obtain the defect code type.
After the defect code characteristics are obtained through the series of steps, the defect code characteristics can be searched in a defect code characteristic library through two kinds of searching until the corresponding defect code type is found.
FIG. 8 is another code scan model training apparatus 800 provided in one or more embodiments of the present disclosure, further comprising:
a second sample acquisition unit 802 configured to acquire a second training sample including a second code sample and a defect number of the second code sample;
a second sample training unit 804 configured to input the second code sample into the code scanning model, to obtain a code scanning result of the second code sample;
a model adjustment unit 806 configured to adjust the code scanning model based on the code scanning result of the second code sample.
In one or more embodiments of the present disclosure, in order to further improve the coverage rate of the code scan model, a second training sample is further obtained, where the second training sample is a code defect in a specific scene and belongs to a code defect with a smaller occurrence probability, and the second training sample also includes the code sample and a defect number, and is input into the code scan model for further training, and the code scan model is adjusted according to the scan result, so as to achieve the purpose of further improving the coverage rate of the code scan model.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are mutually referred to, and each embodiment mainly describes differences from other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments in part.
In addition, one or more embodiments of the present description also provide another computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of any of the methods of the previous embodiments. The constituent modules of the above apparatus, if implemented in the form of software functional units and sold or used as independent products, may be stored in the computer-readable storage medium.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with one or more embodiments of the present description, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted across a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (Digital Subscriber Line, DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy Disk, a hard Disk, a magnetic tape), an optical medium (e.g., a digital versatile Disk (Digital Versatile Disc, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.
One or more embodiments of the present specification also provide an electronic device, including:
one or more processors
A memory associated with the one or more processors for storing program instructions that, when read for execution by the one or more processors, perform the steps of the method of any of the preceding method embodiments.
The present specification also provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the method of any of the preceding method embodiments.
Those skilled in the art will appreciate that implementing all or part of the above-described embodiment methods may be accomplished by way of a computer program, which may be stored in a computer-readable storage medium, instructing relevant hardware, and which, when executed, may comprise the embodiment methods as described above. And the aforementioned storage medium includes: various media capable of storing program code, such as ROM, RAM, magnetic or optical disks. The technical features in the present examples and embodiments may be arbitrarily combined without conflict.
Fig. 9 illustrates an architecture of an electronic device 900, which may specifically include: a processor 910, a disk drive 920, an input/output interface 930, a network interface 940, and a memory 950. The processor 910, disk drive 920, input/output interface 930, network interface 940, and memory 950 may be communicatively coupled via a communication bus.
The processor 910 may be implemented by a general-purpose CPU, a microprocessor, an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc. for executing a relevant program to implement the technical solutions provided herein.
The Memory 950 may be implemented in the form of ROM (Read Only Memory), RAM (Read Access Memory, random access Memory), static Memory, dynamic storage device, or the like. The memory 950 may store an operating system 951 for controlling the operation of the electronic device 900, and a Basic Input Output System (BIOS) 952 for controlling the level one operation of the electronic device 900. In addition, a web browser 953, a data storage management system 954, a push document generation device 955, and the like may be stored. In general, when implemented in software or firmware, the relevant program code is stored in memory 950 and executed by processor 910.
The input/output interface 930 is used to connect with input/output modules to achieve information input and output. The input/output module may be configured as a component in a device (not shown in the figure) or may be external to the device to provide corresponding functionality. Wherein the input devices may include keyboards, mice, touch screens, microphones, various types of sensors, etc., and the output devices may include displays, speakers, vibrators, indicator lights, etc.
The network interface 940 is used to connect communication modules (not shown) to enable communication interactions of the device with other devices. The communication module may implement communication through a wired manner (such as USB, network cable, etc.), or may implement communication through a wireless manner (such as mobile network, WIFI, bluetooth, etc.).
The bus includes a path to transfer information between elements of the device (e.g., the processor 910, the disk drive 920, the input/output interface 930, the network interface 940, and the memory 950).
It should be noted that although the above-described device only shows the processor 910 disk drive 920, input/output interface 930, network interface 940, and memory 950, buses, etc., the device may include other components necessary to achieve proper operation in a particular implementation. Furthermore, it will be understood by those skilled in the art that the above-described apparatus may include only the components necessary to implement the methods of the present application, and not all of the components shown in the drawings.
The above-described embodiments are merely preferred embodiments of the present disclosure, and do not limit the scope of the disclosure, and various modifications and improvements made by those skilled in the art to the technical solutions of the disclosure should fall within the protection scope defined by the claims of the disclosure without departing from the design spirit of the disclosure.

Claims (22)

1. The code scanning model training method comprises the following steps:
acquiring a code scanning model to be trained, wherein the code scanning model comprises an input layer, a convolution layer, a feature extraction layer, an output layer and an evaluation layer;
acquiring a plurality of first training samples, wherein the first training samples comprise first code samples and defect numbers of the first code samples;
training the code scan model with the first training sample, the training comprising: inputting the first code sample into the input layer to obtain word vector expression of the first code sample; inputting the word vector expression to the convolution layer to obtain a word vector sequence of the first code sample; inputting the word vector sequence into the feature extraction layer to obtain defect code features of the first code sample; inputting the defect code characteristics to the output layer to obtain a predicted defect type of the first code sample; inputting the predicted defect type to the evaluation layer, wherein the evaluation layer judges whether the output of the code scanning model is accurate or not based on the defect number of the first code sample;
the training targets include: minimizing the diffusion loss of the code scan model and maximizing the objective function of the code scan model output accuracy.
2. The method according to claim 1, comprising:
the first code sample includes a defective code block and a context of the defective code block.
3. The method of claim 2, the inputting the first code sample to the input layer, deriving a word vector representation of the first code sample comprising:
the input layer respectively performs three-layer word embedding on the first code sample and the defect code block through an ELMO model, wherein the word embedding is respectively word embedding of a word, word embedding of a first layer of bidirectional LSTM network and word embedding of a second layer of bidirectional LSTM network;
and merging the results of the three-layer word embedding based on the different weights of the three-layer word embedding to obtain the code sample multidimensional word vector and the code block multidimensional word vector.
4. The method of claim 3, the inputting the word vector representation into the convolutional layer, resulting in a word vector sequence of the first code samples comprising:
and convolving the code sample multidimensional word vector and the code block multidimensional word vector through a convolutional neural network to respectively obtain a corresponding code sample vector sequence and a code block vector sequence.
5. The method of claim 4, the inputting the sequence of word vectors into the feature extraction layer, obtaining defect code features of the first code sample comprising:
And carrying out fusion compression on the code sample vector sequence and the code block vector sequence to obtain the defect code characteristics.
6. The method of claim 5, the inputting the defect code feature to the output layer resulting in a predicted defect type for the first code sample comprising:
and performing classification search on the defect code characteristics through a classifier to obtain the predicted defect type.
7. The method of claim 1, further comprising:
acquiring a second training sample, wherein the second training sample comprises a second code sample and a defect number of the second code sample;
inputting the second code sample into the code scanning model to obtain a code scanning result of the second code sample;
and adjusting the code scanning model based on the code scanning result of the second code sample.
8. The method of claim 1, further comprising:
acquiring a test set;
and inputting the test set into the code scanning model to obtain the accuracy of the code scanning model.
9. The method of claim 8, after deriving accuracy of the code scan model, further comprising:
And comparing the accuracy with a model performance threshold, and when the accuracy is lower than the model performance threshold, adjusting model parameters of the code scanning model, and then continuing training and adjusting.
10. The method of claim 9, comprising:
and when the accuracy rate is higher than the model performance threshold, saving model parameters, and deploying the code scanning model into an actual code scanning platform after solidification.
11. A code scan model training device comprising:
a model acquisition unit configured to acquire a code scan model to be trained, the code scan model including an input layer, a convolution layer, a feature extraction layer, an output layer, and an evaluation layer;
a first sample acquisition unit configured to acquire a plurality of first training samples including a first code sample and a defect number of the first code sample;
a model training unit configured to train the code scan model using the first training sample, the training comprising: inputting the first code sample into the input layer to obtain word vector expression of the first code sample; inputting the word vector expression to the convolution layer to obtain a word vector sequence of the first code sample; inputting the word vector sequence into the feature extraction layer to obtain defect code features of the first code sample; inputting the defect code characteristics to the output layer to obtain a predicted defect type of the first code sample; inputting the predicted defect type to the evaluation layer, wherein the evaluation layer judges whether the output of the code scanning model is accurate or not based on the defect number of the first code sample;
The training targets include: minimizing the diffusion loss of the code scan model and maximizing the objective function of the code scan model output accuracy.
12. The apparatus of claim 11, comprising:
the first code sample includes a defective code block and a context of the defective code block.
13. The apparatus of claim 12, the input layer configured to:
respectively carrying out three-layer word embedding on the first code sample and the defect code block through an ELMO model, wherein the word embedding is respectively word embedding of a word, word embedding of a first layer of bidirectional LSTM network and word embedding of a second layer of bidirectional LSTM network;
and merging the results of the three-layer word embedding based on the different weights of the three-layer word embedding to obtain the code sample multidimensional word vector and the code block multidimensional word vector.
14. The apparatus of claim 13, the convolutional layer configured to:
and convolving the code sample multidimensional word vector and the code block multidimensional word vector to respectively obtain a corresponding code sample vector sequence and a code block vector sequence.
15. The apparatus of claim 14, the feature extraction layer configured to:
And carrying out fusion compression on the code sample vector sequence and the code block vector sequence to obtain the defect code characteristics.
16. The apparatus of claim 15, the output layer configured to:
and performing classification search on the defect code characteristics through a classifier to obtain the predicted defect type.
17. The apparatus of claim 11, the apparatus further comprising:
a second sample acquisition unit configured to acquire a second training sample including a second code sample and a defect number of the second code sample;
a second sample training unit configured to input the second code sample into the code scanning model to obtain a code scanning result of the second code sample;
and a model adjustment unit configured to adjust the code scanning model based on a code scanning result of the second code sample.
18. The apparatus of claim 11, further comprising:
a test set acquisition unit configured to acquire a test set;
and the testing unit is configured to input the testing set into the code scanning model to obtain the accuracy of the code scanning model.
19. The apparatus of claim 18, further comprising:
and comparing the accuracy with a model performance threshold, and when the accuracy is lower than the model performance threshold, adjusting model parameters of the code scanning model, and then continuing training and adjusting.
20. The apparatus of claim 19, further comprising:
and the model deployment unit is configured to save model parameters when the accuracy rate is higher than the model performance threshold value, and deploy the code scanning model into an actual code scanning platform after solidification.
21. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, implements the steps of the method of any of claims 1-10.
22. An electronic device, comprising:
one or more processors
A memory associated with the one or more processors for storing program instructions that, when read for execution by the one or more processors, perform the steps of the method of any of claims 1-10.
CN202311786493.1A 2023-12-22 2023-12-22 Code scanning model training method and device, storage medium and electronic equipment Pending CN117763461A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311786493.1A CN117763461A (en) 2023-12-22 2023-12-22 Code scanning model training method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311786493.1A CN117763461A (en) 2023-12-22 2023-12-22 Code scanning model training method and device, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN117763461A true CN117763461A (en) 2024-03-26

Family

ID=90325103

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311786493.1A Pending CN117763461A (en) 2023-12-22 2023-12-22 Code scanning model training method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN117763461A (en)

Similar Documents

Publication Publication Date Title
US8527965B2 (en) Layered static program analysis framework for software testing
US11907675B2 (en) Generating training datasets for training neural networks
KR102450131B1 (en) Systems and methods for detecting flaws on panels using images of the panels
CN112288079B (en) Graphic neural network model training method, software defect detection method and system
CN113255614A (en) RPA flow automatic generation method and system based on video analysis
CN111931179A (en) Cloud malicious program detection system and method based on deep learning
CN112148766A (en) Method and system for sampling data using artificial neural network model
CN112527676A (en) Model automation test method, device and storage medium
CN112508456A (en) Food safety risk assessment method, system, computer equipment and storage medium
CN110334262B (en) Model training method and device and electronic equipment
CN113591998A (en) Method, device, equipment and storage medium for training and using classification model
RU2715024C1 (en) Method of trained recurrent neural network debugging
US11681511B2 (en) Systems and methods for building and deploying machine learning applications
CN113469090B (en) Water pollution early warning method, device and storage medium
CN117763461A (en) Code scanning model training method and device, storage medium and electronic equipment
CN115905016A (en) BIOS Setup search function test method and device, electronic equipment and storage medium
US20220058530A1 (en) Method and device for optimizing deep learning model conversion, and storage medium
CN115238645A (en) Asset data identification method and device, electronic equipment and computer storage medium
CN115098389A (en) REST interface test case generation method based on dependency model
CN117296064A (en) Interpretable artificial intelligence in a computing environment
CN113407719A (en) Text data detection method and device, electronic equipment and storage medium
CN110879821A (en) Method, device, equipment and storage medium for generating rating card model derivative label
CN117319091B (en) Enterprise software network security vulnerability detection method and system based on deep learning
CN114356743B (en) Abnormal event automatic detection method and system based on sequence reconstruction
CN116560819B (en) RPA-based batch automatic operation method, system, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination