CN117742769B - Source code intelligent analysis engine based on information creation rule base - Google Patents

Source code intelligent analysis engine based on information creation rule base Download PDF

Info

Publication number
CN117742769B
CN117742769B CN202410183113.3A CN202410183113A CN117742769B CN 117742769 B CN117742769 B CN 117742769B CN 202410183113 A CN202410183113 A CN 202410183113A CN 117742769 B CN117742769 B CN 117742769B
Authority
CN
China
Prior art keywords
code
neural network
network model
data
code data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410183113.3A
Other languages
Chinese (zh)
Other versions
CN117742769A (en
Inventor
郭辉
胡明光
裴高翔
沈铖涛
董明
姚拓中
叶宏武
陈丹儿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Kingnet Chengdu Westone Information Industry Inc
Original Assignee
Zhejiang Kingnet Chengdu Westone Information Industry Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Kingnet Chengdu Westone Information Industry Inc filed Critical Zhejiang Kingnet Chengdu Westone Information Industry Inc
Priority to CN202410183113.3A priority Critical patent/CN117742769B/en
Publication of CN117742769A publication Critical patent/CN117742769A/en
Application granted granted Critical
Publication of CN117742769B publication Critical patent/CN117742769B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a source code intelligent analysis engine based on a credit rule base, which relates to the technical field of code analysis and comprises a neural network model and a code distributor, wherein the neural network model is constructed by the following steps: s1, collecting code data for training a model and converting the code data into unstructured code data; s2, preprocessing data; s3, constructing a neural network model; s4, defining a loss function of the neural network model; s5, training a neural network model; s6, carrying out quantization processing on the neural network model; after the neural network model is trained, the neural network model is used for being deployed to a GPU to transcode and run in advance the source codes distributed by the received code distributor, so that the source codes are analyzed, and the intelligent analysis of the source codes in an environment defined by a credit rule base is achieved.

Description

Source code intelligent analysis engine based on information creation rule base
Technical Field
The invention relates to the technical field of code analysis, in particular to an intelligent source code analysis engine based on a credit rule base.
Background
The credit rule base is a set of rules or algorithm base constructed based on artificial intelligence and machine learning technology and is used in the fields of data mining, risk identification, decision analysis, intelligent recommendation and the like. The rule base can quickly construct an air control system, an anti-fraud system, an intelligent recommendation system and the like, and service efficiency and user experience are improved.
The prior Chinese patent with publication number CN115357481A discloses a data-driven evaluation method for migration of web application to domestic credit and trauma environment, and the evaluation method comprises the following steps of; s1: installing a migration detector; s2: opening a browser page to be analyzed through an IE browser, wherein the content in the browser page comprises JS and CSS-API; s3: injecting JS through the BHO plugin, and injecting JS through the BHO plugin after page loading is completed; s4: analyzing page HTML, JS and CSS files; s5: and obtaining migration evaluation workload according to JS and CSS information of IE analyzed by the local analysis service. The workload required for solving the JS and CSS compatibility problems in the domestic migration process of the browser application can be rapidly and accurately evaluated through the obtained JS and CSS information specific to IE contained in the browser application, then the page problem can be rapidly positioned through the obtained position information of the specific JS and CSS problems, the migration speed is accelerated and the migration time cost is reduced,
In the prior art, although compatibility and workload processing are performed on web applications by means of a credit-creating environment, and overall compatibility and application security are improved, credit-creating rules are still used in an application end, technical contents of verification from a code end and analysis from the code end are not involved, and an analysis engine capable of verifying and analyzing source codes based on a credit-creating rule base is needed to be solved.
Disclosure of Invention
Aiming at the defects existing in the prior art, the invention aims to provide an intelligent source code analysis engine based on a credit rule base, which has the effect of intelligently analyzing source codes in an environment limited by the credit rule base.
In order to achieve the above purpose, the present invention provides the following technical solutions:
The utility model provides a source code intelligent analysis engine based on creation rule base, includes neural network model and code distributor, neural network model is used for training source code and analysis, the code distributor is used for obtaining the source code of waiting to analyze and distributing, its characterized in that:
The neural network model construction method comprises the following steps:
S1, collecting code data for training a model and converting the code data into unstructured code data;
S2, preprocessing data;
s3, constructing a neural network model;
S4, defining a loss function of the neural network model;
S5, training a neural network model;
s6, carrying out quantization processing on the neural network model;
And after the neural network model is trained, the neural network model is used for being deployed to the GPU to transcode and run in advance the source codes distributed by the received code distributor so as to analyze the source codes.
As a further improvement of the present invention, the step S2 of preprocessing the data includes:
s21, code de-duplication, namely identifying repeated code segments in the converted code data, and deleting redundant code segments to reserve a code segment;
s22, formatting code data, and formatting the code subjected to duplication removal;
S23, deleting useless codes, and deleting useless fields in the formatted code data;
S24, dividing code data, namely dividing a longer code file or function into preset code segment sizes;
s25, code data conversion, namely converting the code data into a control flow graph form.
As a further improvement of the present invention, the loss function defined in the step S4 is a multi-label cross entropy loss function, where the multi-label cross entropy loss function is:
code data of N samples and K categories are set, wherein each sample can belong to a plurality of categories, and the code data of the ith sample is recorded as a true label Wherein/>0 Or 1, indicating whether the code data of the sample belongs to the kth category, while the prediction result of the model on the code data of the ith sample is/>Wherein/>The probability that the code data representing the prediction sample belongs to the kth class: wherein log represents the natural logarithm.
As a further improvement of the present invention, the quantization processing step of step S6 includes:
s61: acquiring a neural network model to be quantized, extracting a convolution layer, calculating the weight gradient of neurons in the convolution layer, and calculating the weight influence of the convolution layer according to the calculated weight gradient;
S62: calculating an activation variance value of the convolution layer and obtaining a quantization bit width of the convolution layer;
S63: determining an optimal quantization bit width of each convolution layer;
S64: applying the determined quantization bit width to each convolution layer and converting the full-precision neural network using the quantization function;
s65: training the quantized neural network model by using a loss function and counting training errors.
As a further improvement of the invention, the weight influence is calculated by the following steps:
The weight gradient of the neuron is dw i and is calculated by a back propagation algorithm, and the formula is:
the Loss is an error function of the full-precision neural network, and the formula is as follows:
Wherein Y (x i) is the true value corresponding to input x i, Input/>, for neural network pairApplying the formula to the predicted value of (2)And calculating the weight influence of each layer.
As a further improvement of the invention, the activation value variance sigma z of the convolution layer is calculated at the initial stage of the training of the neural network model, and the influence of the weight is combinedCalculating the precision loss caused by the quantization of the layer;
The variance of the activation value of each layer is obtained according to the statistics of the activation value in the forward propagation process, the activation value of the activation function is calculated by Relu functions, and the formula is:
Calculating the mean value of the activation values:
the variance σ z formula is:
the loss of accuracy is calculated using the following equation:
Where N k is the quantization bit width of the k-th layer.
As a further improvement of the present invention, in the step S63, a genetic algorithm is adopted to determine the optimal quantization bit width of each layer, set the population size and set the intersection rate, randomly allocate the quantization bit width during the initialization, and in each iteration, calculate the fitness according to the following formula and minimize the comprehensive performance loss index P:
P=+/>
As a further improvement of the present invention, the quantization function of step S64 is:
Wherein, Is the full precision weight of the k-th layer,/>Is the quantized weight,/>Is the step size of each quantization level, defined by quantization bit width/>Decision,/>The calculation formula of (2) is as follows: /(I)
As a further improvement of the present invention, the neural network model deployed on the GPU includes, when analyzing the source code:
Converting a source code into a control flow graph, inputting the control flow graph into a neural network model for predictive analysis, and outputting an operation result and a multi-class classification result through predictive analysis to evaluate the quality of the source code, wherein the operation result comprises operation or non-operation, and the multi-class classification result comprises code quality classification, function classification and vulnerability type classification;
setting an evaluation index of the code quality when classifying the code quality, and configuring the priority of the evaluation index, wherein the evaluation index comprises: readability, maintainability, performance, safety and reliability.
As a further improvement of the present invention, the evaluation of security includes a boundary check, and the judgment method of the boundary check includes:
a1: checking the input data range, and checking the data type and length of the input data;
A2: checking the array boundary, running codes using the array, detecting whether the lower index value of the codes is in the effective range of the lower table, and judging that the array is accessed out of range if the lower index value exceeds the effective range of the array;
A3: checking the pointer reference, judging whether the memory space pointed by the pointer is legal or not, and judging as an attack program if the pointer is empty or points to the illegal memory space;
A4: checking file operation, judging whether the file name is legal or not, checking the authority, and judging as an invalid program if the file name is illegal or the authority is insufficient;
A5: and checking network communication, judging whether the communication protocol and the port accord with the specification, and judging as an invalid program if the communication protocol or the port has loopholes.
The invention has the beneficial effects that:
By training the neural network model and carrying out mixed precision quantization processing on the model, the storage requirement and the calculation complexity of the model are reduced, the whole model can operate more efficiently in an environment with limited calculation resources, and the quantized model can execute judgment of people while consuming a small amount of energy due to the reduction of calculation load. After the model is deployed in the GPU, the source code is run in advance, evaluation and judgment are carried out through the multi-category classification result after the run in advance, and especially boundary inspection is carried out to judge the safety of the code, so that the effect of intelligent analysis of the source code is achieved.
Drawings
FIG. 1 is a diagram of a system architecture for constructing a neural network model;
FIG. 2 is a system flow diagram of a quantized neural network model;
FIG. 3 is a system flow diagram of analysis of source code.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. It will be apparent that the described embodiments are some, but not all, embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without creative efforts, based on the described embodiments of the present invention fall within the protection scope of the present invention.
Unless defined otherwise, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this invention belongs. The terms "first," "second," and the like, as used herein, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.
In order to keep the following description of the embodiments of the present invention clear and concise, the detailed description of known functions and known components thereof have been omitted.
Referring to fig. 1 to 3, a specific implementation manner of a source code intelligent analysis engine based on a credit rule base according to the present invention includes a constructed neural network model and a code distributor, the code distributor is used for obtaining source codes for analysis and distributing the source codes, and the neural network model is used for deploying the neural network model into a GPU to perform pre-running on the received distributed source codes so as to obtain analysis data of the source codes.
The method for constructing the neural network model comprises the following steps:
S1, collecting code data for training a model and converting the code data into unstructured code data;
S2, preprocessing data;
s3, constructing a neural network model;
S4, defining a loss function of the neural network model;
S5, training a neural network model;
s6, carrying out quantization processing on the neural network model;
and S7, deploying a neural network model.
In step S2, the process of preprocessing data includes:
S21, code de-duplication, namely identifying repeated code segments in the converted code data, deleting redundant code segments to reserve a code segment so as to reduce the size of a data set;
S22, formatting code data, and formatting the code after de-duplication so as to better understand and analyze the code data;
S23, deleting useless codes, deleting useless fields in the formatted code data, and further reducing the size of a data set;
S24, dividing code data, namely dividing a longer code file or function into preset code segment sizes so as to improve the readability of a data set and better capture local features in the code;
S25, code data conversion is carried out, and the code data is converted into a control flow graph form so as to facilitate training of the neural network model.
Because the engine application of the intelligent analysis neural network model is performed on the source code, a plurality of different types of class labels exist in the preprocessed code data, wherein the class labels comprise labels such as variables, functions, classes, notes, sentences, macros, constants and the like, therefore, a multi-label cross entropy loss function is selected when a loss function is defined, the code data of N samples and K classes are set, each sample can belong to a plurality of classes, and for the code data of the ith sample, the true label is recorded asWherein/>0 Or 1, indicating whether the code data of the sample belongs to the kth category, while the prediction result of the model on the code data of the ith sample is/>Wherein/>The probability that the code data representing the prediction sample belongs to the K-th class, the multi-label cross entropy loss function is calculated as follows:
The log represents natural logarithm, the prediction of each sample code data on each category is compared with a real label, if the code data of the sample belongs to a certain category, the natural logarithm of the probability corresponding to the code data is taken as loss, otherwise, the natural logarithm of the complement of the probability of the category is taken as loss, and finally the loss of all samples is summed and divided by the number N of samples to obtain average loss. The constructed multi-label cross entropy loss function can be used for training a model, so that the multi-label cross entropy loss function can effectively solve the multi-label classification problem.
Because the model of the whole training has huge scale, the demand for computing resources is increased, and the neural network model needs to be quantized, so that the scale and storage demand of the neural network model are reduced, and the model is suitable for a training environment of small resources. The quantization processing step of step S6 includes:
s61: the method comprises the steps of obtaining a neural network model to be quantized, extracting a convolution layer, calculating the weight gradient of neurons in the convolution layer, calculating the weight influence of the convolution layer according to the calculated weight gradient, and analyzing the weight influence, wherein layers with larger influence on performance can be preferentially considered in quantization, more information is reserved, and more aggressive quantization is carried out on layers with smaller influence at the same time, so that the overall size of the model is reduced, and the calculation speed is improved. For each layer in the neural network model, a weight influence δk is calculated, and a weight gradient dwi of each neuron is calculated by a back propagation algorithm, where the formula is:
Wherein, And expressing a weight value, wherein Loss is an error function of the full-precision neural network, and the formula is as follows:
Wherein, For input/>Corresponding true value,/>Input/>, for neural network pairApplying the formula to the predicted value of (2)The weight influence of the first layer is calculated.
S62: in the initial stage of model training, calculating the variance sigma z of the activation value of the first convolution layer, and assuming that the variance of the activation value of the layer is 0.025, combining the weight influence forceThe loss of precision caused by quantization of this layer is calculated assuming an initial quantization bit width of 8 bits.
The activation value variance of the first layer is obtained according to the activation value statistics in the forward propagation process, the activation value of the activation function is calculated by Relu functions, and the formula is:
x represents the input vector from the upper layer neural network, calculates the mean of the activation values:
the variance σ z formula is:
Wherein, Representing the activation value of the i-th layer, the loss of accuracy is calculated using the following equation:
Where N k is the quantization bit width of the k-th layer.
S63: determining optimal quantization bit width of each layer by adopting a genetic algorithm, setting population size as 50, crossover rate as 0.7 and mutation rate as 0.1, wherein each individual represents a combination of 8-layer quantization bit widths, and randomly distributing quantization bit widths between 8 and 2 bits, such as {8,8,6,6,4,4,2,2}, in each iteration according to the formula
P=+/>
The fitness is calculated, and a group of optimal quantized bit width combinations {8,8,6,6,4,4,3,3} are found by iteration 100, so that the comprehensive performance loss index P is minimized.
S64: applying the determined quantized bit widths {8,8,6,6,4,4,3,3} to each layer of the neural network model, converting the full-precision neural network into a full-precision neural network using the quantization functionFor example, for the first layer, if the optimal quantization bit width determined by the genetic algorithm is 8 bits, the weight of the layer is quantized to 8 bits, and the quantization function formula is:
Wherein, Is the full precision weight of the k-th layer,/>Is the quantized weight,/>Is the step size of each quantization level, defined by quantization bit width/>Decision,/>The calculation formula of (2) is as follows:
S65: the quantized neural network is trained using a cross entropy loss function and a training error is calculated. Retraining the quantized neural network model, using a cross entropy loss function, and counting training errors. The batch size was set to 128, the learning rate initial value was 0.01, and the decay factor was 0.9. After 5 training cycles, if the performance on the validation set is not improved, the learning rate is multiplied by the decay factor, and the cross entropy loss function formula is:
s66: after 10 training periods, the quantization network achieves the accuracy rate equivalent to the full-precision model on the verification set, and the effectiveness of the quantization method is proved.
By quantifying the mixed precision of the neural network model, the storage requirement and the calculation complexity of the neural network model are reduced, so that the neural network model can operate more efficiently in an environment with limited calculation resources; the quantized neural network model can execute the deduction task while consuming a smaller amount of energy due to the reduced calculation load; by adopting an accurate weight influence evaluation and regularization strategy, the invention maintains the model precision before and after quantization and reduces the performance loss.
After the neural network model is trained, the neural network model is deployed on a GPU, source codes to be analyzed are converted into control flow diagrams and are input into the neural network model for predictive analysis, operation results and multi-class classification results are output through predictive analysis, the operation results comprise operation or non-operation, the quality of the source codes can be judged through pre-running of the source codes, the performance of the neural network model can be judged during the pre-running, and when the model is judged to be required to be optimized, the neural network model is adjusted and optimized through adjustment of a data set and super-parameter adjustment.
The multi-category classification result includes: the code quality classification, the function classification and the vulnerability type classification can identify the quality of the code by classifying the quality of the code, classify the code by identifying the function use of the code, and identify whether the code has memory leakage, security vulnerability and performance problems by identifying the error and vulnerability of the code.
Setting an evaluation index of the code quality when classifying the code quality, wherein the evaluation index comprises: readability, maintainability, performance, safety and reliability. The assessment of the readability comprises code naming specifications, notes and code structures, the assessment of the maintainability comprises code structure, modularization and interface design, the assessment of the performance comprises code algorithm complexity and data structure selection, the assessment of the safety comprises code input verification, boundary check and error processing, and the assessment of the reliability comprises code boundary processing, exception handling and error recovery. The priority of the evaluation index is configured, and the priority is sequentially from high to low: security, reliability, performance, readability, and maintainability.
Wherein the method for selecting whether the checking code performs enough boundary checking to evaluate the security of the code when processing the input data comprises the following steps:
a1: checking the input data range, checking the data type and length of the input data, and ensuring that the input data meets the requirements;
a2: checking the boundary of the array, running codes using the array, detecting whether the lower index value of the codes is in the effective range of the lower table, and if the lower index value exceeds the effective range of the array, judging that the array is accessed out of range, so that program breakdown or attack utilization is easy to cause;
a3: checking the pointer reference, judging whether the memory space pointed by the pointer is legal or not, if the pointer is empty or points to the illegal memory space, judging that the program is attacked, and easily causing the program to crash or be attacked and utilized;
A4: checking file operation, judging whether the file name is legal or not, checking the authority, and if the file name is illegal or the authority is insufficient, judging that the file name is an invalid program, and easily causing read-write failure or being utilized by attack;
A5: and checking network communication, judging whether the communication protocol and the port accord with the specification, and if the communication protocol or the port has loopholes, judging that the communication protocol or the port is an invalid program, and easily causing the communication channel to be utilized by attack.
Wherein step A3 further comprises:
A31: confirming whether the pointer is empty or not, setting a judgment statement to check when judging, and returning an error code when the pointer is empty;
A32: if the pointer is not empty, checking whether the memory space pointed by the pointer is legal, obtaining the value of the memory address pointed by the pointer through dereferencing the pointer, if the dereferencing is successful, indicating that no memory access abnormality occurs, indicating that the memory address is legal, and returning to the function;
If the memory access abnormality occurs during the dereferencing, indicating that the memory address pointed by the pointer is invalid or unassigned, and returning to the function;
a33: in pointer reference checking, memory allocation boundaries are also synchronously considered, and boundary checking is performed by comparing the lower index value with the array length:
Acquiring the array length;
acquiring subscripts of elements to be accessed, and acquiring corresponding subscript values according to element positions to be accessed;
comparing the subscript value with the array length:
if the subscript value is smaller than 0, characterizing that the access position is outside the left boundary of the array, and indicating out-of-range errors;
If the lower index value is greater than or equal to the length of the array, the access position is characterized as being outside the right boundary of the array, and the cross boundary error is also represented;
If the subscript value is greater than or equal to zero and smaller than the length of the array, characterizing the subscript method, accessing and reading elements in the array;
processing the out-of-range error, throwing out the abnormality and recording a log;
A34: when the pointer reference check is found to fail, recording relevant error information into a log, wherein the error information comprises: the function name, the pointer variable name and the address pointed by the pointer are used for selecting the error level according to the error type and recording the error level in the log so as to facilitate the subsequent investigation and analysis, attempting to restore the normal operation of the program, restoring the pointer by attempting to recover the invalid pointer or reconstructing the memory, and if the normal operation of the program cannot be restored, throwing out the abnormality and recording the corresponding error information in the log.
By training the neural network model and carrying out mixed precision quantization processing on the model, the storage requirement and the calculation complexity of the model are reduced, the whole model can operate more efficiently in an environment with limited calculation resources, and the quantized model can execute judgment of people while consuming a small amount of energy due to the reduction of calculation load. After the model is deployed in the GPU, the source code is run in advance, evaluation and judgment are carried out through the multi-category classification result after the run in advance, and especially boundary inspection is carried out to judge the safety of the code, so that the effect of intelligent analysis of the source code is achieved.
Furthermore, although exemplary embodiments have been described in the present disclosure, the scope thereof includes any and all embodiments having equivalent elements, modifications, omissions, combinations (e.g., of the various embodiments across), adaptations or alterations as would be appreciated by those in the art. The elements in the claims are to be construed broadly based on the language employed in the claims and are not limited to examples described in the present specification or during the practice of the application, which examples are to be construed as non-exclusive. It is intended, therefore, that the specification and examples be considered as exemplary only, with a true scope and spirit being indicated by the following claims and their full scope of equivalents.
The above description is intended to be illustrative and not restrictive. For example, the above-described examples (or one or more aspects thereof) may be used in combination with each other. For example, other embodiments may be used by those of ordinary skill in the art upon reading the above description. In addition, in the above detailed description, various features may be grouped together to streamline the invention. This is not to be interpreted as an intention that the disclosed features not being claimed are essential to any claim. Rather, inventive subject matter may lie in less than all features of a particular disclosed embodiment. Thus, the following claims are hereby incorporated into the detailed description as examples or embodiments, with each claim standing on its own as a separate embodiment, and it is contemplated that these embodiments may be combined with one another in various combinations or permutations. The scope of the invention should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.
The above embodiments are only exemplary embodiments of the present invention and are not intended to limit the present invention, the scope of which is defined by the claims. Various modifications and equivalent arrangements of this invention will occur to those skilled in the art, and are intended to be within the spirit and scope of the invention.

Claims (8)

1. The utility model provides a source code intelligent analysis engine based on creation rule base, includes neural network model and code distributor, neural network model is used for training source code and analysis, the code distributor is used for obtaining the source code of waiting to analyze and distributing, its characterized in that:
The neural network model construction method comprises the following steps:
S1, collecting code data for training a model and converting the code data into unstructured code data;
s2, preprocessing code data;
s3, constructing a neural network model;
S4, defining a loss function of the neural network model;
S5, training a neural network model;
s6, carrying out quantization processing on the neural network model;
After the neural network model is trained, the neural network model is used for being deployed to a GPU to transcode and run in advance the source codes distributed by the received code distributor so as to analyze the source codes;
The neural network model deployed on the GPU includes, when analyzing the source code:
Converting a source code into a control flow graph, inputting the control flow graph into a neural network model for predictive analysis, and outputting an operation result and a multi-class classification result through predictive analysis to evaluate the quality of the source code, wherein the operation result comprises operation or non-operation, and the multi-class classification result comprises code quality classification, function classification and vulnerability type classification;
setting an evaluation index of the code quality when classifying the code quality, and configuring the priority of the evaluation index, wherein the evaluation index comprises: readability, maintainability, performance, safety, and reliability;
The evaluation of the security comprises boundary checking, and the judging method of the boundary checking comprises the following steps:
a1: checking the input data range, and checking the data type and length of the input data;
A2: checking the array boundary, running codes using the array, detecting whether the lower index value of the codes is in the effective range of the lower table, and judging that the array is accessed out of range if the lower index value exceeds the effective range of the array;
A3: checking the pointer reference, judging whether the memory space pointed by the pointer is legal or not, and judging as an attack program if the pointer is empty or points to the illegal memory space;
A4: checking file operation, judging whether the file name is legal or not, checking the authority, and judging as an invalid program if the file name is illegal or the authority is insufficient;
A5: and checking network communication, judging whether the communication protocol and the port accord with the specification, and judging as an invalid program if the communication protocol or the port has loopholes.
2. The source code intelligent analysis engine based on the credit rule base according to claim 1, wherein: the step S2 of preprocessing the data comprises the following steps:
s21, code de-duplication, namely identifying repeated code segments in the converted code data, and deleting redundant code segments to reserve a code segment;
s22, formatting code data, and formatting the code subjected to duplication removal;
S23, deleting useless codes, and deleting useless fields in the formatted code data;
S24, dividing code data, namely dividing a longer code file or function into preset code segment sizes;
s25, code data conversion, namely converting the code data into a control flow graph form.
3. The source code intelligent analysis engine based on the credit rule base according to claim 2, wherein: the loss function defined in the step S4 is a multi-label cross entropy loss function, where the multi-label cross entropy loss function is:
code data of N samples and K categories are set, wherein each sample can belong to a plurality of categories, and the code data of the ith sample is recorded as a true label Wherein/>0 Or 1, indicating whether the code data of the sample belongs to the kth category, while the prediction result of the model on the code data of the ith sample is/>Wherein, the method comprises the steps of, wherein,The probability that the code data representing the prediction sample belongs to the kth class: /(I)Wherein log represents the natural logarithm.
4. The source code intelligent analysis engine based on the credit rule base according to claim 3, wherein: the quantization step of step S6 includes:
s61: acquiring a neural network model to be quantized, extracting a convolution layer, calculating the weight gradient of neurons in the convolution layer, and calculating the weight influence of the convolution layer according to the calculated weight gradient;
S62: calculating an activation variance value of the convolution layer and obtaining a quantization bit width of the convolution layer;
S63: determining an optimal quantization bit width of each convolution layer;
S64: applying the determined quantization bit width to each convolution layer and converting the full-precision neural network using the quantization function;
s65: training the quantized neural network model by using a loss function and counting training errors.
5. The source code intelligent analysis engine based on the credit rule base according to claim 4, wherein: the weight influence is calculated in the following way:
The weight gradient of the neuron is dw i and is calculated by a back propagation algorithm, and the formula is: wherein, loss is the error function of the full-precision neural network, and the formula is: /(I) Wherein/>For input/>Corresponding true value,/>Input/>, for neural network pairApplying the formula/>And calculating the weight influence of each layer.
6. The source code intelligent analysis engine based on the credit rule base according to claim 5, wherein: in the initial stage of neural network model training, calculating the activation value variance sigma z of a convolution layer and combining the weight influence forceCalculating the precision loss caused by the quantization of the layer;
The variance of the activation value of each layer is obtained according to the statistics of the activation value in the forward propagation process, the activation value of the activation function is calculated by Relu functions, and the formula is:
Calculating the mean value of the activation values:
the variance σ z formula is:
the loss of accuracy is calculated using the following equation: Where N k is the quantization bit width of the k-th layer.
7. The source code intelligent analysis engine based on the credit rule base according to claim 6, wherein: in the step S63, a genetic algorithm is adopted to determine the optimal quantization bit width of each layer, the population size is set, the crossing rate is set, the quantization bit width is randomly allocated during initialization, and in each iteration, the fitness is calculated according to the following formula, so that the comprehensive performance loss index P is minimized: p=+/>
8. The source code intelligent analysis engine based on the credit rule base according to claim 7, wherein: the quantization function of step S64 is:
Wherein, Is the full precision weight of the k-th layer,/>Is the quantized weight,/>Is the step size of each quantization level, defined by quantization bit width/>Decision,/>The calculation formula of (2) is as follows:
CN202410183113.3A 2024-02-19 2024-02-19 Source code intelligent analysis engine based on information creation rule base Active CN117742769B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410183113.3A CN117742769B (en) 2024-02-19 2024-02-19 Source code intelligent analysis engine based on information creation rule base

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410183113.3A CN117742769B (en) 2024-02-19 2024-02-19 Source code intelligent analysis engine based on information creation rule base

Publications (2)

Publication Number Publication Date
CN117742769A CN117742769A (en) 2024-03-22
CN117742769B true CN117742769B (en) 2024-04-30

Family

ID=90256194

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410183113.3A Active CN117742769B (en) 2024-02-19 2024-02-19 Source code intelligent analysis engine based on information creation rule base

Country Status (1)

Country Link
CN (1) CN117742769B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110780878A (en) * 2019-10-25 2020-02-11 湖南大学 Method for carrying out JavaScript type inference based on deep learning
WO2021135497A1 (en) * 2020-01-02 2021-07-08 晶晨半导体(深圳)有限公司 Android-based method and device for same copy of source code to be compatible with client demands
CN116702160A (en) * 2023-08-07 2023-09-05 四川大学 Source code vulnerability detection method based on data dependency enhancement program slice

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110780878A (en) * 2019-10-25 2020-02-11 湖南大学 Method for carrying out JavaScript type inference based on deep learning
WO2021135497A1 (en) * 2020-01-02 2021-07-08 晶晨半导体(深圳)有限公司 Android-based method and device for same copy of source code to be compatible with client demands
CN116702160A (en) * 2023-08-07 2023-09-05 四川大学 Source code vulnerability detection method based on data dependency enhancement program slice

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
A Hierarchical Deep Neural Network for Detecting Lines of Codes with Vulnerabilities;Arash Mahyari;IEEE;20230330;全文 *
基于汇编指令词向量与卷积神经网络的恶意代码分类方法研究;乔延臣;姜青山;古亮;吴晓明;;信息网络安全;20190410(第04期);全文 *

Also Published As

Publication number Publication date
CN117742769A (en) 2024-03-22

Similar Documents

Publication Publication Date Title
CN109302410B (en) Method and system for detecting abnormal behavior of internal user and computer storage medium
CN109635110A (en) Data processing method, device, equipment and computer readable storage medium
CN109766277A (en) A kind of software fault diagnosis method based on transfer learning and DNN
CN111062036A (en) Malicious software identification model construction method, malicious software identification medium and malicious software identification equipment
CN107633030A (en) Credit estimation method and device based on data model
CN107633455A (en) Credit estimation method and device based on data model
WO2021139279A1 (en) Data processing method and apparatus based on classification model, and electronic device and medium
CN111898129B (en) Malicious code sample screener and method based on Two-Head anomaly detection model
CN114036531A (en) Multi-scale code measurement-based software security vulnerability detection method
CN109766259A (en) A kind of classifier test method and system based on compound transformation relationship
CN112035345A (en) Mixed depth defect prediction method based on code segment analysis
CN109242165A (en) A kind of model training and prediction technique and device based on model training
CN111582315A (en) Sample data processing method and device and electronic equipment
CN114139931A (en) Enterprise data evaluation method and device, computer equipment and storage medium
CN117742769B (en) Source code intelligent analysis engine based on information creation rule base
CN113822336A (en) Cloud hard disk fault prediction method, device and system and readable storage medium
CN111582647A (en) User data processing method and device and electronic equipment
Qiang Research on software vulnerability detection method based on improved CNN model
CN115017015B (en) Method and system for detecting abnormal behavior of program in edge computing environment
Simao et al. A technique to reduce the test case suites for regression testing based on a self-organizing neural network architecture
CN116245630A (en) Anti-fraud detection method and device, electronic equipment and medium
CN115688101A (en) Deep learning-based file classification method and device
CN114637620A (en) Database system abnormity classification prediction method based on SVM algorithm
CN113935023A (en) Database abnormal behavior detection method and device
CN113971282A (en) AI model-based malicious application program detection method and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant