CN116361815B

CN116361815B - Code sensitive information and hard coding detection method and device based on machine learning

Info

Publication number: CN116361815B
Application number: CN202310636383.0A
Authority: CN
Inventors: 付杰; 高鹏; 靳岩
Original assignee: Shanghai Biling Technology Co ltd; Beijing Biling Technology Co ltd
Current assignee: Shanghai Biling Technology Co ltd; Beijing Biling Technology Co ltd
Priority date: 2023-06-01
Filing date: 2023-06-01
Publication date: 2023-08-15
Anticipated expiration: 2043-06-01
Also published as: CN116361815A

Abstract

The application provides a method and a device for detecting code sensitive information and hard coding based on machine learning, which relate to the technical field of information security and comprise the steps of obtaining a code sample and a code to be detected of a financial supervisory system; performing feature extraction processing on the code samples to obtain vector representation; training a preset machine learning mathematical model according to the vector representation to obtain a classifier for identifying the first information; carrying out grammar analysis processing on the code to be detected, extracting character string constant and corresponding variable names from a grammar tree, and carrying out classification processing on the analyzed code to be detected according to a classifier to obtain the predicted category of the character string constant and the variable names; and carrying out regular expression matching processing on the prediction category to obtain a detection result. According to the application, the grammar analysis and the feature extraction processing are automatically carried out on the code to be detected, and the classification processing is carried out on the code by using a machine learning model, so that the detection efficiency and the automation degree can be greatly improved.

Description

Code sensitive information and hard coding detection method and device based on machine learning

Technical Field

The application relates to the technical field of information security, in particular to a method and a device for detecting code sensitive information and hard codes based on machine learning.

Background

The financial supervision system is a software system used by a financial supervision organization for supervising and managing a financial market and the financial organization, maintaining financial stability and preventing financial risks. These systems involve large amounts of sensitive data, such as financial institution liabilities, risk indicators, violations, penalties, etc., that may cause the financial supervision to fail if modified or deleted. The existing sensitive information and hard codes of the financial supervision system codes are mainly detected in a manual examination mode, and the method has the defect of low efficiency.

In view of the shortcomings of the prior art, a need exists for a machine learning-based code sensitive information and hard coding detection method.

Disclosure of Invention

The present application is directed to a method and apparatus for detecting code sensitive information and hard codes based on machine learning, so as to improve the above-mentioned problems. In order to achieve the above purpose, the technical scheme adopted by the application is as follows:

in one aspect, the present application provides a method for detecting code sensitive information and hard coding based on machine learning, including:

acquiring a code sample and a code to be detected of a financial supervision system, wherein the code sample comprises first information, and the first information is sensitive information and a hard-coded audit password;

performing feature extraction processing on the code samples to obtain vector representation;

training a preset machine learning mathematical model according to the vector representation to obtain a classifier for identifying the first information;

carrying out grammar analysis processing on the code to be detected, extracting character string constants and corresponding variable names from a grammar tree, and carrying out classification processing on the analyzed code to be detected according to the classifier to obtain predicted categories of the character string constants and the variable names;

and carrying out regular expression matching processing on the predicted category to obtain a detection result, wherein the detection result comprises file name, position and category information.

On the other hand, the application also provides a code sensitive information and hard coding detection device based on machine learning, which comprises the following steps:

the system comprises an acquisition module, a detection module and a detection module, wherein the acquisition module is used for acquiring a code sample and a code to be detected of a financial supervision system, the code sample comprises first information, and the first information is sensitive information and a hard-coded audit password;

the extraction module is used for carrying out feature extraction processing on the code samples to obtain vector representation;

the construction module is used for training a preset machine learning mathematical model according to the vector representation to obtain a classifier for identifying the first information;

the classification module is used for carrying out grammar analysis processing on the code to be detected, extracting character string constant and corresponding variable names from a grammar tree, and carrying out classification processing on the analyzed code to be detected according to the classifier to obtain the prediction category of the character string constant and the variable names;

and the matching module is used for carrying out regular expression matching processing on the prediction category to obtain a detection result, wherein the detection result comprises file names, positions and category information.

The beneficial effects of the application are as follows:

according to the application, the grammar analysis and the feature extraction processing are automatically carried out on the code to be detected, and the classification processing is carried out on the code by using a machine learning model, so that the detection efficiency and the automation degree can be greatly improved.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the embodiments of the application.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a method for detecting code sensitive information and hard codes based on machine learning according to an embodiment of the application;

fig. 2 is a schematic diagram of a structure of a machine learning-based code sensitive information and hard code detection device according to an embodiment of the present application.

The marks in the figure: 1. an acquisition module; 2. an extraction module; 21. a first processing unit; 22. a second processing unit; 23. a third processing unit; 24. a fourth processing unit; 3. constructing a module; 31. a fifth processing unit; 32. a first training unit; 33. a first evaluation unit; 34. a first optimizing unit; 35. a first building unit; 4. a classification module; 41. a sixth processing unit; 42. a seventh processing unit; 421. a second construction unit; 422. a first sorting unit; 423. a first extraction unit; 424. a first judgment unit; 43. an eighth processing unit; 44. a ninth processing unit; 5. a matching module; 51. a first matching unit; 52. a second judgment unit; 53. a second extraction unit; 54. and a third judging unit.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. The components of the embodiments of the present application generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the application, as presented in the figures, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only to distinguish the description, and are not to be construed as indicating or implying relative importance.

Example 1:

the embodiment provides a code sensitive information and hard coding detection method based on machine learning.

Referring to fig. 1, the method is shown to include steps S100, S200, S300, S400, and S500.

And step S100, acquiring a code sample and a code to be detected of the financial supervisory system, wherein the code sample comprises first information, and the first information is sensitive information and a hard-coded audit password.

It will be appreciated that in financial supervisory systems there is some sensitive information, such as user identity information, transaction information, bank account information etc., which needs to be strictly protected. At the same time, some hard-coded audit passwords exist in the system, and the passwords can be used for auditing the security and reliability of the system. In this step, the code sample may be obtained by collecting the code of the financial supervisory system. These codes may include codes developed internally by the financial regulatory agency, as well as codes submitted by the financial institution. The code to be detected refers to the code which needs to detect the sensitive information and the hard-coded audit password, and can be the newly submitted code or the existing code. For example, in a banking financial supervisory system, code samples may include a system management module, a business logic module, a data management module, etc. developed inside the bank; the code to be detected refers to various financial transaction codes submitted by banking clients, loan application codes and the like.

And step 200, performing feature extraction processing on the code samples to obtain vector representation.

It will be appreciated that in this step, the code samples are converted into a computer processable vector representation in preparation for subsequent machine learning model training and code classification to be detected. It should be noted that step S200 includes step S210, step S220, step S230, and step S240.

And step S210, carrying out N-gram processing on the sensitive elements in the code sample to obtain sensitive element characteristics, wherein the sensitive elements comprise audit passwords, API keys and database certificates.

It will be appreciated that in this step, N-gram processing refers to cutting a text sequence according to a fixed N-tuple length, and then processing and representing the cut N-tuples. In this step, sensitive elements (such as audit passwords, API keys, and database credentials) are cut according to the length of the N-tuple, resulting in a series of N-tuples. Through N-gram processing, the sensitive elements can be represented in a vectorization mode, and the characteristics of the sensitive elements are further extracted.

Step S220, performing word frequency-inverse document frequency processing on the terms in the code sample to obtain term characteristics, wherein the terms comprise risk indexes and asset liabilities.

It will be appreciated that in this step, terms are processed using the term frequency-inverse document frequency (TF-IDF) method. First, for each code sample, the terms therein are extracted and their word frequencies in the sample (i.e., the number of occurrences in the sample) are calculated. Then, for each term, the Inverse Document Frequency (IDF), i.e., the logarithm of the reciprocal of the number of samples in which the term appears in all samples, is calculated. Finally, the word frequency is multiplied by the inverse document frequency to obtain the word frequency-inverse document frequency (TF-IDF) value of the term in the sample as a characteristic. The feature extraction is carried out by adopting a word frequency-inverse document frequency method, so that the technical noun features in the code sample can be effectively captured, and more meaningful features can be provided for a subsequent machine learning classifier.

And step S230, performing word vector processing on the entity institutions in the code samples to obtain the characteristics of the entity institutions, wherein the entity institutions comprise financial institutions and supervision institutions.

It will be appreciated that entity information is very important information in a financial supervisory system, and that their changes may have a significant impact on regulatory decisions, and that code samples for a financial supervisory system contain names and abbreviations of multiple financial institutions and regulatory authorities, such as "chinese People Banking (PBOC)", "chinese license contract (CSRC)", etc. In this step, these entity institutions are encoded using word vector techniques to obtain entity institution feature vectors. Preferably, word embedding technology is used for training word vectors of entity institutions, then the number of times each entity institution appears in code samples of the financial supervisory system is used as the weight of the entity institution, and finally the feature vector representation of the entity institution is obtained.

And step 240, carrying out fusion processing on the sensitive element characteristics, the technical noun characteristics and the entity mechanism characteristics to obtain comprehensive characteristics, and carrying out normalization processing on the comprehensive characteristics to obtain vector representation.

It will be appreciated that in this step, preferably, the three feature vectors are weighted and averaged by a weighted average method to obtain a comprehensive feature vector, and then the range of values of the feature vector is scaled to between 0 and 1 by a maximum-minimum normalization method. The feature fusion method can improve the accuracy and the robustness of code detection, and different features can be mutually complemented, so that the machine learning model can better capture sensitive information and hard-coded audit passwords in the code. By fusing different types of features, sensitive information and hard-coded audit passwords can be more comprehensively captured, and the code detection effect is improved.

Step S300, training a preset machine learning mathematical model according to the vector representation to obtain a classifier for identifying the first information.

It can be understood that in this step, a high-efficiency accurate classifier is trained for classifying the code to be detected, and the prediction result is used as the basis of the subsequent step. The method has the advantages of rapidness, accuracy, automation and the like, can greatly improve the efficiency and accuracy of code detection, and is suitable for detecting and auditing the code sensitive information and hard coding in the fields of financial supervision systems and the like. It should be noted that step S300 includes step S310, step S320, step S330, step S340, and step S350.

Step S310, dividing the representation based on a preset hierarchical sampling strategy to obtain a data set, wherein the data set comprises a training set and a verification set.

It will be appreciated that in this step, the vector representation is partitioned using a predetermined hierarchical sampling strategy. The strategy can effectively ensure the balance of the distribution of the training set and the verification set on the samples of each category, and prevent the accuracy of the model from being influenced by the bias data in the training set and the verification set. Preferably, the code samples are divided by the financial institution or regulatory agency to which they pertain, ensuring that the number of samples for each institution in the training set and validation set is approximately equal. The generalization capability and the accuracy of the classifier are improved through reasonable data set division, so that sensitive information and hard-coded audit passwords in codes are better identified.

Step S320, performing supervised learning training on a preset attention mechanism-bidirectional long-short-time memory network mathematical model according to the data set, and obtaining a preliminary recognition model by capturing nouns such as risk indexes, asset liabilities and the like in the financial supervision field.

It will be appreciated that in this step, the Attention mechanism-bidirectional long and short term memory network model (Attention-based Bidirectional LSTM Model) is a deep learning model that is capable of adaptively learning language patterns and features of sensitive information and hard-coded audit passwords, thereby improving the ability to accurately identify such information. The model utilizes a bi-directional long and short term memory network layer to simultaneously capture contextual information in a code sequence to better capture the context and meaning in the code. Meanwhile, the model introduces a attention mechanism, and weight distribution is carried out on each word in the code sequence, so that sensitive information and hard-coded audit passwords in the code can be better focused and identified. By using the model, the safety and stability of the financial supervisory system can be improved.

And step S330, performing evaluation processing on the preliminary identification model according to a preset evaluation index to obtain an evaluation result.

It will be appreciated that in this step, the evaluation index typically includes an accuracy rate, a recall rate, an F1 value, and the like. Preferably, by performing prediction and evaluation on the test set, the score of the model under the evaluation indexes such as the accuracy rate, the recall rate, the F1 value and the like can be calculated, so that the recognition effect of the model is determined. In the field of financial supervision, the accuracy rate is always the proportion actually existing in the real violations predicted by the model, the recall rate is the real violations proportion correctly predicted by the model, and the F1 value is the harmonic mean of the accuracy rate and the recall rate. The evaluation result can reflect the identification accuracy of the model to the sensitive information and the hard-coded audit password in the code of the financial supervisory system, so that the safety and stability of the financial supervisory system are improved.

And step 340, performing model optimization processing on the preliminary model according to the evaluation result to obtain an optimized recognition model.

It can be appreciated that in this step, model optimization processing is performed on the preliminary recognition model to improve accuracy and generalization capability of the model. The model optimization method comprises the steps of adjusting the super parameters of the model, adding training data, improving the data preprocessing method and the like. Preferably, in this embodiment, the recognition capability of the model for sensitive information and hard-coded audit passwords in the financial supervisory system code is optimized by adding training data in aspects of supervisory authorities, financial institutions, risk indexes and the like according to the characteristics of the financial supervisory system code.

And step S350, performing verification processing on the optimized identification model according to the verification set to obtain a classifier for identifying the sensitive information and the hard-coded audit password.

It will be appreciated that in this step, the verification set is verified using the optimized model to determine whether the model is over-fitted or under-fitted, and further adjustments and optimizations are made to the model. Through verification of the verification set, generalization capability and recognition accuracy of the model can be determined. If the model performs well on the validation set, the model is used to identify sensitive information and hard-coded audit passwords in the financial supervisory system code.

And step 400, carrying out grammar analysis processing on the code to be detected, extracting character string constants and corresponding variable names from the grammar tree, and carrying out classification processing on the analyzed code to be detected according to a classifier to obtain the predicted category of the character string constants and the variable names.

It will be appreciated that in this step, the predicted class results of the classifier are divided into two classes of sensitive information and non-sensitive information, where the sensitive information includes audit passwords, API keys, database credentials, and the like. It should be noted that step S400 includes step S410, step S420, step S430, and step S440.

And step S410, carrying out grammar analysis processing on the code to be detected to obtain an abstract grammar tree.

It will be appreciated that in this step, an abstract syntax tree is a tree structure used to represent program code that abstracts the code into syntax element nodes and their relationships, facilitating the understanding, modification and analysis of the code by the program. Preferably, the code to be detected is parsed into an abstract syntax tree of the corresponding programming language by using an existing programming language parser. For example, in the Python language, the Python code may be parsed using an ast module to obtain an abstract syntax tree representation of the Python language. In the Java language, java codes can be parsed using the Java Parser library to obtain an abstract syntax tree representation of the Java language. The code to be detected is converted into abstract syntax tree representation, so that subsequent processing and analysis are facilitated, and a foundation is provided for subsequent sensitive information and audit password identification.

Step S420, performing node traversal processing on the abstract syntax tree to obtain code elements, wherein the code elements comprise character string constants and corresponding variable names.

It will be appreciated that in this step, processing is performed for each node by traversing the various nodes of the abstract syntax tree, from which the eligible code elements are extracted. It should be noted that step S420 includes step S421, step S422, step S423, and step S424.

Step S421, constructing a mesogenic traversal mathematical model according to the abstract syntax tree, and taking the root node of the abstract syntax tree as an input parameter of the mesogenic traversal mathematical model to obtain the traversal model.

Preferably, in this step, starting from the root node of the abstract syntax tree, the subtrees are traversed in left-root-right order, each node being represented as a vector in the mathematical model and being taken as an input parameter for the intermediate traversal of the mathematical model. Through the processing of the step, the character string constant and the corresponding variable name in the code to be detected can be effectively extracted while the node sequence information is maintained, and a basis is provided for the subsequent classification processing.

Step S422, according to the relation among the nodes in the abstract syntax tree, the code nodes in the abstract syntax tree are added to the stack of the traversing model one by one, and the traversing node sequence is obtained.

It will be appreciated that in this step, code nodes in the abstract syntax tree are added to the stack of the traversal model one by one, starting from the root node, according to the relationships between the nodes in the abstract syntax tree. Preferably, a depth-first traversal algorithm is used to traverse the abstract syntax tree from left to right and add each node to the stack of the traversal model until all nodes have been traversed. In this process, the type and attribute of each node are recorded for subsequent feature extraction and classification processing. Through the step, the abstract grammar tree of the code to be detected is converted into a traversal model, and a foundation is provided for subsequent feature extraction and classification processing.

And step S423, extracting the traversal model according to the node sequence, and sequentially extracting the current nodes in traversal to obtain the code nodes to be analyzed.

It can be understood that in this step, according to the principle of first-in last-out of the stack, each node is sequentially taken out from the top of the stack, and whether it is the node to be analyzed is determined. In the case of a node to be analyzed, such as a node containing sensitive information or a hard-coded audit password, relevant information, such as node type, string constant, variable name, etc., is processed and recorded. After the processing is completed, the next node in the node sequence is used as the node to be analyzed to be processed until all the nodes in the node sequence are processed. By the processing mode, the sensitive character string constant and the corresponding variable name in the code to be detected can be obtained, and basic data is provided for the next classification processing.

And step S424, judging the type of the code node to be analyzed, and screening out the character string constant and the corresponding variable name to obtain the code element.

It will be appreciated that in this step, the type of code node to be analyzed is determined, and code elements associated with the financial supervisory system, such as liabilities, financial institutions, regulatory authorities, etc., are screened out.

Step S430, performing feature conversion processing on the code elements to obtain input features.

It will be appreciated that in this step, each code element is encoded to be converted into a vector form, which is used as an input feature for the classifier. Preferably, the string constants and variable names in the code elements are encoded separately and converted into vector form. The string constant may be encoded as its corresponding ASCII code value, while the variable name may be encoded as its location information present in the code, etc. By the coding processing mode, code elements can be converted into vector forms which can be processed by a classifier, and basic data is provided for the next classification processing.

And step S440, classifying the input features according to the classifier to obtain the predicted category of the code element.

It will be appreciated that in this step, the input features are taken as input to the classifier and the classification operation is performed by the prediction function in the model. The prediction function maps the input features to the prediction categories while outputting the confidence of the prediction results. The confidence of the prediction result can be determined according to the confidence, and the prediction result is taken as the prediction category of the code element. Preferably, the classifier may classify the financial transaction code elements according to training data, determine whether the transaction meets regulatory regulations, whether there is a potential risk of violation, and the like. For example, when classifying transaction codes, a classifier such as a Support Vector Machine (SVM) is used for training, and classification processing is performed on the transaction code elements through the SVM to obtain prediction categories thereof. By the classification processing mode, a large number of transaction codes can be classified automatically, supervision efficiency is improved, and potential illegal transaction behaviors are found.

And S500, carrying out regular expression matching processing on the predicted category to obtain a detection result, wherein the detection result comprises file name, position and category information.

It can be understood that in this step, by matching the regular expressions of the prediction types, the security vulnerability types existing in the codes can be quickly and accurately determined, and the information such as the file name, the code position, the vulnerability types and the like is recorded, so that a final detection result report is generated. It should be noted that step S500 includes step S510, step S520, step S530, and step S540.

Step S510, matching the character string constant with a preset regular expression matching mathematical model according to the character string constant in the prediction category, and performing pattern matching processing on the character string constant to obtain a character string constant set related to the first information.

It will be appreciated that in the supervisory system of the financial field, for the string constants in the prediction category, sensitive information such as bank account numbers, security codes, transaction amounts, etc. may be involved, so that pattern matching is required. The step is to identify a character string constant set related to the first information by comparing the character string constant set with a preset regular expression matching mathematical model. Preferably, a set of regular expression matching models is defined, for example, models such as "[0-9] {16,19}" can be used for matching bank account numbers, and models such as "[0-9A-Z ] {6}" can be used for matching security codes. By performing pattern matching on the string constants, sensitive information related to the first information can be quickly and accurately identified, and effective support is provided for subsequent supervision and processing.

And step S520, comparing the variable names in the prediction category with the first information in the financial supervisory system, and judging whether the variable names are the same as the variable names related to the first information or not to obtain a variable name set related to the first information.

It will be appreciated that in financial supervisory systems there is some sensitive information, such as user identity information, transaction information, bank account information etc., which needs to be strictly protected. At the same time, some hard-coded audit passwords exist in the system, and the passwords can be used for auditing the security and reliability of the system. In the step, by comparing the variable names in the prediction category with the first information in the financial supervisory system, whether the variable names are the same as the variable names related to the first information or not is judged, and a variable name set related to the first information is obtained, so that potential risks and loopholes in the system can be found in time, and the safety and stability of the system are ensured.

And step S530, carrying out syntactic analysis processing on the character string constant set and the variable name set, and extracting context information of the character string constant and the variable name in the code to obtain input data.

It will be appreciated that the syntactic analysis algorithm may decompose the code into smaller grammar elements, such as phrases and sentence components, and then analyze the relationships and structures between these elements to obtain more comprehensive and accurate context information. In this step, the syntactic analysis algorithm can help identify complex syntactic structures in the code, such as loops, conditional branches, etc., so as to better grasp semantic information of string constants and variable names. In code auditing in the financial field, specific code structures, such as complex conditional statement in transaction processing, may be encountered. By adopting a syntactic analysis algorithm, the meaning and logic of the sentences can be better understood, and the context information of the character string constant and variable name can be further determined, so that code audit and security detection can be better carried out.

And S540, carrying out semantic analysis on the character string constant and the variable name in the input data, judging whether the character string constant and the variable name are associated with the first information, and classifying the character string constant and the variable name into two types of related and unrelated with the first information according to a judging result to obtain a detecting result.

It will be appreciated that the first information is sensitive information and hard-coded audit passwords, and therefore the string constants and variable names associated therewith have some specificity and restriction. By carrying out semantic analysis on the input data, the character string constant and variable names related to the first information can be effectively identified, and the character string constant and variable names which are not related to the first information are respectively classified into two types, so that a final detection result is obtained. Preferably, it is assumed that it is necessary to detect whether there is a risk of information leakage associated with a bank account in a piece of code. In this step, the variable names and string constants in the code are vectorized using natural language processing techniques and deep learning algorithms, and then their similarity to information related to the bank account is calculated. If the similarity is above a set threshold, the variable name or string constant is considered to be associated with the bank account and is further marked as being related to the first information. By the method, information leakage risks in the codes can be effectively detected, and safety of sensitive information such as bank accounts is protected.

Example 2:

as shown in fig. 2, the present embodiment provides a device for detecting code sensitive information and hard codes based on machine learning, the device includes:

the acquisition module 1 is used for acquiring a code sample and a code to be detected of the financial supervisory system, wherein the code sample comprises first information, and the first information is sensitive information and a hard-coded audit password.

And the extraction module 2 is used for carrying out feature extraction processing on the code samples to obtain vector representation.

And the construction module 3 is used for training a preset machine learning mathematical model according to the vector representation to obtain a classifier for identifying the first information.

And the classification module 4 is used for carrying out grammar analysis processing on the code to be detected, extracting the character string constant and the corresponding variable name in the grammar tree, and carrying out classification processing on the analyzed code to be detected according to the classifier to obtain the prediction category of the character string constant and the variable name.

And the matching module 5 is used for carrying out regular expression matching processing on the predicted category to obtain a detection result, wherein the detection result comprises file name, position and category information.

In one embodiment of the present disclosure, the extraction module 2 includes:

the first processing unit 21 is configured to perform N-gram processing on the sensitive elements in the code sample to obtain a feature of the sensitive element, where the sensitive element includes an audit password, an API key, and a database credential.

The second processing unit 22 is configured to perform word frequency-inverse document frequency processing on terms in the code sample, so as to obtain term features, where terms include risk indexes and liabilities.

The third processing unit 23 is configured to perform word vector processing on the entity mechanisms in the code samples to obtain the characteristics of the entity mechanisms, where the entity mechanisms include a financial institution and a supervision institution.

The fourth processing unit 24 is configured to perform fusion processing on the sensitive element feature, the term feature and the entity mechanism feature to obtain a comprehensive feature, and perform normalization processing on the comprehensive feature to obtain a vector representation.

In one embodiment of the present disclosure, the build module 3 includes:

the fifth processing unit 31 divides the representation based on a preset hierarchical sampling strategy to obtain a data set, wherein the data set comprises a training set and a verification set.

The first training unit 32 is configured to perform supervised learning training on a preset attention mechanism-bidirectional long-short-term memory network mathematical model according to the data set, and obtain a preliminary recognition model by capturing nouns such as risk indexes and asset liabilities in the financial supervision field.

The first evaluation unit 33 is configured to perform an evaluation process on the preliminary identification model according to a preset evaluation index, so as to obtain an evaluation result.

The first optimizing unit 34 is configured to perform model optimization on the preliminary model according to the evaluation result, so as to obtain an optimized recognition model.

The first construction unit 35 is configured to perform verification processing on the optimized identification model according to the verification set, so as to obtain a classifier for identifying the sensitive information and the hard-coded audit password.

In one embodiment of the present disclosure, the classification module 4 includes:

and a sixth processing unit 41, configured to perform syntax analysis processing on the code to be detected, so as to obtain an abstract syntax tree.

The seventh processing unit 42 is configured to perform node traversal processing on the abstract syntax tree to obtain code elements, where the code elements include a string constant and a corresponding variable name.

The eighth processing unit 43 is configured to perform feature conversion processing on the code element to obtain an input feature.

And a ninth processing unit 44, configured to perform classification processing on the input features according to the classifier, to obtain a predicted class of the code element.

In one embodiment of the present disclosure, the seventh processing unit 42 includes:

the second construction unit 421 is configured to construct a mesochronous traversal mathematical model according to the abstract syntax tree, and take a root node of the abstract syntax tree as an input parameter of the mesochronous traversal mathematical model to obtain the traversal model.

The first ordering unit 422 is configured to add the code nodes in the abstract syntax tree to the stack of the traversal model one by one according to the relationships between the nodes in the abstract syntax tree, so as to obtain a traversed node sequence.

The first extraction unit 423 is configured to perform extraction processing on the traversal model according to the node sequence, and sequentially extract the current node in traversal to obtain a code node to be analyzed.

The first judging unit 424 is configured to perform type judgment on the code node to be analyzed, and filter out the string constant and the corresponding variable name to obtain the code element.

In one embodiment of the present disclosure, the matching module 5 includes:

the first matching unit 51 is configured to match the string constants with a preset regular expression according to the string constants in the prediction category, and perform pattern matching processing on the string constants to obtain a string constant set related to the first information.

The second judging unit 52 is configured to compare the variable names in the prediction category with the first information in the financial supervisory system, and judge whether the variable names are the same as the variable names related to the first information, so as to obtain a variable name set related to the first information.

The second extracting unit 53 is configured to perform a syntactic analysis process on the string constant set and the variable name set, and extract context information of the string constant and the variable name in the code to obtain input data.

And a third judging unit 54, configured to perform semantic analysis on the string constant and the variable name in the input data, judge whether the string constant and the variable name are associated with the first information, and divide the string constant and the variable name into two types related to and unrelated to the first information according to the judgment result, so as to obtain a detection result.

It should be noted that, regarding the apparatus in the above embodiments, the specific manner in which the respective modules perform the operations has been described in detail in the embodiments regarding the method, and will not be described in detail herein.

The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims

1. The method for detecting the code sensitive information and the hard code based on the machine learning is characterized by comprising the following steps:

carrying out regular expression matching processing on the predicted category to obtain a detection result, wherein the detection result comprises file name, position and category information;

the method comprises the steps of carrying out feature extraction processing on the code samples to obtain vector representation, wherein the method comprises the following steps:

performing N-gram processing on the sensitive elements in the code sample to obtain sensitive element characteristics, wherein the sensitive elements comprise audit passwords, API keys and database certificates;

performing word frequency-inverse document frequency processing on the terms in the code sample to obtain term characteristics, wherein the terms comprise risk indexes and asset liabilities;

performing word vector processing on entity institutions in the code samples to obtain entity institution characteristics, wherein the entity institutions comprise financial institutions and supervision institutions;

and carrying out fusion processing on the sensitive element characteristics, the technical noun characteristics and the entity mechanism characteristics to obtain comprehensive characteristics, and carrying out normalization processing on the comprehensive characteristics to obtain vector representation.

2. The machine learning based code sensitive information and hard code detection method of claim 1, wherein training a pre-set machine learning mathematical model based on the vector representation results in a classifier for identifying the first information, comprising:

dividing the vector representation based on a preset hierarchical sampling strategy to obtain a data set, wherein the data set comprises a training set and a verification set;

performing supervised learning training on a preset attention mechanism-bidirectional long-short-time memory network mathematical model according to the data set, and obtaining a preliminary recognition model by capturing risk indexes and asset liability nouns in the financial supervision field;

performing evaluation processing on the preliminary identification model according to a preset evaluation index to obtain an evaluation result;

performing model optimization processing on the preliminary recognition model according to the evaluation result to obtain an optimized recognition model;

and carrying out verification processing on the optimized identification model according to the verification set to obtain a classifier for identifying sensitive information and hard-coded audit passwords.

3. The machine learning based code sensitive information and hard coding detection method according to claim 1, wherein the parsing processing of the code to be detected, extracting a string constant and a corresponding variable name in a syntax tree, and classifying the parsed code to be detected according to the classifier, to obtain a predicted class of the string constant and the variable name, includes:

carrying out grammar analysis processing on the code to be detected to obtain an abstract grammar tree;

performing node traversal on the abstract syntax tree to obtain code elements, wherein the code elements comprise character string constants and corresponding variable names;

performing feature conversion processing on the code elements to obtain input features;

and classifying the input features according to the classifier to obtain the predicted category of the code element.

4. The machine learning based code sensitive information and hard coding detection method according to claim 3, wherein performing node traversal processing on the abstract syntax tree to obtain code elements comprises:

constructing a mesogenic traversal mathematical model according to the abstract syntax tree, and taking a root node of the abstract syntax tree as an input parameter of the mesogenic traversal mathematical model to obtain a traversal model;

according to the relation among the nodes in the abstract syntax tree, code nodes in the abstract syntax tree are added to a stack of the traversing model one by one to obtain a traversed node sequence;

extracting the traversal model according to the node sequence, and sequentially extracting current nodes in traversal to obtain code nodes to be analyzed;

and judging the type of the code node to be analyzed, screening out the character string constant and the corresponding variable name, and obtaining the code element.

5. A machine learning based code sensitive information and hard code detection device, comprising:

the matching module is used for carrying out regular expression matching processing on the prediction category to obtain a detection result, wherein the detection result comprises file names, positions and category information;

wherein, the extraction module includes:

the first processing unit is used for carrying out N-gram processing on the sensitive elements in the code sample to obtain sensitive element characteristics, wherein the sensitive elements comprise audit passwords, API keys and database certificates;

the second processing unit is used for performing word frequency-inverse document frequency processing on the term in the code sample to obtain term characteristics, wherein the term comprises risk indexes and asset liabilities;

the third processing unit is used for carrying out word vector processing on the entity mechanism in the code sample to obtain the characteristics of the entity mechanism, wherein the entity mechanism comprises a financial mechanism and a supervision mechanism;

and the fourth processing unit is used for carrying out fusion processing on the sensitive element characteristics, the technical noun characteristics and the entity mechanism characteristics to obtain comprehensive characteristics, and carrying out normalization processing on the comprehensive characteristics to obtain vector representation.

6. The machine-learning based code-sensitive information and hard-coded detection device of claim 5, wherein the building block comprises:

the fifth processing unit is used for dividing the vector representation based on a preset hierarchical sampling strategy to obtain a data set, wherein the data set comprises a training set and a verification set;

the first training unit is used for performing supervised learning training on a preset attention mechanism-bidirectional long-short-time memory network mathematical model according to the data set, and acquiring a preliminary recognition model by capturing risk indexes and asset liability nouns in the financial supervision field;

the first evaluation unit is used for performing evaluation processing on the preliminary identification model according to a preset evaluation index to obtain an evaluation result;

the first optimizing unit is used for carrying out model optimization processing on the preliminary recognition model according to the evaluation result to obtain an optimized recognition model;

and the first construction unit is used for carrying out verification processing on the optimized identification model according to the verification set to obtain a classifier for identifying the sensitive information and the hard-coded audit password.

7. The machine-learning based code-sensitive information and hard-coded detection device of claim 5, wherein the classification module comprises:

a sixth processing unit, configured to parse the code to be detected to obtain an abstract syntax tree;

a seventh processing unit, configured to perform node traversal processing on the abstract syntax tree to obtain a code element, where the code element includes a string constant and a corresponding variable name;

an eighth processing unit, configured to perform feature conversion processing on the code element to obtain an input feature;

and a ninth processing unit, configured to perform classification processing on the input feature according to the classifier, to obtain a prediction class of the code element.

8. The machine-learning-based code-sensitive information and hard-coded detection device of claim 7, wherein the seventh processing unit comprises:

the second construction unit is used for constructing and obtaining a middle-order traversal mathematical model according to the abstract syntax tree, and taking a root node of the abstract syntax tree as an input parameter of the middle-order traversal mathematical model to obtain a traversal model;

the first ordering unit is used for adding the code nodes in the abstract syntax tree into the stack of the traversing model one by one according to the relation among the nodes in the abstract syntax tree to obtain a traversed node sequence;

the first extraction unit is used for extracting the traversal model according to the node sequence, and sequentially extracting current nodes in traversal to obtain code nodes to be analyzed;

and the first judging unit is used for judging the type of the code node to be analyzed, screening out the character string constant and the corresponding variable name, and obtaining the code element.