CN116881971A - Sensitive information leakage detection method, device and storage medium - Google Patents

Sensitive information leakage detection method, device and storage medium Download PDF

Info

Publication number
CN116881971A
CN116881971A CN202310972203.6A CN202310972203A CN116881971A CN 116881971 A CN116881971 A CN 116881971A CN 202310972203 A CN202310972203 A CN 202310972203A CN 116881971 A CN116881971 A CN 116881971A
Authority
CN
China
Prior art keywords
sensitive information
sensitive
interface
information
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310972203.6A
Other languages
Chinese (zh)
Inventor
冯玉权
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qizhi Technology Co ltd
Original Assignee
Qizhi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qizhi Technology Co ltd filed Critical Qizhi Technology Co ltd
Priority to CN202310972203.6A priority Critical patent/CN116881971A/en
Publication of CN116881971A publication Critical patent/CN116881971A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/362Software debugging

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Bioethics (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Quality & Reliability (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a sensitive information leakage detection method, sensitive information leakage detection equipment and a storage medium. The method comprises the following steps: the method comprises the steps that an API interface of an application system is subjected to standardization processing through a Swagger interface management unit, and a standardized API interface is obtained; after a request method of the normalized API is packaged, obtaining request content returned by the normalized API; judging whether the request content has sensitive information or not through a trained sensitive information identification model; if yes, marking the request content with a sensitive information label to obtain sensitive information content, and determining a corresponding normalized API interface as a sensitive information leakage API interface; and generating a sensitive information leakage detection result according to the sensitive information content and the interface information of the sensitive information leakage API interface. The invention can reduce the conditions of missing detection and false detection of the sensitive information through the sensitive information identification model, and improves the accuracy and efficiency of sensitive information leakage detection.

Description

Sensitive information leakage detection method, device and storage medium
Technical Field
The present invention relates to the field of information security, and in particular, to a method, apparatus, and storage medium for detecting leakage of sensitive information.
Background
In the digital information age, information security faces serious challenges. In system applications, some critical sensitive information may be revealed out through the API interface.
Currently, in the related art, whether the API interface has sensitive information leakage is detected by analyzing a log or analyzing traffic, etc. However, the analysis capability of the modes is insufficient, and the conditions of missed detection and false detection of the sensitive information exist, so that the sensitive information is leaked, and the information safety problem is caused.
Disclosure of Invention
Aiming at the technical problems and defects, the invention aims to provide a sensitive information leakage detection method, sensitive information leakage detection equipment and a storage medium, which can reduce the conditions of missed detection and false detection of sensitive information through a sensitive information identification model and improve the accuracy and efficiency of sensitive information leakage detection.
To achieve the above object, in a first aspect, the present invention provides a sensitive information leakage detecting method, including:
the method comprises the steps that an API interface of an application system is subjected to standardization processing through a Swagger interface management unit, and a standardized API interface is obtained;
after a request method of the normalized API is packaged, obtaining request content returned by the normalized API;
Judging whether the request content has sensitive information or not through a trained sensitive information identification model, wherein the sensitive information comprises at least one of an identity card number, a telephone number, a mailbox address, business license information and bank card information;
if yes, marking the request content with a sensitive information label to obtain sensitive information content, and determining a corresponding normalized API interface as a sensitive information leakage API interface; the sensitive information label comprises an identity card number label, a telephone number label, a mailbox address label, a business license information label or a bank card information label;
and generating a sensitive information leakage detection result according to the sensitive information content and the interface information of the sensitive information leakage API interface.
By adopting the embodiment, the API interface return content applied by the detection system is identified through the sensitive information identification model, and whether the sensitive information exists or not is identified, so that the conditions of missing detection and false detection of the sensitive information are reduced by utilizing the sensitive information identification model, and the accuracy and efficiency of sensitive information leakage detection are improved.
In one embodiment, the step of determining whether the sensitive information exists in the requested content through the sensitive information identification model includes:
inputting the request content into a trained dictionary learning algorithm model to obtain sparse representation information of the request content;
The sparse representation information is input into a trained sensitive information recognition model to judge whether sensitive information exists or not.
By adopting the embodiment, the original data of the request content is subjected to dimension reduction and feature extraction through the dictionary learning algorithm model, so that the redundancy and complexity of the data are reduced, and the obtained sparse representation information is input into the sensitive information recognition model, so that the calculated amount of the sensitive information recognition model can be reduced, and the method is beneficial to mass data processing. The method and the device integrate the advantages of the dictionary learning model and the neural network model, improve the interpretability, the calculation efficiency and the robustness of the model, can be used for large-batch data detection, and improve the efficiency and the accuracy of sensitive information detection.
In an embodiment, before the step of performing normalization processing on the API interface of the application system by the Swagger interface management unit to obtain a normalized API interface, the method further includes:
acquiring a plurality of sample normal data and sample sensitive data, wherein the sample sensitive data contains sensitive information, and the sample normal data does not contain the sensitive information;
preprocessing normal data and sensitive data of a sample to obtain processed sample data, wherein the preprocessing comprises denoising and normalization;
According to the processed sample data and a dictionary learning algorithm, a base vector and a dictionary matrix are obtained, wherein the base vector is used for carrying out sparse representation on the processed sample data;
and constructing a dictionary learning algorithm model according to the dictionary matrix.
By adopting the embodiment to construct and train the dictionary learning algorithm model, the dictionary learning algorithm model can be output more accurately.
In one embodiment, the step of obtaining the basis vector and the dictionary matrix according to the processed sample data and the dictionary learning algorithm includes:
representing the processed sample data as a linear combination x of basis vectors;
according to x sparsenessThe expression min x-Dz 2 +λ z 1, determining a basis vector z and a dictionary matrix D, where lambda is a regularization parameter, the L1 norm of the base vector z is denoted as z, which is a sparse representation of x.
By adopting the embodiment, the accurate basis vector and dictionary matrix can be obtained, so that the dictionary learning algorithm model is more accurate.
In one embodiment, the step of preprocessing the sample normal data and the sample sensitive data further comprises:
modifying at least part of normal data of the sample and part of sensitive data of the sample to obtain antagonistic sample data;
sample normal data, sample sensitive data, and challenge sample data are preprocessed.
By adopting the embodiment, a method for training the countermeasure samples is introduced, namely, by generating a plurality of model training countermeasure samples, the robustness and the anti-interference capability of the model can be enhanced, so that the capability of the model for identifying sensitive data is improved.
In one embodiment, the sensitive information recognition model includes an identification card number recognition model, a phone number recognition model, a mailbox address recognition model, a business license information recognition model, and a bank card information recognition model, and before the step of obtaining the request content returned by the normalized API interface, the method further includes:
and packaging the identification card number identification model, the telephone number identification model, the mailbox address identification model, the business license information identification model and the bank card information identification model into classes.
By adopting the embodiment, the request content is processed through the sensitive information identification models of different types, so that the condition of missed detection and false detection can be reduced, and the detection accuracy is improved.
In an embodiment, after the step of generating the sensitive information leakage detection result according to the sensitive information content and the interface information of the sensitive information leakage API interface, the method further includes:
and sending the sensitive information leakage detection result to a display terminal so that a manager can check and verify the accuracy of the sensitive information leakage detection result through the display terminal.
By adopting the embodiment, the manager can receive the sensitive information leakage detection result in time, and verify the accuracy of the verification sensitive information leakage detection result.
In a second aspect, the present invention provides a sensitive information leakage detecting apparatus comprising:
the normalization module is used for performing normalization processing on the API interface of the application system through the Swagger interface management unit to obtain a normalized API interface;
the acquisition module is used for acquiring request contents returned by the normalized API after the request method of the normalized API is encapsulated;
the judging module is used for judging whether the request content has sensitive information or not through the trained sensitive information identification model, wherein the sensitive information comprises at least one of an identity card number, a telephone number, a mailbox address, business license information and bank card information;
the obtaining module is used for marking the request content with a sensitive information label when the sensitive information exists, obtaining the sensitive information content, and determining the corresponding normalized API interface as a sensitive information leakage API interface; the sensitive information label comprises an identity card number label, a telephone number label, a mailbox address label, a business license information label or a bank card information label;
The generation module is used for generating a sensitive information leakage detection result according to the sensitive information content and the interface information of the sensitive information leakage API interface.
The sensitive information leakage detection device of the embodiment of the present invention can achieve the technical effects of the above method, and is not described herein in detail.
In a third aspect, the present invention provides a mobile terminal comprising a processor and a memory, the memory having stored thereon a computer program which, when executed by the processor, implements the method described above.
The mobile terminal of the embodiment of the present invention can achieve the technical effects of the above method, and is not described herein.
In a fourth aspect, the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method described above.
The storage medium of the embodiment of the present invention may achieve the technical effects of the above method, and is not described herein.
One or more technical solutions provided in the embodiments of the present invention at least have the following technical effects or advantages:
1. the API interface return content applied by the detection system is identified through the sensitive information identification model, and whether sensitive information exists or not is identified, so that the conditions of missing detection and false detection of the sensitive information are reduced by utilizing the sensitive information identification model, and the accuracy and efficiency of sensitive information leakage detection are improved.
2. The original data of the request content is subjected to dimension reduction and feature extraction through the dictionary learning algorithm model, so that the redundancy and complexity of the data are reduced, and the obtained sparse representation information is input into the sensitive information identification model, so that the calculated amount of the sensitive information identification model can be reduced, and the method is beneficial to large-batch data processing. The method and the device integrate the advantages of the dictionary learning model and the neural network model, improve the interpretability, the calculation efficiency and the robustness of the model, can be used for large-batch data detection, and improve the efficiency and the accuracy of sensitive information detection.
3. Methods of challenge sample training are introduced, namely, by generating some challenge sample training models, the robustness and anti-interference capability of the models can be enhanced, so that the capability of the models for identifying sensitive data is improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention. It is evident that the drawings in the following description are only some embodiments of the present invention and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art. In the drawings:
FIG. 1 is a flow chart of steps of a sensitive information leakage detection method according to an embodiment of the present invention;
FIG. 2 is a flow chart of steps of another sensitive information leakage detection method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of data transmission in a sensitive information leakage detection method according to an embodiment of the present invention;
fig. 4 is a block diagram of a sensitive information leakage detecting device according to an embodiment of the present invention;
fig. 5 is a schematic diagram of an architecture of an electronic device according to an embodiment of the invention.
Detailed Description
The terminology used in the following embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," "the," and "the" are intended to include the plural forms as well, unless the context clearly indicates to the contrary. It should also be understood that the term "and/or" as used in this disclosure refers to and encompasses any or all possible combinations of one or more of the listed items. The terms "first" and "second" are used below for descriptive purposes only and are not to be construed as implying relative importance or implying an indication of the number of technical features being indicated. In the description of the embodiments of the present invention, unless otherwise indicated, the meaning of "a plurality" is two or more. The following describes embodiments of the present invention in detail.
In the digital information age, information security faces serious challenges. In system applications, some critical sensitive information may be revealed out through the API interface.
Currently, in the related art, whether the API interface has sensitive information leakage is detected by analyzing a log or analyzing traffic, etc. However, the analysis capability of the modes is insufficient, and the conditions of missed detection and false detection of the sensitive information exist, so that the sensitive information is leaked, and the information safety problem is caused.
Therefore, the embodiment of the invention provides a sensitive information leakage detection method, which can identify whether the sensitive information exists or not according to the API interface return content applied by a detection system through a sensitive information identification model, so that the conditions of missing detection and false detection of the sensitive information are reduced by utilizing the sensitive information identification model, and the accuracy and efficiency of sensitive information leakage detection are improved.
As shown in fig. 1, the sensitive information leakage detection method of the embodiment of the invention may include a step 101, a step 102, a step 103, a step 104 and a step 105, which specifically includes the following steps:
and step 101, carrying out standardization processing on an API interface of the application system through a Swagger interface management unit to obtain a standardized API interface.
Among these, the API (Application Programming Interface) interface is an application program interface, whose main purpose is to allow an application developer to invoke a set of routine functions without regard to the underlying source code, or to understand the details of its internal operating mechanisms. The API interface can reduce the mutual dependence of all parts of the system, improve the cohesion of the constituent units and reduce the coupling degree among the constituent units, thereby improving the maintainability and expansibility of the system.
In this implementation, swagger is a canonical and complete framework for generating, describing, invoking and visualizing RestFul style web services, with the general goal of having clients and file systems as servers updated at the same speed. The method of files, parameters and models are tightly integrated into the code of the server break, allowing the API to stay synchronized all the time. Swagger can better solve the interaction problem of interface documents, and can generate interface documents in various formats, generate codes of multiple languages, clients and servers, online interface debugging pages and the like by defining interfaces and related information according to a set of standard specifications. The interface document can be automatically generated only by updating the Swagger description file, so that timeliness and convenience of the front-end and back-end joint debugging interface document are realized. The Swagger interface management unit of the present embodiment has the following roles:
Interface description: interface documents can be generated, including information such as URLs (uniform resource locator, uniform resource locators) of interfaces, request methods, request parameters, response parameters and the like, so that developers can conveniently view and understand the using methods of the interfaces.
Interface test: the interface test function can be provided, and a developer can directly test the interface on Swagger to verify the correctness and usability of the interface.
Interface debugging: the interface debugging function can be provided, and a developer can debug the interface on the Swagger to quickly locate the problems of the interface.
Interface specification: the interface specification, including the data type of the request parameter, the limitation of the request method, and the like, can be constrained, so that the standardization and maintainability of the interface are ensured.
Interface management: all interfaces can be managed, including operations such as adding, modifying and deleting, and the maintenance and management of the interfaces are convenient for developers.
In the above steps, the normalization process specifically includes collecting and cleaning the API interface, including: 1. generating an API document by using a Swagger interface management unit;
2. extracting needed interface information such as interface names, URLs, request methods, request parameters, response parameters and the like from the document;
3. The interface information is cleaned and arranged, such as removing repeated interfaces, combining the same interfaces, and the like.
4. The cleaned interface information is stored in a database or other data storage system for subsequent use.
Through the steps, the Swagger interface management unit can realize unified management and standardization of all API interfaces, and is convenient for developers to use and maintain the interfaces. Meanwhile, development efficiency and code quality can be improved, and errors and repeated work are reduced.
Step 102, after the request method of the normalized API is encapsulated, the request content returned by the normalized API is obtained.
Specifically, the request method includes a GET request and a POST request. The steps specifically comprise: and packaging the GET request and the POST request of the normalized API interface, and then acquiring the request content returned by the normalized API interface. The request content is also the content such as pictures, words, audio, video and the like transmitted through the normalized API interface.
Step 103, judging whether the request content has sensitive information or not through the trained sensitive information identification model, wherein the sensitive information comprises at least one of an identity card number, a telephone number, a mailbox address, business license information and bank card information. The telephone number may include a cellular phone number and a landline number.
The sensitive information recognition model comprises an identity card number recognition model, a telephone number recognition model, a mailbox address recognition model, a business license information recognition model and a bank card information recognition model.
The identification card number recognition model can recognize whether an identification card number exists in the request content transmitted in the API interface according to regular expression rules and by using a ocr (Optical Character Recognition ) picture recognition technology.
The phone number identification model may identify whether a phone number exists for the requested content transmitted in the API interface according to regular expression rules and using ocr picture identification techniques.
The mailbox address identification model may identify whether a mailbox address exists for the request content transmitted in the API interface according to regular expression rules and using ocr picture identification techniques.
The business license information recognition model may recognize whether the business license information exists for the requested content transmitted in the API interface according to regular expression rules and using ocr picture recognition techniques.
The bank card information recognition model can recognize whether the bank card information exists in the request content transmitted in the API interface according to the regular expression rule and using ocr picture recognition technology.
In this embodiment, the sensitive information recognition model is a neural network that can perform deep learning. The request content is processed through different types of sensitive information identification models, so that the condition of false detection missing can be reduced. Aiming at each type of sensitive information identification model, corresponding sample data can be collected for training, and the accuracy of model identification is improved.
104, if the sensitive information exists, marking the request content with a sensitive information label to obtain the sensitive information content, and determining the corresponding normalized API interface as a sensitive information leakage API interface; the sensitive information tag includes an identification card number tag, a telephone number tag, a mailbox address tag, a business license information tag or a bank card information tag.
For example, when the identity card number exists in the request content returned by a certain normalized API interface, the request content is marked with the identity card number label, and the normalized API interface is determined to be a sensitive information leakage API interface; if the request content also has a telephone number, a mailbox address or other sensitive information, the request content is marked with a telephone number label, a mailbox address label or other corresponding sensitive information labels.
And 105, generating a sensitive information leakage detection result according to the sensitive information content and the interface information of the sensitive information leakage API interface.
Specifically, the sensitive information leakage detection result may be presented in the form of a report, and the sensitive information leakage detection result may include an application system entry address, an API interface description, a request test case, an API return content, and a sensitive information tag. Wherein the API interface description comprises:
interface name: the names describing the API interfaces, usually beginning with verbs, clearly and explicitly express the functionality of the interface.
Description of the functions: the function of the interface is briefly and clearly described, and the function and purpose of the interface are explained.
Inputting parameters: the input parameters which the interface needs to receive are listed, including information such as parameter names, types, whether the parameters are necessary, default values, value ranges and the like. For complex parameter structures, descriptions of examples or data structures may be used to illustrate.
Outputting a result: the return result of the interface is described, and the return result comprises information such as the type, meaning, format and the like of the return value. For different return results, different scenarios and meanings of return values may be illustrated.
Error code: and listing possible error codes and meanings thereof, wherein the possible error codes are used for indicating possible error conditions in the interface calling process and helping developers to perform error processing and debugging.
Test cases: specific examples are provided to provide a developer with a better understanding of the method of use of the interface and the format of the parameters.
Interface version: the version numbers of the interfaces are indicated for managing and maintaining the different versions of the interfaces. Through the description of the API interface, a developer can clearly know the function and the usage of the API interface, so that the API can be correctly called and integrated, the development efficiency is improved, and the probability of error occurrence is reduced. Meanwhile, the API interface description also provides an important reference basis for team cooperation and document maintenance.
According to the embodiment, the steps are adopted, the API interface return content applied by the detection system can be identified through the sensitive information identification model, and whether sensitive information exists or not is judged, so that the conditions of missing detection and false detection of the sensitive information are reduced by utilizing the sensitive information identification model, and the accuracy and efficiency of sensitive information leakage detection are improved.
On the other hand, when the large-scale detection of the sensitive information is considered, the large data volume causes the large calculation amount of the sensitive information identification model and the large processor load, which causes the reduction of the calculation speed and is unfavorable for the large-scale and high-efficiency detection of the sensitive information. Therefore, in view of the above-mentioned problems, the sensitive information leakage detection method of the present embodiment may further specifically include the following steps, as shown in fig. 2:
Step 201, normalized processing is performed on the API interface of the application system through the Swagger interface management unit, so as to obtain a normalized API interface.
The step is the same as step 101, and will not be described here again.
Step 202, packaging the identification card number identification model, the telephone number identification model, the mailbox address identification model, the business license information identification model and the bank card information identification model into classes.
In the step, the sensitive information identification model is packaged into classes, so that the classes can be conveniently called.
Step 203, after the request method of the normalized API interface is encapsulated, the request content returned by the normalized API interface is obtained.
The step is the same as step 102, and will not be described here again.
And step 204, inputting the request content into a trained dictionary learning algorithm model to obtain sparse representation information of the request content.
Firstly, preprocessing the data in the request content, including denoising, dimension reduction, normalization and the like, so as to improve the quality and usability of the data. The preprocessed request content is then input into a dictionary learning algorithm model, and a set of basis vectors are learned, which can be used to sparsely represent the original data. Sparse representation refers to the representation of a vector as a linear combination of a set of basis vectors, where only a few coefficients are non-zero.
In this embodiment, the dictionary learning algorithm model adopts a dictionary learning algorithm, and usually adopts an unsupervised learning algorithm to solve and obtain sparse representation information in the form of a base vector, such as OMP algorithm, K-SVD algorithm, and the like. Specifically, the basis vector z of the sparse representation information can be solved by the following equation:
min ||x-Dz|| 2 +λ||z||_1;
where x represents a data vector corresponding to the request content, D represents a dictionary matrix, λ is a regularization parameter, and z 1 represents an L1 norm of the base vector z.
min ||x-Dz|| 2 The error between the sparse linear combination of the original data vector x and the dictionary matrix D and the original data vector is represented, the smaller the error is, the more accurate z is, and the best error is 0 in an ideal state, namely x=dz.
λ is a regularization term for punishing the number of non-zero coefficients in coefficient vector z, λ is used to control the balance between sparsity and error. The L1 norm is also known as the Manhattan norm or absolute sum. Specifically, the L1 norm of the vector z is defined as the sum of the absolute values of its individual elements, namely:
||z||_1 = |z_1| + |z_2| + ... + |z_n|;
in the optimization problem, the L1 norm, as a regularization term, can be used to control the balance between sparsity and error. By adjusting the size of the regularization parameter λ, a trade-off between sparsity and error can be achieved.
In practical applications, the L1 norm is often used as a tool for feature selection and sparse representation. The L1 norm has the advantage that a few irrelevant or redundant characteristic coefficients can be set to zero, so that the purposes of characteristic selection and dimension reduction are realized. In addition, the L1 norm can promote the sparsity of the coefficient vector, so that the generalization capability and the interpretation of the model are improved. In sensitive data leakage detection, the L1 norm can be used for controlling sparsity of model parameters, so that robustness and anti-interference capability of the model to noise and interference are improved.
The dictionary matrix D can be trained and optimized through a previous experiment, and the better the dictionary matrix D is trained and optimized, the more accurate the base vector z obtained through solving is.
The basis vector z obtained by solving the above equation is the sparse representation of the request content data vector x. The advantage of sparse representation is that it can represent high-dimensional data as low-dimensional sparse vectors, thereby reducing redundancy and complexity of the data. In addition, sparse representation can also be used in the fields of feature extraction, signal compression, image processing and the like. In the sensitive data leakage detection of the embodiment, sparse representation can be used for performing dimension reduction and feature extraction on the original data, so that the accuracy and efficiency of the model are improved.
It can be understood that, in order to improve accuracy of the dictionary learning algorithm model, sparse representation information output by the dictionary learning model is more accurate, and training and optimization are required to be performed on the dictionary learning algorithm model. The training process may be performed before step 201, specifically including the following steps:
step 210, acquiring a plurality of sample normal data and sample sensitive data, wherein the sample sensitive data contains sensitive information, and the sample normal data does not contain sensitive information.
The sample sensitive data can comprise one or more types of sensitive information such as an identity card number, a telephone number, a mailbox address, business license information, bank card information and the like. The sample normal data and sample sensitive data may be from existing data sets, data within the enterprise, or third party data suppliers.
Step 220, preprocessing the normal data and the sensitive data of the sample to obtain processed sample data, wherein the preprocessing comprises denoising and normalization.
The quality and usability of the normal data and the sensitive data of the sample can be improved through preprocessing, and a large number of model training for many times is facilitated.
In one embodiment, the step of preprocessing the sample normal data and the sample sensitive data further comprises:
Firstly, at least part of normal data of the sample and part of sensitive data of the sample are changed to obtain antagonistic sample data.
Wherein at least part of the sample data may be used as raw sample data; these challenge sample data are obtained by slightly perturbing the original sample data so that the model changes its recognition result. The generation of the challenge sample may employ various methods, such as FGSM (fast gradient sign method, rapid gradient algorithm), PGD (proximal gradient descent, near-end gradient descent method), etc., which may make the recognition result of the model for the challenge sample data different from that of the original sample data, while ensuring that the disturbance amplitude is small.
Then, sample normal data, sample sensitive data and countermeasure sample data are subjected to preprocessing such as denoising, dimension reduction, normalization and the like.
Thus, the quality and the usability of the anti-sample data can be improved, and the model training for a large number of times is facilitated.
The steps introduce a method for training the countermeasure samples, namely, by generating a plurality of model training countermeasure samples, the robustness and the anti-interference capability of the model can be enhanced, so that the capability of the model for identifying sensitive data is improved.
And step 230, obtaining a base vector and a dictionary matrix according to the processed sample data and the dictionary learning algorithm, wherein the base vector is used for carrying out sparse representation on the processed sample data.
Specifically, assuming that sample data is represented as a vector linear combination x, a dictionary matrix is represented as D, a base vector is z, and x is a sparse representation of x, then x is approximately equal to Dz; the Dz product results approximately close to x when the dictionary matrix D is better optimized, or z is more accurate.
And 240, constructing a dictionary learning algorithm model according to the dictionary matrix.
Specifically, the processed sample data is first expressed as a linear combination x of basis vectors; then, a base vector z and a dictionary matrix D are determined according to a sparse expression min|x-dz|2+λ|z|1, where λ is a regularization parameter, |z|_1 represents an L1 norm of the base vector z, and z is a sparse representation of x.
Wherein, in case the dictionary matrix D is optimized sufficiently well, the dictionary matrix D is taken as a fixed constant, x is input into the model as training sample data, the basis vector z can be obtained by min x-Dz 2+λ z 1. Based on the above steps, a dictionary learning algorithm model is formed.
According to the method, the dictionary learning algorithm model is built and trained, and the dictionary learning algorithm model can be output more accurately. On the other hand, in the training process of the dictionary learning algorithm model, the output result can be used as sample data for training samples of the sensitive information recognition model, so that the dictionary learning algorithm model and the sensitive information recognition model can be jointly trained, and the training effect of the dictionary learning algorithm model and the sensitive information recognition model is better.
In step 205, sparse representation information is input to a trained sensitive information recognition model to determine whether sensitive information is present. If yes, go to step 206; if not, return to step 203.
Specifically, the sensitive information recognition model recognizes and classifies the input sparse representation information, and judges whether the sparse representation information is sensitive information and which type of sensitive information. The sensitive information includes the types of identification card number, telephone number, mailbox address, business license information, bank card information, etc.
As the data dimension and the data size of the sparse identification information are greatly reduced compared with the original request content, the data calculation amount during the identification processing of the sensitive information identification model is greatly reduced, and the calculation precision and speed are improved.
And 206, if the sensitive information exists, labeling the request content with a sensitive information label to obtain the sensitive information content, and determining the corresponding normalized API interface as a sensitive information leakage API interface.
The sensitive information label comprises an identity card number label, a telephone number label, a mailbox address label, a business license information label or a bank card information label.
Step 207, generating a sensitive information leakage detection result according to the sensitive information content and the interface information of the sensitive information leakage API interface.
The step is the same as step 105, and will not be described here again.
And step 208, sending the sensitive information leakage detection result to a display terminal so that a manager can check and verify the accuracy of the sensitive information leakage detection result through the display terminal.
Specifically, the information can be sent to a display terminal such as a computer and a mobile phone of a manager in a communication software and email mode, so that the manager can timely receive the sensitive information leakage detection result, and the accuracy of the sensitive information leakage detection result is verified and checked.
The embodiment introduces an algorithm of dictionary learning, and the dictionary learning algorithm has the following advantages:
the interpretability is strong: dictionary learning algorithms can learn the most representative basis vectors that can be used to interpret the characteristics of the original data, helping to understand the nature of the data.
The sparsity is good: the dictionary learning algorithm can learn sparse representation, namely, the original data is represented by using as few basis vectors as possible, so that the data dimension is reduced and the calculation efficiency is improved.
The application range is wide: the dictionary learning algorithm can be applied to the fields of signal processing, image processing, voice recognition and the like, and has good universality and flexibility.
Meanwhile, the sensitive information identification model of the embodiment is a neural network model capable of deep learning, and the neural network model has the following advantages:
the application range is wide: the neural network model can be applied to the fields of image recognition, natural language processing, voice recognition and the like, and has good universality and flexibility.
The learning ability is strong: the neural network model has strong learning ability and generalization ability, and can learn optimal weights and biases from a large amount of data so as to realize specific tasks.
The expandability is good: the neural network model can be expanded by increasing the number of network layers, the number of nodes and the like, so that the performance and generalization capability of the model are improved.
In the embodiment, as shown in fig. 3, the request content returned by the API interface is input to the dictionary learning algorithm model to obtain sparse representation information of the request content, and the original data of the request content is subjected to dimension reduction and feature extraction, so that the redundancy and complexity of the data are reduced, the calculation amount of a subsequent sensitive information identification model can be reduced, and the processing of large-scale data is facilitated; and inputting the sparse representation information into a sensitive information identification model, identifying the content of the sensitive information, and finally sorting the content of the sensitive information into sensitive information leakage detection results. The method and the device integrate the advantages of the dictionary model and the neural network model, improve the interpretability, the calculation efficiency and the robustness of the model, can be used for detecting mass data, and improve the efficiency and the accuracy of sensitive information detection.
In a second aspect, an embodiment of the present invention provides a sensitive information leakage detecting device, as shown in fig. 4, including a normalization module 401, an obtaining module 402, a judging module 403, an obtaining module 404, and a generating module 405, specifically:
the normalization module 401 is configured to perform normalization processing on an API interface of the application system through the Swagger interface management unit, so as to obtain a normalized API interface;
an obtaining module 402, configured to obtain, after packaging a request method of a normalized API, a request content returned by the normalized API;
a judging module 403, configured to judge whether the request content has sensitive information through the trained sensitive information recognition model, where the sensitive information includes at least one of an identification card number, a phone number, a mailbox address, business license information, and bank card information;
an obtaining module 404, configured to tag the request content with sensitive information when there is sensitive information, obtain the sensitive information content, and determine a corresponding normalized API interface as a sensitive information leakage API interface; the sensitive information label comprises an identity card number label, a telephone number label, a mailbox address label, a business license information label or a bank card information label;
The generating module 405 is configured to generate a sensitive information leakage detection result according to the sensitive information content and interface information of the sensitive information leakage API interface.
The sensitive information leakage detection device of the embodiment is used for executing the sensitive information leakage detection method provided by the embodiment, and the API interface return content applied by the detection system can be identified through the sensitive information identification model, so that the condition of missing detection and false detection of the sensitive information is reduced by utilizing the sensitive information identification model, and the accuracy and efficiency of sensitive information leakage detection are improved.
In this embodiment, the sensitive information leakage detecting apparatus is an electronic apparatus, and the sensitive information leakage detecting apparatus of this embodiment is hereinafter referred to as an electronic apparatus. Fig. 5 shows a schematic diagram of a computer system suitable for use in implementing an embodiment of the invention.
It should be noted that, the computer system of the electronic device shown in fig. 5 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present invention.
As shown in fig. 5, the computer system includes a central processing unit (Central Processing Unit, CPU) 1801, which can perform various appropriate actions and processes, such as performing the methods described in the above embodiments, according to a program stored in a Read-Only Memory (ROM) 1802 or a program loaded from a storage section 1808 into a random access Memory (Random Access Memory, RAM) 1803. In the RAM 1803, various programs and data required for system operation are also stored. The CPU 1801, ROM 1802, and RAM 1803 are connected to each other via a bus 1804. An Input/Output (I/O) interface 1805 is also connected to the bus 1804.
The following components are connected to the I/O interface 1805: an input section 1806 including a keyboard, a mouse, and the like; an output portion 1807 including a Cathode Ray Tube (CRT), a liquid crystal display (Liquid Crystal Display, LCD), and a speaker, etc.; a storage section 1808 including a hard disk or the like; and a communication section 1809 including a network interface card such as a LAN (Local Area Network ) card, a modem, or the like. The communication section 1809 performs communication processing via a network such as the internet. The drive 1810 is also connected to the I/O interface 1805 as needed. Removable media 1811, such as magnetic disks, optical disks, magneto-optical disks, semiconductor memory, and the like, is installed as needed on drive 1810 so that a computer program read therefrom is installed as needed into storage portion 1808.
In particular, according to embodiments of the present invention, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present invention include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising a computer program for performing the method shown in the flowchart. In such embodiments, the computer program may be downloaded and installed from a network via the communication portion 1809, and/or installed from the removable medium 1811. When executed by a Central Processing Unit (CPU) 1801, performs various functions defined in the system of the present invention.
It should be noted that, the computer readable medium shown in the embodiments of the present invention may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-Only Memory (ROM), an erasable programmable read-Only Memory (Erasable Programmable Read Only Memory, EPROM), flash Memory, an optical fiber, a portable compact disc read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with a computer-readable computer program embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. A computer program embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. Where each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units involved in the embodiments of the present invention may be implemented by software, or may be implemented by hardware, and the described units may also be provided in a processor. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.
Specifically, the electronic device of the present embodiment includes a processor and a memory, where a computer program is stored, and when the computer program is executed by the processor, the method provided in the foregoing embodiment is implemented.
As another aspect, the present invention also provides a computer-readable storage medium that may be contained in the electronic device described in the above-described embodiment; or may exist alone without being incorporated into the electronic device. The storage medium carries one or more computer programs which, when executed by a processor of the electronic device, cause the electronic device to implement the methods provided in the embodiments described above.
It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the invention. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present invention may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a usb disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a host server, a touch terminal, or a network device, etc.) to perform the method according to the embodiments of the present invention.
The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the above description, numerous specific details are provided to give a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the invention.
The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the embodiments disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains.
It is to be understood that the invention is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims (10)

1. A sensitive information leakage detection method, characterized by comprising:
the method comprises the steps that an API interface of an application system is subjected to standardization processing through a Swagger interface management unit, and a standardized API interface is obtained;
after the request method of the normalized API is encapsulated, obtaining request content returned by the normalized API;
judging whether the request content has sensitive information or not through a trained sensitive information identification model, wherein the sensitive information comprises at least one of an identity card number, a telephone number, a mailbox address, business license information and bank card information;
if yes, marking the request content with a sensitive information label to obtain sensitive information content, and determining a corresponding normalized API interface as a sensitive information leakage API interface; the sensitive information label comprises an identity card number label, a telephone number label, a mailbox address label, a business license information label or a bank card information label;
and generating a sensitive information leakage detection result according to the sensitive information content and the interface information of the sensitive information leakage API interface.
2. The sensitive information leakage detecting method according to claim 1, wherein the step of judging whether the requested content has sensitive information by a sensitive information identification model comprises:
Inputting the request content into a trained dictionary learning algorithm model to obtain sparse representation information of the request content;
and inputting the sparse representation information into a trained sensitive information recognition model to judge whether sensitive information exists or not.
3. The sensitive information leakage detecting method according to claim 2, wherein before the step of normalizing the API interface of the application system by the Swagger interface managing unit to obtain a normalized API interface, further comprising:
acquiring a plurality of sample normal data and sample sensitive data, wherein the sample sensitive data contains sensitive information, and the sample normal data does not contain the sensitive information;
preprocessing the normal sample data and the sensitive sample data to obtain processed sample data, wherein the preprocessing comprises denoising and normalization;
according to the processed sample data and a dictionary learning algorithm, a base vector and a dictionary matrix are obtained, wherein the base vector is used for carrying out sparse representation on the processed sample data;
and constructing a dictionary learning algorithm model according to the dictionary matrix.
4. The sensitive information leakage detecting method according to claim 3, wherein the step of obtaining a basis vector and a dictionary matrix from the processed sample data and a dictionary learning algorithm comprises:
Representing the processed sample data as a linear combination x of basis vectors;
according to the sparse expression min||x-dz||of x 2 +λ z 1, determining a basis vector z and a dictionary matrix D, where lambda is a regularization parameter, the L1 norm of the base vector z is denoted as z, which is a sparse representation of x.
5. The sensitive information leakage detecting method according to claim 3, wherein the step of preprocessing the sample normal data and the sample sensitive data further comprises:
modifying at least part of normal data of the sample and part of sensitive data of the sample to obtain antagonistic sample data;
preprocessing the sample normal data, the sample sensitive data and the challenge sample data.
6. The method for detecting leakage of sensitive information according to claim 1, wherein the sensitive information recognition model includes an identification card number recognition model, a telephone number recognition model, a mailbox address recognition model, a business license information recognition model, and a bank card information recognition model, and further comprising, before the step of obtaining the request content returned by the standardized API interface:
and packaging the identification card number identification model, the telephone number identification model, the mailbox address identification model, the business license information identification model and the bank card information identification model into classes.
7. The sensitive information leakage detecting method according to any one of claims 1 to 6, further comprising, after the step of generating a sensitive information leakage detecting result from the sensitive information content and interface information of the sensitive information leakage API interface:
and sending the sensitive information leakage detection result to a display terminal so that a manager can check and verify the accuracy of the sensitive information leakage detection result through the display terminal.
8. A sensitive information leakage detecting apparatus, characterized by comprising:
the normalization module is used for performing normalization processing on the API interface of the application system through the Swagger interface management unit to obtain a normalized API interface;
the acquisition module is used for acquiring request contents returned by the normalized API after the request method of the normalized API is encapsulated;
the judging module is used for judging whether the request content has sensitive information or not through the trained sensitive information identification model, wherein the sensitive information comprises at least one of an identity card number, a telephone number, a mailbox address, business license information and bank card information;
the obtaining module is used for marking the request content with a sensitive information label when sensitive information exists, obtaining sensitive information content, and determining a corresponding normalized API interface as a sensitive information leakage API interface; the sensitive information label comprises an identity card number label, a telephone number label, a mailbox address label, a business license information label or a bank card information label;
And the generation module is used for generating a sensitive information leakage detection result according to the sensitive information content and the interface information of the sensitive information leakage API interface.
9. A sensitive information leakage detecting apparatus, characterized by comprising a processor and a memory, the memory having stored thereon a computer program which, when executed by the processor, implements the sensitive information leakage detecting method of any one of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the sensitive information leakage detection method according to any one of claims 1 to 7.
CN202310972203.6A 2023-08-02 2023-08-02 Sensitive information leakage detection method, device and storage medium Pending CN116881971A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310972203.6A CN116881971A (en) 2023-08-02 2023-08-02 Sensitive information leakage detection method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310972203.6A CN116881971A (en) 2023-08-02 2023-08-02 Sensitive information leakage detection method, device and storage medium

Publications (1)

Publication Number Publication Date
CN116881971A true CN116881971A (en) 2023-10-13

Family

ID=88264385

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310972203.6A Pending CN116881971A (en) 2023-08-02 2023-08-02 Sensitive information leakage detection method, device and storage medium

Country Status (1)

Country Link
CN (1) CN116881971A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117688540A (en) * 2024-02-01 2024-03-12 杭州美创科技股份有限公司 Interface sensitive data leakage detection defense method and device and computer equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117688540A (en) * 2024-02-01 2024-03-12 杭州美创科技股份有限公司 Interface sensitive data leakage detection defense method and device and computer equipment
CN117688540B (en) * 2024-02-01 2024-04-19 杭州美创科技股份有限公司 Interface sensitive data leakage detection defense method and device and computer equipment

Similar Documents

Publication Publication Date Title
US7426497B2 (en) Method and apparatus for analysis and decomposition of classifier data anomalies
EP4006909B1 (en) Method, apparatus and device for quality control and storage medium
CN111866004B (en) Security assessment method, apparatus, computer system, and medium
CN116209997A (en) System and method for classifying software vulnerabilities
CN113778894B (en) Method, device, equipment and storage medium for constructing test cases
CN113450147A (en) Product matching method, device and equipment based on decision tree and storage medium
CN117707922A (en) Method and device for generating test case, terminal equipment and readable storage medium
CN116881971A (en) Sensitive information leakage detection method, device and storage medium
CN118013963B (en) Method and device for identifying and replacing sensitive words
CN117768220B (en) Network security level protection evaluation method, system and device based on artificial intelligence
CN114610608A (en) Test case processing method and device, electronic equipment and storage medium
CN117435189A (en) Test case analysis method, device, equipment and medium of financial system interface
CN113591998A (en) Method, device, equipment and storage medium for training and using classification model
CN116248412B (en) Shared data resource abnormality detection method, system, equipment, memory and product
CN112464237A (en) Static code safety diagnosis method and device
CN115328753B (en) Fault prediction method and device, electronic equipment and storage medium
CN110674497A (en) Malicious program similarity calculation method and device
CN115828888A (en) Method for semantic analysis and structurization of various weblogs
CN113392014B (en) Test case generation method, device, electronic equipment and medium
CN113836297A (en) Training method and device for text emotion analysis model
CN113887724A (en) Text training enhancement method and system based on deep learning
CN113010339A (en) Method and device for automatically processing fault in online transaction test
US12008364B1 (en) Inconsistency-based bug detection
CN115879446B (en) Text processing method, deep learning model training method, device and equipment
US20240062569A1 (en) Optical character recognition filtering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination