CN110598411A

CN110598411A - Sensitive information detection method and device, storage medium and computer equipment

Info

Publication number: CN110598411A
Application number: CN201910899954.3A
Authority: CN
Inventors: 胡舸; 张尧
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-09-23
Filing date: 2019-09-23
Publication date: 2019-12-20

Abstract

The application relates to a sensitive information detection method, a sensitive information detection device, a computer readable storage medium and computer equipment, which are characterized in that an object code warehouse matched with an object feature mark is obtained from a code hosting platform, a code risk scene is determined, a risk detection rule corresponding to the code risk scene is obtained, then the risk detection rule is utilized to carry out the risk detection on the acquired target code warehouse locally, determining sensitive information in the target code warehouse according to a risk detection result, providing a coarse-to-fine sensitive information detection strategy, firstly, preliminarily finding a matched target code warehouse on a code hosting platform by utilizing a characteristic mark, and then, the codes of the target code warehouse can be accurately detected locally by using a risk detection rule, so that the effect of accurately detecting the sensitive information is achieved, and the risk of sensitive information leakage is further reduced.

Description

Sensitive information detection method and device, storage medium and computer equipment

Technical Field

The present application relates to the field of information security technologies, and in particular, to a method and an apparatus for detecting sensitive information, a computer-readable storage medium, and a computer device.

Background

With the rapid development of information processing technology, network applications and computer software are widely applied to various industries, and in order to create software ecology, the industries also gradually and actively open sources of codes supporting the operation of corresponding network applications and computer software, meanwhile, more and more users also use code hosting platforms such as Github to manage the developed codes so as to more conveniently develop and manage code projects, and the method of opening sources of codes by using the code hosting platforms has the problem of leakage of sensitive information such as key business logic, user passwords and the like.

However, when the detection scheme provided by the conventional technology detects sensitive information on the code hosting platform, the problem of inaccurate detection of the sensitive information is easily caused by selecting improper features in the process of identifying and detecting, and the expected detection effect cannot be achieved.

Disclosure of Invention

Based on this, it is necessary to provide a sensitive information detection method, apparatus, computer-readable storage medium and computer device for solving the technical problem that the conventional technology is inaccurate in detecting sensitive information.

A sensitive information detection method, comprising:

acquiring an object code warehouse matched with the object feature mark from a code hosting platform; the target characteristic mark is prestored in a local characteristic mark library;

determining a code risk scene, and acquiring a risk detection rule corresponding to the code risk scene;

based on the risk detection rule, carrying out risk detection on the codes of the target code warehouse locally to obtain a risk detection result;

and determining sensitive information in the target code warehouse according to a risk detection result.

An apparatus for sensitive information detection, the apparatus comprising:

the warehouse acquisition module is used for acquiring a target code warehouse matched with the target feature mark from the code hosting platform; the target characteristic mark is prestored in a local characteristic mark library;

the rule acquisition module is used for determining a code risk scene and acquiring a risk detection rule corresponding to the code risk scene;

the risk detection module is used for carrying out risk detection on the codes of the target code warehouse locally based on the risk detection rule to obtain a risk detection result;

and the information determining module is used for determining the sensitive information in the target code warehouse according to the risk detection result.

A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:

acquiring an object code warehouse matched with the object feature mark from a code hosting platform; the target characteristic mark is prestored in a local characteristic mark library; determining a code risk scene, and acquiring a risk detection rule corresponding to the code risk scene; based on the risk detection rule, carrying out risk detection on the codes of the target code warehouse locally to obtain a risk detection result; and determining sensitive information in the target code warehouse according to a risk detection result.

A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of:

The sensitive information detection method, the sensitive information detection device, the computer readable storage medium and the computer equipment firstly acquire the target code warehouse matched with the target characteristic mark from the code hosting platform, determine the code risk scene and acquire the risk detection rule corresponding to the code risk scene, then the risk detection rule is utilized to carry out the risk detection on the acquired target code warehouse locally, determining sensitive information in the target code warehouse according to a risk detection result, providing a coarse-to-fine sensitive information detection strategy, firstly, preliminarily finding a matched target code warehouse on a code hosting platform by utilizing a characteristic mark, and then, the codes of the target code warehouse can be accurately detected locally by using a risk detection rule, so that the effect of accurately detecting the sensitive information is achieved, and the risk of sensitive information leakage is further reduced.

Drawings

FIG. 1 is a diagram of an exemplary embodiment of a sensitive information detection method;

FIG. 2 is a flow diagram illustrating a method for sensitive information detection in one embodiment;

FIG. 3 is a schematic flow chart of a sensitive information detection method in another embodiment;

FIG. 4 is a schematic diagram of an interface for a detection report in an example application;

FIG. 5 is a schematic diagram of an interface for asset management in an application example;

FIG. 6 is a schematic flow chart of a method for detecting leakage of source repository data in an application example;

FIG. 7 is a block diagram of a sensitive information detection apparatus according to an embodiment;

FIG. 8 is a block diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The sensitive information detection method provided by the present application can be applied to an application environment shown in fig. 1, where fig. 1 is an application environment diagram of the sensitive information detection method in an embodiment, the application environment may include a terminal 100 and a server 200, and the server 200 may be a server of a code hosting platform such as gitubb and Gitlab. In which the terminal 100 may be connected to the server 200 through a network, for convenience of description of the scheme, a code hosting platform is described as the server 200 in the following, that is, an access of the terminal 100 to the server 200 corresponds to an access of the terminal 100 to the code hosting platform 200. Specifically, the terminal 100 may obtain an object code warehouse matched with the object feature tag from the code hosting platform 200, after the terminal 100 obtains the object code warehouse, may determine a code risk scenario and obtain a risk detection rule corresponding to the code risk scenario, then, based on the risk detection rule, the terminal 100 may locally perform risk detection on the code in the obtained object code warehouse to obtain a risk detection result, thereby determining the sensitive information in the object code warehouse according to the risk detection result, so that the object code warehouse matched with the feature tag may be roughly searched on the code hosting platform 200, then, the risk detection rule corresponding to the code risk scenario may be further locally used at the terminal 100 to perform accurate detection on the code of the object code warehouse, thereby improving the accuracy of detecting the sensitive information, sensitive information resources needing to be protected are prevented from being leaked to the code hosting platform in the form of code data, and the risk of sensitive information leakage is reduced.

In the application scenario, the terminal 100 may specifically be a desktop terminal or a mobile terminal, and the mobile terminal may specifically be at least one of a mobile phone, a tablet computer, a notebook computer, and the like. The server 200 may be implemented as a stand-alone server or as a server cluster comprising a plurality of servers.

In an embodiment, as shown in fig. 2, fig. 2 is a schematic flow chart of a sensitive information detection method in an embodiment, and provides a sensitive information detection method, which is mainly used for illustrating the method applied to the terminal 100 in fig. 1. Referring to fig. 2, the sensitive information detection method specifically includes the following steps:

step S201, acquiring an object code repository matched with the object feature tag from the code hosting platform.

In this step, the code hosting platform 200 may be a public code hosting platform such as gitubs and Gitlab, where a user may upload code data to the code hosting platform 200 for management, and when the user uploads the code data to the code hosting platform 200, some sensitive information that needs to be protected may exist in the code data, such as sensitive information of an account password and an intranet ip of a company, and if the sensitive information is leaked at the code hosting platform 200, information security problems such as intrusion of a company server are easily caused.

Based on this, the terminal 100 needs to scan whether sensitive information exists in the code hosting platform 200, and the user may preset some feature tags, where the feature tags are mainly used to identify code data on the code hosting platform 200 such as gitub, which is related to information resources that need to be protected, and common information resources that need to be protected, such as domain names, ip, weak passwords (initialization passwords), and the like, and the code data related to the information resources that need to be protected may be found from the code hosting platform 200 of mass code data through the feature tags. Specifically, the terminal 100 may store at least one signature in advance by using a local signature library, and generally, the number of signatures stored in the signature library is multiple. The terminal 100 may then use at least one feature tag in the feature tag library as a target feature tag, and obtain an object code repository matching the target feature tag from the code hosting platform, where the object code repository may be a code repository containing a code corresponding to the target feature tag. For example, if the user a stores the weak password M in the code repository C of the code hosting platform 200 in the form of code data, the terminal 100 may search the code repository C containing the weak password M from the code hosting platform 200 by using the content of the weak password M and the like as the feature tag. The method mainly comprises the steps of roughly searching from a code hosting platform by using a target characteristic mark, and acquiring a plurality of code warehouses matched with the target characteristic mark as target code warehouses.

Step S202, determining a code risk scene, and acquiring a risk detection rule corresponding to the code risk scene.

In this step, the terminal 100 may obtain some known risk scenarios where sensitive information is easily leaked through the code data, for example, the terminal 100 may classify a plurality of code risk scenarios according to the related historical data of code risk detection. Where there is a risk of revealing sensitive information in the code, including but not limited to: ssh account secret, mysql account secret, redis account secret, mongodb account secret, token/key monitoring and key file and the like, and these contents usually exist in the code in the form of the following several code risk scenes, and this step can further acquire the risk detection rule corresponding to various code risk scenes for locally accurately scanning whether the code in the target code repository has the information leakage risk. Wherein the content of the first and second substances,

the first is a profile scenario, which may include: in this scenario, whether the file name feature, the extension feature and the content feature are included in the code repository or not can be used as a risk detection rule. The second is a hard coding scenario, in which feature risk detection rules can be found according to the api using mode of various codes connecting mysql, taking python connecting mysql as an example: mysqldb. connection (), mysql. connection (), may be extracted as two risk detection rules.

The third is an operation and maintenance document scene, which mainly uses a script form to record account secret or work comparison. Common ones are for example: mysql: mysql-uroot-ptest-P1234 abc. At this point, the risk detection rules may be extracted: mysql-u. or-p. may also be used directly: mysql + ip regex for risk detection, whereas ssh auto-login is typically implemented via except scripts, such as:

Spawn ssh-l conantest.abc.10.10.10.10

Except“password:”

Send“testpwd”

at this point, the risk detection rules may be extracted: spawn (ssh | sftp | scp), and optimization to remove spwan can cover a more comprehensive scene.

The fourth is a Token/key scene, and in the scene, a risk detection rule can be extracted according to the analysis of tokens and keys of various businesses and cloud services; the fifth is Key file monitoring, and corresponding risk detection rules can be respectively acquired for different Key types such as (RSA Private Key, EC Private Key, PGP Private Key) in this scenario.

Step S203, based on the risk detection rule, performing risk detection on the code of the target code repository locally, and obtaining a risk detection result.

The present step is mainly that after acquiring the risk detection rule, the terminal 100 may locally perform accurate scanning on the code in the target code repository by using the risk detection rule to complete the risk detection on the code locally, thereby obtaining a risk detection result, the risk detection result mainly determines whether the code content conforming to the risk detection rule exists in the target code repository according to the risk detection rule, the risk detection result may be a certain code line suspected to reveal sensitive information, the context content of the code, the file address where the code is located, and the like, after acquiring the risk detection result, the terminal 100 further performs optimization marking (for example, marking the field of the hit rule with red, displaying the code according to a predetermined format, and performing optimization processes such as annotation turning green and displaying and the like) and displays the risk detection result, in this step, because the terminal 100 performs the risk detection on the code of the target code repository locally according to the risk detection rule, the risk detection rules are usually more detailed, accurate and standard, so that when the risk detection is performed on the codes of the target code warehouse by using the multiple risk detection rules made for multiple different code risk scenes, the code data with the sensitive information leakage risk can be more accurately searched and used as the current risk detection result.

And step S204, determining sensitive information in the target code warehouse according to the risk detection result.

In this step, after obtaining the risk detection results, the terminal 100 may display the risk detection results, so that the user may select a part of the risk detection results from the risk detection results, and the terminal 100 may determine the part of the risk detection results selected by the user according to the selection operation of the user and use the part of the risk detection results as the sensitive information in the target code repository, thereby completing the detection process of the sensitive information. In some embodiments, the terminal 100 may also directly determine all risk detection results obtained in step S203 as sensitive information in the target code repository.

According to the sensitive information detection method, the terminal firstly obtains the target code warehouse matched with the target characteristic mark from the code hosting platform, then the terminal can determine the code risk scene and obtain the risk detection rule corresponding to the code risk scene, the risk detection rule is utilized to carry out the risk detection on the obtained target code warehouse locally, and finally the sensitive information in the target code warehouse is determined according to the risk detection result, so that a sensitive information detection strategy from rough to fine is provided.

In one embodiment, before the step S101 acquires the target code repository matching the feature tag from the code hosting platform, the following steps may be further included:

the terminal 100 acquires a target information resource to be protected, analyzes the target information resource, determines a sensitive information feature of the target information resource, generates a feature tag corresponding to the sensitive information feature, and constructs a feature tag library based on the feature tag.

In this embodiment, the terminal 100 may generate a corresponding feature tag for a target information resource to be protected, so as to construct a feature tag library based on a plurality of feature tags. After the feature tag library is constructed, the feature tag library may be stored locally in the terminal 100, and when the terminal 100 needs to detect sensitive information of the code hosting platform 200, one or more feature tags may be extracted from the locally pre-stored feature tag library and used as target feature tags for scanning detection. For example, for a company, a user may use domain names, ip of internal and external networks, account passwords of various system servers, and commonly used tokens/keys as target information resources, and then may analyze and classify various target information resources to determine sensitive information characteristics of various target information resources, for example, classify domain names into one class, classify account passwords into one class, and the like, and then may further generate various corresponding feature labels according to the sensitive information characteristics of various target information resources. Specifically, the following description is given by taking a domain name, an IP, and a password as examples in conjunction with Github:

a. for domain name processing:

the API part of gitubs supports domain name accurate search, and after the "inclusion", the accurate search effect can be realized, whereas the domain name assets of companies are generally large, many tens of thousands, few thousands, and many test domain names are constantly changing (newly added and abandoned), so that the domain names can be aggregated, the aggregated domain names are used as feature labels, and the aggregation to high-level domain names not only ensures the discovery rate, but also increases the robustness of the system without frequent rule changes.

b. For IP processing:

according to actual use experience of Github search, after an intranet segment is obtained, the obtained intranet segment is completely aggregated to a c segment to obtain a feature tag, then the feature tag can be used for detection in Github, and then according to a traversal result counting method during domain name description, the c segment of IP with the number of returned results smaller than a threshold value (400) is selected for monitoring.

c. For cryptographic processing:

when each business system and internal server in a company are initialized, a default password is used, password information of the type is easy to leak through code data, the harm is serious, the amount is not large under general conditions, and the password information can be deployed in the Github rule in a full amount without optimization. However, since there is a problem that non-english and non-english numbers cannot be accurately searched in the detection of sensitive information of Github, if there is a weak password "admin # 333" and the monitoring accuracy is not high, it is also necessary to sort out a weak password field suitable for monitoring as a feature tag by the result counting method as described above.

Therefore, after the feature labels of various target information resources are obtained, the feature labels can be stored in a local database of the terminal 100, so as to construct a feature label library. Further, considering that the feature tags stored in the feature tag library are used for risk detection in the code hosting platform, if the information stored in the feature tag library is maliciously modified, the accuracy of the risk detection will be seriously affected, and a serious information security problem may occur.

Based on this, in some embodiments, the method may further include the following steps to ensure that the information in the signature library is not tampered, and the specific steps include:

the terminal 100 uploads the generated signature to the blockchain for storage.

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism and an encryption algorithm. A block chain (Blockchain), which is essentially a decentralized database, is a string of data blocks associated by using a cryptographic method, and each data block contains information uploaded to the block chain for storage at a time, and can be used for verifying the validity (anti-counterfeiting) of the information and generating a next block. In this embodiment, the terminal 100 may generate a plurality of signatures, and upload the signatures to the blockchain for storage while storing the signatures locally.

Further, on the basis that the terminal 100 stores the feature tag on the blockchain, the step of acquiring, by the terminal 100, the target code repository matched with the target feature tag from the code hosting platform 200 may specifically include:

the terminal 100 obtains candidate feature tags from a local feature tag library; then, the terminal 100 may query whether the block chain stores the feature tag identical to the candidate feature tag, with the candidate feature tag as a query condition; if the same feature tag as the candidate feature tag is stored in the block chain, it indicates that the candidate sensitive information stored locally in the terminal 100 has not been tampered with maliciously, so that the terminal 100 can set the candidate feature tag as a target feature tag to ensure that the target feature tag used for detecting the sensitive information in the code hosting platform is reliable, thereby also ensuring the accuracy of the sensitive information detection; finally, the terminal 100 may download the target code repository matching the target signature in the code hosting platform 200 to the local for further accurate scanning of the target code repository locally at the terminal 100 using the risk detection rule.

In some embodiments, the downloading, to the local, the object code repository matched with the object feature tag in the code hosting platform in the foregoing embodiments may specifically include:

the terminal 100 may search the code hosting platform 200 for a code repository matching the target feature tag as a candidate code repository; then, the terminal 100 may obtain the update time of the candidate code repository, where the update time refers to the last code pushing time in this code repository. Next, the terminal 100 may acquire a time range preset by the user, for example, only two months may be used as the time range, and if the update time of the candidate code repository is within the time range, that is, the code repository is updated in the last two months, the terminal 100 may set the candidate code repository as the target code repository and download the target code repository to the local. In this embodiment, it is mainly considered that, if the time range is not limited, the search results returned by the search interface of the code hosting platform such as the GitHub API are all results from the code hosting platform being online, and in practical applications, much time is not focused on the data leakage situation that is long in time (because the data may be already utilized or invalid), so when returning the results, a proper time range can be set according to the actual business needs of the user to accurately screen the target code warehouse, thereby improving the detection efficiency while ensuring the accuracy of the sensitive information detection.

In addition, the number of returned target code repositories may be limited in the target feature tag, for example, a user may determine a threshold of the number of search results according to the operation capability (without loss of generality, 400 may be selected), where the number of domain names with search results smaller than 400 is set as a feature tag, where the operation capability refers to technical and temporal human input, and since the search results may be audited by relevant managers, if too many search results will result in higher human cost required to be technically and temporally input, and all search results cannot be audited, the number of returned target code repositories may be limited to meet the requirement for limiting the actual operation capability.

In an embodiment, the step of searching, by the terminal 100, a code repository matching the target feature tag from the code hosting platform 200 as a candidate code repository may specifically include:

the terminal 100 obtains a plurality of access tokens distributed by the code hosting platform 200, divides the target feature tag into a plurality of sub-target feature tags according to the number of the access tokens, then the terminal 100 searches a plurality of sub-code warehouses matched with the plurality of sub-target feature tags in the code hosting platform through the plurality of access tokens respectively, and finally sets the plurality of sub-code warehouses as candidate code warehouses.

Considering that the code hosting platform 200 usually has a strict limit on the frequency of access requests, that is, an access token (token) usually can only access the code hosting platform 200 for a certain period of time, and then needs to be replaced with another access token to continue accessing, for example, the Github API has a strict limit on the frequency of requests, usually less than 30 times per minute. In the process of detecting sensitive information on the code hosting platform 200 by the terminal 100, since the number of target feature markers is usually multiple, it may take a long time to search the code repository. In this embodiment, the terminal 100 may first obtain the multiple access tokens distributed by the multiple code hosting platforms 200 by creating different users, and for Github, the different access tokens of the multiple users may share the request 30 times per minute, for example, if the access token of user a and the access token of user B are filled simultaneously, the scanner may access Github 60 times per minute. Based on this, the terminal 100 may split the target feature tag, split the target feature tag into a plurality of sub-target feature tags according to the number of the access tokens, allocate different sub-target feature tags to each access token for scanning, and complete the scanning task of the code repository in parallel through different access tokens, that is, the terminal 100 may search a plurality of sub-code repositories matching with each sub-target feature tag in the code hosting platform through the plurality of access tokens, and the terminal 100 may finally set each scanned sub-code repository as a candidate code repository, so that the detection efficiency of the sensitive information can be improved by breaking through the limitation of the code hosting platform 200 on the access request.

In one embodiment, the obtaining of the risk detection result in step S203 may include:

after the terminal 100 completes the risk detection of the codes of the target code repository locally, an initial risk detection result obtained by the risk detection may be obtained, then the terminal 100 may determine that the risk codes ignore the baseline, and filter the initial risk detection result by using the risk codes ignore the baseline, so as to obtain a risk detection result.

Due to the special characteristics of the code hosting platform, the code is often spread by fork and copy everywhere, the initial risk detection result is subjected to baseline processing according to the condition that the risk code ignores the baseline, and a large amount of false alarms can be optimized in one key. For example, the initial risk detection result for scanning a feature keyword, open.

…. [ TC open platform ] (https:// open. AAA. com /) …

According to investigation, the initial risk detection result is a code from an official authority, the code can be propagated in a public network without harm, but due to the large propagation area, dozens of hundreds of files and warehouses may appear on the same Pattern baseline (Pattern line). The optimized operation is carried out according to the files and the warehouse, and the labor cost is huge. At the moment, "[ TC open platform ] (https:// open. AAA. com /)" can be used as a risk code to ignore the baseline, and then all alarms of the type can be directly ignored, thereby greatly improving the risk detection effect and the operation efficiency. The specific principle of the Pattern line baseline implementation is as follows:

when a risk code ignore baseline is selected, a certain field in the risk code ignore baseline is appeared in the initial risk detection result, the line where the field is located in the risk detection result can be located, the line content is generated, such as, "- [ TC open platform ] (https:// open.AAA.com /)", a new risk code ignore baseline is generated, and simultaneously, the re-filtering of all the initial risk detection results is triggered, and if "- [ TC open platform ] (https:// open.AAA.com /)" appears in the initial risk detection result, the setting of false alarms is directly and automatically carried out. Therefore, false alarm results can be filtered, and initial risk detection results obtained by filtering the risk codes neglecting the baseline are used as risk detection results.

However, in practice, it often happens that sensitive information such as the account secret of the server is leaked, but the sensitive information is invalid, and the situation is different from the situation that sensitive information such as the account secret of the server is leaked, so that attention needs to be paid to the sensitive information. In this regard, in some embodiments, after determining the sensitive information in the target code repository according to the risk detection result in step S104, an automatic verification mechanism may be provided, and specific steps may include:

the method comprises the steps of determining the sensitive type of sensitive information, obtaining a sensitive information verification scheme adapted to the sensitive type, utilizing the sensitive information verification scheme to verify the validity of the sensitive information, and if the sensitive information is valid, sending alarm information to a terminal associated with the sensitive information.

In this embodiment, after obtaining the sensitive information in the target code repository, the terminal 100 may further perform automatic inspection on the sensitive information to check whether the sensitive information is valid. For sensitive information of different sensitive types, different sensitive information verification schemes need to be adopted to verify the validity of the sensitive information. For example, in an environment where the network is reachable, if the terminal 100 determines that the type of the sensitive information is mysql account password, the terminal 100 may write a connection script at this time, and directly copy the matching mysql account password variable: such as:

mysql _ test is "Mysql-u { } -P { }". format (user name, password, port), then the terminal 100 may obtain a return result and analyze and judge the return result, if the analysis and judgment result is that the sensitive information is valid, it indicates that the sensitive information belongs to valid sensitive information, and if the Mysql account password is obtained, the lawless person will seriously affect the information security, so the terminal 100 may increase the priority of the sensitive information, and may also send alarm information to the associated terminal by means of accessing an email and a telephone alarm lamp, so as to complete the detection, verification and alarm process of the sensitive information.

In an embodiment, a sensitive information detection method is further provided, referring to fig. 3, where fig. 3 is a schematic flow chart of the sensitive information detection method in another embodiment, and the sensitive information detection method may include the following steps:

step S301, acquiring a target information resource to be protected, analyzing the target information resource, determining the sensitive information characteristic of the target information resource, generating a characteristic mark corresponding to the sensitive information characteristic, and constructing a characteristic mark library based on the characteristic mark;

step S302, a plurality of access tokens distributed by the code hosting platform are obtained, the target feature tag is divided into a plurality of sub-target feature tags according to the number of the access tokens, a plurality of sub-code warehouses matched with the sub-target feature tags are searched in the code hosting platform through the plurality of access tokens respectively, and the plurality of sub-code warehouses are set as candidate code warehouses;

step S303, obtaining the update time of the candidate code warehouse, setting the candidate code warehouse as a target code warehouse if the update time is within a set time range, and downloading the target code warehouse to the local;

step S304, determining a code risk scene, and acquiring a risk detection rule corresponding to the code risk scene;

step S305, based on a risk detection rule, carrying out risk detection on codes of a target code warehouse locally, obtaining an initial risk detection result obtained by the risk detection, determining a risk code neglect baseline, and filtering the initial risk detection result by using the risk code neglect baseline to obtain a risk detection result;

step S306, determining the sensitive information in the target code warehouse according to the risk detection result, determining the sensitive type of the sensitive information, acquiring a sensitive information verification scheme adapted to the sensitive type, performing validity verification on the sensitive information by using the sensitive information verification scheme, and if the sensitive information is valid, sending alarm information to a terminal associated with the sensitive information.

The sensitive information detection method of the embodiment can construct the corresponding feature label library for the target information resource to be protected, searching candidate code warehouses matched with the target characteristic marks in parallel in the code hosting platform in the form of a plurality of access tokens, screening target code warehouses from the candidate code warehouses according to a set time range and downloading the target code warehouses to the local, the method comprises the steps of carrying out accurate risk detection on codes of a target code warehouse locally by using risk detection rules corresponding to code risk scenes, filtering initial risk detection results by using risk code neglect baselines in combination with a baseline processing technology to obtain sensitive information, further carrying out validity verification on the sensitive information, and finally carrying out alarm processing on the valid sensitive information.

In order to clarify the sensitive information detection scheme provided in each embodiment of the present application more clearly, the sensitive information detection method applied to a large-scale internet enterprise is easily explained in a scene where public code hosting platforms such as gitubs and Gitlab are leaked, specifically, the sensitive information detection method can be applied to protecting information assets of the enterprise, and the product-side function expression is as follows: and detecting the leakage condition of the enterprise information assets, wherein the main product forms can be a Github scanner, a Github Hunter and the like.

The major scenes that sensitive information is leaked from a large internet company are that employees use public code warehouses such as Github, gitlab and code cloud for code management, even if the company has clear regulations that company source codes belong to company information assets and cannot be leaked privately, because of numerous employees and limited supervision modes, the company source codes are uploaded to the public code warehouses such as Github by the employees under the intentional or unintentional condition to cause the leakage of the company source codes, the leaked codes often contain sensitive information such as intranet ip, domain names and initialized default passwords of the company, and in addition, the leakage of the source codes can cause attackers to carry out white-box vulnerability mining on the leaked codes to find vulnerabilities and invade the company.

The public code warehouses such as the gitubs, the gitlab, the code clouds and the like can be regarded as a cloud storage system for providing cloud storage service, and the cloud storage refers to a system for gathering a large number of different types of storage devices in a network through application software by using functions such as cluster application, grid technology or a distributed file system and providing data storage and service access functions to the outside. Accordingly, the public code repository can provide cloud storage service of code data for users such as employees of a large-scale internet company, and the users can upload the code data to the public code repository or download required code data from the public code repository.

The sensitive information detection scheme provided in the embodiments of the present application may be used as a set of distributed sensitive information detection system, as shown in fig. 4 and 5, the sensitive information detection system may include modules such as a detection report and asset management, and the system may be used by users such as a company administrator in a web application manner, because the number of users such as a company administrator using the sensitive information detection system is generally multiple, the sensitive information detection system is equivalent to that the sensitive information detection system may be used by multiple terminal devices to detect the sensitive information of a public code warehouse to ensure information security, and the detection manner is similar to a policy concept of cloud security, and the policy concept of cloud security is: the more users, the more safe each user is. For example, as the number of users such as a company administrator is increased, the number of terminal device nodes for detecting sensitive information of code data is increased, and the sensitive information in a public code warehouse is detected through a large number of meshed terminal device nodes, so that the latest sensitive information leakage condition can be acquired, the latest sensitive information leakage condition is pushed to a corresponding server for automatic analysis and processing, and then a solution is distributed to each terminal device.

Referring to fig. 4, fig. 4 is a schematic diagram of an interface of a detection report in an application example, wherein the detection report can be divided into a first Github mark report and a second local detection report. The githu markup report may display the marked pattern (corresponding to the feature tag), the time of updating the code repository, and the marked code content and the context thereof, and may also perform display optimization on the marked code content and the context thereof, where five processing manners of the marked code content and the context thereof may correspond to five different scenarios, including: sensitive confirmation: directly judging that the warehouse has risks; accurate scanning: the warehouse belongs to a company, but does not directly find risks, and further accurate scanning is needed; ignore line pattern: the linepatrern is risk-free, and a safe baseline is added; ignoring the file: the file has no risk, and a safety baseline is added; ignoring the warehouse: the warehouse is risk-free and a safety baseline is added.

The local detection report lists the keywords and the detailed contents corresponding to the detected risk, and four corresponding processing modes, including: and (3) confirmation: the file code leaks information, and the alarm is confirmed to be effective; and (3) confirmation and verification: confirming that the alarm is effective, extracting the sensitive information in the alarm, and verifying the effectiveness; ignoring the file code: the alarm is invalid and false alarm is given; ignoring the warehouse: this warehouse does not have hazard false positives. The verification refers to extracting sensitive words and calling a background script to verify the effectiveness of the sensitive information.

As shown in fig. 5, fig. 5 is an interface schematic diagram of asset management in an application example, and the asset management module is mainly divided into five sub-modules, namely, temporary scanning, asset tagging, rule management, gitubb account management and administrator account management. The temporary scanning is used for issuing a temporary scanning task, and all warehouses of the temporary scanning task can be taken to the local and directly enter the accurate scanning stage aiming at warehouses, users and organizations. Specifically, the scanning can be realized through three fields, namely url, user and org, which respectively correspond to three organizations of Github: a warehouse url represents a warehouse, a user may have one or more warehouses, and an organization may have one or more warehouses, based on which the warehouse address may be obtained from the interface of gitubb for scanning by submitting the corresponding parameters. And the marked assets are assets marked by Github in the scanning process, the assets can be subjected to addition and deletion management, when the status is 0, the accurate scanning is not performed, and otherwise, the accurate scanning is performed. The rule management module uniformly manages the marking rule and the precise matching rule, wherein a certain rule is in an enabled state through identification enable, and the rule is scanned through identification selected (namely, the field is used for synchronizing the scanning progress in the distributed scanning). Since the Github API has a strict limit on the request frequency, i.e. less than 30 times per minute, the Github account management module can add multiple Github tokens to break through this limit and increase the scanning efficiency. The administrator account module is mainly used for managing the system and can set a plurality of accounts.

The application example provides a set of complete open source warehouse data leakage detection scheme, multiple high-risk scenes are subdivided through analysis of a large number of data leakage samples, firstly, information assets of a company are analyzed, a company mark library is extracted, the assets of the company are searched in Github, a target warehouse with potential leakage is identified, then, leakage of the target warehouse is locally detected, and data accuracy is automatically verified, the main steps are shown as figure 6, figure 6 is a flow diagram of a source warehouse data leakage detection method in an application example, and the method can be further subdivided into the following steps:

step 1: and the company marking rule comprises analysis of domain names, ip, initialization passwords and common token/key formats of internal and external networks of the company, and the rule is extracted for detection in Github, so that the target warehouse is marked on the Github.

Step 2: and establishing and implementing a leakage detection rule. The leakage rule detection mainly carries out local scanning on a target warehouse marked on Github, and at the moment, multiple modes such as regex pattern and machine learning can be combined for accurate modeling. Wherein, for the detection of known risks, common known risks include, but are not limited to: ssh account number, mysql account number, redis account number, mongodb account number, token/key monitoring, key file, etc.

And step 3: and carrying out false alarm optimization and automatic verification. Due to the special nature of Github, code is often spread around by fork, copy. By performing baseline processing according to the pattern line, a large number of false alarms can be optimized by one key, which is a characteristic feature of an open source code hosting platform. When choosing to ignore the pattern line, the line content is located in the file according to the pattern field in the result, a new pattern line baseline is generated, and simultaneously filtering of all search results is triggered, for example, "- [ TC open platform ] (https:// open.AAA.com /)" is generated, and if "- [ TC open platform ] (https:// open.AAA.com /)" appears in the result, false alarms are set automatically directly.

And 4, step 4: in practice, it is often the case that the server account secret is revealed, but the information is invalid. This situation is at risk from revealing the true failure account secret, which requires increased attention. At this time, an automatic checking mechanism is set. In a network accessible terminal, the characteristics of the known high risk are extracted and automatically verified. For example, if mysql account secret is extracted, a connection script is written, and the matched account secret variable is directly assigned: such as: mysql _ test ═ Mysql-u { } -P { } ". format (username, password, port). And judging the returned result, and if the verification is valid, improving the priority, and accessing the mail and the telephone for warning.

Because the internet is a development environment, and various data leakage risks brought by code open sources are met by each enterprise and need to be protected and detected, the application example provides a set of complete enterprise sensitive data leakage detection scheme, and a set of monitoring scheme for monitoring sensitive leakage of large-scale internet enterprises and million assets is provided by thoroughly combing concepts such as detection, verification and alarm flows of sensitive information.

In an embodiment, a sensitive information detecting apparatus is provided, as shown in fig. 7, where fig. 7 is a block diagram of a structure of the sensitive information detecting apparatus in an embodiment, and the sensitive information detecting apparatus 700 may include:

a repository acquisition module 701, configured to acquire a target code repository matching the target feature tag from the code hosting platform; the target characteristic mark is prestored in a local characteristic mark library;

a rule obtaining module 702, configured to determine a code risk scenario, and obtain a risk detection rule corresponding to the code risk scenario;

the risk detection module 703 is configured to perform risk detection on the codes of the target code repository locally based on a risk detection rule, and obtain a risk detection result;

and an information determining module 704, configured to determine sensitive information in the target code repository according to the risk detection result.

In one embodiment, the sensitive information detecting apparatus 700 may further include:

the mark library construction module is used for acquiring target information resources to be protected; analyzing the target information resource and determining the sensitive information characteristics of the target information resource; generating a feature tag corresponding to the sensitive information feature; and constructing a feature tag library based on the feature tags.

and the mark uploading module is used for uploading the characteristic mark to the block chain for storage.

In one embodiment, the warehouse acquisition module 701 is further configured to: acquiring candidate feature marks from a feature mark library; if the block chain stores the feature tag which is the same as the candidate feature tag, setting the candidate feature tag as a target feature tag; and downloading the target code warehouse matched with the target characteristic mark in the code hosting platform to the local.

In one embodiment, the warehouse acquisition module 701 is further configured to: searching a code warehouse matched with the target feature tag in the code hosting platform to serve as a candidate code warehouse; acquiring the update time of the candidate code warehouse; and if the updating time is within the set time range, setting the candidate code warehouse as an object code warehouse, and downloading the object code warehouse to the local.

In one embodiment, the warehouse acquisition module 701 is further configured to: acquiring a plurality of access tokens distributed by a code hosting platform; dividing the target characteristic mark into a plurality of sub-target characteristic marks according to the number of the access tokens; searching a plurality of sub-code warehouses matched with the sub-target feature labels in the code hosting platform through a plurality of access tokens respectively; the plurality of child code repositories are set as candidate code repositories.

In one embodiment, the risk detection module 703 is further configured to: acquiring an initial risk detection result obtained by risk detection; determining a risk code override baseline; and filtering the initial risk detection result by using the risk code neglecting baseline to obtain a risk detection result.

the alarm processing module is used for determining the sensitive type of the sensitive information; acquiring a sensitive information verification scheme adapted to the sensitive type; carrying out validity check on the sensitive information by using a sensitive information checking scheme; and if the sensitive information is effective sensitive information, sending alarm information to a terminal associated with the sensitive information.

Fig. 8 is a diagram showing an internal structure of a computer device in one embodiment, and fig. 8 is a block diagram showing a structure of a computer device in one embodiment. The computer device may particularly be a terminal 100 as in fig. 1. As shown in fig. 8, the computer apparatus includes a processor, a memory, a network interface, an input device, and a display screen connected through a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program that, when executed by the processor, causes the processor to implement the sensitive information detection method. The internal memory may also have a computer program stored therein, which when executed by the processor, causes the processor to perform the sensitive information detection method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 8 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory storing a computer program, the computer program, when executed by the processor, causing the processor to perform the steps of the above sensitive information detection method. Here, the steps of the sensitive information detection method may be steps in the sensitive information detection methods of the above-described respective embodiments.

In one embodiment, a computer-readable storage medium is provided, in which a computer program is stored, which, when executed by a processor, causes the processor to perform the steps of the above-mentioned sensitive information detection method. Here, the steps of the sensitive information detection method may be steps in the sensitive information detection methods of the above-described respective embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for sensitive information detection, comprising:

2. The method of claim 1, wherein prior to obtaining the target code repository from the code hosting platform that matches the target feature tags, comprising:

acquiring target information resources to be protected;

analyzing the target information resource and determining the sensitive information characteristics of the target information resource;

generating a feature tag corresponding to the sensitive information feature;

and constructing the feature tag library based on the feature tags.

3. The method of claim 2, further comprising:

uploading the feature tag to a block chain for storage;

the obtaining of the target code repository matched with the target feature tag from the code hosting platform includes:

acquiring candidate feature marks from the feature mark library;

if the block chain stores the feature tag which is the same as the candidate feature tag, setting the candidate feature tag as the target feature tag;

and downloading the target code warehouse matched with the target characteristic mark in the code hosting platform to the local.

4. The method of claim 3, wherein downloading locally an object code repository of the code hosting platform that matches the object feature tag comprises:

searching a code repository matching the target feature tag in the code hosting platform as a candidate code repository;

acquiring the update time of the candidate code warehouse;

and if the updating time is within a set time range, setting the candidate code warehouse as the target code warehouse, and downloading the target code warehouse to the local.

5. The method of claim 4, wherein the searching the code hosting platform for a code repository matching the target feature tag as a candidate code repository comprises:

obtaining a plurality of access tokens distributed by the code hosting platform;

dividing the target characteristic mark into a plurality of sub-target characteristic marks according to the number of the access tokens;

searching the code hosting platform for a plurality of sub-code repositories matching the plurality of sub-target signatures through the plurality of access tokens, respectively;

setting the plurality of sub-code repositories as the candidate code repository.

6. The method of claim 1, wherein obtaining the risk detection result comprises:

acquiring an initial risk detection result obtained by risk detection;

determining a risk code override baseline;

and filtering the initial risk detection result by using the risk code neglecting baseline to obtain the risk detection result.

7. The method of claim 1, wherein determining sensitive information in the target code repository based on the risk detection result comprises

Determining a sensitive type of the sensitive information;

acquiring a sensitive information verification scheme adapted to the sensitive type;

carrying out validity check on the sensitive information by using the sensitive information checking scheme;

and if the sensitive information is effective sensitive information, sending alarm information to a terminal associated with the sensitive information.

8. An apparatus for sensing sensitive information, the apparatus comprising:

9. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 7.

10. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the computer program, when executed by the processor, causes the processor to perform the steps of the method according to any one of claims 1 to 7.