CN115373694A

CN115373694A - Sensitive information detection method, device, equipment and storage medium

Info

Publication number: CN115373694A
Application number: CN202211117463.7A
Authority: CN
Inventors: 徐艺庭; 白兴伟
Original assignee: Beijing Huayuan Information Technology Co Ltd
Current assignee: Beijing Huayuan Information Technology Co Ltd
Priority date: 2022-09-14
Filing date: 2022-09-14
Publication date: 2022-11-22

Abstract

The embodiment of the disclosure provides a sensitive information detection method, a sensitive information detection device, sensitive information detection equipment and a storage medium, and belongs to the field of data security. The method comprises the following steps: decompiling the program to be detected to obtain a corresponding assembly language code; extracting character strings of the assembly language codes to generate a data set; clustering the data set, and classifying clusters formed by data object points in the data set; it is determined whether the cluster is sensitive information. In this way, the method can realize the efficient and accurate identification of the category and the content of the sensitive information in the program to be detected; and the purpose that sensitive information in different scenes in an application program can be automatically detected after the detection method disclosed by the invention is adopted can be achieved.

Description

Sensitive information detection method, device, equipment and storage medium

Technical Field

The disclosure relates to the technical field of network data security, and in particular relates to a method, a device, equipment and a storage medium for detecting program sensitive information.

Background

The APP database files contain more sensitive information, for example, developers do not specify the right file rights of users when storing data for the application programs, so that the files are globally readable and can be accessed by other application programs without the right, and the APP database files have the possibility of leakage, wherein the leaked database may contain sensitive information. The APP sensitive information detection mostly adopts a mode of manually clicking a function point page for detection, and different evaluation and detection personnel have different experiences, different inspection standards and different inspection careful degrees, which may cause problems of sensitive information leakage and the like.

Disclosure of Invention

The disclosure provides a sensitive information detection method, a sensitive information detection device, sensitive information detection equipment and a storage medium. According to a first aspect of the present disclosure, a sensitive information detection method is provided. The method comprises the following steps: decompiling the program to be detected to obtain a corresponding assembly language code; extracting character strings of the assembly language codes to generate a data set; clustering the data set, and classifying clusters formed by data object points in the data set; it is determined whether the cluster is sensitive information.

Further, the decompiling process comprises: and if the program to be detected is a reinforced application program, carrying out technical identification on the reinforced application program, carrying out reverse processing by using a corresponding reinforcement method reverse tool, and finishing decompilation processing.

Further, the extracting the character string includes: and extracting the preset character string related to the sensitive information from the assembly language code according to the preset character string extraction standard.

Further, clustering the data set includes: randomly selecting a data object point P from the data set; taking the selected data object point P as a core point, finding out all data object points which can reach from the density P to form a cluster; if the selected data object point P is an edge point, selecting another data object point; the above steps are repeated until all data object points are processed.

Further, classifying the clusters formed by the data object points includes: all clusters formed in the clustering process are divided into two types, and a standard cluster and other clusters obtained through a preset character string extraction standard are obtained.

Further, determining whether a cluster is sensitive information includes: judging the sensitive information category of the standard cluster; analyzing the other clusters, and judging whether the information corresponding to the character strings corresponding to the other clusters is sensitive information; and if the sensitive information is the sensitive information, determining that the sensitive information of the character strings corresponding to other clusters is a new sensitive information type, and adding the new sensitive information type to a preset character string extraction standard.

According to a second aspect of the present disclosure, a sensitive information detection apparatus is provided. The device includes: the decompiling module is used for decompiling the program to be detected to obtain a corresponding assembly language code; the data set generating module is used for extracting character strings of the assembly language codes to generate a data set; the cluster classification module is used for clustering the data set and classifying clusters formed by data object points in the data set; and the judging module is used for judging whether the information is sensitive information.

According to a third aspect of the present disclosure, an electronic device is provided. The electronic device includes: a memory having a computer program stored thereon and a processor implementing the method as described above when executing the program.

According to a fourth aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method as described above.

The utility model provides a sensitive information detection method, which comprises the steps of decompiling a program to be detected, extracting character strings according to a preset character string extraction standard, clustering and classifying by using each character string as a data object point, and realizing the high-efficiency and accurate identification of the category and content of sensitive information in the program to be detected; and other clusters formed by clustering are judged, and the preset character string extraction standard is continuously optimized, so that the aim of automatically detecting sensitive information under different scenes in an application program can be fulfilled after the detection method disclosed by the invention is adopted.

It should be understood that the statements herein reciting aspects are not intended to limit the critical or essential features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. The accompanying drawings are included to provide a further understanding of the present disclosure, and are not intended to limit the disclosure thereto, and the same or similar reference numerals will be used to indicate the same or similar elements, where:

FIG. 1 shows a flow diagram of a sensitive information detection method according to an embodiment of the present disclosure;

FIG. 2 shows a block diagram of a sensitive information detection apparatus, according to an embodiment of the present disclosure;

fig. 3 shows a block diagram of an electronic device for implementing the sensitive information detection method of the embodiment of the present disclosure.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, but not all embodiments of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without inventive step, are intended to be within the scope of the present disclosure.

In addition, the term "and/or" herein is only one kind of association relationship describing an associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

According to the sensitive information detection method, after the application program is subjected to decompiling processing, the character strings are extracted through the preset character string extraction rule, and the extracted character strings are subjected to clustering and classification processing, so that all sensitive information in the application program is obtained after the processing.

Fig. 1 shows a flow diagram of a sensitive information detection method according to an embodiment of the present disclosure. In some embodiments, the sensitive information detection method includes:

s1: and performing decompiling processing on the program to be detected to obtain a corresponding assembly language code.

Specifically, a program (APP) to be detected is acquired. And performing decompiling processing on the program to be detected to obtain the assembly language code corresponding to the program to be detected.

Decompiling the program to be detected into corresponding assembly language code using a decompilation tool (e.g., apktool, JD-GUI, XJad decompilation software).

In some embodiments, the decompiling process further comprises:

directly performing decompiling treatment on the non-reinforced or weakly reinforced APP;

performing technical identification on the reinforced APP (for example, performing shell inspection by using ApkScan-PKID), performing reverse processing by using a corresponding reinforcement method reverse tool, and completing decompiling processing; if the technology identification is not successful, it is discarded.

The reinforcing process is to solve the safety defect and risk problem of the mobile application, so that the reinforced mobile application has the safety protection capabilities of preventing reverse analysis, preventing secondary packaging, preventing dynamic debugging, preventing process injection, preventing data tampering and the like, protecting an APP core code algorithm and improving the difficulty of cracking/pirating/secondary packaging. But the reinforcement is carried out by using reinforcement software, and most reinforcement modes can be determined by technical recognition.

The method and the device obtain the decoded resources by reversely compiling the android APP and the apk file. Such as APP source code and resource files, may contain the content of sensitive information.

S2: and extracting character strings of the assembly language codes to generate a data set.

In some embodiments of the present disclosure, assembly language code obtained by decompiling an application program (APP) to be detected is extracted in units of character strings, so as to obtain a data set formed by the character strings. And extracting a preset character string related to the sensitive information from the decompiled assembly language code, and extracting a context near the character string.

Extracting a preset character string related to sensitive information from the assembly language code according to a preset character string extraction standard, wherein the preset character string extraction standard comprises the following steps: extracting all URLs, all IPs, possibly character strings of hash values, possibly transcoded character strings (such as base64, unicode and the like), possible sensitive words (such as password, pass, password and the like), and possible values of accessKey; and extracts the context in the vicinity of the character string.

In some embodiments, the string extraction, and extracting the context of the string, specifically:

for example, the sensitive information is an identification number, when a certain character string contains 18 continuous digits, the character string is extracted as a character string related to the sensitive information, that is, the character string containing 18 continuous digits in the character string is extracted as one of the preset character string extraction standards. The sensitive information is an intranet IP address, and when a certain character string contains a character string which accords with an intranet IP address rule (namely the IP address is a character string which is separated by 4 3 digits and each 3 digits is between 0 and 255 (inclusive), for example 192.168.1.100), the character string which contains the IP which accords with the intranet IP address rule in the character string is extracted as one of preset character string extraction standards.

For example, the sensitive information is a login key (namely, an accesskey or a secretekey or a key is marked), when an accesskey appears in a certain character string, the character marked with the accesskey is extracted, and a context near the character string is extracted (for example, the Tencent cloud is 36-bit accesskey and 32-bit secretekey; the Aliskiun cloud is 24-bit accesskey and 32-bit secretekey), so that the context is taken as one of preset character string extraction standards. Sensitive information is economic relevant, when a certain character string contains price and $, the character string containing price and $isextracted, and the context (the character string after the price character string and before the $, and the £ and the like) near the character string is extracted to serve as one of preset character string extraction standards. Sensitive information is an account password, and when a username and password appear in a character string; or name, pass and the like, extracting the character string with the username, and extracting the context (extracting the character string with the username, pass and the like) near the character string as one of the preset character string extraction standards.

It should be noted that the preset character string extraction criteria in the sensitive information detection method of the present disclosure are updated according to the requirements of different detection scenarios, and are not limited to the above examples, and those skilled in the art may make corresponding modifications according to the actual implementation situation. For example, the last digit of the identification number may also be x, i.e., one of the extraction criteria may be increased from the occurrence of consecutive 16-digit numbers to the occurrence of consecutive 16-digit numbers and the occurrence of consecutive 15-digit numbers and the letter x. For example, different industries may involve different preset string extraction criteria, such as education with a school number ID as the preset string extraction criteria, and medical care with a medical insurance ID as the preset string extraction criteria.

S3: and clustering the data set, and classifying clusters formed by the data object points in the data set.

In some embodiments, a DBSCAN clustering algorithm is used to arbitrarily select a data object point P from the data set; if the selected data object point P is the core point for the parameter radius-Eps and the parameter number-MinPts, finding out all data object points with the density reaching from the point P to form a cluster; if the selected data object point P is an edge point, selecting another data object point; repeating the above steps until all points of the data set are processed; in the clustering process, all formed clusters are divided into two types, and a standard cluster and other clusters are obtained by extracting standards through preset character strings.

It should be noted that the clustering algorithm used in the present disclosure is not limited to the DBSCAN clustering algorithm, and may also be a K nearest neighbor algorithm, a GMM, or other clustering algorithms. In the method, the character strings extracted after the program is decompiled and the contexts of the character strings form a data set, the data density degree of the character strings related to sensitive information in the data set is not completely determined, the dense data set in any shape can be clustered by adopting a DBSCAN clustering algorithm, the influence of abnormal points is small, and the clustering result is relatively accurate.

In some embodiments, the character string extracted by the preset character string extraction standard is used as the data object point P, and the radius may be used as a weight to perform adaptive setting, for example, "card" is used as the data object point, the character string behind the card is used to compare the digits and contents of the id card, the medicare card, the bank card, etc., and then the weight (i.e., radius) that conforms to the sensitive information (e.g., the id card) is determined, and the closer the identity card conforms to the sensitive information, the closer the identity card is to the corresponding data cluster is determined. The coincidence degree is determined by the number of bits, character string coincidence degree, context relevance and the like. Similarly, the accesskey is obtained in the same processing mode, the character strings behind the accesskey are obtained, and link verification can be added even to serve as weights according to the length, the conformity and the context relevance, so that the parameters in the clustering algorithm, the clustering radius and the number of each character string are determined.

S4: it is determined whether the cluster is sensitive information.

Specifically, all the formed clusters are classified and distinguished, and the categories of the clusters are defined. And analyzing the other clusters to judge whether the information corresponding to the character strings corresponding to the other clusters is sensitive information, if so, determining that the sensitive information of the character strings corresponding to the other clusters is a new sensitive information type, and adding the new sensitive information type to the preset character string extraction standard.

In some embodiments, the standard cluster is obtained according to a preset character string extraction standard, each sensitive information category exists in the standard cluster, and each sensitive information category is in a one-to-one correspondence relationship with each preset extraction category in the preset character string extraction standard. Taking URL as an example, URL is uniform resource locator, for example hundred degree URL is https:// www.baidu.com/, and URL is used as one type of preset sensitive information. Taking IP as an example, an IP address, for example 192.168.1.100 is an IP address of an intranet, and the IP address is taken as one type of sensitive information. Similarly, base64, unicode, accessKey value, hash value, password, pass, password and the like are respectively used as one type of preset sensitive information. Namely, in the initial state, six types are set for the preset sensitive information types; six types of sensitive information type results of the corresponding clustered standard clusters are preset; that is, the standard cluster to be generated is subordinate to six sensitive information categories.

In some embodiments, the standard cluster to be updated and the invalid cluster are obtained by manually distinguishing other clusters formed by the clustering algorithm, and the standard cluster to be updated is obtained by manually distinguishing sensitive information corresponding to character strings corresponding to the other clusters to obtain a cluster related to the sensitive information, so that an extraction standard related to the sensitive information corresponding to the standard cluster to be updated is obtained. And adding the extraction standard related to the sensitive information into a preset extraction standard to update the preset character string extraction standard. The preset extraction standard related to the sensitive information can be continuously optimized and iterated, so that the purpose that all sensitive information is filtered out after the program to be detected is detected by the method for detecting the sensitive information is achieved.

In some embodiments, the sensitive information includes user sensitive asset, IP, password, identification card, user data, and the like.

According to the embodiment of the disclosure, the following technical effects are achieved:

(1) The method comprises the steps of decompiling a program to be detected, extracting character strings through a preset character string extraction standard, clustering and classifying by taking each character string as a data object point, and continuously optimizing the preset character string extraction standard by judging other clusters formed by clustering, so that the type and the content of sensitive information in the program to be detected can be efficiently and accurately identified.

(2) According to different application scenes, different preset character string extraction standards are selected, so that the purpose that sensitive information in the application program under different scenes can be automatically detected after the detection method disclosed by the invention is adopted is achieved.

It is noted that while for simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present disclosure is not limited by the order of acts, as some steps may, in accordance with the present disclosure, occur in other orders and concurrently. Further, those skilled in the art should also appreciate that the embodiments described in the specification are exemplary embodiments and that acts and modules referred to are not necessarily required by the disclosure.

The above is a description of embodiments of the method, and the embodiments of the apparatus are further described below.

Fig. 2 shows a block diagram of a sensitive information detection apparatus 200 according to an embodiment of the present disclosure.

The apparatus 200 comprises:

the decompiling module 210 is configured to decompile the program to be detected to obtain a corresponding assembly language code;

a data set generating module 220, configured to perform character string extraction on the assembly language code to generate a data set;

a cluster classification module 230, configured to cluster the data sets and classify clusters formed by data object points in the data sets;

a determining module 240, configured to determine whether the cluster is sensitive information.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the described module may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 3 shows a schematic block diagram of an electronic device 300 that may be used to implement embodiments of the present disclosure.

Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

The device 300 comprises a computing unit 301 which may perform various suitable actions and processes in accordance with a computer program stored in a Read Only Memory (ROM) 302 or a computer program loaded from a storage unit 308 into a Random Access Memory (RAM) 303. In the RAM 303, various programs and data required for the operation of the device 300 can also be stored. The computing unit 301, the ROM302, and the RAM 303 are connected to each other via a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.

Various components in device 300 are connected to I/O interface 305, including: an input unit 306 such as a keyboard, a mouse, or the like; an output unit 307 such as various types of displays, speakers, and the like; a storage unit 308 such as a magnetic disk, optical disk, or the like; and a communication unit 309 such as a network card, modem, wireless communication transceiver, etc. The communication unit 309 allows the device 300 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 301 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 301 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 301 performs the respective methods and processes described above, such as the sensitive information detection method. For example, in some embodiments, the sensitive information detection method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 308. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 300 via ROM302 and/or communication unit 309. When the computer program is loaded into RAM 303 and executed by the computing unit 301, one or more steps of the sensitive information detection method described above may be performed. Alternatively, in other embodiments, the computing unit 301 may be configured to perform the sensitive information detection method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program code, when executed by the processor or controller, causes the functions/acts specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method for sensitive information detection, the method comprising:

decompiling the program to be detected to obtain a corresponding assembly language code;

extracting character strings of the assembly language codes to generate a data set;

clustering the data set, and classifying clusters formed by data object points in the data set;

it is determined whether the cluster is sensitive information.

2. The method of claim 1, wherein the decompilation process comprises:

and if the program to be detected is a reinforced application program, carrying out technical identification on the reinforced application program, carrying out reverse processing by using a corresponding reinforcement method reverse tool, and finishing decompilation processing.

3. The method of claim 1, wherein the performing string extraction comprises:

and extracting preset character strings related to the sensitive information from the assembly language codes according to preset character string extraction standards.

4. The method of claim 3, wherein the clustering the data set comprises:

randomly selecting a data object point P from the data set;

taking the selected data object point P as a core point, finding out all data object points which can reach from the density P to form a cluster;

if the selected data object point P is an edge point, selecting another data object point; the above steps are repeated until all data object points are processed.

5. The method of claim 4, wherein classifying the clusters formed by the data object points comprises:

all clusters formed in the clustering process are divided into two types, and a standard cluster and other clusters obtained through the preset character string extraction standard are obtained.

6. The method of claim 5, wherein the determining whether the cluster is sensitive information comprises:

judging the sensitive information category of the standard cluster;

analyzing the other clusters and judging whether the character strings corresponding to the other clusters are sensitive information or not; and if the sensitive information is the sensitive information, determining whether the character strings corresponding to other clusters are of a new sensitive information type, and adding the character strings to the preset character string extraction standard.

7. An apparatus for sensing sensitive information, the apparatus comprising:

the decompiling module is used for decompiling the program to be detected to obtain a corresponding assembly language code;

the data set generating module is used for extracting character strings from the assembly language codes to generate a data set;

the cluster classification module is used for clustering the data set and classifying clusters formed by data object points in the data set;

and the judging module is used for judging whether the information is sensitive information.

8. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

9. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method according to any one of claims 1-6.