CN113032811A

CN113032811A - Database sensitive information checking method

Info

Publication number: CN113032811A
Application number: CN202110395454.3A
Authority: CN
Inventors: 门嘉平
Original assignee: Beijing Guolian Yian Information Technology Co ltd
Current assignee: Beijing Guolian Yian Information Technology Co ltd
Priority date: 2021-04-13
Filing date: 2021-04-13
Publication date: 2021-06-25

Abstract

The invention discloses a method for checking sensitive information of a database, which comprises the following steps: setting examination keywords, collecting database information, examining secret-related information, automatically eliminating secret-related information and outputting examination results. The method can be used for efficiently and quickly checking the confidential information in the database, and is high in checking precision and low in false alarm rate.

Description

Database sensitive information checking method

Technical Field

The invention relates to the field of information analysis and processing, in particular to a database sensitive information checking method.

Background

Under the background of big data and cloud computing, more and more data are concentrated in a data center database for centralized storage, and the data are dispersed in different massive database tables, so that the novel characteristics of massive data, big data, distributed data and the like are formed.

1) The data storage capacity of a data center database is larger and larger, and generally reaches more than 50T;

2) data center databases store more and more types of data, including features frequently presented such as structured data, unstructured data, pictures, texts, audio and video data, and the like.

These new features are not sufficient for security workers to adopt traditional inspection means; therefore, new database inspection tools must be created timely according to new trends and new characteristics to deal with the serious safety challenges brought by the centralization and big datamation of data assets.

Disclosure of Invention

In view of the above, the present invention provides a method for checking sensitive information of a database, which can efficiently and quickly check confidential information in the database, and has high checking accuracy and low false alarm rate.

In order to solve the problems, the technical scheme adopted by the invention is as follows:

a method for checking sensitive information of a database comprises the following steps: setting examination keywords, acquiring database information, examining secret-related information and automatically eliminating the secret-related information.

The further technical proposal is that the method also comprises the output of the inspection result.

A further technical solution is that the check keyword setting specifically is: generating an automatic generation check keyword by adopting an intelligent semantic sample analysis mode, identifying the content of the file in the current period and combining a machine learning technology; the intelligent expansion and perfection of the inspection keywords are realized by adopting a semantic intelligent analysis mode, and the machine learning technology can perfect the learning process and supplement the sensitive keyword library in the continuous inspection process.

A further technical solution is that the intelligently expanding and perfecting the inspection keywords by adopting a semantic intelligent analysis mode comprises: carrying out intelligent expansion search on the inspection keywords in different industries based on synonymy and near-synonymy attribute algorithms; the method is characterized in that the classified semantics of different industries are analyzed, an intelligent semantic combination technology is adopted, the inspection keywords and a plurality of element words are combined to perfect the inspection keywords, and a set of automatically updated sensitive word library is formed by identifying the sensitive keywords of different industries and machine learning.

A further technical scheme is that the collecting database information specifically comprises: and realizing distributed acquisition of data by adopting a big data Sqoop technology. Distributed acquisition can provide basic guarantee for sensitive word recognition of multiple databases and high-concurrency application scenes, can ensure high efficiency of sensitive word inspection in the high-concurrency scenes, and prevents the conditions of insufficient performance and incomplete inspection of inspection engines when high concurrency occurs.

The further technical scheme is that the secret-related information inspection specifically comprises the following steps: and realizing distributed inspection on the acquired information by adopting a big data MapReduce technology, comparing the acquired information with the contents in the sensitive word sample library one by one, marking the successfully compared information, and giving an alarm.

The further technical proposal is that the examination supports the examination of the confidential information of all text documents in the collected information; the method supports the confidential information check of all picture files in the collected information, and can automatically retrieve and extract the characters embedded in the image and give an alarm.

The further technical scheme is that the automatic elimination of the confidential information specifically comprises the following steps: and (4) encryption processing, namely encrypting the checked confidential information by adopting a cryptographic algorithm. The secret-related information is corrected into a special password field which cannot be effectively recognized, and sensitive information is prevented from being leaked.

The further technical scheme is that the automatic elimination of the confidential information specifically comprises the following steps: and format privacy protection, namely performing format reservation protection processing on the checked confidential information, and protecting the sensitivity of data under the condition of ensuring that the format of the original information is not damaged. For example, the sensitive word is "Zhang-bright" and can be corrected to be "Zhang-three" with the same format meaningless.

The further technical scheme is that the automatic elimination of the confidential information specifically comprises the following steps: and replacement processing, namely processing the checked confidential information by adopting a special character replacement mode. So that the confidential information can not be identified, thereby achieving the purpose of confidentiality.

The invention has the beneficial effects that:

(1) the distributed acquisition of data is realized by adopting a big data Sqoop technology, and the high-efficiency acquisition of data is realized by adopting a distributed data capture technology.

(2) The big data MapReduce technology is adopted to realize distributed inspection on the acquired information, and the inspection efficiency is 200-300 times higher than that of the traditional inspection efficiency.

(3) And the memory database is adopted to realize intermediate result data storage, once the inspection is finished or the system is withdrawn from the inspection site, the power is automatically cut off or restarted, and the user data is automatically destroyed, so that the safety requirements of no trace left in the inspection and no data taking away are met.

(4) Supporting the classified information inspection of all text documents (Word, PDF and the like) in the collected information; the method supports the confidential information check of all picture files in the collected information, and can automatically retrieve and extract the characters embedded in the image and give an alarm. The inspection type is many, and the recognition rate is high, and it is accurate to fix a position: the recognition precision reaches more than 80%.

(5) The method has high inspection precision, can accurately position confidential information or data in the content of the mass database, and has low false alarm rate.

Drawings

FIG. 1 is a schematic diagram of the system of the present invention;

FIG. 2 is a schematic flow chart of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Fig. 1 shows a schematic diagram of the system of the present invention.

A method for checking sensitive information in a database, as shown in fig. 2, includes the following steps: setting examination keywords, collecting database information, examining secret-related information, automatically eliminating secret-related information and outputting examination results. After the examination is completed, a detailed examination report is generated for guiding the next modification work.

The checking keyword setting specifically comprises the following steps: automatically generating examination keywords through learning by adopting an intelligent semantic sample analysis mode, and specifically forming a keyword library according to sample documents provided by a user; and intelligently expanding and perfecting the inspection keywords by adopting a semantic intelligent analysis mode. The intelligent expansion and perfection of the inspection keywords by adopting a semantic intelligent analysis mode comprises the following steps: the search keywords in different industries are intelligently expanded and searched based on synonymy and near-synonymy attribute algorithms, so that the retrieval hit rate is greatly improved; the method is characterized in that the classified semantics of different industries are analyzed, an intelligent semantic combination technology is adopted, and the inspection keywords and a plurality of element word combinations are combined to perfect the inspection keywords, such as: the error probability of the secret-related information is high only by the word of equipment in the file, but the probability that the secret-related information is described by the character description of the equipment and the number/English is high. For another example: the word including exercise is not necessarily secret, but the possibility that the combination of exercise and 15 military/travel/group/camp figures shows that the word is secret information is very high, and the accuracy can be improved by 200% by adopting the patented technology.

The information of the collection database is specifically as follows: and realizing distributed acquisition of data by adopting a big data Sqoop technology. And the data is efficiently acquired by a distributed data capture technology. Distributed acquisition can provide basic guarantee for sensitive word recognition of multiple databases and high-concurrency application scenes, can ensure high efficiency of sensitive word inspection in the high-concurrency scenes, and prevents the conditions of insufficient performance and incomplete inspection of inspection engines when high concurrency occurs.

The secret-related information inspection specifically comprises the following steps: and realizing distributed inspection on the acquired information by adopting a big data MapReduce technology, comparing the acquired information with the contents in the sensitive word sample library one by one, marking the successfully compared information, and giving an alarm. The inspection efficiency is 200-300 times higher than that of the traditional inspection. In the embodiment, a memory database is adopted to realize intermediate result data storage, once the inspection is finished or the system is withdrawn from the inspection site, the power is automatically cut off or restarted, and the user data is automatically destroyed, so that the safety requirements of no trace left in the inspection and no data taken away are met. Supporting the classified information inspection of all text documents (Word, PDF and the like) in the collected information; the method supports the confidential information check of all picture files in the collected information, and can automatically retrieve and extract the characters embedded in the image and give an alarm. The method processes images (including formats such as png, jpg, jpeg, bmp, tif and the like) in various formats acquired by using modes such as mobile phone shooting, scanning, copying, screen capture and the like according to the set inspection keywords so as to achieve the purposes of inspecting whether characters embedded in the images are illegal or not and disclosure, and has the advantages of high system identification rate and accurate positioning: the recognition precision reaches more than 80%.

The automatic elimination of the confidential information specifically comprises the following steps: three modes are provided for the checked confidential information, and further leakage of the confidential information can be prevented to the maximum extent. And (4) encryption processing, namely encrypting the checked confidential information by adopting a cryptographic algorithm. The secret-related information is corrected into a special password field which cannot be effectively recognized, and sensitive information is prevented from being leaked. And format privacy protection, namely performing format reservation protection processing on the checked confidential information, and protecting the sensitivity of data under the condition of ensuring that the format of the original information is not damaged. For example, the sensitive word is "Zhang-bright" and can be corrected to be "Zhang-three" with the same format meaningless. And replacement processing, namely processing the checked confidential information by adopting a special character replacement mode. So that the confidential information can not be identified, thereby achieving the purpose of confidentiality.

Although the invention has been described herein with reference to a number of illustrative embodiments thereof, it should be understood that numerous other modifications and embodiments can be devised by those skilled in the art that will fall within the spirit and scope of the principles of this disclosure. More specifically, various variations and modifications are possible in the component parts and/or arrangements of the subject combination arrangement within the scope of the disclosure, the drawings and the appended claims. In addition to variations and modifications in the component parts and/or arrangements, other uses will also be apparent to those skilled in the art.

Claims

1. A method for checking sensitive information of a database is characterized by comprising the following steps: setting examination keywords, acquiring database information, examining secret-related information and automatically eliminating the secret-related information.

2. The database sensitive information checking method according to claim 1, wherein: and also comprises the output of the inspection result.

3. The database sensitive information checking method according to claim 1, wherein: the checking keyword setting specifically comprises the following steps: generating an automatic generation check keyword by adopting an intelligent semantic sample analysis mode, identifying the content of the file in the current period and combining a machine learning technology; and intelligently expanding and perfecting the inspection keywords by adopting a semantic intelligent analysis mode.

4. The database sensitive information checking method according to claim 3, wherein: the intelligent expansion and perfection of the inspection keywords by adopting a semantic intelligent analysis mode comprises the following steps: carrying out intelligent expansion search on the inspection keywords in different industries based on synonymy and near-synonymy attribute algorithms; the method analyzes the classified semanteme of different industries, adopts intelligent semanteme combination technology, and combines the inspection key words and a plurality of element words to perfect the inspection key words.

5. The database sensitive information checking method according to claim 1, wherein: the information of the collection database is specifically as follows: and realizing distributed acquisition of data by adopting a big data Sqoop technology.

6. The database sensitive information checking method according to claim 1, wherein: the secret-related information inspection specifically comprises the following steps: and realizing distributed inspection on the acquired information by adopting a big data MapReduce technology, comparing the acquired information with the contents in the sensitive word sample library one by one, marking the successfully compared information, and giving an alarm.

7. The database sensitive information checking method according to claim 6, wherein: the check supports the check of the confidential information of all the text documents in the collected information; the method supports the confidential information check of all picture files in the collected information, and can automatically retrieve and extract the characters embedded in the image and give an alarm.

8. The database sensitive information checking method according to claim 1, wherein: the automatic elimination of the confidential information specifically comprises the following steps: and (4) encryption processing, namely encrypting the checked confidential information by adopting a cryptographic algorithm.

9. The database sensitive information checking method according to claim 1, wherein: the automatic elimination of the confidential information specifically comprises the following steps: and format privacy protection, namely performing format reservation protection processing on the checked confidential information, and protecting the sensitivity of data under the condition of ensuring that the format of the original information is not damaged.

10. The database sensitive information checking method according to claim 1, wherein: the automatic elimination of the confidential information specifically comprises the following steps: and replacement processing, namely processing the checked confidential information by adopting a special character replacement mode.