CN113032811A - Database sensitive information checking method - Google Patents

Database sensitive information checking method Download PDF

Info

Publication number
CN113032811A
CN113032811A CN202110395454.3A CN202110395454A CN113032811A CN 113032811 A CN113032811 A CN 113032811A CN 202110395454 A CN202110395454 A CN 202110395454A CN 113032811 A CN113032811 A CN 113032811A
Authority
CN
China
Prior art keywords
information
database
inspection
adopting
checking method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110395454.3A
Other languages
Chinese (zh)
Inventor
门嘉平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Guolian Yian Information Technology Co ltd
Original Assignee
Beijing Guolian Yian Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Guolian Yian Information Technology Co ltd filed Critical Beijing Guolian Yian Information Technology Co ltd
Priority to CN202110395454.3A priority Critical patent/CN113032811A/en
Publication of CN113032811A publication Critical patent/CN113032811A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Bioethics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Fuzzy Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Storage Device Security (AREA)

Abstract

The invention discloses a method for checking sensitive information of a database, which comprises the following steps: setting examination keywords, collecting database information, examining secret-related information, automatically eliminating secret-related information and outputting examination results. The method can be used for efficiently and quickly checking the confidential information in the database, and is high in checking precision and low in false alarm rate.

Description

Database sensitive information checking method
Technical Field
The invention relates to the field of information analysis and processing, in particular to a database sensitive information checking method.
Background
Under the background of big data and cloud computing, more and more data are concentrated in a data center database for centralized storage, and the data are dispersed in different massive database tables, so that the novel characteristics of massive data, big data, distributed data and the like are formed.
1) The data storage capacity of a data center database is larger and larger, and generally reaches more than 50T;
2) data center databases store more and more types of data, including features frequently presented such as structured data, unstructured data, pictures, texts, audio and video data, and the like.
These new features are not sufficient for security workers to adopt traditional inspection means; therefore, new database inspection tools must be created timely according to new trends and new characteristics to deal with the serious safety challenges brought by the centralization and big datamation of data assets.
Disclosure of Invention
In view of the above, the present invention provides a method for checking sensitive information of a database, which can efficiently and quickly check confidential information in the database, and has high checking accuracy and low false alarm rate.
In order to solve the problems, the technical scheme adopted by the invention is as follows:
a method for checking sensitive information of a database comprises the following steps: setting examination keywords, acquiring database information, examining secret-related information and automatically eliminating the secret-related information.
The further technical proposal is that the method also comprises the output of the inspection result.
A further technical solution is that the check keyword setting specifically is: generating an automatic generation check keyword by adopting an intelligent semantic sample analysis mode, identifying the content of the file in the current period and combining a machine learning technology; the intelligent expansion and perfection of the inspection keywords are realized by adopting a semantic intelligent analysis mode, and the machine learning technology can perfect the learning process and supplement the sensitive keyword library in the continuous inspection process.
A further technical solution is that the intelligently expanding and perfecting the inspection keywords by adopting a semantic intelligent analysis mode comprises: carrying out intelligent expansion search on the inspection keywords in different industries based on synonymy and near-synonymy attribute algorithms; the method is characterized in that the classified semantics of different industries are analyzed, an intelligent semantic combination technology is adopted, the inspection keywords and a plurality of element words are combined to perfect the inspection keywords, and a set of automatically updated sensitive word library is formed by identifying the sensitive keywords of different industries and machine learning.
A further technical scheme is that the collecting database information specifically comprises: and realizing distributed acquisition of data by adopting a big data Sqoop technology. Distributed acquisition can provide basic guarantee for sensitive word recognition of multiple databases and high-concurrency application scenes, can ensure high efficiency of sensitive word inspection in the high-concurrency scenes, and prevents the conditions of insufficient performance and incomplete inspection of inspection engines when high concurrency occurs.
The further technical scheme is that the secret-related information inspection specifically comprises the following steps: and realizing distributed inspection on the acquired information by adopting a big data MapReduce technology, comparing the acquired information with the contents in the sensitive word sample library one by one, marking the successfully compared information, and giving an alarm.
The further technical proposal is that the examination supports the examination of the confidential information of all text documents in the collected information; the method supports the confidential information check of all picture files in the collected information, and can automatically retrieve and extract the characters embedded in the image and give an alarm.
The further technical scheme is that the automatic elimination of the confidential information specifically comprises the following steps: and (4) encryption processing, namely encrypting the checked confidential information by adopting a cryptographic algorithm. The secret-related information is corrected into a special password field which cannot be effectively recognized, and sensitive information is prevented from being leaked.
The further technical scheme is that the automatic elimination of the confidential information specifically comprises the following steps: and format privacy protection, namely performing format reservation protection processing on the checked confidential information, and protecting the sensitivity of data under the condition of ensuring that the format of the original information is not damaged. For example, the sensitive word is "Zhang-bright" and can be corrected to be "Zhang-three" with the same format meaningless.
The further technical scheme is that the automatic elimination of the confidential information specifically comprises the following steps: and replacement processing, namely processing the checked confidential information by adopting a special character replacement mode. So that the confidential information can not be identified, thereby achieving the purpose of confidentiality.
The invention has the beneficial effects that:
(1) the distributed acquisition of data is realized by adopting a big data Sqoop technology, and the high-efficiency acquisition of data is realized by adopting a distributed data capture technology.
(2) The big data MapReduce technology is adopted to realize distributed inspection on the acquired information, and the inspection efficiency is 200-300 times higher than that of the traditional inspection efficiency.
(3) And the memory database is adopted to realize intermediate result data storage, once the inspection is finished or the system is withdrawn from the inspection site, the power is automatically cut off or restarted, and the user data is automatically destroyed, so that the safety requirements of no trace left in the inspection and no data taking away are met.
(4) Supporting the classified information inspection of all text documents (Word, PDF and the like) in the collected information; the method supports the confidential information check of all picture files in the collected information, and can automatically retrieve and extract the characters embedded in the image and give an alarm. The inspection type is many, and the recognition rate is high, and it is accurate to fix a position: the recognition precision reaches more than 80%.
(5) The method has high inspection precision, can accurately position confidential information or data in the content of the mass database, and has low false alarm rate.
Drawings
FIG. 1 is a schematic diagram of the system of the present invention;
FIG. 2 is a schematic flow chart of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Fig. 1 shows a schematic diagram of the system of the present invention.
A method for checking sensitive information in a database, as shown in fig. 2, includes the following steps: setting examination keywords, collecting database information, examining secret-related information, automatically eliminating secret-related information and outputting examination results. After the examination is completed, a detailed examination report is generated for guiding the next modification work.
The checking keyword setting specifically comprises the following steps: automatically generating examination keywords through learning by adopting an intelligent semantic sample analysis mode, and specifically forming a keyword library according to sample documents provided by a user; and intelligently expanding and perfecting the inspection keywords by adopting a semantic intelligent analysis mode. The intelligent expansion and perfection of the inspection keywords by adopting a semantic intelligent analysis mode comprises the following steps: the search keywords in different industries are intelligently expanded and searched based on synonymy and near-synonymy attribute algorithms, so that the retrieval hit rate is greatly improved; the method is characterized in that the classified semantics of different industries are analyzed, an intelligent semantic combination technology is adopted, and the inspection keywords and a plurality of element word combinations are combined to perfect the inspection keywords, such as: the error probability of the secret-related information is high only by the word of equipment in the file, but the probability that the secret-related information is described by the character description of the equipment and the number/English is high. For another example: the word including exercise is not necessarily secret, but the possibility that the combination of exercise and 15 military/travel/group/camp figures shows that the word is secret information is very high, and the accuracy can be improved by 200% by adopting the patented technology.
The information of the collection database is specifically as follows: and realizing distributed acquisition of data by adopting a big data Sqoop technology. And the data is efficiently acquired by a distributed data capture technology. Distributed acquisition can provide basic guarantee for sensitive word recognition of multiple databases and high-concurrency application scenes, can ensure high efficiency of sensitive word inspection in the high-concurrency scenes, and prevents the conditions of insufficient performance and incomplete inspection of inspection engines when high concurrency occurs.
The secret-related information inspection specifically comprises the following steps: and realizing distributed inspection on the acquired information by adopting a big data MapReduce technology, comparing the acquired information with the contents in the sensitive word sample library one by one, marking the successfully compared information, and giving an alarm. The inspection efficiency is 200-300 times higher than that of the traditional inspection. In the embodiment, a memory database is adopted to realize intermediate result data storage, once the inspection is finished or the system is withdrawn from the inspection site, the power is automatically cut off or restarted, and the user data is automatically destroyed, so that the safety requirements of no trace left in the inspection and no data taken away are met. Supporting the classified information inspection of all text documents (Word, PDF and the like) in the collected information; the method supports the confidential information check of all picture files in the collected information, and can automatically retrieve and extract the characters embedded in the image and give an alarm. The method processes images (including formats such as png, jpg, jpeg, bmp, tif and the like) in various formats acquired by using modes such as mobile phone shooting, scanning, copying, screen capture and the like according to the set inspection keywords so as to achieve the purposes of inspecting whether characters embedded in the images are illegal or not and disclosure, and has the advantages of high system identification rate and accurate positioning: the recognition precision reaches more than 80%.
The automatic elimination of the confidential information specifically comprises the following steps: three modes are provided for the checked confidential information, and further leakage of the confidential information can be prevented to the maximum extent. And (4) encryption processing, namely encrypting the checked confidential information by adopting a cryptographic algorithm. The secret-related information is corrected into a special password field which cannot be effectively recognized, and sensitive information is prevented from being leaked. And format privacy protection, namely performing format reservation protection processing on the checked confidential information, and protecting the sensitivity of data under the condition of ensuring that the format of the original information is not damaged. For example, the sensitive word is "Zhang-bright" and can be corrected to be "Zhang-three" with the same format meaningless. And replacement processing, namely processing the checked confidential information by adopting a special character replacement mode. So that the confidential information can not be identified, thereby achieving the purpose of confidentiality.
Although the invention has been described herein with reference to a number of illustrative embodiments thereof, it should be understood that numerous other modifications and embodiments can be devised by those skilled in the art that will fall within the spirit and scope of the principles of this disclosure. More specifically, various variations and modifications are possible in the component parts and/or arrangements of the subject combination arrangement within the scope of the disclosure, the drawings and the appended claims. In addition to variations and modifications in the component parts and/or arrangements, other uses will also be apparent to those skilled in the art.

Claims (10)

1. A method for checking sensitive information of a database is characterized by comprising the following steps: setting examination keywords, acquiring database information, examining secret-related information and automatically eliminating the secret-related information.
2. The database sensitive information checking method according to claim 1, wherein: and also comprises the output of the inspection result.
3. The database sensitive information checking method according to claim 1, wherein: the checking keyword setting specifically comprises the following steps: generating an automatic generation check keyword by adopting an intelligent semantic sample analysis mode, identifying the content of the file in the current period and combining a machine learning technology; and intelligently expanding and perfecting the inspection keywords by adopting a semantic intelligent analysis mode.
4. The database sensitive information checking method according to claim 3, wherein: the intelligent expansion and perfection of the inspection keywords by adopting a semantic intelligent analysis mode comprises the following steps: carrying out intelligent expansion search on the inspection keywords in different industries based on synonymy and near-synonymy attribute algorithms; the method analyzes the classified semanteme of different industries, adopts intelligent semanteme combination technology, and combines the inspection key words and a plurality of element words to perfect the inspection key words.
5. The database sensitive information checking method according to claim 1, wherein: the information of the collection database is specifically as follows: and realizing distributed acquisition of data by adopting a big data Sqoop technology.
6. The database sensitive information checking method according to claim 1, wherein: the secret-related information inspection specifically comprises the following steps: and realizing distributed inspection on the acquired information by adopting a big data MapReduce technology, comparing the acquired information with the contents in the sensitive word sample library one by one, marking the successfully compared information, and giving an alarm.
7. The database sensitive information checking method according to claim 6, wherein: the check supports the check of the confidential information of all the text documents in the collected information; the method supports the confidential information check of all picture files in the collected information, and can automatically retrieve and extract the characters embedded in the image and give an alarm.
8. The database sensitive information checking method according to claim 1, wherein: the automatic elimination of the confidential information specifically comprises the following steps: and (4) encryption processing, namely encrypting the checked confidential information by adopting a cryptographic algorithm.
9. The database sensitive information checking method according to claim 1, wherein: the automatic elimination of the confidential information specifically comprises the following steps: and format privacy protection, namely performing format reservation protection processing on the checked confidential information, and protecting the sensitivity of data under the condition of ensuring that the format of the original information is not damaged.
10. The database sensitive information checking method according to claim 1, wherein: the automatic elimination of the confidential information specifically comprises the following steps: and replacement processing, namely processing the checked confidential information by adopting a special character replacement mode.
CN202110395454.3A 2021-04-13 2021-04-13 Database sensitive information checking method Pending CN113032811A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110395454.3A CN113032811A (en) 2021-04-13 2021-04-13 Database sensitive information checking method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110395454.3A CN113032811A (en) 2021-04-13 2021-04-13 Database sensitive information checking method

Publications (1)

Publication Number Publication Date
CN113032811A true CN113032811A (en) 2021-06-25

Family

ID=76456547

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110395454.3A Pending CN113032811A (en) 2021-04-13 2021-04-13 Database sensitive information checking method

Country Status (1)

Country Link
CN (1) CN113032811A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102819604A (en) * 2012-08-20 2012-12-12 徐亮 Method for retrieving confidential information of file and judging and marking security classification based on content correlation
CN107577939A (en) * 2017-09-12 2018-01-12 中国石油集团川庆钻探工程有限公司 A kind of data leakage prevention method based on key technology
CN111723280A (en) * 2019-03-20 2020-09-29 北京字节跳动网络技术有限公司 Information processing method and device, storage medium and electronic equipment
CN112347079A (en) * 2020-11-06 2021-02-09 杭州世平信息科技有限公司 Database content security check system and check method
CN112612875A (en) * 2020-12-29 2021-04-06 重庆农村商业银行股份有限公司 Method, device and equipment for automatically expanding query words and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102819604A (en) * 2012-08-20 2012-12-12 徐亮 Method for retrieving confidential information of file and judging and marking security classification based on content correlation
CN107577939A (en) * 2017-09-12 2018-01-12 中国石油集团川庆钻探工程有限公司 A kind of data leakage prevention method based on key technology
CN111723280A (en) * 2019-03-20 2020-09-29 北京字节跳动网络技术有限公司 Information processing method and device, storage medium and electronic equipment
CN112347079A (en) * 2020-11-06 2021-02-09 杭州世平信息科技有限公司 Database content security check system and check method
CN112612875A (en) * 2020-12-29 2021-04-06 重庆农村商业银行股份有限公司 Method, device and equipment for automatically expanding query words and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
余宣杰等: "开放创新模式下研究型图书馆战略变革", 科学技术文献出版社, pages: 208 - 177 *

Similar Documents

Publication Publication Date Title
US6178417B1 (en) Method and means of matching documents based on text genre
US8200642B2 (en) System and method for managing electronic documents in a litigation context
US8359472B1 (en) Document fingerprinting with asymmetric selection of anchor points
WO2010102515A1 (en) Automatic and semi-automatic image classification, annotation and tagging through the use of image acquisition parameters and metadata
Rosin et al. Learning word relatedness over time
Wu et al. Efficient near-duplicate detection for q&a forum
CN110909120B (en) Resume searching/delivering method, device and system and electronic equipment
CN111291177A (en) Information processing method and device and computer storage medium
US20230205824A1 (en) Contextual Clarification and Disambiguation for Question Answering Processes
Tan et al. Efficient mining of multiple partial near-duplicate alignments by temporal network
Middleton et al. Social computing for verifying social media content in breaking news
EP3301603A1 (en) Improved search for data loss prevention
US20120254166A1 (en) Signature Detection in E-Mails
CN102819612A (en) Full text search method based on print documents
Joshi et al. Auto-grouping emails for faster e-discovery
CN113220821A (en) Index establishing method and device for test question retrieval and electronic equipment
Raghavan et al. Eliciting file relationships using metadata based associations for digital forensics
CN113032811A (en) Database sensitive information checking method
CN112528056B (en) Double-index field data retrieval system and method
Foo et al. Discovery of image versions in large collections
Baratis et al. Automatic website summarization by image content: a case study with logo and trademark images
Papadopoulou et al. Context aggregation and analysis: a tool for user-generated video verification
Adefowoke Ojokoh et al. Automated document metadata extraction
CN117556112B (en) Intelligent management system for electronic archive information
Song et al. A pointillism approach for natural language processing of social media

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination