CN111522950B

CN111522950B - Rapid identification system for unstructured massive text sensitive data

Info

Publication number: CN111522950B
Application number: CN202010338431.4A
Authority: CN
Inventors: 章明珠; 刘超
Original assignee: Chengdu Siwei Century Technology Co ltd
Current assignee: Chengdu Siwei Century Technology Co ltd
Priority date: 2020-04-26
Filing date: 2020-04-26
Publication date: 2023-06-27
Anticipated expiration: 2040-04-26
Also published as: CN111522950A

Abstract

The invention discloses a rapid identification system for unstructured massive text sensitive data, which comprises a modeling unit, an identification layer unit, a storage unit, a support layer unit and a serialization unit, wherein the modeling unit comprises an information acquisition module and a modeling calculation module, the modeling unit is electrically connected with the identification layer unit, the storage unit is used for providing persistent storage for metadata of the modeling unit, the storage unit is electrically connected with the modeling unit, and the support layer unit comprises a business monitoring module, a man-machine interaction module, a service hosting module and a log tracking module. According to the invention, for the rapid classification of unstructured data, a learning engine is utilized to autonomously select a proper algorithm from common classification algorithms to perform rapid classification of the data, so that the recognition efficiency is improved, and for the efficient recognition of unstructured data, a corresponding query method can be autonomously selected for scanning according to the sensitive type, and the scanning efficiency is improved.

Description

Rapid identification system for unstructured massive text sensitive data

Technical Field

The invention belongs to the fields of data security, data classification algorithms and data modeling, and particularly relates to a rapid identification system for unstructured massive text sensitive data.

Background

Aiming at massive unstructured text data, the method performs extraction and optimization on a large-scale unstructured data classification algorithm through modeling and text similarity comparison of the text of the unstructured data in the current market, and classifies and sensitively extracts the unstructured data. The related technical scheme mainly comprises the steps of classifying and summarizing text data by using a neural network data analysis engine, and then extracting and identifying the data, wherein the core technology is a sensitive identification engine for rapidly classifying and systemizing the text data, a large amount of unstructured electronic texts exist on the Internet along with the development and popularization of the Internet technology, and the sensitive data threatens the daily life of enterprises and individuals at any time in the face of increasing webpage data. How to help enterprises efficiently identify the sensitive data, quickly classify the sensitive data from massive unstructured texts, and how to express the unstructured text data into a form which can be understood by a computer, so that the identification cost is reduced, and meanwhile, the data is efficiently mined and stored, so that the method is increasingly in wide market demands.

For the prior art of sensitive identification of unstructured data, the main disadvantages are: when mass data is sensitively identified, the identification efficiency is quite low, and the main reasons are the classification efficiency of the data and the scanning efficiency of key information.

Disclosure of Invention

The invention aims to solve the defects in the prior art, and provides a rapid identification system for unstructured massive text sensitive data.

In order to achieve the above purpose, the present invention provides the following technical solutions:

the rapid identification system for unstructured massive text sensitive data comprises a modeling unit, an identification layer unit, a storage unit, a supporting layer unit and a serialization unit, wherein the modeling unit comprises an information acquisition module and a modeling calculation module, the modeling unit is electrically connected with the identification layer unit, the storage unit is used for providing persistent storage for metadata of the modeling unit, a read-write interface is arranged on the storage unit, the storage unit is electrically connected to the modeling unit, the supporting layer unit comprises a business monitoring module, a man-machine interaction module, a service hosting module and a log tracking module, and the supporting layer unit is used for optimizing an algorithm of the serialization unit and providing a strategy and a basis for acquisition of the center of gravity of the information acquisition module.

Preferably, the information acquisition module comprises a manual acquisition module and a machine acquisition module, wherein the manual acquisition module is used for manually sorting sample data to the storage unit and labeling the sample in sensitive type and grade, an interface connected with the identification layer unit is arranged on the manual acquisition module, the manual acquisition module is used for manually providing key word import in batches, the manual acquisition module is used for acquiring at least 100 pieces of information for each sample, and the samples are stored in the storage unit.

Preferably, the modeling calculation module is the same as providing corresponding human acquisition calculation and computer acquisition calculation for the manual acquisition module and the machine acquisition module, the computer acquisition calculation adopts an algorithm of opening sources of enterprises at the technical front of the industry in the aspects of neural networks and artificial intelligence, the human acquisition calculation is used for correcting the business correlation of the computer acquisition calculation, smooth transition processing is carried out on a sensitive classification and rating scoring system, the human acquisition calculation introduces a similarity calculation and a Hamming distance expansibility algorithm, and the human acquisition calculation increases the natural language processing of approximation and lexical association.

Preferably, the recognition layer unit performs initial loading operation on the model by taking the output of the modeling unit as input, and the recognition layer unit dynamically increases and decreases model items according to service requirements and supports hot plug operation, and the hit scoring system returned by the recognition layer unit for each sensitive model should be provided with a summarizing algorithm, namely that each classification self weight is multiplied by a matching degree accumulated value to take logarithm, and the result is a floating point number between zero and one to be used as a correction value for final evaluation calculation of the sensitivity.

Preferably, the recognition layer unit is used for summarizing massive unstructured text data by using a classification algorithm or a clustering algorithm, then processing and judging character sets and languages, converting the character sets into character sets corresponding to internal storage according to requirements, and extracting keywords of a current text after the keyword is deleted by using a word segmentation system to segment metadata.

Preferably, the storage unit adopts a semi-structured distributed storage solution to store the webpage content with high expansibility based on the network data characteristic, and the message queue in the storage unit meets the first-in first-out characteristic and can be freely subscribed to.

Preferably, the read-write interface is configured to periodically read incremental data and transmit the incremental data to the storage unit to form a message queue, and push the message queue to each service node of the identification layer unit, where the service node sets the incremental data according to the equipment load condition and the consumption queue data, and write the incremental data back to the message queue immediately after the processing is completed.

Preferably, the serialization unit includes a sensitive information module, the sensitive information module is used for sealing sensitive words and sensitive fields, the serialization unit is serialized at a production end and is de-serialized at a consumption end, and the information to be serialized by the serialization unit includes version number, information type, operation type, encryption identification and key, data length, data information and identification result.

Preferably, the specific workflow of the rapid identification system for unstructured massive text-sensitive data is as follows:

s1: data acquisition and storage, wherein the data provided by a mechanism or an enterprise to be identified is stored in hbase, ES or other non-relational databases;

s2: the method comprises the steps of carrying out recognition operation, loading part or all of recognition models according to configuration items, carrying out data recognition according to the recognition models by utilizing relational extraction from data, reading records from a message queue one by a thread pool, executing deserialization operation, executing different processing flows according to data types, carrying out summarization calculation after model matching is completed, writing back a system bus message queue subject after serialization, and recording logs for the current execution process for offline effect analysis;

s3: the bus queue can create a production working thread and a consumption working thread when the system bus is started, the production working thread tracks the change condition of the bottom storage increment data at fixed time, and when the data arrives, the data to be consumed is extracted from the storage unit and put into a consumption theme; the consumption working thread is suspended and waits at the entrance, and automatically triggers write-back operation when a new message exists, and the original address data of the bottom layer is updated;

s4: the log analysis, the serialization unit system uses an hour as a unit, analyzes log data in a batch processing mode and generates a report, and statistics is carried out to obtain the data scale, the sensitive information proportion, the sensitive information intensity, the propagation frequency heat and the identification accuracy;

s5: the support system provides support capability for the whole framework of the rapid identification system through the support layer unit, mainly comprises a database component and the support capability of a learning engine, wherein the database needs to be cleaned and optimized regularly, and the learning engine needs to update an identification algorithm and an identification library in time;

s6: and the external interface is used for providing sensitive identification data import and a request interface for sensitive data identification.

The invention has the technical effects and advantages that: compared with the traditional irrigation technology, the rapid identification system for unstructured massive text sensitive data provided by the invention has the advantages that the learning engine is utilized to autonomously select a proper algorithm from common classification algorithms to rapidly classify the unstructured data, the identification efficiency is improved, and the corresponding query method can be autonomously selected for scanning according to the sensitive type for efficient identification of the unstructured data, so that the scanning efficiency is improved.

Drawings

FIG. 1 is a block diagram of a rapid recognition system for unstructured massive text-sensitive data of the present invention;

FIG. 2 is a flow chart of the rapid recognition system of the present invention for unstructured massive text-sensitive data.

Detailed Description

The present invention will be described in further detail with reference to specific embodiments in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

The rapid identification system for unstructured massive text sensitive data comprises a modeling unit, an identification layer unit, a storage unit, a supporting layer unit and a serialization unit, wherein the modeling unit comprises an information acquisition module and a modeling calculation module, the modeling unit is electrically connected with the identification layer unit, the storage unit is used for providing persistent storage for metadata of the modeling unit, a read-write interface is arranged on the storage unit, the storage unit is electrically connected with the modeling unit, the supporting layer unit comprises a service monitoring module, a man-machine interaction module, a service hosting module and a log tracking module, the supporting layer unit is used for optimizing an algorithm of the serialization unit, and providing a strategy and a basis for acquisition center of gravity of the information acquisition module.

The information acquisition module comprises a manual acquisition module and a machine acquisition module, wherein the manual acquisition module is used for manually sorting sample data to the storage unit and labeling the samples in sensitive types and grades, an interface connected with the identification layer unit is arranged on the manual acquisition module, the manual acquisition module is used for manually providing key word import in batches, the manual acquisition module is used for acquiring at least 100 pieces of information of each sample, and the samples are stored in the storage unit.

The modeling calculation module is used for providing corresponding human acquisition calculation and computer acquisition calculation for the artificial acquisition module and the machine acquisition module, wherein the computer acquisition calculation adopts an algorithm of an industry technology front-edge enterprise open source in the aspects of a neural network and artificial intelligence, the human acquisition calculation is used for carrying out business correlation correction on the computer acquisition calculation, smooth transition processing is carried out on a sensitive classification and rating scoring system, the human acquisition calculation introduces a similarity calculation and Hamming distance expansibility algorithm, and the human acquisition calculation increases approximation and lexical association natural language processing.

The recognition layer unit takes the output of the modeling unit as input to perform initial loading operation on the model, the recognition layer unit dynamically increases and decreases model items according to service requirements and supports hot plug operation, and a hit scoring system returned by the recognition layer unit for each sensitive model is provided with a summarizing algorithm, namely, each classification self weight is multiplied by a matching degree accumulated value to take the logarithm, and the result is a floating point number between zero and one to be used as a correction value for the final sensitive evaluation calculation.

The recognition layer unit is used for summarizing massive unstructured text data by using a classification algorithm or a clustering algorithm, then processing and judging character sets and languages, converting the character sets and the languages into character sets corresponding to internal storage according to requirements, segmenting metadata by using a word segmentation system, deleting stop words, and extracting keywords of a current text.

The storage unit adopts a semi-structured distributed storage solution to store high-expansibility webpage content based on network data characteristics, and the message queue in the storage unit meets the first-in first-out characteristics and can be freely subscribed.

The read-write interface is used for periodically reading the incremental data and transmitting the incremental data to the storage unit to form a message queue, pushing the message queue to each service node of the identification layer unit, setting the service node according to the equipment load condition and the consumption queue data, and immediately writing back into the message queue after the processing is finished.

The serialization unit comprises a sensitive information module, the sensitive information module is used for sealing sensitive words and sensitive fields, the serialization unit is serialized at a production end and is reversely serialized at a consumption end, and information to be serialized by the serialization unit comprises a version number, an information type, an operation type, an encryption identifier, a secret key, a data length, data information and a recognition result.

Example 2

The specific workflow of the rapid identification system for unstructured massive text sensitive data is as follows:

To sum up: compared with the traditional irrigation technology, the rapid identification system for unstructured massive text sensitive data provided by the invention has the advantages that the learning engine is utilized to autonomously select a proper algorithm from common classification algorithms to rapidly classify the unstructured data, the identification efficiency is improved, and the corresponding query method can be autonomously selected for scanning according to the sensitive type for efficient identification of the unstructured data, so that the scanning efficiency is improved.

Finally, it should be noted that: the foregoing description is only illustrative of the preferred embodiments of the present invention, and although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described, or equivalents may be substituted for elements thereof, and any modifications, equivalents, improvements or changes may be made without departing from the spirit and principles of the present invention.

Claims

1. The rapid identification system for unstructured massive text sensitive data comprises a modeling unit, an identification layer unit, a storage unit, a support layer unit and a serialization unit, and is characterized in that: the modeling unit comprises an information acquisition module and a modeling calculation module, the modeling unit is electrically connected with an identification layer unit, the storage unit is used for providing persistent storage for metadata of the modeling unit, a read-write interface is arranged on the storage unit, the storage unit is electrically connected with the modeling unit, the support layer unit comprises a business monitoring module, a man-machine interaction module, a service hosting module and a log tracking module, and the support layer unit is used for optimizing an algorithm of the serialization unit and providing a strategy and a basis for the acquisition center of gravity of the information acquisition module;

the information acquisition module comprises a manual acquisition module and a machine acquisition module, wherein the manual acquisition module is used for manually sorting sample data to the storage unit and labeling the sample in sensitive type and grade, an interface connected with the identification layer unit is arranged on the manual acquisition module, the manual acquisition module is used for manually providing batch keyword import, the manual acquisition module is used for acquiring at least 100 pieces of information of each sample, and the samples are stored in the storage unit;

the identification layer unit takes the output of the modeling unit as input to perform initial loading operation on the model, the identification layer unit dynamically increases and decreases model items according to service requirements and supports hot plug operation, and a hit scoring system returned by the identification layer unit for each sensitive model is provided with a summarizing algorithm, namely, the weight of each classification is multiplied by the accumulated value of the matching degree to take the logarithm, and the result is a floating point number between zero and one to be used as a correction value of the sensitive final evaluation calculation.

2. A rapid identification system for unstructured massive text-sensitive data according to claim 1, wherein: the modeling calculation module is the same as providing corresponding human acquisition calculation and computer acquisition calculation for the manual acquisition module and the machine acquisition module, the computer acquisition calculation adopts an algorithm of opening sources of enterprises at the technical front of the industry in the aspects of neural networks and artificial intelligence, the human acquisition calculation is used for carrying out business correlation correction on the computer acquisition calculation, smooth transition processing is carried out on a sensitive classification and rating scoring system, the human acquisition calculation introduces a similarity calculation and Hamming distance expansibility algorithm, and the human acquisition calculation increases the natural language processing of approximation and lexical association.

3. A rapid identification system for unstructured massive text-sensitive data according to claim 1, wherein: the recognition layer unit is used for summarizing massive unstructured text data by using a classification algorithm or a clustering algorithm, then processing and judging character sets and languages, converting the character sets and the languages into character sets corresponding to internal storage according to requirements, segmenting metadata by using a word segmentation system, and extracting keywords of a current text after deleting stop words.

4. A rapid identification system for unstructured massive text-sensitive data according to claim 1, wherein: the storage unit adopts a semi-structured distributed storage solution to store high-expansibility webpage content based on network data characteristics, and the message queue in the storage unit meets the first-in first-out characteristics and can be freely subscribed.

5. The rapid identification system for unstructured massive text-sensitive data of claim 4, wherein: the read-write interface is used for periodically reading the incremental data and transmitting the incremental data to the storage unit to form a message queue, pushing the message queue to each service node of the identification layer unit, setting the service node according to the equipment load condition and the consumption queue data, and immediately writing the service node back into the message queue after processing is finished.

6. A rapid identification system for unstructured massive text-sensitive data according to claim 1, wherein: the serialization unit comprises a sensitive information module, wherein the sensitive information module is used for sealing sensitive words and sensitive fields, the serialization unit is serialized at a production end and is reversely serialized at a consumption end, and information to be serialized by the serialization unit comprises a version number, an information type, an operation type, an encryption identifier, a secret key, a data length, data information and an identification result.

7. A rapid identification system for unstructured massive text-sensitive data according to claim 1, wherein: the specific workflow of the rapid identification system for unstructured massive text sensitive data is as follows: