CN111522950A

CN111522950A - Rapid identification system for unstructured massive text sensitive data

Info

Publication number: CN111522950A
Application number: CN202010338431.4A
Authority: CN
Inventors: 章明珠; 刘超
Original assignee: Chengdu Siwei Century Technology Co ltd
Current assignee: Chengdu Siwei Century Technology Co ltd
Priority date: 2020-04-26
Filing date: 2020-04-26
Publication date: 2020-08-11
Anticipated expiration: 2040-04-26
Also published as: CN111522950B

Abstract

The invention discloses a rapid identification system for unstructured massive text sensitive data, which comprises a modeling unit, an identification layer unit, a storage unit, a supporting layer unit and a serialization unit, wherein the modeling unit comprises an information acquisition module and a modeling calculation module, the modeling unit is electrically connected with the identification layer unit, the storage unit is used for providing persistent storage for metadata of the modeling unit, the storage unit is electrically connected with the modeling unit, and the supporting layer unit comprises a business monitoring module, a human-computer interaction module, a service hosting module and a log tracking module. According to the method, for the rapid classification of the unstructured data, a learning engine is used for independently selecting a proper algorithm from common classification algorithms to perform the rapid classification of the data, so that the identification efficiency is improved, the efficient identification of the unstructured data can be realized, the corresponding query method can be independently selected for scanning according to the sensitive type, and the scanning efficiency is improved.

Description

Rapid identification system for unstructured massive text sensitive data

Technical Field

The invention belongs to the fields of data security, data classification algorithm and data modeling, and particularly relates to a rapid identification system for unstructured massive text sensitive data.

Background

Aiming at massive unstructured text data, in the market at present, extraction and optimization are carried out on a classification algorithm of large-scale unstructured data through modeling of texts of the unstructured data and text similarity comparison, and the unstructured data are classified and sensitively extracted. At present, the mainstream related technical scheme is mainly to use a neural network data analysis engine to classify and summarize text data and then extract and identify the data, wherein a core technology is a sensitive identification engine for quickly classifying and systematizing the text data, and along with the development and popularization of the internet technology, a large amount of unstructured electronic texts exist on the internet, and in the face of increasing webpage data, the sensitive data also threatens the daily life of enterprises and individuals all the time. How to help enterprises to efficiently identify the sensitive data, how to quickly classify the sensitive data from massive unstructured texts, how to express the unstructured text data into a form that a computer can understand, and how to reduce the identification cost, and how to efficiently mine and store the data, the method and the system are increasingly subjected to wide market demands.

For the sensitive identification prior art of the current unstructured data, the main disadvantages are as follows: when sensitive identification is performed on mass data, the identification efficiency is very low, mainly due to the classification efficiency of the data and the scanning efficiency of key information.

Disclosure of Invention

The invention aims to solve the defects in the prior art, and provides a rapid identification system for unstructured massive text sensitive data.

In order to achieve the purpose, the invention provides the following technical scheme:

a quick identification system for unstructured massive text sensitive data comprises a modeling unit, an identification layer unit, a storage unit, a supporting layer unit and a serialization unit, wherein the modeling unit comprises an information acquisition module and a modeling calculation module, the modeling unit is electrically connected with the identification layer unit, the storage unit is used for providing persistent storage for metadata of the modeling unit, a read-write interface is arranged on the storage unit, the storage unit is electrically connected with the modeling unit, the supporting layer unit comprises a business monitoring module, a human-computer interaction module, a service hosting module and a log tracking module, and the supporting layer unit is used for optimizing an algorithm of the serialization unit and providing strategies and bases for the information acquisition module to acquire the center of gravity.

Preferably, the information acquisition module comprises an artificial acquisition module and a machine acquisition module, the artificial acquisition module is used for manually sorting sample data to the storage unit and marking the sensitive types and the grades of the samples, an interface connected with the identification layer unit is arranged on the artificial acquisition module, the artificial acquisition module is used for manually providing batch keywords for leading in, the artificial acquisition module is used for acquiring each piece of information which is not less than 100 pieces of information of the samples, and the samples are stored in the storage unit.

Preferably, the modeling calculation module is used for providing corresponding human mining calculation and computer mining calculation for the artificial acquisition module and the machine acquisition module, the computer mining calculation adopts an algorithm of technology frontier enterprise sourcing in the neural network and artificial intelligence aspects of the industry, the human mining calculation is used for correcting business relevance of the computer mining calculation and performing smooth transition processing on a sensitive classification and rating scoring system, the human mining calculation introduces similarity calculation and hamming distance expansibility algorithm, and the human mining calculation increases natural language processing of similarity and lexical association.

Preferably, the identification layer unit performs initial loading operation on the model by taking the output of the modeling unit as input, dynamically increases and decreases model items according to business needs and supports hot plug operation, and a hit scoring system returned by the identification layer unit for each sensitive model should have a summarizing algorithm, that is, the weight of each classification is multiplied by the logarithm of the accumulated value of the matching degree of each classification, and the result is a floating point number between zero and one to serve as a correction value of the sensitive final evaluation calculation.

Preferably, the recognition layer unit is configured to perform summarization on massive unstructured text data by using a classification algorithm or a clustering algorithm, then process and judge a character set and a language, convert the character set into a character set corresponding to internal storage according to needs, perform word segmentation on metadata by using a word segmentation system, and extract keywords of a current text after deleting stop words.

Preferably, the storage unit stores highly extensible web page content by adopting a semi-structured distributed storage solution based on network data characteristics, and the message queue in the storage unit meets the first-in first-out characteristic and can freely subscribe the message queue.

Preferably, the read-write interface is configured to periodically read incremental data and transmit the incremental data to the storage unit to form a message queue, and push the message queue to each service node of the identification layer unit, where the service node is set according to a device load condition and consumption queue data, and immediately writes back the incremental data to the message queue after processing is completed.

Preferably, the serialization unit includes a sensitive information module, the sensitive information module is used for sealing the sensitive words and the sensitive fields, the serialization unit is serialized at the production end and deserialized at the consumption end, and the information that the serialization unit needs to serialize includes version number, information type, operation type, encryption identification and key, data length, data information, and identification result.

Preferably, the specific workflow of the rapid identification system for the unstructured massive sensitive text data is as follows:

s1: data acquisition and storage, wherein data provided by an organization or an enterprise needing to be identified is stored in hbase, ES or other non-relational data;

s2: identification operation, namely loading part or all identification models according to configuration items, utilizing relational extraction to identify data from data according to the identification models, reading records from a message queue one by a thread pool and executing deserialization operation, executing different processing flows according to data types, performing summary calculation after model matching is finished, writing back a system bus message queue theme after serialization, and using a current execution process record log for offline effect analysis;

s3: the system comprises a bus queue, a memory unit and a data processing unit, wherein the bus queue can create a production working thread and a consumption working thread when a system bus is started, the production working thread regularly tracks the change condition of bottom-layer storage incremental data, and when data arrives, the data to be consumed is extracted from the memory unit and put into a consumption theme; the consumption working thread is suspended and waits at an entrance, and write-back operation is automatically triggered when a new message exists, so that original address data of a bottom layer are updated;

s4: analyzing the log, wherein a serialization unit system analyzes log data in an hour unit by adopting a whole batch processing mode and generates a report, and the data scale, the sensitive information proportion, the sensitive information intensity, the propagation frequency heat degree, the identification accuracy rate and the like are obtained through statistics;

s5: the support system provides support capability for the whole framework of the rapid identification system through the support layer unit, the support capability is mainly the support capability of the database component and the learning engine, and the database needs to be regularly cleaned and optimized. The learning engine needs to update the recognition algorithm and the recognition library in time;

s6: and the external interface is used for externally providing sensitive identification data import and a request interface for identifying the sensitive data.

The invention has the technical effects and advantages that: compared with the traditional irrigation technology, the rapid identification system for the unstructured massive text sensitive data provided by the invention has the advantages that the learning engine is utilized to independently select a proper algorithm from common classification algorithms for rapid classification of the unstructured data, the identification efficiency is improved, the unstructured data can be efficiently identified, the corresponding query method can be automatically selected for scanning according to the sensitive type, and the scanning efficiency is improved.

Drawings

FIG. 1 is a block diagram of a fast recognition system for unstructured massive text sensitive data according to the present invention;

FIG. 2 is a flowchart of the fast recognition system for unstructured massive text sensitive data according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

The utility model provides a quick identification system to unstructured massive text sensitive data, including the modeling unit, the identification layer unit, a memory cell, supporting layer unit and serialization unit, the modeling unit includes information acquisition module and modeling calculation module, and modeling unit and identification layer unit electric connection, the memory cell is used for providing the persistence storage for the metadata of modeling unit, and be provided with the read-write interface on the memory cell, memory cell electric connection is in the modeling unit, supporting layer unit includes business monitoring module, man-machine interaction module, service trusteeship module and log tracking module, and supporting layer unit is used for optimizing the algorithm of serialization unit, and provide strategy and basis to information acquisition module collection focus.

The information acquisition module comprises an artificial acquisition module and a machine acquisition module, the artificial acquisition module is used for manually sorting sample data to the storage unit and marking the samples according to sensitive types and grades, an interface connected with the identification layer unit is arranged on the artificial acquisition module, the artificial acquisition module is used for manually providing batch keywords for introduction, the artificial acquisition module acquires that each sample is not less than 100 pieces of information, and the samples are stored in the storage unit.

The modeling calculation module is used for providing corresponding human mining calculation and computer mining calculation for the artificial acquisition module and the machine acquisition module, the computer mining calculation adopts an algorithm of technology front-edge enterprise sourcing in the neural network and artificial intelligence aspects of the industry, the human mining calculation is used for correcting business correlation of the computer mining calculation, smooth transition processing is carried out on a sensitive classification and rating scoring system, similarity calculation and Hamming distance expansibility algorithm are introduced into the human mining calculation, and natural language processing of similarity and lexical association is increased through the human mining calculation.

The identification layer unit takes the output of the modeling unit as input to carry out initial loading operation on the model, dynamically increases and decreases model items according to business needs and supports hot plug operation, and a hit scoring system returned by the identification layer unit for each sensitive model is provided with a summarizing algorithm, namely, the weight of each classification is multiplied by the accumulated value of the matching degree of each classification to take logarithm, and the result is a floating point number between zero and one to serve as a correction value of the sensitive final evaluation calculation.

The recognition layer unit is used for summarizing and processing massive unstructured text data by using a classification algorithm or a clustering algorithm, then processing and judging a character set and a language, converting the character set into a character set corresponding to internal storage according to needs, segmenting metadata by using a segmentation system, and extracting keywords of a current text after deleting stop words.

The storage unit stores highly-extensible webpage content by adopting a semi-structured distributed storage solution scheme based on network data characteristics, and a message queue in the storage unit meets the first-in first-out characteristic and can be freely subscribed.

The read-write interface is used for regularly reading incremental data and transmitting the incremental data to the storage unit to form a message queue, the message queue is pushed to each service node of the identification layer unit, and the service nodes are set according to equipment load conditions and consumption queue data and immediately write back to the message queue after processing is finished.

The serialization unit comprises a sensitive information module, the sensitive information module is used for sealing sensitive words and sensitive fields, the serialization unit is serialized at a production end and deserialized at a consumption end, and information required to be serialized by the serialization unit comprises a version number, an information type, an operation type, an encryption identifier, a secret key, data length, data information and an identification result.

Example 2

The specific work flow of the rapid identification system for the unstructured massive text sensitive data is as follows:

In summary, the following steps: compared with the traditional irrigation technology, the rapid identification system for the unstructured massive text sensitive data provided by the invention has the advantages that the learning engine is utilized to independently select a proper algorithm from common classification algorithms for rapid classification of the unstructured data, the identification efficiency is improved, the unstructured data can be efficiently identified, the corresponding query method can be automatically selected for scanning according to the sensitive type, and the scanning efficiency is improved.

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments or portions thereof without departing from the spirit and scope of the invention.

Claims

1. A quick identification system for unstructured massive text sensitive data comprises a modeling unit, an identification layer unit, a storage unit, a support layer unit and a serialization unit, and is characterized in that: the modeling unit comprises an information acquisition module and a modeling calculation module, the modeling unit is electrically connected with the identification layer unit, the storage unit is used for providing persistent storage for metadata of the modeling unit, a read-write interface is arranged on the storage unit, the storage unit is electrically connected with the modeling unit, the supporting layer unit comprises a business monitoring module, a human-computer interaction module, a service hosting module and a log tracking module, and the supporting layer unit is used for optimizing the algorithm of the serialization unit and providing strategies and bases for the information acquisition module to acquire the gravity center.

2. The system for rapidly recognizing the unstructured massive text sensitive data according to claim 1, is characterized in that: the information acquisition module comprises an artificial acquisition module and a machine acquisition module, the artificial acquisition module is used for manually sorting sample data to a storage unit and marking the sensitive types and the grades of the samples, an interface connected with the identification layer unit is arranged on the artificial acquisition module, the artificial acquisition module is used for manually providing batch keywords for leading in, the artificial acquisition module is used for acquiring each sample and storing the sample in the storage unit, and 100 pieces of information are not less than each sample.

3. The system for rapidly recognizing the unstructured massive text sensitive data according to claim 2, is characterized in that: the modeling calculation module is used for providing corresponding human mining calculation and computer mining calculation for the artificial acquisition module and the machine acquisition module, the computer mining calculation adopts an algorithm of technology frontier enterprise sourcing in the neural network and artificial intelligence aspects of the industry, the human mining calculation is used for correcting business correlation of the computer mining calculation and performing smooth transition processing on a sensitive classification and rating scoring system, the human mining calculation introduces similarity calculation and Hamming distance expansibility algorithm, and the human mining calculation increases natural language processing of similarity and lexical association.

4. The system for rapidly recognizing the unstructured massive text sensitive data according to claim 1, is characterized in that: the identification layer unit takes the output of the modeling unit as input to carry out initial loading operation on the model, dynamically increases and decreases model items according to business requirements and supports hot plug operation, and a hit scoring system returned by the identification layer unit aiming at each sensitive model is provided with a summarizing algorithm, namely, the weight of each classification is multiplied by the accumulated value of the matching degree of each classification to take logarithm, and the result is a floating point number between zero and one to serve as a correction value of the sensitive final evaluation calculation.

5. The system for rapidly recognizing the unstructured massive text sensitive data according to claim 4, is characterized in that: the recognition layer unit is used for summarizing and processing massive unstructured text data by using a classification algorithm or a clustering algorithm, then processing and judging a character set and a language, converting the character set into a character set corresponding to internal storage according to needs, segmenting metadata by using a segmentation system, and extracting keywords of a current text after deleting stop words.

6. The system for rapidly recognizing the unstructured massive text sensitive data according to claim 1, is characterized in that: the storage unit stores highly-extensible webpage content by adopting a semi-structured distributed storage solution scheme based on network data characteristics, and a message queue in the storage unit meets the first-in first-out characteristic and can be freely subscribed.

7. The system for rapidly recognizing the unstructured massive text sensitive data according to claim 6, is characterized in that: the read-write interface is used for regularly reading incremental data and transmitting the incremental data to the storage unit to form a message queue, the message queue is pushed to each service node of the identification layer unit, and the service nodes are set according to equipment load conditions and consumption queue data and immediately write back to the message queue after processing is finished.

8. The system for rapidly recognizing the unstructured massive text sensitive data according to claim 1, is characterized in that: the serialization unit comprises a sensitive information module used for sealing sensitive words and sensitive fields, the serialization unit is serialized at a production end and deserialized at a consumption end, and the information required to be serialized by the serialization unit comprises a version number, an information type, an operation type, an encryption identifier and key, a data length, data information and an identification result.

9. The system for rapidly recognizing the unstructured massive text sensitive data according to claim 1, is characterized in that: the specific work flow of the rapid identification system for the unstructured massive text sensitive data is as follows: