CN111522951A

CN111522951A - Sensitive data identification and classification technical method based on image identification

Info

Publication number: CN111522951A
Application number: CN202010338824.5A
Authority: CN
Inventors: 章明珠; 刘超
Original assignee: Chengdu Siwei Century Technology Co ltd
Current assignee: Chengdu Siwei Century Technology Co ltd
Priority date: 2020-04-26
Filing date: 2020-04-26
Publication date: 2020-08-11

Abstract

The invention discloses a sensitive data identification and classification technical method based on image identification, which comprises a sensitive strategy model construction method and an image sensitive information mark classification method; the method directly extracts characters in the image through an OCR (optical character recognition) technology, classifies the sensitive information of the image according to a preset sensitive information strategy, can accurately recognize and classify the image information, can directly modify the sensitive information strategy at the later stage, does not need to perform image classification training, has higher expansibility and detailed sensitive classification, and can adapt to and meet the service difference requirements of different service systems; the method can adapt to and meet the business difference requirements of different business systems, the sensitive information is quickly and accurately identified by extracting the image characters and using a sensitive strategy, the sensitive characteristic is expanded, the self-defined sensitive information matching of the image is realized, and the leakage of private data is reduced.

Description

Sensitive data identification and classification technical method based on image identification

Technical Field

The invention belongs to the field of data security technology and machine learning, and particularly relates to a sensitive data identification and classification technical method based on image identification.

Background

The method aims at the technology of sensitive identification in images, a deep learning-based image classification method is available at present, the core is a task of distributing a label to the images from a set classification set, identifying information such as objects, scenes, behaviors and the like in the images and returning corresponding label information.

The method comprises the steps of identifying each object through a TENSORFLOW (symbolic mathematical system based on data flow programming) training system through a machine learning algorithm, and using TENSORFLOW to train image data to be divided into three steps of labeling, training and classifying, wherein the labeling step is very time-consuming.

The image classification is carried out through a machine learning algorithm TENSORFLOW, a large number of image training sets are used for training the model, the verification set is used for verifying whether the model is over-fitted, and the test set is used for testing the accuracy of the model. Therefore, the prior art has the following disadvantages: 1. the amount of sample data required for learning is large, and a large amount of sample pictures are required to be provided for each picture type; 2. the calculation amount is large, the requirement on computer hardware is high, and the time for training the model by machine learning is long; 3. revising the classification samples requires relearning; 4. different learning rates may lead to different results. If the speed is too high, the accuracy rate can continuously jump up and down in the training process, and if the speed is too low, the expected accuracy rate can not be reached before the training is finished; 5. images that are similar but differ in the type of text-sensitive information in the figure cannot be carefully distinguished.

Therefore, the image classification is performed through the machine learning algorithm TENSORFLOW, and the accurate classification and extraction of the character information in the image cannot be met, so that a technical method for identifying and classifying sensitive data based on image identification is needed.

Disclosure of Invention

The invention aims to solve the defects in the prior art, and provides a technical method for identifying and classifying sensitive data based on image identification.

In order to achieve the purpose, the invention provides the following technical scheme: a sensitive data identification and classification technical method based on image identification comprises a method for constructing a sensitive strategy model and a method for classifying image sensitive information labels, wherein the method for constructing the sensitive strategy model comprises the following steps:

s1, acquiring sensitive information characteristics, and acquiring sensitive information characteristics and sensitive elements according to a sensitive sample, wherein the sensitive characteristics are a set of minimum sensitive metadata, the sensitive characteristics are only used for configuring a sensitive strategy, and the sensitive characteristics cannot be directly used for sensitive identification;

s2, constructing a sensitivity strategy rule, setting sensitivity classification and grading on the basis of a sensitivity strategy according to actual scenes, industry specifications and different sensitivity strategies of use range components, wherein the sensitivity strategy combines the sensitivity elements and limits the identification range;

the image sensitive information mark classification method comprises the following steps:

a1, image character region segmentation, namely performing region segmentation on images and characters in the images through a PSENET algorithm, setting BACKBONE in the PSENET algorithm as a RESNET structure, performing multiple predictions corresponding to a plurality of KERNELS with different scales on text examples in the images and characters, gradually expanding the KERNELS with the minimum scale to the size matched with the shape of the text examples through a progressive scale expansion algorithm, distinguishing adjacent text examples through a larger geometric edge between the KERNELS with the minimum scale through the PSE algorithm, and detecting the text examples with any shapes;

a2, performing character recognition by using a method of solving the problem of sequence recognition based on images by using a convolution cyclic neural network structure, and extracting sequence features from input images to obtain a feature map (CNN); predicting the characteristic sequence by using a deep bidirectional recurrent neural network (BLSTM), learning each characteristic vector in the sequence, and outputting prediction label distribution (RNN) representing a true value; using the CTC loss, performing distribution conversion on a series of labels acquired from a cycle layer (RNN), and predicting and outputting a final label sequence (CTC) by selecting a label sequence with the highest probability;

a3, identifying a sensitive engine, adding an image to be identified into a queue, acquiring queue information through a background task, identifying the image, acquiring character information in the image, saving the character information as a sample file, acquiring an available sensitive strategy according to a system configuration item, summarizing hit sensitive characteristic information, matching strategy rules according to configuration strategy information after the characteristic information is identified, and summarizing and recording the hit strategy rules.

Preferably, in step S1, the sensitive features are embodied as sensitive information such as a mobile phone number, a name, an address, an identification number, and the like.

Preferably, in step S2, the sensitive data identification criterion of the sensitive policy is a combination of sensitive elements, and the ordering rule of the combination of sensitive elements is set to three types, i.e., unordered ordering, ordered ordering, and interval ordering.

Preferably, in the step a1, the network structure of the PSENET algorithm is set as a pyramid network framework structure like FPN.

Preferably, in the step a1, the plural KERNELS with different scales are set to be a shape shared with the original whole text instance, and the plural KERNELS with different scales and the same center point are located at different scales and the same center point as the original whole text instance.

Preferably, in step a2, the CRNN network structure includes three parts, CNN (convolutional layer), RNN (cyclic layer), and CTC (transcriptional layer).

Preferably, in the step a3, the recognition of the image in the queue by the sensitivity engine is set as multi-thread recognition.

The invention has the technical effects and advantages that: the invention provides a sensitive data identification and classification technical method based on image identification, which directly extracts characters in an image through an OCR (optical character recognition) technology, classifies the sensitive information of the image according to a preset sensitive information strategy, can accurately identify and classify the image information, can directly modify the sensitive information strategy at the later stage, does not need to perform image classification training any more, and has higher expansibility and detailed sensitive classification;

the sensitive strategy rules and the sensitive classification and grading methods are established according to the sensitive characteristic information sequence, and technologies such as sensitive information classification and grading are realized to carry out the datamation on the sensitive rules by using the clustering, sampling, probability and other mathematical analysis methods; identifying an algorithm of sensitive information in an engine according to a sensitive strategy, and realizing sensitive information matching by using mathematical analysis methods such as probability, statistics and the like; the method has the advantages of automatic optimization updating, self-improvement and enrichment according to scene analysis sensitive characteristics, and can adapt to and meet the service difference requirements of different service systems; the image characters are extracted, sensitive information is rapidly and accurately identified by using a sensitive strategy, the sensitive characteristic is expanded, the self-defined sensitive information of the image is matched, and the leakage of private data is reduced.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

A sensitive data identification and classification technical method based on image identification comprises a method for constructing a sensitive strategy model and a method for classifying image sensitive information labels, wherein the method for constructing the sensitive strategy model comprises the following steps:

Further, in step S1, the sensitive characteristics are embodied as sensitive information such as a mobile phone number, a name, an address, an identification number, and the like.

Further, in step S2, the sensitive data identification criterion of the sensitive policy is a combination of sensitive elements, and the ordering rule of the combination of sensitive elements is set to three types, i.e., unordered ordering, ordered ordering, and interval ordering.

Further, in the step a1, the network structure of the PSENET algorithm is set as a pyramid network framework structure like FPN.

Further, in the step a1, the plural KERNELS with different scales are set to be a shape shared with the original whole text instance, and the plural KERNELS with different scales and the same center point are located at different scales and the same center point as the original whole text instance.

Further, in the step a2, the CRNN network structure includes three parts, namely CNN (convolutional layer), RNN (cyclic layer) and CTC (transcriptional layer).

Further, in the step a3, the recognition of the image in the queue by the sensitivity engine is set as multi-thread recognition.

To sum up: the invention provides a sensitive data identification and classification technical method based on image identification, which directly extracts characters in an image through an OCR (optical character recognition) technology, classifies the sensitive information of the image according to a preset sensitive information strategy, can accurately identify and classify the image information, can directly modify the sensitive information strategy at the later stage, does not need to perform image classification training any more, and has higher expansibility and detailed sensitive classification;

the method overcomes the defects that the conventional image sensitive information identification is low in identification accuracy, few in preset classification and low in later-stage increased sensitive classification efficiency, and the accurate identification and specific sensitive information extraction of similar images cannot be carried out. Sensitive strategies are enriched through machine learning and manual intervention, classification and information extraction of image sensitive information are realized through extracting characters and performing strategy matching, and accuracy and expansibility of image sensitive information are improved. Developers can easily realize sensitive feature expansion to realize the matching of the self-defined sensitive information of the image without repeatedly training and identifying samples for many times, and the leakage of private data and the like is reduced;

the sensitive strategy rules and the sensitive classification and grading methods are established according to the sensitive characteristic information sequence, and technologies such as sensitive information classification and grading are realized to carry out the datamation on the sensitive rules by using the clustering, sampling, probability and other mathematical analysis methods; identifying an algorithm of sensitive information in an engine according to a sensitive strategy, and realizing sensitive information matching by using mathematical analysis methods such as probability, statistics and the like; the method has the advantages of automatic optimization updating, self-improvement and enrichment according to scene analysis sensitive characteristics, and can adapt to and meet the service difference requirements of different service systems; and extracting image characters and rapidly and accurately identifying sensitive information by using a sensitive strategy.

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments or portions thereof without departing from the spirit and scope of the invention.

Claims

1. A sensitive data identification and classification technical method based on image identification comprises a method for constructing a sensitive strategy model and a method for classifying image sensitive information labels, and is characterized in that:

the method for constructing the sensitive strategy model comprises the following steps:

2. The technical method for sensitive data identification and classification based on image identification as claimed in claim 1, wherein: in step S1, the sensitive features are embodied as sensitive information such as a mobile phone number, a name, an address, an identification number, and the like.

3. The technical method for sensitive data identification and classification based on image identification as claimed in claim 1, wherein: in the step S2, the sensitive data identification standard of the sensitive policy is a combination of sensitive elements, and the ordering rule of the combination of sensitive elements is set to three types, i.e., unordered ordering, ordered ordering, and interval ordering.

4. The technical method for sensitive data identification and classification based on image identification as claimed in claim 1, wherein: in the step a1, the network structure of the PSENET algorithm is set as a pyramid network framework structure like FPN.

5. The technical method for sensitive data identification and classification based on image identification as claimed in claim 1, wherein: in step a1, the plural KERNELS with different scales are set to be a shape shared with the original whole text instance, and the plural KERNELS with different scales and the same center point are located at different scales and the same center point as the original whole text instance.

6. The technical method for sensitive data identification and classification based on image identification as claimed in claim 1, wherein: in the step a2, the CRNN network structure includes three parts, CNN (convolutional layer), RNN (cyclic layer), and CTC (transcriptional layer).

7. The technical method for sensitive data identification and classification based on image identification as claimed in claim 1, wherein: in the step a3, the recognition of the image in the queue by the sensitive engine is set as multi-thread recognition.