CN114282258A - Screen capture data desensitization method and device, computer equipment and storage medium - Google Patents

Screen capture data desensitization method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN114282258A
CN114282258A CN202111262538.6A CN202111262538A CN114282258A CN 114282258 A CN114282258 A CN 114282258A CN 202111262538 A CN202111262538 A CN 202111262538A CN 114282258 A CN114282258 A CN 114282258A
Authority
CN
China
Prior art keywords
data
sensitive
layer
sensitive data
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111262538.6A
Other languages
Chinese (zh)
Inventor
司新鲁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Bank Co Ltd
Original Assignee
Ping An Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Bank Co Ltd filed Critical Ping An Bank Co Ltd
Priority to CN202111262538.6A priority Critical patent/CN114282258A/en
Publication of CN114282258A publication Critical patent/CN114282258A/en
Pending legal-status Critical Current

Links

Images

Abstract

The embodiment of the application belongs to the field of artificial intelligence, and relates to a screenshot data desensitization method which comprises the steps of obtaining a sensitive data sample set, inputting the preprocessed sensitive data sample set into a direction classification correction model for direction correction to obtain a correction data set, obtaining a training data set according to the correction data set, inputting the training data set into a pre-constructed initial sensitive data recognition model for training, outputting a recognition result, determining a loss function according to the recognition result, iteratively updating the initial sensitive data recognition model based on the loss function, outputting the trained sensitive data recognition model, obtaining current screenshot data, inputting the current screenshot data into the sensitive data recognition model, recognizing the sensitive data, and desensitizing the sensitive data. The application also provides a screen capture data desensitization device, computer equipment and a storage medium. In addition, the present application also relates to blockchain techniques in which sensitive data may be stored. The method and the device can improve the safety of sensitive data.

Description

Screen capture data desensitization method and device, computer equipment and storage medium
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a screen capture data desensitization method and device, computer equipment and a storage medium.
Background
The information technology provides powerful support for the development of the modern society, the brisk development of big data brings a new starting point for the modern science and technology, and massive data is widely applied to industries such as medical treatment, society, insurance, tax, bank and the like and data platforms such as social networks. Along with the collection of various data platforms and information systems, more and more sensitive data related to personal privacy information are gathered, the problem of privacy disclosure is easily involved while convenience is brought to people, and the personal privacy faces huge risks. A large amount of sensitive data without desensitization processing is stored and used on the Internet, so that the system is very easy to steal by lawbreakers, and the sensitive data is utilized for profit making. The leakage of sensitive data not only has serious influence on the core confidentiality, the same industry competitiveness and the market reputation of the platform, but also has different degrees of harm to the privacy of users and the personal information safety. How to take effective measures to protect the privacy of users is always an urgent problem to be solved.
Disclosure of Invention
The embodiment of the application aims to provide a screen capture data desensitization method, a screen capture data desensitization device, computer equipment and a storage medium, so as to solve the technical problems of insufficient security and easy leakage of sensitive data in the related technology.
In order to solve the above technical problem, an embodiment of the present application provides a method for desensitizing screen capture data, which adopts the following technical scheme:
acquiring a sensitive data sample set, and preprocessing the sensitive data sample set;
inputting the preprocessed sensitive data sample set into a direction classification correction model for direction correction to obtain a correction data set;
acquiring a training data set according to the correction data set, inputting the training data set into a pre-constructed initial sensitive data recognition model for training, and outputting a recognition result;
determining a loss function according to the recognition result, performing iterative updating on the initial sensitive data recognition model based on the loss function, and outputting a trained sensitive data recognition model;
acquiring current screen capture data, inputting the current screen capture data into the sensitive data identification model, and identifying sensitive data;
and desensitizing the sensitive data to obtain desensitized data.
Further, the initial sensitive data recognition model comprises a text detection layer, a convolution network layer, a circulation network layer and a transcription layer, the training data set is input into the pre-constructed initial sensitive data recognition model for training, and the step of outputting the recognition result comprises:
inputting the training data set into the text detection layer to perform text region detection, and outputting text region data;
performing feature extraction on the text region data through the convolutional network layer to obtain sensitive features;
inputting the sensitive characteristics into a circulating network layer for classification prediction, and outputting a classification prediction result;
and carrying out alignment operation on the classification prediction result through the transcription layer to obtain an identification result.
Further, the step of inputting the training data set into the text detection layer for text region detection and outputting text region data includes:
dividing correction data in the training data set into grid units of a first preset number through the text detection layer, and predicting prediction frames of a second preset number and confidence degrees corresponding to the prediction frames for each grid unit;
screening out a prediction frame with the largest intersection ratio with the real text box as a prediction text box;
and outputting the correction data marked with the predicted text box as text region data.
Further, the cyclic network layer includes an LSTM layer, a full link layer, and a softmax layer, and the step of inputting the sensitive features into the cyclic network layer for classification prediction and outputting a classification prediction result includes:
performing feature extraction on the sensitive features through a forward layer and a backward layer of the LSTM layer to respectively obtain forward hidden layer features and backward hidden layer features;
inputting the forward hidden layer characteristics and the backward hidden layer characteristics into the full-connection layer, splicing according to positions to obtain hidden layer states, and obtaining sensitive sequence characteristics according to the hidden layer states;
and predicting the sensitive sequence characteristics through a softmax layer to obtain a classification prediction result.
Further, the step of iteratively updating the initial sensitive data recognition model based on the loss function and outputting a trained sensitive data recognition model includes:
adjusting model parameters of the initial sensitive data identification model based on the loss function;
and when the iteration ending condition is met, generating a sensitive data identification model according to the model parameters.
Further, the step of determining a loss function according to the recognition result includes:
the loss function is calculated as follows:
L=-∑(x,y)∈Slnp(y|x)
wherein x is a sensitive feature, y is a recognition result, S is a training data set, p (y | x) is the probability that the input is x and the output is y.
Further, the desensitizing processing is performed on the sensitive data, and the step of obtaining desensitized data includes:
extracting a picture area where the sensitive data are located;
covering a preset shielding layer in the picture area.
In order to solve the above technical problem, an embodiment of the present application further provides a screen capture data desensitization apparatus, which adopts the following technical solution:
the preprocessing module is used for acquiring a sensitive data sample set and preprocessing the sensitive data sample set;
the correcting module is used for inputting the preprocessed sensitive data sample set into a direction classification correcting model to carry out direction correction to obtain a corrected data set;
the training module is used for acquiring a training data set according to the correction data set, inputting the training data set into a pre-constructed initial sensitive data recognition model for training and outputting a recognition result;
the updating module is used for determining a loss function according to the recognition result, carrying out iterative updating on the initial sensitive data recognition model based on the loss function and outputting a trained sensitive data recognition model;
the acquisition module is used for acquiring current screen capture data, inputting the current screen capture data into the sensitive data identification model and identifying sensitive data;
and the desensitization module is used for desensitizing the sensitive data to obtain desensitization data.
In order to solve the above technical problem, an embodiment of the present application further provides a computer device, which adopts the following technical solutions:
the computer device includes a memory having computer readable instructions stored therein and a processor that when executed implements the steps of the screen capture data desensitization method described above.
In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, which adopts the following technical solutions:
the computer readable storage medium has stored thereon computer readable instructions which, when executed by a processor, implement the steps of the screen capture data desensitization method described above.
Compared with the prior art, the embodiment of the application mainly has the following beneficial effects:
the method comprises the steps of obtaining a sensitive data sample set, preprocessing the sensitive data sample set, inputting the preprocessed sensitive data sample set into a direction classification correction model for direction correction to obtain a correction data set, obtaining a training data set according to the correction data set, inputting the training data set into a pre-constructed initial sensitive data recognition model for training, outputting a recognition result, determining a loss function according to the recognition result, iteratively updating the initial sensitive data recognition model based on the loss function, outputting the trained sensitive data recognition model, obtaining current screen capture data, inputting the current screen capture data into the sensitive data recognition model, recognizing the sensitive data, desensitizing the sensitive data, and obtaining desensitized data; this application discerns the sensitive data on the real-time screenshot data through the sensitive data recognition model who uses the training to carry out desensitization to sensitive data and handle, can effectively promote sensitive data's security, be favorable to protecting sensitive data, and then protect customer privacy, can effectively strengthen customer's privacy security consciousness simultaneously.
Drawings
In order to more clearly illustrate the solution of the present application, the drawings needed for describing the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.
FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;
FIG. 2 is a flow diagram of one embodiment of a method of screen capture data desensitization according to the present application;
FIG. 3 is a flowchart of one embodiment of step S203 in FIG. 2;
FIG. 4 is a schematic block diagram illustrating one embodiment of a screenshot data desensitization apparatus according to the present application;
FIG. 5 is a schematic block diagram of one embodiment of a computer device according to the present application.
Detailed Description
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.
The application provides a screenshot data desensitization method, which relates to artificial intelligence, and can be applied to a system architecture 100 shown in fig. 1, where the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The terminal devices 101, 102, 103 may have various communication client applications installed thereon, such as a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.
The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, mpeg compression standard Audio Layer 3), MP4 players (Moving Picture Experts Group Audio Layer IV, mpeg compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like.
The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.
It should be noted that the screenshot data desensitization method provided in the embodiment of the present application is generally executed by a server/terminal device, and accordingly, the screenshot data desensitization apparatus is generally disposed in the server/terminal device.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to FIG. 2, a flow diagram of one embodiment of a method of screen capture data desensitization according to the present application is shown, including the steps of:
step S201, a sensitive data sample set is obtained, and the sensitive data sample set is preprocessed.
The sensitive data in the sensitive data sample set refers to data which may bring serious harm to the society or individuals after leakage, and includes personal privacy data, such as names, identification numbers, addresses, telephones, bank accounts, mailboxes, passwords, medical information, education backgrounds and the like, and also includes data which is not suitable for publishing by enterprises or social institutions, such as the business conditions of the enterprises, the network structures of the enterprises, IP address lists and the like.
The sensitive data sample set can be obtained from public data sets, for example, in the aspect of Chinese data sets, the sensitive data sample set can be obtained from named entity data sets msra-ner data sets, the data sets comprise 24 types of entities such as person names, place names, organization names, ages, mailboxes, digital entities (telephones and postcodes), and in the aspect of English data sets, the sensitive data sample set can be obtained from MJSynth and SynthText synthetic data; and the data can also be obtained by random crawling by Python crawlers and self-collected identity cards, bank cards, financial statements and the like of banks.
It should be understood that the sensitive data in the sensitive data sample set is data that is tagged, for example, an English data set may be tagged using a "BIO" tagging scheme (B denotes entity start, I denotes entity inside, and O denotes entity outside), and a Chinese data set may be tagged using a "BIOES" tagging scheme (E denotes entity end, and S denotes a single entity) to tag sensitive information entities.
Due to the fact that image quality is different due to the limitation of conditions such as user shooting habits, framing requirements and random interference, data needs to be preprocessed, and the preprocessing is mainly used for reducing useless information of image data, obtaining effective images and facilitating subsequent processing. The preprocessing comprises preprocessing processes of image enhancement, image graying, binarization, noise reduction, and the like.
The image enhancement is used for enhancing useful information in an image, can be a distortion process, aims to improve the visual effect of the image, purposefully emphasizes the integral or local characteristics of the image aiming at the application occasion of a given image, changes the original unclear image into clear or emphasizes certain interesting characteristics, enlarges the difference between different object characteristics in the image, inhibits the uninteresting characteristics, improves the image quality and enriches the information quantity, and enhances the image interpretation and identification effects; the image graying is capable of well removing color information irrelevant to the identification characters and only leaving brightness information, and mainly comprises a maximum value method, an average value method and a weighted average method; binarization is to select a proper threshold value for the image which is subjected to graying processing, so that the effective information of the image is kept while the content of the image is further simplified; denoising is to remove noise pollution irrelevant to effective information in an image, and the noise is an uncontrollable factor in a shooting machine or a shooting process.
And S202, inputting the preprocessed sensitive data sample set into a direction classification correction model for direction correction to obtain a correction data set.
The images in the sensitive data sample set may be tilted at different angles due to some factor during the shooting process, and the tilt correction is needed.
In this embodiment, a direction classification correction model is used for correction, specifically, a deep convolutional neural network is pre-constructed, the deep convolutional neural network includes a convolutional layer, a full-link layer, and a softmax layer, the deep convolutional neural network is trained, and after the training is completed, a final deep convolutional neural network is output as the direction classification correction model.
Taking the deep convolutional neural network as the VGG16 network as an example, the number of layers of the convolutional layer and the fully-connected layer in the VGG16 network is 16, the last fully-connected layer of the VGG16 network is replaced by a possible tilt condition (for example, the tilt angle is 0 °, 90 °, 180 °, 270 °), the softmax layer is connected behind the fully-connected layer for classification, the softmax layer is trained by using the acquired image data, and the convolutional layer and the fully-connected layer in the front of the VGG16 network are frozen in the training process until the training is completed.
The inclination angle of the image in the sensitive data sample set is detected through the direction classification correction model, and the image is rotated by a corresponding angle, so that the direction correction is realized, and the subsequent sensitive data identification is facilitated.
And step S203, obtaining a training data set according to the correction data set, inputting the training data set into a pre-constructed initial sensitive data recognition model for training, and outputting a recognition result.
And dividing the correction data set into a training data set and a testing data set randomly according to a preset value, and inputting the training data set into a pre-constructed initial sensitive data recognition model for training.
In this embodiment, referring to fig. 3, the initial sensitive data Recognition model is an OCR (Optical Character Recognition) Recognition model, and includes a text detection layer, a convolutional network layer, a cyclic network layer, and a transcription layer, where the step of inputting a training data set into a pre-constructed initial sensitive data Recognition model for training and outputting a Recognition result includes:
step S301, inputting the training data set into a text detection layer for text region detection, and outputting text region data.
Specifically, the correction data in the training data set is divided into a first preset number of grid units through the text detection layer, a second preset number of prediction frames and confidence degrees corresponding to the prediction frames are predicted for each grid unit, the prediction frame with the largest intersection ratio with the real text frame is screened out to serve as the prediction text frame, and the correction data marked with the prediction text frame is output to serve as text region data.
The text Detection layer may be a YOLO layer (Real-Time Object Detection), or may perform text region Detection using a partition-based DB (differential Binarization). The present embodiment specifically explains the text detection layer as the YOLO layer.
The YOLO layer divides the correction data into S × S (i.e., a first preset number) grid cells (grid cells), each of which is independently detected, and if the center point coordinate of a certain text object (object) falls within the grid cell, the grid cell is responsible for predicting the text object. Each grid unit is provided with a second preset number of prediction frames (bounding boxes), namely, each grid unit predicts B bounding boxes, and each bounding box predicts a confidence level (confidence) value besides the position of the grid unit, wherein the confidence level value represents the confidence level of a text object contained in the predicted bounding box and the prediction accuracy of the bounding box.
The YOLO layer is exemplified as a YOLO 3 network, the YOLO 3 network adopts a network structure called as Darknet-53, and includes 53 Convolutional layers (connected), which uses the method of a Residual error network residonetwork as a reference, and a shortcut link, namely a Residual error structure (Residual), is arranged between some layers, and the Residual error structure can better control the propagation of the gradient, so as to avoid situations unfavorable for training, such as gradient disappearance or gradient explosion, and also includes a pooling layer (Avgpool), a full connection layer and a Softmax layer.
In this embodiment, the confidence and the position size of the predicted frame are calculated by the following coordinate offset formula:
Pr(object)×IOU(b,object)=σ(t0),
bx=σ(tx)+cx
by=σ(ty)+cy
bw=pw×etw
bh=ph×eth
wherein, tx、ty、tw、thThe target learned for the convolution layer in the YOLO layer is the prediction output of the YOLO layer, tx、tyIs the coordinate offset value of the predicted bounding box, tw、thIs a scale of width and height, cxAnd cyAs the central coordinate of the grid cell, pwAnd phIs the width and height of the predicted frame preset, bx、by、bwAnd bhThe coordinates of the center of the predicted frame, width and height are obtained by prediction. Pr (object) represents the probability of text object existing in the prediction frame, if the text object exists, the probability is 1, and if no text object exists, the probability is 0; IOUt p r r u e t d hAn IOU (Intersection over Union) representing the predicted frame and the real text box reflects the proximity of the predicted frame and the real text box.
And step S302, performing feature extraction on the text region data through the convolutional network layer to obtain sensitive features.
The convolution network layer is used for extracting sensitive features from the text region data, and the sensitive features are output in a sequence mode. The convolutional network layer comprises a convolutional layer and a pooling layer, the convolutional layer is essentially a series of filter sets (filters), the output result is called feature maps (feature maps), and each feature map is the output of a convolutional kernel after convolution on the image; the function of the pooling layer is to compress the input features, reducing the data processing amount, thereby simplifying the calculation, and the convolutional layer and the pooling layer are alternately arranged.
And step S303, inputting the sensitive characteristics into a circulating network layer for classification prediction, and outputting a classification prediction result.
In this embodiment, the loop Network Layer includes an LSTM Layer, a Fully Connected Layer (FC) and a softmax Layer, and the LSTM (Long Short-Term Memory) Neural Network is a time loop Neural Network, and is a time loop Neural Network specially designed to solve a Long-Term dependence problem of a general RNN (r n r) Network. The LSTM contains LSTM blocks (blocks), which are also called intelligent network units, and the LSTM blocks can memorize the value of an indefinite time length, and a threshold gate in the LSTM blocks can determine whether the input data information is important to be memorized, and determine that the output data outputted through the LSTM cannot be outputted. LSTM to minimize training errors, training of the LSTM uses a Gradient descent method (Gradient device) that applies a time-ordered back-propagation algorithm that can be used to modify the weights of the LSTM.
One disadvantage of the unidirectional LSTM is that the neural network can only use the above input information, and there is no way to obtain the context information of the current feature, so Bi-LSTM that can fully use the past and future context information can be selected for feature extraction to predict the label distribution of the feature sequence.
Specifically, the Bi-LSTM may obtain two independent hidden layer representations by using sequential and reverse-sequential recurrent neural networks for the input sensitive features, and then perform certain calculation (splicing or adding) on the two hidden layer representations to obtain a final hidden layer representation, which is sent to the output layer for subsequent calculation. The hidden layer indicates that the sensitive characteristics of the current time contain the voice information from the previous time and the next time.
And step S304, carrying out alignment operation on the classification prediction results through the transcription layer to obtain an identification result.
The transcription layer is used for performing CTC (connection time classification) operation, and it can be understood that CTC is time sequence class classification based on a neural network, and CTC operation is mainly used for solving the alignment problem between input sensitive features and output classification prediction results.
In this embodiment, the sensitive data recognition model is obtained through training, so that the accuracy of sensitive data recognition can be improved, and meanwhile, the recognition efficiency is improved.
And S204, determining a loss function according to the recognition result, carrying out iterative updating on the initial sensitive data recognition model based on the loss function, and outputting the trained sensitive data recognition model.
In this embodiment, the loss function is a CTC loss function, the loss between the real recognition result and the recognition result output by the model is calculated by using a CTC loss function calculation formula, and the model parameters are adjusted according to the loss condition.
The computational formula for the CTC loss function is as follows:
L=-∑(x,y)∈Slnp(y|x)
wherein x is (x)1,x2,...xt) The sequence corresponding to the sensitive characteristic is t, and the sequence length of the sensitive characteristic sequence is t; y ═ y1,y2,..yU) And outputting the recognition result (namely the result after the classification prediction result is aligned, namely the prediction label sequence) output by the cycle network layer, wherein U is the sequence length of the prediction label sequence, and S is the training data set.
In this embodiment, the step of iteratively updating the initial sensitive data recognition model based on the loss function and outputting the trained sensitive data recognition model includes:
adjusting model parameters of the initial sensitive data identification model based on a loss function;
and when the iteration ending condition is met, generating a sensitive data identification model according to the model parameters.
Inputting a training data set into an initial sensitive data recognition model for training, after one round of training is finished, calculating a loss function of the initial sensitive data recognition model to obtain a loss function value, adjusting model parameters according to the loss function value, continuing iterative training, and training the model to a certain degree. Judging the convergence mode only by calculating the loss function values in two iterations before and after, if the loss function values are still changed, continuously selecting training data to input into the initial sensitive data recognition model after the model parameters are adjusted so as to carry out iterative training; if the loss function values do not change significantly, the model can be considered to be converged. And after the model is converged, generating a final sensitive data identification model according to the finally adjusted model parameters.
In some optional implementations, after training is completed, the test data set is input into a sensitive data recognition model for model evaluation, and the model evaluation evaluates the experimental results using three evaluation indexes commonly used in machine learning: accuracy P (precision), recall R (Recall) and F-Score.
It should be noted that the F value is a harmonic mean of the accuracy and the recall ratio, the influence of the accuracy and the recall ratio is fully considered, and the F value is equivalent to a comprehensive evaluation index of the accuracy and the recall ratio, and therefore, the F value is used as a main evaluation index of the model.
Step S205, acquiring current screen capture data, inputting the current screen capture data into a sensitive data identification model, and identifying the sensitive data.
After screen capture data of a user are acquired in real time, the screen capture data are transmitted to a sensitive data identification model, and the screen capture data are intelligently identified through the sensitive data identification model to identify the sensitive data. For example, when a user opens an APP (Application), the user wants to share a screenshot in the APP, and after the screenshot is captured, the screenshot is transmitted to a sensitive data recognition model for automatic recognition.
It is emphasized that to further ensure privacy and security of sensitive data, the sensitive data may also be stored in a node of a blockchain.
The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
And step S206, desensitizing the sensitive data to obtain desensitized data.
Desensitization treatment is to protect sensitive data, and aims to guarantee the safety of private information of a user, ensure the use value of the data in the information era, prevent information from flooding, and ensure higher safety, usability, openness and use value of the data.
This application discerns the sensitive data on the real-time screenshot data through the sensitive data recognition model who uses the training to carry out desensitization to sensitive data and handle, can effectively promote sensitive data's security, be favorable to protecting sensitive data, and then protect customer privacy, can effectively strengthen customer's privacy security consciousness simultaneously.
In some optional implementation manners of this embodiment, the step of inputting the sensitive features into the cyclic network layer for classification prediction, and outputting the classification prediction result includes:
performing feature extraction on the sensitive features through a forward layer and a backward layer of the LSTM layer to respectively obtain forward hidden layer features and backward hidden layer features;
inputting the forward hidden layer characteristic and the backward hidden layer characteristic into a full-connection layer, splicing according to the positions to obtain a hidden layer state, and obtaining a sensitive sequence characteristic according to the hidden layer state;
and predicting the sensitive sequence characteristics through the softmax layer to obtain a classification prediction result.
In this embodiment, the circular network layer includes an LSTM layer, a full connection layer, and a softmax layer, where the input of the LSTM layer is a vector sequence of each speech frame in the noisy speech to be enhanced, and the input of the LSTM layer passes through the LSTM layer before the LSTM layerForward hidden layer feature for obtaining speech frame vector from layer
Figure BDA0003326294110000131
Obtaining backward hidden layer characteristics of speech frame vector through backward layer of LSTM layer
Figure BDA0003326294110000132
Splicing hidden layer states output by the forward hidden layer characteristic and the backward hidden layer characteristic at each position according to the position to obtain htIs composed of
Figure BDA0003326294110000133
ht∈RmAnd then obtain the complete hidden layer state (h)1,h2,…,hn)∈Rn×m
Before entering the next layer, a dropout mechanism is set to solve the problem of overfitting. After a dropout mechanism is set, a hidden state vector is mapped from m dimension to k dimension through a full connection layer, k represents the number of labels, and then sensitive sequence features P represented as (P) are obtained1,P2,…,Pn)∈Rn×k
In this embodiment, after the sensitive sequence features are processed by the softmax layer, the probability distribution of the sensitive sequence features is obtained, and the sensitive sequence features with the highest probability are selected as the classification prediction result.
In some optional implementations, the desensitizing processing on the sensitive data to obtain desensitized data includes:
extracting a picture area where the sensitive data are located;
and covering the preset shielding layer in the picture area.
In this embodiment, the sensitive data is determined according to the type of the sensitive information, the type of the sensitive information includes, but is not limited to, a certificate type, a phone type, an address type asset type, and the like, the sensitive information includes sensitive keywords and sensitive data corresponding to the sensitive keywords, the sensitive keywords include, but is not limited to, a name, an identity card number, an address, a mobile phone number, and the like, a character string closest to the sensitive keywords is the corresponding sensitive data, for example, the sensitive keywords are the "identity card number", the character string closest to the sensitive keywords is the "1234567890 ABCDEFGH", and the "1234567890 ABCDEFGH" is the sensitive data.
In the process of identifying the sensitive data by the sensitive data identification model, the coordinate parameters of the text area are recorded, and the picture area of the sensitive data can be determined according to the coordinate parameters.
Covering the preset shielding layer on the picture area can realize the desensitization of the sensitive data. The preset shielding layer is an opaque layer, such as a mosaic-shaped layer, a ground glass-shaped layer, or the like. The area of the preset shielding layer is larger than or equal to the picture area where the sensitive data are located. Covering a preset shielding layer on the picture area where the sensitive data are located to obtain desensitized sensitive data, namely desensitized data. The shielding of the sensitive data corresponding to the screen capture data can be realized, so that the sensitive data can be prevented from being leaked.
It should be noted that, the user may be allowed to cancel the occlusion of the preset occlusion layer, and the cancellation behavior of the user is recorded at the same time.
The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware associated with computer readable instructions, which can be stored in a computer readable storage medium, and when executed, the processes of the embodiments of the methods described above can be included. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
With further reference to fig. 4, as an implementation of the method shown in fig. 2 described above, the present application provides an embodiment of a device for desensitizing screen capture data, which corresponds to the embodiment of the method shown in fig. 2 and which is particularly applicable in various electronic devices.
As shown in fig. 4, the screen capture data desensitization apparatus 400 according to the present embodiment includes: a pre-processing module 401, a correction module 402, a training module 403, an update module 404, an acquisition module 405, and a desensitization module 406. Wherein:
the preprocessing module 401 is configured to obtain a sensitive data sample set, and preprocess the sensitive data sample set;
the correcting module 402 is configured to input the preprocessed sensitive data sample set into a direction classification correcting model to perform direction correction, so as to obtain a corrected data set;
the training module 403 is configured to obtain a training data set according to the correction data set, input the training data set into a pre-constructed initial sensitive data recognition model for training, and output a recognition result;
the updating module 404 is configured to determine a loss function according to the recognition result, perform iterative updating on the initial sensitive data recognition model based on the loss function, and output a trained sensitive data recognition model;
the obtaining module 405 is configured to obtain current screen capture data, input the current screen capture data into the sensitive data identification model, and identify sensitive data;
the desensitization module 406 is configured to perform desensitization processing on the sensitive data to obtain desensitization data.
It is emphasized that to further ensure privacy and security of sensitive data, the sensitive data may also be stored in a node of a blockchain.
Above-mentioned screen capture data desensitization device uses the sensitive data recognition model of training completion to discern the sensitive data on the real-time screen capture data to desensitize the processing to sensitive data, can effectively promote sensitive data's security, be favorable to protecting sensitive data, and then protect customer privacy, can effectively strengthen customer's privacy security consciousness simultaneously.
In this embodiment, the training module 403 includes a text region detection sub-module, a feature extraction sub-module, a classification sub-module, and an alignment sub-module, where:
the text region detection submodule is used for inputting the training data set into the text detection layer to carry out text region detection and outputting text region data;
the feature extraction submodule is used for performing feature extraction on the text region data through the convolutional network layer to obtain sensitive features;
the classification submodule is used for inputting the sensitive characteristics into a circulating network layer for classification prediction and outputting a classification prediction result;
and the alignment submodule is used for performing alignment operation on the classification prediction result through the transcription layer to obtain an identification result.
In this embodiment, the sensitive data recognition model is obtained through training, so that the accuracy of sensitive data recognition can be improved, and meanwhile, the recognition efficiency is improved.
In some optional implementations of this embodiment, the text region detection sub-module is further configured to:
dividing correction data in the training data set into grid units of a first preset number through the text detection layer, and predicting prediction frames of a second preset number and confidence degrees corresponding to the prediction frames for each grid unit;
screening out a prediction frame with the largest intersection ratio with the real text box as a prediction text box;
and outputting the correction data marked with the predicted text box as text region data.
In some optional implementations of this embodiment, the classification sub-module is further configured to:
performing feature extraction on the sensitive features through a forward layer and a backward layer of the LSTM layer to respectively obtain forward hidden layer features and backward hidden layer features;
inputting the forward hidden layer characteristics and the backward hidden layer characteristics into the full-connection layer, splicing according to positions to obtain hidden layer states, and obtaining sensitive sequence characteristics according to the hidden layer states;
and predicting the sensitive sequence characteristics through a softmax layer to obtain a classification prediction result.
In this embodiment, the updating module 404 includes an adjusting submodule and a generating submodule, and the adjusting submodule is configured to adjust a model parameter of the initial sensitive data identification model based on the loss function; and the generation submodule is used for generating a sensitive data identification model according to the model parameters when the iteration ending condition is met.
In this embodiment, the update module 404 further includes a calculation submodule, configured to:
the loss function is calculated as follows:
L=-∑(x,y)∈Slnp(y|x)
wherein x is a sensitive feature, y is a recognition result, S is a training data set, p (y | x) is the probability that the input is x and the output is y.
In some optional implementation manners of this embodiment, the desensitization module 406 includes an extraction sub-module and a coverage sub-module, where the extraction sub-module is configured to extract an image area where the sensitive data is located; the covering submodule is used for covering a preset shielding layer in the picture area.
The embodiment can guarantee the safety of the private information of the user and ensure the use value of the data in the information era.
In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 5, fig. 5 is a block diagram of a basic structure of a computer device according to the present embodiment.
The computer device 5 comprises a memory 51, a processor 52, a network interface 53 communicatively connected to each other via a system bus. It is noted that only a computer device 5 having components 51-53 is shown, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.
The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.
The memory 51 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the memory 51 may be an internal storage unit of the computer device 5, such as a hard disk or a memory of the computer device 5. In other embodiments, the memory 51 may also be an external storage device of the computer device 5, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the computer device 5. Of course, the memory 51 may also comprise both an internal storage unit of the computer device 5 and an external storage device thereof. In this embodiment, the memory 51 is generally used for storing an operating system installed in the computer device 5 and various types of application software, such as computer readable instructions of a screen capture data desensitization method. Further, the memory 51 may also be used to temporarily store various types of data that have been output or are to be output.
The processor 52 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 52 is typically used to control the overall operation of the computer device 5. In this embodiment, the processor 52 is configured to execute computer readable instructions stored in the memory 51 or to process data, such as executing computer readable instructions of the screen capture data desensitization method.
The network interface 53 may comprise a wireless network interface or a wired network interface, and the network interface 53 is generally used for establishing communication connections between the computer device 5 and other electronic devices.
In the embodiment, the steps of the screenshot data desensitization method in the above embodiment are realized when the processor executes the computer readable instructions stored in the memory, the sensitive data on the real-time screenshot data is identified by using the trained sensitive data identification model, and desensitization processing is performed on the sensitive data, so that the security of the sensitive data can be effectively improved, the sensitive data is protected, the privacy of a client is further protected, and meanwhile, the privacy security awareness of the client can be effectively enhanced.
The application further provides another embodiment, that is, a computer-readable storage medium is provided, where computer-readable instructions are stored, and the computer-readable instructions are executable by at least one processor, so that the at least one processor executes the steps of the screenshot desensitization method, and identifies the sensitive data on the real-time screenshot data by using the trained sensitive data identification model, and performs desensitization processing on the sensitive data, so that the security of the sensitive data can be effectively improved, the sensitive data can be protected, the privacy of the client can be further protected, and the security awareness of the privacy of the client can be effectively enhanced.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.
It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims (10)

1. A method of desensitizing screenshot data, comprising the steps of:
acquiring a sensitive data sample set, and preprocessing the sensitive data sample set;
inputting the preprocessed sensitive data sample set into a direction classification correction model for direction correction to obtain a correction data set;
acquiring a training data set according to the correction data set, inputting the training data set into a pre-constructed initial sensitive data recognition model for training, and outputting a recognition result;
determining a loss function according to the recognition result, performing iterative updating on the initial sensitive data recognition model based on the loss function, and outputting a trained sensitive data recognition model;
acquiring current screen capture data, inputting the current screen capture data into the sensitive data identification model, and identifying sensitive data;
and desensitizing the sensitive data to obtain desensitized data.
2. The screenshot data desensitization method according to claim 1, wherein the initial sensitive data recognition model comprises a text detection layer, a convolutional network layer, a cyclic network layer, and a transcription layer, the step of inputting the training data set into a pre-constructed initial sensitive data recognition model for training, and the step of outputting the recognition result comprises:
inputting the training data set into the text detection layer to perform text region detection, and outputting text region data;
performing feature extraction on the text region data through the convolutional network layer to obtain sensitive features;
inputting the sensitive characteristics into a circulating network layer for classification prediction, and outputting a classification prediction result;
and carrying out alignment operation on the classification prediction result through the transcription layer to obtain an identification result.
3. The screenshot data desensitization method according to claim 2, wherein said inputting the training data set into the text detection layer for text region detection, the step of outputting text region data comprising:
dividing correction data in the training data set into grid units of a first preset number through the text detection layer, and predicting prediction frames of a second preset number and confidence degrees corresponding to the prediction frames for each grid unit;
screening out a prediction frame with the largest intersection ratio with the real text box as a prediction text box;
and outputting the correction data marked with the predicted text box as text region data.
4. The method of desensitizing screenshot data according to claim 2, wherein said cyclic network layer comprises an LSTM layer, a fully connected layer and a softmax layer, said inputting said sensitive features into the cyclic network layer for classification prediction, and said outputting classification prediction results comprises:
performing feature extraction on the sensitive features through a forward layer and a backward layer of the LSTM layer to respectively obtain forward hidden layer features and backward hidden layer features;
inputting the forward hidden layer characteristics and the backward hidden layer characteristics into the full-connection layer, splicing according to positions to obtain hidden layer states, and obtaining sensitive sequence characteristics according to the hidden layer states;
and predicting the sensitive sequence characteristics through a softmax layer to obtain a classification prediction result.
5. The method of claim 1, wherein the step of iteratively updating the initial sensitive data recognition model based on the loss function and outputting a trained sensitive data recognition model comprises:
adjusting model parameters of the initial sensitive data identification model based on the loss function;
and when the iteration ending condition is met, generating a sensitive data identification model according to the model parameters.
6. The method of desensitizing screen capture data according to claim 2, wherein said step of determining a loss function based on said identification comprises:
the loss function is calculated as follows:
L=-∑(x,y)∈Slnp(y|x)
wherein x is a sensitive feature, y is a recognition result, S is a training data set, p (y | x) is the probability that the input is x and the output is y.
7. The method for desensitizing screen capture data according to claim 1, wherein said desensitizing the sensitive data to obtain desensitized data comprises:
extracting a picture area where the sensitive data are located;
covering a preset shielding layer in the picture area.
8. A screen capture data desensitization apparatus, comprising:
the preprocessing module is used for acquiring a sensitive data sample set and preprocessing the sensitive data sample set;
the correcting module is used for inputting the preprocessed sensitive data sample set into a direction classification correcting model to carry out direction correction to obtain a corrected data set;
the training module is used for acquiring a training data set according to the correction data set, inputting the training data set into a pre-constructed initial sensitive data recognition model for training and outputting a recognition result;
the updating module is used for determining a loss function according to the recognition result, carrying out iterative updating on the initial sensitive data recognition model based on the loss function and outputting a trained sensitive data recognition model;
the acquisition module is used for acquiring current screen capture data, inputting the current screen capture data into the sensitive data identification model and identifying sensitive data;
and the desensitization module is used for desensitizing the sensitive data to obtain desensitization data.
9. A computer device comprising a memory having computer readable instructions stored therein and a processor which when executed implements the steps of the screen capture data desensitization method of any of claims 1 to 7.
10. A computer readable storage medium having computer readable instructions stored thereon which, when executed by a processor, implement the steps of the screen capture data desensitization method of any of claims 1 to 7.
CN202111262538.6A 2021-10-28 2021-10-28 Screen capture data desensitization method and device, computer equipment and storage medium Pending CN114282258A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111262538.6A CN114282258A (en) 2021-10-28 2021-10-28 Screen capture data desensitization method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111262538.6A CN114282258A (en) 2021-10-28 2021-10-28 Screen capture data desensitization method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114282258A true CN114282258A (en) 2022-04-05

Family

ID=80868733

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111262538.6A Pending CN114282258A (en) 2021-10-28 2021-10-28 Screen capture data desensitization method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114282258A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115953651A (en) * 2023-03-13 2023-04-11 浪潮电子信息产业股份有限公司 Model training method, device, equipment and medium based on cross-domain equipment
CN117391076A (en) * 2023-12-11 2024-01-12 东亚银行(中国)有限公司 Acquisition method and device of identification model of sensitive data, electronic equipment and medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115953651A (en) * 2023-03-13 2023-04-11 浪潮电子信息产业股份有限公司 Model training method, device, equipment and medium based on cross-domain equipment
CN115953651B (en) * 2023-03-13 2023-09-12 浪潮电子信息产业股份有限公司 Cross-domain equipment-based model training method, device, equipment and medium
CN117391076A (en) * 2023-12-11 2024-01-12 东亚银行(中国)有限公司 Acquisition method and device of identification model of sensitive data, electronic equipment and medium
CN117391076B (en) * 2023-12-11 2024-02-27 东亚银行(中国)有限公司 Acquisition method and device of identification model of sensitive data, electronic equipment and medium

Similar Documents

Publication Publication Date Title
US20220058426A1 (en) Object recognition method and apparatus, electronic device, and readable storage medium
US20230401828A1 (en) Method for training image recognition model, electronic device and storage medium
CN107545241A (en) Neural network model is trained and biopsy method, device and storage medium
WO2022126970A1 (en) Method and device for financial fraud risk identification, computer device, and storage medium
WO2022105118A1 (en) Image-based health status identification method and apparatus, device and storage medium
WO2021143267A1 (en) Image detection-based fine-grained classification model processing method, and related devices
CN112863683B (en) Medical record quality control method and device based on artificial intelligence, computer equipment and storage medium
CN111914775B (en) Living body detection method, living body detection device, electronic equipment and storage medium
JP2022177232A (en) Method for processing image, method for recognizing text, and device for recognizing text
CN114282258A (en) Screen capture data desensitization method and device, computer equipment and storage medium
CN115861462B (en) Training method and device for image generation model, electronic equipment and storage medium
CN110795714A (en) Identity authentication method and device, computer equipment and storage medium
CN114612743A (en) Deep learning model training method, target object identification method and device
CN115050064A (en) Face living body detection method, device, equipment and medium
CN114550051A (en) Vehicle loss detection method and device, computer equipment and storage medium
CN113111880A (en) Certificate image correction method and device, electronic equipment and storage medium
CN115438149A (en) End-to-end model training method and device, computer equipment and storage medium
CN111062019A (en) User attack detection method and device and electronic equipment
CN116774973A (en) Data rendering method, device, computer equipment and storage medium
CN115730237A (en) Junk mail detection method and device, computer equipment and storage medium
CN115565186A (en) Method and device for training character recognition model, electronic equipment and storage medium
CN113362249A (en) Text image synthesis method and device, computer equipment and storage medium
CN112733645A (en) Handwritten signature verification method and device, computer equipment and storage medium
CN113836297A (en) Training method and device for text emotion analysis model
CN113343898B (en) Mask shielding face recognition method, device and equipment based on knowledge distillation network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination