CN114898373A - File desensitization method and device, electronic equipment and storage medium - Google Patents

File desensitization method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114898373A
CN114898373A CN202210645565.XA CN202210645565A CN114898373A CN 114898373 A CN114898373 A CN 114898373A CN 202210645565 A CN202210645565 A CN 202210645565A CN 114898373 A CN114898373 A CN 114898373A
Authority
CN
China
Prior art keywords
information
file
desensitization
picture
sensitive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210645565.XA
Other languages
Chinese (zh)
Inventor
李书涵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202210645565.XA priority Critical patent/CN114898373A/en
Publication of CN114898373A publication Critical patent/CN114898373A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/15Cutting or merging image elements, e.g. region growing, watershed or clustering-based techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19173Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/26Techniques for post-processing, e.g. correcting the recognition result
    • G06V30/262Techniques for post-processing, e.g. correcting the recognition result using context analysis, e.g. lexical, syntactic or semantic context

Abstract

The invention relates to the field of intelligent decision making, and discloses a file desensitization method, a device, electronic equipment and a readable storage medium, wherein the method comprises the following steps: receiving a file uploaded by a user; when the file format of the file is a picture format, picture information identification is carried out on the file to obtain the picture information of the file; classifying the types of the files according to the picture information and preset certificate type information to obtain the certificate types of the files; according to the certificate type, information classification is carried out on the picture information, and sensitive information of the file and the position of the sensitive information are determined according to the information classification; according to the position of the sensitive information, carrying out data deformation on the sensitive information on the file to obtain desensitization information; and when the file format of the file is a text format, performing text desensitization on the file to obtain a desensitized text. The invention can improve the accuracy and efficiency of file desensitization.

Description

File desensitization method and device, electronic equipment and storage medium
Technical Field
The invention relates to the field of intelligent decision making, in particular to a file desensitization method, a file desensitization device, electronic equipment and a readable storage medium.
Background
Desensitization refers to data deformation of some sensitive information through desensitization rules, and reliable protection of sensitive private data is achieved. For example, sensitive information in personal text information input by a user is replaced with an "+".
At present, common desensitization technology can only be used in text information, when a user uploads image information such as an identity card, a bank card and a driver's license, the common desensitization technology cannot desensitize the image information, when business personnel display the image, the personal information of the user is easy to leak, and the personal information of the user is stolen, so that the business personnel can only rely on personal judgment, shelter from the personal information of the user, and the working efficiency of the business personnel is reduced.
Disclosure of Invention
The invention provides a file desensitization method, a file desensitization device, electronic equipment and a readable storage medium, and aims to improve the accuracy and efficiency of file desensitization.
In order to achieve the above object, the present invention provides a method for desensitizing a file, the method comprising:
receiving a file uploaded by a user;
when the file format of the file is a picture format, picture information identification is carried out on the file to obtain the picture information of the file;
classifying the types of the files according to the picture information and preset certificate type information to obtain the certificate types of the files;
according to the certificate type, information classification is carried out on the picture information, and sensitive information of the file and the position of the sensitive information are determined according to the information classification;
according to the position of the sensitive information, carrying out data deformation on the sensitive information on the file to obtain desensitization information;
and when the file format of the file is a text format, performing text desensitization on the file to obtain a desensitized text.
Optionally, the classifying the file according to the picture information and preset certificate type information to obtain the certificate type of the file includes:
performing model training by using the certificate type information to obtain a certificate type classification model;
inputting the picture information into the certificate type classification model to obtain a classification result;
and judging the certificate type of the file according to the classification result.
Optionally, the inputting the picture information into the certificate type classification model to obtain a classification result includes:
character encoding is carried out on the picture information by utilizing an encoding layer in the certificate type classification model, and a character vector is obtained;
performing matrix splicing on the character vectors by utilizing a decoding layer in the certificate type classification model to obtain a picture information character matrix;
extracting keywords from the picture information character matrix by using an attention mechanism layer in the certificate type classification model to obtain picture information keywords;
and outputting the classification result of the picture information by utilizing a full connection layer in the certificate type classification model according to the picture information keyword.
Optionally, the classifying the information of the picture information according to the certificate type, and determining the sensitive information of the file and the position of the sensitive information according to the information classification includes:
acquiring information needing desensitization in the certificate type;
matching the picture information with the information needing desensitization, and taking the successfully matched information as sensitive information of the file;
and tracing the position of the sensitive information in the picture information to obtain the position of the sensitive information.
Optionally, the text desensitizing the file to obtain a desensitized text includes:
vectorizing the file to obtain a word vector of the file;
labeling the word vectors according to preset text characteristics;
combining the marked word vectors according to the corresponding word units in the file to obtain a word unit set;
creating a frequent item set according to the support of the word units contained in the word unit set;
calculating the promotion degree of frequent items contained in the frequent item set;
taking word units corresponding to frequent items with the promotion degree larger than a preset threshold value in the frequent item set as sensitive word units;
and desensitizing the file according to the sensitive word units to obtain a desensitized text.
Optionally, the performing picture information identification on the file to obtain the picture information of the file includes:
carrying out digital image processing on the file to obtain a target contour region;
performing character segmentation on the target contour region to obtain a character contour region;
identifying the character outline area to obtain character information;
and splicing the character information to obtain picture information.
Optionally, the performing conventional digital image processing on the file to obtain an outline region containing text information includes:
carrying out graying processing on the file to obtain a pixel matrix;
carrying out binarization processing on the pixel matrix to obtain a binary pixel matrix;
performing expansion processing on the binary pixel matrix to obtain an expanded pixel matrix;
framing clustered pixels in the expanded pixel matrix to obtain a plurality of target object matrixes to be screened;
screening the target object matrix to be screened according to a preset rule to obtain a target object matrix;
and extracting a region corresponding to the target object matrix to obtain a target contour region.
In order to solve the above problems, the present invention also provides a document desensitizing apparatus, including:
the picture information identification module is used for receiving a file uploaded by a user, and when the file format of the file is a picture format, picture information identification is carried out on the file to obtain the picture information of the file;
the sensitive information identification module is used for classifying the types of the files according to the picture information and preset certificate type information to obtain the certificate types of the files, classifying the information of the picture information according to the certificate types, determining the sensitive information and the position of the sensitive information of the files according to the information classification, classifying the types of the files according to the picture information and the preset certificate type information to obtain the certificate types of the files, classifying the information of the picture information according to the certificate types, and determining the sensitive information and the position of the sensitive information of the files according to the information classification;
and the sensitive information desensitization module is used for performing data deformation on the sensitive information on the file according to the position of the sensitive information to obtain desensitization information, and performing text desensitization on the file to obtain a desensitization text when the file format of the file is a text format.
In order to solve the above problem, the present invention also provides an electronic device, including:
a memory storing at least one computer program; and
a processor executing the computer program stored in the memory to implement the file desensitization method described above.
To solve the above problem, the present invention also provides a computer-readable storage medium having at least one computer program stored therein, the at least one computer program being executed by a processor in an electronic device to implement the file desensitization method described above.
According to the embodiment of the invention, the desensitization mode to be carried out is determined according to the file format of the file, so that the file desensitization efficiency and accuracy are improved; when the file format of the file is the picture format, the file is subjected to type classification according to the picture information of the file and preset certificate type information to obtain the certificate type of the file, so that key information needing desensitization is convenient to confirm, and the accuracy and the intelligent degree of picture desensitization are improved. Therefore, the file desensitization method, the file desensitization device, the electronic equipment and the readable storage medium provided by the embodiment of the invention can improve the intelligent degree, accuracy and efficiency of file desensitization.
Drawings
FIG. 1 is a schematic flow chart of a file desensitization method according to an embodiment of the present invention;
fig. 2 to fig. 7 are flowcharts illustrating detailed implementation of one step in a file desensitization method according to an embodiment of the present invention;
FIG. 8 is a block schematic diagram of a document desensitization apparatus provided in accordance with an embodiment of the present invention;
fig. 9 is a schematic internal structural diagram of an electronic device implementing a file desensitization method according to an embodiment of the present invention;
the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The embodiment of the invention provides a file desensitization method. The execution subject of the file desensitization method includes, but is not limited to, at least one of electronic devices that can be configured to execute the method provided by the embodiments of the present application, such as a server, a terminal, and the like. In other words, the file desensitization method may be performed by software or hardware installed in the terminal device or the server device, and the software may be a blockchain platform. The server may include an independent server, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, Content Delivery Network (CDN), big data and an artificial intelligence platform.
Referring to a schematic flow diagram of a file desensitization method provided in an embodiment of the present invention shown in fig. 1, in the embodiment of the present invention, the file desensitization method includes the following steps S1-S6:
and S1, receiving the file uploaded by the user.
In the embodiment of the present invention, the user may be a customer performing business transaction, for example, a customer applying for a credit card at a bank. The document may be user data required by the corresponding business, for example, when a bank transacts a credit card business, the user is required to provide corresponding identification data and other data.
In the optional embodiment of the invention, the file uploaded by the user is received through the service handling page without being processed by service personnel, so that the information security of the user is ensured, and the security of the private information of the user is improved.
And S2, judging the file format of the file.
The file format is a special encoding method for information used for storing information, and is used for identifying data stored inside. In the embodiment of the present invention, the file format may include a text format and a picture format. In an optional embodiment of the present invention, the file format of the file is determined by reading suffix format information of the file,
when the suffix format information is bmp, jpg, png, tif, gif, pcx, tga, exif, fpx, svg, psd, cdr, pcd, dxf, ufo, eps, ai, raw, WMF, webp, avif, and apng, it may be determined that the file format of the file is a picture format.
When the suffix format information of the file is txt, rtf, doc, xls, ppt, htm, html, wpd, pdf, etc., the file format of the file can be determined to be a text format.
And S3, when the file format of the file is the picture format, carrying out picture information identification on the file to obtain the picture information of the file.
According to the embodiment of the invention, the picture information of the file is obtained by carrying out picture information identification on the file, so that the type of the file is convenient to judge, and the accuracy of picture desensitization is improved.
Further, referring to fig. 2, as an alternative embodiment of the present invention, the identifying the picture information of the file to obtain the picture information of the file includes steps S31-S34:
s31, carrying out digital image processing on the file to obtain a target contour region;
s32, carrying out character segmentation on the target contour region to obtain a character contour region;
s33, recognizing the character outline area to obtain character information;
and S34, splicing the character information to obtain picture information.
In the embodiment of the present invention, the target contour region may be a region including character information, where the character information includes recognizable chinese characters, english, arabic numerals, and the like.
Further, referring to fig. 3, the conventional digital image processing on the document to obtain the outline region containing the text information includes steps S311 to S316:
s311, carrying out graying processing on the file to obtain a pixel matrix;
s312, carrying out binarization processing on the pixel matrix to obtain a binary pixel matrix;
s313, performing expansion processing on the binary pixel matrix to obtain an expanded pixel matrix;
s314, framing clustered pixels in the expanded pixel matrix to obtain a plurality of target object matrixes to be screened;
s315, screening the target object matrix to be screened according to a preset rule to obtain a target object matrix;
and S316, extracting the area corresponding to the target object matrix to obtain a target contour area.
In the embodiment of the present invention, the preset rule may be a description of a feature of the target object, for example, when the name of the identification card is screened, the preset rule may be a rectangle and the interval between the preset rule and the rectangle is large.
In the optional embodiment of the invention, the target contour region is obtained by carrying out a series of traditional digital image processing on the file, so that the contour region containing the character information is obtained, the accuracy of picture information identification is improved, and the possibility of information loss is reduced.
In another optional embodiment of the present invention, the picture information of the file may be obtained by inputting the file into a convolutional neural network for convolutional pooling.
And S4, classifying the file types according to the picture information and preset certificate type information to obtain the certificate types of the files.
In the embodiment of the present invention, the document type information may be information included in a picture of a document type, for example, the document type information of the document type of the identification card may include characters such as name, gender, family of names, year and month of birth, and the like. The certificate types comprise identity cards, business licenses, driving licenses, drivers' licenses, house property cards and the like.
According to the embodiment of the invention, the file is subjected to type classification according to the picture information and the preset certificate type information to obtain the certificate type of the file, so that the accurate control of the file in the picture format is realized, the accuracy of desensitization of the picture is improved, and the risk coefficient of user information leakage caused by the fact that information cannot be covered is reduced.
Further, referring to fig. 4, as an alternative embodiment of the present invention, the classifying the types of the files according to the picture information and the preset certificate type information to obtain the certificate types of the files includes steps S41-S43:
s41, performing model training by using the certificate type information to obtain a certificate type classification model;
s42, inputting the picture information into the certificate type classification model to obtain a classification result;
and S43, judging the certificate type of the file according to the classification result.
In the embodiment of the invention, the certificate type classification model can be a multi-classification model based on deep learning.
In the optional embodiment of the invention, the certificate type information is subjected to data training by using a model training method to obtain the certificate type classification model, so that the intelligent degree of certificate type identification is improved, and the efficiency of certificate type identification is improved.
Further, referring to fig. 5, the inputting the picture information into the certificate type classification model to obtain a classification result includes steps S421 to S424:
s421, carrying out character coding on the picture information by utilizing a coding layer in the certificate type classification model to obtain a character vector;
s422, carrying out matrix splicing on the character vectors by utilizing a decoding layer in the certificate type classification model to obtain a picture information character matrix;
s423, extracting keywords from the picture information character matrix by using an attention mechanism layer in the certificate type classification model to obtain picture information keywords;
and S424, outputting the classification result of the picture information by utilizing a full connection layer in the certificate type classification model according to the picture information keyword.
In the embodiment of the present invention, the certificate type classification model is formed based on a neural network, and may be a Bert model, where the certificate type classification model includes: the device comprises an encoding layer, a decoding layer, an attention mechanism layer and a full connection layer.
S5, carrying out information classification on the picture information according to the certificate type, and determining the sensitive information and the position of the sensitive information of the file according to the information classification.
In the embodiment of the present invention, the sensitive information may be private confidential information of the user, for example, information such as an identification number, a bank card number, a telephone number, a home address, and a name. The location of the sensitive information may be the location where the sensitive information is presented in the file.
According to the embodiment of the invention, the information of the picture information is classified according to the certificate type, so that desensitization of unnecessary information is eliminated, unnecessary information masking is reduced, and the efficiency of desensitization of the picture is improved.
According to the embodiment of the invention, the sensitive information of the file and the position of the sensitive information are determined according to the information classification, so that the desensitization position of the sensitive information is conveniently and accurately positioned, the desensitization rate of the sensitive information is improved, and the accuracy and the efficiency of desensitization of the picture are improved.
Further, referring to fig. 6, as an alternative embodiment of the present invention, the classifying the picture information according to the certificate type, and determining the sensitive information and the location of the sensitive information of the file according to the information classification include steps S51-S53:
s51, acquiring information needing desensitization in the certificate types;
s52, matching the picture information with the information needing desensitization, and taking the successfully matched information as sensitive information of the file;
and S53, tracing the position of the sensitive information in the picture information to obtain the position of the sensitive information.
According to the optional embodiment of the invention, firstly, the characteristic of the information needing desensitization is obtained by acquiring the information needing desensitization in the certificate type, secondly, the picture information is matched with the information needing desensitization, the successfully matched information is used as the sensitive information of the file, the accuracy of the sensitive information is ensured, the accuracy of the picture desensitization is improved, and finally, the position of the sensitive information in the picture information is traced to obtain the position of the sensitive information, so that the accurate picture desensitization is realized.
And S6, performing data deformation on the sensitive information on the file according to the position of the sensitive information to obtain desensitized information.
In an embodiment of the present invention, the desensitization information may be hidden data, for example, sensitive information in personal information input by a user is replaced with an "x", and the obtained "x" is the desensitization information.
According to the embodiment of the invention, data deformation is carried out on the sensitive information on the file according to the position of the sensitive information to obtain desensitization information, so that image desensitization is realized, the artificial participation degree is reduced, and the intelligent degree of image desensitization is improved.
In the optional embodiment of the invention, the sensitive code is obtained by reading the coding information of the sensitive information, the sensitive code is modified by using a preset code modifier to obtain the desensitization code, and the desensitization code is compiled to obtain the desensitization information, so that the information desensitization is realized and the information safety of a user is ensured.
In the embodiment of the present invention, the encoding information includes an encoding sequence and encoding characters. The preset encoding modifier is a modifier for modifying normal encoding information into encoding information of non-sensitive characters such as "+" # "and the like.
And S7, when the file format of the file is a text format, performing text desensitization on the file to obtain a desensitized text.
In the embodiment of the invention, when the file format of the file is not the picture format, common text desensitization can be used for completing desensitization of the file, so that the information security of a user is ensured.
Further, referring to fig. 7, as an alternative embodiment of the present invention, the text desensitizing the file to obtain a desensitized text includes steps S71-S77:
s71, vectorizing the file to obtain a word vector of the file;
s72, labeling the word vectors according to preset text characteristics;
s73, combining the marked word vectors according to the corresponding word units in the file to obtain a word unit set;
s74, creating a frequent item set according to the support degree of the word units contained in the word unit set;
s75, calculating the promotion degree of frequent items contained in the frequent item set;
s76, taking word units corresponding to frequent items with the promotion degree larger than a preset threshold value in the frequent item set as sensitive word units;
and S77, desensitizing the file according to the sensitive word units to obtain a desensitized text.
In the embodiment of the present invention, the preset text feature may be a text sensitive vocabulary feature preset by a user. The support may be a ratio of the word units in the set of word units and the document.
In the optional embodiment of the invention, the candidate sensitive word units are selected according to the text characteristics of the sensitive information, the sensitive information judgment range is narrowed, the text desensitization efficiency is improved, and further the sensitive word units are screened out by calculating the promotion degree of the candidate sensitive word units, so that the desensitization effect of the sensitive information in the text format is improved, the sensitive information in the desensitization processed text is effectively protected, and the information security of a user is ensured.
According to the embodiment of the invention, the desensitization mode to be carried out is determined according to the file format of the file, so that the file desensitization efficiency and accuracy are improved; when the file format of the file is the picture format, the file is subjected to type classification according to the picture information of the file and preset certificate type information to obtain the certificate type of the file, so that key information needing desensitization is convenient to confirm, and the accuracy and the intelligent degree of picture desensitization are improved. Therefore, the file desensitization method provided by the embodiment of the invention can improve the intellectualization degree, accuracy and efficiency of file desensitization.
FIG. 8 is a functional block diagram of the document desensitizing apparatus of the present invention.
The file desensitization apparatus 100 of the present invention may be installed in an electronic device. According to the implemented functions, the file desensitization apparatus 100 may include a picture information identification module 101, a sensitive information identification module 102 and a sensitive information desensitization module 103, which may also be referred to as a unit according to the present invention, and refer to a series of computer program segments that can be executed by a processor of an electronic device and can perform fixed functions, and are stored in a memory of the electronic device.
In the present embodiment, the functions regarding the respective modules/units are as follows:
the picture information identification module 101 is configured to receive a file uploaded by a user, and when the file format of the file is a picture format, perform picture information identification on the file to obtain picture information of the file.
In the embodiment of the present invention, the user may be a customer performing business transaction, for example, a customer applying for a credit card at a bank. The document may be user data required by the corresponding business, for example, when a bank transacts a credit card business, the user is required to provide corresponding identification data and other data.
In the optional embodiment of the invention, the file uploaded by the user is received through the service handling page without being processed by service personnel, so that the information security of the user is ensured, and the security of the private information of the user is improved.
The file format is a special encoding method for information used for storing information, and is used for identifying data stored inside. In the embodiment of the present invention, the file format may include a text format and a picture format. In an optional embodiment of the present invention, the file format of the file is determined by reading suffix format information of the file,
when the suffix format information is bmp, jpg, png, tif, gif, pcx, tga, exif, fpx, svg, psd, cdr, pcd, dxf, ufo, eps, ai, raw, WMF, webp, avif, and apng, it may be determined that the file format of the file is a picture format.
When the suffix format information of the file is txt, rtf, doc, xls, ppt, htm, html, wpd, pdf, etc., the file format of the file can be determined to be a text format.
According to the embodiment of the invention, the picture information of the file is obtained by carrying out picture information identification on the file, so that the type of the file is convenient to judge, and the accuracy of picture desensitization is improved.
Further, referring to fig. 2, as an optional embodiment of the present invention, the identifying the picture information of the file to obtain the picture information of the file includes:
carrying out digital image processing on the file to obtain a target contour region;
performing character segmentation on the target contour region to obtain a character contour region;
identifying the character outline area to obtain character information;
and splicing the character information to obtain picture information.
In the embodiment of the present invention, the target contour region may be a region including character information, where the character information includes recognizable chinese characters, english, arabic numerals, and the like.
Further, referring to fig. 3, the conventional digital image processing on the file to obtain an outline region containing text information includes:
carrying out graying processing on the file to obtain a pixel matrix;
carrying out binarization processing on the pixel matrix to obtain a binary pixel matrix;
performing expansion processing on the binary pixel matrix to obtain an expanded pixel matrix;
framing clustered pixels in the expanded pixel matrix to obtain a plurality of target object matrixes to be screened;
screening the target object matrix to be screened according to a preset rule to obtain a target object matrix;
and extracting a region corresponding to the target object matrix to obtain a target contour region.
In the embodiment of the present invention, the preset rule may be a description of a feature of the target object, for example, when the name of the identification card is screened, the preset rule may be a rectangle and the interval between the preset rule and the rectangle is large.
In the optional embodiment of the invention, the target contour region is obtained by carrying out a series of traditional digital image processing on the file, so that the contour region containing the character information is obtained, the accuracy of picture information identification is improved, and the possibility of information loss is reduced.
In another optional embodiment of the present invention, the picture information of the file may be obtained by inputting the file into a convolutional neural network for convolutional pooling.
The sensitive information identification module 102 is configured to classify the file according to the picture information and preset certificate type information to obtain a certificate type of the file, classify the picture information according to the certificate type, determine the sensitive information and the position of the sensitive information of the file according to the information classification, classify the file according to the picture information and the preset certificate type information to obtain the certificate type of the file, classify the picture information according to the certificate type, and determine the sensitive information and the position of the sensitive information of the file according to the information classification.
In the embodiment of the present invention, the document type information may be information included in a picture of a document type, for example, the document type information of the document type of the identification card may include characters such as name, gender, family of names, year and month of birth, and the like. The certificate types comprise identity cards, business licenses, driving licenses, drivers' licenses, house property cards and the like.
According to the embodiment of the invention, the file is subjected to type classification according to the picture information and the preset certificate type information to obtain the certificate type of the file, so that the accurate control of the file in the picture format is realized, the accuracy of desensitization of the picture is improved, and the risk coefficient of user information leakage caused by the fact that information cannot be covered is reduced.
Further, referring to fig. 4, as an optional embodiment of the present invention, the classifying the file type according to the picture information and preset certificate type information to obtain the certificate type of the file includes:
performing model training by using the certificate type information to obtain a certificate type classification model;
inputting the picture information into the certificate type classification model to obtain a classification result;
and judging the certificate type of the file according to the classification result.
In the embodiment of the invention, the certificate type classification model can be a multi-classification model based on deep learning.
In the optional embodiment of the invention, the certificate type information is subjected to data training by using a model training method to obtain the certificate type classification model, so that the intelligent degree of certificate type identification is improved, and the efficiency of certificate type identification is improved.
Further, referring to fig. 5, the inputting the picture information into the certificate type classification model to obtain a classification result includes:
character encoding is carried out on the picture information by utilizing an encoding layer in the certificate type classification model, and a character vector is obtained;
performing matrix splicing on the character vectors by utilizing a decoding layer in the certificate type classification model to obtain a picture information character matrix;
extracting keywords from the picture information character matrix by using an attention mechanism layer in the certificate type classification model to obtain picture information keywords;
and outputting the classification result of the picture information by utilizing a full connection layer in the certificate type classification model according to the picture information keyword.
In the embodiment of the present invention, the certificate type classification model is formed based on a neural network, and may be a Bert model, where the certificate type classification model includes: the device comprises an encoding layer, a decoding layer, an attention mechanism layer and a full connection layer.
In the embodiment of the present invention, the sensitive information may be private confidential information of the user, for example, information such as an identification number, a bank card number, a telephone number, a home address, and a name. The location of the sensitive information may be the location where the sensitive information is presented in the file.
According to the embodiment of the invention, the information of the picture information is classified according to the certificate type, so that desensitization of unnecessary information is eliminated, unnecessary information masking is reduced, and the efficiency of desensitization of the picture is improved.
According to the embodiment of the invention, the sensitive information of the file and the position of the sensitive information are determined according to the information classification, so that the desensitization position of the sensitive information is conveniently and accurately positioned, the desensitization rate of the sensitive information is improved, and the accuracy and the efficiency of desensitization of the picture are improved.
Further, referring to fig. 6, as an optional embodiment of the present invention, the classifying the picture information according to the certificate type, and determining the sensitive information and the location of the sensitive information of the file according to the information classification includes:
acquiring information needing desensitization in the certificate type;
matching the picture information with the information needing desensitization, and taking the successfully matched information as sensitive information of the file;
and tracing the position of the sensitive information in the picture information to obtain the position of the sensitive information.
According to the optional embodiment of the invention, firstly, the characteristic of the information needing desensitization is obtained by acquiring the information needing desensitization in the certificate type, secondly, the picture information is matched with the information needing desensitization, the successfully matched information is used as the sensitive information of the file, the accuracy of the sensitive information is ensured, the accuracy of the picture desensitization is improved, and finally, the position of the sensitive information in the picture information is traced to obtain the position of the sensitive information, so that the accurate picture desensitization is realized.
The sensitive information desensitization module 103 is configured to perform data deformation on the file according to the position of the sensitive information to obtain desensitization information, and perform text desensitization on the file to obtain a desensitization text when the file format of the file is a text format.
In an embodiment of the present invention, the desensitization information may be hidden data, for example, sensitive information in the personal information input by the user is replaced with an "x", and the obtained "x" is the desensitization information.
According to the embodiment of the invention, data deformation is carried out on the sensitive information on the file according to the position of the sensitive information to obtain desensitization information, so that image desensitization is realized, the artificial participation degree is reduced, and the intelligent degree of image desensitization is improved.
In the optional embodiment of the invention, the sensitive code is obtained by reading the coding information of the sensitive information, the sensitive code is modified by using a preset code modifier to obtain the desensitization code, and the desensitization code is compiled to obtain the desensitization information, so that the information desensitization is realized and the information safety of a user is ensured.
In the embodiment of the present invention, the encoding information includes an encoding sequence and encoding characters. The preset encoding modifier is a modifier for modifying normal encoding information into encoding information of non-sensitive characters such as "+" # "and the like.
In the embodiment of the invention, when the file format of the file is not the picture format, common text desensitization can be used for completing desensitization of the file, so that the information security of a user is ensured.
Further, referring to fig. 7, as an alternative embodiment of the present invention, the performing text desensitization on the file to obtain a desensitized text includes:
vectorizing the file to obtain a word vector of the file;
labeling the word vectors according to preset text characteristics;
combining the marked word vectors according to the corresponding word units in the file to obtain a word unit set;
creating a frequent item set according to the support of the word units contained in the word unit set;
calculating the promotion degree of frequent items contained in the frequent item set;
taking word units corresponding to frequent items with the promotion degree larger than a preset threshold value in the frequent item set as sensitive word units;
and desensitizing the file according to the sensitive word units to obtain a desensitized text.
In the embodiment of the present invention, the preset text feature may be a text sensitive vocabulary feature preset by a user. The support may be a ratio of the word units in the set of word units and the document.
In the optional embodiment of the invention, the candidate sensitive word units are selected according to the text characteristics of the sensitive information, the sensitive information judgment range is narrowed, the text desensitization efficiency is improved, and further the sensitive word units are screened out by calculating the promotion degree of the candidate sensitive word units, so that the desensitization effect of the sensitive information in the text format is improved, the sensitive information in the desensitization processed text is effectively protected, and the information security of a user is ensured.
Fig. 9 is a schematic structural diagram of an electronic device implementing the file desensitization method according to the present invention.
The electronic device may comprise a processor 10, a memory 11, a communication bus 12 and a communication interface 13, and may further comprise a computer program, such as a file desensitization program, stored in the memory 11 and executable on the processor 10.
The memory 11 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device, for example a removable hard disk of the electronic device. The memory 11 may also be an external storage device of the electronic device in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device. The memory 11 may be used not only to store application software installed in the electronic device and various types of data, such as codes of a file desensitization program, etc., but also to temporarily store data that has been output or is to be output.
The processor 10 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the entire electronic device using various interfaces and lines, and executes various functions of the electronic device and processes data by running or executing programs or modules (e.g., file desensitization programs, etc.) stored in the memory 11 and calling data stored in the memory 11.
The communication bus 12 may be a PerIPheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus. The bus may be divided into an address bus, a data bus, a control bus, etc. The communication bus 12 is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
Fig. 9 shows only an electronic device having components, and those skilled in the art will appreciate that the structure shown in fig. 9 does not constitute a limitation of the electronic device, and may include fewer or more components than those shown, or some components may be combined, or a different arrangement of components.
For example, although not shown, the electronic device may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so that functions of charge management, discharge management, power consumption management and the like are realized through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
Optionally, the communication interface 13 may include a wired interface and/or a wireless interface (e.g., WI-FI interface, bluetooth interface, etc.), which is generally used to establish a communication connection between the electronic device and other electronic devices.
Optionally, the communication interface 13 may further include a user interface, which may be a Display (Display), an input unit (such as a Keyboard (Keyboard)), and optionally, a standard wired interface, or a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable, among other things, for displaying information processed in the electronic device and for displaying a visualized user interface.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The file desensitization program stored in the memory 11 in the electronic device is a combination of computer programs which, when run in the processor 10, implement:
receiving a file uploaded by a user;
when the file format of the file is a picture format, picture information identification is carried out on the file to obtain the picture information of the file;
classifying the types of the files according to the picture information and preset certificate type information to obtain the certificate types of the files;
according to the certificate type, information classification is carried out on the picture information, and sensitive information of the file and the position of the sensitive information are determined according to the information classification;
according to the position of the sensitive information, carrying out data deformation on the sensitive information on the file to obtain desensitization information;
and when the file format of the file is a text format, performing text desensitization on the file to obtain a desensitized text.
Specifically, the processor 10 may refer to the description of the relevant steps in the embodiment corresponding to fig. 1 for a specific implementation method of the computer program, which is not described herein again.
Further, the electronic device integrated module/unit, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in a computer readable storage medium. The computer readable medium may be non-volatile or volatile. The computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).
Embodiments of the present invention may also provide a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor of an electronic device, the computer program may implement:
receiving a file uploaded by a user;
when the file format of the file is a picture format, picture information identification is carried out on the file to obtain the picture information of the file;
classifying the types of the files according to the picture information and preset certificate type information to obtain the certificate types of the files;
according to the certificate type, information classification is carried out on the picture information, and sensitive information of the file and the position of the sensitive information are determined according to the information classification;
according to the position of the sensitive information, carrying out data deformation on the sensitive information on the file to obtain desensitization information;
and when the file format of the file is a text format, performing text desensitization on the file to obtain a desensitized text.
Further, the computer usable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.
In the embodiments provided in the present invention, it should be understood that the disclosed electronic device, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A method of desensitizing a file, the method comprising:
receiving a file uploaded by a user;
when the file format of the file is a picture format, picture information identification is carried out on the file to obtain the picture information of the file;
classifying the types of the files according to the picture information and preset certificate type information to obtain the certificate types of the files;
according to the certificate type, information classification is carried out on the picture information, and sensitive information of the file and the position of the sensitive information are determined according to the information classification;
according to the position of the sensitive information, carrying out data deformation on the sensitive information on the file to obtain desensitization information;
and when the file format of the file is a text format, performing text desensitization on the file to obtain a desensitized text.
2. The method for desensitizing a document according to claim 1, wherein said classifying the document type according to the picture information and predetermined document type information to obtain the document type of the document comprises:
performing model training by using the certificate type information to obtain a certificate type classification model;
inputting the picture information into the certificate type classification model to obtain a classification result;
and judging the certificate type of the file according to the classification result.
3. The file desensitization method of claim 2, wherein said entering said picture information into said document type classification model to obtain classification results comprises:
character encoding is carried out on the picture information by utilizing an encoding layer in the certificate type classification model, and a character vector is obtained;
performing matrix splicing on the character vectors by utilizing a decoding layer in the certificate type classification model to obtain a picture information character matrix;
extracting keywords from the picture information character matrix by using an attention mechanism layer in the certificate type classification model to obtain picture information keywords;
and outputting the classification result of the picture information by utilizing a full connection layer in the certificate type classification model according to the picture information keyword.
4. The file desensitization method of claim 1, wherein said classifying information according to the document type for the picture information, and determining sensitive information and a location of the sensitive information for the file according to the information classification comprises:
acquiring information needing desensitization in the certificate type;
matching the picture information with the information needing desensitization, and taking the successfully matched information as sensitive information of the file;
and tracing the position of the sensitive information in the picture information to obtain the position of the sensitive information.
5. The file desensitization method of claim 1, wherein said text desensitizing said file to obtain desensitized text, comprises:
vectorizing the file to obtain a word vector of the file;
labeling the word vectors according to preset text characteristics;
combining the marked word vectors according to the corresponding word units in the file to obtain a word unit set;
creating a frequent item set according to the support of the word units contained in the word unit set;
calculating the promotion degree of frequent items contained in the frequent item set;
taking word units corresponding to frequent items with the promotion degree larger than a preset threshold value in the frequent item set as sensitive word units;
and desensitizing the file according to the sensitive word units to obtain a desensitized text.
6. The file desensitization method of claim 1, wherein said identifying picture information of the file to obtain picture information of the file comprises:
carrying out digital image processing on the file to obtain a target contour region;
performing character segmentation on the target contour region to obtain a character contour region;
identifying the character outline area to obtain character information;
and splicing the character information to obtain picture information.
7. The document desensitization method according to claim 6, wherein said digitally processing said document to obtain outline regions containing textual information comprises:
carrying out graying processing on the file to obtain a pixel matrix;
carrying out binarization processing on the pixel matrix to obtain a binary pixel matrix;
performing expansion processing on the binary pixel matrix to obtain an expanded pixel matrix;
framing clustered pixels in the expanded pixel matrix to obtain a plurality of target object matrixes to be screened;
screening the target object matrix to be screened according to a preset rule to obtain a target object matrix;
and extracting a region corresponding to the target object matrix to obtain a target contour region.
8. A document desensitizing apparatus, characterized in that the apparatus comprises:
the picture information identification module is used for receiving a file uploaded by a user, and when the file format of the file is a picture format, picture information identification is carried out on the file to obtain the picture information of the file;
the sensitive information identification module is used for classifying the types of the files according to the picture information and preset certificate type information to obtain the certificate types of the files, classifying the information of the picture information according to the certificate types, determining the sensitive information and the position of the sensitive information of the files according to the information classification, classifying the types of the files according to the picture information and the preset certificate type information to obtain the certificate types of the files, classifying the information of the picture information according to the certificate types, and determining the sensitive information and the position of the sensitive information of the files according to the information classification;
and the sensitive information desensitization module is used for performing data deformation on the sensitive information on the file according to the position of the sensitive information to obtain desensitization information, and performing text desensitization on the file to obtain a desensitization text when the file format of the file is a text format.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores computer program instructions executable by the at least one processor to cause the at least one processor to perform a file desensitization method according to any of claims 1 to 7.
10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements a file desensitization method according to any of claims 1 to 7.
CN202210645565.XA 2022-06-08 2022-06-08 File desensitization method and device, electronic equipment and storage medium Pending CN114898373A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210645565.XA CN114898373A (en) 2022-06-08 2022-06-08 File desensitization method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210645565.XA CN114898373A (en) 2022-06-08 2022-06-08 File desensitization method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114898373A true CN114898373A (en) 2022-08-12

Family

ID=82727749

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210645565.XA Pending CN114898373A (en) 2022-06-08 2022-06-08 File desensitization method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114898373A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115098877A (en) * 2022-08-25 2022-09-23 北京前沿信安科技股份有限公司 File encryption and decryption method and device, electronic equipment and medium
CN116663065A (en) * 2023-07-27 2023-08-29 北京亿赛通科技发展有限责任公司 Stream file desensitizing method and device applied to computer security system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115098877A (en) * 2022-08-25 2022-09-23 北京前沿信安科技股份有限公司 File encryption and decryption method and device, electronic equipment and medium
CN116663065A (en) * 2023-07-27 2023-08-29 北京亿赛通科技发展有限责任公司 Stream file desensitizing method and device applied to computer security system

Similar Documents

Publication Publication Date Title
CN112507936B (en) Image information auditing method and device, electronic equipment and readable storage medium
CN114898373A (en) File desensitization method and device, electronic equipment and storage medium
CN112699775A (en) Certificate identification method, device and equipment based on deep learning and storage medium
CN112528616B (en) Service form generation method and device, electronic equipment and computer storage medium
CN112396005A (en) Biological characteristic image recognition method and device, electronic equipment and readable storage medium
CN113704614A (en) Page generation method, device, equipment and medium based on user portrait
CN112581227A (en) Product recommendation method and device, electronic equipment and storage medium
CN113157927A (en) Text classification method and device, electronic equipment and readable storage medium
CN113887438A (en) Watermark detection method, device, equipment and medium for face image
CN114398557A (en) Information recommendation method and device based on double portraits, electronic equipment and storage medium
CN113792089A (en) Illegal behavior detection method, device, equipment and medium based on artificial intelligence
CN113221888B (en) License plate number management system test method and device, electronic equipment and storage medium
CN113536782A (en) Sensitive word recognition method and device, electronic equipment and storage medium
CN113704474B (en) Bank outlet equipment operation guide generation method, device, equipment and storage medium
CN114943306A (en) Intention classification method, device, equipment and storage medium
CN115203364A (en) Software fault feedback processing method, device, equipment and readable storage medium
CN114996386A (en) Business role identification method, device, equipment and storage medium
CN113888760A (en) Violation information monitoring method, device, equipment and medium based on software application
CN113626605A (en) Information classification method and device, electronic equipment and readable storage medium
CN114120347A (en) Form verification method and device, electronic equipment and storage medium
CN112712797A (en) Voice recognition method and device, electronic equipment and readable storage medium
CN111784499A (en) Service integration method and device based on cloud platform, electronic equipment and storage medium
CN111680513B (en) Feature information identification method and device and computer readable storage medium
CN113704478B (en) Text element extraction method, device, electronic equipment and medium
CN113793121A (en) Automatic litigation method and device for litigation cases, electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination