CN115544560A

CN115544560A - Desensitization method and device for sensitive information, computer equipment and storage medium

Info

Publication number: CN115544560A
Application number: CN202211170352.2A
Authority: CN
Inventors: 李连钢
Original assignee: Ping An Property and Casualty Insurance Company of China Ltd
Current assignee: Ping An Property and Casualty Insurance Company of China Ltd
Priority date: 2022-09-22
Filing date: 2022-09-22
Publication date: 2022-12-30

Abstract

The embodiment of the application belongs to the field of information safety, and relates to a desensitization method of sensitive information, which comprises the steps of creating a desensitization tool, and packaging a trained sensitive data identification model in the desensitization tool; responding to a sensitive data access request sent by a client, and obtaining response data corresponding to the sensitive data access request; identifying sensitive information in the response data through a sensitive data identification model; based on access interface configuration sensitivity carried in sensitive data access requestDesensitization rules of the sensory information are issued to a desensitization tool; desensitization is carried out on the sensitive information through a desensitization tool by using a desensitization rule, desensitization data are obtained, and the desensitization data are returned to the client. The application also provides a desensitizing device of sensitive information, computer equipment and a storage medium. In addition, the application also relates to a block chain technology, and sensitive information can be stored in the block chain _。 The application can improve the safety of response data and avoid privacy disclosure.

Description

Desensitization method and device for sensitive information, computer equipment and storage medium

Technical Field

The present application relates to the field of information security technologies, and in particular, to a desensitization method and apparatus for sensitive information, a computer device, and a storage medium.

Background

With the development of the information age, people pay more and more attention to the security requirements of data information. With the development of computer technology, it is more and more common to acquire response data by sending a data access request to a server, however, the response data fed back by the server may include sensitive information such as an identification number, a mobile phone number, a card number, a client number, and the like. Therefore, sensitive information needs to be desensitized, in the traditional desensitization technology, the sensitive information is identified and positioned by adopting technical means based on regular expression matching, keyword code table mapping, data type definition discrimination, data characteristic calculation and the like, and then desensitization is carried out on the sensitive information, but the identification accuracy rate of the modes is low, sensitive information is easy to miss, and privacy data are indirectly leaked.

Disclosure of Invention

An embodiment of the present application aims to provide a desensitization method and apparatus for sensitive information, a computer device, and a storage medium, so as to solve the problem in the related art

In order to solve the above technical problem, an embodiment of the present application provides a method for desensitizing sensitive information, which adopts the following technical solutions:

creating a desensitization tool, and packaging the trained sensitive data recognition model in the desensitization tool;

responding to a sensitive data access request sent by a client, and obtaining response data corresponding to the sensitive data access request;

identifying sensitive information in the response data through the sensitive data identification model;

configuring desensitization rules of the sensitive information based on an access interface carried in the sensitive data access request, and issuing the desensitization rules to the desensitization tool;

desensitizing the sensitive information by using the desensitization rule through the desensitization tool to obtain desensitization data, and returning the desensitization data to the client.

Further, before the step of packaging the trained sensitive data recognition model in the desensitization tool, the method further comprises:

acquiring a historical service data set, and performing word segmentation processing on service data in the historical service data set to obtain word segmentation data;

inputting the word segmentation data into a pre-constructed initial sensitive data recognition model, wherein the initial sensitive data recognition model comprises a word vector layer, a Bi-LSTM layer, a CRF layer and an output layer;

converting the participles in the participle data into word vectors through the word vector layer;

inputting the word vector into the Bi-LSTM layer for feature extraction to obtain a semantic feature vector;

inputting the semantic feature vector into the CRF layer for calculation, and outputting an optimal labeling sequence with the maximum probability;

and calculating a loss function value according to the optimal labeling sequence, adjusting model parameters of the initial sensitive data recognition model based on the loss function value, continuing iterative training until the model converges, and outputting a final sensitive data recognition model.

Further, the step of converting the participles in the participle data into word vectors through the word vector layer comprises:

coding each word segmentation and converting the word segmentation into a vocabulary vector;

inputting the vocabulary vectors into the word vector layer, and obtaining a word vector mapping table according to the context information of each word segmentation;

and obtaining a word vector corresponding to each word segmentation based on the word vector mapping table.

Further, the step of inputting the word vector into the Bi-LSTM layer for feature extraction to obtain a semantic feature vector includes:

performing feature extraction on the word vectors through a forward layer and a backward layer of the Bi-LSTM layer to respectively obtain forward hidden features and backward hidden features;

splicing the forward hidden features and the backward hidden features according to positions to obtain global hidden features;

and obtaining the semantic feature vector according to the global hidden feature.

Further, the step of configuring the desensitization rule of the sensitive information based on the access interface carried in the sensitive data access request includes:

determining a service type based on an access interface carried in the sensitive data access request;

and configuring a corresponding desensitization rule for the sensitive information according to the service type.

Further, the step of configuring a corresponding desensitization rule for the sensitive information according to the service type includes:

identifying first sensitive data and second sensitive data in the sensitive information according to the service type;

labeling corresponding first annotation information and second annotation information for the first sensitive data and the second sensitive data respectively;

and configuring a first desensitization rule and a second desensitization rule corresponding to the first annotation information and the second annotation information.

Further, after the step of configuring the corresponding desensitization rule for the sensitive information according to the service type, the method further includes:

establishing a mapping relation among the service type, the sensitive information and the desensitization rule;

and configuring the mapping relation into a desensitization rule table.

In order to solve the above technical problem, an embodiment of the present application further provides a desensitization apparatus for sensitive information, which adopts the following technical solution:

the building module is used for building a desensitization tool and packaging the trained sensitive data recognition model in the desensitization tool;

the response module is used for responding to a sensitive data access request sent by a client and obtaining response data corresponding to the sensitive data access request;

the identification module is used for identifying the sensitive information in the response data through the sensitive data identification model;

the configuration module is used for configuring desensitization rules of the sensitive information based on the access interface carried in the sensitive data access request and sending the desensitization rules to the desensitization tool;

and the desensitization module is used for desensitizing the sensitive information by using the desensitization rule through the desensitization tool to obtain desensitization data and returning the desensitization data to the client.

In order to solve the above technical problem, an embodiment of the present application further provides a computer device, which adopts the following technical solutions:

the computer device includes a memory having computer readable instructions stored therein which when executed by the processor implement the steps of the method for desensitizing sensitive information as described above.

In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, which adopts the following technical solutions:

the computer readable storage medium has stored thereon computer readable instructions which, when executed by a processor, implement the steps of the method of desensitizing sensitive information as described above.

Compared with the prior art, the embodiment of the application mainly has the following beneficial effects:

the method comprises the steps of creating a desensitization tool, and packaging a trained sensitive data identification model in the desensitization tool; responding to a sensitive data access request sent by a client, and obtaining response data corresponding to the sensitive data access request; identifying sensitive information in the response data through a sensitive data identification model; the desensitization rule of the sensitive information is configured on the basis of the access interface carried in the sensitive data access request, and the desensitization rule is issued to a desensitization tool; desensitizing the sensitive information by using a desensitization rule through a desensitization tool to obtain desensitization data, and returning the desensitization data to the client; according to the method and the device, the sensitive information in the response data is identified through the sensitive data identification model, the access interface corresponding to the identified sensitive information is used as the sensitive interface, and the corresponding desensitization rule is configured, so that the identification efficiency and accuracy of the sensitive information can be improved, meanwhile, the safety of the response data is improved, and privacy disclosure is avoided.

Drawings

In order to more clearly illustrate the solution of the present application, the drawings needed for describing the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a method of desensitizing sensitive information according to the present application;

FIG. 3 is a flow diagram of another embodiment of a method of desensitizing sensitive information according to the present application;

FIG. 4 is a schematic diagram illustrating the structure of one embodiment of an apparatus for desensitizing sensitive information according to the present application;

FIG. 5 is a schematic block diagram of one embodiment of a computer device according to the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the foregoing drawings are used for distinguishing between different objects and not for describing a particular sequential order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein may be combined with other embodiments.

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.

The application provides a desensitization method of sensitive information, which relates to artificial intelligence, and can be applied to a system architecture 100 shown in fig. 1, where the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have various communication client applications installed thereon, such as a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to a smart phone, a tablet computer, an e-book reader, an MP3 player (Moving Picture Experts Group Audio Layer III, motion Picture Experts Group Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion Picture Experts Group Audio Layer 4), a laptop portable computer, a desktop computer, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the

terminal devices

101, 102, 103.

It should be noted that the desensitization method for sensitive information provided in the embodiment of the present application is generally executed by a server/terminal device, and accordingly, the desensitization apparatus for sensitive information is generally disposed in the server/terminal device.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow diagram of one embodiment of a method of desensitizing sensitive information according to the present application is shown, including the steps of:

step S201, a desensitization tool is created, and the trained sensitive data recognition model is packaged in the desensitization tool.

The desensitization tool is a tool packet containing desensitization rules of sensitive information under different service types in the service system, and can adopt an SDK tool packet.

In this embodiment, a trained sensitive data recognition model is further packaged in the desensitization tool, and is used for efficiently and accurately recognizing sensitive information in data.

In this embodiment, referring to fig. 3, before the step of packaging the trained sensitive data recognition model in the desensitization tool, the method further includes:

step S301, a historical service data set is obtained, and word segmentation processing is carried out on the service data in the historical service data set to obtain word segmentation data.

The service system provides various types of services for users to handle, such as transfer service, loan service, insurance service and the like, provides each interface for calling according to the service types, and accesses the corresponding services through the calling interfaces. In an actual business scenario, a large amount of business data is generated. The business data comprises sensitive information and non-sensitive information, the sensitive information is labeled with a sensitive label and can be labeled by adopting entity labeling modes such as BIO, BIOE, BIOES, BMES and the like, for example, if the labeling mode of the BMES is adopted for labeling, B represents the first character of the sensitive information, M represents the middle character in the sensitive information, E represents the last character of the sensitive information, and S represents single-character sensitive information.

In some optional implementation manners, the acquired historical service data set is preprocessed, including text duplication removal, special symbol removal and the like, and the preprocessed service data is subjected to word segmentation processing to obtain word segmentation data. Randomly dividing the historical service data set into a training data set and a testing data set according to a preset proportion, wherein the preset proportion for dividing the training data set and the testing data set is 7:3.

step S302, inputting the word segmentation data into a pre-constructed initial sensitive data recognition model, wherein the initial sensitive data recognition model comprises a word vector layer, a Bi-LSTM layer, a CRF layer and an output layer.

Inputting the participle data carrying the sensitive label into a pre-constructed initial sensitive data recognition model, wherein the initial sensitive data recognition model comprises a word vector layer, a Bi-LSTM (Bidirectional long short term memory neural network) layer, a CRF (Conditional Random Fields) layer and an output layer. The word vector layer is used for mapping words from a one-dimensional space to dense vectors with lower dimensionality, namely the words are represented by the vectors to form a vector space, and the character information is embedded into a mathematical space, can be used as a bottom input representation and belongs to the word embedding layer; the Bi-LSTM layer takes word vectors as input and further constructs high-level feature expression of context information; the CRF layer is a label sequence with the maximum output probability, so that the aim of identifying sensitive information is fulfilled; and the output layer is used for directly outputting the predicted sensitive label.

Step S303, converting the participles in the participle data into word vectors through the word vector layer.

Specifically, the participle data is input into a Word vector layer, and context information contained in each data entity is converted into a Word vector by using a Word2Vec algorithm to obtain a Word vector corresponding to each participle.

And step S304, inputting the word vectors into the Bi-LSTM layer for feature extraction to obtain semantic feature vectors.

The long-short-Term Memory (LSTM) Neural Network is a time-cycle Neural Network, and is specially designed to solve the long-Term dependence problem of general RNN (r n) Neural Network.

One disadvantage of the unidirectional LSTM is that the neural network can only use the above input information, and there is no way to obtain the context semantic information of the current word, so the present embodiment chooses a Bi-LSTM neural network layer that can make full use of the past and future context information for feature extraction.

The Bi-LSTM layer can automatically acquire sentence characteristics, two independent hidden layer representations are obtained by adopting a sequential and reverse-sequential recurrent neural network for each input sentence, then certain calculation (splicing or adding) is carried out on the two hidden layer representations to obtain a final hidden layer representation, and the final hidden layer representation is sent to a CRF layer for subsequent calculation. Such a hidden layer means that for each word in a sentence, semantic information from both above and below is included.

And S305, inputting the semantic feature vector into a CRF layer for calculation, and outputting the optimal labeling sequence with the maximum probability.

The CRF layer can effectively utilize sentence-level label information, and sets a constraint condition to ensure that the final prediction is effective for further mining the relation between different sensitive information, wherein the constraint condition can be automatically learned by the CRF layer during training data.

Calculating the marking probability of each participle according to the semantic feature vector, wherein the specific process is as follows:

the parameter of the CRF layer is a matrix A of (k + 2) th power, which is added by 2 because a start state and an end state are added to the head and the tail of a sentence respectively _ij Represents the transition probability from the ith labeling position to the jth labeling position, thereby fully utilizing the information which is labeled previously when labeling a certain position. Suppose that the sentence x requiring identification of sensitive information is represented as (x) ₁ ,x ₂ …,x _n ) A tag sequence y equal in length to the sentence length is (y) ₁ ,y ₂ ,…,y _n ) Then the identity is determinedThe score corresponding to the sentence corresponding to the sensitive information is as follows:

wherein A is a transition score matrix, A _yi，yi+1 Indicating slave label y _i Transfer to label y _i+1 Fraction of (a), wherein y ₀ And y _n Start and end tags for sentences, respectively; so the latitude of a is (k + 2) × (k + 2) (k is the number of tags); p is the semantic feature vector output by the Bi-LSTM layer, and the latitude is n x k (k is the label number), pi, y _i The corresponding label of the ith word in the sentence is y _i The probability value of (2).

Normalizing the score value by using a softmax activation function to obtain the probability of the label y, wherein the calculation formula is as follows:

wherein y' represents the possible annotation sequences corresponding to the sentence x, that is, each annotation sequence corresponding to the sentence has a score and a probability, so as to maximize the probability of the actual annotation sequence corresponding to the sentence.

And S306, calculating a loss function value according to the optimal labeling sequence, adjusting model parameters of the initial sensitive data recognition model based on the loss function value, continuing iterative training until the model is converged, and outputting a final sensitive data recognition model.

Specifically, a loss function is designed to obtain a minimum loss function value, and the calculation formula is as follows:

and finally, calculating an optimal labeling sequence by using a Viterbi viterbi algorithm:

and adjusting the model parameters according to the loss function values, continuing to carry out iterative training, and training the model to a certain degree, wherein the performance of the model reaches an optimal state, and the loss function values cannot be continuously reduced, namely are converged. Judging the convergence mode only needs to calculate the loss function value in the two iterations, if the loss function value is still changed, continuing to select training data to carry out iterative training on the model; if the loss function value is not changed obviously, the model can be considered to be converged, and the final sensitive data identification model is output.

In some optional implementation manners, after the sensitive data recognition model is obtained, an evaluation test is performed on the sensitive data recognition model, where the test indexes are as follows: accuracy P (Precision), recall R (Recall), and F-Score (F-Score).

It should be noted that the F value is a harmonic mean of the accuracy and the recall ratio, and fully considers the influence of the accuracy and the recall ratio, which is equivalent to a comprehensive evaluation index of the accuracy and the recall ratio, and therefore, the F value is used as a main evaluation index of the model in this embodiment.

Specifically, the test data set is input into the sensitive data identification model, a labeling result is output, a sensitive information identification evaluation standard F value is calculated based on the labeling result, the F value is used as an evaluation result, and the sensitive data identification model is evaluated according to the F value.

The evaluation calculation method is as follows:

the accuracy is as follows:

and (4) recall rate:

f value:

wherein, the higher the F value is, the more accurate the sensitive information identification of the model is. When the F value is larger than or equal to a preset threshold value, the sensitive data recognition model meets the condition; and when the F value is smaller than the preset threshold, the training data set is acquired again, and the steps S301 to S306 are executed until the F value is larger than or equal to the preset threshold.

In the embodiment, the efficiency and the accuracy of sensitive information identification can be improved by training the constructed sensitive data identification model.

Step S202, responding to the sensitive data access request sent by the client, and obtaining response data corresponding to the sensitive data access request.

In this embodiment, the desensitization tool is further configured to monitor a sensitive data access request sent by the client, intercept response data, and asynchronously analyze whether the response data contains sensitive information, where the sensitive information includes, but is not limited to, a name, a mobile phone number, an identification number, a bank card number, an address, and the like.

The data access request refers to an access request sent when the client needs to acquire network data, and the sensitive data access request refers to an access request sent when the client needs to access a sensitive function to acquire data. The response data is network data which is fed back to the client side by the access object according to the access request of the client side and is to be acquired by the client side.

Specifically, whether the access interface receives a sensitive data access request sent by the client is monitored in real time through the desensitization tool. When the client sends an access request to the access interface, the access interface feeds back response data to the client according to the access request of the client, and the desensitization tool intercepts and acquires the response data fed back to the client by an access object when monitoring that the client sends a sensitive data access request to the access interface.

And step S203, identifying the sensitive information in the response data through the sensitive data identification model.

And the desensitization tool intercepts and acquires response data of the access object and transmits the response data to the sensitive data identification model, and the response data is sequentially processed by the word vector layer, the Bi-LSTM layer, the CRF layer and the output layer to output identified sensitive information.

It is emphasized that to further ensure the privacy and security of sensitive information, the sensitive information may also be stored in a node of a blockchain.

The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

And step S204, configuring a desensitization rule of the sensitive information based on the access interface carried in the sensitive data access request, and issuing the desensitization rule to a desensitization tool.

In this embodiment, the access object may be an application program or a function in the application program. The client accesses the corresponding application program through the access interface, if the sensitive data identification model in the desensitization tool identifies that the response data fed back by the application program according to the sensitive data access request contains sensitive information, the corresponding access interface is asynchronously recorded, the access interface is a sensitive interface, a corresponding desensitization rule is configured for the sensitive interface, and the desensitization rule is issued to the desensitization tool.

And S205, desensitizing the sensitive information by using a desensitization rule through the desensitization tool to obtain desensitization data, and returning the desensitization data to the client.

In this embodiment, different desensitization rules may be configured for different sensitive information, characters of a preset digit range and/or a plurality of specified digits of the sensitive information are reserved, and other digit characters of the sensitive information are replaced with preset non-sensitive characters, and partial characters in the sensitive information may be deleted.

For example, the data to be desensitized is the mobile phone number 1566668888, and the configured desensitization rule is denoted by (3, 4, "), wherein 3,4 denotes a replacement position instruction, 3 denotes a character displayed in a forward direction in the data to be desensitized, 4 denotes a character displayed in a backward direction in the data to be desensitized, that is, 3 characters are displayed in the forward direction while 4 characters are displayed in the backward direction, and the rest of middle characters are characters to be replaced, and are replaced by preset non-sensitive characters", and then the desensitization rule is used for desensitizing and outputting 156 × 8888; data requiring desensitization is identification number 110100000006024713, and the configured desensitization rule is (ID Card No,3,4, "), and output 110 × × 4713 after desensitization.

Desensitization is carried out on sensitive information through desensitization rules, the safety of business data can be effectively protected, the desensitization rules can be set, and various use scenes can be met.

The desensitization tool in the embodiment can be accessed to any service system, automatically monitors whether the service system has a sensitive function, and informs an access party in a mail, short message and other modes, so that safety problems can be found in advance, and the safety and reliability of the system are improved.

The following will explain the development test in detail.

The service system is accessed to the desensitization tool of the embodiment, when a function test is performed before online every time, the desensitization tool monitors an access request of a client to a newly-developed function, intercepts and acquires response data generated based on the access request, identifies the response data through a sensitive data identification model in the desensitization tool, identifies that sensitive information exists in the response data, determines an access interface carried in the corresponding access request as a sensitive interface, records the sensitive interface, informs development and test personnel of the sensitive interface, receives a desensitization rule configured by the development and test personnel for the sensitive interface, issues the desensitization rule to the desensitization tool, desensitizes the sensitive information by using the desensitization rule by the desensitization tool, and returns the desensitization data to the client.

According to the method and the device, the sensitive information in the response data is identified through the sensitive data identification model, the access interface corresponding to the identified sensitive information is used as the sensitive interface, and the corresponding desensitization rule is configured, so that the identification efficiency and accuracy of the sensitive information can be improved, meanwhile, the safety of the response data is improved, the privacy disclosure is avoided, and the safety and reliability of the system are further improved.

In some optional implementations of this embodiment, the step of converting the participles in the participle data into word vectors through the word vector layer includes:

coding each participle and converting the participle into a vocabulary vector;

and obtaining a word vector corresponding to each participle based on the word vector mapping table.

In the present embodiment, the word is encoded using One-hot encoding (One-hot encoding). Before coding, the number of characters in sample space is determined, and the characters are converted in an One-hot coding mode. One-hot encoding, also known as One-bit active codes, uses an N-bit status register to encode N states, with only One bit being active at any time. And performing vector conversion on the encoded participles, wherein the vector conversion is to set the dimension of the embedded vector, convert One-hot codes corresponding to the characters into dense vectors with low dimension, and finally obtain the numerical vector representation of the characters.

And mapping the vocabulary vector into a word vector mapping table according to the context information of each participle through a Skip-gram algorithm. And a certain column in the word vector mapping table corresponds to each participle one by one, so that the word vector corresponding to each participle can be obtained from the word vector mapping table.

The word vector of each participle containing context information is obtained through the word vector layer, so that the obtained word vector is more accurate, and the accuracy of subsequent sensitive information identification is ensured.

In this embodiment, the step of inputting the word vector into the Bi-LSTM layer for feature extraction to obtain the semantic feature vector includes:

splicing the forward hidden feature and the backward hidden feature according to positions to obtain a global hidden feature;

The input of the Bi-LSTM layer is a word vector sequence of each word in each sentence, and the forward hidden characteristics of the enhanced word vectors are obtained through the forward layer of the Bi-LSTM layer

Obtaining backward hidden features of word vectors through a backward layer of a Bi-LSTM layer

Splicing the hidden layer states output by the forward hidden feature and the backward hidden feature at each position according to the positions to obtain h _t Is composed of

h _t ∈R ^m And then obtaining the complete global hidden feature (h) ₁ ,h ₂ ,…,h _n )∈R ^n×m 。

Before entering the next layer, a dropout mechanism is set to solve the problem of overfitting. After a dropout mechanism is set, mapping a hidden state vector from m dimension to k dimension, wherein k represents the number of sensitive labels, and further obtaining an automatic learning sentence characteristic P, wherein the sentence characteristic P is a semantic characteristic vector and is represented as (P) ₁ ,P ₂ ,…,P _n )∈R ^n×k Can be processed with P _i ∈R ^k Each dimension P of _ij All as the word x _i A probability value of the j-th label.

The embodiment can fully utilize the past and future context information and improve the accuracy of semantic feature vector extraction.

In some optional implementation manners, the step of configuring the desensitization rule of the sensitive information based on the access interface carried in the sensitive data access request includes:

In this embodiment, the service system provides each interface for calling according to the service type, and accesses the corresponding service through the calling interface, that is, accesses the service type corresponding to the interface, and different service types have different processing manners for the sensitive information. For example, some service types may directly perform desensitization processing on sensitive information, some service types need to check the sensitive information, and there is sensitive information that cannot be desensitized.

Specifically, identifying first sensitive data and second sensitive data in the sensitive information according to the service type; respectively labeling corresponding first annotation information and second annotation information for the first sensitive data and the second sensitive data; and configuring a first desensitization rule and a second desensitization rule corresponding to the first annotation information and the second annotation information.

The first sensitive data is sensitive information which needs desensitization processing, and the second sensitive data is sensitive information which does not need desensitization processing. The annotation information is used for representing the sensitive information, and the sensitive information is defined by the @ sensiveinfo annotation method to obtain the annotation information of the sensitive information.

In this embodiment, the first sensitive data is labeled with first annotation information, and the second sensitive data is labeled with second annotation information. Configuring a first desensitization rule corresponding to the first annotation information, wherein the first desensitization rule is as follows: replacing part or all of the characters in the sensitive information field with preset characters, for example, for a customer name, one of a surname and a first name in the customer name can be replaced with the preset character, or all of the surname and the first name can be replaced with the preset character; for the certificate number, a part of the certificate number can be replaced by preset characters, or all the certificate number can be replaced by the preset characters. It should be noted that the preset characters may be "x", or other non-sensitive characters.

And configuring a second desensitization rule corresponding to the second annotation information, wherein the second desensitization rule is used for recording access information according to a preset format and notifying related personnel to avoid abnormal access.

In this embodiment, corresponding desensitization rules of sensitive information are configured according to different service types, so that the security of desensitization data is protected, and meanwhile, abnormal access to sensitive information that cannot be desensitized is avoided, and further, the security of sensitive information that cannot be desensitized is protected.

In some optional implementation manners, after the step of configuring the corresponding desensitization rule for the sensitive information according to the service type, the method further includes:

and configuring the mapping relation into a desensitization rule table.

And configuring a desensitization rule table, storing the service type, the sensitive information and the desensitization rule into the desensitization rule table, and calling the corresponding desensitization rule according to the service type and the sensitive information to perform desensitization treatment after determining the service type according to the access interface, so that the configuration efficiency of the desensitization rule is improved.

The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware that is configured to be instructed by computer-readable instructions, which can be stored in a computer-readable storage medium, and when executed, the programs may include the processes of the embodiments of the methods described above. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

With further reference to fig. 4, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a desensitization apparatus for sensitive information, which corresponds to the embodiment of the method shown in fig. 2, and which can be applied in various electronic devices.

As shown in fig. 4, the desensitizing apparatus 400 for sensitive information according to the present embodiment includes: a creation module 401, a response module 402, an identification module 403, a configuration module 404, and a desensitization module 405. Wherein:

the creating module 401 is configured to create a desensitization tool, and package the trained sensitive data recognition model in the desensitization tool;

the response module 402 is configured to respond to a sensitive data access request sent by a client, and obtain response data corresponding to the sensitive data access request;

the identification module 403 is configured to identify the sensitive information in the response data through the sensitive data identification model;

the configuration module 404 is configured to configure a desensitization rule of the sensitive information based on an access interface carried in the sensitive data access request, and send the desensitization rule to the desensitization tool;

the desensitization module 405 is configured to desensitize the sensitive information by using the desensitization rule through the desensitization tool, obtain desensitization data, and return the desensitization data to the client.

According to the desensitization device based on the sensitive information, the sensitive information in the response data is identified through the sensitive data identification model, the access interface corresponding to the identified sensitive information is used as the sensitive interface, and the corresponding desensitization rule is configured, so that the identification efficiency and accuracy of the sensitive information can be improved, meanwhile, the security of the response data is improved, privacy disclosure is avoided, and the safety and reliability of the system are further improved.

In some optional implementation manners of this embodiment, the desensitization apparatus 400 for sensitive information further includes a training module, where the training module includes a word segmentation sub-module, an input sub-module, a conversion sub-module, a feature extraction sub-module, a calculation sub-module, and an iteration sub-module, where:

the word segmentation submodule is used for acquiring a historical service data set and performing word segmentation processing on service data in the historical service data set to obtain word segmentation data;

the input submodule is used for inputting the word segmentation data into a pre-constructed initial sensitive data recognition model, and the initial sensitive data recognition model comprises a word vector layer, a Bi-LSTM layer, a CRF layer and an output layer;

the conversion submodule is used for converting the participles in the participle data into word vectors through the word vector layer;

the feature extraction sub-module is used for inputting the word vectors into the Bi-LSTM layer for feature extraction to obtain semantic feature vectors;

the calculation submodule is used for inputting the semantic feature vector into the CRF layer for calculation and outputting an optimal labeling sequence with the maximum probability;

and the iteration submodule is used for calculating a loss function value according to the optimal labeling sequence, adjusting model parameters of the initial sensitive data recognition model based on the loss function value, continuing iterative training until the model is converged, and outputting a final sensitive data recognition model.

By training the constructed sensitive data recognition model, the efficiency and accuracy of sensitive information recognition can be improved.

In this embodiment, the converting submodule is further configured to:

inputting the vocabulary vector into the word vector layer, and obtaining a word vector mapping table according to the context information of each participle;

In this embodiment, the feature extraction sub-module is further configured to:

In some optional implementations of this embodiment, the configuration module 404 includes a determination submodule and a configuration submodule, where:

the determining submodule is used for determining the service type based on the access interface carried in the sensitive data access request;

and the configuration sub-module is used for configuring a corresponding desensitization rule for the sensitive information according to the service type.

In this embodiment, the configuration submodule is further configured to:

In some optional implementations, the configuration module 404 further includes a mapping sub-module for:

establishing a mapping relation among the service types, the sensitive information and the desensitization rule;

and configuring the mapping relation into a desensitization rule table.

According to the embodiment, the corresponding desensitization rule is called according to the service type and the sensitive information to perform desensitization treatment, so that the configuration efficiency of the desensitization rule is improved.

In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 5, fig. 5 is a block diagram of a basic structure of a computer device according to the present embodiment.

The computer device 5 comprises a memory 51, a processor 52, a network interface 53 communicatively connected to each other via a system bus. It is noted that only a computer device 5 having components 51-53 is shown, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user in a keyboard mode, a mouse mode, a remote controller mode, a touch panel mode or a voice control equipment mode.

The memory 51 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the memory 51 may be an internal storage unit of the computer device 5, such as a hard disk or a memory of the computer device 5. In other embodiments, the memory 51 may also be an external storage device of the computer device 5, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, provided on the computer device 5. Of course, the memory 51 may also comprise both an internal storage unit of the computer device 5 and an external storage device thereof. In this embodiment, the memory 51 is generally used for storing an operating system installed in the computer device 5 and various types of application software, such as computer readable instructions of a desensitization method of sensitive information. Further, the memory 51 may also be used to temporarily store various types of data that have been output or are to be output.

The processor 52 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 52 is typically used to control the overall operation of the computer device 5. In this embodiment, the processor 52 is configured to execute computer readable instructions stored in the memory 51 or to process data, such as computer readable instructions for executing a desensitization method of the sensitive information.

The network interface 53 may comprise a wireless network interface or a wired network interface, and the network interface 53 is generally used for establishing communication connections between the computer device 5 and other electronic devices.

In this embodiment, when the processor executes the computer readable instructions stored in the memory, the steps of the desensitization method for sensitive information according to the above embodiments are implemented, the sensitive information in the response data is identified by the sensitive data identification model, and the access interface corresponding to the identified sensitive information is used as the sensitive interface, and the corresponding desensitization rule is configured, so that the identification efficiency and accuracy of the sensitive information can be improved, meanwhile, the security of the response data is improved, privacy disclosure is avoided, and the security and reliability of the system are further improved.

The present application further provides another embodiment, that is, a computer-readable storage medium is provided, where the computer-readable instructions are stored, and the computer-readable instructions are executable by at least one processor, so as to cause the at least one processor to perform the steps of the desensitization method for sensitive information as described above, identify the sensitive information in the response data through a sensitive data identification model, and configure corresponding desensitization rules by using an access interface corresponding to the identified sensitive information as a sensitive interface, so as to improve identification efficiency and accuracy of the sensitive information, and at the same time, improve security of the response data, avoid privacy disclosure, and further improve security and reliability of the system.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

It should be understood that the above-described embodiments are merely exemplary of some, and not all, embodiments of the present application, and that the drawings illustrate preferred embodiments of the present application without limiting the scope of the claims appended hereto. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims

1. A method of desensitizing sensitive information, comprising the steps of:

desensitizing the sensitive information by the desensitizing tool by using the desensitizing rule to obtain desensitizing data, and returning the desensitizing data to the client.

2. The desensitization method of sensitive information according to claim 1, further comprising, prior to said step of encapsulating the trained sensitive data recognition models in the desensitization tool:

and calculating a loss function value according to the optimal labeling sequence, adjusting model parameters of the initial sensitive data identification model based on the loss function value, continuing iterative training until the model is converged, and outputting a final sensitive data identification model.

3. The method of desensitizing sensitive information according to claim 2, wherein said step of converting participles in said participle data into word vectors by said word vector layer comprises:

4. The method for desensitizing sensitive information according to claim 2, wherein said step of inputting said word vectors into said Bi-LSTM layer for feature extraction to obtain semantic feature vectors comprises:

5. The desensitization method according to claim 1, wherein said step of configuring desensitization rules for said sensitive information based on access interfaces carried in said sensitive data access requests comprises:

6. The method of claim 5, wherein the step of configuring corresponding desensitization rules for the sensitive information according to the service types comprises:

7. The desensitization method according to claim 5, wherein after the step of configuring corresponding desensitization rules for the sensitive information according to the service types, the method further comprises:

and configuring the mapping relation into a desensitization rule table.

8. An apparatus for desensitizing sensitive information, comprising:

9. A computer device comprising a memory having stored therein computer readable instructions which, when executed by a processor, implement the steps of a method of desensitizing sensitive information according to any of claims 1 to 7.

10. A computer readable storage medium having computer readable instructions stored thereon which, when executed by a processor, implement the steps of a method of desensitizing sensitive information according to any of claims 1 to 7.