CN114021548A

CN114021548A - Sensitive information detection method, training method, device, equipment and storage medium

Info

Publication number: CN114021548A
Application number: CN202111317840.7A
Authority: CN
Inventors: 杜悦艺; 许艳茹; 孙亚生
Original assignee: Baidu Online Network Technology Beijing Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd
Priority date: 2021-11-08
Filing date: 2021-11-08
Publication date: 2022-02-08

Abstract

The disclosure provides a sensitive information detection method, a training method, a device, equipment and a storage medium, and relates to the technical field of artificial intelligence and the technical field of internet, in particular to the technical field of data security. The specific implementation scheme is as follows: extracting feature information in a text to be detected to obtain an initial feature vector, wherein the feature information comprises statement level feature information, syntax structure feature information and semantic feature information; extracting context features of the initial feature vector to obtain a predicted feature vector; and determining a detection result about the sensitive information in the text to be detected based on the prediction feature vector, wherein the detection result comprises a sensitive information type detection result and a sensitive information position detection result.

Description

Sensitive information detection method, training method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technology, the field of internet technology, and in particular, to the field of data security technology, and more particularly, to a sensitive information detection method, a training method, an apparatus, a device, and a storage medium.

Background

With the development of internet technology, there is a huge amount of data information that will be or has been spread on the internet, and in the huge amount of data information, information that may relate to personal privacy, property security or information security, the leakage of sensitive information will cause serious loss to relevant individuals, enterprises or organizations.

Disclosure of Invention

The disclosure provides a sensitive information detection method, a training method, an apparatus, an electronic device, a storage medium, and a program product.

According to an aspect of the present disclosure, there is provided a sensitive information detection method, including: extracting feature information in a text to be detected to obtain an initial feature vector, wherein the feature information comprises statement level feature information, syntax structure feature information and semantic feature information; extracting context features of the initial feature vector to obtain a predicted feature vector; and determining a detection result about the sensitive information in the text to be detected based on the prediction feature vector, wherein the detection result comprises a sensitive information type detection result and a sensitive information position detection result.

According to another aspect of the present disclosure, there is provided a method for training a sensitive information detection model, including: training a sensitive information detection model by using a training sample to obtain a trained sensitive information detection model, wherein the sensitive information detection model is used for: extracting feature information in a sample text, and generating a sample initial feature vector, wherein the feature information comprises statement level feature information, syntax structure feature information and semantic feature information; extracting context characteristics of the initial characteristic vector of the sample to obtain a predicted characteristic vector of the sample; and determining a sample detection result related to the sensitive information in the sample text based on the sample prediction feature vector, wherein the sample detection result comprises a sample sensitive information category detection result and a sample sensitive information position detection result.

According to another aspect of the present disclosure, there is provided a sensitive information detecting apparatus including: the initial feature extraction module is used for extracting feature information in the text to be detected to obtain an initial feature vector, wherein the feature information comprises statement level feature information, syntactic structure feature information and semantic feature information; the predicted feature extraction module is used for extracting context features of the initial feature vector to obtain a predicted feature vector; and the detection result determining module is used for determining the detection result of the sensitive information in the text to be detected based on the predicted feature vector, wherein the detection result comprises a sensitive information type detection result and a sensitive information position detection result.

According to another aspect of the present disclosure, a training apparatus for a sensitive information detection model includes: the training module is used for training the sensitive information detection model by using the training sample to obtain the trained sensitive information detection model, wherein the sensitive information detection model is used for: extracting feature information in a sample text, and generating a sample initial feature vector, wherein the feature information comprises statement level feature information, syntax structure feature information and semantic feature information; extracting context characteristics of the initial characteristic vector of the sample to obtain a predicted characteristic vector of the sample; and determining a sample detection result related to the sensitive information in the sample text based on the sample prediction feature vector, wherein the sample detection result comprises a sample sensitive information category detection result and a sample sensitive information position detection result.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method as described above.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method as described above.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 schematically illustrates an exemplary system architecture to which the sensitive information detection method and apparatus may be applied, according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a sensitive information detection method according to an embodiment of the present disclosure;

fig. 3 schematically illustrates an application scenario of a sensitive information detection method according to an embodiment of the present disclosure;

FIG. 4 schematically illustrates a flow chart of a method of training a sensitive information detection model according to an embodiment of the present disclosure;

FIG. 5 is a diagram schematically illustrating an application scenario of a training method of a sensitive information detection model according to an embodiment of the present disclosure;

FIG. 6 schematically shows a block diagram of a sensitive information detection apparatus according to an embodiment of the present disclosure;

FIG. 7 schematically illustrates a block diagram of a training apparatus for a sensitive information detection model according to an embodiment of the present disclosure; and

FIG. 8 shows a schematic block diagram of an example electronic device that may be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

According to the embodiment of the disclosure, the sensitive information detection method comprises the following steps: extracting feature information in a text to be detected to obtain an initial feature vector, wherein the feature information comprises statement level feature information, syntax structure feature information and semantic feature information; extracting context features of the initial feature vector to obtain a predicted feature vector; and determining a detection result about the sensitive information in the text to be detected based on the prediction feature vector, wherein the detection result comprises a sensitive information type detection result and a sensitive information position detection result.

According to the embodiment of the disclosure, by utilizing the initial feature extraction operation, complete information can be extracted from the text to be detected, so that the problem of information loss is prevented; deep level feature information can be obtained by utilizing the context feature extraction operation, and local information and global information are considered; the initial feature extraction operation and the context feature extraction operation are combined, so that the detection precision of subsequent sensitive information is improved.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

Fig. 1 schematically illustrates an exemplary system architecture to which the sensitive information detection method and apparatus may be applied, according to an embodiment of the present disclosure.

It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios. For example, in another embodiment, an exemplary system architecture to which the method and apparatus for detecting sensitive information may be applied may include a terminal device, but the terminal device may implement the method and apparatus for processing content provided in the embodiments of the present disclosure without interacting with a server.

As shown in fig. 1, the system architecture 100 according to this embodiment may include

terminal devices

101, 102, 103, a network 104 and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired and/or wireless communication links, and so forth.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have installed thereon various communication client applications, such as a knowledge reading application, a web browser application, a search application, an instant messaging tool, a mailbox client, and/or social platform software, etc. (by way of example only).

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 105 may be a server providing various services, such as a background management server (for example only) providing support for content browsed by the user using the

terminal devices

101, 102, 103. The background management server may analyze and perform other processing on the received data such as the user request, and feed back a processing result (e.g., a webpage, information, or data obtained or generated according to the user request) to the terminal device.

It should be noted that the sensitive information detection method provided by the embodiment of the present disclosure may be generally executed by the

terminal device

101, 102, or 103. Correspondingly, the sensitive information detection device provided by the embodiment of the present disclosure may also be disposed in the

terminal device

101, 102, or 103.

Alternatively, the sensitive information detection method provided by the embodiment of the present disclosure may also be generally executed by the server 105. Accordingly, the sensitive information detection apparatus provided by the embodiments of the present disclosure may be generally disposed in the server 105. The sensitive information detection method provided by the embodiment of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105. Correspondingly, the sensitive information detection apparatus provided by the embodiment of the present disclosure may also be disposed in a server or a server cluster different from the server 105 and capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Fig. 2 schematically shows a flow chart of a sensitive information detection method according to an embodiment of the present disclosure.

As shown in fig. 2, the method includes operations S210 to S230.

In operation S210, feature information in the text to be detected is extracted to obtain an initial feature vector, where the feature information includes statement level feature information, syntax structure feature information, and semantic feature information.

In operation S220, context feature extraction is performed on the initial feature vector to obtain a predicted feature vector.

In operation S230, a detection result regarding the sensitive information in the text to be detected is determined based on the predicted feature vector, where the detection result includes a sensitive information category detection result and a sensitive information position detection result.

According to an embodiment of the present disclosure, the sensitive information may include sensitive information related to an individual, such as personal identification information, personal health information, personal property information, and the like, but is not limited thereto, and may also include sensitive information related to an organization or an organization, such as name information of the organization, office address information of the organization, domain name address information of the organization, and the like. The sensitive information can be defined by the person skilled in the art according to the relevant practical requirements.

According to an embodiment of the present disclosure, the text to be detected may include unpublished text or published text, such as individual's speech, academic papers, publicity papers published by institutions, and other source channel information.

According to an embodiment of the present disclosure, the sentence-level feature information may include feature information at a character level, feature information at a phrase level, feature information at a sentence level, and the like in the text to be detected. The syntactic structure characteristic information may include syntactic structure information between sentences of the text to be detected, syntactic structure information between phrases in a single sentence, and the like, and may include, for example, a predicate relationship, a motile relationship, and the like. The semantic feature information may include semantic feature information at a character level, semantic feature information at a phrase level, and semantic feature information at a sentence boundary level in the text to be detected. The semantic feature information provided by the embodiment of the disclosure can improve the accurate recognition of polyphones, polysemous characters, polysemous words and the like.

According to the embodiment of the disclosure, the sentence level feature information, the syntactic structure feature information and the semantic feature information are comprehensively considered to extract the features, so that the preliminary feature extraction processing is complete, the information in the obtained initial feature vector is complete, and the problem of information loss caused by extracting only single feature information such as the sentence level feature information or the semantic feature information from the text to be detected is avoided.

According to the embodiment of the disclosure, the context feature extraction is performed on the initial feature vector, the semantic feature extraction can be performed on the feature information of a shallow level, and the context feature is considered, so that the feature information of a high level, such as the predicted feature vector, can be considered as a global feature.

According to the embodiment of the disclosure, the detection result of the sensitive information in the text to be detected, for example, whether the sensitive information exists in the text to be detected, may be determined based on the predicted feature vector. The text to be detected has sensitive information, and the detection result may include a sensitive information type detection result and a sensitive information position detection result.

According to an embodiment of the present disclosure, the sensitive information category detection result may be a result characterizing a sensitive information category, for example, a detection result of a personal information category, a detection result of an organization information category, and the like. Personal information may include information such as the name, identity, institution, etc. of an individual. The institution information may include contract, business confidentiality, etc. information for the institution.

It should be noted that, the embodiment of the present disclosure does not limit the specific determination manner or the specific category of the sensitive information, and a person skilled in the art may determine the sensitive information category according to actual needs or according to related criteria.

According to the embodiment of the disclosure, the sensitive information position detection result can represent the position of the sensitive information at the sentence level in the text to be detected. For example, the position detection result of the sensitive information is a position identifier of the sensitive information appearing in the second sentence in the first paragraph in the text to be detected. But not limited thereto, the sensitive information position detection result may also characterize the position of the sensitive information character level. For example, the position detection result of the sensitive information is the position identifier of the first character and the position identifier of the last character of the sensitive information in the text to be detected.

By using the sensitive information detection method provided by the embodiment of the disclosure, complete information can be extracted from the text to be detected by using the initial feature extraction operation, so that the problem of information loss is prevented; deep level feature information can be obtained by utilizing the context feature extraction operation, and local information and global information are considered; the initial feature extraction operation and the context feature extraction operation are combined, so that the detection precision of subsequent sensitive information is improved.

According to an embodiment of the present disclosure, the operation S220 of extracting feature information in a text to be detected and generating an initial feature vector may include the following operations.

Preprocessing a text to be detected to generate an input vector corresponding to the text to be detected; and extracting feature information in the input vector by using a feature extraction module to generate an initial feature vector.

According to the embodiment of the disclosure, the preprocessing of the text to be detected may include sequentially performing sentence segmentation, word segmentation and word segmentation on the text to be detected, and finally obtaining the information at the character level. The character-level information may also be vectorized, for example, encoded according to a preset encoding rule, to obtain vector expression information, i.e., an input vector corresponding to the text to be detected. By utilizing the input vector provided by the embodiment of the disclosure, the subsequent feature extraction module can more conveniently extract features, so that the subsequent operation steps are simplified, and the processing efficiency is improved.

According to the embodiment of the disclosure, the text to be detected may include sensitive information composed of numbers or letters, such as an email address, a bank card number, a mobile phone number, and the like, and the sensitive information may be replaced with corresponding attribute marking elements according to a preset marking rule, for example, in the case that the sensitive information is the email address "xy @ aa.com", the sensitive information "xy @ aa.com" may be replaced with information related to the attribute, such as "email information", according to the preset marking rule. Therefore, sensitive information which can be matched by using the regular rule is excluded from the text to be detected, so that the subsequent operation steps are simplified, and the detection precision of the sensitive information related to the embodiment of the disclosure is improved.

According to an embodiment of the present disclosure, the feature extraction module may be constructed based on a BERT (Bidirectional Encoder characterizations from Transformers) model, and may include, for example, a plurality of Multi-head self-Attention (Multi-head self-Attention) layers, a feed-forward network layer, a normalization layer, and a residual network layer. But not limited thereto, the feature extraction module may also be constructed based on other neural network models or based on other related algorithms, such as a word2vec network model.

According to the embodiment of the disclosure, the sentence level feature information, the syntactic structure feature information and the semantic feature information in the text to be detected can be extracted by using the feature extraction module constructed by the BERT model, so that the specific meaning of the information containing the ambiguous character level can be determined more intelligently and accurately by using the feature extraction module constructed by the BERT model.

According to an embodiment of the present disclosure, the operation S220 of performing context feature extraction on the initial feature vector to obtain a predicted feature vector may include the following operations.

And utilizing a context feature extraction module to extract the context feature of the initial feature vector to obtain a predicted feature vector.

According to an embodiment of the present disclosure, the context feature extraction module may be constructed based on a neural network, for example, a context feature extraction module constructed based on a Recurrent Neural Network (RNN), a long short term memory network (LSTM), or a bidirectional long short term memory network (BiLSTM).

According to the embodiment of the disclosure, the bidirectional long and short term memory network (BilsTM) comprises a forward LSTM layer and a backward LSTM layer, so that a context feature extraction module constructed by a network structure of the bidirectional long and short term memory network (BilsTM) is used for processing an initial feature vector, the feature vector output by the forward LSTM layer and the feature vector output by the backward LSTM can be combined to be used as final output to obtain a predicted feature vector, the predicted feature vector is enabled to fully extract context feature information in the initial feature vector, further, overall feature information of the predicted feature vector is enabled to be more comprehensive, and a foundation is provided for detection of subsequent sensitive information.

According to an embodiment of the present disclosure, the operation S230 of determining a detection result about the sensitive information in the text to be detected based on the predicted feature vector may include the following operations.

And processing the prediction characteristic vector by using a sensitive information classifier, and determining a detection result about the sensitive information in the text to be detected.

According to an embodiment of the present disclosure, the sensitive information classifier may be a hidden markov (HMM) model, a Conditional Random Field (CRF) model, or the like, but is not limited thereto, and may also be a Span classifier (Span classifier).

According to the embodiment of the disclosure, a Span classifier such as a Span classifier is used as the sensitive information classifier provided by the embodiment of the disclosure, the predicted feature vector can be processed to obtain the detection result including the sensitive information class and the sensitive information position, the detection result has high precision, the types of the sensitive information class detection results are wide, and the transportability of the sensitive information detection method can be enhanced, so that the application range of the sensitive information detection is expanded.

Meanwhile, according to the sensitive information type detection result and the sensitive information position detection result, sensitive information can be conveniently found from the text to be detected, corresponding measures are taken according to different sensitive information types, accurate processing basis is provided for relevant sensitive information processing users, and the working efficiency of the relevant sensitive information processing users is improved.

Fig. 3 schematically shows an application scenario of the sensitive information detection method according to an embodiment of the present disclosure.

As shown in fig. 3, the input vector 310 may be an input vector corresponding to the text to be detected, which is generated after preprocessing the text to be detected. The feature extraction module 320 is used to extract feature information from the input vector to generate an initial feature vector 330. The feature extraction module 320 may be constructed based on a BERT model. The initial feature vector 330 may represent feature information of the text to be detected, and the feature information of the text to be detected may include sentence-level feature information, syntax structure feature information, and semantic feature information.

The initial feature vector 330 is context feature extracted by the context feature extraction module 340 to obtain a predicted feature vector 350. In an embodiment of the present disclosure, the contextual feature extraction module 340 may be constructed based on a bidirectional long short term memory network (BiLSTM) to fully extract contextual features in the initial feature vector.

The predicted feature vector 350 is input into the sensitive information classifier 360, and a detection result 370 about the sensitive information in the text to be detected is output. The detection results 370 may include sensitive information category detection results and sensitive information location detection results. In an embodiment of the present disclosure, sensitive information classifier 360 includes a classifier constructed based on a Span model.

According to the embodiment of the disclosure, the sensitive information can be automatically classified by detecting the text to be detected by using the sensitive information detection method, so that the automation and intelligence level of the sensitive information detection is improved, and the transportability of the sensitive information detection method is enhanced, so that the application range of the sensitive information detection is conveniently expanded. Meanwhile, according to the sensitive information type detection result and the sensitive information position detection result, sensitive information can be conveniently found from the text to be detected, corresponding measures are taken according to different sensitive information types, accurate processing basis is provided for relevant sensitive information processing users, and the working efficiency of the relevant sensitive information processing users is improved.

According to an embodiment of the present disclosure, the sensitive information detection method may further include the following operations.

And determining the sensitivity level of the sensitive information in the text to be detected according to the preset sensitivity level based on the detection result.

According to the embodiment of the present disclosure, the predetermined sensitivity level may be determined according to the detection result of the sensitive information category, that is, the corresponding sensitivity level may be determined for different sensitive information categories. For example, for the personal information class, the sensitivity level corresponding to the personal information class may be determined as two levels, and for the organization name class, the sensitivity level corresponding to the organization name class may be determined as one level, so as to distinguish the sensitivity levels of different sensitive information classes, thereby facilitating the relevant user or organization to determine corresponding measures for different sensitivity levels.

According to the embodiment of the disclosure, the predetermined sensitivity level may also be determined according to the number of times of occurrence of the sensitive information, for example, the number of times of occurrence of the sensitive information may be determined according to the position detection result of the sensitive information, and the corresponding sensitivity level may be determined according to the number of times of occurrence of the sensitive information, so as to prompt the relevant user to pay attention to the corresponding sensitive information in time.

According to the embodiment of the disclosure, the preset sensitivity level can be determined by comprehensively considering the sensitive information type detection result and the sensitive information position detection result, so as to meet the individual requirements of different users on sensitive information detection.

Fig. 4 schematically shows a flowchart of a training method of a sensitive information detection model according to an embodiment of the present disclosure.

As shown in fig. 4, the training method may include operations S410 to S420.

In operation S410, training samples are acquired.

In operation S420, the sensitive information detection model is trained using the training samples, so as to obtain a trained sensitive information detection model.

The training method of the embodiment of the present disclosure may include only operation S420. But is not limited thereto. Operation S410 and operation S420 may also be included.

According to an embodiment of the present disclosure, for obtaining a training sample in operation S410, the training sample may include a sample text and a tag sequence corresponding to the sample text, where a plurality of tag elements in the tag sequence are in one-to-one correspondence with a plurality of words in the sample text, and each of the plurality of tag elements indicates a relationship between a word corresponding to the tag element and sensitive information.

According to embodiments of the present disclosure, the training samples may include positive training samples, such as training samples containing sensitive information, and may also include negative training samples, such as training samples that do not contain sensitive information. The training samples may also include positive training samples and negative training samples. The sensitive information detection model is trained by using the training samples including the positive training sample and the negative training sample, so that the sensitive information detection model can better learn the characteristic information of different training samples.

According to an embodiment of the present disclosure, the sensitive information in the sample text of the training sample may include sensitive information related to individuals, such as personal identification information, personal health information, personal property information, and the like, but is not limited thereto, and the sensitive information in the sample text may also include sensitive information related to organizations or organizations, such as name information of organizations, office address information of organizations, domain name address information of organizations, and the like.

According to an embodiment of the present disclosure, the sample text may include one or more sentence-level information, for example, the sample text may include sentence-level information "this is y-agency", the sentence-level information includes 5 character-level information, each of the character-level information respectively corresponds to one word of the sentence-level information "this is y-agency", and the sensitive information in the sample notebook may be "y-agency". The tag sequences corresponding to the sample text may include a first tag sequence (0, 0, 1, 0, 0), and a second tag sequence (0, 0, 0, 0, 1). The tag element "1" in the first tag sequence corresponds to the word "y" in the sample text, and the tag element "1" in the second tag sequence corresponds to the word "structure" in the sample text, and the start position information and the end position information of the sensitive information "y structure" can be indicated through the first tag sequence and the second tag sequence.

The training sample provided by the embodiment of the disclosure is used for training the sensitive information detection model provided by the embodiment of the disclosure, which is beneficial to simultaneously obtaining two types of detection results of a sensitive information type detection result and a sensitive information position detection result after the trained sensitive information detection model is applied, and is beneficial to improving the detection precision of the trained sensitive information detection model.

According to an embodiment of the present disclosure, in operation S420, the sensitive information detection model may be used to: extracting feature information in a sample text, and generating a sample initial feature vector, wherein the feature information comprises statement level feature information, syntax structure feature information and semantic feature information; extracting context characteristics of the initial characteristic vector of the sample to obtain a predicted characteristic vector of the sample; and determining a sample detection result related to the sensitive information in the sample text based on the sample prediction feature vector, wherein the sample detection result comprises a sample sensitive information category detection result and a sample sensitive information position detection result.

According to the embodiment of the disclosure, the trained sensitive information detection model can be used for the sensitive information detection method. The sensitive information detection model can comprise a feature extraction module, a context feature extraction module and a sensitive information classifier.

According to an embodiment of the present disclosure, the sentence-level feature information may include character-level feature information, phrase-level feature information, sentence-level feature information, and the like in the sample text. The syntactic structure feature information may include syntactic structure information between respective sentences of the sample text, syntactic structure information between respective phrases in a single sentence, and the like, and may include, for example, a predicate relationship, a verb-to-guest relationship, and the like. The semantic feature information may include semantic feature information at a character level, semantic feature information at a phrase level, and semantic feature information at a sentence boundary in the sample text. The semantic feature information provided by the embodiment of the disclosure can improve the accurate recognition of polyphones, polysemous characters, polysemous words and the like.

According to the embodiment of the disclosure, the sentence level feature information, the syntactic structure feature information and the semantic feature information are comprehensively considered to extract features, so that the preliminary feature extraction processing is complete, the information in the obtained sample initial feature vector is complete, and the problem of information loss caused by extracting only single feature information from a sample text, such as the sentence level feature information or the semantic feature information, is avoided.

According to the embodiment of the disclosure, the context feature extraction is performed on the initial feature vector of the sample, the semantic feature extraction can be performed on the feature information of a shallow layer, and the context feature is considered, so that the feature information of a high layer, such as the predicted feature vector of the sample, can be considered as a global feature.

According to the embodiment of the disclosure, the detection result of the sensitive information in the sample text, for example, whether the sensitive information exists in the sample text, may be determined based on the sample prediction feature vector. In the case that there is sensitive information in the sample text, the detection result may include a sample sensitive information category detection result and a sample sensitive information position detection result.

According to an embodiment of the present disclosure, the sample sensitive information category detection result may be a result characterizing a sensitive information category, for example, a detection result of a personal information category, a detection result of an organization information category, and the like. Personal information may include information such as the name, identity, institution, etc. of an individual. The institution information may include contract, business confidentiality, etc. information for the institution.

According to the embodiment of the disclosure, the sample sensitive information position detection result can represent the position of the sensitive information at a statement level in the sample text. For example, the sample sensitive information position detection result is a position identifier of the sensitive information appearing in the second sentence in the first paragraph in the sample text. But not limited thereto, the sample sensitive information position detection result may also characterize the position of the sensitive information character level. For example, the sample sensitive information position detection result is a position identifier where the first word and the last word of the sensitive information appear in the sample text.

By using the sensitive information detection model provided by the embodiment of the disclosure, the initial characteristic extraction operation can be used for extracting complete information from a sample text, so that the problem of information loss is prevented; the context feature extraction operation can be used for obtaining deep level feature information and considering local information and global information; the initial feature extraction operation and the context feature extraction operation are combined, so that the detection precision of the trained sensitive information detection model on the sensitive information is improved.

According to an embodiment of the present disclosure, the sensitive information detection model may include a feature extraction module, configured to preprocess a sample text and generate a sample input vector corresponding to the sample text; and extracting the characteristic information in the sample input vector by using a characteristic extraction module to generate a sample initial characteristic vector.

According to the embodiment of the disclosure, the preprocessing of the sample text may include sequentially performing sentence segmentation, word segmentation and word segmentation on the sample text, and finally obtaining the information at the character level. The information at the character level may also be vectorized, for example, encoded according to a preset encoding rule, to obtain vector expression information, i.e., a sample input vector corresponding to the sample text. By utilizing the sample input vector provided by the embodiment of the disclosure, the subsequent feature extraction module can more conveniently perform feature extraction operation, so that the subsequent operation steps are simplified, and the processing efficiency is improved.

According to the embodiment of the disclosure, the sample text may include sensitive information composed of numbers or letters, such as an email address, a bank card number, a mobile phone number, and the like, and the sensitive information may be replaced with a corresponding attribute labeling element according to a preset labeling rule, for example, in the case that the sensitive information is an email address "xy @ aa.com", the sensitive information "xy @ aa.com" may be replaced with information related to an attribute, such as "email information", according to the preset labeling rule. Sensitive information which can be matched by using the regular rule is removed from the sample text, so that the subsequent training operation steps are simplified, and the training speed of the sensitive information detection model related to the embodiment of the disclosure is improved.

According to an embodiment of the present disclosure, the feature extraction module may be constructed based on a BERT model, and may include, for example, a plurality of Multi-head self-Attention (Multi-head self-Attention) layers, a feed-forward network layer, and a normalization layer and a residual network layer. But not limited thereto, the feature extraction module may also be constructed based on other neural network models or based on other related algorithms, such as a word2vec network model.

According to the embodiment of the disclosure, the feature extraction module constructed by the BERT model can extract the sentence-level feature information, the syntactic structure feature information and the semantic feature information in the sample text, so that the feature extraction module constructed by the BERT model can more intelligently and accurately determine the specific meaning of the information containing the ambiguous character level.

According to an embodiment of the present disclosure, the sensitive information detection model may include a context feature extraction module, configured to perform context feature extraction on the sample initial feature vector by using the context feature extraction module, so as to obtain a sample predicted feature vector.

According to the embodiment of the disclosure, the bidirectional long and short term memory network (BilsTM) comprises a forward LSTM layer and a reverse LSTM layer, so that a context feature extraction module constructed by a network structure of the bidirectional long and short term memory network (BilsTM) is utilized to process a sample initial feature vector, the feature vector output by the forward LSTM layer and the feature vector output by the reverse LSTM can be combined to be used as final output to obtain a sample prediction feature vector, so that the sample prediction feature vector fully extracts context feature information in the sample initial feature vector, further the global feature information of the sample prediction feature vector can be more comprehensive, and a foundation is provided for detection of subsequent sensitive information.

According to an embodiment of the disclosure, the sensitive information detection model may include a sensitive information classifier, which is configured to process the sample prediction feature vector by using the sensitive information classifier, and determine a detection result about the sensitive information in the sample text.

According to the embodiment of the disclosure, a Span classifier such as a Span classifier is used as the sensitive information classifier provided by the embodiment of the disclosure, the sample prediction feature vector can be processed, the detection result including the sample sensitive information type detection result and the sample sensitive information position detection result is obtained, the detection result precision is high, the type of the sample sensitive information type detection result is wide, the portability of the sensitive information detection method can be enhanced, and the application range of the sensitive information detection can be conveniently expanded.

According to an embodiment of the present disclosure, the operation S420 of training the sensitive information detection model using the training samples, and obtaining the trained sensitive information detection model may include the following operations.

Inputting a sample text in a training sample into a sensitive information detection model to obtain a sample detection result; processing the sample detection result and the label sequence by using a loss function to obtain a loss value; adjusting parameters of the sensitive information detection model based on the loss value until the loss function is converged; and taking the model of the loss function convergence as a trained sensitive information detection model.

According to an embodiment of the present disclosure, the loss function may include a cross entropy loss function, a likelihood loss, and the like, and a gradient descent algorithm may be used to adjust parameters of the sensitive information detection model, such as a random gradient descent algorithm or a batch gradient descent algorithm.

By utilizing the loss function and the parameter adjusting mode provided by the embodiment of the disclosure, the trained sensitive information detection model can be obtained more quickly.

Fig. 5 schematically illustrates an application scenario of a training method of a sensitive information detection model according to an embodiment of the present disclosure.

As shown in fig. 5, the training sample 510 may include a sample text 511 and a tag sequence 512 corresponding to the sample text, wherein tag elements in the tag sequence 512 are in one-to-one correspondence with words in the sample text 511, and each of a plurality of tag elements indicates a relationship between a word corresponding to a tag element and sensitive information.

In operation S510, sample text is preprocessed, resulting in a sample input vector 520 corresponding to the sample text 511.

In operation S520, the sample input vector 520 is input into the sensitive information detection model to obtain a sample detection result 530. The sensitive information detection model can comprise a feature extraction module, a context feature extraction module and a sensitive information classifier. The feature extraction module may be constructed based on a BERT model, and may include, for example, a plurality of Multi-head self-Attention (Multi-head self-Attention) layers, a feed-forward network layer, a normalization layer, and a residual network layer. The context feature extraction module can be constructed based on a bidirectional long-short term memory network (BilSTM), and the sensitive information classifier comprises a classifier constructed based on a Span model.

In operation S530, the sample detection result 530 and the tag sequence 512 are processed using a loss function. For example, the sample detection result 530 and the tag sequence 512 are input to a loss function to obtain a loss value.

In operation S540, it is determined whether the loss value converges. In case that the determination result is negative, operation S550 is performed to adjust the parameters of the sensitive information detection model, that is, to adjust the parameters of the sensitive information detection model based on the loss value. Then, operation S520 to operation S550 are executed in a loop to iteratively update and adjust the parameters of the sensitive information detection model, until the determination result in operation S540 is yes, that is, the loss value is converged, operation S560 may be executed to end the operation, so as to obtain the trained sensitive information detection model.

Fig. 6 schematically shows a block diagram of a sensitive information detection apparatus according to an embodiment of the present disclosure.

As shown in fig. 6, the sensitive information detecting apparatus 600 may include an initial feature extraction module 610, a predicted feature extraction module 620, and a detection result determination module 630.

The initial feature extraction module 610 is configured to extract feature information in a text to be detected to obtain an initial feature vector, where the feature information includes statement level feature information, syntax structure feature information, and semantic feature information.

And a predicted feature extraction module 620, configured to perform context feature extraction on the initial feature vector to obtain a predicted feature vector.

The detection result determining module 630 is configured to determine a detection result about the sensitive information in the text to be detected based on the predicted feature vector, where the detection result includes a sensitive information category detection result and a sensitive information position detection result.

According to an embodiment of the present disclosure, the initial feature extraction module may include a preprocessing sub-module and an initial feature generation sub-module.

And the preprocessing submodule is used for preprocessing the text to be detected and generating an input vector corresponding to the text to be detected.

And the initial feature generation submodule is used for extracting feature information in the input vector by using the feature extraction module to generate an initial feature vector.

According to an embodiment of the present disclosure, the predictive feature extraction module may include a predictive feature extraction sub-module.

And the prediction feature extraction submodule is used for extracting the context feature of the initial feature vector by using the context feature extraction module to obtain the prediction feature vector.

According to an embodiment of the present disclosure, the detection result determination module may include a detection sub-module.

And the detection submodule is used for processing the prediction characteristic vector by using the sensitive information classifier and determining a detection result about the sensitive information in the text to be detected.

According to an embodiment of the present disclosure, the sensitive information detecting apparatus may further include a level determining module.

And the level determining module is used for determining the sensitivity level of the sensitive information in the text to be detected according to the preset sensitivity level based on the detection result.

Fig. 7 schematically shows a block diagram of a training apparatus of a sensitive information detection model according to an embodiment of the present disclosure.

As shown in fig. 7, the training apparatus 700 for the sensitive information detection model may include an obtaining module 710 and a training module 720.

The obtaining module 710 is configured to obtain a training sample, where the training sample includes a sample text and a tag sequence corresponding to the sample text, where a plurality of tag elements in the tag sequence are in one-to-one correspondence with a plurality of words in the sample text, and each of the plurality of tag elements indicates a relationship between a word corresponding to the tag element and sensitive information.

A training module 710, configured to train a sensitive information detection model using a training sample to obtain a trained sensitive information detection model, where the sensitive information detection model is configured to: extracting feature information in a sample text, and generating a sample initial feature vector, wherein the feature information comprises statement level feature information, syntax structure feature information and semantic feature information; extracting context characteristics of the initial characteristic vector of the sample to obtain a predicted characteristic vector of the sample; and determining a sample detection result related to the sensitive information in the sample notebook based on the sample prediction feature vector, wherein the sample detection result comprises a sample sensitive information type detection result and a sample sensitive information position detection result.

According to an embodiment of the present disclosure, the training apparatus 700 of the sensitive information detection model may include only the training module 720, but is not limited thereto, and may further include the obtaining module 710 and the training module 720.

According to an embodiment of the present disclosure, the training module may include a detection sub-module, a loss value determination sub-module, and a tuning parameter sub-module.

And the detection submodule is used for inputting the sample text in the training sample into the sensitive information detection model to obtain a sample detection result.

And the loss value determining submodule is used for processing the sample detection result and the label sequence by using a loss function to obtain a loss value.

And the parameter adjusting sub-module is used for adjusting the parameters of the sensitive information detection model based on the loss value until the loss function is converged.

And the determining submodule is used for taking the model with the loss function convergence as the trained sensitive information detection model.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

According to an embodiment of the present disclosure, an electronic device includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to an embodiment of the present disclosure, a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method as described above.

According to an embodiment of the disclosure, a computer program product comprising a computer program which, when executed by a processor, implements the method as described above.

FIG. 8 shows a schematic block diagram of an example electronic device that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 executes the respective methods and processes described above, such as the sensitive information detection method or the training method of the sensitive information detection model. For example, in some embodiments, the sensitive information detection method or the training method of the sensitive information detection model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto device 800 via ROM 802 and/or communications unit 809. When the computer program is loaded into the RAM 803 and executed by the computing unit 801, one or more steps of the sensitive information detection method or the training method of the sensitive information detection model described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured by any other suitable means (e.g., by means of firmware) to perform the sensitive information detection method or the training method of the sensitive information detection model.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A sensitive information detection method, comprising:

extracting feature information in a text to be detected to obtain an initial feature vector, wherein the feature information comprises statement level feature information, syntax structure feature information and semantic feature information;

extracting context features of the initial feature vector to obtain a predicted feature vector; and

and determining a detection result about the sensitive information in the text to be detected based on the predicted feature vector, wherein the detection result comprises a sensitive information type detection result and a sensitive information position detection result.

2. The method according to claim 1, wherein the extracting feature information in the text to be detected and generating an initial feature vector comprises:

preprocessing the text to be detected to generate an input vector corresponding to the text to be detected; and

and extracting the characteristic information in the input vector by using a characteristic extraction module to generate the initial characteristic vector.

3. The method of claim 1, wherein the extracting the context feature of the initial feature vector to obtain a predicted feature vector comprises:

and utilizing a context feature extraction module to extract the context feature of the initial feature vector to obtain the predicted feature vector.

4. The method of claim 1, wherein the determining, based on the predicted feature vector, a detection result of sensitive information in the text to be detected comprises:

5. The method of any of claims 1 to 4, further comprising:

6. A training method of a sensitive information detection model comprises the following steps:

training the sensitive information detection model by using the training sample to obtain a trained sensitive information detection model,

wherein the sensitive information detection model is to:

extracting feature information in a sample text, and generating a sample initial feature vector, wherein the feature information comprises statement level feature information, syntax structure feature information and semantic feature information;

extracting context features of the initial sample feature vector to obtain a sample prediction feature vector; and

and determining sample detection results about sensitive information in the sample text based on the sample prediction feature vector, wherein the sample detection results comprise sample sensitive information category detection results and sample sensitive information position detection results.

7. The method of claim 6, further comprising:

obtaining a training sample, wherein the training sample comprises a sample text and a label sequence corresponding to the sample text, a plurality of label elements in the label sequence correspond to a plurality of words in the sample text one to one, and each label element in the plurality of label elements indicates a relationship between the word corresponding to the label element and sensitive information.

8. The method of claim 7, wherein the training the sensitive information detection model by using the training samples, and obtaining the trained sensitive information detection model comprises:

inputting a sample text in the training sample into the sensitive information detection model to obtain a sample detection result;

processing the sample detection result and the label sequence by using a loss function to obtain a loss value;

adjusting parameters of the sensitive information detection model based on the loss value until the loss function converges; and

and taking the model of the loss function convergence as the trained sensitive information detection model.

9. A sensitive information detection apparatus comprising:

the initial feature extraction module is used for extracting feature information in a text to be detected to obtain an initial feature vector, wherein the feature information comprises statement level feature information, syntax structure feature information and semantic feature information;

the predicted feature extraction module is used for extracting context features of the initial feature vector to obtain a predicted feature vector; and

and the detection result determining module is used for determining the detection result of the sensitive information in the text to be detected based on the predicted feature vector, wherein the detection result comprises a sensitive information type detection result and a sensitive information position detection result.

10. The apparatus of claim 9, wherein the initial feature extraction module comprises:

the preprocessing submodule is used for preprocessing the text to be detected and generating an input vector corresponding to the text to be detected; and

and the initial feature generation submodule is used for extracting the feature information in the input vector by using a feature extraction module to generate the initial feature vector.

11. The apparatus of claim 9, wherein the predictive feature extraction module comprises:

and the prediction feature extraction submodule is used for extracting the context feature of the initial feature vector by using a context feature extraction module to obtain the prediction feature vector.

12. The apparatus of claim 9, wherein the detection result determination module comprises:

and the detection submodule is used for processing the prediction characteristic vector by using a sensitive information classifier and determining a detection result about the sensitive information in the text to be detected.

13. The apparatus of any of claims 9 to 12, further comprising:

14. A training device for a sensitive information detection model comprises:

the training module is used for training the sensitive information detection model by utilizing the training sample to obtain the trained sensitive information detection model,

wherein the sensitive information detection model is to:

15. The training device of claim 14, further comprising:

the training sample comprises a sample text and a label sequence corresponding to the sample text, wherein a plurality of label elements in the label sequence correspond to a plurality of words in the sample text one to one, and each label element in the label elements indicates a relation between the word corresponding to the label element and sensitive information.

16. The training device of claim 15, wherein the training module comprises:

the detection submodule is used for inputting the sample text in the training sample into the sensitive information detection model to obtain a sample detection result;

a loss value determining submodule, configured to process the sample detection result and the tag sequence by using a loss function to obtain a loss value;

the parameter adjusting sub-module is used for adjusting the parameters of the sensitive information detection model based on the loss value until the loss function is converged; and

and the determining submodule is used for taking the model of the loss function convergence as the trained sensitive information detection model.

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 5 or to perform the method of any one of claims 6 to 8.

18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1 to 5 or to be capable of performing the method of any one of claims 6 to 8.

19. A computer program product comprising a computer program which, when executed by a processor, implements the method of any of claims 1 to 5 or is capable of performing the method of any of claims 6 to 8.