CN112134858B

CN112134858B - Sensitive information detection method, device, equipment and storage medium

Info

Publication number: CN112134858B
Application number: CN202010940328.7A
Authority: CN
Inventors: 周一枫; 侯姗姗; 张云蕾
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Hangzhou Information Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Hangzhou Information Technology Co Ltd
Priority date: 2020-09-09
Filing date: 2020-09-09
Publication date: 2022-12-13
Anticipated expiration: 2040-09-09
Also published as: CN112134858A

Abstract

The embodiment of the invention relates to the field of data security, and discloses a sensitive information detection method, a sensitive information detection device, sensitive information detection equipment and a storage medium. In the embodiment of the invention, a byte stream message to be detected is obtained firstly; extracting a text vector from a byte stream message to be detected based on a GRU network in a preset sensitive information detection model; processing the text vectors under an attention mechanism in a preset sensitive information detection model to obtain text characteristic representation information; and carrying out normalization processing on the text characteristic representation information through a classifier to obtain a sensitive information detection result. Therefore, the model structure of the preset sensitive information detection model in the embodiment of the invention jointly enables the GRU network structure and the attention mechanism structure, and the model structure constructed by the structural arrangement mode can greatly improve the accuracy of the detection result when processing the byte stream message to be detected, thereby solving the technical problem of low detection accuracy of the current sensitive information detection method.

Description

Sensitive information detection method, device, equipment and storage medium

Technical Field

The embodiment of the invention relates to the field of data security, in particular to a sensitive information detection method, a sensitive information detection device, sensitive information detection equipment and a storage medium.

Background

With the continuous development of data security technology, sensitive information is also increasingly concerned.

By definition, sensitive information is generally meant specific information that, if improperly used, or if exposed or modified by unauthorized persons, would have adverse effects and loss of interest to the country or organization. Of course, sensitive information also refers to specific information related to an individual.

Taking document information as an example, there are various identifying means for identifying document-class sensitive files in general in order to detect sensitive information from a large amount of information. For example, the identification operation of the sensitive file can be performed based on the sensitive keywords. However, the inventor finds that the technical means for identifying the sensitive documents based on the sensitive keywords at least has the following problems:

although sensitive documents can be detected more easily by using sensitive keywords, detection can usually be performed only by using specific sensitive keywords, for example, only by using sensitive keywords such as "secret", and the detection effect on more complicated sentences or sentences which are easy to have ambiguity is often poor.

Therefore, the technical problem of low detection accuracy exists in the conventional sensitive information detection method.

Disclosure of Invention

The embodiment of the invention aims to provide a sensitive information detection method, a sensitive information detection device, sensitive information detection equipment and a storage medium, and aims to solve the technical problem that the detection accuracy of the conventional sensitive information detection method is not high.

In order to solve the above technical problem, an embodiment of the present invention provides a sensitive information detection method, including the following steps:

acquiring a byte stream message to be detected;

extracting a text vector from the byte stream message to be detected based on a gate control loop unit GRU network in a preset sensitive information detection model;

processing the text vector under an attention mechanism in the preset sensitive information detection model to obtain text characteristic representation information;

and carrying out normalization processing on the text characteristic representation information through a classifier to obtain a sensitive information detection result.

An embodiment of the present invention further provides a sensitive information detection apparatus, including:

the message acquisition module is used for acquiring a byte stream message to be detected;

the GRU processing module is used for extracting a text vector from the byte stream message to be detected based on a gate control cycle unit GRU network in a preset sensitive information detection model;

the attention mechanism processing module is used for processing the text vector under the attention mechanism in the preset sensitive information detection model to obtain text characteristic representation information;

and the result output module is used for carrying out normalization processing on the text characteristic representation information through the classifier so as to obtain a sensitive information detection result.

An embodiment of the present invention further provides a sensitive information detecting apparatus, including:

at least one processor; and (c) a second step of,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the sensitive information detection method as described above.

Embodiments of the present invention also provide a computer-readable storage medium storing a computer program, which when executed by a processor implements the sensitive information detection method as described above.

Compared with the prior art, the embodiment of the invention provides a kind of preset sensitive information detection model, the GRU network structure and the attention mechanism structure are jointly used in the model structure of the preset sensitive information detection model, the model structure constructed by the structure arrangement mode can greatly improve the accuracy of the detection result when processing the byte stream message to be detected, and further the embodiment can obtain the sensitive information detection result with higher detection accuracy, thereby solving the technical problem that the detection accuracy of the existing sensitive information detection method is not high.

Meanwhile, the input layer of the preset sensitive information detection model described in this embodiment will use byte stream messages existing in the form of byte stream data in network transmission, instead of the input layer of the traditional deep learning model directly using the data type of text. The sensitive information detection model after adjustment omits the step of extracting the message information in network transmission and converting the message information into characters, and further improves the detection efficiency.

Additionally, the GRU network includes an update gate and a reset gate; correspondingly, the extracting a text vector from the byte stream message to be detected based on the gate control loop unit GRU network in the preset sensitive information detection model specifically includes: respectively determining a calculation result corresponding to the updating gate and a calculation result corresponding to the resetting gate according to the byte stream message to be detected; and determining a text vector according to the calculation result corresponding to the updating gate and the calculation result corresponding to the resetting gate.

The sensitive information detection method provided by the embodiment provides a text vector extraction mode. In particular, there are reset gates and refresh gates in the GRU structure, the reset gates determining how to combine new input information with the previous memory, the refresh gates defining the amount by which the previous memory is saved to the current time step. For example, if the reset gate is set to 1 and the update gate is set to 0, a standard RNN model will again be obtained. In view of the fact that the basic idea of the gate control mechanism for learning the long-term dependency relationship is consistent with the LSTM, compared with the LSTM with three gate mechanisms, the GRU with two gates has higher training speed, and the training time can be greatly reduced when the gate control mechanism is suitable for large-scale linguistic data.

In addition, before the obtaining of the byte stream message to be detected, the sensitive information detection method further includes: acquiring a message sample to be trained; training a sensitive information detection model to be trained according to the message sample to be trained to obtain a trained preset sensitive information detection model;

the sensitive information detection model to be trained is a deep learning model based on a GRU network and an attention mechanism;

wherein the GRU network includes an update gate and a reset gate.

The sensitive information detection method provided by the embodiment provides a model training mode of a preset sensitive information detection model, which is different from the traditional model training mode. The traditional model training mode comprises the steps of performing model training by adopting a standard RNN, and performing sensitive information detection by adopting the trained standard RNN subsequently. However, there are many drawbacks in training the standard RNN, such as the appearance of a gradient vanishing condition. Obviously, the embodiment mode can control information such as input, memory and the like through a gating mechanism embodied during the operation of the GRU network so as to make a prediction at the current time step, and the processing mode can better cope with the situation of gradient disappearance.

Meanwhile, the conditions of gradient disappearance, gradient reduction and the like exist during the training of the standard RNN, so that the RNN loses the capability of acquiring remote information. The method proposed in this embodiment can solve the two drawbacks of the RNN structure by using a GRU gated structure. Specifically, the structure of the update gate and the reset gate adopted by the GRU can determine what information should be transmitted and output, so that the model constructed by the embodiment can store information long before, and irrelevant information can be removed, thereby solving the structural defect caused by using the RNN model.

In addition, before the obtaining of the to-be-trained packet sample, the sensitive information detection method further includes: acquiring a text sample to be trained; performing code conversion on the text sample to be trained to obtain a code sample; and carrying out packet grabbing operation on the coding sample to obtain a message sample to be trained.

The sensitive information detection method provided by the embodiment provides a mode for acquiring a type of message samples, so that the method can be used in a model training link of a preset sensitive information detection model. Moreover, during the acquisition of the message sample, the text sample can be converted into a specific coding form to ensure that the information can be correctly detected and identified when the information is transmitted in the form of a byte stream message.

In addition, the performing code conversion on the text sample to be trained to obtain a code sample specifically includes: performing code conversion on the text sample to be trained in a preset coding mode to obtain a coding sample; the preset coding mode comprises Unicode coding, UTF-8 coding and GB2312 coding.

The sensitive information detection method provided by the embodiment obtains corresponding hexadecimal message information by converting the Chinese text in three coding formats of Unicode coding, UTF-8 coding and GB2312 coding, and compared with the traditional technical means, the detection range of detecting the text coding format of the file is expanded by the embodiment.

In addition, after the packet capturing operation is performed on the coding sample to obtain a packet sample to be trained, the sensitive information detection method further includes: carrying out big-end mode processing on the message sample to be trained to obtain a first message sample; and performing word segmentation processing on the first message sample to obtain a second message sample, and taking the second message sample as a new message sample to be trained.

The sensitive information detection method provided by the embodiment provides a preprocessing mode of a type of message sample to be trained, and particularly relates to large and small end mode processing operation, word segmentation processing operation and the like. By means of the correspondence of the big-end mode and the small-end mode of the byte stream message, the adoption of the big-end storage mode or the small-end storage mode which accords with the message can be ensured, and the correctness of the byte stream text is further ensured.

Drawings

One or more embodiments are illustrated by way of example in the accompanying drawings which correspond to and are not to be construed as limiting the embodiments, in which elements having the same reference numeral designations represent like elements throughout, and in which the drawings are not to be construed as limiting in scale unless otherwise specified.

Fig. 1 is a detailed flowchart of a sensitive information detection method according to a first embodiment of the present invention;

fig. 2 is a detailed flowchart of a sensitive information detection method according to a second embodiment of the present invention;

fig. 3 is a structural view of a GRU according to a second embodiment of the present invention;

FIG. 4 is a flow chart of detection according to a second embodiment of the present invention;

fig. 5 is a detailed flowchart of a sensitive information detection method according to a third embodiment of the present invention;

fig. 6 is a schematic structural view of a sensitive information detecting apparatus according to a fourth embodiment of the present invention;

fig. 7 is a schematic configuration diagram of a sensitive information detecting apparatus according to a fifth embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that numerous technical details are set forth in order to provide a better understanding of the present application in various embodiments of the present invention. However, the technical solution claimed in the present application can be implemented without these technical details and various changes and modifications based on the following embodiments.

The following embodiments are divided for convenience of description, and should not constitute any limitation to the specific implementation manner of the present invention, and the embodiments may be mutually incorporated and referred to without contradiction.

A first embodiment of the present invention relates to a sensitive information detection method. The specific process is shown in fig. 1, and comprises the following steps:

step 101, acquiring a byte stream message to be detected.

It can be understood that, for the technical problem that the detection accuracy of the current sensitive information detection method is not high, the first embodiment provides a type of information detection means based on a preset sensitive information detection model, and the accuracy of a detection result can be improved.

Specifically, the preset sensitive information detection model is a deep learning model based on a Gated Round Unit (GRU) network and an attention mechanism, an input layer of the preset sensitive information detection model can input a byte stream message, and an output layer of the preset sensitive information detection model can output a sensitive information detection result.

Further, it should be noted that the input layer of the conventional deep learning model directly uses text, but the input layer of the preset sensitive information detection model described in the first embodiment uses byte stream messages in network transmission in the form of byte stream data, instead of directly using text. It can be seen that the first embodiment provides a new sensitive information detection model.

Obviously, the operation mode greatly accelerates the detection efficiency, so that the detection control can be directly carried out on the source of information transmission, and the leakage risk of sensitive files is reduced.

And 102, extracting a text vector from the byte stream message to be detected based on a gate control loop unit GRU network in a preset sensitive information detection model.

In a specific implementation, a byte stream message to be detected can be captured first; and then, processing the byte stream message to be detected through the GRU network, and extracting a text vector.

And 103, processing the text vector under an attention mechanism in the preset sensitive information detection model to obtain text characteristic representation information.

Then, by introducing an attention mechanism, the text vector can embody the importance of the text vector in different dimensions, namely the weight, and the weight is represented in the form of text feature representation information.

And 104, performing normalization processing on the text characteristic representation information through a classifier to obtain a sensitive information detection result.

Then, the text feature representation information is normalized through a classifier, probability values under different sensitive categories are output in a numerical value mode, and the sensitive category with the larger probability value is output as a sensitive information detection result.

Wherein, the multi-classifier can be a softmax multi-classifier.

The sensitive categories may include a military sensitive category, a personal equity sensitive category, and the like, and if the probability value corresponding to the military sensitive category is the maximum, the output sensitive information detection result may be the military sensitive category.

Of course, if the probability values corresponding to the various sensitive categories are all smaller than a certain probability threshold, the output sensitive information detection result may be no sensitive information.

It should be understood that the foregoing is only an example, and the technical solution of the present embodiment is not limited at all, and in practical applications, a person skilled in the art may set the technical solution according to business needs, and the present embodiment does not limit the technical solution.

Furthermore, the traditional sensitive information detection method also comprises the steps of carrying out identification operation on sensitive files based on Word and PDF file watermarks, carrying out identification based on document encryption keys and carrying out identification operation based on text semantics. Obviously, the embodiment is superior to the traditional sensitive information detection methods in detection accuracy.

It is not difficult to find out through the above description that the sensitive information detection method provided by the embodiment provides a type of preset sensitive information detection model, a GRU network structure and an attention mechanism structure are jointly used in a model structure of the preset sensitive information detection model, and a model structure constructed by the structure arrangement mode can greatly improve the accuracy of a detection result when processing a byte stream message to be detected, so that the embodiment can obtain a sensitive information detection result with higher detection accuracy, and the technical problem that the detection accuracy of the existing sensitive information detection method is not high is solved.

Meanwhile, the input layer of the preset sensitive information detection model described in this embodiment will use byte stream messages existing in the form of byte stream data in network transmission, instead of the input layer of the traditional deep learning model directly using the data type of text. The step of extracting the message information in network transmission and converting the message information into characters is omitted, and the detection efficiency is further improved.

A second embodiment of the present invention relates to a sensitive information detection method. The second embodiment is substantially the same as the first embodiment, and a specific flow is shown in fig. 2, and the main differences are that: the GRU network comprises a first GRU network and a second GRU network;

the attention mechanism is a layered attention mechanism which comprises a vocabulary level attention mechanism and a sentence level attention mechanism;

the model structure of the preset sensitive information detection model sequentially comprises the first GRU network, the vocabulary level attention mechanism, the second GRU network and the sentence level attention mechanism.

It can be understood that the embodiment provides a more specific model structure of the preset sensitive information detection model.

Specifically, the GRU network in the preset sensitive information detection model may be a Bi-directional GRU (Bi-GRU) network, and the Attention mechanism may be a Hierarchical Attention Network (HAN).

Therefore, the model structure of the preset sensitive information detection model can be a first GRU network, a vocabulary level attention mechanism, a second GRU network and a sentence level attention mechanism in sequence.

The Bi-GRU network has better memory feature extraction capability when the Bi-GRU network is used for processing long-distance texts in a deep network, and meanwhile, because the model structure relates to a vocabulary level attention mechanism and a sentence level attention mechanism, the characteristic processing can be carried out by respectively and independently using the vocabulary level attention mechanism and the sentence level attention mechanism, so that the accuracy of a detection result is improved.

For example, considering that the model structure may be a first GRU network, a vocabulary level attention mechanism, a second GRU network and a sentence level attention mechanism in sequence, when in the vocabulary level attention mechanism processing loop, the importance weight of the vocabulary level vector in the sentence can be calculated and introduced into the model structure; when the attention mechanism processing link is in a sentence level, the importance weight of the sentence level vector in the article can be calculated and introduced into the model structure.

Further, the GRU network includes an update gate and a reset gate.

Correspondingly, the step 102 specifically includes:

step 1021, respectively determining a calculation result corresponding to the updating gate and a calculation result corresponding to the resetting gate according to the byte stream message to be detected.

In a specific implementation, as for the GRU network, an update gate (update gate) and a reset gate (reset gate) are included in the GRU network. Basically, these two types of gating vectors determine which information can ultimately be used as the output of the gated loop unit.

It is to be understood that these two types of gating mechanisms are unique in that they can preserve information in long-term sequences and do not clear over time or get removed because they are not relevant to prediction.

Further, the first calculation formula of the GRU gating mechanism, corresponding to the update gate, as shown below,

Z _t ＝σ(W _Z ·[h _t-1 ，x _t ])

wherein, Z _t To update the door calculation. Sigma is sigmoid activation function. x is the number of _t The input vector for the t-th time step, i.e., the t-th component of the input sequence X, undergoes a linear transformation. The linear transformation operation is specifically to be associated with a weight matrix W _Z Multiplication. The information of the previous time step t-1 is stored in ht-1, which is also subjected to a linear transformation.

Wherein, the input sequence X is the byte stream message to be detected.

It can be seen that the update gate will add these two pieces of information and put them into the sigmoid activation function, so the activation result can be compressed to between 0 and 1.

Further, a second calculation formula for the GRU gating mechanism, corresponding to a reset gate, as shown below,

r _t ＝σ(W _r ·[h _t-1 ,x _t ])

wherein r is _t The result of the gate reset calculation. x is the number of _t The input vector at the t-th time step, i.e. the t-th component of the input sequence X, undergoes a linear transformation. The linear transformation operation is specifically to be associated with a weight matrix W _r Multiplication. The information of the previous time step t-1 is stored in ht-1, which is also subjected to a linear transformation.

Obviously, two types of activation results will be obtained here, and may also be recorded as a calculation result corresponding to the update gate and a calculation result corresponding to the reset gate.

Step 1022, determining a text vector according to the calculation result corresponding to the update gate and the calculation result corresponding to the reset gate.

Further, a third calculation formula for the GRU gating mechanism, as shown below,

wherein r is _t To reset the output value of the gate, ht-1 stores the information of the previous time step t-1. W is a group of _h Representing a weight matrix.

The current memory content value is obtained after the function tanh is calculated.

Further, a fourth calculation formula for the GRU gating mechanism, as shown below,

wherein Z is _t Indicating the result of the calculation to update the gate,

indicating the current memory content value. h is _t The final memory value representing the current time step, i.e. the text vector mentioned above.

Reference is made to fig. 3 showing a structure of a GRU according to a second embodiment of the present invention.

Through the above description, it is easy to find that the sensitive information detection method provided by the embodiment provides a text vector extraction mode. In particular, there are reset gates and refresh gates in the GRU structure, the reset gates determining how to combine new input information with the previous memory, the refresh gates defining the amount by which the previous memory is saved to the current time step. For example, if the reset gate is set to 1 and the update gate is set to 0, a standard Recurrent Neural Network (RNN) model will be obtained again. In view of the fact that the basic idea of the gate control mechanism for learning the Long-Term dependency relationship is consistent with the Long-Term Short-Term Memory network (LSTM), compared with the LSTM with three gate mechanisms, the GRU with two gates has a higher training speed, and the training time can be greatly reduced when the gate control mechanism is suitable for large-scale language materials.

Further, the text vector corresponds to a first GRU network;

correspondingly, the processing the text vector under the attention mechanism in the preset sensitive information detection model to obtain text feature representation information specifically includes:

processing a text vector corresponding to the first GRU network through a vocabulary level attention mechanism to obtain first text feature representation information;

processing the first text feature representation information through a second GRU network to obtain a text vector corresponding to the second GRU network;

processing a text vector corresponding to the second GRU network through a sentence-level attention mechanism to obtain second text feature representation information.

In the specific implementation, the hierarchical attention mechanism models the vocabulary level text vectors and the sentence level text vectors, the importance of the vocabulary level vectors in sentences and the importance of the sentence level vectors in articles can be expressed by introducing the hierarchical attention mechanism, and finally, the conditional probability values of the sensitive categories can be calculated by the softmax multi-classifier, so that the detection work of the sensitive information is realized.

Meanwhile, the HAN mechanism is introduced, so that the sentences can be understood according to the importance of the vocabulary level vectors in the sentences, and the articles can be understood according to the importance of the sentence level vectors in the articles, thereby achieving a better detection effect.

More specifically, the formula for the vocabulary level attention mechanism, as shown below,

u _it ＝tanh(W _ω h _it +b _ω )

wherein u is _it For the output vector corresponding to the ith word at time t, h _it Is the output vector of Bi-GRU at time t, W _ω Is its corresponding character vector weight, b _ω Is its corresponding bias term. tanh is the activation function. u. u _ω Vocabulary level attention weights. a is _it And calculating the weight value of the ith word after normalization at the time t. s _i At time t consisting of a _it And h _it The weighted sum sentence obtained by dot multiplication is represented, i.e., the first text feature representation information described above.

More specifically, the formula for the sentence-level attention mechanism, as shown below,

u _i ＝tanh(W _s h _i +b _s )

wherein u is _i Is the attention weight, h, corresponding to the ith word _i Is composed of s _i Sentence representation, W, computed over a bidirectional LSTM network _s Is a sentence-level vector weight, b _s Is its corresponding bias term. u. u _t Is the attention weight, u, corresponding to time t _s For sentence-level attention weight, a _i Is a warpAnd calculating the obtained weight after over-normalization. v is a radical of h _i And a _i And the weighted sum article representation obtained by dot multiplication is the second text feature representation information described above.

Further, in the subsequent operation, the second text feature representation information obtained here can be normalized through the classifier to obtain a sensitive information detection result.

Further, the relevant hyper-parameters in the preset sensitive information detection model may be designed as follows, batch size (Batch size) is 256, learning rate is set to 0.001, attention dimension is 512, adam is used by the optimizer, text maximum size (size) is set to 800, etc.

As can be easily found from the above description, the sensitive information detection method provided in this embodiment provides a specific operation mode of a specific model structure based on a sensitive information detection model. Specifically, in the model structure, a vocabulary level text vector and a sentence level text vector are modeled, the importance of the vocabulary level vector in a sentence and the importance of the sentence level vector in an article can be represented by introducing the attention mechanism hierarchically, and finally, the conditional probability values of the sensitive categories can be calculated by a softmax multi-classifier, so that the detection work of the sensitive information is realized.

Further, after the second text feature representation information is obtained, normalization processing can be performed on the second text feature representation information through a classifier, so that a sensitive information detection result is obtained.

Specifically, the output numerical values of multiple classifications can be converted into relative probabilities through a softmax function, so that the message information is normalized, and the result is easier to understand and compare.

The formula of the softmax function, among others, can be shown as follows,

wherein, V _i Representing the output of the pre-stage output unit of the classifier. i denotes the category index and C denotes the total number of categories. S. the _i Indicating the ratio of the index of the current element to the sum of the indices of all elements.

It can be seen that the softmax function translates the output values of multiple classes into relative probabilities.

Further, after the normalized result is obtained, that is, after the conditional probability value corresponding to the sensitive packet class is output, the class with the largest conditional probability value can be used as the class basis for checking the sensitive packet, so as to detect whether the data stream packet belongs to the sensitive packet and judge the sensitive class thereof.

The information input by the input layer may be "today is good weather", the hexadecimal data stream format of the input layer is 0xe4bb8ae5a4a9e698afe4b8aae5a5bde5a4 e6b094 in the form of UTF-8 encoding format, corresponding to "today" is 0xe4bb8a, corresponding to "day" is 0xe5a4a9, corresponding to "yes" is 0xe698af, corresponding to "one" is 0xe4b8aa, corresponding to "good" is 0xe5a5bd, corresponding to "day" is 0xe5a4a9, and corresponding to "gas" is 0xe6b094, and the specific operation flow can be referred to the detection flow chart of the second embodiment of the present invention shown in fig. 4.

Where category represents a sensitive category.

A third embodiment of the present invention relates to a sensitive information detection method. The third embodiment is substantially the same as the first embodiment, and a specific flow is as shown in fig. 5, and the main differences are as follows.

In a third embodiment, before the step 101, the sensitive information detecting method further includes:

step 01, obtaining a message sample to be trained;

and step 02, training the sensitive information detection model to be trained according to the message sample to be trained to obtain a trained preset sensitive information detection model.

wherein the GRU network includes an update gate and a reset gate.

It can be understood that, the first embodiment describes a model using link of the preset sensitive information detection model, and this embodiment will provide a type of model training mode of the preset sensitive information detection model before the model using link.

Wherein, the message sample to be trained in the model training link corresponds to the byte stream message to be tested in the model using link and is the data content of the same type.

For details of the model training link, reference may be made to details of the model using link described in the first embodiment and the second embodiment of the present invention, which are not described herein again.

Through the above description, it is easy to find that the sensitive information detection method provided by the embodiment provides a model training mode of a preset sensitive information detection model, which is different from the traditional model training mode. The traditional model training mode comprises the steps of performing model training by adopting a standard RNN, and performing sensitive information detection by adopting the trained standard RNN subsequently. However, there are many drawbacks in training the standard RNN, such as the appearance of a gradient vanishing condition. Obviously, the embodiment mode can control information such as input, memory and the like through a gating mechanism embodied during the operation of the GRU network so as to make a prediction at the current time step, and the processing mode can better cope with the situation of gradient disappearance.

Meanwhile, structural defects such as gradient disappearance, gradient reduction and the like exist in the process of training the standard RNN, so that the RNN loses the capability of acquiring remote information. The method proposed in this embodiment can solve the two drawbacks of the RNN structure described above by using a GRU gated structure. Specifically, the structure of the update gate and the reset gate adopted by the GRU can determine what information should be transmitted and output, so that the model constructed by the embodiment can store information long before, and can remove irrelevant information, thereby solving the structural defect caused by using the RNN model.

Further, before the step 01, the sensitive information detecting method further includes:

and 001, acquiring a text sample to be trained.

It can be understood that, in the embodiment, the text sample may be processed first to obtain a message sample for a subsequent model training link.

In particular, the text sample to be trained may be a chinese sample.

And 002, performing code conversion on the text sample to be trained to obtain a code sample.

The text samples may then be converted to a particular encoded form to ensure that the information is properly detected and identified when communicated in a byte stream message.

The preset coding mode adopted by the code conversion can be one or more. For example, the predetermined encoding method may be Unicode encoding, UTF-8 encoding, or GB2312 encoding.

And step 003, performing packet grabbing operation on the coding sample to obtain a to-be-trained message sample.

Further, wireshark software can be called for packet grabbing operation. Of course, the type of implement of the bale plucker is not limited herein.

For example, the text sample to be trained may be "today is good weather", and the hexadecimal data stream of the UTF-8 encoding format is 0xe4bb8ae5a4a9e698afe4b8aae5a5bde5a4 e6b094, corresponding to "today" is 0xe4bb8a, corresponding to "day" is 0xe5a4a9, corresponding to "yes" is 0xe698af, corresponding to "one" is 0xe4b8aa, corresponding to "good" is 0xe5a5bd, corresponding to "day" is 0xe5a4a9, and corresponding to "gas" is 0xe6b094, so as to provide corresponding word-throttle message information material for the subsequent sensitive text.

As can be easily found from the above description, the sensitive information detection method provided in this embodiment provides a method for obtaining a type of packet sample, so as to be used in a model training link for presetting a sensitive information detection model. Moreover, during the acquisition of the message sample, the text sample can be converted into a specific coding form to ensure that the information can be correctly detected and identified when the information is transmitted in the form of a byte stream message.

Further, the performing code conversion on the text sample to be trained to obtain a code sample specifically includes:

performing code conversion on the text sample to be trained in a preset coding mode to obtain a coding sample;

the preset coding mode comprises Unicode coding, UTF-8 coding and GB2312 coding.

In a specific implementation, the transcoding operation may simultaneously use multiple encoding modes, for example, unicode encoding, UTF-8 encoding, and GB2312 encoding may be used to simultaneously obtain three types of encoded samples for subsequent operations.

Further, unicode encoding is typically composed of two bytes, called USC-2, and individual excerpt words are composed of four bytes, called USC-4. The first 127 characters represent the characters in the original ASCII code, but only one byte is changed into two bytes.

It should be noted that the Unicode series codes have two modes, namely Big-end mode (Big-end) and small-end mode (Little-end), and the error of Big-end and small-end can also cause the error decoding.

Further, in UTF-8, characters are encoded in an 8-bit sequence, with one or several bytes representing a character.

Further, in the GB2312 simplified chinese coding, one chinese character occupies 2 bytes, which is a domestic main coding mode.

It is not difficult to find out through the above description that the sensitive information detection method provided by the embodiment obtains the corresponding hexadecimal message information by converting the chinese text into the Unicode coding format, the UTF-8 coding format and the GB2312 coding format, and compared with the conventional technical means, the embodiment expands the detection range of the text coding format of the detection file.

Further, after the packet capturing operation is performed on the coding sample to obtain the packet sample to be trained, the sensitive information detection method further includes:

carrying out big-end mode processing on the message sample to be trained to obtain a first message sample;

and performing word segmentation processing on the first message sample to obtain a second message sample, and taking the second message sample as a new message sample to be trained.

It can be understood that, after the packet capturing operation is performed to obtain the packet sample to be trained, the packet sample to be trained may be put into the sensitive information detection model to be trained for training.

However, before the sensitive information detection model to be trained is put into use for training, the sample of the message to be trained may also be adjusted, where the adjustment operation includes a size-end mode processing operation, a word segmentation processing operation, and the like.

Specifically, an unadjusted sample of the message to be trained may be recorded as a sample a, and an adjusted sample of the message to be trained may be recorded as a sample B.

First, the sample a may be subjected to a high-low byte conversion operation, i.e., a so-called big-end mode processing operation.

Here, the Unicode encoding method will be described as an example.

The small end mode means that the low-order byte is arranged at the low address end of the memory, and the high-order byte is arranged at the high address end of the memory.

In addition, the embodiment can perform fast small-end mode processing on the byte stream by using the function qtolittleneendian provided in the QT integrated development environment.

For example, the 16 notation of the word "today" is 0xe4bb8ae5a4a9, and in memory,

"memory: low address

0xa9|0xa4|0xe5|0x8a|0xbb|0xe4

Lower byte- -higher byte- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -.

The big-end mode means that the high-order byte is arranged at the low address end of the memory, and the low-order byte is arranged at the high address end of the memory.

In addition, the present embodiment can perform fast big-end mode processing on the byte stream using the function qToBigEndian provided in the QT integration development environment.

For example, the 16-system expression of the word "today" is 0xe4bb8ae5a4a9, the expression in memory,

"memory: low address-high address

0xe4|0xbb|0x8a|0xe5|0xa4|0xa9

High-order byte- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -.

It can be understood that the byte stream text is subjected to the correspondence of the big-end mode and the small-end mode, so that the adoption of the big-end storage mode or the small-end storage mode which accords with the message can be ensured, and the correctness of the byte stream text is further ensured.

Second, the first message sample may be first subjected to chinese word segmentation to obtain a second message sample, which may also be denoted as sample B. Sample B will subsequently be used for model training operations.

Meanwhile, the sensitive text information can be marked with the sensitive category while the word segmentation is carried out,

as can be readily seen from the above description, the sensitive information detection method provided in this embodiment provides a preprocessing method for a type of packet sample to be trained, and specifically relates to a large-small-end mode processing operation, a word segmentation processing operation, and the like. By means of the correspondence of the big-end mode and the small-end mode of the byte stream message, the adoption of the big-end storage mode or the small-end storage mode which accords with the message can be ensured, and the correctness of the byte stream text is further ensured.

Further, in the model training link combining the Bi-GRU and HAN hierarchical attention mechanism in the embodiment, sensitive semantic information in the text can be learned, so that the sensitivity of the text can be detected according to the text semantics, and the detection operation of the sensitive file is more robust.

Further, the traditional sensitive information detection means can detect and judge the sensitive file by constructing a sensitive keyword library, and if the sensitive word is detected, the file is considered to belong to the sensitive file, and if the sensitive word is not detected, the file is considered to belong to the non-sensitive file.

However, in the actual implementation process, the expansion of the sensitive word bank is a tedious and energy-consuming matter.

However, the deep learning model based on Bi-GRU + HAN adopted in the embodiment can have better learning ability through learning of a certain amount of text corpora, that is, better detection ability can be achieved through the learning ability of the deep learning neural network for sensitive files which are not trained before.

Further, when information detection is performed on the parsed byte stream message information, extra time is often consumed for parsing the message, and good detection efficiency and response speed are often not achieved under the large background of severe information network security situation.

Obviously, the embodiment performs deep learning model training on the byte stream in network transmission, that is, takes the data form of the text transmitted in the network as the input layer of the training, and is different from the traditional detection method for detecting text information, and the embodiment can directly detect the message information of the data byte stream transmitted in the network. Obviously, the operation mode greatly accelerates the detection efficiency, so that the detection control can be directly carried out on the source of information transmission, and the leakage risk of sensitive files is reduced.

The steps of the above methods are divided for clarity, and the implementation may be combined into one step or split some steps, and the steps are divided into multiple steps, so long as the same logical relationship is included, which are all within the protection scope of the present patent; it is within the scope of this patent to add insignificant modifications or introduce insignificant designs to the algorithms or processes, but not to change the core designs of the algorithms and processes.

A fourth embodiment of the present invention relates to a sensitive information detecting apparatus, as shown in fig. 6, including: a message acquisition module 401, a GRU processing module 402, an attention mechanism processing module 403 and a result output module 404;

a message acquiring module 401, configured to acquire a byte stream message to be detected;

a GRU processing module 402, configured to extract a text vector from the byte stream message to be detected based on a gated cycle unit GRU network in a preset sensitive information detection model;

an attention mechanism processing module 403, configured to process the text vector under an attention mechanism in the preset sensitive information detection model to obtain text feature representation information;

and a result output module 404, configured to perform normalization processing on the text feature representation information through a classifier, so as to obtain a sensitive information detection result.

Further, in another example, the GRU network comprises a first GRU network, a second GRU network;

Further, in another example, the GRU network includes an update gate and a reset gate;

the GRU processing module 402 is further configured to determine, according to the to-be-detected byte stream packet, a calculation result corresponding to the update gate and a calculation result corresponding to the reset gate, respectively; and determining a text vector according to the calculation result corresponding to the updating gate and the calculation result corresponding to the resetting gate.

In addition, in another example, the sensitive information detecting apparatus further includes: a model training module;

the model training module is used for acquiring a message sample to be trained; training a sensitive information detection model to be trained according to the message sample to be trained to obtain a trained preset sensitive information detection model;

wherein the GRU network includes an update gate and a reset gate.

In addition, in another example, the sensitive information detecting apparatus further includes: a message extraction module;

the message extraction module is used for acquiring a text sample to be trained; performing code conversion on the text sample to be trained to obtain a code sample; and carrying out packet grabbing operation on the coding sample to obtain a message sample to be trained.

In addition, in another example, the message extraction module is further configured to perform code conversion on the text sample to be trained through a preset coding mode to obtain a coding sample;

the preset coding mode comprises a Unicode code, a UTF-8 code and a GB2312 code.

In addition, in another example, the message extraction module is further configured to perform a large-small-end mode processing on the message sample to be trained to obtain a first message sample; and performing word segmentation processing on the first message sample to obtain a second message sample, and taking the second message sample as a new message sample to be trained.

It is to be understood that the present embodiment corresponds to the first, second, or third embodiment, and that the present embodiment may be implemented in cooperation with the first, second, or third embodiment. The related technical details mentioned in the first embodiment, the second embodiment and the third embodiment are still valid in the present embodiment, and are not described herein again in order to reduce the repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the first embodiment, the second embodiment, and the third embodiment.

It should be noted that each module referred to in this embodiment is a logical module, and in practical applications, one logical unit may be one physical unit, may be a part of one physical unit, and may be implemented by a combination of multiple physical units. In addition, in order to highlight the innovative part of the present invention, elements that are not so closely related to solving the technical problems proposed by the present invention are not introduced in the present embodiment, but this does not indicate that other elements are not present in the present embodiment.

A fifth embodiment of the present invention relates to a sensitive information detecting apparatus, as shown in fig. 7, including at least one processor 501; and a memory 502 communicatively coupled to the at least one processor 501; the memory 502 stores instructions executable by the at least one processor 501, and the instructions are executed by the at least one processor 501, so that the at least one processor 501 can execute the sensitive information detection method described in the first or second embodiment.

The memory 502 and the processor 501 are connected by a bus, which may include any number of interconnected buses and bridges that link one or more of the various circuits of the processor 501 and the memory 502. The bus may also link various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor is transmitted over a wireless medium via an antenna, which further receives the data and transmits the data to the processor.

The processor 501 is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. While memory 502 may be used to store data used by processor 501 in performing operations.

Those skilled in the art can understand that all or part of the steps in the method of the foregoing embodiments may be implemented by a program to instruct related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific embodiments for practicing the invention, and that various changes in form and details may be made therein without departing from the spirit and scope of the invention in practice.

Claims

1. A method for sensitive information detection, comprising:

acquiring a byte stream message to be detected;

processing the text vector under the attention mechanism in the preset sensitive information detection model to obtain text characteristic representation information;

2. The sensitive information detection method of claim 1, wherein the GRU network comprises a first GRU network, a second GRU network;

3. The sensitive information detection method of claim 1, wherein the GRU network comprises an update gate and a reset gate;

correspondingly, the extracting a text vector from the byte stream message to be detected based on the gated cycle unit GRU network in the preset sensitive information detection model specifically includes:

respectively determining a calculation result corresponding to the updating gate and a calculation result corresponding to the resetting gate according to the byte stream message to be detected;

and determining a text vector according to the calculation result corresponding to the updating gate and the calculation result corresponding to the resetting gate.

4. The sensitive information detection method according to claim 1 or 2, wherein before the obtaining of the byte stream packet to be detected, the sensitive information detection method further comprises:

acquiring a message sample to be trained;

training a sensitive information detection model to be trained according to the message sample to be trained to obtain a trained preset sensitive information detection model;

wherein the GRU network includes an update gate and a reset gate.

5. The sensitive information detection method according to claim 4, wherein before the obtaining of the to-be-trained packet sample, the sensitive information detection method further comprises:

acquiring a text sample to be trained;

performing code conversion on the text sample to be trained to obtain a code sample;

and carrying out packet grabbing operation on the coding sample to obtain a message sample to be trained.

6. The method for detecting sensitive information according to claim 5, wherein the performing transcoding on the text sample to be trained to obtain a coded sample specifically includes:

7. The sensitive information detection method according to claim 5, wherein after the packet capture operation is performed on the coded samples to obtain the to-be-trained packet samples, the sensitive information detection method further comprises:

8. A sensitive information detecting apparatus, comprising:

9. A sensitive information detection apparatus, comprising:

at least one processor; and (c) a second step of,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the sensitive information detection method of any one of claims 1 to 7.

10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the sensitive information detecting method according to any one of claims 1 to 7.