CN111666414A

CN111666414A - Method for detecting cloud service by sensitive data and cloud service platform

Info

Publication number: CN111666414A
Application number: CN202010537941.4A
Authority: CN
Inventors: 周晓勇; 梁淑云; 刘胜; 马影; 陶景龙; 王启凡; 魏国富; 徐�明; 殷钱安; 余贤喆
Original assignee: Information and Data Security Solutions Co Ltd
Current assignee: Information and Data Security Solutions Co Ltd
Priority date: 2020-06-12
Filing date: 2020-06-12
Publication date: 2020-09-15
Anticipated expiration: 2040-06-12
Also published as: CN111666414B

Abstract

The invention discloses a method for detecting cloud service by sensitive data and a cloud service platform, and the method comprises the following steps: s01, uploading a training sample by an enterprise, and uploading the training sample to a server by the enterprise through a data interface opened by the server; s02, the server side uses the training samples to perform model training to obtain a Bert + BiLSTM classification model; s03, the server predicts the Internet documents by using a Bert + BiLSTM classification model to obtain a prediction result; and S04, the service party returns the suspected documents in the prediction result to the enterprise. The method uses the form of cloud service to provide sensitive data detection service for enterprises, reduces the cost and threshold of obtaining service by the enterprises, and avoids repeated investment; therefore, large, medium and small enterprises can obtain the detection service with the same standard through the service, the security of network data is integrally improved, and the method has great social significance.

Description

Method for detecting cloud service by sensitive data and cloud service platform

Technical Field

The invention relates to the technical field of computer data security, in particular to a method for detecting cloud service by sensitive data and a cloud service platform.

Background

The data inside the enterprise is a high-value intangible asset, and data leakage events frequently occur in recent years. The characteristics of fast information transmission and large volume of internet bring great difficulty to enterprises to search and retrieve data, and great economic loss is often generated.

At present, large enterprises need to spend extraordinary cost on purchasing software and hardware systems and operation and maintenance services provided by a third party, and medium and small enterprises cannot pay, so that the data security maintenance cost of the large enterprises is high, the medium and small enterprises are directly and directly rushing, and great data security potential hazards exist. Technically, most of the methods are realized by adopting a keyword matching technology, and the method has the defects of low efficiency, high cost and difficult operation and maintenance.

The method for mining the industrial accident recording text disclosed by application number 201911106089.9 adopts the existing natural language algorithm to design the text mining algorithm suitable for the industrial accident time analysis, reduces the operation amount to a certain extent, and reduces the labor cost. The technology still cannot meet the requirements of all different types of enterprises.

Disclosure of Invention

The invention aims to solve the technical problem of reducing the cost and threshold of obtaining the sensitive data detection service by enterprises.

The invention solves the technical problems through the following technical means:

a method for detecting cloud service by sensitive data comprises the following steps:

s01, uploading training samples to enterprises

Uploading the training samples to the server side by the enterprise through a data interface opened by the server side;

s02, the server side uses the training samples to perform model training to obtain a Bert + BiLSTM classification model;

s03, the server predicts the Internet documents by using a Bert + BiLSTM classification model to obtain a prediction result;

and S04, the service party returns the suspected documents in the prediction result to the enterprise.

The method uses the form of cloud service to provide sensitive data detection service for enterprises, reduces the cost and threshold of obtaining service by the enterprises, and avoids repeated investment; therefore, large, medium and small enterprises can obtain the detection service with the same standard through the service, the security of network data is integrally improved, and the method has great social significance.

Further, in step S01, the training sample is an internal document provided by an enterprise, and the enterprise provides a document set with content similar to that of the sensitive data type desired to be searched on the internet, so as to form the training sample.

Further, in step S02, the training sample sent by the enterprise is defined as a positive sample, and other document sets different from the positive sample are selected as negative samples, and then

S021, text preprocessing

Preprocessing the positive and negative samples to obtain text contents of all documents stored in a csv form, labeling the text contents according to whether the documents come from the positive samples or the negative samples, wherein the labels of the positive samples are 1, and the labels of the negative samples are 0 to obtain a data set with labels; then, randomly and hierarchically sampling according to a certain proportion, and dividing a data set into a training set and a verification set;

classification model of S022.Bert + BiLSTM

Performing fine tuning training on the Bert + BiLSTM classification model by adopting the training set to generate a target model suitable for the training set;

s023 model evaluation

And using the verification set, evaluating the target model by adopting the set classification evaluation index, finishing the model training step if the effect is better than the set threshold value, and performing model optimization or sample optimization if the effect is worse than the set threshold value.

Further, the step S021 specifically includes:

s0211. document processing

Circularly traversing the nested directory, copying the documents under all the subdirectories to a new single-level directory, and distinguishing file names under the repeated condition of the file names;

s0212. file format conversion

Carrying out format conversion on the file with the specific format to obtain a target format file;

s0213. text extraction

Outputting the text contents of all positive and negative sample documents by using different reading functions according to the file format, wherein the text content of each document is used as a piece of training data;

s0214, data set establishment and partitioning

Firstly, according to whether a document comes from a positive sample or a negative sample, performing label marking on a result output in the step S0213, and establishing a data set, wherein the label of the positive sample is 1, and the label of the negative sample is 0; and then, randomly and hierarchically sampling according to a certain proportion, dividing the data set into a training set and a verification set, wherein the training set is used for training the model, and the verification set is used for judging the real effect of the model.

Further, in the step S03, before predicting the internet document by using the Bert + BilSTM classification model, the internet document needs to be processed in steps S0211-S0213.

Further, in the step S023, the classification evaluation index is set to be one or more of F1-score, accuracy, precision and recall.

Correspondingly, the invention also provides a sensitive data detection cloud service platform, and the method is applied; the system comprises

The data interface module is used for uploading training samples by enterprises;

the model training module is used for carrying out model training by using a training sample to generate a target model;

the model prediction module is used for predicting the Internet documents by using the model to obtain a prediction result;

and the prediction result returning module is used for returning the suspected document in the prediction result to the enterprise.

Further, the specific implementation process of the model training module is as follows: defining the training sample sent by the enterprise as a positive sample, selecting other document sets different from the positive sample as negative samples, and then

Text pre-processing

bert + BilsTM classification model

model evaluation

Further, the file preprocessing specifically comprises the following steps:

document processing

file format conversion

text extraction

data set creation and partitioning

Firstly, according to the fact that a document comes from a positive sample or a negative sample, carrying out label marking on output text content to establish a data set, wherein the label of the positive sample is 1, and the label of the negative sample is 0; and then, randomly and hierarchically sampling according to a certain proportion, dividing the data set into a training set and a verification set, wherein the training set is used for training the model, and the verification set is used for judging the real effect of the model.

Correspondingly, the present invention further provides a storage medium, wherein a plurality of instructions are stored, the instructions are suitable for being loaded and executed by a processor, and the plurality of instructions are:

the data interface is used for uploading training samples by enterprises;

model training, namely performing model training by using a training sample to generate a target model;

the model prediction is used for predicting the Internet documents by using the model to obtain a prediction result;

The invention has the advantages that:

1. the method uses the form of cloud service to provide sensitive data detection service for enterprises, reduces the cost and threshold of obtaining service by the enterprises, and avoids repeated investment; therefore, large, medium and small enterprises can obtain the detection service with the same standard through the service, the security of network data is integrally improved, and the method has great social significance.

2. Compared with the traditional technology, the AI technology in the field of natural language processing is used, so that the efficiency is greatly improved, and the operation and maintenance complexity is reduced.

Drawings

Fig. 1 is a flowchart of a method for detecting cloud services according to sensitive data in embodiment 1 of the present invention;

fig. 2 is a flowchart of model training in a method for detecting cloud services using sensitive data according to embodiment 1 of the present invention;

fig. 3 is an architecture diagram of a sensitive data detection cloud service platform according to embodiment 2 of the present invention;

fig. 4 is a business process diagram corresponding to the cloud service platform for sensitive data detection in embodiment 2 of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

Referring to the step diagram of fig. 1, a method for detecting cloud services by sensitive data of the present invention includes the following steps:

s01, uploading training samples by enterprises;

and uploading the training samples to the server side by the enterprise through a data interface opened by the server side.

The training sample refers to an internal document set provided by an enterprise, and the enterprise needs to provide a document set similar to the content of the sensitive data type expected to be searched on the Internet. Training sample file formats include, but are not limited to, Office files (docx/elsx/pptx/csv), scripts (sh/sql/java), web pages (html/css), data (json/log), text (txt), and the like.

S02, the server side uses the training sample to train the model, and generates a target model;

the step is a core step, referring to fig. 2, the AI technology in the field of natural language processing, including Bert and BiLSTM, is used, and the whole modeling process can be fully automated without human intervention.

As the model to be built belongs to the predicted classification model, a positive and negative sample set needs to be built, the training sample provided by an enterprise is used as a positive sample, and the negative sample can be built by any other document set as long as the negative sample is not similar to the positive sample.

S021, text preprocessing

The input of this step is a training sample, i.e. a document set with a directory structure; the output is the textual content of all documents stored in csv form.

S0211, document processing

In the step, the nested directory is circularly traversed, the documents in all the subdirectories are copied to a new single-level directory, and the possible file name repetition conditions are processed, for example, the files are distinguished by adopting a postfix mode, such as doc _1, doc _2 and doc _ 3.

S0212, file format conversion

In this step, format conversion is performed on the file with a specific format, such as pdf to docx, doc to docx, xls to xlsx, and the like.

S0213, text extraction

Because the text storage modes of files with different formats are different, different reading functions are used according to the file formats, the functions comprise a plurality of functions for respectively reading docx, xlsx, pptx, csv, md and other pure texts, the text contents of all positive and negative sample documents are output, and the text content of each document is used as a piece of training data.

S0214, data set establishment and partitioning

Firstly, according to the fact that a document comes from a positive sample or a negative sample, performing label marking on a result output by S0213, and establishing a data set, wherein the label of the positive sample is 1, and the label of the negative sample is 0; and then, randomly and hierarchically sampling according to a certain proportion, dividing the data set into a training set and a verification set, wherein the training set is used for training the model, and the verification set is used for judging the real effect of the model.

S022, Bert + BilSTM classification model

The use of Bert + BilSTM is the key to improving the effect of the prediction model.

Bert (bidirectional Encoder expressions from transformations) is a Language Model (Language Model) published by Google in 2018, representing the current highest level of natural Language processing. Compared with other leading technologies such as ELMo and GPT, the Bert can acquire information in the front direction and the rear direction by training a Masked Language Model and predicting a next sentence task, so that the context relationship can be better understood.

LSTM (Long Short-Term Memory) is a recurrent neural network. Compared with other technologies, the LSTM considers the occurrence sequence and distance of words, is particularly suitable for processing sequence data such as texts, and has the problem that only one-way information can be processed. The BilSTM (Bi-directional Long Short-Term Memory) is formed by combining a forward LSTM and a backward LSTM and can capture bidirectional semantic dependence.

In the step, the training set output by S0214 is used for carrying out fine tuning training on the integral model of the Bert + BilSTM, and a target model suitable for the training set is generated. Not only the text is converted into the vector by the Bert, but also the text is combined with the following BilSTM into a whole, and the output of the Bert is spliced with a classification task to jointly form a model. In the whole training process of the model, the model of Bert can also perform some training on the parameters of partial layers of the model along with the training process of the task so as to be more suitable for the current task, and therefore the result of the whole model is optimal.

S023, model evaluation

In the step, the verification set divided by S0214 is used, the set classification evaluation index is adopted to evaluate the effect of the S022 generated model, if the effect is better than the set threshold value, the model training step is ended, and if the effect is worse than the set threshold value, model optimization or sample optimization is required.

The set classification evaluation indexes are selected for the classification prediction model, the classification evaluation index set includes, but is not limited to, F1-score, accuracy, precision, recall, etc., and F1-score is usually selected in the method.

S03, the service side predicts the Internet documents by using the model to obtain a prediction result;

the internet document refers to a document obtained from the internet by a server on the premise of obtaining permission, and the obtained source includes but is not limited to a webpage, a code hosting platform, a library, a forum, a post and the like.

The prediction result refers to the prediction of whether each document is suspected to be a document or not, and a yes prediction label or a no prediction label and a prediction confidence coefficient are obtained. The internet documents also need to be text preprocessed as S0211-S0213 before prediction.

S04, the service side returns the suspected document in the prediction result to the enterprise;

and screening out the document subset with the prediction label of 'yes' and relatively high confidence coefficient by the service party according to the prediction label and the prediction confidence coefficient in the prediction result, and returning the document subset to the enterprise through a data interface.

Example 2

Referring to fig. 3, the invention further discloses a sensitive data detection cloud service platform, which integrates strong document acquisition capacity and model service capacity by deploying an acquirer and a model on the cloud. With reference to fig. 4, the business process of the cloud service platform to the same service of the enterprise includes the following core modules:

a data interface module: the method is used for uploading training samples for enterprises;

a model training module: the model training is carried out by using the training sample to obtain a Bert + BiLSTM classification model;

a model prediction module: the model is used for predicting the Internet documents to obtain a prediction result;

Example 3

This embodiment provides a storage medium on the basis of embodiment 1 and embodiment 2, wherein a plurality of instructions are stored, the instructions are suitable for being loaded and executed by a processor, and the plurality of instructions are:

the data interface is used for uploading training samples by enterprises;

model training, namely performing model training by using a training sample to generate a target model; the specific execution process of the instruction is as follows:

referring to fig. 2, using AI techniques in the field of natural language processing, including Bert and BiLSTM, the entire modeling process can be fully automated without human intervention.

S021, text preprocessing

S0211, document processing

S0212, file format conversion

S0213, text extraction

S0214, data set establishment and partitioning

S022, Bert + BilSTM classification model

S023, model evaluation

Model prediction, which is used for predicting the internet document by using a model to obtain a prediction result, wherein the specific execution process of the instruction is as follows:

And the prediction result returning module is used for returning the suspected document in the prediction result to the enterprise. And screening out the document subset with the prediction label of 'yes' and relatively high confidence coefficient by the service party according to the prediction label and the prediction confidence coefficient in the prediction result, and returning the document subset to the enterprise through a data interface.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for detecting cloud service by sensitive data is characterized by comprising the following steps: the method comprises the following steps:

s01, uploading training samples to enterprises

2. The method for sensitive data detection cloud service according to claim 1, wherein: in step S01, the training sample is an internal document provided by an enterprise, and the enterprise provides a document set with content similar to that of sensitive data that the enterprise desires to search on the internet, to form the training sample.

3. The method for sensitive data detection cloud service according to claim 1, wherein: in step S02, the training sample sent by the enterprise is defined as a positive sample, and other document sets different from the positive sample are selected as negative samples, and then

S021, text preprocessing

classification model of S022.Bert + BiLSTM

s023 model evaluation

4. The method for sensitive data detection cloud service according to claim 3, wherein: the step S021 specifically comprises the following steps:

s0211. document processing

s0212. file format conversion

s0213. text extraction

s0214, data set establishment and partitioning

5. The method for sensitive data detection cloud service according to claim 4, wherein: in step S03, before predicting the internet document using the Bert + BilSTM classification model, the internet document needs to be processed in steps S0211-S0213.

6. The method for sensitive data detection cloud service according to claim 3, wherein: in step S023, the classification evaluation index is set to be one or more of F1-score, accuracy, precision, and recall.

7. A sensitive data detection cloud service platform applied to the method of any one of claims 1 to 6; the method is characterized in that: the system comprises

8. The sensitive data detection cloud service platform according to claim 7, wherein: the specific execution process of the model training module is as follows: defining the training sample sent by the enterprise as a positive sample, selecting other document sets different from the positive sample as negative samples, and then

Text pre-processing

bert + BilsTM classification model

Performing inner-center fine-tuning training on the Bert + BiLSTM classification model by adopting the training set to generate a target model suitable for the training set;

model evaluation

9. The sensitive data detection cloud service platform according to claim 8, wherein: the file preprocessing comprises the following specific processes:

document processing

file format conversion

text extraction

data set creation and partitioning

10. A storage medium having stored therein a plurality of instructions adapted to be loaded and executed by a processor, characterized in that: the plurality of instructions are:

the data interface is used for uploading the training samples to the enterprise and returning suspected documents in the prediction result to the enterprise by the service party;

and returning the prediction result, wherein the suspected document in the prediction result is returned to the enterprise.