CN111666414A - Method for detecting cloud service by sensitive data and cloud service platform - Google Patents

Method for detecting cloud service by sensitive data and cloud service platform Download PDF

Info

Publication number
CN111666414A
CN111666414A CN202010537941.4A CN202010537941A CN111666414A CN 111666414 A CN111666414 A CN 111666414A CN 202010537941 A CN202010537941 A CN 202010537941A CN 111666414 A CN111666414 A CN 111666414A
Authority
CN
China
Prior art keywords
training
model
sample
document
enterprise
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010537941.4A
Other languages
Chinese (zh)
Other versions
CN111666414B (en
Inventor
周晓勇
梁淑云
刘胜
马影
陶景龙
王启凡
魏国富
徐�明
殷钱安
余贤喆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information and Data Security Solutions Co Ltd
Original Assignee
Information and Data Security Solutions Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information and Data Security Solutions Co Ltd filed Critical Information and Data Security Solutions Co Ltd
Priority to CN202010537941.4A priority Critical patent/CN111666414B/en
Publication of CN111666414A publication Critical patent/CN111666414A/en
Application granted granted Critical
Publication of CN111666414B publication Critical patent/CN111666414B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Abstract

The invention discloses a method for detecting cloud service by sensitive data and a cloud service platform, and the method comprises the following steps: s01, uploading a training sample by an enterprise, and uploading the training sample to a server by the enterprise through a data interface opened by the server; s02, the server side uses the training samples to perform model training to obtain a Bert + BiLSTM classification model; s03, the server predicts the Internet documents by using a Bert + BiLSTM classification model to obtain a prediction result; and S04, the service party returns the suspected documents in the prediction result to the enterprise. The method uses the form of cloud service to provide sensitive data detection service for enterprises, reduces the cost and threshold of obtaining service by the enterprises, and avoids repeated investment; therefore, large, medium and small enterprises can obtain the detection service with the same standard through the service, the security of network data is integrally improved, and the method has great social significance.

Description

Method for detecting cloud service by sensitive data and cloud service platform
Technical Field
The invention relates to the technical field of computer data security, in particular to a method for detecting cloud service by sensitive data and a cloud service platform.
Background
The data inside the enterprise is a high-value intangible asset, and data leakage events frequently occur in recent years. The characteristics of fast information transmission and large volume of internet bring great difficulty to enterprises to search and retrieve data, and great economic loss is often generated.
At present, large enterprises need to spend extraordinary cost on purchasing software and hardware systems and operation and maintenance services provided by a third party, and medium and small enterprises cannot pay, so that the data security maintenance cost of the large enterprises is high, the medium and small enterprises are directly and directly rushing, and great data security potential hazards exist. Technically, most of the methods are realized by adopting a keyword matching technology, and the method has the defects of low efficiency, high cost and difficult operation and maintenance.
The method for mining the industrial accident recording text disclosed by application number 201911106089.9 adopts the existing natural language algorithm to design the text mining algorithm suitable for the industrial accident time analysis, reduces the operation amount to a certain extent, and reduces the labor cost. The technology still cannot meet the requirements of all different types of enterprises.
Disclosure of Invention
The invention aims to solve the technical problem of reducing the cost and threshold of obtaining the sensitive data detection service by enterprises.
The invention solves the technical problems through the following technical means:
a method for detecting cloud service by sensitive data comprises the following steps:
s01, uploading training samples to enterprises
Uploading the training samples to the server side by the enterprise through a data interface opened by the server side;
s02, the server side uses the training samples to perform model training to obtain a Bert + BiLSTM classification model;
s03, the server predicts the Internet documents by using a Bert + BiLSTM classification model to obtain a prediction result;
and S04, the service party returns the suspected documents in the prediction result to the enterprise.
The method uses the form of cloud service to provide sensitive data detection service for enterprises, reduces the cost and threshold of obtaining service by the enterprises, and avoids repeated investment; therefore, large, medium and small enterprises can obtain the detection service with the same standard through the service, the security of network data is integrally improved, and the method has great social significance.
Further, in step S01, the training sample is an internal document provided by an enterprise, and the enterprise provides a document set with content similar to that of the sensitive data type desired to be searched on the internet, so as to form the training sample.
Further, in step S02, the training sample sent by the enterprise is defined as a positive sample, and other document sets different from the positive sample are selected as negative samples, and then
S021, text preprocessing
Preprocessing the positive and negative samples to obtain text contents of all documents stored in a csv form, labeling the text contents according to whether the documents come from the positive samples or the negative samples, wherein the labels of the positive samples are 1, and the labels of the negative samples are 0 to obtain a data set with labels; then, randomly and hierarchically sampling according to a certain proportion, and dividing a data set into a training set and a verification set;
classification model of S022.Bert + BiLSTM
Performing fine tuning training on the Bert + BiLSTM classification model by adopting the training set to generate a target model suitable for the training set;
s023 model evaluation
And using the verification set, evaluating the target model by adopting the set classification evaluation index, finishing the model training step if the effect is better than the set threshold value, and performing model optimization or sample optimization if the effect is worse than the set threshold value.
Further, the step S021 specifically includes:
s0211. document processing
Circularly traversing the nested directory, copying the documents under all the subdirectories to a new single-level directory, and distinguishing file names under the repeated condition of the file names;
s0212. file format conversion
Carrying out format conversion on the file with the specific format to obtain a target format file;
s0213. text extraction
Outputting the text contents of all positive and negative sample documents by using different reading functions according to the file format, wherein the text content of each document is used as a piece of training data;
s0214, data set establishment and partitioning
Firstly, according to whether a document comes from a positive sample or a negative sample, performing label marking on a result output in the step S0213, and establishing a data set, wherein the label of the positive sample is 1, and the label of the negative sample is 0; and then, randomly and hierarchically sampling according to a certain proportion, dividing the data set into a training set and a verification set, wherein the training set is used for training the model, and the verification set is used for judging the real effect of the model.
Further, in the step S03, before predicting the internet document by using the Bert + BilSTM classification model, the internet document needs to be processed in steps S0211-S0213.
Further, in the step S023, the classification evaluation index is set to be one or more of F1-score, accuracy, precision and recall.
Correspondingly, the invention also provides a sensitive data detection cloud service platform, and the method is applied; the system comprises
The data interface module is used for uploading training samples by enterprises;
the model training module is used for carrying out model training by using a training sample to generate a target model;
the model prediction module is used for predicting the Internet documents by using the model to obtain a prediction result;
and the prediction result returning module is used for returning the suspected document in the prediction result to the enterprise.
Further, the specific implementation process of the model training module is as follows: defining the training sample sent by the enterprise as a positive sample, selecting other document sets different from the positive sample as negative samples, and then
Text pre-processing
Preprocessing the positive and negative samples to obtain text contents of all documents stored in a csv form, labeling the text contents according to whether the documents come from the positive samples or the negative samples, wherein the labels of the positive samples are 1, and the labels of the negative samples are 0 to obtain a data set with labels; then, randomly and hierarchically sampling according to a certain proportion, and dividing a data set into a training set and a verification set;
bert + BilsTM classification model
Performing fine tuning training on the Bert + BiLSTM classification model by adopting the training set to generate a target model suitable for the training set;
model evaluation
And using the verification set, evaluating the target model by adopting the set classification evaluation index, finishing the model training step if the effect is better than the set threshold value, and performing model optimization or sample optimization if the effect is worse than the set threshold value.
Further, the file preprocessing specifically comprises the following steps:
document processing
Circularly traversing the nested directory, copying the documents under all the subdirectories to a new single-level directory, and distinguishing file names under the repeated condition of the file names;
file format conversion
Carrying out format conversion on the file with the specific format to obtain a target format file;
text extraction
Outputting the text contents of all positive and negative sample documents by using different reading functions according to the file format, wherein the text content of each document is used as a piece of training data;
data set creation and partitioning
Firstly, according to the fact that a document comes from a positive sample or a negative sample, carrying out label marking on output text content to establish a data set, wherein the label of the positive sample is 1, and the label of the negative sample is 0; and then, randomly and hierarchically sampling according to a certain proportion, dividing the data set into a training set and a verification set, wherein the training set is used for training the model, and the verification set is used for judging the real effect of the model.
Correspondingly, the present invention further provides a storage medium, wherein a plurality of instructions are stored, the instructions are suitable for being loaded and executed by a processor, and the plurality of instructions are:
the data interface is used for uploading training samples by enterprises;
model training, namely performing model training by using a training sample to generate a target model;
the model prediction is used for predicting the Internet documents by using the model to obtain a prediction result;
and the prediction result returning module is used for returning the suspected document in the prediction result to the enterprise.
The invention has the advantages that:
1. the method uses the form of cloud service to provide sensitive data detection service for enterprises, reduces the cost and threshold of obtaining service by the enterprises, and avoids repeated investment; therefore, large, medium and small enterprises can obtain the detection service with the same standard through the service, the security of network data is integrally improved, and the method has great social significance.
2. Compared with the traditional technology, the AI technology in the field of natural language processing is used, so that the efficiency is greatly improved, and the operation and maintenance complexity is reduced.
Drawings
Fig. 1 is a flowchart of a method for detecting cloud services according to sensitive data in embodiment 1 of the present invention;
fig. 2 is a flowchart of model training in a method for detecting cloud services using sensitive data according to embodiment 1 of the present invention;
fig. 3 is an architecture diagram of a sensitive data detection cloud service platform according to embodiment 2 of the present invention;
fig. 4 is a business process diagram corresponding to the cloud service platform for sensitive data detection in embodiment 2 of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1
Referring to the step diagram of fig. 1, a method for detecting cloud services by sensitive data of the present invention includes the following steps:
s01, uploading training samples by enterprises;
and uploading the training samples to the server side by the enterprise through a data interface opened by the server side.
The training sample refers to an internal document set provided by an enterprise, and the enterprise needs to provide a document set similar to the content of the sensitive data type expected to be searched on the Internet. Training sample file formats include, but are not limited to, Office files (docx/elsx/pptx/csv), scripts (sh/sql/java), web pages (html/css), data (json/log), text (txt), and the like.
S02, the server side uses the training sample to train the model, and generates a target model;
the step is a core step, referring to fig. 2, the AI technology in the field of natural language processing, including Bert and BiLSTM, is used, and the whole modeling process can be fully automated without human intervention.
As the model to be built belongs to the predicted classification model, a positive and negative sample set needs to be built, the training sample provided by an enterprise is used as a positive sample, and the negative sample can be built by any other document set as long as the negative sample is not similar to the positive sample.
S021, text preprocessing
The input of this step is a training sample, i.e. a document set with a directory structure; the output is the textual content of all documents stored in csv form.
S0211, document processing
In the step, the nested directory is circularly traversed, the documents in all the subdirectories are copied to a new single-level directory, and the possible file name repetition conditions are processed, for example, the files are distinguished by adopting a postfix mode, such as doc _1, doc _2 and doc _ 3.
S0212, file format conversion
In this step, format conversion is performed on the file with a specific format, such as pdf to docx, doc to docx, xls to xlsx, and the like.
S0213, text extraction
Because the text storage modes of files with different formats are different, different reading functions are used according to the file formats, the functions comprise a plurality of functions for respectively reading docx, xlsx, pptx, csv, md and other pure texts, the text contents of all positive and negative sample documents are output, and the text content of each document is used as a piece of training data.
S0214, data set establishment and partitioning
Firstly, according to the fact that a document comes from a positive sample or a negative sample, performing label marking on a result output by S0213, and establishing a data set, wherein the label of the positive sample is 1, and the label of the negative sample is 0; and then, randomly and hierarchically sampling according to a certain proportion, dividing the data set into a training set and a verification set, wherein the training set is used for training the model, and the verification set is used for judging the real effect of the model.
S022, Bert + BilSTM classification model
The use of Bert + BilSTM is the key to improving the effect of the prediction model.
Bert (bidirectional Encoder expressions from transformations) is a Language Model (Language Model) published by Google in 2018, representing the current highest level of natural Language processing. Compared with other leading technologies such as ELMo and GPT, the Bert can acquire information in the front direction and the rear direction by training a Masked Language Model and predicting a next sentence task, so that the context relationship can be better understood.
LSTM (Long Short-Term Memory) is a recurrent neural network. Compared with other technologies, the LSTM considers the occurrence sequence and distance of words, is particularly suitable for processing sequence data such as texts, and has the problem that only one-way information can be processed. The BilSTM (Bi-directional Long Short-Term Memory) is formed by combining a forward LSTM and a backward LSTM and can capture bidirectional semantic dependence.
In the step, the training set output by S0214 is used for carrying out fine tuning training on the integral model of the Bert + BilSTM, and a target model suitable for the training set is generated. Not only the text is converted into the vector by the Bert, but also the text is combined with the following BilSTM into a whole, and the output of the Bert is spliced with a classification task to jointly form a model. In the whole training process of the model, the model of Bert can also perform some training on the parameters of partial layers of the model along with the training process of the task so as to be more suitable for the current task, and therefore the result of the whole model is optimal.
S023, model evaluation
In the step, the verification set divided by S0214 is used, the set classification evaluation index is adopted to evaluate the effect of the S022 generated model, if the effect is better than the set threshold value, the model training step is ended, and if the effect is worse than the set threshold value, model optimization or sample optimization is required.
The set classification evaluation indexes are selected for the classification prediction model, the classification evaluation index set includes, but is not limited to, F1-score, accuracy, precision, recall, etc., and F1-score is usually selected in the method.
S03, the service side predicts the Internet documents by using the model to obtain a prediction result;
the internet document refers to a document obtained from the internet by a server on the premise of obtaining permission, and the obtained source includes but is not limited to a webpage, a code hosting platform, a library, a forum, a post and the like.
The prediction result refers to the prediction of whether each document is suspected to be a document or not, and a yes prediction label or a no prediction label and a prediction confidence coefficient are obtained. The internet documents also need to be text preprocessed as S0211-S0213 before prediction.
S04, the service side returns the suspected document in the prediction result to the enterprise;
and screening out the document subset with the prediction label of 'yes' and relatively high confidence coefficient by the service party according to the prediction label and the prediction confidence coefficient in the prediction result, and returning the document subset to the enterprise through a data interface.
Example 2
Referring to fig. 3, the invention further discloses a sensitive data detection cloud service platform, which integrates strong document acquisition capacity and model service capacity by deploying an acquirer and a model on the cloud. With reference to fig. 4, the business process of the cloud service platform to the same service of the enterprise includes the following core modules:
a data interface module: the method is used for uploading training samples for enterprises;
a model training module: the model training is carried out by using the training sample to obtain a Bert + BiLSTM classification model;
a model prediction module: the model is used for predicting the Internet documents to obtain a prediction result;
and the prediction result returning module is used for returning the suspected document in the prediction result to the enterprise.
Example 3
This embodiment provides a storage medium on the basis of embodiment 1 and embodiment 2, wherein a plurality of instructions are stored, the instructions are suitable for being loaded and executed by a processor, and the plurality of instructions are:
the data interface is used for uploading training samples by enterprises;
model training, namely performing model training by using a training sample to generate a target model; the specific execution process of the instruction is as follows:
referring to fig. 2, using AI techniques in the field of natural language processing, including Bert and BiLSTM, the entire modeling process can be fully automated without human intervention.
As the model to be built belongs to the predicted classification model, a positive and negative sample set needs to be built, the training sample provided by an enterprise is used as a positive sample, and the negative sample can be built by any other document set as long as the negative sample is not similar to the positive sample.
S021, text preprocessing
The input of this step is a training sample, i.e. a document set with a directory structure; the output is the textual content of all documents stored in csv form.
S0211, document processing
In the step, the nested directory is circularly traversed, the documents in all the subdirectories are copied to a new single-level directory, and the possible file name repetition conditions are processed, for example, the files are distinguished by adopting a postfix mode, such as doc _1, doc _2 and doc _ 3.
S0212, file format conversion
In this step, format conversion is performed on the file with a specific format, such as pdf to docx, doc to docx, xls to xlsx, and the like.
S0213, text extraction
Because the text storage modes of files with different formats are different, different reading functions are used according to the file formats, the functions comprise a plurality of functions for respectively reading docx, xlsx, pptx, csv, md and other pure texts, the text contents of all positive and negative sample documents are output, and the text content of each document is used as a piece of training data.
S0214, data set establishment and partitioning
Firstly, according to the fact that a document comes from a positive sample or a negative sample, performing label marking on a result output by S0213, and establishing a data set, wherein the label of the positive sample is 1, and the label of the negative sample is 0; and then, randomly and hierarchically sampling according to a certain proportion, dividing the data set into a training set and a verification set, wherein the training set is used for training the model, and the verification set is used for judging the real effect of the model.
S022, Bert + BilSTM classification model
The use of Bert + BilSTM is the key to improving the effect of the prediction model.
Bert (bidirectional Encoder expressions from transformations) is a Language Model (Language Model) published by Google in 2018, representing the current highest level of natural Language processing. Compared with other leading technologies such as ELMo and GPT, the Bert can acquire information in the front direction and the rear direction by training a Masked Language Model and predicting a next sentence task, so that the context relationship can be better understood.
LSTM (Long Short-Term Memory) is a recurrent neural network. Compared with other technologies, the LSTM considers the occurrence sequence and distance of words, is particularly suitable for processing sequence data such as texts, and has the problem that only one-way information can be processed. The BilSTM (Bi-directional Long Short-Term Memory) is formed by combining a forward LSTM and a backward LSTM and can capture bidirectional semantic dependence.
In the step, the training set output by S0214 is used for carrying out fine tuning training on the integral model of the Bert + BilSTM, and a target model suitable for the training set is generated. Not only the text is converted into the vector by the Bert, but also the text is combined with the following BilSTM into a whole, and the output of the Bert is spliced with a classification task to jointly form a model. In the whole training process of the model, the model of Bert can also perform some training on the parameters of partial layers of the model along with the training process of the task so as to be more suitable for the current task, and therefore the result of the whole model is optimal.
S023, model evaluation
In the step, the verification set divided by S0214 is used, the set classification evaluation index is adopted to evaluate the effect of the S022 generated model, if the effect is better than the set threshold value, the model training step is ended, and if the effect is worse than the set threshold value, model optimization or sample optimization is required.
The set classification evaluation indexes are selected for the classification prediction model, the classification evaluation index set includes, but is not limited to, F1-score, accuracy, precision, recall, etc., and F1-score is usually selected in the method.
Model prediction, which is used for predicting the internet document by using a model to obtain a prediction result, wherein the specific execution process of the instruction is as follows:
the internet document refers to a document obtained from the internet by a server on the premise of obtaining permission, and the obtained source includes but is not limited to a webpage, a code hosting platform, a library, a forum, a post and the like.
The prediction result refers to the prediction of whether each document is suspected to be a document or not, and a yes prediction label or a no prediction label and a prediction confidence coefficient are obtained. The internet documents also need to be text preprocessed as S0211-S0213 before prediction.
And the prediction result returning module is used for returning the suspected document in the prediction result to the enterprise. And screening out the document subset with the prediction label of 'yes' and relatively high confidence coefficient by the service party according to the prediction label and the prediction confidence coefficient in the prediction result, and returning the document subset to the enterprise through a data interface.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for detecting cloud service by sensitive data is characterized by comprising the following steps: the method comprises the following steps:
s01, uploading training samples to enterprises
Uploading the training samples to the server side by the enterprise through a data interface opened by the server side;
s02, the server side uses the training samples to perform model training to obtain a Bert + BiLSTM classification model;
s03, the server predicts the Internet documents by using a Bert + BiLSTM classification model to obtain a prediction result;
and S04, the service party returns the suspected documents in the prediction result to the enterprise.
2. The method for sensitive data detection cloud service according to claim 1, wherein: in step S01, the training sample is an internal document provided by an enterprise, and the enterprise provides a document set with content similar to that of sensitive data that the enterprise desires to search on the internet, to form the training sample.
3. The method for sensitive data detection cloud service according to claim 1, wherein: in step S02, the training sample sent by the enterprise is defined as a positive sample, and other document sets different from the positive sample are selected as negative samples, and then
S021, text preprocessing
Preprocessing the positive and negative samples to obtain text contents of all documents stored in a csv form, labeling the text contents according to whether the documents come from the positive samples or the negative samples, wherein the labels of the positive samples are 1, and the labels of the negative samples are 0 to obtain a data set with labels; then, randomly and hierarchically sampling according to a certain proportion, and dividing a data set into a training set and a verification set;
classification model of S022.Bert + BiLSTM
Performing fine tuning training on the Bert + BiLSTM classification model by adopting the training set to generate a target model suitable for the training set;
s023 model evaluation
And using the verification set, evaluating the target model by adopting the set classification evaluation index, finishing the model training step if the effect is better than the set threshold value, and performing model optimization or sample optimization if the effect is worse than the set threshold value.
4. The method for sensitive data detection cloud service according to claim 3, wherein: the step S021 specifically comprises the following steps:
s0211. document processing
Circularly traversing the nested directory, copying the documents under all the subdirectories to a new single-level directory, and distinguishing file names under the repeated condition of the file names;
s0212. file format conversion
Carrying out format conversion on the file with the specific format to obtain a target format file;
s0213. text extraction
Outputting the text contents of all positive and negative sample documents by using different reading functions according to the file format, wherein the text content of each document is used as a piece of training data;
s0214, data set establishment and partitioning
Firstly, according to whether a document comes from a positive sample or a negative sample, performing label marking on a result output in the step S0213, and establishing a data set, wherein the label of the positive sample is 1, and the label of the negative sample is 0; and then, randomly and hierarchically sampling according to a certain proportion, dividing the data set into a training set and a verification set, wherein the training set is used for training the model, and the verification set is used for judging the real effect of the model.
5. The method for sensitive data detection cloud service according to claim 4, wherein: in step S03, before predicting the internet document using the Bert + BilSTM classification model, the internet document needs to be processed in steps S0211-S0213.
6. The method for sensitive data detection cloud service according to claim 3, wherein: in step S023, the classification evaluation index is set to be one or more of F1-score, accuracy, precision, and recall.
7. A sensitive data detection cloud service platform applied to the method of any one of claims 1 to 6; the method is characterized in that: the system comprises
The data interface module is used for uploading training samples by enterprises;
the model training module is used for carrying out model training by using a training sample to generate a target model;
the model prediction module is used for predicting the Internet documents by using the model to obtain a prediction result;
and the prediction result returning module is used for returning the suspected document in the prediction result to the enterprise.
8. The sensitive data detection cloud service platform according to claim 7, wherein: the specific execution process of the model training module is as follows: defining the training sample sent by the enterprise as a positive sample, selecting other document sets different from the positive sample as negative samples, and then
Text pre-processing
Preprocessing the positive and negative samples to obtain text contents of all documents stored in a csv form, labeling the text contents according to whether the documents come from the positive samples or the negative samples, wherein the labels of the positive samples are 1, and the labels of the negative samples are 0 to obtain a data set with labels; then, randomly and hierarchically sampling according to a certain proportion, and dividing a data set into a training set and a verification set;
bert + BilsTM classification model
Performing inner-center fine-tuning training on the Bert + BiLSTM classification model by adopting the training set to generate a target model suitable for the training set;
model evaluation
And using the verification set, evaluating the target model by adopting the set classification evaluation index, finishing the model training step if the effect is better than the set threshold value, and performing model optimization or sample optimization if the effect is worse than the set threshold value.
9. The sensitive data detection cloud service platform according to claim 8, wherein: the file preprocessing comprises the following specific processes:
document processing
Circularly traversing the nested directory, copying the documents under all the subdirectories to a new single-level directory, and distinguishing file names under the repeated condition of the file names;
file format conversion
Carrying out format conversion on the file with the specific format to obtain a target format file;
text extraction
Outputting the text contents of all positive and negative sample documents by using different reading functions according to the file format, wherein the text content of each document is used as a piece of training data;
data set creation and partitioning
Firstly, according to the fact that a document comes from a positive sample or a negative sample, carrying out label marking on output text content to establish a data set, wherein the label of the positive sample is 1, and the label of the negative sample is 0; and then, randomly and hierarchically sampling according to a certain proportion, dividing the data set into a training set and a verification set, wherein the training set is used for training the model, and the verification set is used for judging the real effect of the model.
10. A storage medium having stored therein a plurality of instructions adapted to be loaded and executed by a processor, characterized in that: the plurality of instructions are:
the data interface is used for uploading the training samples to the enterprise and returning suspected documents in the prediction result to the enterprise by the service party;
model training, namely performing model training by using a training sample to generate a target model;
the model prediction is used for predicting the Internet documents by using the model to obtain a prediction result;
and returning the prediction result, wherein the suspected document in the prediction result is returned to the enterprise.
CN202010537941.4A 2020-06-12 2020-06-12 Method for detecting cloud service by sensitive data and cloud service platform Active CN111666414B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010537941.4A CN111666414B (en) 2020-06-12 2020-06-12 Method for detecting cloud service by sensitive data and cloud service platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010537941.4A CN111666414B (en) 2020-06-12 2020-06-12 Method for detecting cloud service by sensitive data and cloud service platform

Publications (2)

Publication Number Publication Date
CN111666414A true CN111666414A (en) 2020-09-15
CN111666414B CN111666414B (en) 2023-10-17

Family

ID=72387440

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010537941.4A Active CN111666414B (en) 2020-06-12 2020-06-12 Method for detecting cloud service by sensitive data and cloud service platform

Country Status (1)

Country Link
CN (1) CN111666414B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115203506A (en) * 2022-06-27 2022-10-18 海南电网有限责任公司信息通信分公司 Archive filing similarity calculation method based on multi-mode verification algorithm

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106383853A (en) * 2016-08-30 2017-02-08 刘勇 Realization method and system for electronic medical record post-structuring and auxiliary diagnosis
CN107239529A (en) * 2017-05-27 2017-10-10 中国矿业大学 A kind of public sentiment hot category classification method based on deep learning
CN109543084A (en) * 2018-11-09 2019-03-29 西安交通大学 A method of establishing the detection model of the hidden sensitive text of network-oriented social media
CN110263166A (en) * 2019-06-18 2019-09-20 北京海致星图科技有限公司 Public sentiment file classification method based on deep learning
CN110287334A (en) * 2019-06-13 2019-09-27 淮阴工学院 A kind of school's domain knowledge map construction method based on Entity recognition and attribute extraction model
CN110309306A (en) * 2019-06-19 2019-10-08 淮阴工学院 A kind of Document Modeling classification method based on WSD level memory network
CN110705293A (en) * 2019-08-23 2020-01-17 中国科学院苏州生物医学工程技术研究所 Electronic medical record text named entity recognition method based on pre-training language model
CN110826320A (en) * 2019-11-28 2020-02-21 上海观安信息技术股份有限公司 Sensitive data discovery method and system based on text recognition
CN111061868A (en) * 2019-11-05 2020-04-24 百度在线网络技术(北京)有限公司 Reading prediction model obtaining method, reading prediction device and storage medium
CN111144119A (en) * 2019-12-27 2020-05-12 北京联合大学 Entity identification method for improving knowledge migration
CN111178586A (en) * 2019-12-06 2020-05-19 浙江工业大学 Method for tracking, predicting and dredging public opinion events of network patriots

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106383853A (en) * 2016-08-30 2017-02-08 刘勇 Realization method and system for electronic medical record post-structuring and auxiliary diagnosis
CN107239529A (en) * 2017-05-27 2017-10-10 中国矿业大学 A kind of public sentiment hot category classification method based on deep learning
CN109543084A (en) * 2018-11-09 2019-03-29 西安交通大学 A method of establishing the detection model of the hidden sensitive text of network-oriented social media
CN110287334A (en) * 2019-06-13 2019-09-27 淮阴工学院 A kind of school's domain knowledge map construction method based on Entity recognition and attribute extraction model
CN110263166A (en) * 2019-06-18 2019-09-20 北京海致星图科技有限公司 Public sentiment file classification method based on deep learning
CN110309306A (en) * 2019-06-19 2019-10-08 淮阴工学院 A kind of Document Modeling classification method based on WSD level memory network
CN110705293A (en) * 2019-08-23 2020-01-17 中国科学院苏州生物医学工程技术研究所 Electronic medical record text named entity recognition method based on pre-training language model
CN111061868A (en) * 2019-11-05 2020-04-24 百度在线网络技术(北京)有限公司 Reading prediction model obtaining method, reading prediction device and storage medium
CN110826320A (en) * 2019-11-28 2020-02-21 上海观安信息技术股份有限公司 Sensitive data discovery method and system based on text recognition
CN111178586A (en) * 2019-12-06 2020-05-19 浙江工业大学 Method for tracking, predicting and dredging public opinion events of network patriots
CN111144119A (en) * 2019-12-27 2020-05-12 北京联合大学 Entity identification method for improving knowledge migration

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
吕飞亚等: "注意力机制的BiLSTM模型在招聘信息分类中的应用", 《计算机系统应用》 *
吕飞亚等: "注意力机制的BiLSTM模型在招聘信息分类中的应用", 《计算机系统应用》, no. 04, 15 April 2020 (2020-04-15) *
曹步清;肖巧翔;张祥平;刘建勋;: "融合SOM功能聚类与DeepFM质量预测的API服务推荐方法", 计算机学报, no. 06 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115203506A (en) * 2022-06-27 2022-10-18 海南电网有限责任公司信息通信分公司 Archive filing similarity calculation method based on multi-mode verification algorithm

Also Published As

Publication number Publication date
CN111666414B (en) 2023-10-17

Similar Documents

Publication Publication Date Title
US11036808B2 (en) System and method for indexing electronic discovery data
WO2019075466A1 (en) System and method for analysis of structured and unstructured data
CN111274815A (en) Method and device for mining entity attention points in text
CN110968695A (en) Intelligent labeling method, device and platform based on active learning of weak supervision technology
CN110321466B (en) Securities information duplicate checking method and system based on semantic analysis
CN112307303A (en) Efficient and accurate network page duplicate removal system based on cloud computing
CN114661861A (en) Text matching method and device, storage medium and terminal
CN116150651A (en) AI-based depth synthesis detection method and system
CN112905753A (en) Method and device for distinguishing text information
CN114372532A (en) Method, device, equipment, medium and product for determining label marking quality
CN111666414B (en) Method for detecting cloud service by sensitive data and cloud service platform
CN116976341A (en) Entity identification method, entity identification device, electronic equipment, storage medium and program product
CN114626370A (en) Training method, risk early warning method, apparatus, device, medium, and program product
CN113836308A (en) Network big data long text multi-label classification method, system, device and medium
CN112133308A (en) Method and device for multi-label classification of voice recognition text
CN112434009A (en) End-to-end data probing method and device, computer equipment and storage medium
CN111581270A (en) Data extraction method and device
CN117093604B (en) Search information generation method, apparatus, electronic device, and computer-readable medium
CN115840808B (en) Technological project consultation method, device, server and computer readable storage medium
CN113850085B (en) Enterprise grade evaluation method and device, electronic equipment and readable storage medium
CN117077678B (en) Sensitive word recognition method, device, equipment and medium
CN115391496B (en) Legal document case extraction method, system and storage medium
CN117573956B (en) Metadata management method, device, equipment and storage medium
CN116028498B (en) Quality inspection form storage method, device, electronic equipment and medium
CN112287101B (en) Information processing method, device and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant