CN116775881A

CN116775881A - Data detection method and device and electronic equipment

Info

Publication number: CN116775881A
Application number: CN202310790613.9A
Authority: CN
Inventors: 龚双双
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Priority date: 2023-06-30
Filing date: 2023-06-30
Publication date: 2023-09-19

Abstract

The application discloses a data detection method, a data detection device and electronic equipment, relates to the technical field of data processing, and aims to solve the problem that the detection result is incomplete or inaccurate in the existing sensitive information detection method. The method comprises the following steps: acquiring data to be detected; respectively extracting features of tag data and text data in the data to be detected to obtain a first tag feature vector and a first text feature vector; splicing the first tag class feature vector and the first text class feature vector to obtain a first feature vector; and determining whether the data to be detected is sensitive data according to the similarity of the first feature vector and the second feature vector, wherein the second feature vector is obtained by respectively extracting features of tag data and text data in marked sensitive data in advance and then splicing the tag data and the text data. The embodiment of the application can improve the accuracy and the comprehensiveness of sensitive data detection.

Description

Data detection method and device and electronic equipment

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a data detection method, a data detection device, and an electronic device.

Background

In the prior art, keyword detection is mainly adopted for the identification of sensitive information. The keyword detection method is to define a keyword matching formula of sensitive data by making a sensitive word stock, adopt information such as a field name, an annotation and the like of an accurate or fuzzy matching table, and judge the sensitive data when the field is found to meet the keyword matching formula.

However, the keyword detection method is strongly dependent on a sensitive information database, and the keyword comparison is simply to judge whether the set keywords are "exist" or "not", so that the classification mode is rough, and the detection result is easy to be incomplete or inaccurate.

Disclosure of Invention

The embodiment of the application provides a data detection method, a data detection device and electronic equipment, which are used for solving the problem that the detection result is incomplete or inaccurate in the existing sensitive information detection method.

In a first aspect, an embodiment of the present application provides a data detection method, including:

acquiring data to be detected;

respectively extracting the characteristics of the tag data and the text data in the data to be detected to obtain a first tag characteristic vector and a first text characteristic vector;

splicing the first tag class feature vector and the first text class feature vector to obtain a first feature vector;

And determining whether the data to be detected is sensitive data according to the similarity of the first feature vector and the second feature vector, wherein the second feature vector is obtained by respectively extracting features of label data and text data in marked sensitive data in advance and then splicing the extracted features.

Optionally, the acquiring data to be detected includes:

sampling P similar data to obtain Q data to be detected, wherein P is an integer greater than 1, and Q is a positive integer less than P;

the determining whether the data to be detected is sensitive data according to the similarity between the first feature vector and the second feature vector comprises:

determining average similarity according to the similarity between the first feature vector and the second feature vector of each piece of data to be detected in the Q pieces of data to be detected;

and determining whether the P pieces of similar data are sensitive data according to the average similarity.

Optionally, after the obtaining the data to be detected, before the extracting the features of the tag class data and the text class data in the data to be detected, the method further includes:

determining characteristic class data in the data to be detected according to the data types of the fields in the data to be detected;

And dividing the feature class data into tag class data and text class data according to at least one of the length, the type and the composition structure of each field in the feature class data.

Optionally, the feature extraction is performed on the tag class data and the text class data in the data to be detected to obtain a first tag class feature vector and a first text class feature vector, which includes:

embedding and encoding the tag class data in the data to be detected by using a Word vector model Word2Vec to obtain a first tag class feature vector;

and extracting the characteristics of the text data in the data to be detected by using a text convolution network (Text Convolutional Neural Networks, textCNN) to obtain a first text characteristic vector.

Optionally, the determining whether the data to be detected is sensitive data according to the similarity between the first feature vector and the second feature vector includes:

respectively calculating the similarity between the first feature vector and each second feature vector in a plurality of second feature vectors, wherein the plurality of second feature vectors respectively correspond to labeling sensitive data of different sensitive levels;

and determining whether the data to be detected is sensitive data or not according to the sensitive level of the marked sensitive data corresponding to the second feature vector with the highest similarity to the first feature vector, and determining the sensitive level of the data to be detected.

Optionally, the method further comprises:

acquiring labeling sensitive data of different sensitive levels;

marking sensitive data of different sensitive levels are respectively divided into tag class data and text class data;

respectively extracting the characteristics of the tag class data and the text class data in the labeling sensitive data of different sensitive levels to obtain a second tag class characteristic vector and a second text class characteristic vector of the labeling sensitive data of different sensitive levels;

and splicing the second tag class feature vector and the second text class feature vector of the marked sensitive data of each sensitive level to obtain the second feature vector of the marked sensitive data of the corresponding sensitive level.

respectively extracting features of n tag data in the data to be detected to obtain n first tag feature vectors, wherein n is a positive integer;

respectively extracting features of m text data in the data to be detected to obtain m first text feature vectors, wherein m is a positive integer;

fully connecting the n first tag feature vectors to obtain the first tag feature vectors;

And fully connecting the m first text feature vectors to obtain the first text feature vectors.

In a second aspect, an embodiment of the present application further provides a data detection apparatus, including:

the first acquisition module is used for acquiring data to be detected;

the first feature extraction module is used for respectively carrying out feature extraction on the tag data and the text data in the data to be detected to obtain a first tag feature vector and a first text feature vector;

the first splicing module is used for splicing the first tag type feature vector and the first text type feature vector to obtain a first feature vector;

the first determining module is used for determining whether the data to be detected is sensitive data according to the similarity of the first feature vector and the second feature vector, wherein the second feature vector is obtained by respectively performing feature extraction on label data and text data in marked sensitive data in advance and then splicing the label data and the text data.

Optionally, the first obtaining module is configured to sample P pieces of similar data to obtain Q pieces of data to be detected, where P is an integer greater than 1, and Q is a positive integer less than P;

the first determining module includes:

The first determining unit is used for determining average similarity according to the similarity between the first feature vector and the second feature vector of each piece of data to be detected in the Q pieces of data to be detected;

and the second determining unit is used for determining whether the P pieces of similar data are sensitive data according to the average similarity.

Optionally, the data detection device further includes:

the second determining module is used for determining characteristic class data in the data to be detected according to the data types of the fields in the data to be detected;

and the first classification module is used for dividing the feature class data into tag class data and text class data according to at least one of the length, the type and the composition structure of each field in the feature class data.

Optionally, the first feature extraction module includes:

word vector model Word2Vec, is used for carrying on the embedded coding to the label class data in the said data to be detected, get the first label class feature vector;

and the text convolution network textCNN is used for extracting the characteristics of the text data in the data to be detected to obtain a first text characteristic vector.

Optionally, the first determining module includes:

The computing unit is used for computing the similarity between the first feature vector and each of a plurality of second feature vectors, and the plurality of second feature vectors respectively correspond to the labeling sensitive data of different sensitive levels;

and the third determining unit is used for determining whether the data to be detected is sensitive data or not according to the sensitive level of the marked sensitive data corresponding to the second feature vector with the highest similarity to the first feature vector, and determining the sensitive level of the data to be detected.

Optionally, the data detection device further includes:

the second acquisition module is used for acquiring the labeling sensitive data of different sensitive levels;

the second classification module is used for dividing the labeling sensitive data with different sensitive levels into label data and text data respectively;

the second feature extraction module is used for extracting features of tag class data and text class data in the labeling sensitive data of different sensitive levels respectively to obtain second tag class feature vectors and second text class feature vectors of the labeling sensitive data of different sensitive levels;

and the second splicing module is used for splicing the second tag type feature vector and the second text type feature vector of the marked sensitive data of each sensitive level to obtain the second feature vector of the marked sensitive data of the corresponding sensitive level.

Optionally, the first feature extraction module includes:

the first feature extraction unit is used for respectively carrying out feature extraction on n tag data in the data to be detected to obtain n first tag feature vectors, wherein n is a positive integer;

the second feature extraction unit is used for respectively carrying out feature extraction on m text data in the data to be detected to obtain m first text feature vectors, wherein m is a positive integer;

the first full-connection unit is used for full-connecting the n first tag feature vectors to obtain the first tag feature vectors;

and the second full-connection unit is used for full-connecting the m first text feature vectors to obtain the first text feature vectors.

In a third aspect, an embodiment of the present application further provides an electronic device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, which processor implements the steps in the data detection method as described above when executing the computer program.

In a fourth aspect, embodiments of the present application also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the data detection method as described above.

In the embodiment of the application, the data to be detected is obtained; respectively extracting the characteristics of the tag data and the text data in the data to be detected to obtain a first tag characteristic vector and a first text characteristic vector; splicing the first tag class feature vector and the first text class feature vector to obtain a first feature vector; and determining whether the data to be detected is sensitive data according to the similarity of the first feature vector and the second feature vector, wherein the second feature vector is obtained by respectively extracting features of label data and text data in marked sensitive data in advance and then splicing the extracted features. Therefore, the data features are converted into the feature vector form which is convenient for the machine learning algorithm to calculate, and the information entropy of the feature data is increased, so that whether the data to be detected is the sensitive data or not is determined by comparing the similarity of the feature vectors of the data to be detected and the marked sensitive data, and the accuracy and the comprehensiveness of detecting the sensitive data can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort to a person of ordinary skill in the art.

FIG. 1 is a flowchart of a data detection method according to an embodiment of the present application;

FIG. 2 is a second flowchart of a data detection method according to an embodiment of the present application;

fig. 3 is a schematic diagram of an implementation of the TextCNN network provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of an implementation of a full connection layer provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of the structure and implementation of a sensitive data detection model provided by an embodiment of the present application;

FIG. 6 is a block diagram of a data detection device according to an embodiment of the present application;

fig. 7 is a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

In order to make the embodiments of the present application more clear, the following description will be given to the related technical knowledge related to the embodiments of the present application:

With the advent of digital economy, enterprises or organizations have broken through the system boundaries by data flow in application scenarios where data resources are used to participate in collaboration, resulting in cross-system access or multi-party data aggregation for joint operations. The leakage of the sensitive information brings serious economic and brand losses to enterprises, so that the value of the data asset is better exerted, service operation and use are required to be convenient as much as possible, and meanwhile, the sensitive information is required to be protected and cannot be accessed at will. Therefore, how to identify and ensure confidentiality of personal information, business secrets or unique data resources in the collaboration process is a problem that must be solved by data security and orderly interconnection and interworking.

The main method for identifying the sensitive information comprises a keyword detection and regular expression matching mode, wherein the keyword detection method mainly comprises the steps of making a sensitive word stock, defining a keyword matching mode of sensitive data, carrying out field-by-field matching on a database table and a file by using metadata information through accurate or fuzzy matching table field names, comments and other information, judging the sensitive data and automatically grading when the fields are found to meet the keyword matching mode; or a common sensitive word stock is arranged and output according to the data content, the similarity of the word segmentation content and the sensitive words is calculated through a Chinese approximate word comparison algorithm, and if the similarity exceeds a certain threshold value, the content is considered to accord with the classification of the sensitive words.

The regular expression matching is to scan all contents in the database, and match a large amount of numerical and English sensitive information in the system, such as mobile phone number, identity card number, mailbox, etc., according to a predefined regular expression, so as to make a determination of sensitive data and level thereof.

However, the keyword detection method is strongly dependent on the sensitive information database, if the number of submitted keywords is limited or the representativeness is lacking, or the sensitive information database is not fully established, the detection result is easy to be incomplete or inaccurate, the keyword comparison is simply to judge whether the set keywords are "present" or "absent", the classification mode is rough, the classification judgment is not accurate enough, for example, the keyword containing "contract" is not necessarily legal contract, but the keyword containing "agreement" but no "contract" is possibly legal contract.

The regular expression is mainly used for identifying whether target data in a target object contains sensitive data line by manually making a sensitive field matching rule, the method is easier to match with data with a certain rule, but the types of the sensitive data are many, and the method has the advantages of pure numbers, pure Chinese, chinese-English mixing, chinese-English number mixing and the like, the characteristic parameters of different types of data are greatly different, and the regular expression matching method is used for identifying the sensitive data of only about 50% of clients in the whole network, so that the sensitive data is difficult to protect well.

In order to overcome the defects of the method, the application provides a sensitive data detection method which comprises the following steps: classifying the data to be detected into two categories according to the data attribute characteristics, and classifying the contents of each field of the data into a tag class and a text class; then respectively extracting tag data characteristics and text data characteristics, and performing full-connection operation on the two obtained characteristics to obtain unified characteristic representation which can finally represent the data to be detected; meanwhile, similar data classification, data feature extraction and full connection operation are carried out on the data of each sensitivity level marked on the bottom layer of the data warehouse; and calculating the similarity between the feature vector of the trained data to be detected and the feature vector of the marked sensitive data at all levels, and judging the suspected sensitive level of the data to be detected according to the final similarity. The application can intelligently realize the identification of the sensitive data, reduces the defects of time and labor consumption and poor mobility of manually establishing the sensitive information base and the matching rule, and effectively improves the accuracy of the identification of the sensitive information.

The data detection method provided by the embodiment of the application is described in detail below through specific embodiments and application scenes thereof with reference to the accompanying drawings.

Referring to fig. 1, fig. 1 is a flowchart of a data detection method according to an embodiment of the present application, as shown in fig. 1, including the following steps:

and 101, acquiring data to be detected.

The data to be detected may be any data that needs to identify sensitive data or identify sensitive level, for example, may be customer data, customer communication data, communication content data, and the like.

The acquiring the data to be detected may be acquiring the data to be detected of the sensitive data from a specific database. In some embodiments, considering that in practice, sensitive data detection needs to be performed on the whole data, that is, the quantity of sensitive data to be detected is huge, and the sensitive data belongs to large-scale data, and each data is usually structured data, where the structured data has the characteristic that the content of each field of data is basically similar information, one piece of data can be extracted from the whole data at a time by sampling as the data to be detected, or a plurality of pieces of data can be extracted from the whole data at a time as the data to be detected, so that the data calculation efficiency is improved.

And 102, respectively extracting the characteristics of the tag class data and the text class data in the data to be detected to obtain a first tag class characteristic vector and a first text class characteristic vector.

After the data to be detected is obtained, the data to be detected can be divided into tag data and text data according to different field types, wherein the tag data is generally simple in structure and has fixed rules and formats, and is the most critical information item such as gender, age, occupation, mobile phone number, identity card number, mailbox address, credit level and the like, and belongs to tag data; the text data is often descriptive short text with different lengths and different rules, and contains a large number of text features, such as work units, residence addresses, government and enterprise customer names, value-added service information descriptions, communication contents and the like, which belong to the text data.

Then, feature extraction can be performed on the tag data and the text data in the data to be detected, for example, feature extraction can be performed on the tag data in the data to be detected by using a feature extraction network capable of well extracting tag features to obtain corresponding first tag feature vectors, and feature extraction can be performed on the text data in the data to be detected by using a feature extraction network capable of well extracting text features to obtain corresponding first text feature vectors. Through feature extraction, the tag class data features and the text class data features in the data to be detected can be well characterized.

Optionally, after the step 101, before the step 102, the method further includes:

In one embodiment, the tag class data and the text class data in the data to be detected can be obtained by performing two-layer classification on the data to be detected.

The data to be detected is classified mainly for classifying the data to be detected into different data feature types according to the characteristics of the feature data so as to better extract key information in the data. In practice, the data to be detected is usually structured data, and the structured data can be divided into two types of data, namely dimension data and index data, wherein the dimension data can also be called feature data, is a certain feature of a transaction or phenomenon, and is usually character type, text type or floating point type data, such as region, time, product, user, occupation type and other feature data; the index class data is a result of statistics on certain data, and is usually a numerical value or proportion, such as the number of times of product use, the number of users, the number of occupation classes, the same ratio, the ring ratio and the like.

In practice, the sensitive information is usually concentrated in the feature data, so that the index data can be removed when the sensitive information is detected, and the noise influence caused by the index data can be reduced by only taking the feature data. In order to further improve the information entropy of the data characteristics, the characteristic data can be further divided into tag data and text data according to the characteristics of the content length, the data type, the composition structure and the like, and the tag data is usually composed of data with simple structure, fixed rules and formats, and is the most critical information item, such as data of gender, age, occupation, mobile phone number, identity card number, mailbox address, credit level and the like; the text data is often descriptive short text with different lengths and different rules, and contains a large amount of text features, such as data of work units, residence addresses, government and enterprise customer names, value-added service information descriptions, communication contents and the like.

In this embodiment, the main classification steps of the data to be detected are as follows:

acquiring the data types of all fields in the data to be detected; based on the data type characteristics of each field in the data to be detected, the data to be detected is divided into characteristic class data and index class data by a classification algorithm such as logic (Logistic regression), wherein the characteristic class data is reserved and is further classified;

Acquiring the length, the data type and the composition of English numbers in the data of the feature class data; based on the obtained data length, data type, and the composition of the english digits in the data, the feature class data is divided into tag class data and text class data by a classification algorithm, such as but not limited to Logistic regression.

The two-layer classification results were as follows:

y ₁ in the above, 1 represents a feature class, and-1 represents an index class.

y ₂ In which 1 represents a tag class, -1 represents a text class.

Therefore, the data to be detected is classified into different data characteristic types twice, so that noise influence caused by index data can be reduced, key information in the data can be conveniently and better extracted later, and information entropy of the data characteristic is further improved.

Optionally, the step 102 includes:

and extracting the characteristics of the text class data in the data to be detected by using a text convolution network textCNN to obtain a first text class characteristic vector.

In one embodiment, considering the characteristics of the feature data in the form of a label class and a text class, it is proposed to extract the label class data feature and the text class data feature according with the neural network algorithm by using Word2Vec and TextCNN, respectively.

Wherein embedded (Embedding) coding is a low-dimensional vector representation of words, the vector transformation is a representation form easily utilized by neural network algorithms, and can measure meaning close to each other based on similarity between word vectors, and has very strong characterization capability. Based on the characteristic that the tag data is a key information item, the tag data can be used for acquiring an Embedding code of the classification characteristic of the tag data through a Word2Vec model to be used as the classification characteristic for judging whether the data to be detected is sensitive data.

The text data contains a large number of text features, and key information in the text features can be extracted based on the processing of the text data by the textCNN network, so that local relevance of the text is better captured, and feature information gain is realized.

The implementation principle of the TextCNN network is shown in fig. 2, an embedded vector of each word corresponding to the text data is obtained from the embedded matrix, convolution kernels with different sizes are used for convolution and maximum pooling on the text embedded layer, and then the text data characteristics are obtained through random inactivation (Dropout) and the full connection layer.

Implementation of the full connection layer as shown in fig. 3, each node of the full connection layer is connected to all nodes of the previous layer, so as to integrate the extracted features, and convert the integrated features into a one-dimensional vector.

In this way, through the embodiment, the keyword 2Vec and TextCNN can be used to extract the key tag features and the key text features in the data to be detected, so that it is helpful to determine whether the data to be detected is sensitive data according to the extracted tag features and text features.

Optionally, the step 102 includes:

In one embodiment, after the data to be detected is divided into tag data and text data, assuming that there are n tag data and m text data, feature extraction may be performed on the n tag data respectively, for example, the n tag data with classified tag features are sequentially obtained through Word2Vec to obtain corresponding tag feature Embedding codes, and the dimension of each tag feature is set to 64, so that n 64-dimensional first tag feature vectors may be obtained, and the n 64-dimensional first tag feature vectors may be fully connected, and feature values of the same dimension may be added to obtain one 64-dimensional first tag feature vector; and respectively extracting features of the m text data, for example, obtaining corresponding text feature Embedding codes through textCNN, setting each text feature dimension to 128, so that m 128-dimensional first text feature vectors can be obtained, and fully connecting the m 128-dimensional first text feature vectors, and adding feature values of the same dimension to obtain one 128-dimensional first text feature vector.

Thus, according to the embodiment, the tag class data in the data to be detected can be subjected to feature extraction to generate a tag class feature vector representing the global tag feature of the tag class data, the tag feature of the data to be detected is represented, and the text class data in the data to be detected can be subjected to feature extraction to generate a text class feature vector representing the global text feature of the text class data.

And step 103, splicing the first tag type feature vector and the first text type feature vector to obtain a first feature vector.

After the first tag feature vector and the first text feature vector of the data to be detected are extracted, the first tag feature vector and the first text feature vector of the data to be detected can be spliced, for example, the first tag feature vector and the first text feature vector of the data to be detected are processed through a full connection layer, and a first feature vector representing the tag and the text feature of the data to be detected is obtained. Since the dimensions of the tag class feature vector and the text class feature vector are different, a first feature vector of 200 dimensions can be generated for a first tag feature vector of 64 dimensions and a first text class feature vector of 128 dimensions after full concatenation.

Step 104, determining whether the data to be detected is sensitive data according to the similarity of the first feature vector and the second feature vector, wherein the second feature vector is obtained by respectively performing feature extraction on label data and text data in the marked sensitive data in advance and then splicing the label data and the text data.

In the embodiment of the application, in order to determine whether the data to be detected is sensitive data, pre-marked sensitive data can be extracted, specifically, sensitive data which is the same structure type as the data to be detected can be obtained, feature vectors marked with the sensitive data can be extracted in the same or similar mode, then the feature vectors of the data to be detected and the feature vectors marked with the sensitive data are subjected to similarity comparison, if the similarity of the feature vectors is high, the data to be detected is determined to be sensitive data, and otherwise, the data to be detected is determined to be non-sensitive data. In some embodiments, the sensitivity level of the sensitive data may be further marked, so that in the case that the feature vector of the data to be detected and the feature vector of the marked sensitive data have higher similarity, the sensitivity level of the data to be detected may be determined to be the sensitivity level of the marked sensitive data. For example, in the case that the similarity between the data to be detected and the labeling sensitive data with a high sensitivity level is high, the data to be detected may be determined to belong to the data with a high sensitivity level.

Optionally, the step 104 includes:

In one embodiment, the data to be detected may be classified according to the sensitivity level, i.e. the sensitivity level of the data to be detected is identified, so that the privacy of the data to be detected is protected to a corresponding extent according to the sensitivity level classification result.

In this embodiment, a plurality of sensitive data marked with different sensitive levels may be obtained in advance, and feature vectors marked with the sensitive data of different sensitive levels may be extracted according to a feature extraction manner with the data to be detected, so as to obtain a plurality of second feature vectors corresponding to different sensitive levels. And comparing the similarity between the first feature vector of the data to be detected and a plurality of second feature vectors corresponding to different sensitivity levels, determining the second feature vector with the highest similarity with the first feature vector, and determining whether the data to be detected is sensitive data and the sensitivity level to which the data to be detected belongs according to the sensitivity level corresponding to the second feature vector.

For example, from the standpoint of data security, privacy protection, and compliance, the degree of sensitivity of information can generally be divided into 4 levels, namely, a very sensitive level, a more sensitive level, and a less sensitive level. The data security management developer can carry out classification marking on the sensibility of the data asset when the data is introduced, the enterprise data classification protection strategy divides the data into different sensibility categories and sensibility levels, and the data is associated with the sensibility classification and the sensibility classification strategy so as to support the subsequent data sensitive automatic classification, and the classification judgment standard and the content can be shown in the following tables 1 and 2.

TABLE 1 sensitivity level criteria

Table 2 data classification hierarchy content

In this embodiment, the detection of the sensitive data and the automatic classification of the sensitive level are realized mainly by calculating the similarity between the feature vector of the data to be detected and the feature vector of the marked sensitive data of each sensitive level, and the adopted sensitive data identification and classification model structure can be shown in fig. 4, and the main steps are as follows:

1) Extracting tag field characteristics and text field characteristics of data to be detected, and generating a feature vector A= [ a ] after full connection ₁ ，a ₂ ，a ₃ ，…，a ₂₀₀ ]；

2) Extracting tag field characteristics and text field characteristics of each sensitive level data, and generating a characteristic vector B of each sensitive level data after full connection _k ＝[b _k1 ，b _k2 ，b _k3 ，…，b _k200 ]，{k＝1，2，3，4}；

3) Calculating a feature vector A of data to be detected and a feature vector B of each sensitive data _k Cosine similarity between them. The larger the cosine similarity value is, the more similar the two feature vectors are, so that the sensitivity level of the data to be detected can be judged according to the cosine similarity. The cosine similarity calculation formula is as follows:

4) According to the calculated similarity value of the data to be detected and the four types of sensitive level data, based on the characteristic that the higher the similarity between vectors is, the higher the data similarity is, according to the final max { cos (theta) _k And determining the suspected sensitivity level of the data to be detected.

In this way, through the embodiment, the data to be detected can be automatically classified in a sensitive way, so that the data to be detected can be conveniently and subsequently subjected to privacy protection to a corresponding degree according to the sensitive classification.

Optionally, the method further comprises:

acquiring labeling sensitive data of different sensitive levels;

respectively extracting the characteristics of the tag class data and the text class data in the labeling sensitive data of different sensitive levels to obtain a second tag class characteristic vector and a second text class characteristic vector of the labeling sensitive data of different sensitive levels:

In other words, in one embodiment, in order to obtain the second feature vector of the labeling sensitive data with different sensitivity levels, the second feature vector is conveniently compared with the first feature vector of the data to be detected to identify the sensitivity level of the data to be detected, one piece of labeling sensitive data can be extracted from the labeling sensitive data with different sensitivity levels respectively, and according to the same feature extraction mode as the data to be detected, data type division and feature vector extraction are performed on the labeling sensitive data with different sensitivity levels respectively, so as to obtain the second feature vector corresponding to the data with different sensitivity levels.

Optionally, the step 101 includes:

the step 104 includes:

In one embodiment, considering that the data size of the sensitive information to be detected in practical application is large, and the data calculation for the whole data is very time-consuming and resource-consuming, so that in order to save system resources and reduce calculation amount, a plurality of pieces of data can be acquired as the data to be detected by sampling the whole data based on the characteristic that the data content of each field in the structured data is basically the same kind of information, and the sensitive detection result of the whole data is determined according to the sensitive detection result of the data to be detected, thereby improving the data calculation efficiency.

Specifically, the P pieces of similar data of the sensitive data to be detected may be sampled to obtain Q pieces of data to be detected, for example, Q pieces of data may be randomly extracted from the P pieces of similar data to be detected, or multiple rounds of sampling may be performed on the P pieces of similar data, one piece of data is extracted each time to be detected as data to be detected, and Q rounds of extraction are altogether performed to obtain Q pieces of data to be detected.

In this way, when calculating the similarity with the marked sensitive data, the similarity between the first feature vector of each piece of data to be detected and the second feature vector of the marked sensitive data can be calculated respectively, then an average value is obtained, the average similarity is obtained, whether the Q pieces of data to be detected are sensitive data is determined according to the average similarity, and the sensitive detection result can be used as the sensitive detection result of the P pieces of similar data, for example, if the average similarity is greater than a preset threshold value, the P pieces of similar data are determined to be sensitive data.

When the P pieces of similar data are needed to be sensitively classified, the similarity between the first characteristic vector of each piece of data to be detected and the second characteristic vector of each sensitive level of data can be calculated respectively, and then the similarity between the first characteristic vector of each piece of data to be detected and the second characteristic vector of each level of sensitive data is averaged to obtain the average similarity between the first characteristic vector of each piece of data to be detected and the second characteristic vector of each level of sensitive dataAnd then according to the maximum average similarity +.>And determining the sensitivity level of the P pieces of similar data according to the corresponding sensitivity level.

Thus, according to the embodiment, system resources can be saved, the calculated amount can be reduced, and the data calculation efficiency can be improved.

In combination with the above embodiment, a flowchart of an implementation of a sensitive data detection method may be shown in fig. 5.

The embodiment of the application has the following advantages: in the sensitive information classification task, feature data capable of judging the data sensitivity level is extracted by utilizing a double-layer feature data classification and multi-round sampling method, the influence of classified feature noise data is removed, the occupation and consumption of computing resources are reduced, and the efficiency of data feature extraction is improved; according to the classified tag data and text data, vectorization features of the two types of data are extracted based on Word2Vec and textCNN respectively, and the two types of features are fully connected, so that a feature matrix representation of sensitive data is generated, the data features are converted into a numerical form which is convenient for a machine learning algorithm to calculate, the information entropy of the feature data is increased, automatic extraction of the data features is realized, and the accuracy of sensitive data classification is improved; the similarity between the feature vector of the data to be detected and the feature vector of each sensitive level is calculated through a cosine similarity algorithm, the suspected sensitive level is judged according to the final cosine value condition, the classification and classification of the sensitive data are automatically realized, and the time and labor cost brought by manually formulating a sensitive information base and a matching rule is reduced. Therefore, the embodiment of the application has the advantages of strong intelligent degree and high accuracy in the aspects of automatically identifying and grading the sensitive data, and can effectively reduce the complicated work of manually finding and formulating the sensitive grade by data security personnel, reduce the possibility of manual misoperation and further ensure the safety of the sensitive data when being applied to the data security management of enterprises or organizations.

The data detection method of the embodiment of the application obtains the data to be detected; respectively extracting the characteristics of the tag data and the text data in the data to be detected to obtain a first tag characteristic vector and a first text characteristic vector; splicing the first tag class feature vector and the first text class feature vector to obtain a first feature vector; and determining whether the data to be detected is sensitive data according to the similarity of the first feature vector and the second feature vector, wherein the second feature vector is obtained by respectively extracting features of label data and text data in marked sensitive data in advance and then splicing the extracted features. Therefore, the data features are converted into the feature vector form which is convenient for the machine learning algorithm to calculate, and the information entropy of the feature data is increased, so that whether the data to be detected is the sensitive data or not is determined by comparing the similarity of the feature vectors of the data to be detected and the marked sensitive data, and the accuracy and the comprehensiveness of detecting the sensitive data can be improved.

The embodiment of the application also provides a data detection device. Referring to fig. 6, fig. 6 is a block diagram of a data detection apparatus according to an embodiment of the present application. Because the principle of the data detection device for solving the problem is similar to that of the data detection method in the embodiment of the present application, the implementation of the data detection device can refer to the implementation of the method, and the repetition is omitted.

As shown in fig. 6, the data detection device 600 includes:

a first obtaining module 601, configured to obtain data to be detected;

the first feature extraction module 602 is configured to perform feature extraction on the tag class data and the text class data in the data to be detected, so as to obtain a first tag class feature vector and a first text class feature vector;

a first stitching module 603, configured to stitch the first tag class feature vector and the first text class feature vector to obtain a first feature vector;

the first determining module 604 is configured to determine whether the data to be detected is sensitive data according to the similarity between the first feature vector and the second feature vector, where the second feature vector is obtained by performing feature extraction on tag class data and text class data in the labeled sensitive data in advance, and then splicing the tag class data and the text class data.

Optionally, the first obtaining module 601 is configured to sample P pieces of similar data to obtain Q pieces of data to be detected, where P is an integer greater than 1 and Q is a positive integer less than P;

the first determination module 604 includes:

Optionally, the data detection device 600 further includes:

Optionally, the first feature extraction module 602 includes:

Optionally, the first determining module 604 includes:

Optionally, the data detection device 600 further includes:

Optionally, the first feature extraction module 602 includes:

The data detection device 600 provided in the embodiment of the present application may perform the method embodiment shown in fig. 1, and its implementation principle and technical effects are similar, and this embodiment will not be described herein again.

The data detection device 600 of the embodiment of the application acquires data to be detected; respectively extracting the characteristics of the tag data and the text data in the data to be detected to obtain a first tag characteristic vector and a first text characteristic vector; splicing the first tag class feature vector and the first text class feature vector to obtain a first feature vector; and determining whether the data to be detected is sensitive data according to the similarity of the first feature vector and the second feature vector, wherein the second feature vector is obtained by respectively extracting features of label data and text data in marked sensitive data in advance and then splicing the extracted features. Therefore, the data features are converted into the feature vector form which is convenient for the machine learning algorithm to calculate, and the information entropy of the feature data is increased, so that whether the data to be detected is the sensitive data or not is determined by comparing the similarity of the feature vectors of the data to be detected and the marked sensitive data, and the accuracy and the comprehensiveness of detecting the sensitive data can be improved.

The embodiment of the application also provides electronic equipment. Because the principle of solving the problem of the electronic device is similar to that of the data detection method in the embodiment of the present application, the implementation of the electronic device can refer to the implementation of the method, and the repetition is not repeated. As shown in fig. 7, an electronic device according to an embodiment of the present application includes:

the processor 700 is configured to read the program in the memory 720, and execute the following procedures:

acquiring data to be detected;

Wherein in fig. 7, a bus architecture may comprise any number of interconnected buses and bridges, and in particular one or more processors represented by processor 700 and various circuits of memory represented by memory 720, linked together. The bus architecture may also link together various other circuits such as peripheral devices, voltage regulators, power management circuits, etc., which are well known in the art and, therefore, will not be described further herein. The bus interface provides an interface. The processor 700 is responsible for managing the bus architecture and general processing, and the memory 720 may store data used by the processor 700 in performing operations.

Optionally, the processor 700 is further configured to read the program in the memory 720, and perform the following steps:

acquiring labeling sensitive data of different sensitive levels;

The electronic device provided in the embodiment of the present application may execute the method embodiment shown in fig. 1, and its implementation principle and technical effects are similar, and this embodiment will not be described herein.

Furthermore, a computer readable storage medium of an embodiment of the present application is used for storing a computer program, where the computer program can be executed by a processor to implement the steps of the method embodiment shown in fig. 1.

In the several embodiments provided in the present application, it should be understood that the disclosed methods and apparatus may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may be physically included separately, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in hardware plus software functional units.

The integrated units implemented in the form of software functional units described above may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform part of the steps of the transceiving method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

While the foregoing is directed to the preferred embodiments of the present application, it will be appreciated by those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present application, and such modifications and adaptations are intended to be comprehended within the scope of the present application.

Claims

1. A data detection method, comprising:

acquiring data to be detected;

2. The method of claim 1, wherein the acquiring the data to be detected comprises:

3. The method according to claim 1, wherein after the obtaining the data to be detected, before the feature extraction is performed on the tag class data and the text class data in the data to be detected, the method further comprises:

4. The method according to claim 1, wherein the feature extracting the tag class data and the text class data in the data to be detected to obtain a first tag class feature vector and a first text class feature vector includes:

5. The method according to any one of claims 1 to 4, wherein determining whether the data to be detected is sensitive data according to the similarity of the first feature vector and the second feature vector comprises:

6. The method of claim 5, wherein the method further comprises:

acquiring labeling sensitive data of different sensitive levels;

7. The method according to any one of claims 1 to 4, wherein the feature extracting the tag class data and the text class data in the data to be detected to obtain a first tag class feature vector and a first text class feature vector includes:

8. A data detection apparatus, comprising:

the first acquisition module is used for acquiring data to be detected;

9. An electronic device, comprising: a memory, a processor, and a computer program stored on the memory and executable on the processor; the method according to any one of claims 1 to 7, characterized in that the processor is adapted to read a program in a memory for implementing the steps in the data detection method according to any one of claims 1 to 7.

10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps in the data detection method according to any one of claims 1 to 7.