CN117592085A

CN117592085A - Data security detection method, device, equipment and storage medium

Info

Publication number: CN117592085A
Application number: CN202311753428.9A
Authority: CN
Inventors: 方少波
Original assignee: Kunming Shaobo Data Technology Co ltd
Current assignee: Kunming Shaobo Data Technology Co ltd
Priority date: 2023-12-19
Filing date: 2023-12-19
Publication date: 2024-02-23

Abstract

The application relates to the technical field of data detection, and more particularly discloses a data security detection method, a transpose, equipment and a storage medium, which aim to cope with data volume increase and security risk caused by rapid development of cloud computing and big data technology.

Description

Data security detection method, device, equipment and storage medium

Technical Field

The present application relates to the field of data detection technology, and more particularly, to a data security detection method, apparatus, and storage medium.

Background

With the rapid development of cloud computing and big data technology, the data volume grows in an exponential scale. This rapid growth also carries with it the risk of security events such as various data leaks. Abnormal behaviors of users, such as data theft and unauthorized access, not only can cause sensitive information leakage and brand reputation damage of enterprises, but also threatens the privacy of citizens, and increasingly threatens the security of network information.

The traditional data security product data leakage protection type product for coping with the threat of data leakage inside enterprises mainly depends on presetting enterprise sensitive data rules, and ensures enterprise information security by means of controlling staff surfing behavior and the like, namely blocking leakage risks by controlling outgoing paths such as USB flash disk copying, sensitive file sending, uploading, printing and the like. However, the method has limitations, and cannot effectively deal with the leakage scene that an internal employee steals unknown data such as enterprise sensitive data, and the like, because the internal employee has legal access rights of enterprise data assets and generally knows the storage position of the enterprise sensitive data, the traditional behavior analysis means cannot detect the behavior, so that the privacy security of the enterprise data and users is threatened.

Accordingly, a data security detection method, a transpose, an apparatus, and a storage medium are desired.

Disclosure of Invention

The present application has been made in order to solve the above technical problems. The embodiment of the application provides a data security detection method, a transpose, equipment and a storage medium, which aim to cope with data volume increase and security risk caused by rapid development of cloud computing and big data technology, and the method comprises the steps of acquiring data security detection associated data, converting the data security detection associated data into associated data embedded vectors for semantic coding to obtain global semantic feature vectors, local associated feature vectors and multi-scale associated data feature vectors, then fusing the feature vectors into classification feature vectors, judging whether data leakage early warning is generated based on the classification feature vectors, effectively dealing with secret leakage scenes of unknown data such as enterprise sensitive data and the like by internal staff, and improving the security of data and user privacy.

Accordingly, according to one aspect of the present application, there is provided a data security detection method, comprising:

acquiring data security detection associated data, wherein the data security detection associated data comprises user access data, data transmission data, system log data, files and data access records, data modification and audit logs and vulnerability assessment data;

converting the data security detection associated data into associated data embedded vectors, and performing semantic coding to obtain associated data global semantic feature vectors, associated data local associated feature vectors and multi-scale associated data associated feature vectors respectively;

fusing the associated data global semantic feature vector, the associated data local associated feature vector and the multi-scale associated data associated feature vector into a classification feature vector;

based on the classification feature vector, whether data leakage early warning is generated or not is obtained.

According to another aspect of the present application, there is provided a data security detection device, comprising:

the system comprises a data acquisition module, a vulnerability assessment module and a vulnerability assessment module, wherein the data acquisition module is used for acquiring data security detection associated data, and the data security detection associated data comprises user access data, data transmission data, system log data, file and data access records, data modification and audit logs and vulnerability assessment data;

The semantic coding module is used for converting the data security detection associated data into associated data embedded vectors and then carrying out semantic coding to respectively obtain associated data global semantic feature vectors, associated data local associated feature vectors and multi-scale associated data associated feature vectors;

the fusion module is used for fusing the global semantic feature vector of the associated data, the local associated feature vector of the associated data and the associated feature vector of the multi-scale associated data into a classification feature vector;

and the result generation module is used for obtaining whether the data leakage early warning is generated or not based on the classification feature vector.

According to another aspect of the present application, there is provided an electronic device including: a processor; a memory in which computer program instructions are stored which, when executed by the processor, cause the processor to perform a data security detection method as described above.

According to another aspect of the present application, there is provided a computer readable medium having stored thereon computer program instructions which, when executed by the processor, cause the processor to perform a data security detection method as described above.

Compared with the prior art, the data security detection method, the transpose, the equipment and the storage medium aim to cope with data volume increase and security risk caused by rapid development of cloud computing and big data technology, the data security detection associated data are obtained and converted into associated data embedded vectors for semantic coding, global semantic feature vectors, local associated feature vectors and multi-scale associated data feature vectors are obtained, then the feature vectors are fused into classification feature vectors, whether data leakage early warning is generated or not is judged based on the classification feature vectors, a secret leakage scene of unknown data such as enterprise sensitive data can be effectively stolen for internal staff, and the security of data and user privacy is improved.

Drawings

The foregoing and other objects, features and advantages of the present application will become more apparent from the following more particular description of embodiments of the present application, as illustrated in the accompanying drawings. The accompanying drawings are included to provide a further understanding of embodiments of the application and are incorporated in and constitute a part of this specification, illustrate the application and not constitute a limitation to the application. In the drawings, like reference numerals generally refer to like parts or steps.

Fig. 1 is a flowchart of a data security detection method according to an embodiment of the present application.

Fig. 2 is a flowchart of converting the data security detection association data into association data embedded vectors and performing semantic coding to obtain association data global semantic feature vectors, association data local association feature vectors and multi-scale association data association feature vectors respectively in the data security detection method according to the embodiment of the application.

Fig. 3 is a flowchart of a method for data security detection according to an embodiment of the present application, where the data security detection associated data is converted into an associated data embedded vector, and then passed through a context encoder based on a converter to obtain a plurality of associated data feature vectors.

Fig. 4 is a flowchart of a method for detecting data security according to an embodiment of the present application, in which the plurality of associated data feature vectors are respectively extracted through concatenation, convolutional encoding and multi-scale feature extraction to obtain an associated data global semantic feature vector, an associated data local associated feature vector and a multi-scale associated data associated feature vector.

Fig. 5 is a schematic architecture diagram of a data security detection method according to an embodiment of the present application.

Fig. 6 is a block diagram of a data security detection device according to an embodiment of the present application.

Fig. 7 is a schematic diagram of an electronic device according to an embodiment of the present application.

Fig. 8 is a schematic diagram of a computer-readable storage medium according to an embodiment of the present application.

Detailed Description

Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application and not all of the embodiments of the present application, and it should be understood that the present application is not limited by the example embodiments described herein.

Fig. 1 illustrates a flow chart of a data security detection method according to an embodiment of the present application. As shown in fig. 1, a data security detection method according to an embodiment of the present application includes the steps of: s110, acquiring data security detection associated data, wherein the data security detection associated data comprises user access data, data transmission data, system log data, files and data access records, data modification and audit logs and vulnerability assessment data; s120, converting the data security detection associated data into associated data embedded vectors, and then carrying out semantic coding to respectively obtain associated data global semantic feature vectors, associated data local associated feature vectors and multi-scale associated data associated feature vectors; s130, fusing the associated data global semantic feature vector, the associated data local associated feature vector and the multi-scale associated data associated feature vector into a classification feature vector; and S140, based on the classification feature vector, whether data leakage early warning is generated or not is obtained.

In step S110 of the embodiment of the present application, data security detection association data is acquired, where the data security detection association data includes user access data, data transmission data, system log data, files and data access records, data modification and audit logs, and vulnerability assessment data. It should be appreciated that by monitoring and analyzing the user's access behavior, abnormal or suspicious activity may be detected, such as unauthorized access, frequent access to sensitive data, and the like. Monitoring and analyzing the transmission process of data can discover the risk of data leakage, including tampering, interception, forwarding, etc. of the data during the transmission process. The system log records the running state, events and activities of the system, and abnormal behaviors, security events or potential vulnerabilities can be found by analyzing the system log. Monitoring and recording access to files and data can track and audit access to sensitive data, and identify potential data leakage behavior. Recording modification and audit logs of data can help track modification history and operators of data to discover and cope with unauthorized data changes in a timely manner. Vulnerability assessment data provides security vulnerability information for systems and applications, and by analyzing such data, vulnerabilities and potential security risks present in the system can be discovered. Specifically, for the above data, the user's login, access, and operation behavior may be recorded by logging, or acquired by monitoring network traffic and access logs. Network monitoring tools are used to capture and analyze the transmission of data in the network, or through audit log records of the data transmission system. The system typically generates journaling system events and activities, and may acquire relevant data by collecting and analyzing the system logs. File and data access monitoring tools are used to record and audit access to files and data.

In step S120 of the embodiment of the present application, after the data security detection associated data is converted into the associated data embedded vector, semantic coding is performed to obtain an associated data global semantic feature vector, an associated data local associated feature vector and a multi-scale associated data associated feature vector, respectively. It should be appreciated that the conversion of different types of data security detection association data into embedded vectors may enable a unified representation that facilitates subsequent analysis and processing. By semantically encoding the data security detection association data, semantic associations and similarities between the data can be captured. This allows a better understanding of the relationships between the data and the extraction of useful information therefrom. By converting the associated data into global semantic feature vectors, the overall features of the entire dataset can be captured. This helps to find global patterns and trends, so that the overall situation of the data is better understood. In addition to global features, local association relations can be captured by converting association data into local association feature vectors. This helps to discover local patterns and anomalies in the data, thereby analyzing the characteristics of the data more finely. Converting the associated data into multi-scale associated feature vectors may describe the relationship between the data from different scales and granularities. This helps to comprehensively consider the correlation characteristics of different layers, and improves understanding and analysis capability of the data correlation. Therefore, the data security detection associated data are converted into the associated data embedded vectors and subjected to semantic coding, and associated features among the data, including global features, local features and multi-scale features, can be extracted, so that the semantic relevance of the data can be better understood and analyzed. This conversion and encoding process provides the basis for subsequent data analysis and processing.

Fig. 2 illustrates a flowchart of converting the data security detection association data into association data embedded vectors and performing semantic coding to obtain association data global semantic feature vectors, association data local association feature vectors and multi-scale association data association feature vectors respectively in the data security detection method according to the embodiment of the application. As shown in fig. 2, on the basis of the embodiment shown in fig. 1, the step S120 includes: s210, converting the data security detection associated data into associated data embedded vectors, and then obtaining a plurality of associated data feature vectors through a context encoder based on a converter; s220, the plurality of associated data feature vectors are respectively extracted through cascading, convolution encoding and multi-scale feature extraction to obtain an associated data global semantic feature vector, an associated data local associated feature vector and a multi-scale associated data associated feature vector.

Specifically, in one specific example of the present application, the step S210 converts the data security detection association data into association data embedded vectors, and then obtains a plurality of association data feature vectors through a context encoder based on a converter. It should be appreciated that the context-based encoder of the converter is capable of context modeling, i.e. taking into account interrelationships and dependencies between data, of associated data embedding vectors. This allows better capture of semantic information and contextual features of the associated data. A plurality of associated data feature vectors may be obtained by the context encoder. Each feature vector represents a different associated data representation, thereby providing multiple perspectives and expressions. This allows a more comprehensive description of the relationships and features between the associated data. Further, by the context encoder based on the converter, different types of associated data features can be learned, including global features, local features, semantic features, etc. This allows for better capture of the diversity and complexity of the associated data. Obtaining a plurality of associated data feature vectors can improve the expressive power of the model. Different feature vectors may provide different information and representations to better support subsequent data analysis and processing tasks. Thus, after converting the data security detection associated data into associated data embedded vectors, a plurality of associated data feature vectors can be obtained by the context encoder based on the converter. This allows for better capture of semantic information and contextual characteristics of the associated data and provides multiple perspectives and expressions to support more comprehensive, accurate and flexible data analysis and processing.

Accordingly, fig. 3 illustrates a flowchart of a method for data security detection according to an embodiment of the present application, where the data security detection associated data is converted into associated data embedded vectors, and then passed through a context encoder based on a converter to obtain a plurality of associated data feature vectors. As shown in fig. 3, on the basis of the embodiment shown in fig. 2, the step S210 includes: s2101, enabling the data security detection associated data to pass through an embedding layer to convert each dimension data in the data security detection associated data into associated data embedded vectors to obtain a sequence of associated data embedded vectors, wherein the embedding layer uses a learnable embedding matrix to carry out embedded coding on each dimension data; s2102, the sequence of the associated data embedded vectors is passed through the converter-based context encoder to obtain the plurality of associated data feature vectors.

Specifically, in step S2101, the data security detection associated data is passed through an embedding layer to convert each dimension data in the data security detection associated data into an associated data embedding vector to obtain a sequence of associated data embedding vectors, where the embedding layer uses a learnable embedding matrix to perform embedded encoding on each dimension data. It should be appreciated that converting the data of each dimension into associated data embedded vectors may enable a unified representation that facilitates subsequent processing and analysis. Through the sequence of the embedded vectors, multidimensional data can be uniformly encoded into a continuous vector sequence, and model processing and learning are more convenient. The original high-dimensional data can be converted into a low-dimensional embedded vector by the embedded coding of the embedded layer. This helps reduce the dimensionality of the data and extracts important features in the data. The use of a learnable embedding matrix can map data into the embedding space by learning the distribution and characteristics of the data, making the representation of the data more compact and meaningful. By means of embedded coding of the embedded layer, semantic association modeling can be conducted on data of all dimensions. The learning process of the embedding matrix can capture semantic association and similarity between data in different dimensions, and relevant data is mapped into a similar embedding space. The relationship between the data can be better understood and represented, and a basis is provided for subsequent association analysis and processing. The use of a learnable embedding matrix allows for learning the best data representation according to specific tasks and data characteristics. The parameters of the embedding matrix may be trained by back propagation and optimization algorithms to obtain an embedded representation that is more data-compliant. This may enable the embedded vector to more express the semantics and features of the data.

Specifically, first, an embedding layer is defined, the input of which is the original multidimensional data, and the output of which is the corresponding embedding vector sequence. The parameters of the embedding layer are a learnable embedding matrix. The dimensions of the embedding matrix are determined by the length of the embedding vector and the original data dimensions. Through the training process, the weights of the embedded matrix may be optimized to maximize capture of the semantics and features of the data. In the training process, the weight of the embedded matrix is adjusted according to specific tasks and loss functions through a back propagation and optimization algorithm, so that the embedded vector can better represent the semantics and characteristics of the data. The sequence of the associated data embedded vector is a result obtained by coding and converting each dimension data through an embedded layer and an embedded matrix. Each embedded vector represents a dimension in the original data and through learning of the embedded matrix, the semantics and features of the data can be better expressed.

Specifically, in step S2102, the sequence of the associated data embedded vectors is passed through the converter-based context encoder to obtain the plurality of associated data feature vectors. It should be appreciated that the context-based encoder of the converter may capture context information and semantic associations in the sequence data. It models the relationship between each embedded vector in the sequence and other embedded vectors through a self-attention mechanism. This allows a better understanding and representation of the dependencies and interactions between the associated data embedding vectors. In the processing of association data, it is generally necessary to consider relationships and interactions between a plurality of dimensional data. By inputting the sequence of embedded vectors of associated data into a context encoder, a plurality of feature vectors of associated data may be derived, each feature vector representing a position or a dimension of data in the sequence. Thus, the characteristics and the relations of the associated data can be more comprehensively represented, and richer information is provided for subsequent analysis and processing. The context encoder of the converter may encode the associated data embedded vector sequence by a self-attention mechanism. It is able to capture semantic associations and similarities between the associated data, aggregating related embedded vectors together so that their features are more prominent and highlighted. This helps to better understand and model the relationships between the associated data, extracting important features of the associated data.

Specifically, first, the associated data is embedded into a vector sequence as input into a converter-based context encoder. The context encoder employs a self-attention mechanism to encode the sequence by modeling the relationship between each embedded vector in the sequence and the other embedded vectors. The self-attention mechanism may be weighted according to similarity and importance between embedded vectors to better capture semantic associations between associated data. The context encoder encodes each position or dimension data in the sequence and outputs a corresponding feature vector. These feature vectors represent important features of different dimensions or locations of the associated data.

Specifically, in one specific example of the present application, step S220 extracts the plurality of associated data feature vectors through concatenation, convolutional encoding and multiscale feature extraction to obtain an associated data global semantic feature vector, an associated data local associated feature vector and a multiscale associated data associated feature vector, respectively. It should be appreciated that the associated data may contain global semantic information, where each feature vector represents a feature of the overall data. By concatenating multiple associated data feature vectors, their features can be integrated to obtain a global semantic feature vector that better represents the overall semantic information and important features of the associated data. Further, different parts or portions of the association data may have particular relevance and importance. By convolutional encoding, a filter can be applied over the sequence of feature vectors to highlight or capture locally associated features. In this way, a set of local associated feature vectors can be obtained that represent the associated information of different parts of the associated data, facilitating a finer granularity of understanding and processing of the associated data. Still further, the association data often has different scale and hierarchy associations. Through multi-scale feature extraction, features of the associated data may be extracted at different scales. This can be achieved by applying filters of different sizes or using receptive fields of different sizes. The multi-scale associated feature vector can capture multi-level associated information of associated data, thereby providing a more comprehensive and rich feature representation.

Accordingly, fig. 4 illustrates a flowchart of extracting the plurality of associated data feature vectors through concatenation, convolutional encoding and multi-scale feature extraction to obtain an associated data global semantic feature vector, an associated data local associated feature vector and a multi-scale associated data associated feature vector, respectively, in the data security detection method according to the embodiment of the present application. As shown in fig. 4, on the basis of the embodiment shown in fig. 2, the step S220 includes: s2201, cascading the plurality of associated data feature vectors to obtain an associated data global semantic feature vector; s2202, performing two-dimensional arrangement on the plurality of associated data feature vectors to obtain an associated data associated feature map through a first convolutional neural network model serving as a feature extractor after the associated data feature vectors are arranged into an associated data feature matrix; s2203, carrying out global mean pooling on each feature matrix of the associated data associated feature graph along the channel dimension to obtain associated data local associated feature vectors; s2204, the plurality of associated data feature vectors are arranged into one-dimensional associated data feature vectors, and then the one-dimensional associated data feature vectors are obtained through a multi-scale neighborhood feature extraction module.

Specifically, in step S2201, the plurality of associated data feature vectors are concatenated to obtain an associated data global semantic feature vector. It should be appreciated that when processing associated data, each feature vector typically represents a sample or instance of the data. These feature vectors may contain information about different aspects and properties. By concatenating these feature vectors together, their features can be connected into a longer vector, forming a more global view. The concatenation operation may splice together the dimensions of different feature vectors, resulting in a larger feature vector. This has the advantage of capturing the consistency and comprehensive characteristics of the overall data. Through cascading, features of different samples or instances in the associated data can be integrated together to form a global semantic feature vector, and the vector can better represent features and semantic information of the whole data. Global semantic feature vectors are useful in many tasks such as data classification, clustering, and retrieval. They can provide a comprehensive understanding of the overall data, helping us capture important features and patterns in the data. Therefore, by cascading a plurality of associated data feature vectors together, a feature vector with more characterization capability and global view angle can be obtained, thereby improving the analysis and application capability of the associated data.

Specifically, first, it is necessary to determine whether the dimensions of the plurality of feature vectors are the same. If the dimensions are different, preprocessing steps such as dimension matching or feature dimension reduction are needed. Creating a null global feature vector with dimensions n x d, where n is the number of feature vectors and d is the dimension of each feature vector. Each feature vector is traversed in turn, cascading it with the global feature vector. The dimension of the current feature vector may be added to the end of the global feature vector by simple vector stitching. After the traversal is completed, a global feature vector with a dimension of n x d is obtained, wherein the global feature vector contains information of all feature vectors. Thus, an associative data global semantic feature vector is a vector obtained by concatenating a plurality of associative data feature vectors together. It represents the comprehensive features and semantic information of the overall data that can be used to analyze, model and apply the associated data. The global feature vector may capture consistency and comprehensive features in the associated data, providing a more global view and comprehensive understanding of the overall data. It can be used for data representation, feature extraction, machine learning, etc.

Specifically, in step S2202, the plurality of associated data feature vectors are two-dimensionally arranged into an associated data feature matrix, and then the associated data feature map is obtained by using a first convolutional neural network model as a feature extractor. It should be understood that by arranging the feature vectors in a matrix form, structural information between the associated data can be preserved. The rows and columns of the matrix may correspond to associations between data, such as time steps in time series data or locations in spatial data. Convolutional neural networks are powerful feature extractors that can effectively capture local and global features in data. By inputting the correlation data feature matrix into the CNN model, the convolution layer of the CNN can be utilized to extract features of the correlation data. The convolution layer of the CNN model generates a series of feature maps, each feature map corresponding to a particular feature. These feature maps may capture different patterns and structures in the associated data. By observing and analyzing these feature maps, it is possible to obtain associated feature maps of associated data, and further understand and utilize associated information in the data. By using the CNN model as a feature extractor, higher level feature representations can be learned. The convolution layer and the pooling layer of the CNN can gradually extract abstract features in the data, so that the feature representation has more discriminant and expressive capacity.

Specifically, after the plurality of associated data feature vectors are two-dimensionally arranged into an associated data feature matrix, an associated data associated feature map is obtained through a first convolutional neural network model serving as a feature extractor, and the method comprises the following steps: the dimensions and number of feature vectors of the associated data are determined. Feature vectors associated with the associated data are extracted or prepared. Creating a blank two-dimensional matrix, wherein the number of rows is the number of feature vectors, and the number of columns is the dimension of the feature vectors. The eigenvectors are arranged row by row into a matrix, each eigenvector occupying a row. The structure of the convolutional neural network model is determined, including convolutional layers, pooling layers, activation functions, and the like. Inputs and outputs of the model are defined. And transmitting the associated data feature matrix as input to the convolutional neural network model. And the convolution layer carries out convolution operation on the associated data feature matrix to extract features. The pooling layer may be applied to downsample features, reducing the size of the feature map. The extracted features are mapped into an associated feature map by an activation function. And acquiring the generated association characteristic diagram from the convolutional neural network model. The associated feature map may be a two-dimensional matrix of channels, each channel corresponding to a particular feature.

Specifically, in step S2203, the feature matrices of the associated data associated feature graphs along the channel dimension are globally averaged and pooled to obtain the associated data local associated feature vectors. It should be appreciated that by global averaging, the feature matrix for each channel can be converted to a single value, thereby reducing the dimension of the feature. This helps to reduce the amount of parameters and computational complexity of the model. And the global averaging is used for summarizing the information of the whole feature matrix, so that the integral features of the associated data are reserved. This helps to capture global patterns and statistical properties in the associated data. Specifically, for the feature matrix of each channel, the mean value of all its elements is calculated. The mean value of each channel is taken as one element of the local associated feature vector of the channel. And finally obtaining a vector, wherein each element corresponds to the average value of one channel, and the vector is the local association characteristic vector of the association data. Through global averaging, a local associated feature vector with lower dimension can be extracted from the associated data associated feature map, and the vector can more effectively represent the overall feature of the associated data. Such feature vectors may be used for subsequent tasks such as classification, regression, etc.

Specifically, in step S2204, the plurality of associated data feature vectors are arranged into one-dimensional associated data feature vectors, and then the one-dimensional associated data feature vectors are obtained by a multi-scale neighborhood feature extraction module. It should be appreciated that the information in the associated data may exist on different scales, such as local detail and global structure. By performing multi-scale neighborhood feature extraction, the associated data can be analyzed on different scales to obtain a more comprehensive feature representation. The multi-scale neighborhood feature extraction module may capture contextual information of the associated data features on different scales. In this way, local and global contexts of associated data features can be obtained, facilitating a better understanding of relationships between associated data. Through the multi-scale neighborhood feature extraction module, the associated features extracted on different scales can be fused. The feature fusion can enhance the expression capability of the associated data features, so that the associated data associated feature vectors have more discriminant and expression capability. The multi-scale associated data associated feature vectors may be used for various tasks such as classification, regression, object detection, etc. By utilizing a multi-scale feature representation, the understanding and processing power of the associated data can be improved, resulting in better performance among these tasks.

Specifically, the multi-scale neighborhood feature extraction module is a module for extracting multi-scale features from associated data. It is typically used to process data having a spatial or temporal relationship, such as images, videos, text sequences, and the like. The main objective of the multi-scale neighborhood feature extraction module is to capture local and global features of associated data by considering neighborhood information on different scales to obtain a more comprehensive feature representation.

Specifically, the method for obtaining the multi-scale associated data associated feature vector by the multi-scale neighborhood feature extraction module after arranging the plurality of associated data feature vectors into the one-dimensional associated data feature vector comprises the following steps: using a first convolution layer of the multi-scale neighborhood feature extraction module to check the one-dimensional associated data feature vector with a one-dimensional convolution with a first scale to perform one-dimensional convolution coding so as to obtain a first-scale associated data associated feature vector; using a second convolution layer of the multi-scale neighborhood feature extraction module to check the one-dimensional associated data feature vector with a one-dimensional convolution with a second scale to perform one-dimensional convolution coding so as to obtain a second-scale associated data associated feature vector; and cascading the first scale associated data associated feature vector and the second scale associated data associated feature vector to obtain the multi-scale associated data associated feature vector. Using a first convolution layer of the multi-scale neighborhood feature extraction module to check the one-dimensional associated data feature vector with a one-dimensional convolution with a first scale to perform one-dimensional convolution coding to obtain a first-scale associated data associated feature vector, wherein the first-scale associated data associated feature vector is used for: performing one-dimensional convolution coding on the one-dimensional associated data feature vector by using a first convolution layer of the multi-scale neighborhood feature extraction module according to the following formula to obtain a first scale associated data associated feature vector;

Wherein, the formula is:

wherein a is the width of the first convolution kernel in the x-direction, F (a) is the first convolution kernel parameter vector, G (x-a) is the sumA local vector matrix of convolution kernel function operation, w is the size of a first convolution kernel, X represents the one-dimensional associated data feature vector, cov ₁ (X) represents the first scale associated data associated feature vector.

Using a second convolution layer of the multi-scale neighborhood feature extraction module to check the one-dimensional associated data feature vector with a one-dimensional convolution with a second scale to perform one-dimensional convolution coding to obtain a second-scale associated data associated feature vector, wherein the second-scale associated data associated feature vector is used for: performing one-dimensional convolution coding on the one-dimensional associated data feature vector by using a second convolution layer of the multi-scale neighborhood feature extraction module according to the following formula to obtain a second scale associated data associated feature vector;

wherein, the formula is:

wherein b is the width of the second convolution kernel in the X direction, F (b) is a second convolution kernel parameter vector, G (X-b) is a local vector matrix calculated by a convolution kernel function, m is the size of the second convolution kernel, X represents the one-dimensional associated data feature vector, cov ₁ (X) represents the second scale associated data associated feature vector.

In step S130 of the embodiment of the present application, the global semantic feature vector of the associated data, the local associated feature vector of the associated data, and the associated feature vector of the multi-scale associated data are fused into a classification feature vector. It should be appreciated that global semantic feature vectors capture semantic information of overall associated data, local associated feature vectors capture local relationships of associated data, and multi-scale associated data associated feature vectors provide multi-scale feature representations. The information of different layers can be comprehensively utilized by fusing the information and the feature representation is more comprehensive and richer. Different types of feature vectors may describe features of the associated data from different angles. The global semantic feature vector provides global features of the associated data, the local associated feature vector provides local relationships of the associated data, and the multi-scale associated data associated feature vector provides a multi-scale feature representation. The fusion of the two can enhance the expression capability of the classification feature vector, so that the classification feature vector has more discriminant and generalization performance. Fusing multiple feature vectors can provide more information, thereby improving the performance of classification tasks. Different feature vectors may have different degrees of distinction for different categories, and by fusing them, the difference between the categories can be captured better, and the accuracy and robustness of the classifier are improved.

In step S140 of the embodiment of the present application, based on the classification feature vector, whether to generate the data leakage early warning is obtained. It should be understood that data leakage refers to the leakage of sensitive data to unauthorized persons or organizations in the event of an unauthorized or unexpected situation. The disclosure of data may cause problems such as disclosure of personal privacy, disclosure of business confidentiality, etc., and may cause serious loss to individuals and organizations. Therefore, it is important to find data leakage and take countermeasures in time. A classifier is a machine learning model that learns patterns and rules from known data by learning and then applies the patterns and rules to new data for classification. In this problem, the classifier can learn a pattern and a rule to identify whether or not data leakage is generated from the known data. By inputting the classification feature vector into the classifier, a classification result, i.e., a prediction of whether data leakage occurs, can be obtained. The classification result is used for indicating whether the data leakage early warning is generated or not so as to find out potential data leakage risks in time. If the classification result indicates that the data may leak, corresponding countermeasures, such as strengthening data security measures, tracking data access records, notifying relevant parties of risk assessment, etc., can be immediately taken. Thus, loss caused by data leakage can be reduced to the greatest extent, and personal privacy and business confidentiality safety can be protected.

In particular, in the present solution, it is considered that when cascading multiple associated data feature vectors into a global semantic feature vector, if some associated data features are missing or near zero over most of the data samples, the global semantic feature vector will become very sparse. This may be due to the different frequency of occurrence of different types of associated data in different samples, resulting in some features rarely occurring in global semantic feature vectors. Meanwhile, after a plurality of associated data feature vectors are arranged as a feature matrix, if some features are missing or near zero in most samples, the feature matrix becomes very sparse. This may be due to the non-uniform distribution of different types of associated data in different samples, resulting in few occurrences of certain features in the feature matrix. Moreover, when global averaging is performed, if some features of the associated data associated feature map do not change greatly in the channel dimension, the local associated feature vector obtained after global averaging is very sparse. In a multi-scale neighborhood feature extraction process, a sliding window may be used to capture associated data features of different scales. If the associated data does not change much on different scales, the extracted multi-scale associated data associated feature vectors will be very sparse. That is, the reason why the classified feature vector obtained by fusing the associated data global semantic feature vector, the associated data local associated feature vector and the multi-scale associated data associated feature vector has high sparsity is that a large number of missing or near-zero features exist in the processes of feature cascading, feature arrangement, global averaging and feature extraction. Such sparsity can adversely affect classification accuracy because it can be difficult for the classifier to extract valid information from the sparse features, resulting in reduced performance of the classifier. Sparsity can also cause problems such as data imbalance, dimension disasters, classifier instability and the like, and further influence classification accuracy. In order to solve the problem that the sparsity of the classification feature vector can adversely affect the classification accuracy when the classification feature vector is classified by a classifier, in the technical scheme of the application, the posterior expression of the motion distribution model of the classification feature vector relative to the target classification function is calculated.

Specifically, calculating posterior expression of the motion distribution model of the classification feature vector relative to the target classification function by the following formula to obtain an optimized classification feature vector;

wherein, the formula is:

wherein v is _i Is the eigenvalue of the ith position of the classification eigenvector,is the global feature mean of the classification feature vector, log represents a logarithmic function value based on 2, lambda represents a predetermined hyper-parameter, v _i ' is the eigenvalue of the ith position of the optimized classification eigenvector.

That is, in order to improve the classification accuracy of the classification feature vector obtained based on the encoder model, in the technical solution of the present application, a motion distribution model is used to approximate the objective classification function. Specifically, the difference between the motion distribution model and the target classification function is measured by using the KL-like divergence, that is, the parameters of the neural network are optimized by using a cross entropy loss function, and then the output of the motion distribution model is calculated or estimated by using the posterior expression, that is, the output of the encoder model is used as the input of the neural network, and the difference between the posterior expression and the motion distribution model is measured by using the KL-like divergence, that is, the parameters of the encoder model are optimized by using a maximum likelihood estimation method, so that the sparsity constraint is used to promote the sparsity of the encoder model. By the method, the implicit expression of the features is sparsity constrained, so that the parameter space of the encoder is sparsity limited in the training process, the group optimization capacity of the encoder model is improved, and the classification accuracy of the classification feature vectors obtained based on the encoder model is improved.

In a specific example of the present application, based on the classification feature vector, to obtain whether to generate the data leakage early warning includes: calculating posterior expression of the motion distribution model of the classification feature vector relative to the target classification function to obtain an optimized classification feature vector; and the optimized classification feature vector passes through a classifier to obtain a classification result, wherein the classification result is used for indicating whether data leakage early warning is generated or not so as to take countermeasures in time. Specifically, using a full-connection layer of the classifier to perform full-connection coding on the classification feature vector so as to obtain a coded classification feature vector; the coding classification feature vector is passed through a Softmax classification function of the classifier to obtain a first probability of generating data leakage early warning and a second probability of not generating data leakage early warning; the classification result is determined based on a comparison between the first probability and the second probability.

The embodiment of the application also provides an architecture diagram, and particularly shown in fig. 5. Fig. 5 is a schematic architecture diagram of a data security detection method according to an embodiment of the present application, in which first, data security detection association data is acquired, where the data security detection association data includes user access data, data transmission data, system log data, files and data access records, data modification and audit logs, and vulnerability assessment data. And then, converting the data security detection associated data into associated data embedded vectors, and then obtaining a plurality of associated data feature vectors through a context encoder based on a converter. And then, cascading the plurality of associated data feature vectors to obtain an associated data global semantic feature vector. Secondly, the plurality of associated data feature vectors are arranged in two dimensions to form an associated data feature matrix, and then the associated data associated feature map is obtained through a first convolution neural network model serving as a feature extractor. And then, carrying out global averaging pooling on each feature matrix of the associated data associated feature graph along the channel dimension to obtain associated data local associated feature vectors. And then, arranging the plurality of associated data feature vectors into one-dimensional associated data feature vectors, and then obtaining the multi-scale associated data associated feature vectors through a multi-scale neighborhood feature extraction module. And secondly, fusing the associated data global semantic feature vector, the associated data local associated feature vector and the multi-scale associated data associated feature vector to obtain the classification feature vector. Then, a posterior representation of the motion distribution model of the classification feature vector relative to the objective classification function is calculated to obtain an optimized classification feature vector. And finally, the optimized classification feature vector passes through a classifier to obtain a classification result, wherein the classification result is used for indicating whether data leakage early warning is generated or not so as to take countermeasures in time.

In summary, according to the data security detection method, transpose, device and storage medium of the embodiments of the present application, it is intended to cope with the increase of data volume and security risk caused by the rapid development of cloud computing and big data technology, by acquiring data security detection associated data and converting the data security detection associated data into associated data embedded vectors for semantic coding, global semantic feature vectors, local associated feature vectors and multi-scale associated data feature vectors are obtained, then these feature vectors are fused into classification feature vectors, and based on the classification feature vectors, whether data leakage early warning is generated is judged, so that it is possible to effectively steal the disclosure scene of unknown data such as enterprise sensitive data for internal staff, and the security of data and user privacy is improved.

Fig. 6 is a block diagram of a data security detection device according to an embodiment of the present application. As shown in fig. 6, the data security detection device 100 according to the embodiment of the present application includes: a data acquisition module 110, configured to acquire data security detection association data, where the data security detection association data includes user access data, data transmission data, system log data, file and data access records, data modification and audit logs, and vulnerability assessment data; the semantic coding module 120 is configured to convert the data security detection associated data into associated data embedded vectors, and perform semantic coding to obtain associated data global semantic feature vectors, associated data local associated feature vectors, and multi-scale associated data associated feature vectors, respectively; a fusion module 130, configured to fuse the associated data global semantic feature vector, the associated data local associated feature vector, and the multi-scale associated data associated feature vector into a classification feature vector; and the result generating module 140 is configured to obtain whether to generate a data leakage early warning based on the classification feature vector.

Here, it will be understood by those skilled in the art that the specific functions and operations of the respective units and modules in the above-described data security detection device have been described in detail in the above description of the data security detection method with reference to fig. 1 to 5, and thus, repetitive descriptions thereof will be omitted.

As described above, the data security detection apparatus 100 according to the embodiment of the present application may be implemented in various terminal devices, for example, a server or the like in which a data security detection algorithm is deployed. In one example, the data security detection device 100 may be integrated into the terminal device as a software module and/or a hardware module. For example, the data security detection means 100 may be a software module in the operating system of the terminal device or may be an application developed for the terminal device; of course, the data security detection device 100 may also be one of a plurality of hardware modules of the terminal device.

Alternatively, in another example, the data security detection device 100 and the terminal device may be separate devices, and the data security detection device 100 may be connected to the terminal device through a wired and/or wireless network and transmit the interaction information in a agreed data format.

Specifically, the present application also provides another embodiment, and an electronic device according to an embodiment of the present application is described below with reference to fig. 7.

Fig. 7 illustrates a block diagram of an electronic device according to an embodiment of the present application. As shown in fig. 7, an electronic device 70 according to an embodiment of the present disclosure includes a memory 701 and a processor 702. The components in the electronic device 70 are interconnected by a bus system and/or other forms of connection mechanisms (not shown).

The memory 701 is used to store computer readable instructions. In particular, memory 701 may include one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random Access Memory (RAM) and/or cache memory (cache), and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like.

The processor 702 may be a Central Processing Unit (CPU), a Graphics Processing Unit (GPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 70 to perform desired functions. In one embodiment of the present disclosure, the processor 702 is configured to execute the computer readable instructions stored in the memory 701, so that the electronic device 70 performs the data security detection method described with reference to fig. 1 and 5 or the data security detection apparatus described with reference to fig. 6.

Further, it is to be understood that the components and configurations of the electronic device 70 shown in FIG. 7 are exemplary only and not limiting, as the electronic device 70 may have other components and configurations as desired. Such as data acquisition devices and output devices, etc. (not shown). The data acquisition device may be used to acquire data security detection association data and store the acquired data security detection association data in memory 701 for use by other components. Of course, other acquisition devices may be used to acquire the acquired data security detection related data, and the acquired data security detection related data may be sent to the electronic device 70, where the electronic device 70 may store the received acquired data security detection related data in the memory 701. The output device may output various information, such as whether or not a data leakage warning is generated, to the outside (e.g., a user). The output devices may include one or more of a display, speakers, projector, network card, etc.

Fig. 8 is a schematic diagram illustrating a computer-readable storage medium according to an embodiment of the present disclosure. As shown in fig. 8, a computer-readable storage medium 800 according to an embodiment of the present disclosure has computer-readable instructions 801 stored thereon. When the computer readable instructions 801 are executed by a processor, the data security detection method described with reference to fig. 1 and 5 or the data security detection apparatus described with reference to fig. 6 is performed.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A data security detection method, comprising:

2. The method for detecting data security according to claim 1, wherein converting the data security detection associated data into associated data embedded vectors and performing semantic coding to obtain associated data global semantic feature vectors, associated data local associated feature vectors and multi-scale associated data associated feature vectors respectively, comprises:

converting the data security detection associated data into associated data embedded vectors, and then obtaining a plurality of associated data feature vectors through a context encoder based on a converter;

and respectively extracting the plurality of associated data feature vectors through cascading, convolution coding and multi-scale feature extraction to obtain an associated data global semantic feature vector, an associated data local associated feature vector and a multi-scale associated data associated feature vector.

3. The method of claim 2, wherein converting the data security detection association data into association data embedded vectors and then passing through a context encoder based on a converter to obtain a plurality of association data feature vectors, comprising:

the data security detection associated data pass through an embedding layer to convert each dimension data in the data security detection associated data into associated data embedded vectors to obtain a sequence of associated data embedded vectors, wherein the embedding layer uses a learnable embedding matrix to carry out embedded coding on each dimension data;

The sequence of associated data embedded vectors is passed through the converter-based context encoder to obtain the plurality of associated data feature vectors.

4. A data security detection method according to claim 3, wherein extracting the plurality of associated data feature vectors through concatenation, convolutional encoding and multiscale feature extraction, respectively, to obtain an associated data global semantic feature vector, an associated data local associated feature vector and a multiscale associated data associated feature vector, comprises:

cascading the plurality of associated data feature vectors to obtain an associated data global semantic feature vector;

two-dimensionally arranging the plurality of associated data feature vectors into an associated data feature matrix, and then obtaining an associated data associated feature map through a first convolutional neural network model serving as a feature extractor;

carrying out global averaging on each feature matrix of the associated data associated feature graph along the channel dimension to obtain an associated data local associated feature vector;

and arranging the plurality of associated data feature vectors into one-dimensional associated data feature vectors, and then obtaining the multi-scale associated data associated feature vectors through a multi-scale neighborhood feature extraction module.

5. The method of claim 4, wherein the step of arranging the plurality of associated data feature vectors into one-dimensional associated data feature vectors and then passing through a multi-scale neighborhood feature extraction module to obtain multi-scale associated data associated feature vectors comprises:

using a first convolution layer of the multi-scale neighborhood feature extraction module to check the one-dimensional associated data feature vector with a one-dimensional convolution with a first scale to perform one-dimensional convolution coding so as to obtain a first-scale associated data associated feature vector;

using a second convolution layer of the multi-scale neighborhood feature extraction module to check the one-dimensional associated data feature vector with a one-dimensional convolution with a second scale to perform one-dimensional convolution coding so as to obtain a second-scale associated data associated feature vector;

and cascading the first scale associated data associated feature vector and the second scale associated data associated feature vector to obtain the multi-scale associated data associated feature vector.

6. The method of claim 5, wherein the step of obtaining whether to generate the data leakage warning based on the classification feature vector comprises:

Calculating posterior expression of the motion distribution model of the classification feature vector relative to the target classification function to obtain an optimized classification feature vector;

and the optimized classification feature vector passes through a classifier to obtain a classification result, wherein the classification result is used for indicating whether data leakage early warning is generated or not so as to take countermeasures in time.

7. The data security detection method of claim 6, wherein computing a posterior representation of the motion distribution model of the classification feature vector relative to a target classification function to obtain an optimized classification feature vector is used to: calculating posterior expression of the motion distribution model of the classification feature vector relative to the target classification function by the following formula to obtain an optimized classification feature vector;

wherein, the formula is:

8. A data security detection device, comprising:

9. An electronic device, comprising:

a processor;

a memory having stored therein computer program instructions that, when executed by the processor, cause the processor to perform the data security detection method of any of claims 1-7.

10. A computer readable medium having stored thereon computer program instructions which, when executed by the processor, cause the processor to perform the data security detection method of any of claims 1-7.