CN116049397A - Sensitive information discovery and automatic classification method based on multi-mode fusion - Google Patents

Sensitive information discovery and automatic classification method based on multi-mode fusion Download PDF

Info

Publication number
CN116049397A
CN116049397A CN202211705972.1A CN202211705972A CN116049397A CN 116049397 A CN116049397 A CN 116049397A CN 202211705972 A CN202211705972 A CN 202211705972A CN 116049397 A CN116049397 A CN 116049397A
Authority
CN
China
Prior art keywords
data
unstructured data
mode
fusion
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211705972.1A
Other languages
Chinese (zh)
Other versions
CN116049397B (en
Inventor
蔡亮
邹贞贞
刘志超
杨潇健
杜海蛟
陈佩佩
肖雪雪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Huoyin Technology Co ltd
Original Assignee
Beijing Huoyin Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Huoyin Technology Co ltd filed Critical Beijing Huoyin Technology Co ltd
Priority to CN202211705972.1A priority Critical patent/CN116049397B/en
Publication of CN116049397A publication Critical patent/CN116049397A/en
Application granted granted Critical
Publication of CN116049397B publication Critical patent/CN116049397B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/546Message passing systems or structures, e.g. queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/54Indexing scheme relating to G06F9/54
    • G06F2209/548Queue

Abstract

The application relates to a sensitive information discovery and automatic classification method based on multi-mode fusion, which obtains single-mode unstructured data or multi-mode unstructured data by judging the data type of the accessed unstructured data; processing and analyzing the single-mode unstructured data or the multi-mode unstructured data by adopting a preset model to obtain a corresponding data classification result; and outputting and storing the data classification result. The BERT structure of the shared parameter can be used for simultaneously completing judgment of the relation between each frame of the image/video and the text and fusion of the visual characteristics and the text characteristics, and realizing linkage processing analysis of different mode data such as the text, the image, the video and the like. The method solves the problems of single processing data mode type, insufficient mining of data information, low automation and customization degree, narrow application service range, low accuracy, large calculation resource consumption due to a plurality of models and the like of the data security classification products in the market.

Description

Sensitive information discovery and automatic classification method based on multi-mode fusion
Technical Field
The disclosure relates to the technical field of unstructured data, in particular to a method, a device and a control system for detecting and automatically classifying sensitive information based on multi-mode fusion.
Background
Today, with increasingly developed networking and big data, data security and privacy protection present unprecedented challenges. Accurate identification of sensitive data is a prerequisite for data security. Social media, corporate institutions, generate large amounts of unstructured production operation data daily, which, if they contain sensitive information, once compromised, would not be appreciable in terms of loss to the company or individual. The mode of the data is not single, key picture information is commonly included in text information, important subtitle information is included in video information, and characters in audio and audio are not cleavable. Therefore, a plurality of modal data expressing the same information complements each other. How to fully utilize and combine data of each mode from massive multi-mode unstructured data, discover sensitive data from the data and automatically classify the sensitive data in a grading way is a serious problem of data security.
The data classification and grading refers to defining which data belongs to which service field, namely category, according to different industry backgrounds and application scenes. The hierarchy represents the sensitivity level of the data. The sharing strategy for the inside and the outside is changed along with the different sensitivity levels of the data. The method is different from the traditional keyword or regular matching algorithm matching key information, and at present, the data content is analyzed and processed by utilizing technologies such as machine learning, natural language processing, text semantic analysis, computer vision and the like, and the automatic classification of the data can be realized through repeated sample training and model correction. However, the existing model is aimed at data of a single mode, and the rest modes of the data are not fully utilized, so that the accuracy and efficiency of model reasoning results are greatly reduced.
The current unstructured data security classification and classification processing mode is mainly a traditional template matching method. And comparing the existing unstructured data with an already formed keyword library, a regular matching template or a fixed sentence pattern sentence template, and extracting key information. The method is mechanical, dead and limited in processing capacity, and once the application scene is slightly changed, the template is difficult to adapt to new business again; in addition, although some products use an artificial intelligence-based method for sensitive discovery and automatic classification, all products only can complete the most basic recognition task based on a general algorithm model, cannot meet the higher-level customization requirements of users, and are large in classification granularity and low in precision; in addition, existing security hierarchical classification products or systems, on unstructured data sensitive information extraction, require different types of artificial intelligence models for different types of unstructured data to process, e.g., use natural language processing models to process plain text information, focus on processing text information and ignore associated picture information; the visual model processes the plain picture, video information, focusing on the visual information and ignoring important text information. In the prior art, no matter national defense security data, social media data, company internal production management data or personal information data, a plurality of modes such as texts, pictures, videos and audios are mutually fused together. The existing classification and grading system has the defects that the single-mode model divides text information, pictures and video information mutually, and the text information, the pictures and the video information cannot be combined together, so that the model is low in integration level and accuracy, sensitive information cannot be fully mined, and a plurality of models consume more resources.
Disclosure of Invention
In order to solve the problems, the application provides a method, a device and a control system for detecting and automatically classifying sensitive information based on multi-mode fusion.
In one aspect of the present application, a method for discovering and automatically classifying and classifying sensitive information based on multi-modal fusion is provided, including the following steps:
accessing unstructured data;
judging the data type of the unstructured data to obtain single-mode unstructured data or multi-mode unstructured data;
processing and analyzing the single-mode unstructured data or the multi-mode unstructured data by adopting a preset model to obtain a corresponding data classification result;
and outputting and storing the data classification result.
As an optional embodiment of the present application, optionally accessing unstructured data includes:
configuring a data source connector;
accessing corresponding unstructured data from the selected data source by adopting the configured data source connector;
and reporting the accessed unstructured data in real time according to a preset message queue.
As an optional embodiment of the present application, optionally, accessing unstructured data further includes:
presetting distributed storage conditions;
and carrying out distributed storage on the accessed unstructured data according to the distributed storage conditions.
As an optional embodiment of the present application, optionally, accessing unstructured data further includes:
pre-configuring an imaging management interface;
and adopting the imaging management interface to perform visual management on the accessed unstructured data.
As an optional embodiment of the present application, optionally after accessing the unstructured data, further comprising:
presetting pretreatment conditions;
and preprocessing the accessed unstructured data according to the preprocessing conditions.
As an optional implementation manner of the present application, optionally, processing and analyzing the single-mode unstructured data by using a preset model to obtain a corresponding data classification result includes:
based on the AI technology, training and generating different AI models according to different types of unstructured data and different task targets;
inputting the single-mode unstructured data into different AI models according to the data types of the data, and extracting the characteristics;
and carrying out data classification on the single-mode unstructured data according to the extracted characteristics, and storing the obtained data classification result.
As an optional implementation manner of the present application, optionally, processing and analyzing the multi-mode unstructured data by using a preset model to obtain a corresponding data classification result includes:
based on the pre-training model, obtaining a text feature matrix;
based on the vision pretreatment model, obtaining a vision characteristic matrix of the image/video;
performing fusion processing on the text feature matrix and the visual feature matrix to obtain a fused feature vector;
inputting the fused feature vectors into the pre-training model, obtaining text-vision related probability after training calculation, and constructing a vision mask matrix according to the text-vision related probability;
fusing the vision mask matrix and the vision feature matrix to obtain a fusion feature;
inputting the fusion characteristics into the pre-training model, and obtaining text codes with visual clues after training calculation;
inputting the text code with the visual clues to a preset named entity recognition model, and extracting key entity information;
and processing and analyzing the multi-mode unstructured data according to the extracted key entity information to obtain and store a corresponding data classification result.
As an optional embodiment of the present application, optionally, the pre-training model is a BERT model sharing parameters; the visual pretreatment model is a 152-layer residual error network.
In another aspect of the present application, an apparatus for implementing the method for discovering and automatically classifying and classifying sensitive information based on multimodal fusion is provided, including:
the data access module is used for accessing unstructured data;
the data type judging module is used for judging the data type of the unstructured data to obtain single-mode unstructured data or multi-mode unstructured data;
the hierarchical classification processing module is used for processing and analyzing the single-mode unstructured data or the multi-mode unstructured data by adopting a preset model to obtain a corresponding data hierarchical classification result;
and the distributed storage module is used for outputting and storing the data classification result.
In another aspect of the present application, a control system is also provided, including:
a processor;
a memory for storing processor-executable instructions;
the processor is configured to implement the sensitive information discovery and automatic classification method based on multi-modal fusion when executing the executable instructions.
The invention has the technical effects that:
based on the implementation of the application, the application is implemented by accessing unstructured data; judging the data type of the unstructured data to obtain single-mode unstructured data or multi-mode unstructured data; processing and analyzing the single-mode unstructured data or the multi-mode unstructured data by adopting a preset model to obtain a corresponding data classification result; and outputting and storing the data classification result. The BERT structure of the shared parameter can be used for simultaneously completing judgment of the relation between each frame of the image/video and the text and fusion of the visual characteristics and the text characteristics, and the automatic safety grading classification system based on multi-mode fusion and capable of processing large-scale unstructured data can realize fusion linkage processing analysis of different mode data such as texts, images and videos. The method solves the technical problems of single processing data mode type, insufficient mining of data information, low automation and customization degree, narrow application service range, low accuracy, large calculation resource consumption due to a plurality of models and the like of the data security classification products in the market.
Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features and aspects of the present disclosure and together with the description, serve to explain the principles of the disclosure.
FIG. 1 is a schematic diagram showing the implementation flow of the method for discovering and automatically classifying sensitive information based on multi-modal fusion;
FIG. 2 is a flow chart illustrating the processing and analysis of multi-modal unstructured data according to the present invention.
Detailed Description
Various exemplary embodiments, features and aspects of the disclosure will be described in detail below with reference to the drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
In addition, numerous specific details are set forth in the following detailed description in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements, and circuits well known to those skilled in the art have not been described in detail in order not to obscure the present disclosure.
The invention designs an automatic safety grading classification system based on multi-mode fusion and capable of processing large-scale unstructured data, and realizes the fusion linkage processing analysis of different mode data such as texts, images and videos. The present invention uses a BERT structure (Bidirectional Encoder Representations from Transformers, bi-directional encoder representation from the converter) that shares parameters to accomplish both the image/video per frame and text relationship determination and the fusion of visual and text features.
The system comprises the following steps:
1) The unstructured data is accessed, and the problems of unstructured data source management and access are solved by using a large data frame based on a distributed mode;
2) Unstructured data preprocessing, namely applying ETL capacity of a large data platform to the unstructured data preprocessing to finish necessary preprocessing;
3) The method comprises the steps of analyzing and processing single-mode unstructured data, training a correction processing model by using an AI-based technology, and analyzing and processing the single-mode data;
4) And (3) performing multi-mode unstructured data analysis processing, and aiming at large-scale complex multi-mode unstructured data, performing corresponding multi-mode fusion modeling and optimization to finish multi-mode data relation judgment and key information extraction. Specifically, the first stage: and (3) using a multi-mode BERT structure of a shared parameter, and carrying out [ SEP ] symbol connection on character features of bidirectional long-short-term memory network codes and diagram features of residual network codes, inputting the character features and the diagram features into a shared BERT model, and judging whether information expressed by different mode data is related or not by using an output [ CLS ] vector. And a second stage: and then carrying out dot multiplication on the confidence coefficient of the transformed image-text association matrix and the visual feature matrix, splicing the obtained product matrix and the character feature with [ CLS ] +character feature+ [ SEP ] + visual representation, inputting the spliced product matrix and the character feature into a shared BERT model, carrying out multi-mode named entity recognition tasks, and finally outputting the extracted key information result.
Embodiments of the respective steps will be specifically described below.
Example 1
As shown in fig. 1, in one aspect, the application provides a method for discovering and automatically classifying and classifying sensitive information based on multi-modal fusion, which includes the following steps:
s1, accessing unstructured data;
unstructured data, the type of which is not limited, is determined primarily by the data source to which the system is connected. Unstructured data accessed from a distributed file storage system, such as through a data connector;
s2, judging the data type of the unstructured data to obtain single-mode unstructured data or multi-mode unstructured data;
unstructured data, which may be unstructured data of a single-mode or multi-mode data structure, needs to be firstly judged on the data type of the accessed unstructured data, and whether the unstructured data is the single-mode unstructured data or the multi-mode unstructured data is judged.
In this embodiment, the multiple modes are at least data types with two or more modes.
S3, processing and analyzing the single-mode unstructured data or the multi-mode unstructured data by adopting a preset model to obtain a corresponding data classification result;
for single-mode unstructured data or multi-mode unstructured data, different hierarchical classification methods are adopted in the scheme:
training a correction processing model for the single-mode unstructured data by using an AI-based technology, extracting data characteristics from the single-mode data, and analyzing and processing the data characteristics;
and carrying out corresponding multi-modal fusion modeling and optimization on the modal unstructured data aiming at the large-scale complex multi-modal unstructured data to finish the judgment of the relation of the multi-modal data and carry out key information extraction.
And S4, outputting and storing the data classification result.
In this embodiment, a distributed storage manner is adopted to perform distributed storage on unstructured data obtained by acquisition and/or data characteristics/classification results obtained by extraction. The distributed storage technology may be a technology in the prior art, and this embodiment is not described.
The implementation of each step will be described in detail below.
As an optional embodiment of the present application, optionally accessing unstructured data includes:
configuring a data source connector;
accessing corresponding unstructured data from the selected data source by adopting the configured data source connector;
and reporting the accessed unstructured data in real time according to a preset message queue.
And the unstructured data is accessed, and the problems of unstructured data source management and access are solved by using a large data frame based on distribution. The method comprises the following steps:
the connection source type is selected through the data connector, the data source is constructed to a data pipeline of the security classification system, or the data source is connected to a database, a distributed file storage system or a corresponding message queue is configured in a scene requiring real-time task processing.
And the collected data is reported in sequence according to the message queue, so that the management is convenient.
As an optional embodiment of the present application, optionally, accessing unstructured data further includes:
presetting distributed storage conditions;
and carrying out distributed storage on the accessed unstructured data according to the distributed storage conditions.
The unstructured data collected in real time is stored in a distributed manner (this step is optional). The distributed storage conditions are not limited as long as data storage can be performed by using the distributed storage technology.
As an optional embodiment of the present application, optionally, accessing unstructured data further includes:
pre-configuring an imaging management interface;
and adopting the imaging management interface to perform visual management on the accessed unstructured data.
The visual technology is adopted to realize the visual management of each data. The application of the imaging configuration interface manages large-scale unstructured data.
And the image management interface is selected by a user on the hierarchical classification system, and the imported data is subjected to visual interface management and application.
As an optional embodiment of the present application, optionally after accessing the unstructured data, further comprising:
presetting pretreatment conditions;
and preprocessing the accessed unstructured data according to the preprocessing conditions.
In this case, after the unstructured data is accessed, there may be a need to perform preprocessing such as format conversion or cleaning, so that the present application uses ETL (Extract-Transform-Load) capability to perform necessary preprocessing of the unstructured data, including basic processing such as searching, conversion, cleaning, and mining. The method comprises the following steps:
connecting and extracting data from different database systems and file systems;
splitting, merging, format conversion, rationality of judgment values, duplication removal, zero clearing, error data deletion and the like are carried out on the extracted data according to rules;
and loading the preprocessed data into a target database for the next step.
As an optional implementation manner of the present application, optionally, processing and analyzing the single-mode unstructured data by using a preset model to obtain a corresponding data classification result includes:
based on the AI technology, training and generating different AI models according to different types of unstructured data and different task targets;
inputting the single-mode unstructured data into different AI models according to the data types of the data, and extracting the characteristics;
and carrying out data classification on the single-mode unstructured data according to the extracted characteristics, and storing the obtained data classification result.
Single-mode unstructured data analysis processing
The correction processing model is trained by using an AI (Artificial Intelligence artificial intelligence) -based technique, and analysis processing is performed on the single-mode data capable of expressing the completion information. The method comprises the following steps:
3.1 model training stage
1) Collecting a large amount of data with similar characteristics to the problem to be processed, and taking the data as a training set of a machine learning model;
2) Preprocessing the data to be trained, and analyzing the basic characteristics of the statistical data;
3) Selecting corresponding AI models according to different types of unstructured data and different task targets, and carrying out parameter training and optimization on the selected models by using a training set;
3.2 model test stage
1) Testing the model trained in 3.1 by using a corresponding test set, and performing parameter tuning according to a test result;
2) Repeating the training step of 3.1) 3);
3) Until the model outputs a satisfactory test result, storing model parameters corresponding to the optimal result for later reasoning;
3.3 model reasoning use phase
1) Inputting the preprocessed data into different trained models according to data types, and extracting required data characteristics;
2) And obtaining a data classification result.
As an optional implementation manner of the present application, optionally, processing and analyzing the multi-mode unstructured data by using a preset model to obtain a corresponding data classification result includes:
based on the pre-training model, obtaining a text feature matrix;
based on the vision pretreatment model, obtaining a vision characteristic matrix of the image/video;
performing fusion processing on the text feature matrix and the visual feature matrix to obtain a fused feature vector;
inputting the fused feature vectors into the pre-training model, obtaining text-vision related probability after training calculation, and constructing a vision mask matrix according to the text-vision related probability;
fusing the vision mask matrix and the vision feature matrix to obtain a fusion feature;
inputting the fusion characteristics into the pre-training model, and obtaining text codes with visual clues after training calculation;
inputting the text code with the visual clues to a preset named entity recognition model, and extracting key entity information;
and processing and analyzing the multi-mode unstructured data according to the extracted key entity information to obtain and store a corresponding data classification result.
As an optional embodiment of the present application, optionally, the pre-training model is a BERT model sharing parameters; the visual pretreatment model is a 152-layer residual error network.
The processing method of the multi-mode unstructured data comprises the following steps of:
as shown in fig. 2, for large-scale complex multi-modal unstructured data, corresponding multi-modal fusion modeling and optimization are performed, and multi-modal data relation judgment and key information extraction are completed. The following mainly describes the specific process of multi-mode unstructured data fusion modeling and the processing of the obtained results. The specific multi-mode fusion model structure is shown in figure 2
Specifically, the first stage: a multi-mode BERT structure (a BERT word segmentation device is used for processing a text sequence to obtain a text feature matrix SEP, a 152-layer residual error network is used for processing a picture sequence and a full-connection layer is accessed to obtain a visual feature matrix CLS), character features coded by a bidirectional long-short-term memory network and picture features coded by the residual error network are connected through SEP symbols, the SEP symbols are input into a shared BERT model, and the output CLS vectors are used for judging whether information expressed by different modes of data is related or not. And a second stage: and then carrying out dot multiplication on the confidence coefficient of the transformed image-text association matrix and the visual feature matrix, splicing the obtained product matrix and the character feature with [ CLS ] +character feature+ [ SEP ] + visual representation, inputting the spliced product matrix and the character feature into a shared BERT model, carrying out multi-mode named entity recognition tasks, and finally outputting the extracted key information result.
The method specifically comprises the following steps:
4.1 unstructured data feature extraction
Based on the BERT pre-training model, a text feature matrix is obtained. The text feature expression matrix is generated by a BERT pre-model, which can decompose a word at a location into a plurality of word segments for marking. The input of the BERT pre-training model is a single word, and the word embedded vector of each word is output and is marked as T.
Based on the 152-layer residual network, a visual characteristic matrix of the image and the video is obtained. The visual expression matrix is generated by a residual network of 152 layers, the input is a frame of picture/video, the output divides the picture into 49 block areas, since the output size of the last convolutional layer of the residual network is 7 x 2048, where 7 x 7 represents 49 block areas in the image. The 49 block region extraction ({ fi, j }7i, j=1) features are sequentially arranged into an image block embedding sequence, denoted as V, the dimensions of the sequence being the same as those of the word embedding vector described above.
4.2 Multi-modal feature embedding matrix fusion
According to the method of BERT model input, the model input is the sum of word embedding or image block embedding, segment embedding and position embedding. Here, the word embedding is a word embedding vector and image block embedding sequence extracted in 4.1; segment embedding learns from two modalities, for example, 0 represents a word embedding vector, and 1 represents an image block embedding sequence; the word position embedding is learned from word sequences in sentences, but the position marks of the image block embedding sequences are the same and are the same fixed value.
The fused model input sequence is as follows: [ CLS ] t+word segment embedding+word position embedding [ SEP ] v+image segment embedding+image position embedding, where [ CLS ] is a start-of-fusion feature start flag and [ SEP ] is a separator of text features and visual features.
4.3 sending into BERT model training to obtain text-visual relationship
And (3) sending the feature vector after fusion in 4.2 into a BERT pre-training model, and accessing a full connection layer to the [ CLS ] result output by the model to obtain the text-vision correlation probability. This probability can be used to determine whether the graphics are relevant. The vision mask matrix R is constructed according to the associated probabilities.
4.4 fusion visual Property matrix-fusion Property of visual mask matrix R
At this step, the mask matrix R is used to control the additional visual cues, changing V to the product of V and the R corresponding element. This operation masks the visual feature matrix, which is derived from the text-to-visual correlation probabilities of the previous layer.
4.5 feeding into BERT model training to obtain text codes with visual cues
The fusion features in 4.4 are fed into the BERT model to obtain text codes with visual cues.
4.6 sending the key entity information into a named entity recognition model to extract the key entity information
The entity recognition model is named, and a bistm-CRF (Bidirectional Long Short Term Memory-Conditional Random Field, two-way long and short time memory-conditional random field) model is selected. The biLSTM-CRF model consists of bi-directional LSTM and Conditional Random Field (CRF). The input of the bistmcrf is a concatenation of a word and a text code with visual information output in 4.5. The CRF marks the sequence with the entity tag using the biLSTM hidden vector for each tag.
And finally obtaining the key entity to be extracted.
Therefore, the method realizes the fusion linkage processing analysis of different mode data such as texts, images and videos by the automatic safety classification method which is based on multi-mode fusion and can process large-scale unstructured data, and improves the accuracy of the sensitive information discovery model. And the data are all-round mined, so that the problems of high calculation resource consumption and the like caused by a plurality of single-mode models are solved. Based on the artificial intelligence technology, the automation and customization degree of the classification system are improved, and the application service range is widened.
It should be noted that, although the hierarchical classification process of single-mode or multi-mode unstructured data as described above is described by way of example with respect to respective training/recognition models, those skilled in the art will appreciate that the present disclosure should not be limited thereto. In fact, the user can flexibly set application models of all modes according to actual application scenes, so long as the technical functions of the application can be realized according to the technical method.
It should be apparent to those skilled in the art that the implementation of all or part of the above-described embodiments of the method may be implemented by a computer program for instructing relevant hardware, and the program may be stored in a computer readable storage medium, and the program may include the steps of the embodiments of the control methods described above when executed.
It will be appreciated by those skilled in the art that implementing all or part of the above-described embodiment methods may be implemented by a computer program for instructing relevant hardware, and the program may be stored in a computer readable storage medium, and the program may include the embodiment flow of each control method as described above when executed. The storage medium may be a magnetic disk, an optical disc, a Read-only memory (ROM), a random access memory (RandomAccessMemory, RAM), a flash memory (flash memory), a hard disk (HDD), or a Solid State Drive (SSD); the storage medium may also comprise a combination of memories of the kind described above.
Example 2
Based on the implementation principle of embodiment 1, in another aspect of the present application, an apparatus for implementing the method for discovering and automatically classifying and classifying sensitive information based on multimodal fusion is provided, where the apparatus includes:
the data access module is used for accessing unstructured data;
the data type judging module is used for judging the data type of the unstructured data to obtain single-mode unstructured data or multi-mode unstructured data;
the hierarchical classification processing module is used for processing and analyzing the single-mode unstructured data or the multi-mode unstructured data by adopting a preset model to obtain a corresponding data hierarchical classification result;
and the distributed storage module is used for outputting and storing the data classification result.
The above modules are specifically referred to the description of embodiment 1, which is not repeated in this embodiment.
The modules or steps of the invention described above may be implemented in a general-purpose computing device, they may be centralized in a single computing device, or distributed across a network of computing devices, or they may alternatively be implemented in program code executable by a computing device, such that they may be stored in a memory device and executed by a computing device, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
Example 3
Still further, another aspect of the present application provides a control system, including:
a processor;
a memory for storing processor-executable instructions;
the processor is configured to implement the sensitive information discovery and automatic classification method based on multi-modal fusion when executing the executable instructions.
Embodiments of the present disclosure control a system that includes a processor and a memory for storing processor-executable instructions. The processor is configured to implement any one of the above sensitive information discovery and automatic classification methods based on multi-modal fusion when executing executable instructions.
Here, it should be noted that the number of processors may be one or more. Meanwhile, in the control system of the embodiment of the present disclosure, an input device and an output device may be further included. The processor, the memory, the input device, and the output device may be connected by a bus, or may be connected by other means, which is not specifically limited herein.
The memory is a computer-readable storage medium that can be used to store software programs, computer-executable programs, and various modules, such as: the embodiment of the disclosure discloses a program or a module corresponding to a multi-mode fusion-based sensitive information discovery and automatic classification method. The processor executes various functional applications and data processing of the control system by running software programs or modules stored in the memory.
The input device may be used to receive an input number or signal. Wherein the signal may be a key signal generated in connection with user settings of the device/terminal/server and function control. The output means may comprise a display device such as a display screen.
The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvement of the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (10)

1. The sensitive information discovery and automatic classification and classification method based on multi-mode fusion is characterized by comprising the following steps of:
accessing unstructured data;
judging the data type of the unstructured data to obtain single-mode unstructured data or multi-mode unstructured data;
processing and analyzing the single-mode unstructured data or the multi-mode unstructured data by adopting a preset model to obtain a corresponding data classification result;
and outputting and storing the data classification result.
2. The multi-modal fusion-based sensitive information discovery and automatic classification and classification method as claimed in claim 1 wherein accessing unstructured data comprises:
configuring a data source connector;
accessing corresponding unstructured data from the selected data source by adopting the configured data source connector;
and reporting the accessed unstructured data in real time according to a preset message queue.
3. The multi-modal fusion-based sensitive information discovery and automatic classification and ranking method of claim 2 wherein accessing unstructured data further comprises:
presetting distributed storage conditions;
and carrying out distributed storage on the accessed unstructured data according to the distributed storage conditions.
4. The multi-modal fusion-based sensitive information discovery and automatic classification and ranking method of claim 2 wherein accessing unstructured data further comprises:
pre-configuring an imaging management interface;
and adopting the imaging management interface to perform visual management on the accessed unstructured data.
5. The multi-modal fusion-based sensitive information discovery and automatic classification and ranking method according to claim 1, further comprising, after accessing unstructured data:
presetting pretreatment conditions;
and preprocessing the accessed unstructured data according to the preprocessing conditions.
6. The method for discovering and automatically classifying sensitive information based on multi-modal fusion according to claim 1, wherein the processing and analyzing the single-modal unstructured data by using a preset model to obtain a corresponding data classifying result comprises the following steps:
based on the AI technology, training and generating different AI models according to different types of unstructured data and different task targets;
inputting the single-mode unstructured data into different AI models according to the data types of the data, and extracting the characteristics;
and carrying out data classification on the single-mode unstructured data according to the extracted characteristics, and storing the obtained data classification result.
7. The method for discovering and automatically classifying sensitive information based on multi-modal fusion according to claim 6, wherein the processing and analyzing the multi-modal unstructured data by using a preset model to obtain a corresponding data classifying result comprises:
based on the pre-training model, obtaining a text feature matrix;
based on the vision pretreatment model, obtaining a vision characteristic matrix of the image/video;
performing fusion processing on the text feature matrix and the visual feature matrix to obtain a fused feature vector;
inputting the fused feature vectors into the pre-training model, obtaining text-vision related probability after training calculation, and constructing a vision mask matrix according to the text-vision related probability;
fusing the vision mask matrix and the vision feature matrix to obtain a fusion feature;
inputting the fusion characteristics into the pre-training model, and obtaining text codes with visual clues after training calculation;
inputting the text code with the visual clues to a preset named entity recognition model, and extracting key entity information;
and processing and analyzing the multi-mode unstructured data according to the extracted key entity information to obtain and store a corresponding data classification result.
8. The multi-modal fusion-based sensitive information discovery and automatic classification and classification method according to claim 7, wherein the pre-training model is a BERT model sharing parameters; the visual pretreatment model is a 152-layer residual error network.
9. An apparatus for implementing the sensitive information discovery and automatic classification and ranking method based on multi-modal fusion according to any one of claims 1-8, comprising:
the data access module is used for accessing unstructured data;
the data type judging module is used for judging the data type of the unstructured data to obtain single-mode unstructured data or multi-mode unstructured data;
the hierarchical classification processing module is used for processing and analyzing the single-mode unstructured data or the multi-mode unstructured data by adopting a preset model to obtain a corresponding data hierarchical classification result;
and the distributed storage module is used for outputting and storing the data classification result.
10. A control system, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to implement the multimodal fusion-based sensitive information discovery and automatic classification ranking method of any one of claims 1-8 when executing the executable instructions.
CN202211705972.1A 2022-12-29 2022-12-29 Sensitive information discovery and automatic classification method based on multi-mode fusion Active CN116049397B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211705972.1A CN116049397B (en) 2022-12-29 2022-12-29 Sensitive information discovery and automatic classification method based on multi-mode fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211705972.1A CN116049397B (en) 2022-12-29 2022-12-29 Sensitive information discovery and automatic classification method based on multi-mode fusion

Publications (2)

Publication Number Publication Date
CN116049397A true CN116049397A (en) 2023-05-02
CN116049397B CN116049397B (en) 2024-01-02

Family

ID=86130697

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211705972.1A Active CN116049397B (en) 2022-12-29 2022-12-29 Sensitive information discovery and automatic classification method based on multi-mode fusion

Country Status (1)

Country Link
CN (1) CN116049397B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116681086A (en) * 2023-07-31 2023-09-01 深圳市傲天科技股份有限公司 Data grading method, system, equipment and storage medium
CN117351257A (en) * 2023-08-24 2024-01-05 长江水上交通监测与应急处置中心 Multi-mode information-based shipping data extraction method and system
CN117787924A (en) * 2024-02-28 2024-03-29 中国航空工业集团公司西安飞机设计研究所 Method and system for issuing data packets for aircraft design data

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2014213560A1 (en) * 2013-08-15 2015-03-05 Po, Wilson MR Communication Platform and Method for Participants of a Text Message Conversation to Convey Real Emotions, and to Experience Shared Content Together at the Same Time
CN108932549A (en) * 2017-05-25 2018-12-04 百度(美国)有限责任公司 It listens attentively to, interact and talks:It is spoken by interactive learning
CN109508375A (en) * 2018-11-19 2019-03-22 重庆邮电大学 A kind of social affective classification method based on multi-modal fusion
CN113822340A (en) * 2021-08-27 2021-12-21 北京工业大学 Image-text emotion recognition method based on attention mechanism
CN114936623A (en) * 2022-04-20 2022-08-23 西北工业大学 Multi-modal data fused aspect-level emotion analysis method
CN115146057A (en) * 2022-05-27 2022-10-04 电子科技大学 Supply chain ecological region image-text fusion emotion recognition method based on interactive attention
WO2022227294A1 (en) * 2021-04-30 2022-11-03 山东大学 Disease risk prediction method and system based on multi-modal fusion
CN115510224A (en) * 2022-07-14 2022-12-23 南京邮电大学 Cross-modal BERT emotion analysis method based on fusion of vision, audio and text

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2014213560A1 (en) * 2013-08-15 2015-03-05 Po, Wilson MR Communication Platform and Method for Participants of a Text Message Conversation to Convey Real Emotions, and to Experience Shared Content Together at the Same Time
CN108932549A (en) * 2017-05-25 2018-12-04 百度(美国)有限责任公司 It listens attentively to, interact and talks:It is spoken by interactive learning
CN109508375A (en) * 2018-11-19 2019-03-22 重庆邮电大学 A kind of social affective classification method based on multi-modal fusion
WO2022227294A1 (en) * 2021-04-30 2022-11-03 山东大学 Disease risk prediction method and system based on multi-modal fusion
CN113822340A (en) * 2021-08-27 2021-12-21 北京工业大学 Image-text emotion recognition method based on attention mechanism
CN114936623A (en) * 2022-04-20 2022-08-23 西北工业大学 Multi-modal data fused aspect-level emotion analysis method
CN115146057A (en) * 2022-05-27 2022-10-04 电子科技大学 Supply chain ecological region image-text fusion emotion recognition method based on interactive attention
CN115510224A (en) * 2022-07-14 2022-12-23 南京邮电大学 Cross-modal BERT emotion analysis method based on fusion of vision, audio and text

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
邓一姣;张凤荔;陈学勤;艾擎;余苏?;: "面向跨模态检索的协同注意力网络模型", 计算机科学, no. 04 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116681086A (en) * 2023-07-31 2023-09-01 深圳市傲天科技股份有限公司 Data grading method, system, equipment and storage medium
CN116681086B (en) * 2023-07-31 2024-04-02 深圳市傲天科技股份有限公司 Data grading method, system, equipment and storage medium
CN117351257A (en) * 2023-08-24 2024-01-05 长江水上交通监测与应急处置中心 Multi-mode information-based shipping data extraction method and system
CN117351257B (en) * 2023-08-24 2024-04-02 长江水上交通监测与应急处置中心 Multi-mode information-based shipping data extraction method and system
CN117787924A (en) * 2024-02-28 2024-03-29 中国航空工业集团公司西安飞机设计研究所 Method and system for issuing data packets for aircraft design data

Also Published As

Publication number Publication date
CN116049397B (en) 2024-01-02

Similar Documents

Publication Publication Date Title
US20210303921A1 (en) Cross-modality processing method and apparatus, and computer storage medium
CN112199375B (en) Cross-modal data processing method and device, storage medium and electronic device
CN116049397B (en) Sensitive information discovery and automatic classification method based on multi-mode fusion
CN109271539B (en) Image automatic labeling method and device based on deep learning
WO2021139191A1 (en) Method for data labeling and apparatus for data labeling
CN110968695A (en) Intelligent labeling method, device and platform based on active learning of weak supervision technology
CN111582409A (en) Training method of image label classification network, image label classification method and device
CN110598019B (en) Repeated image identification method and device
CN114596566B (en) Text recognition method and related device
CN111931859B (en) Multi-label image recognition method and device
CN113051914A (en) Enterprise hidden label extraction method and device based on multi-feature dynamic portrait
CN114461890A (en) Hierarchical multi-modal intellectual property search engine method and system
CN113657115A (en) Multi-modal Mongolian emotion analysis method based on ironic recognition and fine-grained feature fusion
CN114492601A (en) Resource classification model training method and device, electronic equipment and storage medium
CN114445826A (en) Visual question answering method and device, electronic equipment and storage medium
CN113705293A (en) Image scene recognition method, device, equipment and readable storage medium
CN112966676A (en) Document key information extraction method based on zero sample learning
CN112070093A (en) Method for generating image classification model, image classification method, device and equipment
CN112084788A (en) Automatic marking method and system for implicit emotional tendency of image captions
CN115690816A (en) Text element extraction method, device, equipment and medium
CN112528674B (en) Text processing method, training device, training equipment and training equipment for model and storage medium
CN114708472A (en) AI (Artificial intelligence) training-oriented multi-modal data set labeling method and device and electronic equipment
CN114842301A (en) Semi-supervised training method of image annotation model
CN116030375A (en) Video feature extraction and model training method, device, equipment and storage medium
CN115269781A (en) Modal association degree prediction method, device, equipment, storage medium and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant