CN112381159A

CN112381159A - Sensitive data identification method, device and equipment

Info

Publication number: CN112381159A
Application number: CN202011296573.5A
Authority: CN
Inventors: 张弥
Original assignee: Beijing Kingsoft Cloud Network Technology Co Ltd
Current assignee: Beijing Kingsoft Cloud Network Technology Co Ltd
Priority date: 2020-11-18
Filing date: 2020-11-18
Publication date: 2021-02-19

Abstract

The embodiment of the invention provides a sensitive data identification method, a sensitive data identification device and sensitive data identification equipment, and is applied to the technical field of data identification. The method comprises the following steps: acquiring multimedia data; respectively carrying out element recognition on various designated elements of the multimedia data to obtain a target recognition result of each designated element; wherein the element identification for each specified element is: the obtained identification result can be used for judging the identification mode of whether the multimedia data is sensitive data; performing fusion analysis of a designated mode on the target recognition results of the various designated elements to obtain a fusion analysis result; and identifying whether the multimedia data is sensitive data or not based on the fusion analysis result. By the scheme, whether the multimedia data belong to sensitive data or not can be effectively identified.

Description

Sensitive data identification method, device and equipment

Technical Field

The invention relates to the technical field of data identification, in particular to a sensitive data identification method, a sensitive data identification device and sensitive data identification equipment.

Background

Nowadays, multimedia data is widely existed in work and life of people as a common carrier for transferring information content, and has more and more huge amount.

Multimedia data generally has a plurality of elements, and a meaning that cannot be characterized by any one element can be characterized by combining the plurality of elements together. For example: for a picture with both image content and text content, the image content and the text content in the picture belong to different elements, and a certain meaning can be represented by combining the image content and the text content together; or, for a picture belonging to a whole body photograph, the face part and the decoration part in the picture belong to different elements, and a certain meaning can be represented by combining the face part and the decoration part.

Therefore, how to effectively identify whether multimedia data belongs to sensitive data is an urgent problem to be solved.

Disclosure of Invention

The embodiment of the invention aims to provide a sensitive data identification method, a sensitive data identification device and sensitive data identification equipment so as to effectively identify whether multimedia data belong to sensitive data. The specific technical scheme is as follows:

in a first aspect, an embodiment of the present invention provides a sensitive data identification method, where the method includes:

acquiring multimedia data;

respectively carrying out element recognition on various designated elements of the multimedia data to obtain a target recognition result of each designated element; wherein the element identification for each specified element is: the obtained identification result can be used for judging the identification mode of whether the multimedia data is sensitive data;

performing fusion analysis of a designated mode on the target recognition results of the various designated elements to obtain a fusion analysis result;

and identifying whether the multimedia data is sensitive data or not based on the fusion analysis result.

Optionally, the performing fusion analysis of the target recognition results of the plurality of designated elements in a designated manner to obtain a fusion analysis result includes:

detecting whether the combined content of the target identification results of the various specified elements can represent at least one of preset meaning contents or not to obtain a detection result as a fusion analysis result; wherein the meaning represented by each meaning content belongs to sensitive content;

the identifying whether the multimedia data is sensitive data based on the fusion analysis result comprises:

if the fusion analysis result shows that the combined content of the target identification results of the various designated elements can represent at least one of preset meaning contents, determining that the multimedia data is sensitive data; otherwise, determining that the multimedia data is not sensitive data.

Optionally, the detecting whether the combined content of the target recognition results about the multiple specified elements can represent at least one of preset multiple implied contents or not obtains a detection result, including:

detecting the combined content of the target identification results of the various designated elements through a graph database storing a preset knowledge graph, and whether at least one of preset meaning contents can be represented or not to obtain a detection result;

the preset knowledge graph at least records a plurality of first-class nodes, a plurality of second-class nodes and an incidence relation between each first-class node and each second-class node; the plurality of first-class nodes at least comprise a plurality of nodes representing potential recognition results of the plurality of specified elements, each node in the plurality of nodes represents one potential recognition result, each second-class node represents one meaning content, and the association relationship is used for representing the relevance of the content represented by each first-class node and the content represented by each second-class node.

Optionally, each potential recognition result belongs to entity content and corresponds to ontology content;

the plurality of first type nodes further comprises: and the node represents the ontology content corresponding to the potential recognition result.

Optionally, the representation form of the association relationship between each first class node and each second class node includes:

the first class nodes with relevance are connected with the second class nodes, the attribute value of the designated attribute of each second class node is the node content of a plurality of target first class nodes connected with the second class nodes, and the target first class nodes are a plurality of nodes capable of representing the content represented by the second class nodes when the represented content is combined;

the detecting, by using a graph database in which a preset knowledge graph is stored, whether combined content of target recognition results of the plurality of specified elements can represent at least one of preset meaning contents or not to obtain a detection result includes:

inputting the target recognition results of the various specified elements into the graph database, so that the graph database detects whether a second type node meeting a first preset condition exists or not based on the target recognition results of the various specified elements to obtain a detection result;

the first preset condition is that the first preset condition is connected with a plurality of designated nodes, attribute values of designated attributes are contents of the designated nodes, the designated nodes are first-class nodes corresponding to target identification results of the designated elements, and the first-class node corresponding to each target identification result is any one of the first-class node representing the target identification result and the first-class node representing the body content corresponding to the target identification result.

the first class node and the second class node with relevance are connected, and the weight of the relevance for data sensitive identification is set;

inputting the target recognition results of the various specified elements into the graph database, so that the graph database detects whether a second type of node meeting a second preset condition exists or not based on the target recognition results of the various specified elements to obtain a detection result;

the second preset condition is that the second preset condition is connected with a plurality of designated nodes, and the comprehensive weight is greater than a preset weight threshold value; the plurality of designated nodes are first-class nodes corresponding to the target recognition results of the plurality of designated elements, and the first-class node corresponding to each target recognition result is any one of the first-class node representing the target recognition result and the first-class node representing the ontology content corresponding to the target recognition result; the composite weight is the sum of the weights of the relevance of each designated node to the data sensitive identification.

Optionally, the number of the plurality of specified elements is two;

the representation form of the incidence relation between each first class node and each second class node comprises the following steps:

the first class nodes and the second class nodes with relevance are connected;

inputting target recognition results of two specified elements into the graph database so that the graph database detects whether links which relate to two specified nodes and a second type node and have the path length of a node loop smaller than a preset threshold exist or not to obtain a detection result;

the two designated nodes are two first-class nodes representing target recognition results of the two designated elements.

Optionally, the plurality of specified elements includes at least two of:

data content of the multimedia data under a plurality of data types;

and part of the data content of the multimedia data under the specified data type.

In a second aspect, an embodiment of the present invention provides a sensitive data identification apparatus, where the apparatus includes:

the acquisition module is used for acquiring multimedia data;

the identification module is used for respectively carrying out element identification on various specified elements of the multimedia data to obtain a target identification result of each specified element; wherein the element identification for each specified element is: the obtained identification result can be used for judging the identification mode of whether the multimedia data is sensitive data;

the analysis module is used for performing fusion analysis of a designated mode on the target recognition results of the various designated elements to obtain a fusion analysis result;

and the determining module is used for identifying whether the multimedia data is sensitive data or not based on the fusion analysis result.

In a third aspect, an embodiment of the present invention provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor and the communication interface complete communication between the memory and the processor through the communication bus;

a memory for storing a computer program;

a processor, configured to implement the steps of the method provided by the first aspect when executing the program stored in the memory.

In a fourth aspect, the present invention provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the method provided in the first aspect.

The embodiment of the invention has the following beneficial effects:

in the solution provided in the embodiment of the present invention, after the multimedia data is obtained, element identification is performed on multiple types of specified elements of the multimedia data, so as to obtain a target identification result of each type of specified element, where the element identification performed on each type of specified element is: the obtained identification result can be used for judging the identification mode of whether the multimedia data is sensitive data; performing fusion analysis of a designated mode on the target recognition results of the various designated elements to obtain a fusion analysis result; and identifying whether the multimedia data is sensitive data or not based on the fusion analysis result. Because a certain meaning can be represented when multiple elements in the multimedia data are combined, when sensitive data are identified, single identification is firstly carried out on multiple specified elements to obtain the single meaning represented by each element, and then fusion analysis is carried out on the single identification results of the multiple specified elements to identify whether the multimedia data are the sensitive data, so that an identification result is obtained. Therefore, the aim of effectively identifying whether the multimedia data belong to the sensitive data or not can be achieved by utilizing the scheme.

Of course, not all of the advantages described above need to be achieved at the same time in the practice of any one product or method of the invention.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.

FIG. 1 is a flow chart of a sensitive data identification method according to an embodiment of the present invention;

FIG. 2 is another flow chart of a sensitive data identification method according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating the relationship between node contents of a knowledge-graph according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a sensitive data identification device according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to achieve the purpose of effectively identifying whether multimedia data belongs to sensitive data, the embodiment of the invention provides a sensitive data identification method, a system, a device, electronic equipment and a storage medium.

The sensitive data identification method provided by the embodiment of the invention is described below firstly.

The sensitive data identification method provided by the embodiment of the invention can be applied to electronic equipment. In a specific application, the electronic device may be a terminal device, for example: notebook computers, desktop computers, tablet computers, smart phones, and the like; of course, the electronic device may also be a server.

In addition, an execution subject of the sensitive data identification method provided by the embodiment of the invention can be a sensitive data identification device running in the electronic equipment. The sensitive data identification device can be special client software for identifying sensitive data, and can also be a plug-in program in the existing client software with sensitive data identification requirements.

In addition, the multimedia data mentioned in the embodiment of the present invention may include pictures or videos, and the like, that is, one picture may be used as the multimedia data to be subjected to the sensitive data identification, and one video may also be used as the multimedia data to be subjected to the sensitive data identification.

It will be appreciated that so-called sensitive data, which may also be referred to as violation data, is data which is of adverse effect or is related to a criminal offence, for example: terrorism data, data that spoils the image of a country and ethnic group, etc. While in different scenarios the definition of sensitive data may differ. In the administrative scene, multimedia data which damage the state welfare, the national association, the image of a leader and the like can be regarded as sensitive data. For example: the image content represents that a rural child walks difficultly in mud land, the text content represents the happiness of the praising people, the image content and the text content are combined together to represent the meaning of politics or situation in the dark ironic country, and the picture belongs to sensitive data.

As shown in fig. 1, the sensitive data identification method provided in the embodiment of the present invention may include the following steps:

s101, acquiring multimedia data;

the multimedia data acquired by the sensitive data identification device is data to be subjected to sensitive data identification, the multimedia data can be pictures or videos, the number of the multimedia data can be one or more, and the identification process of whether each multimedia data belongs to the sensitive data is the same.

Also, there may be a variety of ways to obtain multimedia data. Illustratively, multimedia data can be crawled from the web using a web crawler; or, acquiring multimedia data from a specified file path; alternatively, the multimedia data is acquired from a manually entered package of files regarding the multimedia data.

S102, element identification is carried out on various specified elements of the multimedia data respectively to obtain a target identification result of each specified element; wherein the element identification for each specified element is: the obtained identification result can be used for judging the identification mode of whether the multimedia data is sensitive data;

wherein each designated element is data that can be individually identified and that can characterize a specific meaning. Wherein the plurality of specified elements includes at least two of: data content of the multimedia data under a plurality of data types; a portion of the data content of the multimedia data under the specified data type.

For example, if the multimedia data is a picture, the various specified elements may include: two elements of image content and text content; or, the human face part and the clothes part in the image content; or three elements of a face part and a clothing part in the text content and the image content; or four elements of character content, face part, clothing part and object part in image content, and the like.

For example, if the multimedia data is a video, the various specified elements may include: two elements of a video frame sequence and an audio frame sequence; or a sequence of face data and audio frames in a sequence of video frames.

It is to be understood that, in order to adapt to multimedia data related to a plurality of application scenes, the plurality of designated elements may cover elements of the multimedia data of the plurality of application scenes, and in this case, the number of designated elements actually included in the acquired multimedia data may not be equal to the fixed number. Of course, the number and specific types of the various specified elements may be different for different application scenarios. For example, in the application scenario 1, the plurality of designated elements include a elements and B elements; in the application scenario 2, the plurality of designated elements may include a types of elements, B types of elements, and C types of elements.

In addition, since the target recognition result of each kind of specified element is required to analyze whether the multimedia data is sensitive data, the element recognition performed on each kind of specified element is: the obtained identification result can be used for judging the identification mode of whether the multimedia data is sensitive data. For example, the various specified elements may include: considering that the combination of image content and character content with different emotional colors may represent sensitive content, the element identification performed on the image content may be emotional color identification, and the element identification performed on the character content may be: the emotion color recognition, or the recognition about the emotion color and the keyword, and further, the recognition result obtained by the emotion color recognition and the keyword recognition can be used for sensitive data recognition. The emotion colors for emotion color identification can include at least two types, for example: both positive and negative, or, both sad and cheerful, or, both lean and non-lean, etc.

For another example: the plurality of specified elements may include: face parts and clothing parts in the image content, and the multimedia data can belong to sensitive data considering mismatching of the face parts and the clothing parts, so that element identification performed on the face parts can be as follows: identification, and identification for a clothing part may be: and clothing type identification, and further, identification results obtained through identity identification and clothing type identification can be used for identifying sensitive data.

For another example: the plurality of specified elements may include: video frame sequence and audio frame sequence, then, considering that the combination of video frame sequence and audio frame sequence with different emotional colors may characterize sensitive content, the element identification performed for the video frame sequence and audio frame sequence may be: the emotion color recognition, and further, the recognition result obtained by the emotion color recognition can be used for the recognition of the sensitive data.

It should be emphasized that the specific types of elements and the specific identification schemes provided above are merely examples, and should not be construed as limiting the embodiments of the present invention. And, for recognition of any one element, there may be a plurality of potential recognition results, with the target recognition result being one of the plurality of potential recognition results; the identification method adopted for identifying any element may be any method capable of achieving the identification purpose, which is not limited in the embodiment of the present invention, and for example, the emotion color of the element may be identified by using a pre-trained emotion analysis model

S103, performing fusion analysis in a specified mode on the target recognition results of the various specified elements to obtain fusion analysis results;

and S104, identifying whether the multimedia data is sensitive data or not based on the fusion analysis result.

After the target recognition result of each kind of designated element is obtained, considering that a plurality of kinds of designated elements can represent a certain meaning together, the target recognition results of the plurality of kinds of designated elements can be subjected to fusion analysis in a designated manner, so that whether the multimedia data is sensitive data or not is recognized based on the fusion analysis result.

And then, based on the fusion analysis result, the specific implementation mode for identifying whether the multimedia data is sensitive data can be various. For example, in an implementation manner, performing fusion analysis in a specified manner on the target recognition results of the multiple specified elements to obtain a fusion analysis result, which may include:

vectorizing the target recognition results of various specified elements to obtain a vector of the target recognition result of each specified element; splicing vectors of target identification results of various specified elements according to a preset splicing mode to obtain spliced vectors; inputting the splicing vector into a pre-trained neural network model for identifying sensitive data to obtain an output result serving as a fusion analysis result; wherein, the output result is the confidence of whether the sensitive data belongs to;

correspondingly, identifying whether the multimedia data is sensitive data or not based on the fusion analysis result comprises the following steps:

and when the fusion analysis result is larger than a preset confidence coefficient threshold value, judging the multimedia data to be sensitive data, otherwise, judging the multimedia data not to be sensitive data.

Wherein, the concrete structure of the neural network model can adopt any model structure; and, the training process of the neural network model may include: determining sample multimedia data, and performing element identification on various specified elements in the sample multimedia data to obtain a plurality of sample identification results; vectorizing the multiple sample identification results to obtain vectors of the multiple sample identification results, and splicing the vectors of the multiple sample identification results according to a preset splicing mode to obtain a sample splicing vector; inputting the sample splicing vector into a neural network model in training to obtain a prediction result; determining whether the neural network model converges based on the difference between the prediction result and the label information of the sample multimedia data, and if so, finishing training; if not, adjusting the parameters of the neural network model and continuing training. Wherein, the tag information of the sample multimedia data is information for characterizing whether the sample media data is sensitive data.

It can be understood that, in addition to performing fusion analysis of a designated manner on the target recognition results of the plurality of designated elements by using the neural network model to obtain a fusion analysis result, and based on the fusion analysis result, whether the multimedia data is sensitive data or not may be recognized by using other manners, and then, with reference to a specific embodiment, other specific implementation manners are described in detail.

The sensitive data identification method provided by the embodiment of the invention is described below with reference to another embodiment.

As shown in fig. 2, the sensitive data identification method provided in the embodiment of the present invention may include the following steps:

s201, acquiring multimedia data;

s202, element recognition is carried out on various specified elements of the multimedia data respectively to obtain a target recognition result of each specified element;

wherein the element identification for each specified element is: the obtained identification result can be used for judging the identification mode of whether the multimedia data is sensitive data.

In this embodiment, steps S201 to S202 are the same as steps S101 to S102 in the above embodiment, and are not described herein again.

S203, detecting whether the combined content of the target identification results of the various designated elements can represent at least one of preset meaning contents or not to obtain a detection result as a fusion analysis result; wherein the meaning represented by each meaning content belongs to sensitive content;

s204, if the fusion analysis result shows that the combined content of the target identification results of the various designated elements can represent at least one of preset meaning contents, determining that the multimedia data is sensitive data; otherwise, the multimedia data is determined not to be sensitive data.

S203-S204 are a specific implementation manner of S103-S104 in the above embodiment.

In this embodiment, a plurality of implied contents may be preset based on a specific application scenario, and the meaning represented by each implied content belongs to sensitive content. For example, in an administrative scenario, the plurality of implied contents may include: ironic national politics, attack leaders, gang alone, etc.

Furthermore, in the sensitive data identification process, after a plurality of target identification results are obtained, whether the combined content of the target identification results of the plurality of specified elements can represent at least one of a plurality of preset meaning contents is detected, and a detection result is obtained and used as a fusion analysis result; further, whether the multimedia data is sensitive data can be determined based on whether the fusion analysis result indicates a combined content of the target recognition results with respect to the plurality of specified elements, and at least one of preset meaning contents can be characterized.

Optionally, in one implementation, the implied contents represented by various combinations of the potential recognition results of the multiple specified elements are analyzed in advance, so that a mapping relation between the combined contents of the potential recognition results of the multiple specified elements and the implied contents can be established; wherein each combined content includes a potential recognition result of each designated element. For example: the method comprises the following steps that a designated element 1 corresponds to potential recognition results a and b, a designated element 2 corresponds to potential recognition results c and d, a designated element 3 corresponds to potential recognition results e and f, and it can be known through manual analysis that the potential recognition results a, c and e can represent the meaning content L1 when combined together, the potential recognition results a, d and f can represent the meaning content L2 when combined together, and the potential recognition results b, c and f can represent the meaning content L3 when combined together, so that the following mapping relationship can be established: a. c and e, corresponding to L1, a, d and f, corresponding to L2, b, c and f, corresponding to L3.

Accordingly, detecting whether the combined content of the target recognition results of the plurality of designated elements can represent at least one of the preset meaning contents, and obtaining the detection result may include:

and detecting whether meaning content corresponding to the combined content of the target identification results of the various specified elements exists or not from a preset mapping relation to obtain a detection result.

Optionally, in one implementation, a knowledge graph may be constructed based on the individual meaning content and the potential recognition results, and detection may be implemented based on the knowledge graph; accordingly, detecting whether the combined content of the target recognition results of the plurality of designated elements can represent at least one of the preset meaning contents, and obtaining the detection result may include:

detecting the combined content of the target identification results of various specified elements through a graph database storing a preset knowledge graph, and whether at least one of preset meaning contents can be represented or not to obtain a detection result;

the preset knowledge graph at least records a plurality of first-class nodes, a plurality of second-class nodes and the incidence relation between each first-class node and each second-class node; the plurality of first-class nodes at least comprise a plurality of nodes representing potential recognition results of the plurality of specified elements, each node in the plurality of nodes represents one potential recognition result, each second-class node represents one meaning content, and the association relationship is used for representing the relevance of the content represented by each first-class node and the content represented by each second-class node.

In addition, each potential recognition result belongs to entity content and corresponds to ontology content; correspondingly, the plurality of first-type nodes further include: and the nodes represent the ontology contents corresponding to the potential recognition results. Wherein the ontology content is a general content. For example: leaders 1, 2 and 3 all belong to entity content, if leaders 1 and 2 are domestic leaders and leaders 3 belong to foreign leaders, the domestic leaders are body content and correspond to leaders 1 and 2, and the foreign leaders are body content and correspond to leaders 3. For example: both poverty and sadness can belong to entity content, and the semantic negative can be ontology content and correspond to poverty and sadness. For a designated element with more potential recognition results, a first class node of the potential recognition results of the designated element can be set in the knowledge graph, and the first class node is used for representing the ontology content corresponding to the more potential recognition results, so that when the first class node representing the ontology content is connected with the second class node, each entity content in the ontology content has relevance with the content of the connected second class node. In addition, the level of the ontology content may be set to one or more levels, for example: the first level is ontology content-entity content, and the second level is ontology content-entity content.

For clarity of the scheme and clear layout, a specific implementation manner of the detection result is described in an exemplary manner by detecting whether the combined content of the target recognition results of the multiple specified elements can be represented by at least one of preset meaning contents through the graph database in which the preset knowledge graph is stored.

According to the scheme, as the multimedia data represents a certain meaning through multiple elements, when the sensitive data is identified, single identification is firstly carried out on multiple specified elements to obtain the single meaning represented by each element, and then fusion analysis is carried out on the single identification results of the multiple specified elements through a preset knowledge graph to identify whether the multimedia data is the sensitive data or not to obtain an identification result. Therefore, the purpose of effectively identifying whether the multimedia data belong to the sensitive data or not can be achieved through the scheme.

For clarity of the scheme and clear layout, a specific implementation manner of detecting whether the combined content of the target recognition results of the multiple specified elements can be represented by at least one of the preset meaning contents through the graph database storing the preset knowledge graph to obtain the detection result is exemplarily described below.

For example, in an implementation manner, the characterizing form of the association relationship between each first class node and each second class node may include:

correspondingly, through storing the graph database of the preset knowledge graph, detecting the combined content of the target recognition results of various specified elements, and whether at least one of preset meaning contents can be represented, obtaining the detection result, including:

inputting the target recognition results of various specified elements into the graph database, so that the graph database detects whether a second type node meeting a first preset condition exists or not based on the target recognition results of the various specified elements to obtain a detection result;

the first preset condition is that the first preset condition is connected with a plurality of designated nodes, attribute values of designated attributes are contents of the designated nodes, the designated nodes are first-class nodes corresponding to target identification results of a plurality of designated elements, and the first-class node corresponding to each target identification result is any one of the first-class node representing the target identification result and the first-class node representing body contents corresponding to the target identification result.

It is understood that the number of attribute values of the specified attribute of any second-class node may be one or more; also, the specified attribute may be an attribute for characterizing sensitivity. Moreover, the fact that a certain first-class node and a certain second-class node have relevance specifically means that: the contents represented by the certain first-class node and the contents represented by other first-class nodes can characterize the contents represented by the one second-class node when combined together.

When the plurality of first-class nodes only comprise a plurality of nodes representing each potential recognition result, the plurality of target first-class nodes connected with each second-class node are the plurality of first-class nodes representing the potential recognition results, and when the contents of the plurality of first-class nodes representing the potential recognition results are combined, the contents represented by the second-class nodes can be represented. For example, one potential recognition result 1 for the specified element a is: negative sentiment, one potential recognition result 2 for element B is designated as: positive emotion, if the potential recognition result 1 of the designated element a and the potential recognition result 2 of the designated element B are combined to characterize the implied content 1, a first type node a representing the potential recognition result 1, a first type node B representing the potential recognition result 2, and a second type node c representing the implied content may be set in the knowledge graph, the first type node a and the first type node B are respectively connected with the second type node c, and the combined content of the potential recognition result 1 represented by the first type node a and the potential recognition result 2 represented by the first type node B is used as an attribute value of the designated attribute of the second type node c.

When the plurality of first-class nodes include both a plurality of nodes representing respective potential recognition results and ontology contents representing corresponding potential recognition results, the plurality of target first-class nodes connected to each second-class node may include first-class nodes representing potential recognition results and/or first-class nodes representing ontology contents. For example, one potential recognition result 1 for the specified element a is: person 1, one potential recognition result 2 for element B is designated as: if the place 1 can represent the implied content 2 when the potential recognition result 1 of the designated element a and the potential recognition result 2 of the designated element B are combined, a first-class node a representing the potential recognition result 1, a first-class node B representing the potential recognition result 2, a first-class node c representing the ontology content 1 corresponding to the potential recognition result 1, and a second-class node d representing the implied content 2 may be set in the knowledge graph, the first-class node a and the first-class node c are connected, the first-class node c and the first-class node B are respectively connected with the second-class node d, and the combined content of the ontology content 1 represented by the first-class node c and the potential recognition result 2 represented by the first-class node B is used as an attribute value of the designated attribute of the second-class node d.

Based on the representation form of the incidence relation between each first-class node and each second-class node, after a plurality of target recognition results are obtained, the target recognition results can be used as detection bases and input into a graph database; accordingly, the graph database can detect whether the second type of nodes meeting the first preset condition exist or not based on the detection basis, and a detection result is obtained.

Optionally, in another implementation manner, the characterizing form of the association relationship between each first class node and each second class node may include:

the first class node and the second class node with relevance are connected, and the weight of the relevance for sensitive identification of data is set, namely the connected edges have weight correspondingly;

correspondingly, the combined content of the target recognition results of various specified elements is detected through the graph database in which the preset knowledge graph is stored, whether at least one of preset meaning contents can be represented or not is detected, and the detection result is obtained, which may include:

inputting the target recognition results of the various specified elements into a graph database, so that the graph database detects whether a second type of node meeting a second preset condition exists or not based on the target recognition results of the various specified elements to obtain a detection result;

the second preset condition is that the second preset condition is connected with a plurality of designated nodes, and the comprehensive weight is greater than a preset weight threshold value; the plurality of designated nodes are first-class nodes corresponding to target recognition results of a plurality of designated elements, and the first-class node corresponding to each target recognition result is any one of the first-class node representing the target recognition result and the first-class node representing the body content corresponding to the target recognition result; the composite weight is the sum of the weights of the data-sensitive identifications of the correlations with the respective designated nodes.

When the plurality of first-class nodes only comprise a plurality of nodes representing various potential identification results, the first-class nodes which are associated with the second-class nodes refer to the nodes representing the potential identification results; when the plurality of first-class nodes include both a plurality of nodes representing respective potential recognition results and nodes representing ontology contents corresponding to the potential recognition results, the first-class nodes having an association with the second-class nodes may be the nodes representing the potential recognition results or the nodes representing the ontology contents corresponding to the potential recognition results. And the relevance closeness of the connected second class nodes and the first class nodes can be determined based on manual analysis, the relevance of the connected second class nodes and the first class nodes is determined based on the relevance closeness, and the weight of data sensitive identification is weighted. For example: one potential recognition result 1 for element a is given as: negative sentiment, one potential recognition result 2 for element B is designated as: positive emotion, if the potential recognition result 1 of the designated element a and the potential recognition result 2 of the designated element B are combined, the meaning content 1 can be represented, a first-class node a representing the potential recognition result 1, a first-class node B representing the potential recognition result 2, and a second-class node c representing the meaning content 1 may be set in the knowledge graph, the first-class node a and the first-class node B are respectively connected with the second-class node c, the weight corresponding to the relevance of the first-class node a and the second-class node c is set to 0.7, and the weight corresponding to the relevance of the first-class node B and the second-class node c is set to 0.2.

When setting the weights, for a first-class node a and a first-class node b which have relevance with a second-class node c, if the first-class node a has a larger contribution to being capable of representing the second-class node relative to the first-class node b, that is, the contribution to sensitivity is larger, the weight corresponding to the relevance of the first-class node a and the second-class node c is larger than the weight corresponding to the relevance of the first-class node b and the second-class node c, and the weight difference value may be larger than a predetermined threshold value. In addition, the weight corresponding to the relevance of each first-class node and each second-class node has an initial value, and when the designated content exists in the plurality of input contents input into the graph database, the weight corresponding to the relevance of the first-class node to which the plurality of input contents belong and each second-class node can be increased. For example, the specific content may be a predetermined recognition result that contributes significantly to the sensitivity, but is not limited thereto.

In the implementation mode, the weight is set for the relevance of the first type of node and the second type of node which are connected, namely the weight is set for the connecting edge, so that the second type of node can be selected by calculating the comprehensive weight after the graph database obtains a plurality of target identification results serving as detection bases.

Optionally, in another implementation, the number of the plurality of specified elements is two;

the first class nodes and the second class nodes with relevance are connected;

inputting the target recognition results of the two specified elements into a graph database so that the graph database detects whether links which relate to the two specified nodes and a second type node and have the path length of a node loop smaller than a preset threshold exist or not to obtain a detection result; the two designated nodes are two first-class nodes representing target recognition results of the two designated elements.

When the plurality of first-class nodes only comprise a plurality of nodes representing various potential identification results, the first-class nodes which are associated with the second-class nodes refer to the nodes representing the potential identification results; when the plurality of first-class nodes include both a plurality of nodes representing respective potential recognition results and nodes representing ontology contents corresponding to the potential recognition results, the first-class nodes having an association with the second-class nodes may be the nodes representing the potential recognition results or the nodes representing the ontology contents corresponding to the potential recognition results.

It can be understood that, since the first-class nodes and the second-class nodes having relevance are connected, and the first-class node representing the potential recognition result is connected with the first-class node representing the ontology content corresponding to the potential recognition result, if two target recognition results are combined together to represent a meaning content, two first-class nodes representing two target recognition results can form a link, and the path length in the formed link is short.

Therefore, in this implementation, whether the combined content of the target recognition results of the plurality of designated elements can represent at least one of the preset meaning contents is detected by detecting whether two first-class nodes representing two target recognition results can form a link and whether the path length of a node loop experienced when the link is formed is smaller than a preset threshold.

The sensitive data identification method provided by the embodiment of the invention is described below with reference to a specific example.

A knowledge graph is pre-established, and fig. 3 shows a relationship between node contents of each first-class node and node contents of each second-class node in the knowledge graph, specifically, a first column shows node contents of each first-class node, and a second column shows node contents of each second-class node. Wherein, the three contents of "positive", "70 year national celebration" and "poor" are combined together to represent "dark ironic national politics", that is, the three contents of "positive", "70 year national celebration" and "poor" are all related to "dark ironic national politics", therefore, three first type nodes representing "positive", "70 year national celebration" and "poor" in the knowledge map are respectively connected with a second type node representing "dark ironic national politics", and one attribute value of the designated attribute of the second type node representing "dark ironic national politics" is set as: "obverse", "national celebration in 70 years" and "poor bitter". In fig. 3, a schematic diagram is shown in which three first type nodes representing "frontal", "70 year national celebration", "poor bitter" are respectively connected to a second type node representing "dark ironic national politics".

Assuming a target video, a sequence of video frames describes a child going to school with a bag on his/her back and going through the mud, and a sequence of audio frames has audio content: congratulating national celebration happy 70 years in China;

the process of processing the video by using the sensitive data identification method provided by the embodiment of the invention comprises the following steps:

acquiring the target video;

performing frame extraction processing on a video frame sequence of the target video serving as an appointed element A to obtain at least one frame of video frame, and performing emotion color identification on the at least one frame by utilizing a deep learning algorithm to obtain a target identification result 'poor' of the appointed element A;

performing emotion color recognition on the audio frame sequence of the target video serving as the designated element B to obtain a target recognition result 'front' of the designated element B;

performing keyword recognition on the audio frame sequence of the target video serving as the designated element B to obtain a target recognition result '70 anniversary national celebration' of the designated element B;

and inputting each obtained target recognition result into a graph database in which a knowledge graph is stored, so that the graph database can detect that the graph database is connected with the lean data, the front data and the 70-year national celebration, and the video is judged to be sensitive data if the attribute values of the specified attributes are second-class nodes of the three contents.

Therefore, the aim of effectively identifying whether the multimedia data belong to the sensitive data or not can be achieved by utilizing the scheme.

Corresponding to the above method embodiment, an embodiment of the present invention provides a sensitive data identification apparatus, as shown in fig. 4, where the apparatus may include:

an obtaining module 410, configured to obtain multimedia data;

the identification module 420 is configured to perform element identification on multiple types of specified elements of the multimedia data, so as to obtain a target identification result of each type of specified element; wherein the element identification for each specified element is: the obtained identification result can be used for judging the identification mode of whether the multimedia data is sensitive data;

the analysis module 430 is configured to perform fusion analysis in a specified manner on the target recognition results of the multiple specified elements to obtain a fusion analysis result;

a determining module 440, configured to identify whether the multimedia data is sensitive data based on the fusion analysis result.

Optionally, the analysis module 430 includes:

the analysis submodule is used for detecting whether the combined content of the target identification results of the various specified elements can represent at least one of preset meaning contents or not to obtain a detection result which is used as a fusion analysis result; wherein the meaning represented by each meaning content belongs to sensitive content;

the determining module 440 includes:

the determining submodule is used for determining the multimedia data as sensitive data if the fusion analysis result shows that the combined content of the target identification results of the various designated elements can represent at least one of preset meaning contents; otherwise, determining that the multimedia data is not sensitive data.

Optionally, an analysis submodule comprising:

the analysis unit is used for detecting the combined content of the target identification results of the various designated elements through a graph database in which a preset knowledge graph is stored, and whether at least one of preset meaning contents can be represented or not to obtain a detection result which is used as a fusion analysis result;

the analysis unit is specifically configured to:

Optionally, the number of the plurality of specified elements is two;

the first class nodes and the second class nodes with relevance are connected;

the analysis unit is specifically configured to:

Optionally, the plurality of specified elements includes at least two of:

data content of the multimedia data under a plurality of data types;

An embodiment of the present invention further provides an electronic device, as shown in fig. 5, which includes a processor 501, a communication interface 502, a memory 503 and a communication bus 504, where the processor 501, the communication interface 502 and the memory 503 complete mutual communication through the communication bus 504,

a memory 503 for storing a computer program;

the processor 501 is configured to implement the steps of the sensitive data identification method provided by the embodiment of the present invention when executing the program stored in the memory 503.

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

In yet another embodiment provided by the present invention, a computer-readable storage medium is further provided, in which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the sensitive data identification method provided by the embodiment of the present invention.

In yet another embodiment provided by the present invention, a computer program product containing instructions is also provided, which when run on a computer, causes the computer to perform the steps of the sensitive data identification method provided by the embodiment of the present invention.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for embodiments of devices, apparatuses, storage media, etc., since they are substantially similar to the method embodiments, the description is relatively simple, and reference may be made to some descriptions of the method embodiments for relevant points.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method for sensitive data identification, the method comprising:

acquiring multimedia data;

2. The method according to claim 1, wherein performing fusion analysis of the target recognition results of the plurality of designated elements in a designated manner to obtain a fusion analysis result comprises:

3. The method of claim 2, wherein the detecting whether the combined content of the target recognition results about the plurality of designated elements can characterize at least one of a plurality of preset meaning contents, and obtaining the detection result comprises:

4. The method of claim 3, wherein each potential recognition result belongs to entity content and corresponds to ontology content;

5. The method according to claim 3 or 4, wherein the characterizing form of the association relationship between each first class node and each second class node comprises:

6. The method according to claim 3 or 4, wherein the characterizing form of the association relationship between each first class node and each second class node comprises:

7. The method of claim 3 or 4, wherein the number of the plurality of specified elements is two;

the first class nodes and the second class nodes with relevance are connected;

8. The method of any of claims 1-4, wherein the plurality of specified elements include at least two of:

data content of the multimedia data under a plurality of data types;

9. An apparatus for identifying sensitive data, the apparatus comprising:

the acquisition module is used for acquiring multimedia data;

10. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of claims 1 to 8 when executing a program stored in the memory.

11. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of the claims 1-8.