CN115809429A

CN115809429A - Network media data supervision method and device, electronic equipment and readable storage medium

Info

Publication number: CN115809429A
Application number: CN202211212072.3A
Authority: CN
Inventors: 崔晓峰; 徐鹏飞; 王波; 周文明
Original assignee: Hangzhou Dt Dream Technology Co Ltd
Current assignee: Hangzhou Dt Dream Technology Co Ltd
Priority date: 2022-09-30
Filing date: 2022-09-30
Publication date: 2023-03-17

Abstract

The application provides a network media data supervision method, a device, an electronic device and a readable storage medium, wherein the method comprises the following steps: acquiring multi-modal network media data; generating a characteristic matrix according to the network media data of each mode in the multi-mode network media data; each row vector in the feature matrix corresponds to a feature vector of each mode respectively; inputting the characteristic matrix into a pre-trained classification model for classification processing to obtain a classification result output by the classification model and aiming at the multi-modal network media data; and performing supervision processing on the multi-modal network media data based on the classification result. The method and the system realize the supervision of the multi-modal network media data and avoid possible supervision blind areas which are caused by the fact that only data of a single mode is supervised.

Description

Network media data supervision method and device, electronic equipment and readable storage medium

Technical Field

The present application relates to the field of big data technologies, and in particular, to a method and an apparatus for monitoring network media data, an electronic device, and a readable storage medium.

Background

With the rapid development of internet media, networks become an important channel for everyone to know information and accept new things. Meanwhile, some people still issue statements violating relevant legal regulations, so that network media data needs to be supervised in order to maintain a healthy network environment.

In recent years, media data on a network is increasingly multi-modal, for example, a video playing website often includes video media data, picture media data and text media data. Multimodal data on the one hand enriches the sensory experience of the user and on the other hand also presents a greater challenge to internet media administration. The existing network media data supervision method only aims at data of a single mode, so that auditing of the data of the single mode can be easily escaped, only starting from the single mode, the statement violating the relevant legal provisions can not be accurately detected, and supervision blind areas exist.

Disclosure of Invention

The application provides a network media data supervision method, which comprises the following steps:

acquiring multi-modal network media data;

generating a feature matrix according to the network media data of each mode in the multi-mode network media data; each row vector in the feature matrix corresponds to a feature vector of each modality respectively;

inputting the characteristic matrix into a pre-trained classification model for classification processing so as to obtain a classification result, output by the classification model, of the multi-modal network media data; wherein the classification model is a deep learning model comprising a self-attention coding layer; the self-attention coding layer is used for learning the relation among the network media data of each modality; the classification result is used for indicating the violation type of the multi-modal network media data;

and performing supervision processing on the multi-modal network media data based on the classification result.

Optionally, the inputting the feature matrix into a pre-trained classification model for classification processing to obtain a classification result for the multi-modal network media data output by the classification model includes:

inputting the feature matrix into a self-attention coding layer in the pre-trained classification model, and coding the feature matrix to obtain a coding result indicating the relationship between network media data of each mode;

and classifying the multi-modal network media data based on the coding result to obtain a classification result of the multi-modal network media data output by the classification model.

Optionally, inputting the feature matrix into a self-attention coding layer in the pre-trained classification model, and coding the feature matrix to obtain a coding result indicating a relationship between network media data of each modality, including:

generating a first matrix, a second matrix and a third matrix corresponding to each row vector in the feature matrix;

respectively taking each row vector in the feature matrix as a target row vector, and calculating a weight coefficient corresponding to the target row vector based on a first matrix corresponding to the target row vector and a second matrix corresponding to all the row vectors, wherein the weight coefficient corresponding to the target row vector is used for representing the correlation degree between elements in the target row vector and elements of other row vectors;

and adjusting the target row vector according to the weight coefficient to obtain a coding result.

Optionally, the adjusting the target row vector according to the weight coefficient to obtain the encoding result includes:

calculating a weighting result of a third matrix corresponding to the target row vector and a weight coefficient corresponding to the target row vector;

and performing addition operation on the weighting results respectively corresponding to the row vectors in the feature matrix to obtain the coding result.

Optionally, calculating a weight coefficient corresponding to the target row vector based on the first matrix corresponding to the target row vector and the second matrix corresponding to the other row vectors, includes:

performing dot product operation on a first matrix corresponding to the target row vector and second matrices corresponding to all row vectors to obtain a weight matrix corresponding to the target row vector;

and carrying out normalization processing on the weight matrix to obtain a weight coefficient corresponding to the target row vector.

Optionally, the multimodal network media data includes at least two of text media data, picture media data, audio media data, and video media data.

Optionally, the generating a feature matrix according to the network media data of each modality in the multi-modality network media data includes: respectively generating a feature vector corresponding to each modal network media data based on each modal network media data in the multi-modal network media data;

and generating a feature matrix based on the feature vectors corresponding to the network media data of each mode.

Optionally, before generating the feature matrix based on the feature vectors corresponding to the network media data of each modality, the method further includes:

aligning the feature vectors corresponding to the network media data of each modality in the multi-modality network media data so as to enable the dimensions of the feature vectors in the feature matrix to be the same.

The present application further provides a device for supervising network media data, the device comprising:

the data acquisition unit is used for acquiring multi-modal network media data;

the characteristic matrix generating unit is used for generating a characteristic matrix according to the network media data of each mode in the multi-mode network media data; each row vector in the feature matrix corresponds to a feature vector of each mode respectively;

the classification processing unit is used for inputting the characteristic matrix into a pre-trained classification model for classification processing so as to obtain a classification result output by the classification model and aiming at the multi-modal network media data; wherein the classification model is a deep learning model comprising a self-attention coding layer; the self-attention coding layer is used for learning the relation among the network media data of each modality; the classification result is used for indicating the violation type of the multi-modal network media data;

and the supervision processing unit is used for carrying out supervision processing on the multi-modal network media data based on the classification result.

The application also provides an electronic device, which comprises a communication interface, a processor, a memory and a bus, wherein the communication interface, the processor and the memory are mutually connected through the bus;

the memory stores machine-readable instructions, and the processor executes the method by calling the machine-readable instructions.

The present application also provides a computer readable storage medium having stored thereon machine readable instructions which, when invoked and executed by a processor, implement the above method.

In the scheme described in the above embodiment, the multi-modal network media data is input into the pre-trained classification model, and information included in the network media data of each modality can be fused through the self-attention mechanism, so that the relationship between the network media data of each modality can be learned, so that the classification result obtained by the classification model can more accurately indicate the violation type of the multi-modal network media data, the supervision on the multi-modal network media data can be more accurately realized, and the possible supervision blind area existing in the supervision on the data of only a single modality can be avoided.

Drawings

Fig. 1 is a flowchart illustrating a network media policing method according to an exemplary embodiment.

FIG. 2 is a flow chart illustrating an encoding for the feature matrix in an exemplary embodiment.

Fig. 3 is a hardware structure diagram of an electronic device in which a network media monitoring apparatus is located according to an exemplary embodiment.

Fig. 4 is a block diagram of a network media policing apparatus, according to an example embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the application, as detailed in the appended claims.

It should be noted that: in other embodiments, the steps of the corresponding methods are not necessarily performed in the order shown and described in this specification. In some other embodiments, the method may include more or fewer steps than those described herein. Moreover, a single step described in this specification may be broken down into multiple steps for description in other embodiments; multiple steps described in this specification may be combined into a single step in other embodiments.

In recent years, media data on a network is gradually multi-modal, and the multi-modal data enriches the sensory experience of users on one hand and also provides greater challenges for internet media supervision on the other hand. The existing internet media supervision only aims at data of a single mode, and data of various modes are often in a mutual complementary and matched relationship, so that some statements violating relevant legal provisions can not be accurately detected only from the single mode, and supervision dead zones exist.

In view of this, the present application provides a technical solution for converting multi-modal network media data into a feature matrix, inputting a pre-trained classification model for classification processing to obtain a classification result, and performing supervision processing on the multi-modal network media data based on the classification result.

Referring to fig. 1, fig. 1 is a flowchart illustrating a network media data policing method according to an exemplary embodiment, where the method may include the following steps:

step 102, obtaining multi-modal network media data.

In this specification, the multimodal network media data can be read from the multimodal database through the pre-constructed multimodal database, and can also be obtained through the background server of the website. The multi-modal network media data can comprise at least two network media data of text media data, picture media data, audio media data and video media data.

104, generating a feature matrix according to the network media data of each mode in the multi-mode network media data; each row vector in the feature matrix corresponds to a feature vector of each modality, respectively.

After the multimodal network media data are obtained, the feature matrix corresponding to the multimodal network media data can be generated directly aiming at the multimodal network media data. The feature matrix may include a plurality of row vectors, and each row vector may correspond to network media data of one modality. The present specification does not limit the generation method of the feature matrix, and may generate corresponding feature vectors for each modality of the network media data in the multimodal network media data, and combine the feature vectors corresponding to each modality of the network media data into the feature matrix. The multi-modal network media data can also be input into a pre-constructed matrix generation method, so that the matrix generation method further generates a feature vector for each modal network media data and combines the feature vectors into a feature matrix.

And 106, inputting the feature matrix into a pre-trained classification model for classification processing to obtain a classification result output by the classification model and aiming at the multi-modal network media data.

After generating the feature matrix for the multi-modal network media data, the feature matrix may be input to a pre-trained classification model for classification.

The pre-trained classification model may include a deep learning model with a self-attention coding layer. Through the self-attention coding layer of the classification model, the feature matrix can be coded, and the relation between the network media data of each mode can be learned. The pre-trained classification model can classify various violation types of the multi-modal network media data based on the encoding of the feature matrix, and output a classification result indicating the violation types of the multi-modal network media data.

For example, through the pre-trained classification model, the multi-modal network media data can be output with a higher proportion of one violation type among all violation types, and thus the violation type of the multi-modal network media data can be determined as the violation type.

And 108, performing supervision processing on the multi-modal network media data based on the classification result.

After the classification result for the multi-modal network media data output by the classification model is obtained, the multi-modal network media data can be supervised according to the violation type of the multi-modal network media data indicated by the classification result.

For example, when it is determined that the violation type of the multi-modal network media data is a certain violation type, the multi-modal network media data may be directly subjected to supervision processing such as shielding or blocking. In the scheme described in the above embodiment, the multi-modal network media data is input into the pre-trained classification model, and the relationship between the network media data of each modality is learned through the self-attention mechanism, so that the information contained in the network media data of each modality can be fused, and the relationship between the network media data of each modality can more accurately indicate the violation type of the multi-modal network media data, thereby more accurately realizing the supervision on the multi-modal network media data and avoiding the supervision blind area which may exist only in the supervision on the data of a single modality.

The present application is described below with reference to specific embodiments and specific application scenarios.

Step 102, obtaining multi-modal network media data.

In this specification, the multimodal network media data may include at least two of text media data, picture media data, audio media data, and video media data.

For example, when monitoring network media data of a video playing website, text content, pictures, advertisement data, and the like displayed on a website page may be acquired, and multimodal network media data such as videos provided by the website and comment text content related to the videos may also be acquired.

In this specification, a specific implementation manner of obtaining the multimodal network media data is not limited in this specification. For example, the multimodal network media data can be read from the multimodal database through a pre-constructed multimodal database, or can be obtained through a background server of a website.

In step 104, after the multi-modal network media data is obtained, corresponding feature vectors may be generated for the network media data of each modality according to the network media data of each modality in the multi-modal network media data, and a feature matrix may be further generated based on the feature vectors of each modality; wherein each row vector in the feature matrix corresponds to a feature vector of each modality, respectively.

Wherein, in one possible embodiment, the textual media data may be converted into feature vectors based on a pre-trained BERT model.

In one possible embodiment, the picture media data may be converted to feature vectors based on a pre-trained iBot model.

Similarly, for the audio media data, the feature vectors corresponding to the audio media data can be generated by extracting audio segments and based on the pre-trained neural network model; for the video media data, video key frames can be extracted, and for the video key frames, feature vectors corresponding to the video media data are generated further based on a pre-trained neural network model.

After generating the feature vectors for the network media data of each modality in the multi-modal network media data, a feature matrix may be further generated based on the feature vectors. Specifically, feature vectors of the respective modalities may be combined, where each row vector of the feature matrix corresponds to a feature vector of the respective modality.

For example, taking a video website as an example, when auditing video content, the workload is often large and the time is long, so that the title of a video displayed in a website page and a cover page of the video can be audited together with the video content, so as to more quickly and accurately identify whether violation data exists. The title text of the video can be extracted, and corresponding feature vectors, such as "[1,0,0,1]" are generated for the title text. Similarly, a video cover can be extracted, and a corresponding feature vector, such as "[1,2,2,0]" is generated for the video cover. Further, a feature vector corresponding to the video content, such as "[0,1,1,0]", or the like, may be generated for the video content.

Further, the feature vectors may be combined to further generate a feature matrix. For example, the three feature vectors [1,0,0,1], [1,2,2,0], [0,1,1,0] are used as row vectors of a feature matrix to generate the feature matrix:

in an embodiment, since there may be a difference in length of feature vectors output after the network media data of different modalities are input into the pre-trained neural network model, feature vectors corresponding to the network media data of each modality in the multi-modality network media data may be aligned, so that dimensions of the feature vectors in the feature matrix are the same.

Specifically, the feature vectors corresponding to the network media data of each modality may be input into an N-dimensional feature vector learning layer, and the feature vectors corresponding to the network media data of each modality are all converted into N-dimensional feature vectors, so as to achieve alignment of the vector of each modality. The three feature vectors in the above example are 4-dimensional and therefore do not need to be aligned. And when the lengths of the feature vectors are not consistent, alignment is required.

For example, the feature vectors [1,0,0], [1,2,2,0], [0,1,1,0,2], 3-dimensional, 4-dimensional, and 5-dimensional, respectively, may be input to a 5-dimensional feature vector learning layer to obtain 5-dimensional feature vectors [1,1,0,0,0], [1,2,2,0,0], [0,1,1,0,2].

Wherein the classification model is a deep learning model comprising a self-attention coding layer; the self-attention coding layer is used for learning the relation among the network media data of each modality; the classification result is used for indicating the violation type of the multi-modal network media data.

After the feature matrix is generated for the multi-modal network media data, the feature matrix can be input into a pre-trained classification model for classification processing.

In this specification, a pre-trained classification model is a deep learning model that includes a self-attention coding layer. The pre-trained classification model can learn the relation among the network media data of each mode through a self-attention mechanism, and generates a classification result aiming at the network media data of the modes, wherein the classification result can indicate the violation type of the network media data of multiple modes.

After the feature matrix is input into a pre-trained classification model for classification processing, a classification result output by the classification model and aiming at the multi-modal network media data can be obtained. Wherein the classification result is used for indicating the violation type of the multi-modal network media data.

In this specification, the violation type of the multi-modal network media data may include a violation of related legal provisions, a violation of related other normative documents, and the like, and is not specifically limited in this specification.

In one embodiment, the feature matrix may be input into a self-attention coding layer in the pre-trained classification model, and the feature matrix is coded to obtain a coding result indicating a relationship between network media data of each modality;

after the coding result is obtained, classifying the multi-modal network media data based on the coding result to obtain a classification result of the multi-modal network media data output by the classification model.

Referring to fig. 2, fig. 2 is a flow chart illustrating an exemplary embodiment of encoding the feature matrix, where the method may include the following steps:

step 202, generating a first matrix, a second matrix and a third matrix corresponding to each row vector in the feature matrix.

In this step, for each row vector in the feature matrix, a corresponding first matrix, a corresponding second matrix, and a corresponding third matrix may be generated. For example, the first matrix, the second matrix and the third matrix may be a Q matrix, a K matrix and a V matrix, respectively.

The preset Wq, wk and Wv matrices are respectively subjected to dot product calculation with the row vectors to generate Q, K and V matrices corresponding to the row vectors. The Wq, wk, and Wv matrices may be preset by a user according to actual needs, and are not limited in this specification.

For example, the Wq matrix may be:

the Wk matrix may be:

the Wv matrix may be:

and generating a Q matrix, a K matrix and a V matrix corresponding to each row vector in the characteristic matrix through the Wq matrix, the Wk matrix and the Wv matrix.

For example, the row vector Wq matrix may be calculated as a Q matrix corresponding to the row vector:

similarly, the Q, K, V matrix corresponding to each row vector can be calculated, which is not described in detail in this specification.

And 204, respectively taking each row vector in the feature matrix as a target row vector, and calculating a weight coefficient corresponding to the target row vector based on a first matrix corresponding to the target row vector and a second matrix corresponding to all row vectors, wherein the weight coefficient corresponding to the target row vector is used for representing the correlation degree between elements in the target row vector and elements of other row vectors.

After the first matrix, the second matrix, and the third matrix corresponding to each row vector are calculated, one of the row vectors may be selected as a target row vector, and a weight coefficient corresponding to the target row vector is calculated based on the first matrix corresponding to the target row vector and the second matrix corresponding to the other row vectors. And the weight coefficient corresponding to the target row vector is used for representing the correlation degree between the elements in the target row vector and the elements of other row vectors.

Specifically, a point multiplication operation may be performed on a first matrix corresponding to the target row vector and a second matrix corresponding to all row vectors to obtain a weight matrix corresponding to the target row vector, and a weight coefficient may be further generated based on the weight matrix.

For example, in the above example, the Q matrix corresponding to the row vector [1,0,0,1] in the feature matrix is [ 1], and the K matrix corresponding to the row vector [1,0,0,1] is [1 2]; the K matrix corresponding to the row vector [1,2,2,0] is [ 3], and the K matrix corresponding to the row vector [0,1,1,0] is [ 1].

The Q matrix corresponding to the row vector [1,0,0,1] may be [ 1], and the dot product operation may be performed on the K matrices corresponding to all the row vectors to obtain a weight matrix corresponding to the row vector [1,0,0,1 ]:

in the above example, the weight matrix corresponding to the row vector [1,0,0,1] in the feature matrix is [5 6]. After the weight matrix corresponding to the row vector is calculated, normalization processing may be performed on the weight matrix, and a weight coefficient corresponding to the target row vector is obtained.

For example, the normalization process may be performed for the weight matrix by a softmax function: softmax ([ 516 ]) = [0.270.00.73], it can be obtained that the weight coefficient corresponding to the row vector [1,0,0,1] is 0.0, the weight coefficient corresponding to the row vector [1,2,2,0] is 0.85, and the weight coefficient corresponding to the row vector [0,1,1,0] is 0.15.

The weight coefficient may be used to characterize the degree of correlation between elements in the target row vector and elements of other row vectors, and each row vector corresponds to a feature vector of the network media data different from each modality, so that the weight coefficient may characterize the degree of correlation of the network media data between different modalities.

And step 206, adjusting the target row vector according to the weight coefficient to obtain a coding result.

After the weight coefficient is calculated, the target row vector may be adjusted according to the weight coefficient, so as to obtain the encoding result.

For example, weighting calculation may be performed with the target row vectors, respectively, based on the weighting coefficients, and the result of the weighting calculation may be taken as the encoding result.

In one embodiment, a weighting result of the weight coefficients of the third matrix corresponding to each row vector and the target row vector may be calculated. And performing addition operation on the weighting results respectively corresponding to the row vectors in the feature matrix to obtain the coding result.

For example, in the above example feature matrix, the V matrix corresponding to the row vector [1,0,0,1] is [240], the V matrix corresponding to the row vector [1,2,2,0] is [280], and the V matrix corresponding to the row vector [0,1,1,0] is [133]. The V matrix corresponding to each row vector may be multiplied by its corresponding weight coefficient [0.270.00.73] to obtain a corresponding weighting result:

0.27*[240]＝[0.541.080.0]

0.0*[280]＝[0.00.00.0]

0.73*[133]＝[0.732.192.19]

further, the weighting results may be added to obtain an encoding result:

[0.541.080.0]+

[0.00.00.0]+

[0.732.192.19]

＝[1.273.272.19]

in summary, the feature matrix corresponding to the multi-modal network media data is input to the self-attention coding layer in the pre-trained classification model, and the coding result for the multi-modal network media data can be obtained.

After the encoding result for the multi-modal network media data is obtained, the multi-modal network media data may be further classified based on the encoding result to obtain a classification result for the multi-modal network media data output by the classification model.

In this specification, the encoding result of the multi-modal network media data may be input to a neural network classification layer, so that the classification layer performs classification processing on the multi-modal network media data. Wherein the classification layer can be implemented by a softmax function. The encoding result can be input into a softmax function, and the probabilities of multiple violation types corresponding to the multi-modal network media data are output, wherein the violation type with the highest probability can be represented as the violation type corresponding to the multi-modal network media data.

And inputting the multi-modal network media data into the pre-trained classification model to obtain a classification result. The classification result can indicate a violation type corresponding to the multi-modal network media data. Based on the violation type of the multi-modal network media data, the multi-modal network media data can be supervised.

For example, some multi-modal network media data violate relevant legal regulations, and the pre-trained classification model can identify the violation type of the multi-modal network media data, and can mask the multi-modal network media data based on the violation type, or delete the multi-modal network media data from the multi-modal network media data database, or send a violation warning to a network administrator, and so on.

Corresponding to the embodiment of the network media data supervision method, the specification also provides an embodiment of a network media data supervision device.

Referring to fig. 3, fig. 3 is a hardware structure diagram of an electronic device in which a network media data monitoring apparatus is located according to an exemplary embodiment. At the hardware level, the device includes a processor 302, an internal bus 304, a network interface 306, a memory 308, and a non-volatile memory 310, although it may include hardware required for other services. One or more embodiments of the present description may be implemented in software, such as by processor 302 reading a corresponding computer program from non-volatile storage 310 into memory 308 and then executing. Of course, besides software implementation, the one or more embodiments in this specification do not exclude other implementations, such as logic devices or combinations of software and hardware, and so on, that is, the execution subject of the following processing flow is not limited to each logic unit, and may also be hardware or logic devices.

Referring to fig. 4, fig. 4 is a block diagram of a network media data policing apparatus according to an exemplary embodiment. The network media data supervision can be applied to the electronic device shown in fig. 2 to implement the technical solution of the present specification. Wherein, the network media data supervision may include:

a data acquisition unit 402, configured to acquire multimodal network media data;

a feature matrix generating unit 404, configured to generate a feature matrix according to network media data of each modality in the multi-modality network media data; each row vector in the feature matrix corresponds to a feature vector of each mode respectively;

the classification processing unit 406 is configured to input the feature matrix into a pre-trained classification model for classification processing, so as to obtain a classification result for the multi-modal network media data output by the classification model; wherein the classification model is a deep learning model comprising a self-attention coding layer; the self-attention coding layer is used for learning the relation among the network media data of each modality; the classification result is used for indicating the violation type of the multi-modal network media data;

a supervision processing unit 408, configured to perform supervision processing on the multi-modal network media data based on the classification result.

Optionally, the classification processing unit 406 is specifically configured to input the feature matrix into a self-attention coding layer in the pre-trained classification model, and code the feature matrix to obtain a coding result indicating a relationship between network media data of each modality;

Optionally, the classification processing unit 406 is specifically configured to generate a first matrix, a second matrix, and a third matrix corresponding to each row vector in the feature matrix;

A classification processing unit 406, specifically configured to calculate a weighting result of a third matrix corresponding to the target row vector and a weighting coefficient corresponding to the target row vector;

and performing addition operation on the weighting results respectively corresponding to each row vector in the feature matrix to obtain the coding result.

A classification processing unit 406, specifically configured to perform a dot product operation on a first matrix corresponding to the target row vector and second matrices corresponding to all row vectors to obtain a weight matrix corresponding to the target row vector;

Optionally, the feature matrix generating unit 404 is specifically configured to generate feature vectors corresponding to network media data of each modality based on the network media data of each modality in the multi-modality network media data;

and generating a characteristic matrix based on the characteristic vectors corresponding to the network media data of each mode.

Optionally, the feature matrix generating unit 404 is specifically configured to align feature vectors corresponding to network media data of each modality in the multi-modality network media data, so that the dimensions of the feature vectors in the feature matrix are the same.

In order to make those skilled in the art better understand the technical solutions in the embodiments of the present disclosure, the related technologies related to the embodiments of the present disclosure will be briefly described below.

The specific details of the implementation process of the functions and actions of each unit in the above device are the implementation processes of the corresponding steps in the above method, and are not described herein again.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are only illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution in the specification. One of ordinary skill in the art can understand and implement it without inventive effort.

This specification also provides an embodiment of a computer-readable storage medium. The computer readable storage medium stores machine readable instructions, which when called and executed by a processor, can implement the network media data supervision method provided by any one of the embodiments in this specification.

Embodiments of the present description provide computer-readable storage media that may include, but are not limited to, any type of disk including floppy disks, hard disks, optical disks, CD-ROMs, and magnetic-optical disks, ROMs (Read-Only memories), RAMs (Random Access memories), EPROMs (Erasable Programmable Read-Only memories), EEPROMs (Electrically Erasable Programmable Read-Only memories), flash memories, magnetic cards, or fiber optic cards. That is, a readable storage medium includes a readable medium that can store or transfer information.

The systems, apparatuses, modules or units described in the above embodiments may be specifically implemented by a computer chip or an entity, or implemented by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.

In a typical configuration, a computer includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage, quantum memory, graphene-based storage media or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in one or more embodiments of the present description to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of one or more embodiments herein. The word "if," as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination," depending on the context.

The above description is only for the purpose of illustrating the preferred embodiments of the one or more embodiments of the present disclosure, and is not intended to limit the scope of the one or more embodiments of the present disclosure, and any modifications, equivalent substitutions, improvements, etc. made within the spirit and principle of the one or more embodiments of the present disclosure should be included in the scope of the one or more embodiments of the present disclosure.

Claims

1. A method for policing network media data, the method comprising:

acquiring multi-modal network media data;

generating a characteristic matrix according to the network media data of each mode in the multi-mode network media data; each row vector in the feature matrix corresponds to a feature vector of each mode respectively;

inputting the characteristic matrix into a pre-trained classification model for classification processing to obtain a classification result output by the classification model and aiming at the multi-modal network media data; wherein the classification model is a deep learning model comprising a self-attention coding layer; the self-attention coding layer is used for learning the relation among the network media data of each modality; the classification result is used for indicating the violation type of the multi-modal network media data;

2. The method of claim 1, wherein the inputting the feature matrix into a pre-trained classification model for classification processing to obtain a classification result for the multi-modal network media data output by the classification model, comprises:

3. The method of claim 2, wherein inputting the feature matrix into a self-attention coding layer in the pre-trained classification model, and coding the feature matrix to obtain a coding result indicating a relationship between network media data of respective modalities comprises:

4. The method of claim 3, wherein the adjusting the target row vector according to the weight coefficient to obtain the encoding result comprises:

5. The method of claim 3, wherein calculating the weight coefficient corresponding to the target row vector based on a first matrix corresponding to the target row vector and a second matrix corresponding to other row vectors comprises:

6. The method of claim 1, wherein the multimodal network media data comprises at least two of text media data, picture media data, audio media data, and video media data.

7. The method of claim 1, wherein generating a feature matrix from the network media data of each modality in the multi-modal network media data comprises: respectively generating a feature vector corresponding to each modal network media data based on each modal network media data in the multi-modal network media data;

8. An apparatus for supervising network media data, the apparatus comprising:

the data acquisition unit is used for acquiring multi-modal network media data;

9. An electronic device is characterized by comprising a communication interface, a processor, a memory and a bus, wherein the communication interface, the processor and the memory are connected with each other through the bus;

the memory has stored therein machine-readable instructions, which the processor executes by calling to perform the method of any one of claims 1-7.

10. A computer-readable storage medium having stored thereon machine-readable instructions which, when invoked and executed by a processor, implement the method of any of claims 1-7.