CN115775363A - Illegal video detection method based on text and video fusion - Google Patents

Illegal video detection method based on text and video fusion Download PDF

Info

Publication number
CN115775363A
CN115775363A CN202210453529.3A CN202210453529A CN115775363A CN 115775363 A CN115775363 A CN 115775363A CN 202210453529 A CN202210453529 A CN 202210453529A CN 115775363 A CN115775363 A CN 115775363A
Authority
CN
China
Prior art keywords
text
video
data
fusion
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210453529.3A
Other languages
Chinese (zh)
Inventor
于碧辉
魏靖烜
王克
刘畅
贺少婕
孙华军
卜立平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenyang Institute of Computing Technology of CAS
Original Assignee
Shenyang Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenyang Institute of Computing Technology of CAS filed Critical Shenyang Institute of Computing Technology of CAS
Priority to CN202210453529.3A priority Critical patent/CN115775363A/en
Publication of CN115775363A publication Critical patent/CN115775363A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a violation video detection method based on text and video fusion. The method comprises the following steps: acquiring a video of a live broadcast platform and text data of a corresponding live broadcast theme; respectively processing the obtained video text data by adopting a double-flow model, and sending the processed data into a feature extraction layer for shallow information extraction; then, the collected feature information is sent to a deep feature extractor for carrying out postamble learning before association on the extracted features of the previous layer, so as to effectively learn the deep semantic relation expressed by the continuous frames; the extracted deep semantic relation is sent to a fusion layer for information fusion, so that violation video identification can be better carried out through modal complementation; and finally, sending the fused features into a full connection layer, and outputting final classification information. According to the method and the device, the violation behaviors in the Internet live broadcast can be timely and accurately identified, the labor cost is saved, the algorithm can be popularized in the related field, various text and video classification tasks can be adaptively matched, and the method and the device have good ductility.

Description

Illegal video detection method based on text and video fusion
Technical Field
The invention relates to image and text detection, in particular to a violation video detection method based on text and video fusion.
Background
In recent years, watching live webcasts has become one of the entertainment and leisure ways for internet users. In a network live broadcast platform, in order to attract audiences to have different types of live broadcasts, the live broadcasts of sensitive pictures and barrage characters violating laws and regulations, moral fashion and the like gradually appear. The core means of the live broadcast is to bear the illegal content through the real-time video stream of the live broadcast platform. According to the specific situation of the past and present network live broadcast platform, when the illegal contents appear in the live broadcast process, the popularity and the heat of the live broadcast room can be increased, the live broadcast barrage quantity is increased, and barrage comments of the illegal contents are mixed. Therefore, when the network live video content is audited, the live video content and the corresponding live subject text data can be fused for multi-mode violation detection.
At present, the internet live broadcast violation detection is generally screened in a manual mode, or a simple image recognition method is adopted to recognize images on the whole, and the method has the following problems in practical application:
1. the manual examination workload is huge, and the effective examination of each video is difficult to ensure;
2. through manual examination, the subjectivity is strong, and the quality is difficult to guarantee;
3. the existing image identification and classification method only considers relevant parameters such as the skin color ratio of an image in a video frame, ignores the high-level semantic features of the video and does not utilize information of a corresponding subject text when a sensitive picture appears in live broadcasting.
Disclosure of Invention
The invention aims to provide a rule-breaking video detection algorithm for text-video fusion aiming at the defects of high cost, poor timeliness and accuracy and the like in the prior art, and the extracted deep semantic relation between a video and a text is sent to a fusion layer for information fusion, so that the text information and the video information are combined, and the rule-breaking video can be better identified through modal complementation. The method based on text and video fusion is used for researching the internet live broadcast violation, has important research value and significance, and has the advantages of manpower cost saving, high stability, simple operation and good real-time property.
In order to achieve the purpose, the invention designs the illegal video detection algorithm for fusing the text and the video, and can accurately identify the simulated actions and behaviors of the contents in the network illegal live video and the dance performance containing the actions.
The technical scheme adopted by the invention for realizing the purpose is as follows: a violation video detection method based on text and video fusion comprises the following steps:
data acquisition and preprocessing steps: crawling video stream data in live webcasting, and carrying out denoising pretreatment to obtain a single-frame picture and text data appearing on the picture; the video stream data contains image picture data and text data;
establishing a data set: classifying and marking the single-frame picture and text data as an illegal action or text and a non-illegal action or text, and making a picture-text pair sample set of the single-frame picture and text data with labels;
establishing a network model for detecting illegal behaviors: establishing a network model structure which comprises a shallow layer feature extraction module, a deep layer feature extraction module, a feature fusion module, a full connection module and a classifier module which are connected in sequence; repeatedly training the network on the sample set data by using pictures and texts in batches, iteratively tuning the model and setting a violation judgment probability threshold until the optimized network model is obtained; the model is used for judging whether continuous multiframe dynamic image-text pictures contain illegal actions or text information;
a real-time video detection step: and (4) acquiring a video stream in real time, preprocessing the video stream, inputting the video stream into the optimized network model, and acquiring an illegal judgment result.
The image sample data set and the text sample data set are divided into: a training set, a verification set and a test set;
the denoising pretreatment comprises frame extraction treatment and text cleaning:
removing noise from the text data;
and performing frame cutting operation on the continuous frames of the video data, and extracting and storing the continuous frames as continuous frames every 10 s.
The shallow feature extraction module comprises: the method comprises the following steps of extracting a BERT model of semantic shallow text characteristics of a current sentence and a RestNet model of semantic characteristics of a shallow video frame picture;
the deep feature extraction module includes: the TEXTCNN model and the LSTM model are used for continuously learning the shallow video features and the shallow text features, and the deep text semantic features and the deep video frame semantic features of the context associated information of the continuous video frames and the text contents are obtained;
the feature fusion module is as follows: the self-attention module is used for fusing the acquired deep video frame characteristics and the text characteristics;
the full-connection module is used for splicing output characteristics;
and the classifier module adopts a Sigmoid classifier function to judge the probability of the data of the multi-frame image-text input in the current batch and output a classification result of whether violation occurs or not.
The RestNet model for extracting the picture characteristics of the shallow video frame comprises the following steps: performing shallow video frame feature extraction by using a ResNet152 network embedded into an SE module; the embedded SE module is used for readjusting the features, measuring the importance of the extracted features by using the extracted global information, and calculating to obtain the correlation of each channel.
The BERT model for extracting the shallow text features comprises the following steps:
the BERT is formed by stacking a plurality of transform encoder layers and is used for processing the input of the BERT; the encoder of each layer consists of a layer of Multi-head attachment and a layer of feedforward neural network, the number of models is 12, and each layer has 12 attachments;
the input of the BERT is that word codes, position codes and segment codes are extracted from the input text information and then summed.
The LSTM model is used for extracting shallow video features by adopting bidirectional LSTM, and extracting context associated information of continuous video frames to obtain deeper semantic features.
The textCNN is divided into a coding layer, a convolutional layer, a maximum pooling layer and a full-link layer and is used for learning the context relationship of the text to obtain deeper extracted text features.
Learning the deep video features and the deep text features to obtain deep fusion features, which specifically comprises the following steps:
in the early-stage fusion, self-learning is carried out on the input feature map based on self-attribute, weight is distributed, important information points in the feature map are obtained, dependence on external information is reduced, and alignment and fusion of deep video features and deep text features are achieved;
and classifying the video pictures containing the text information by a full connection layer and then by Sigmoid, and judging whether the current given video pictures containing the text information are illegal.
A multimodal violation video detection system based on text and video fusion, the system comprising:
a data acquisition and preprocessing program module: video stream data in live webcasting is crawled, and a single-frame picture and text data appearing on the picture are obtained through denoising pretreatment; the video stream data contains image picture data and text data;
establishing a data set program module: classifying and marking the single-frame picture and text data as an illegal action or text and a non-illegal action or text, and making a picture-text pair sample set of the single-frame picture and text data with labels;
establishing a network model program module for detecting illegal behaviors: establishing a network model structure which comprises a shallow layer feature extraction module, a deep layer feature extraction module, a feature fusion module, a full connection module and a classifier module which are sequentially connected; repeatedly training the network on the sample set data by using pictures and texts in batches, iteratively tuning the model and setting a violation judgment probability threshold until the optimized network model is obtained; the model is used for judging whether continuous multiframe dynamic image-text pictures contain illegal actions or text information;
a real-time video detection program module: and (4) acquiring a video stream in real time, preprocessing the video stream, inputting the video stream into the optimized network model, and acquiring an illegal judgment result.
A multi-modal violation video detection device based on text and video fusion, comprising: a processor and a computer readable storage medium at the server side; the processor is configured to implement instructions, and the computer-readable storage medium is configured to store a plurality of instructions, which are adapted to be loaded by the processor and to perform the method for multimodal violation video detection based on text and video fusion as described above.
The invention has the following beneficial effects and advantages:
1. the method adopts the rule-breaking video detection algorithm based on the fusion of the text and the video, and acquires the live video data, the live theme and other text data through the crawler, so as to train on the basis, thereby effectively improving the accuracy of rule-breaking detection in live broadcasting.
2. The invention adopts a BERT + TextCNN mode to extract text characteristics. Extracting relevant information of the current live broadcast through a BERT model, but extracting more vague meanings in the live broadcast, wherein extracting deeper meaning characteristics is carried out by a TextCNN model later, so that the extracted contents are closer to meaning contents expressed by the Internet live broadcast theme.
3. The ResNet network used by the invention is the ResNet152 network embedded in the SE module, the characteristics can be readjusted by embedding the SE module, the extracted characteristic importance is measured by extracting global information, and the relevance of each channel is obtained by calculation, so that the extraction of the characteristics and the understanding of the contents of the front text and the back text of the video frame are facilitated.
4. The invention uses the LSTM model to extract the deep semantic features of the video features, because the video frames possibly contain continuous semantic information and are not learned after the SE-ResNet extracts the image features, the LSTM model is followed to extract the deep semantic features of the video features.
5. The algorithm has high identification accuracy, and is suitable for large-area popularization and application in the current live broadcast prevalent years.
Drawings
FIG. 1 is a model structure diagram of a violation video detection algorithm for text and video fusion according to the present invention;
fig. 2 is a schematic flow diagram of a text-video fusion violation video detection algorithm provided in the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention more comprehensible, embodiments of the present invention are described in detail below with reference to the accompanying drawings. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein, but rather should be construed as modified in the spirit and scope of the present invention as set forth in the appended claims.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
The violation video is as follows: the method comprises sensitive pictures and bullet screen characters violating laws and regulations, moral fashion and the like, or simulated actions and behaviors of the contents and dance performances comprising the actions. And aiming at the live broadcast illegal content, carrying out illegal detection by combining the live broadcast theme and the video semantics.
The violation video detection system is internally provided with: and the data acquisition and storage module is used for preparing samples of live videos and corresponding theme texts. The violation video detection system is internally provided with: and the data preprocessing module is used for performing frame extraction processing and text cleaning on the video. The violation video detection system is internally provided with: and the text feature and video frame feature extraction module is used for extracting the live video content and the corresponding live topic text semantics. The violation video detection system is internally provided with: and the deep characteristic extraction module is used for performing deep association extraction on the video and text semantic information. The violation video detection system is internally provided with: and the characteristic fusion module is used for fusing the video and the text semantic information. The detection model installed in the violation video detection system is a violation video detection model fusing text and video multi-mode information.
The text video based multi-modal violation video detection model adopts a textCNN, a bidirectional LSTM, a large-scale image and text data set pre-training model.
The weight parameters of the model installed in the illegal video detection system are obtained by training the detection model by using normal and abnormal video stream data of an internet live broadcast platform.
The video acquisition program used in the illegal video detection system is used for collecting video stream data which is publicly shared by an internet platform. The video analysis program used in the illegal video detection system can process video streams into image data frame by frame, and compresses analysis amount by using a sparse sampling mode. The video abnormal behavior detection program used in the illegal video detection system takes the key video frame image and the corresponding live text theme processed by the video analysis program as input, uses a deep learning neural network model to carry out identification detection, and finally outputs whether the input video is abnormal or not.
An accompanying drawing of a text and video fused illegal video detection algorithm is disclosed.
The following describes in detail a detection model of an illegal video with text and video fusion according to the present invention with reference to fig. 1.
According to the invention, text features are extracted by adopting a BERT + TextCNN mode, video frame features are extracted by adopting a ResNet152+ bidirectional LSTM mode, deep feature semantics are fused in a pre-fusion mode, and self-attribute mode is adopted in the pre-fusion to fuse the video frame features and the text features, because self-attribute performs autonomous learning on an input feature map and weight is distributed, important information point positions in the feature map can be obtained, dependence on external information is reduced, feature alignment and fusion can be realized by the mode, a network focuses more on capturing the relevance inside of information, and after final model fusion, classification is performed through a full-connection layer, and illegal texts and videos are effectively classified.
After the input characters and videos are preprocessed, some cluttered symbols, uncommon characters and the like in the characters are removed, and the characters can be regarded as purer texts to enter a text feature extractor for text feature extraction. After data preprocessing is carried out on the input video characteristics, continuous frames are extracted in the video every 10S to be used as the expression of the content of the video, and the processed data are stored in a corresponding file to be read by a video characteristic extractor. In the aspect of text feature extraction, BERT is adopted to extract features of information such as live topics, and as the topics contain live main content, the relevance of other live video content is considered to be high, the live topics are selected as the input of text information. After inputting, extracting relevant information of the current live broadcast through a BERT model, but extracting more vague meanings in the live broadcast, wherein extracting deeper meaning characteristics is carried out by a TextCNN model later, so that the extracted contents are closer to meaning contents expressed by the Internet live broadcast theme. The BERT is formed by stacking a plurality of transform encoder layers, each encoder layer consists of a Multi-head attachment and a feedforward neural network, the number of models is 12, and each layer has 12 attachments. The inputs to the BERT are summed up by three different embeddings, respectively, word encoding, position encoding and fragment encoding. The textCNN has four layers, namely a coding layer, a convolution layer, a maximum pooling layer and a full-link layer, so that the context relationship of the text can be better learned, and text features can be extracted from a deeper layer. In the aspect of video feature extraction, a video frame after video feature preprocessing is obtained, and a RenNet152 network is used as a feature extraction network of the video frame. The ResNet network used by the invention is a ResNet152 network embedded with an SE module, the characteristics can be readjusted by embedding the SE module, the extracted characteristics are measured by extracting global information, and the relevance of each channel is calculated, so that the extraction of the characteristics and the understanding of the text content before and after the video frame are assisted, and after the image characteristics are extracted by the SE-ResNet, because continuous semantic information possibly contained in the video frame is not learned, the deep semantic characteristics of the video characteristics are extracted by an LSTM model. After the extraction of the text deep semantic features and the image deep semantic features is finished, the text deep semantic features and the image deep semantic features are fused by adopting a pre-fusion strategy, and the video frame features and the text features are fused by adopting a self-excitation mode in the pre-fusion. After fusion, through a full connection layer, classification is carried out by means of sigmoid, and whether the given text information and the corresponding video are illegal videos is judged.
The following describes in detail a detection process of the illegal video resulting from the fusion of the text and the video, with reference to fig. 2.
1. The data acquisition and storage module: the illegal video and the non-illegal video in the live webcast are crawled in a crawler mode, data are respectively marked to be used as a training set, a verification set and a test set, and the data are stored. And after the marking is finished, manually auditing, checking and judging whether the marking is reasonable and effective.
2. A data preprocessing module: for the input video and text data, because both contain a lot of noise and the video is a continuous segment, the noise of the text data needs to be removed, and the video data is subjected to continuous frame cutting operation, so as to finally obtain the data sample required by the model.
3. The text feature and video frame feature extraction module: and performing feature extraction on the text by adopting BERT, performing feature extraction on the video frame by adopting RE-ResNet, and extracting the semantic features of the current sentence of the text and the semantic features of the current frame of the video frame.
4. Deep layer characteristic extraction module: the text adopts TextCNN, the video frames adopt bidirectional LSTM for extraction, and the context associated information of the continuous video frames and the text content is extracted, so that the deeper semantic features are obtained.
5. A feature fusion module: and for the extracted text and the deep features of the image, performing feature fusion through a feature fusion module, and fusing the features of the video frame and the text features in a previous fusion mode in which a self-attenuation mode is adopted.
6. A classification module: and finally inputting the fused features into a full-connection layer, classifying the features through Sigmoid, and judging whether the given video and text violate rules.
The invention discloses multi-mode violation video detection equipment based on text and video fusion, which comprises: a processor and a computer readable storage medium at the server side; the processor is configured to implement instructions, and the computer-readable storage medium is configured to store a plurality of instructions, which are adapted to be loaded by the processor and to execute the multimodal violation video detection method based on text and video fusion.
The logic instructions in the computer-readable storage medium according to the present invention may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when they are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The embodiments described in the above description will assist those skilled in the art in further understanding the invention, but do not limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

Claims (10)

1. A violation video detection method based on text and video fusion is characterized by comprising the following steps:
data acquisition and preprocessing steps: crawling video stream data in live webcasting, and carrying out denoising pretreatment to obtain a single-frame picture and text data appearing on the picture; the video stream data contains image picture data and text data;
establishing a data set: classifying and marking the single-frame picture and text data as an illegal action or text and a non-illegal action or text, and making a picture-text pair sample set of the single-frame picture and text data with labels;
establishing a network model for detecting the illegal behavior: establishing a network model structure which comprises a shallow layer feature extraction module, a deep layer feature extraction module, a feature fusion module, a full connection module and a classifier module which are sequentially connected; repeatedly training the network on the sample set data by using pictures and texts in batches, iteratively tuning the model and setting a violation judgment probability threshold until the optimized network model is obtained; the model is used for judging whether continuous multiframe dynamic image-text pictures contain illegal actions or text information;
a real-time video detection step: and acquiring a video stream in real time, preprocessing the video stream, inputting the optimized network model, and acquiring an illegal judgment result.
2. The method according to claim 1, wherein the image sample data set and the text sample data set are divided into: a training set, a verification set and a test set;
the denoising pretreatment comprises frame extraction treatment and text cleaning:
denoising the text data;
and performing frame cutting operation on the continuous frames of the video data, and extracting and storing the continuous frames as continuous frames every 10 s.
3. The method for detecting illegal video based on text and video fusion according to claim 1,
the shallow feature extraction module comprises: a BERT model used for extracting semantic shallow text characteristics of a current sentence and a RestNet model used for extracting semantic characteristics of a shallow video frame picture;
the deep layer feature extraction module comprises: the TEXTCNN model and the LSTM model are used for continuously learning the shallow video features and the shallow text features, and the deep text semantic features and the deep video frame semantic features of the context associated information of the continuous video frames and the text contents are obtained;
the feature fusion module is as follows: the self-attention module is used for fusing the acquired deep video frame characteristics and the text characteristics;
the full-connection module is used for splicing output characteristics;
and the classifier module adopts a Sigmoid classifier function to judge and score the probability of the data of the multi-frame image-text input in the current batch and output a classification result of whether violation occurs or not.
4. The method according to claim 3, wherein the RestNet model for extracting the picture features of the shallow video frames comprises: performing shallow video frame feature extraction by using a ResNet152 network embedded with an SE module; the embedded SE module is used for readjusting the features, measuring the importance of the extracted features by using the extracted global information, and calculating to obtain the correlation of each channel.
5. The method for detecting the illegal video based on the text-video fusion according to claim 3, wherein the BERT model for extracting the shallow text features comprises:
the BERT is formed by stacking a plurality of transform encoder layers and is used for processing the input of the BERT; the encoder of each layer consists of a layer of Multi-head Attention and a layer of feedforward neural network, the number of models is 12, and each layer has 12 attentions;
the input of the BERT is that word codes, position codes and segment codes are extracted from the input text information and then summed.
6. The method for detecting the illegal video based on the fusion of the text and the video according to claim 3, wherein the LSTM model is used for extracting shallow video features by adopting a bidirectional LSTM and extracting context associated information of continuous video frames to obtain deeper semantic features.
7. The method for detecting the illegal video based on the text and video fusion of claim 3, wherein the TextCNN is divided into a coding layer, a convolutional layer, a maximum pooling layer and a full connection layer, and is used for learning the context relationship of the text to obtain deeper extracted text features.
8. The method for detecting the violation video based on text-video fusion according to claim 3, wherein learning the deep video features and the deep text features to obtain deep fusion features specifically comprises:
in the early-stage fusion, self-learning is carried out on the input feature map based on self-attribute, weight is distributed, important information points in the feature map are obtained, dependence on external information is reduced, and alignment and fusion of deep video features and deep text features are achieved;
and (4) classifying the video pictures which are given at present and contain the text information by a full connection layer and then connecting Sigmoid, and judging whether the video pictures are illegal.
9. A multimodal violation video detection system based on text and video fusion, the system comprising:
a data acquisition and preprocessing program module: video stream data in live webcasting is crawled, and a single-frame picture and text data appearing on the picture are obtained through denoising pretreatment; the video stream data contains image picture data and text data;
establishing a data set program module: classifying and marking the single-frame picture and text data as illegal action or text and non-illegal action or text, and making a picture-text pair sample set of the single-frame picture and text data with labels;
establishing a network model program module for detecting the illegal behavior: establishing a network model structure which comprises a shallow layer feature extraction module, a deep layer feature extraction module, a feature fusion module, a full connection module and a classifier module which are sequentially connected; repeatedly training the network on the sample set data by using pictures and texts in batches, iteratively tuning the model and setting a violation judgment probability threshold until the optimized network model is obtained; the model is used for judging whether continuous multiframe dynamic image-text pictures contain illegal actions or text information;
a real-time video detection program module: and acquiring a video stream in real time, preprocessing the video stream, inputting the optimized network model, and acquiring an illegal judgment result.
10. A multimodal violation video detection device based on text and video fusion, comprising: a processor and a computer readable storage medium at the server side; the processor is configured to implement instructions, and the computer-readable storage medium is configured to store a plurality of instructions, wherein the instructions are adapted to be loaded by the processor and to perform the text and video fusion based multimodal violation video detection method of any of claims 1-8.
CN202210453529.3A 2022-04-27 2022-04-27 Illegal video detection method based on text and video fusion Pending CN115775363A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210453529.3A CN115775363A (en) 2022-04-27 2022-04-27 Illegal video detection method based on text and video fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210453529.3A CN115775363A (en) 2022-04-27 2022-04-27 Illegal video detection method based on text and video fusion

Publications (1)

Publication Number Publication Date
CN115775363A true CN115775363A (en) 2023-03-10

Family

ID=85388249

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210453529.3A Pending CN115775363A (en) 2022-04-27 2022-04-27 Illegal video detection method based on text and video fusion

Country Status (1)

Country Link
CN (1) CN115775363A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116778438A (en) * 2023-08-17 2023-09-19 苏州市吴江区盛泽镇人民政府 Illegal forklift detection method and system based on large language model
CN116996470A (en) * 2023-09-27 2023-11-03 创瑞技术有限公司 Rich media information sending system
CN117478838A (en) * 2023-11-01 2024-01-30 珠海经济特区伟思有限公司 Distributed video processing supervision system and method based on information security

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116778438A (en) * 2023-08-17 2023-09-19 苏州市吴江区盛泽镇人民政府 Illegal forklift detection method and system based on large language model
CN116778438B (en) * 2023-08-17 2023-11-14 苏州市吴江区盛泽镇人民政府 Illegal forklift detection method and system based on large language model
CN116996470A (en) * 2023-09-27 2023-11-03 创瑞技术有限公司 Rich media information sending system
CN116996470B (en) * 2023-09-27 2024-02-06 创瑞技术有限公司 Rich media information sending system
CN117478838A (en) * 2023-11-01 2024-01-30 珠海经济特区伟思有限公司 Distributed video processing supervision system and method based on information security
CN117478838B (en) * 2023-11-01 2024-05-28 珠海经济特区伟思有限公司 Distributed video processing supervision system and method based on information security

Similar Documents

Publication Publication Date Title
CN110020437B (en) Emotion analysis and visualization method combining video and barrage
CN115775363A (en) Illegal video detection method based on text and video fusion
CN112465008B (en) Voice and visual relevance enhancement method based on self-supervision course learning
CN111339305B (en) Text classification method and device, electronic equipment and storage medium
CN110223675B (en) Method and system for screening training text data for voice recognition
CN106708949A (en) Identification method of harmful content of video
CN112131347A (en) False news detection method based on multi-mode fusion
CN115994230A (en) Intelligent archive construction method integrating artificial intelligence and knowledge graph technology
CN111488487B (en) Advertisement detection method and detection system for all-media data
CN115982350A (en) False news detection method based on multi-mode Transformer
CN114170411A (en) Picture emotion recognition method integrating multi-scale information
CN113469214A (en) False news detection method and device, electronic equipment and storage medium
CN116796251A (en) Poor website classification method, system and equipment based on image-text multi-mode
CN115775349A (en) False news detection method and device based on multi-mode fusion
CN114548274A (en) Multi-modal interaction-based rumor detection method and system
CN111986259A (en) Training method of character and face detection model, auditing method of video data and related device
CN114912026B (en) Network public opinion monitoring analysis processing method, equipment and computer storage medium
CN116977701A (en) Video classification model training method, video classification method and device
CN113361615B (en) Text classification method based on semantic relevance
CN117009577A (en) Video data processing method, device, equipment and readable storage medium
CN115690636A (en) Intelligent detection system for internet live broadcast abnormal behaviors
CN113095319A (en) Multidirectional scene character detection method and device based on full convolution angular point correction network
CN111813996A (en) Video searching method based on sampling parallelism of single frame and continuous multi-frame
CN111666437A (en) Image-text retrieval method and device based on local matching
CN113255461B (en) Video event detection and semantic annotation method and device based on dual-mode depth network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination