CN112989950A - Violent video recognition system oriented to multi-mode feature semantic correlation features - Google Patents
Violent video recognition system oriented to multi-mode feature semantic correlation features Download PDFInfo
- Publication number
- CN112989950A CN112989950A CN202110185761.9A CN202110185761A CN112989950A CN 112989950 A CN112989950 A CN 112989950A CN 202110185761 A CN202110185761 A CN 202110185761A CN 112989950 A CN112989950 A CN 112989950A
- Authority
- CN
- China
- Prior art keywords
- video
- module
- violent
- information
- audio
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012545 processing Methods 0.000 claims abstract description 54
- 230000011218 segmentation Effects 0.000 claims abstract description 18
- 230000000007 visual effect Effects 0.000 claims abstract description 5
- 238000000034 method Methods 0.000 claims description 17
- 238000004364 calculation method Methods 0.000 claims description 8
- 238000005516 engineering process Methods 0.000 claims description 6
- 238000007781 pre-processing Methods 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 5
- 238000001914 filtration Methods 0.000 claims description 4
- 230000002708 enhancing effect Effects 0.000 claims description 2
- 230000004927 fusion Effects 0.000 claims description 2
- 238000000605 extraction Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 5
- 238000001035 drying Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 239000003550 marker Substances 0.000 description 3
- 230000002159 abnormal effect Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 239000008280 blood Substances 0.000 description 2
- 210000004369 blood Anatomy 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000004880 explosion Methods 0.000 description 2
- 239000002360 explosive Substances 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 239000007858 starting material Substances 0.000 description 2
- 230000000740 bleeding effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000004630 mental health Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 108090000623 proteins and genes Proteins 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/49—Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N23/00—Cameras or camera modules comprising electronic image sensors; Control thereof
- H04N23/60—Control of cameras or camera modules
- H04N23/698—Control of cameras or camera modules for achieving an enlarged field of view, e.g. panoramic image capture
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Signal Processing (AREA)
- Computational Linguistics (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Evolutionary Biology (AREA)
- Child & Adolescent Psychology (AREA)
- General Health & Medical Sciences (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a violent video identification system facing multi-mode characteristic semantic correlation characteristics, which comprises a video acquisition module, a video segmentation module, a video processing module and a violent video identification module, wherein the video acquisition module, the video segmentation module, the video processing module and the violent video identification module are sequentially connected, and the violent video identification system is provided with: violent videos in the videos can be identified in a mode of semantic association of three modal features, namely visual, auditory and text information, and the advantages of accuracy and high identification efficiency can be achieved.
Description
Technical Field
The invention relates to the technical field of recognition, in particular to a violent video recognition system for multi-modal feature semantic association features.
Background
With the rapid development of video network technology, people can be exposed to various videos and simultaneously can be exposed to violent videos. The existence of the violent videos influences the physical and mental health of people, so when people watch online videos or monitoring videos monitored by staff, the violent videos need to be searched and detected, and video identification is often needed for searching the violent videos. When identifying a violent video, the video is generally required to be prepared before identification, the identification mode of the violent video is single at present, the preparation of identifying files and an identification environment is complex, the violent video cannot be accurately identified, and the identification of the violent video is difficult.
Therefore, a violent video system oriented to the multi-modal feature semantic association features is urgently needed to solve the problems.
Disclosure of Invention
In view of the above, the present invention provides a violent video recognition system oriented to the multi-modal feature semantic association feature, so as to solve the above technical problems.
In order to achieve the purpose, the invention provides the following technical scheme:
a violent video recognition system facing multi-modal feature semantic association features comprises the following components: the system comprises a video acquisition module, a video segmentation module, a video processing module and a violent video identification module, wherein the video acquisition module, the video segmentation module, the video processing module and the violent video identification module are sequentially connected;
the video acquisition module is used for acquiring video information, and the video information comprises violent video information and non-violent video information;
the video segmentation module is used for segmenting the video information obtained by the video acquisition module into a plurality of video shots according to a video segmentation technology, and extracting relevant visual features, audio features and text features from each video shot to obtain corresponding image information to be identified, audio information to be identified and text information to be identified;
the video processing module is used for carrying out video preprocessing on the plurality of divided video lenses and comprises an image processing module, an audio processing module and a text processing module;
the violence video identification module is used for judging whether the video information processed by the video processing module belongs to violence video information or not, and comprises a violence audio judgment module, a violence image judgment module and a violence text judgment module.
Further, the violent video recognition system facing the multi-modal feature semantic association features further comprises a common violent scene module used for storing and storing a violent scene template, and the common violent scene module is connected with the violent video recognition module.
Further, the common violent scene template comprises common violent scene audio characteristic information, common violent scene image characteristic information and common violent scene text characteristic information.
Further, the image processing module is configured to pre-process an image in the video lens, and the image processing module includes an image deduplication module, an image gray scale calculation module, and an image contrast enhancement module, where the image deduplication module is configured to remove overlapping information from image information in the video lens, the gray scale calculation module is configured to calculate a gray scale value of the image, the image contrast enhancement module is configured to enhance the gray scale value of the image, and the image deduplication module, the image gray scale calculation module, and the image contrast enhancement module are sequentially connected.
Further, the audio processing module is configured to pre-process the audio information in the video lens module, and the audio processing module includes a low-frequency filtering module, where the low-frequency filtering module is configured to remove low frequencies in the audio information in the video lens module.
Further, the text processing module is used for preprocessing the text information in the video shot, and the text processing module comprises a text drying module which is used for removing redundant noise in the text information in the video shot.
Further, the judgment method of the violent audio judgment module is as follows: firstly, fusing the processed audio characteristic information with the audio characteristic of a common violent scene to obtain processed fused audio characteristic information; secondly, comparing and judging the audio characteristic information of the common violent scene and the processed audio characteristic information by using a classifier, and marking the processed audio characteristic information matched with the audio information of the common violent scene as violent audio information; the judgment methods of the violent image judgment module and the violent text module are the same as the judgment method of the violent audio judgment module.
Further, the violent video information comprises at least one of marked violent audio characteristic information, marked violent image characteristic information and marked violent text characteristic information.
Further, the violent video identification system facing the multi-modal characteristic semantic correlation characteristics further comprises a timing starting module, wherein the timing starting module is used for starting a video system at regular time to carry out violent video identification, and the timing starting module is connected with the video acquisition module.
The technical scheme can show that the invention has the advantages that:
1. the method comprises the steps that a video is divided into a plurality of video shots through a video division technology, each shot video comprises an image module to be identified, an audio module to be identified and a text module to be identified, and each shot is subjected to video processing and identification so as to achieve the purpose of accurate identification;
2. the characteristics in the video lens are extracted in a mode of combining the image, the audio and the text, and the multi-mode characteristic violent videos are subjected to combined identification, so that the violent videos are identified more accurately, and the practicability of the violent videos is improved.
3. Through setting up regularly the start module, can realize opening or closing violence video identification system at certain time, do not need artifical manual operation, can accomplish intelligence and open or close violence video identification system.
In addition to the objects, features and advantages described above, other objects, features and advantages of the present invention are also provided. The present invention will be described in further detail below with reference to the drawings.
Drawings
In the drawings:
fig. 1 is a schematic structural diagram of a violent video recognition system oriented to multi-modal feature semantic association features.
FIG. 2 is a step diagram of a video segmentation technique in a violent video recognition system oriented to multi-modal feature semantic related features according to the present invention.
FIG. 3 is a step diagram of a video shot with clear view acquisition in a violent video identification system oriented to multi-modal characteristic semantic relation characteristics.
FIG. 4 is a structural diagram of the components of a video shot in the violent video identification system facing the multi-modal characteristic semantic relation characteristic.
FIG. 5 is a schematic step diagram of a violent video feature recognition system oriented to the multi-modal feature semantic relation features.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The violent video identification system facing the multi-modal characteristic semantic association characteristic comprises a timing starting module, a video acquisition module, a video segmentation module, a video processing module, a common violent scene module and a violent video identification module, wherein the timing starting module starts the video identification system at regular time, the video segmentation module segments video information received by the video acquisition module into video shots, the video processing module processes the received video shots, the violent video identification module determines whether the video processing module belongs to a violent video by comparing with the scene violent scene module, and the timing starting module, the video acquisition module, the video segmentation module, the video processing module and the violent video identification module are all connected, the common violence scene module is connected with the violence video identification module.
The timing starting module is used for starting the video acquisition module at a fixed time, and is connected with the video acquisition module.
Specifically, the timing starting module comprises a timing starter, and the timing starter controls a switch of the violence identification system. And when the set timing starting time is reached, the video acquisition module is opened according to the timing starting time to acquire the video information.
The video acquisition module can adopt a plurality of panoramic cameras to simultaneously acquire video data and audio data and acquire video information, and the acquired video information comprises violent video information and non-violent video information.
Specifically, when the timing starting module is opened and a video acquisition request is received, the plurality of panoramic cameras acquire video information. In order to ensure the definition of the video, the video acquisition range of the panoramic camera is 5 m.
The video segmentation module is used for segmenting the video information obtained by the video acquisition module into a plurality of video shots according to a video segmentation technology.
Specifically, as shown in fig. 2, the implementation of video segmentation includes the following two steps:
the first step is as follows: judging the continuity of the acquired video information;
the second step is that: and dividing the shot into a plurality of video shots according to the judgment result, if the video information is continuous, the current video shot is determined, and if the video information is discontinuous, the next video shot is determined.
Generally, video information obtained from a television, a movie or a panoramic camera from a scene is shot by a plurality of shots, and the video information of the same video shot is continuous in a common situation, while the video information between two shots is discontinuous, so that the video information is mostly composed of a plurality of video shots.
Preferably, the method for judging the sharpness of the acquired video shots and extracting the video shots with higher sharpness includes the following steps, as shown in fig. 3:
step 1: and intercepting video stream information from the video lens for acquiring the video information.
Specifically, a computer may be used to intercept a continuous video stream to obtain an image, and then analyze the image.
Step 2: and judging whether the video stream is in YUV format, if so, executing the step 3, and if not, executing the step 4.
And step 3: and analyzing the image area of the intercepted video stream.
Specifically, factors of the place and time appearing in the video stream are removed, and a rectangular area in the middle part of the image is reserved.
And 4, step 4: an evaluation function of the sharpness of the video stream is calculated.
Specifically, the sum of the gradients of the sharp points and the sum of the gradients of all the pixels are calculated for the rectangular region of the remaining middle portion, and the sharpness evaluation function is determined according to the ratio of the sum of the gradients of the sharp points to the sum of the gradients of all the pixels.
And 5, judging whether the image in the specific time is clear or not.
Specifically, the specific time may be a unit time. Counting the video frame number with abnormal video definition in unit time, if the video frame number exceeds a certain ratio of the total frame number of the video images in the unit time, the acquired video shot definition is abnormal, otherwise, the acquired video shot definition is normal.
Step 6: and acquiring video shot information with higher definition.
As shown in fig. 4, for each clear video shot, relevant visual features, audio features and text features are extracted through a deep learning neural network model to obtain corresponding image information to be recognized, audio information to be recognized and text information to be recognized. The characteristics of the video lens are extracted by adopting three modes of images, audios and texts, the video lens comprises an image module, an audio module and a text module, the image module is used for storing image information, the audio module is used for storing audio information, and the text module is used for storing text information.
The video processing module is used for carrying out video preprocessing on the plurality of divided video lenses and comprises an image processing module, an audio processing module and a text processing module.
The image processing module comprises an image duplicate removal module, an image gray scale calculation module and an image contrast enhancement module, the image duplicate removal module is used for removing overlapping information of image information in the video lens, the gray scale calculation module is used for calculating an image gray scale value, and the image contrast enhancement module is used for enhancing the gray scale value of the image, so that the identifiability of the image is improved conveniently.
Specifically, the image information in each lens is subjected to overlap elimination according to the image area, and the image information with a large area is reserved; calculating the range of the gray value of the image according to the obtained de-overlapped image information by means of image binarization, and solving a minimum gray value min and a maximum gray value max; and stretching the gray value of the obtained image to be within the interval of [0,255] so as to enhance the identification degree of the image. And preprocessing the image module according to the image processing module to obtain relatively clear image information with relatively good quality, so that the image information can be compared with the image information in the common violent scene module conveniently.
The audio processing module is used for processing the audio information in the video lens, and comprises a low-frequency filtering module which is used for removing low frequency in the audio information in the video lens, so that the quality of the audio is enhanced and the audio information with the quality is obtained.
The text processing module is used for processing text information in the shot video and comprises a text drying module; the text drying module is used for removing irrelevant noise in the text information.
The common violent scene module is a template of a violent video to be identified by a video identification, and the common violent scene module is connected with the violent video identification module. The method comprises the following steps of obtaining common violent scene audio characteristic information of violent images, common violent scene image characteristic information and common violent scene text characteristic information.
The extraction of the common violent scene audio characteristic information comprises the following steps: audio energy characteristic information, short-time average energy intensity, gene frequency, audio energy entropy and other characteristic information.
Specifically, common violence scene audio information may be defined to include audio information such as screech, hoarse, and explosion. The audio extraction steps for common violent scenes are as follows: extracting audio signals from a common violent scene template through a high-pass filter, converting the audio information into a spectrogram, carrying out forward analysis through a neural network, extracting violent audio information, and using the spectrogram as a comparison template.
The extraction of the common violent scene image feature information comprises the following steps: average motion intensity information, blood smell characteristic information, flame characteristic information and other characteristic information.
Specifically, the image information of the common violent scene can be defined to include image information of bleeding, knives, guns, explosions, actions and the like. The image extraction steps of the common violent scene are as follows: and extracting image signals from the common violence template, extracting violence image information through forward analysis of a neural network, and taking the image information as a comparison template.
The extraction of the text characteristic information of the common violent scenes comprises the following steps: sensitive word information, sensitive phrase information, etc.
Specifically, the text information of the common violent scenes can be defined to include character information such as horror, violence, bloody smell, blood and the like. The extraction steps of the common violent scene texts comprise extracting text signals from common violent scene templates and extracting common violent text features from text information, wherein the extraction of the common violent text features can be extracted by adopting a bag-of-words model.
The judgment method of the violent audio judgment module comprises the following steps: firstly, fusing the processed audio characteristic information with the audio characteristics of a common violent scene to obtain the processed fused audio characteristic information and the fused audio characteristic information of the common violent scene; and secondly, comparing and judging the fusion audio characteristic information of the common violent scenes and the processed audio characteristic information by using a classifier, and marking the processed audio characteristic information matched with the audio information of the common violent scenes as violent audio information.
Specifically, if a certain video shot contains explosive audio information, the audio information in the video shot is compared with the audio information of a common violent scene for judgment, if the certain video shot contains explosive sound, the video shot is indicated to contain audio characteristic information corresponding to the audio information of the common violent scene, the video shot contains violent audio information, the video shot is marked as violent audio information, otherwise, the video shot is marked as non-violent audio characteristic information.
The judgment methods of the violent image judgment module and the violent text module are the same as the judgment method of the violent audio judgment module.
The violent video information comprises at least one of the characteristic information of the marker violent audio frequency, the characteristic information of the marker violent image and the characteristic information of the marker violent text.
Preferably, the riot and terrorist identification system facing the multimodal feature semantic association features can further be provided with a violence video warning module, and the violence video warning module is connected with the violence video identification module. When the system identifies that the video information is violent video information, the violent video warning module reminds the user of the violent video information and informs that violent videos exist.
Specifically, the violence video warning module can adopt an audible and visual alarm to realize the reminding function through the sound and light modes. This violence video warning module's setting can realize reminding the violence video that discerns out, makes the staff convenient more audio-visual the recognition of looking over the violence video.
As shown in fig. 5, a violent video recognition system oriented to the multi-modal feature semantic association features is implemented as follows:
s1: acquiring video information, S2: segmenting the acquired video information into a plurality of video shots, S3: processing the video information in each video shot, S4: and comparing the processed video characteristic information with the characteristic information of the common violent scenes to determine whether the processed video characteristic information is violent video information.
Step S1: video information is obtained through video acquisition modules such as a panoramic camera and the like, and the video information comprises violent video information and non-violent video information.
Step S2: and segmenting the acquired video information into a plurality of pieces of video information by using a video segmentation technology, wherein the video shot information comprises audio information to be identified, image information to be identified and text information to be identified.
Step S3: and preprocessing the audio information to be identified, the image information to be identified and the text information to be identified to obtain corresponding audio processing information, image processing information and text processing information.
Step S4: and comparing the audio processing information, the image processing information and the text processing information in each video shot with the audio information, the image information and the text information in the common violent scene information, wherein at least one of the image processing information corresponding to the violent scene image characteristic information, the audio processing information corresponding to the violent scene audio characteristic information and the text processing information corresponding to the violent scene text characteristic information is violent video information.
The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (9)
1. A violent video recognition system oriented to multi-modal feature semantic related features, comprising: the system comprises a video acquisition module, a video segmentation module, a video processing module and a violent video identification module, wherein the video acquisition module, the video segmentation module, the video processing module and the violent video identification module are sequentially connected;
the video acquisition module is used for acquiring video information, and the video information comprises violent video information and non-violent video information;
the video segmentation module is used for segmenting the video information obtained by the video acquisition module into a plurality of video shots according to a video segmentation technology, extracting relevant visual features, audio features and text features from each video shot and obtaining corresponding image information to be identified, audio information to be identified and text information to be identified;
the video processing module is used for carrying out video preprocessing on the plurality of divided video lenses and comprises an image processing module, an audio processing module and a text processing module;
the violence video identification module is used for judging whether the video information processed by the video processing module belongs to violence video information or not, and comprises a violence image judgment module, a violence audio judgment module and a violence text judgment module.
2. The violent video recognition system oriented to the multimodal feature semantic related features of claim 1, further comprising a common violent scene module for storing and storing violent scene templates, wherein the common violent scene module is connected with the violent video recognition module.
3. The violent video recognition system oriented to the multimodal feature semantic related features of claim 2, wherein the common violent scene templates comprise common violent scene audio feature information, common violent scene image feature information and common violent scene text feature information.
4. The violent video identification system oriented to the multi-modal characteristic semantic related characteristics, according to claim 1, wherein the image processing module is used for processing image information in the video shot, the image processing module comprises an image deduplication module, an image gray scale calculation module and an image contrast enhancement module, the image deduplication module is used for removing overlapped information of the image information in the video shot, the gray scale calculation module is used for calculating an image gray scale value, the image contrast enhancement module is used for enhancing the gray scale value of the image, and the image deduplication module, the image gray scale calculation module and the image contrast enhancement module are connected in sequence.
5. The violent video recognition system of claim 1, wherein the audio processing module is configured to process audio information in the video shot, and the audio processing module comprises a low frequency filtering module configured to remove low frequencies in the audio information in the video shot.
6. The violent video recognition system of claim 1, wherein the text processing module is configured to pre-process text information in the video shot, and the text processing module comprises a text dessication module configured to remove noise from the text information in the video shot.
7. The violent video recognition system oriented to the multi-modal characteristic semantic related characteristics of claim 5, wherein the violent audio judging module is used for judging by: firstly, fusing the processed audio characteristic information with the audio characteristic of a common violent scene to obtain processed fused audio characteristic information; secondly, comparing and judging the fusion audio characteristic information of the common violent scenes and the processed audio characteristic information by using a classifier, and marking the processed audio characteristic information matched with the audio information of the common violent scenes as violent audio information and as violent audio information; the judgment methods of the violent image judgment module and the violent text module are the same as the judgment method of the violent audio judgment module.
8. The violence video recognition system for multimodal feature semantic association oriented features of claim 1, wherein the violence video information comprises at least one of tagged violence audio feature information, tagged violence image feature information, and tagged violence text feature information.
9. The violent video recognition system oriented to the multimodal feature semantic related features of claim 1, further comprising a timed starting module, wherein the timed starting module is used for the timed starting system to perform violent video recognition, and the timed starting module is connected with the video acquisition module.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110185761.9A CN112989950A (en) | 2021-02-11 | 2021-02-11 | Violent video recognition system oriented to multi-mode feature semantic correlation features |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110185761.9A CN112989950A (en) | 2021-02-11 | 2021-02-11 | Violent video recognition system oriented to multi-mode feature semantic correlation features |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112989950A true CN112989950A (en) | 2021-06-18 |
Family
ID=76393237
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110185761.9A Pending CN112989950A (en) | 2021-02-11 | 2021-02-11 | Violent video recognition system oriented to multi-mode feature semantic correlation features |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112989950A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113673364A (en) * | 2021-07-28 | 2021-11-19 | 上海影谱科技有限公司 | Video violence detection method and device based on deep neural network |
CN114239570A (en) * | 2021-12-02 | 2022-03-25 | 北京智美互联科技有限公司 | Sensitive data identification method and system based on semantic analysis |
CN114519828A (en) * | 2022-01-17 | 2022-05-20 | 天津大学 | Video detection method and system based on semantic analysis |
CN114821385A (en) * | 2022-03-08 | 2022-07-29 | 阿里巴巴(中国)有限公司 | Multimedia information processing method, device, equipment and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101834982A (en) * | 2010-05-28 | 2010-09-15 | 上海交通大学 | Hierarchical screening method of violent videos based on multiplex mode |
CN103218608A (en) * | 2013-04-19 | 2013-07-24 | 中国科学院自动化研究所 | Network violent video identification method |
US20170289624A1 (en) * | 2016-04-01 | 2017-10-05 | Samsung Electrônica da Amazônia Ltda. | Multimodal and real-time method for filtering sensitive media |
WO2019127659A1 (en) * | 2017-12-30 | 2019-07-04 | 惠州学院 | Method and system for identifying harmful video based on user id |
WO2019127651A1 (en) * | 2017-12-30 | 2019-07-04 | 惠州学院 | Method and system thereof for identifying malicious video |
CN112069884A (en) * | 2020-07-28 | 2020-12-11 | 中国传媒大学 | Violent video classification method, system and storage medium |
-
2021
- 2021-02-11 CN CN202110185761.9A patent/CN112989950A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101834982A (en) * | 2010-05-28 | 2010-09-15 | 上海交通大学 | Hierarchical screening method of violent videos based on multiplex mode |
CN103218608A (en) * | 2013-04-19 | 2013-07-24 | 中国科学院自动化研究所 | Network violent video identification method |
US20170289624A1 (en) * | 2016-04-01 | 2017-10-05 | Samsung Electrônica da Amazônia Ltda. | Multimodal and real-time method for filtering sensitive media |
WO2019127659A1 (en) * | 2017-12-30 | 2019-07-04 | 惠州学院 | Method and system for identifying harmful video based on user id |
WO2019127651A1 (en) * | 2017-12-30 | 2019-07-04 | 惠州学院 | Method and system thereof for identifying malicious video |
CN112069884A (en) * | 2020-07-28 | 2020-12-11 | 中国传媒大学 | Violent video classification method, system and storage medium |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113673364A (en) * | 2021-07-28 | 2021-11-19 | 上海影谱科技有限公司 | Video violence detection method and device based on deep neural network |
CN114239570A (en) * | 2021-12-02 | 2022-03-25 | 北京智美互联科技有限公司 | Sensitive data identification method and system based on semantic analysis |
CN114519828A (en) * | 2022-01-17 | 2022-05-20 | 天津大学 | Video detection method and system based on semantic analysis |
CN114821385A (en) * | 2022-03-08 | 2022-07-29 | 阿里巴巴(中国)有限公司 | Multimedia information processing method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112989950A (en) | Violent video recognition system oriented to multi-mode feature semantic correlation features | |
CN109284729B (en) | Method, device and medium for acquiring face recognition model training data based on video | |
CN110569720B (en) | Audio and video intelligent identification processing method based on audio and video processing system | |
CN110704682B (en) | Method and system for intelligently recommending background music based on video multidimensional characteristics | |
CN110795595B (en) | Video structured storage method, device, equipment and medium based on edge calculation | |
CN109766779B (en) | Loitering person identification method and related product | |
US10037467B2 (en) | Information processing system | |
CN106708949A (en) | Identification method of harmful content of video | |
CN110852147B (en) | Security alarm method, security alarm device, server and computer readable storage medium | |
WO2005024707A1 (en) | Apparatus and method for feature recognition | |
WO2018040306A1 (en) | Method for detecting frequent passers-by in monitoring video | |
CN111797820B (en) | Video data processing method and device, electronic equipment and storage medium | |
CN105335691A (en) | Smiling face identification and encouragement system | |
CN110852306A (en) | Safety monitoring system based on artificial intelligence | |
KR101092472B1 (en) | Video indexing system using surveillance camera and the method thereof | |
CN109033476A (en) | A kind of intelligent space-time data event analysis method based on event clue network | |
CN111126411B (en) | Abnormal behavior identification method and device | |
KR101413620B1 (en) | Apparatus for video to text using video analysis | |
CN110175553B (en) | Method and device for establishing feature library based on gait recognition and face recognition | |
KR101547255B1 (en) | Object-based Searching Method for Intelligent Surveillance System | |
CN111428589B (en) | Gradual transition identification method and system | |
Das et al. | Human face detection in color images using HSV color histogram and WLD | |
CN109977891A (en) | A kind of object detection and recognition method neural network based | |
CN114241363A (en) | Process identification method, process identification device, electronic device, and storage medium | |
CN115115976A (en) | Video processing method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |