CN112989950A

CN112989950A - Violent video recognition system oriented to multi-mode feature semantic correlation features

Info

Publication number: CN112989950A
Application number: CN202110185761.9A
Authority: CN
Inventors: 张笑钦; 李兵; 胡卫明
Original assignee: Wenzhou University
Current assignee: Wenzhou University
Priority date: 2021-02-11
Filing date: 2021-02-11
Publication date: 2021-06-18

Abstract

The invention discloses a violent video identification system facing multi-mode characteristic semantic correlation characteristics, which comprises a video acquisition module, a video segmentation module, a video processing module and a violent video identification module, wherein the video acquisition module, the video segmentation module, the video processing module and the violent video identification module are sequentially connected, and the violent video identification system is provided with: violent videos in the videos can be identified in a mode of semantic association of three modal features, namely visual, auditory and text information, and the advantages of accuracy and high identification efficiency can be achieved.

Description

Violent video recognition system oriented to multi-mode feature semantic correlation features

Technical Field

The invention relates to the technical field of recognition, in particular to a violent video recognition system for multi-modal feature semantic association features.

Background

With the rapid development of video network technology, people can be exposed to various videos and simultaneously can be exposed to violent videos. The existence of the violent videos influences the physical and mental health of people, so when people watch online videos or monitoring videos monitored by staff, the violent videos need to be searched and detected, and video identification is often needed for searching the violent videos. When identifying a violent video, the video is generally required to be prepared before identification, the identification mode of the violent video is single at present, the preparation of identifying files and an identification environment is complex, the violent video cannot be accurately identified, and the identification of the violent video is difficult.

Therefore, a violent video system oriented to the multi-modal feature semantic association features is urgently needed to solve the problems.

Disclosure of Invention

In view of the above, the present invention provides a violent video recognition system oriented to the multi-modal feature semantic association feature, so as to solve the above technical problems.

In order to achieve the purpose, the invention provides the following technical scheme:

a violent video recognition system facing multi-modal feature semantic association features comprises the following components: the system comprises a video acquisition module, a video segmentation module, a video processing module and a violent video identification module, wherein the video acquisition module, the video segmentation module, the video processing module and the violent video identification module are sequentially connected;

the video acquisition module is used for acquiring video information, and the video information comprises violent video information and non-violent video information;

the video segmentation module is used for segmenting the video information obtained by the video acquisition module into a plurality of video shots according to a video segmentation technology, and extracting relevant visual features, audio features and text features from each video shot to obtain corresponding image information to be identified, audio information to be identified and text information to be identified;

the video processing module is used for carrying out video preprocessing on the plurality of divided video lenses and comprises an image processing module, an audio processing module and a text processing module;

the violence video identification module is used for judging whether the video information processed by the video processing module belongs to violence video information or not, and comprises a violence audio judgment module, a violence image judgment module and a violence text judgment module.

Further, the violent video recognition system facing the multi-modal feature semantic association features further comprises a common violent scene module used for storing and storing a violent scene template, and the common violent scene module is connected with the violent video recognition module.

Further, the common violent scene template comprises common violent scene audio characteristic information, common violent scene image characteristic information and common violent scene text characteristic information.

Further, the image processing module is configured to pre-process an image in the video lens, and the image processing module includes an image deduplication module, an image gray scale calculation module, and an image contrast enhancement module, where the image deduplication module is configured to remove overlapping information from image information in the video lens, the gray scale calculation module is configured to calculate a gray scale value of the image, the image contrast enhancement module is configured to enhance the gray scale value of the image, and the image deduplication module, the image gray scale calculation module, and the image contrast enhancement module are sequentially connected.

Further, the audio processing module is configured to pre-process the audio information in the video lens module, and the audio processing module includes a low-frequency filtering module, where the low-frequency filtering module is configured to remove low frequencies in the audio information in the video lens module.

Further, the text processing module is used for preprocessing the text information in the video shot, and the text processing module comprises a text drying module which is used for removing redundant noise in the text information in the video shot.

Further, the judgment method of the violent audio judgment module is as follows: firstly, fusing the processed audio characteristic information with the audio characteristic of a common violent scene to obtain processed fused audio characteristic information; secondly, comparing and judging the audio characteristic information of the common violent scene and the processed audio characteristic information by using a classifier, and marking the processed audio characteristic information matched with the audio information of the common violent scene as violent audio information; the judgment methods of the violent image judgment module and the violent text module are the same as the judgment method of the violent audio judgment module.

Further, the violent video information comprises at least one of marked violent audio characteristic information, marked violent image characteristic information and marked violent text characteristic information.

Further, the violent video identification system facing the multi-modal characteristic semantic correlation characteristics further comprises a timing starting module, wherein the timing starting module is used for starting a video system at regular time to carry out violent video identification, and the timing starting module is connected with the video acquisition module.

The technical scheme can show that the invention has the advantages that:

1. the method comprises the steps that a video is divided into a plurality of video shots through a video division technology, each shot video comprises an image module to be identified, an audio module to be identified and a text module to be identified, and each shot is subjected to video processing and identification so as to achieve the purpose of accurate identification;

2. the characteristics in the video lens are extracted in a mode of combining the image, the audio and the text, and the multi-mode characteristic violent videos are subjected to combined identification, so that the violent videos are identified more accurately, and the practicability of the violent videos is improved.

3. Through setting up regularly the start module, can realize opening or closing violence video identification system at certain time, do not need artifical manual operation, can accomplish intelligence and open or close violence video identification system.

In addition to the objects, features and advantages described above, other objects, features and advantages of the present invention are also provided. The present invention will be described in further detail below with reference to the drawings.

Drawings

In the drawings:

fig. 1 is a schematic structural diagram of a violent video recognition system oriented to multi-modal feature semantic association features.

FIG. 2 is a step diagram of a video segmentation technique in a violent video recognition system oriented to multi-modal feature semantic related features according to the present invention.

FIG. 3 is a step diagram of a video shot with clear view acquisition in a violent video identification system oriented to multi-modal characteristic semantic relation characteristics.

FIG. 4 is a structural diagram of the components of a video shot in the violent video identification system facing the multi-modal characteristic semantic relation characteristic.

FIG. 5 is a schematic step diagram of a violent video feature recognition system oriented to the multi-modal feature semantic relation features.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The violent video identification system facing the multi-modal characteristic semantic association characteristic comprises a timing starting module, a video acquisition module, a video segmentation module, a video processing module, a common violent scene module and a violent video identification module, wherein the timing starting module starts the video identification system at regular time, the video segmentation module segments video information received by the video acquisition module into video shots, the video processing module processes the received video shots, the violent video identification module determines whether the video processing module belongs to a violent video by comparing with the scene violent scene module, and the timing starting module, the video acquisition module, the video segmentation module, the video processing module and the violent video identification module are all connected, the common violence scene module is connected with the violence video identification module.

The timing starting module is used for starting the video acquisition module at a fixed time, and is connected with the video acquisition module.

Specifically, the timing starting module comprises a timing starter, and the timing starter controls a switch of the violence identification system. And when the set timing starting time is reached, the video acquisition module is opened according to the timing starting time to acquire the video information.

The video acquisition module can adopt a plurality of panoramic cameras to simultaneously acquire video data and audio data and acquire video information, and the acquired video information comprises violent video information and non-violent video information.

Specifically, when the timing starting module is opened and a video acquisition request is received, the plurality of panoramic cameras acquire video information. In order to ensure the definition of the video, the video acquisition range of the panoramic camera is 5 m.

The video segmentation module is used for segmenting the video information obtained by the video acquisition module into a plurality of video shots according to a video segmentation technology.

Specifically, as shown in fig. 2, the implementation of video segmentation includes the following two steps:

the first step is as follows: judging the continuity of the acquired video information;

the second step is that: and dividing the shot into a plurality of video shots according to the judgment result, if the video information is continuous, the current video shot is determined, and if the video information is discontinuous, the next video shot is determined.

Generally, video information obtained from a television, a movie or a panoramic camera from a scene is shot by a plurality of shots, and the video information of the same video shot is continuous in a common situation, while the video information between two shots is discontinuous, so that the video information is mostly composed of a plurality of video shots.

Preferably, the method for judging the sharpness of the acquired video shots and extracting the video shots with higher sharpness includes the following steps, as shown in fig. 3:

step 1: and intercepting video stream information from the video lens for acquiring the video information.

Specifically, a computer may be used to intercept a continuous video stream to obtain an image, and then analyze the image.

Step 2: and judging whether the video stream is in YUV format, if so, executing the step 3, and if not, executing the step 4.

And step 3: and analyzing the image area of the intercepted video stream.

Specifically, factors of the place and time appearing in the video stream are removed, and a rectangular area in the middle part of the image is reserved.

And 4, step 4: an evaluation function of the sharpness of the video stream is calculated.

Specifically, the sum of the gradients of the sharp points and the sum of the gradients of all the pixels are calculated for the rectangular region of the remaining middle portion, and the sharpness evaluation function is determined according to the ratio of the sum of the gradients of the sharp points to the sum of the gradients of all the pixels.

And 5, judging whether the image in the specific time is clear or not.

Specifically, the specific time may be a unit time. Counting the video frame number with abnormal video definition in unit time, if the video frame number exceeds a certain ratio of the total frame number of the video images in the unit time, the acquired video shot definition is abnormal, otherwise, the acquired video shot definition is normal.

Step 6: and acquiring video shot information with higher definition.

As shown in fig. 4, for each clear video shot, relevant visual features, audio features and text features are extracted through a deep learning neural network model to obtain corresponding image information to be recognized, audio information to be recognized and text information to be recognized. The characteristics of the video lens are extracted by adopting three modes of images, audios and texts, the video lens comprises an image module, an audio module and a text module, the image module is used for storing image information, the audio module is used for storing audio information, and the text module is used for storing text information.

The video processing module is used for carrying out video preprocessing on the plurality of divided video lenses and comprises an image processing module, an audio processing module and a text processing module.

The image processing module comprises an image duplicate removal module, an image gray scale calculation module and an image contrast enhancement module, the image duplicate removal module is used for removing overlapping information of image information in the video lens, the gray scale calculation module is used for calculating an image gray scale value, and the image contrast enhancement module is used for enhancing the gray scale value of the image, so that the identifiability of the image is improved conveniently.

Specifically, the image information in each lens is subjected to overlap elimination according to the image area, and the image information with a large area is reserved; calculating the range of the gray value of the image according to the obtained de-overlapped image information by means of image binarization, and solving a minimum gray value min and a maximum gray value max; and stretching the gray value of the obtained image to be within the interval of [0,255] so as to enhance the identification degree of the image. And preprocessing the image module according to the image processing module to obtain relatively clear image information with relatively good quality, so that the image information can be compared with the image information in the common violent scene module conveniently.

The audio processing module is used for processing the audio information in the video lens, and comprises a low-frequency filtering module which is used for removing low frequency in the audio information in the video lens, so that the quality of the audio is enhanced and the audio information with the quality is obtained.

The text processing module is used for processing text information in the shot video and comprises a text drying module; the text drying module is used for removing irrelevant noise in the text information.

The common violent scene module is a template of a violent video to be identified by a video identification, and the common violent scene module is connected with the violent video identification module. The method comprises the following steps of obtaining common violent scene audio characteristic information of violent images, common violent scene image characteristic information and common violent scene text characteristic information.

The extraction of the common violent scene audio characteristic information comprises the following steps: audio energy characteristic information, short-time average energy intensity, gene frequency, audio energy entropy and other characteristic information.

Specifically, common violence scene audio information may be defined to include audio information such as screech, hoarse, and explosion. The audio extraction steps for common violent scenes are as follows: extracting audio signals from a common violent scene template through a high-pass filter, converting the audio information into a spectrogram, carrying out forward analysis through a neural network, extracting violent audio information, and using the spectrogram as a comparison template.

The extraction of the common violent scene image feature information comprises the following steps: average motion intensity information, blood smell characteristic information, flame characteristic information and other characteristic information.

Specifically, the image information of the common violent scene can be defined to include image information of bleeding, knives, guns, explosions, actions and the like. The image extraction steps of the common violent scene are as follows: and extracting image signals from the common violence template, extracting violence image information through forward analysis of a neural network, and taking the image information as a comparison template.

The extraction of the text characteristic information of the common violent scenes comprises the following steps: sensitive word information, sensitive phrase information, etc.

Specifically, the text information of the common violent scenes can be defined to include character information such as horror, violence, bloody smell, blood and the like. The extraction steps of the common violent scene texts comprise extracting text signals from common violent scene templates and extracting common violent text features from text information, wherein the extraction of the common violent text features can be extracted by adopting a bag-of-words model.

The judgment method of the violent audio judgment module comprises the following steps: firstly, fusing the processed audio characteristic information with the audio characteristics of a common violent scene to obtain the processed fused audio characteristic information and the fused audio characteristic information of the common violent scene; and secondly, comparing and judging the fusion audio characteristic information of the common violent scenes and the processed audio characteristic information by using a classifier, and marking the processed audio characteristic information matched with the audio information of the common violent scenes as violent audio information.

Specifically, if a certain video shot contains explosive audio information, the audio information in the video shot is compared with the audio information of a common violent scene for judgment, if the certain video shot contains explosive sound, the video shot is indicated to contain audio characteristic information corresponding to the audio information of the common violent scene, the video shot contains violent audio information, the video shot is marked as violent audio information, otherwise, the video shot is marked as non-violent audio characteristic information.

The judgment methods of the violent image judgment module and the violent text module are the same as the judgment method of the violent audio judgment module.

The violent video information comprises at least one of the characteristic information of the marker violent audio frequency, the characteristic information of the marker violent image and the characteristic information of the marker violent text.

Preferably, the riot and terrorist identification system facing the multimodal feature semantic association features can further be provided with a violence video warning module, and the violence video warning module is connected with the violence video identification module. When the system identifies that the video information is violent video information, the violent video warning module reminds the user of the violent video information and informs that violent videos exist.

Specifically, the violence video warning module can adopt an audible and visual alarm to realize the reminding function through the sound and light modes. This violence video warning module's setting can realize reminding the violence video that discerns out, makes the staff convenient more audio-visual the recognition of looking over the violence video.

As shown in fig. 5, a violent video recognition system oriented to the multi-modal feature semantic association features is implemented as follows:

s1: acquiring video information, S2: segmenting the acquired video information into a plurality of video shots, S3: processing the video information in each video shot, S4: and comparing the processed video characteristic information with the characteristic information of the common violent scenes to determine whether the processed video characteristic information is violent video information.

Step S1: video information is obtained through video acquisition modules such as a panoramic camera and the like, and the video information comprises violent video information and non-violent video information.

Step S2: and segmenting the acquired video information into a plurality of pieces of video information by using a video segmentation technology, wherein the video shot information comprises audio information to be identified, image information to be identified and text information to be identified.

Step S3: and preprocessing the audio information to be identified, the image information to be identified and the text information to be identified to obtain corresponding audio processing information, image processing information and text processing information.

Step S4: and comparing the audio processing information, the image processing information and the text processing information in each video shot with the audio information, the image information and the text information in the common violent scene information, wherein at least one of the image processing information corresponding to the violent scene image characteristic information, the audio processing information corresponding to the violent scene audio characteristic information and the text processing information corresponding to the violent scene text characteristic information is violent video information.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A violent video recognition system oriented to multi-modal feature semantic related features, comprising: the system comprises a video acquisition module, a video segmentation module, a video processing module and a violent video identification module, wherein the video acquisition module, the video segmentation module, the video processing module and the violent video identification module are sequentially connected;

the video segmentation module is used for segmenting the video information obtained by the video acquisition module into a plurality of video shots according to a video segmentation technology, extracting relevant visual features, audio features and text features from each video shot and obtaining corresponding image information to be identified, audio information to be identified and text information to be identified;

the violence video identification module is used for judging whether the video information processed by the video processing module belongs to violence video information or not, and comprises a violence image judgment module, a violence audio judgment module and a violence text judgment module.

2. The violent video recognition system oriented to the multimodal feature semantic related features of claim 1, further comprising a common violent scene module for storing and storing violent scene templates, wherein the common violent scene module is connected with the violent video recognition module.

3. The violent video recognition system oriented to the multimodal feature semantic related features of claim 2, wherein the common violent scene templates comprise common violent scene audio feature information, common violent scene image feature information and common violent scene text feature information.

4. The violent video identification system oriented to the multi-modal characteristic semantic related characteristics, according to claim 1, wherein the image processing module is used for processing image information in the video shot, the image processing module comprises an image deduplication module, an image gray scale calculation module and an image contrast enhancement module, the image deduplication module is used for removing overlapped information of the image information in the video shot, the gray scale calculation module is used for calculating an image gray scale value, the image contrast enhancement module is used for enhancing the gray scale value of the image, and the image deduplication module, the image gray scale calculation module and the image contrast enhancement module are connected in sequence.

5. The violent video recognition system of claim 1, wherein the audio processing module is configured to process audio information in the video shot, and the audio processing module comprises a low frequency filtering module configured to remove low frequencies in the audio information in the video shot.

6. The violent video recognition system of claim 1, wherein the text processing module is configured to pre-process text information in the video shot, and the text processing module comprises a text dessication module configured to remove noise from the text information in the video shot.

7. The violent video recognition system oriented to the multi-modal characteristic semantic related characteristics of claim 5, wherein the violent audio judging module is used for judging by: firstly, fusing the processed audio characteristic information with the audio characteristic of a common violent scene to obtain processed fused audio characteristic information; secondly, comparing and judging the fusion audio characteristic information of the common violent scenes and the processed audio characteristic information by using a classifier, and marking the processed audio characteristic information matched with the audio information of the common violent scenes as violent audio information and as violent audio information; the judgment methods of the violent image judgment module and the violent text module are the same as the judgment method of the violent audio judgment module.

8. The violence video recognition system for multimodal feature semantic association oriented features of claim 1, wherein the violence video information comprises at least one of tagged violence audio feature information, tagged violence image feature information, and tagged violence text feature information.

9. The violent video recognition system oriented to the multimodal feature semantic related features of claim 1, further comprising a timed starting module, wherein the timed starting module is used for the timed starting system to perform violent video recognition, and the timed starting module is connected with the video acquisition module.