CN112766035A - Bus-oriented system and method for recognizing violent behavior of passenger on driver - Google Patents

Bus-oriented system and method for recognizing violent behavior of passenger on driver Download PDF

Info

Publication number
CN112766035A
CN112766035A CN202011388004.3A CN202011388004A CN112766035A CN 112766035 A CN112766035 A CN 112766035A CN 202011388004 A CN202011388004 A CN 202011388004A CN 112766035 A CN112766035 A CN 112766035A
Authority
CN
China
Prior art keywords
model
voice
recognition
information
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011388004.3A
Other languages
Chinese (zh)
Other versions
CN112766035B (en
Inventor
王熙柱
李梓平
舒琳
黄毓敏
陈楚钧
陈柏伶
赵源发
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202011388004.3A priority Critical patent/CN112766035B/en
Publication of CN112766035A publication Critical patent/CN112766035A/en
Application granted granted Critical
Publication of CN112766035B publication Critical patent/CN112766035B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/59Context or environment of the image inside of a vehicle, e.g. relating to seat occupancy, driver state or inner lighting conditions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/18Closed-circuit television [CCTV] systems, i.e. systems in which the video signal is not broadcast
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a system and a method for recognizing violent behavior of passengers facing a bus to a driver, wherein the system comprises a processor module, a video acquisition module and a voice acquisition module; the processor module comprises a video voice data acquisition model, a video stream identification model, a voice stream identification model and a decision fusion model; the processor module is respectively connected with the video acquisition module and the voice acquisition module. The system for recognizing the violent behavior of the bus passenger on the driver utilizes the video and the voice to perform multi-mode recognition on the scene, so that the accuracy of scene recognition is improved.

Description

Bus-oriented system and method for recognizing violent behavior of passenger on driver
Technical Field
The invention relates to the technical field of multi-mode recognition, in particular to a system and a method for recognizing violent behaviors of passengers to drivers facing buses.
Background
In recent years, bus accidents have been common due to violent behavior of bus passengers on drivers. The existing public transport safety facilities on the bus are not perfect enough, and a complete detection and prevention system is not provided to reduce the occurrence of the events. Therefore, it is necessary to design a bus driver and passenger conflict recognition system based on artificial intelligence, so as to monitor the situation of passenger and driver conflict in time, ensure the driving safety of the bus, and reduce the occurrence of such events.
Disclosure of Invention
In order to solve the technical problems in the prior art, the invention provides a system and a method for recognizing violent behaviors of passengers on drivers for buses, which utilize videos and voices to perform multi-mode recognition on scenes, increase the accuracy of scene recognition, discriminate the recognition result of the whole scene, monitor the scene condition and ensure the driving safety of vehicles.
The system of the invention is realized by adopting the following technical scheme: a bus-oriented system for recognizing violent behavior of passengers on drivers comprises a processor module, a video acquisition module and a voice acquisition module; the processor module is respectively connected with the video acquisition module and the voice acquisition module;
the processor module comprises a video voice data acquisition model, a video stream identification model, a voice stream identification model and a decision fusion model; the video voice data acquisition model is respectively connected with the video stream identification model and the voice stream identification model; the video stream identification model and the voice stream identification model are respectively connected with the decision fusion model;
the video acquisition module and the voice acquisition module are input into the processor module for scene recognition by acquiring video stream information and voice stream information;
the video and voice data acquisition model is used for acquiring video stream information and voice stream information and locally storing the video stream information and the voice stream information in the processor module;
the video stream identification model is used for processing and identifying the stored video stream information and obtaining a result;
the voice stream recognition model is used for processing and recognizing the stored voice stream information and obtaining a result;
and the decision fusion model is used for fusing the recognition result of the video stream recognition model and the recognition result of the voice stream recognition model to obtain a final recognition result.
The method is realized by adopting the following technical scheme: a method for recognizing violent behavior of a passenger facing a bus to a driver mainly comprises the following steps:
s1, synchronously acquiring video stream information and voice stream information through a video voice data acquisition model;
s2, processing the collected video stream information by using a video stream feature extraction model in the video stream identification model, extracting optical flow and RGB feature information, acquiring a double-flow graph with time and space information, inputting the double-flow graph into an action identification model in the video stream identification model for identification, and acquiring an action identification result;
s3, carrying out voice detection on the collected voice stream information by using an audio stream feature extraction model in the voice stream recognition model, carrying out preprocessing for converting an audio waveform into a spectrogram, extracting spectral features including the spectrogram and a Mel frequency cepstrum coefficient, inputting the spectral features into the voice recognition model in the voice stream recognition model, and obtaining a voice recognition result;
s4, inputting the action recognition result and the voice recognition result into the decision fusion model to obtain the recognition result of the whole scene;
s5, judging the recognition result of the whole scene, and if an abnormal condition occurs, giving an alarm; if the situation is normal, the process returns to step S1 to continue monitoring the scene status.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the system for recognizing the violent behavior of the bus passenger on the driver utilizes the video and the voice to perform multi-mode recognition on the scene, so that the accuracy of scene recognition is improved.
2. The invention utilizes a later stage decision method of a decision fusion model, adopts two layers of full connection layer fusion, and obtains the weight through a decision fusion data set training network, thereby reducing the randomness of artificially setting the weight and ensuring the reasonability of the weight.
3. The video stream recognition model adopts an improved double-stream network, the RGB image and the optical stream stack are fused into a picture and input into the neural network, and one neural network is omitted compared with the existing double-stream method, so that the whole model is lighter, the running speed is higher, the occupied hardware resources are less, and the model is more favorable for being transplanted to a module loaded with an embedded system CPU.
4. The action recognition data set is shot by simulating actions of a plurality of volunteers in the background of the bus, and has good fitting property.
Drawings
FIG. 1 is a block diagram of a bus oriented passenger to driver violent behavior identification system of the present invention;
fig. 2 is a block diagram of the method for recognizing violent behavior of passengers to drivers facing buses.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.
Examples
As shown in fig. 1, the system for recognizing violent behavior of passengers facing to a bus to a driver comprises a processor module, a video acquisition module and a voice acquisition module; the processor module is respectively connected with the video acquisition module and the voice acquisition module;
the processor module comprises a video voice data acquisition model, a video stream identification model, a voice stream identification model and a decision fusion model; the video voice data acquisition model is respectively connected with the video stream identification model and the voice stream identification model; the video stream identification model and the voice stream identification model are respectively connected with the decision fusion model;
the video acquisition module and the voice acquisition module are input into the processor module for scene recognition by acquiring video stream information and voice stream information;
the video and voice data acquisition model is used for acquiring video stream information and voice stream information and locally storing the video stream information and the voice stream information in the processor module;
the video stream identification model comprises a video stream characteristic extraction model and an action identification model and is used for processing and identifying the stored video stream information and obtaining a result; specifically, dense optical flow information is acquired between every two continuous frames of images through a dense optical flow method, the acquired optical flow information is superposed to form an optical flow stack, and the acquired optical flow stack is superposed on a last frame gray image to acquire a fused dual-flow graph;
the voice stream recognition model comprises an audio stream feature extraction model and a voice recognition model and is used for processing and recognizing the stored voice stream information and obtaining a result;
the decision fusion model is used for fusing the recognition result of the video stream recognition model and the recognition result of the voice stream recognition model to obtain a final recognition result; specifically, the output results of the action recognition model and the voice recognition model are combined into a one-dimensional matrix by using a later stage fusion method, and the action recognition model and the voice recognition model are fused by inputting two full connection layers.
In this embodiment, the processor module may select a module processor carrying the Broadcom BCM 2837; the video acquisition module and the voice acquisition module are wired devices. The video acquisition module can stably work at an acquisition frame rate of 30fps, and can select a wired transmission camera carrying a Sony IMX291 chip and supporting USB 3.0; the voice acquisition module is a USB wired microphone with the signal-to-noise ratio exceeding-67 dB.
Based on the same inventive concept, the inventor also proposes a method for recognizing violent behavior of a passenger facing a bus to a driver, as shown in fig. 2, comprising the following steps:
s1, continuously and synchronously acquiring video stream information and voice stream information through a video and voice data acquisition model;
s2, processing the collected video stream information by using a video stream feature extraction model in the video stream identification model, extracting optical flow and RGB feature information, acquiring a double-flow graph with time and space information, inputting the double-flow graph into an action identification model in the video stream identification model for identification, and acquiring an action identification result;
s3, carrying out voice detection on the collected voice stream information by using an audio stream feature extraction model in the voice stream recognition model, carrying out preprocessing for converting an audio waveform into a spectrogram, extracting spectral features such as a chromatogram and a Mel frequency cepstrum coefficient, inputting the spectral features into the voice recognition model in the voice stream recognition model, and obtaining a voice recognition result;
s4, inputting the action recognition result and the voice recognition result into the decision fusion model to obtain the recognition result of the whole scene;
s5, judging the recognition result of the whole scene, and if an abnormal condition occurs, giving an alarm; if the situation is normal, the process returns to step S1 to continue monitoring the scene status.
In this embodiment, the angle at which the video stream information of the video and audio data collection model of step S1 is collected is a driver-centered view angle.
In this embodiment, the dual-flow graph with time-space information in step S2 is obtained by using a dense optical flow method to obtain dense optical flow information between every two consecutive images, where the number of frames used for one action determination is 10, and so on from frame 1 to frame 10, such as between frame 1 and frame 2, between frame 2 and frame 3, between frame 3 and frame 4, between … …, and between frame 10. Calculating from the first frame to the 10 th frame of the video, superposing the obtained 9 optical flow information into an optical flow stack, wherein the optical flow stack comprises the time information and the spatial information of the video, superposing the obtained optical flow stack on a final frame gray image containing the spatial information, and obtaining a fused dual-flow image, wherein the dual-flow image comprises the space time information of a scene.
Specifically, the process of superimposing optical flow information into an optical flow stack is as follows: computing a group using every two adjacent framesThe optical flow information is used to correspond to the change of each position of the image, and since the interval between two adjacent frames is short, the change value delta is calculated at a sampling rate of 30 frames/s according to the moving area, and the change value of the non-moving area is 0. If the motion is in the horizontal direction, the format of the obtained adjacent frame optical flow information is M ═ 0,0, …, deltam,…,0]Wherein, deltamIs a real number indicating that motion occurs at position m. In this embodiment, the motion speed is fast, generally 8-15 frames are one motion, and in this period, the motion is moving in one direction. Due to the continuity of motion, the continuous optical flow information is adjacent, i.e., M ═ 0,0, …, deltam,…,0]The next optical flow information of (1) is Mnext=[0,0,…,deltam+1,…,0]I.e. there are 9 frames of optical flow information between 10 frames, where deltam+1Is a real number, indicating the motion occurring at position m + 1; the stack is an optical flow stack:
Figure BDA0002810723550000041
reflecting the time trace of a complete motion.
Further, superimposing the optical flow stack to the gray image as a dual-flow image identification, wherein the superimposing mode is as follows: the gray-scale map of a two-dimensional single channel (299 x 1) and the optical flow map of two dimensions (299 x 1) are superimposed into a two-channel (299 x 2).
In this embodiment, the voice detection and the pre-processing for converting the audio waveform into the spectrogram in step S3 are performed by the audio stream feature extraction model identifying a sequence of voice feature vectors to obtain a voice emotion identification result and a corresponding confidence level in the current time period; the audio stream feature extraction model calculates a spectrogram of voice stream information by using a windowing method, and then integrates a small time period, namely, a voice stream signal is divided into a plurality of sections of time domain and frequency domain information which are subjected to short-time fast Fourier transform when the spectrogram is acquired, and a sequence of multi-dimensional voice feature vectors such as a voice detection confidence coefficient and a chromatogram is obtained through processing. Because the video stream information and the voice stream are collected at the same time, the time length of the primary judgment information of the voice stream information is consistent with that of the video stream information.
In this embodiment, the decision fusion model in step S4 adopts a post-fusion method to fuse the action recognition result and the speech recognition result, where the action is classified into 5 classes of actions according to the degree of severity in the action recognition model, and the speech is classified into 8 classes of speech according to the degree of emotion in the speech recognition model, and each class score is different. The decision fusion model is divided into 5 scene levels according to the scores of the action voice combinations, and the higher the level is, the greater the danger degree is.
In this embodiment, the action recognition model, the voice recognition model and the decision fusion model are all neural network models; the motion recognition model adopts a transfer learning method and is retrained by using an inclusion-v 3 network and the inclusion-v 3 initial weight; the speech recognition model uses an RNN neural network; the decision fusion model adopts a later fusion method and outputs A in the action recognition modellever=[p0,p1,p2,p3…pi]Wherein p isiIs the confidence of the ith category in the action recognition, i.e. the action recognition result is piThe ith action corresponding to the maximum value; in speech recognition model output Vlever=[q0,q1,q2,q3,q4,q5,q6…qi]Wherein q isiConfidence for the ith category in speech recognition; combining the output results of the motion recognition model and the voice recognition model into a 13 x 1 one-dimensional matrix:
[Alever,Vlever]=[p0,p1,p2,p3,p4,q0,q1,q2,q3,q4,q5,q6…qi]
wherein [ A ]lever,Vlever]Is the input signal of the fusion layer;
will [ A ] belever,Vlever]Inputting two fully-connected layers to fuse the action recognition model and the voice recognition model and outputting 5 scene grades, namely outputting a matrix o of 5 x 1lever=[o0,o1,o2,o3…oi]Wherein, in the step (A),oithe confidence of the ith class in the final recognition.
In this embodiment, the data set of the neural network model includes a speech recognition data set, an action recognition data set, and a decision fusion data set; wherein, the data set of the voice recognition adopts a general clear recognition data set; the motion recognition data set is a self-made data set.
Specifically, the making process of the motion recognition data set comprises the following steps: a school bus is borrowed, violent action videos of passengers to drivers are simulated on the bus by N volunteers, the actions are various, in order to contain scene actions as much as possible and collect calm actions, the actions are comprehensively used as an original video data set, the video data set is framed, continuous action frames in the video data set are extracted, optical flow stacks are extracted from the continuous action frames, the continuous action frames are overlapped on a last frame, and a double-flow graph data set is obtained and used as a data set used by an action identification data set.
Specifically, the process of making the decision fusion data set comprises the following steps: the method comprises the steps of training a voice recognition model and an action recognition model, inputting a voice recognition data set into the trained voice recognition model to obtain a voice classification probability input result, inputting an action recognition data set into the action recognition model to obtain an action classification probability output result, integrating n types of action classification probability outputs and m types of voice classification probability outputs due to the fact that different types contain different probability outputs, dividing the n types of action classification probability outputs and the m types of action classification probability outputs into 5 levels, obtaining a label of a decision fusion data set, and obtaining the decision fusion data set.
In this embodiment, the method for identifying violent driver behavior by passengers facing a bus can be mounted on platforms such as Windows10, Ubuntu, rasbian and the like based on a processor, and can stably run for more than 12 hours in a test on platforms such as rasbian based on Broadcom BCM 2837.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (10)

1. A bus-oriented system for recognizing violent behavior of passengers on drivers is characterized by comprising a processor module, a video acquisition module and a voice acquisition module; the processor module is respectively connected with the video acquisition module and the voice acquisition module;
the processor module comprises a video voice data acquisition model, a video stream identification model, a voice stream identification model and a decision fusion model; the video voice data acquisition model is respectively connected with the video stream identification model and the voice stream identification model; the video stream identification model and the voice stream identification model are respectively connected with the decision fusion model;
the video acquisition module and the voice acquisition module are input into the processor module for scene recognition by acquiring video stream information and voice stream information;
the video and voice data acquisition model is used for acquiring video stream information and voice stream information and locally storing the video stream information and the voice stream information in the processor module;
the video stream identification model is used for processing and identifying the stored video stream information and obtaining a result;
the voice stream recognition model is used for processing and recognizing the stored voice stream information and obtaining a result;
and the decision fusion model is used for fusing the recognition result of the video stream recognition model and the recognition result of the voice stream recognition model to obtain a final recognition result.
2. The violent behavior recognition system of claim 1, wherein the video stream recognition model comprises a video stream feature extraction model and an action recognition model; the voice stream recognition model comprises an audio stream feature extraction model and a voice recognition model.
3. Violent behavior recognition system according to claim 2,
the video stream identification model acquires dense optical flow information by a dense optical flow method between every two continuous frames of images, superposes the acquired optical flow information into an optical flow stack, and superposes the acquired optical flow stack on the last frame of gray image to acquire a fused dual-flow image;
and a decision model is fused, the output results of the action recognition model and the voice recognition model are combined into a one-dimensional matrix by using a later fusion method, and the two fully connected layers are input to fuse the action recognition model and the voice recognition model.
4. The violent behavior recognizing system of claim 1, wherein the processor module is a module processor carrying a Broadcom BCM2837, and the video collecting module and the language collecting module are wired devices.
5. The violent behavior recognizing method based on the violent behavior recognizing system of claim 1, comprising the steps of:
s1, synchronously acquiring video stream information and voice stream information through a video voice data acquisition model;
s2, processing the collected video stream information by using a video stream feature extraction model in the video stream identification model, extracting optical flow and RGB feature information, acquiring a double-flow graph with time and space information, inputting the double-flow graph into an action identification model in the video stream identification model for identification, and acquiring an action identification result;
s3, carrying out voice detection on the collected voice stream information by using an audio stream feature extraction model in the voice stream recognition model, carrying out preprocessing for converting an audio waveform into a spectrogram, extracting spectral features including a chromatogram and a Mel frequency cepstrum coefficient, inputting the spectral features into the voice recognition model in the voice stream recognition model, and obtaining a voice recognition result;
s4, inputting the action recognition result and the voice recognition result into the decision fusion model to obtain the recognition result of the whole scene;
s5, judging the recognition result of the whole scene, and if an abnormal condition occurs, giving an alarm; if the situation is normal, the process returns to step S1 to continue monitoring the scene status.
6. The violent behavior recognizing method of claim 5, wherein the angle of acquisition of the video stream information of the video/audio data acquisition model in step S1 is a driver-centered angle.
7. The violent behavior recognition method of claim 5, wherein the step S2 of acquiring the dual-flow graph with the time-space information comprises the following specific steps:
s21, obtaining dense optical flow information by a dense optical flow method between every two continuous frames of images;
s22, superposing the acquired optical flow information into an optical flow stack containing time information; the process of superposing the optical flow information into the optical flow stack is specifically as follows: calculating a group of optical flow information by using every two adjacent frames, calculating a change value delta according to a moving area, wherein the change value of a non-moving area is 0; if the motion is in the horizontal direction, the format of the obtained adjacent frame optical flow information is M ═ 0, 0.. multidata, deltam,...,0]Wherein, deltamIs a real number, indicating that motion occurred at position m; the continuous optical flow information is adjacent, i.e., M ═ 0, 0.., deltam,...,0]The next optical flow information of (1) is Mnext=[0,0,...,deltam+1,...,0]I.e. n-1 frames of optical flow information between n frames, wherein deltam+1Is a real number, indicating the motion occurring at position m + 1; an optical flow stack after the superposition:
Figure FDA0002810723540000021
and S23, superposing the acquired optical flow stack on the final frame gray image containing the spatial information, and acquiring a fused double-flow image containing the spatial and temporal information of the scene.
8. The violent behavior recognizing method according to claim 5, wherein the human voice detection and the preprocessing of converting the audio waveform into the spectrogram in step S3 are specifically performed as follows:
s31, acquiring a voice emotion recognition result and a corresponding confidence coefficient of the current time period by recognizing the sequence of the voice feature vectors by using the audio stream feature extraction model;
s32, calculating a spectrogram of the voice stream information by using the audio stream feature extraction model through a windowing method, then integrating time domain and frequency domain information, and processing and obtaining a human voice detection confidence coefficient and a sequence of multi-dimensional human voice feature vectors containing the chromatogram.
9. The violent behavior recognition method of claim 5, wherein the decision fusion model in step S4 fuses the motion recognition result and the voice recognition result by means of post-fusion, wherein the motion recognition model divides the motion into a plurality of levels of motion according to the degree of violence, the voice recognition model divides the voice into a plurality of levels of voice according to the degree of emotion, and the decision fusion model divides the voice into a plurality of scene levels according to the score of the combination of the motion voices.
10. The violent behavior recognizing method according to claim 5, wherein the action recognizing model, the voice recognizing model and the decision fusion model are neural network models;
the motion recognition model utilizes a transfer learning method, and is retrained by using an inclusion-v 3 network and the initial weights of the inclusion-v 3;
the speech recognition model utilizes an RNN neural network;
the decision fusion model outputs A in the action recognition model by using a later fusion methodlever=[p0,p1,p2,p3...pi]Wherein p isiThe confidence coefficient of the ith category in the action recognition; in speech recognition model output Vlever=[q0,q1,q2,q3,q4,q5,q6...qi]Wherein q isiConfidence for the ith category in speech recognition; combining the output results of the action recognition model and the voice recognition model into a one-dimensional matrix:
[Alever,Vlever]=[p0,p1,p2,p3,p4,q0,q1,q2,q3,q4,q5,q6...qi]
wherein [ A ]lever,Vlever]Is the input signal of the fusion layer;
will [ A ] belever,Vlever]Inputting two full-connection layers to fuse the action recognition model and the voice recognition model and outputting a plurality of scene grades, namely an output matrix olever=[o0,o1,o2,o3...oi]Wherein o isiThe confidence of the ith class in the final recognition.
CN202011388004.3A 2020-12-01 2020-12-01 System and method for identifying violence behaviors of passengers on drivers facing buses Active CN112766035B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011388004.3A CN112766035B (en) 2020-12-01 2020-12-01 System and method for identifying violence behaviors of passengers on drivers facing buses

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011388004.3A CN112766035B (en) 2020-12-01 2020-12-01 System and method for identifying violence behaviors of passengers on drivers facing buses

Publications (2)

Publication Number Publication Date
CN112766035A true CN112766035A (en) 2021-05-07
CN112766035B CN112766035B (en) 2023-06-23

Family

ID=75693755

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011388004.3A Active CN112766035B (en) 2020-12-01 2020-12-01 System and method for identifying violence behaviors of passengers on drivers facing buses

Country Status (1)

Country Link
CN (1) CN112766035B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113507627A (en) * 2021-07-08 2021-10-15 北京的卢深视科技有限公司 Video generation method and device, electronic equipment and storage medium
CN115601714A (en) * 2022-12-16 2023-01-13 广东汇通信息科技股份有限公司(Cn) Campus violent behavior identification method based on multi-mode data analysis
KR20230042926A (en) * 2021-09-23 2023-03-30 주식회사 소이넷 Apparatus and Method for Detecting Violence, Smart Violence Monitoring System having the same

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107292266A (en) * 2017-06-21 2017-10-24 吉林大学 A kind of vehicle-mounted pedestrian area estimation method clustered based on light stream
CN107301375A (en) * 2017-05-26 2017-10-27 天津大学 A kind of video image smog detection method based on dense optical flow
CN108564597A (en) * 2018-03-05 2018-09-21 华南理工大学 A kind of video foreground target extraction method of fusion gauss hybrid models and H-S optical flow methods
CN110555368A (en) * 2019-06-28 2019-12-10 西安理工大学 Fall-down behavior identification method based on three-dimensional convolutional neural network
CN115331289A (en) * 2022-08-09 2022-11-11 西安理工大学 Micro-expression recognition method based on video motion amplification and optical flow characteristics
CN115838062A (en) * 2022-08-01 2023-03-24 广州华方智能科技有限公司 Dense light stream-based belt conveyor carrier roller non-rotation detection system and method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107301375A (en) * 2017-05-26 2017-10-27 天津大学 A kind of video image smog detection method based on dense optical flow
CN107292266A (en) * 2017-06-21 2017-10-24 吉林大学 A kind of vehicle-mounted pedestrian area estimation method clustered based on light stream
CN108564597A (en) * 2018-03-05 2018-09-21 华南理工大学 A kind of video foreground target extraction method of fusion gauss hybrid models and H-S optical flow methods
CN110555368A (en) * 2019-06-28 2019-12-10 西安理工大学 Fall-down behavior identification method based on three-dimensional convolutional neural network
CN115838062A (en) * 2022-08-01 2023-03-24 广州华方智能科技有限公司 Dense light stream-based belt conveyor carrier roller non-rotation detection system and method
CN115331289A (en) * 2022-08-09 2022-11-11 西安理工大学 Micro-expression recognition method based on video motion amplification and optical flow characteristics

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113507627A (en) * 2021-07-08 2021-10-15 北京的卢深视科技有限公司 Video generation method and device, electronic equipment and storage medium
KR20230042926A (en) * 2021-09-23 2023-03-30 주식회사 소이넷 Apparatus and Method for Detecting Violence, Smart Violence Monitoring System having the same
KR102648004B1 (en) * 2021-09-23 2024-03-18 주식회사 소이넷 Apparatus and Method for Detecting Violence, Smart Violence Monitoring System having the same
CN115601714A (en) * 2022-12-16 2023-01-13 广东汇通信息科技股份有限公司(Cn) Campus violent behavior identification method based on multi-mode data analysis
CN115601714B (en) * 2022-12-16 2023-03-10 广东汇通信息科技股份有限公司 Campus violent behavior identification method based on multi-modal data analysis

Also Published As

Publication number Publication date
CN112766035B (en) 2023-06-23

Similar Documents

Publication Publication Date Title
CN112766035B (en) System and method for identifying violence behaviors of passengers on drivers facing buses
CN110827837B (en) Whale activity audio classification method based on deep learning
KR101969504B1 (en) Sound event detection method using deep neural network and device using the method
Thomas et al. Analyzing convolutional neural networks for speech activity detection in mismatched acoustic conditions
CN108319909B (en) Driving behavior analysis method and system
CN110706700B (en) In-vehicle disturbance prevention alarm method and device, server and storage medium
CN109559736B (en) Automatic dubbing method for movie actors based on confrontation network
CN111601074A (en) Security monitoring method and device, robot and storage medium
CN111091044B (en) Network appointment-oriented in-vehicle dangerous scene identification method
CN111613240B (en) Camouflage voice detection method based on attention mechanism and Bi-LSTM
CN114333070A (en) Examinee abnormal behavior detection method based on deep learning
CN112420033A (en) Vehicle-mounted device and method for processing words
CN113723292A (en) Driver-ride abnormal behavior recognition method and device, electronic equipment and medium
CN108323209A (en) Information processing method, system, cloud processing device and computer program product
CN116129405A (en) Method for identifying anger emotion of driver based on multi-mode hybrid fusion
CN115393968A (en) Audio-visual event positioning method fusing self-supervision multi-mode features
CN112541533A (en) Modified vehicle identification method based on neural network and feature fusion
CN112861762B (en) Railway crossing abnormal event detection method and system based on generation countermeasure network
CN116503631A (en) YOLO-TGB vehicle detection system and method
JP2019192201A (en) Learning object image extraction device and method for autonomous driving
CN115359464A (en) Motor vehicle driver dangerous driving behavior detection method based on deep learning
CN114463844A (en) Fall detection method based on self-attention double-flow network
CN113191209A (en) Intelligent early warning method based on deep learning
CN113393643B (en) Abnormal behavior early warning method and device, vehicle-mounted terminal and medium
Sathyanarayana et al. Automatic maneuver boundary detection system for naturalistic driving massive corpora

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant