WO2023216609A1 - Target behavior recognition method and apparatus based on visual-audio feature fusion, and application - Google Patents

Target behavior recognition method and apparatus based on visual-audio feature fusion, and application Download PDF

Info

Publication number
WO2023216609A1
WO2023216609A1 PCT/CN2022/141314 CN2022141314W WO2023216609A1 WO 2023216609 A1 WO2023216609 A1 WO 2023216609A1 CN 2022141314 W CN2022141314 W CN 2022141314W WO 2023216609 A1 WO2023216609 A1 WO 2023216609A1
Authority
WO
WIPO (PCT)
Prior art keywords
visual
features
auditory
feature
audio
Prior art date
Application number
PCT/CN2022/141314
Other languages
French (fr)
Chinese (zh)
Inventor
毛云青
王国梁
齐韬
陈思瑶
葛俊
Original Assignee
城云科技(中国)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 城云科技(中国)有限公司 filed Critical 城云科技(中国)有限公司
Publication of WO2023216609A1 publication Critical patent/WO2023216609A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to the field of intelligent security technology, and in particular to a target behavior recognition method, device and application of audio-visual feature fusion.
  • the methods of judging fights through artificial intelligence algorithms include: detecting pictures and making behavioral classification judgments, or detecting the positions of key human body points in multiple frames of pictures and making behavioral judgments.
  • the publication numbers CN112733629A and CN111401296A only disclose the use of image information to determine abnormal behavior.
  • the above algorithm will identify some labor operations with large movements, such as cleaning by multiple people, or physical exercise actions by multiple people, such as playing ball, as fighting behaviors; in addition, the usual algorithm judgment only uses image information to make judgments. Judgment,needs to improve in terms of accuracy.
  • Semantic consistency is of great significance in multi-modal information fusion, especially visual and auditory information fusion.
  • the information is complementary, otherwise, they will interfere with each other, such as the famous "McGurk effect".
  • human hearing is obviously affected by vision, which may lead to mishearing. For example, when a sound does not match the visual signal, people will mysteriously perceive a third sound. Simple fusion of sound and video signals may even cause mishearing. Produce the opposite effect.
  • Embodiments of this application provide a target behavior recognition method, device and application for audio-visual feature fusion.
  • this solution uses a feature-level fusion method to fuse audio-visual information, which can improve the accuracy of abnormal behavior recognition. Rate.
  • embodiments of the present application provide a target behavior recognition method using audio-visual feature fusion.
  • the method includes: obtaining an audio and video segment of a preset duration to be recognized; and collecting visual input information in the audio and video segment to be recognized. and auditory input information; input the visual input information and the auditory input information together into the target behavior model, wherein the target behavior model includes a dual-branch channel feature extraction network, an autoencoding network and a fully connected layer recognition module; according to The feature extraction network extracts features from the visual input information and the auditory input information respectively to obtain visual features and auditory features; the autoencoding network is used to map the visual features and the auditory features to the same subspace
  • the audio-visual information is fused to obtain fusion features; the fusion features are input into the fully connected layer recognition module for recognition, and the target behavior is obtained.
  • "using the auto-encoding network to map the visual features and the auditory features into the same subspace for audio-visual information fusion to obtain fusion features” includes: using the encoder of the auto-encoding network The visual features and the auditory features are mapped to the same subspace to obtain the auditory mapping features corresponding to the auditory features and the visual mapping features corresponding to the visual features; according to the decoder of the autoencoding network, all the visual mapping features and All the auditory mapping features are mapped into a multi-modal space, and each modality obtains the visual compensation features of other modal spaces as visual shared features and the auditory compensation features of other modalities as auditory shared features; splicing the visual The shared features, the auditory shared features, the visual features and the auditory features are used to obtain fused features.
  • the autoencoding network includes an encoder and a decoder, where the encoder includes a first fully connected layer, a second fully connected layer and an encoder layer connected in sequence; visual features and auditory features are jointly input into the code In the decoder, and through the first fully connected layer, the second fully connected layer and the encoder layer output in sequence, the auditory mapping features corresponding to the auditory features and the visual mapping features corresponding to the visual features are obtained; among them, the decoder includes two branches Each branch consists of two fully connected layers; one branch takes auditory mapping features as input, and the two fully connected layers map all auditory mapping features into the multi-modal space to obtain the visual compensation corresponding to the auditory mapping features.
  • the other branch takes visual mapping features as input, and two fully connected layers map all visual mapping features into the multi-modal space to obtain auditory compensation features corresponding to the visual mapping features.
  • the visual features and the auditory features input to the autoencoding network are marked using semantic mapping tags, wherein the semantic mapping tags are characterized by the visual input information describing the same semantic content and the The mark label of the auditory input information; when the visual feature or auditory feature input to the autoencoding network has a semantic mapping label, the loss function is the algebraic sum of the auditory average error value and the visual average error value; when the visual feature or auditory feature input to the autoencoding network When there is no semantic mapping label for auditory features, the loss function is the difference between 1 and the algebraic sum of the auditory average error value and the visual average error value; the auditory average error value is represented by the average of the absolute differences between all auditory features and all auditory shared features. value, the visual average error value is characterized as the average of the absolute differences between all visual features and all visual shared features.
  • the loss function is obtained by the following formula:
  • y autocoder is the loss function
  • N is the number of features
  • faudio is the auditory feature
  • f' audio is the auditory shared feature
  • f visual is the visual feature
  • f' visual is the visual shared feature
  • labeling the visual features and the auditory features input to the autoencoding network using semantic mapping labels includes: separately labeling the acoustic anomaly information of the auditory input information and the visual input The visual abnormality information of the information is semantically tagged. If it is determined that both the auditory input information and the visual abnormality information have the semantic tags, the semantic mapping is assigned to the set of auditory input information and the visual abnormality information. Label.
  • "collecting visual input information in the audio and video segment to be recognized” includes: collecting the difference between every two adjacent image frames from the audio and video segment to be recognized to obtain a difference sequence, The difference sequence is used as visual input information.
  • "collecting auditory input information in the audio and video segments to be recognized” includes: obtaining the original audio waveform corresponding to the audio and video segment to be recognized, and sampling from the original audio waveform at a preset sampling interval. Acoustic signals are collected to obtain auditory input information.
  • the auditory input information is represented as waveform data with time as the horizontal axis and acoustic signal as the vertical axis, wherein the auditory input information and the visual input information use the time domain as a unified scale.
  • the dual-branch channel feature extraction network includes an auditory feature extraction network and a visual feature extraction network, wherein the auditory feature extraction network includes an AFEN module and an LSTM module to convert multiple frames in the original audio waveform.
  • the waveform is input into the AFEN module to obtain multiple corresponding auditory frame-level features, and the multiple auditory frame-level features are fused through the LSTM module to output auditory segment-level features.
  • the AFEN network includes 5 convolutional layers, 3 pooling layers and 3 sequentially connected fully connected layers, where the pooling layers are connected to the first convolutional layer, the second convolutional layer respectively.
  • each convolutional layer includes a ReLU activation function that makes the activation pattern of the AFEN module sparser, and each pooling layer includes a local response normalization operation to avoid gradient vanishing. .
  • the visual feature extraction network shares the AFEN module with the auditory feature extraction network.
  • the ConvLSTM module is used instead of the LSTM module to fuse multiple visual frame-level features output by the AFEN module. , output visual segment-level features.
  • embodiments of the present application provide a target behavior recognition device with audio-visual feature fusion, including: an acquisition module for acquiring the audio and video segments to be recognized of a preset duration; and an information collection module for collecting the audio and video segments to be recognized.
  • the feature extraction module is used to input the visual input information and the auditory input information into the target behavior model together, wherein the target behavior model includes the characteristics of the dual branch channel Extraction network, autoencoding network and fully connected layer recognition module; extract features from the visual input information and auditory input information respectively according to the feature extraction network to obtain visual features and auditory features; use the autoencoding network to The visual features and the auditory features are mapped to the same subspace for audio-visual information fusion to obtain fused features; a behavior recognition module is used to input the fused features into the fully connected layer recognition module for recognition to obtain the target behavior.
  • embodiments of the present application provide an electronic device, including a memory and a processor.
  • a computer program is stored in the memory, and the processor is configured to run the computer program to execute any one of the first aspects.
  • the target behavior recognition method of audio-visual feature fusion is configured to run the computer program to execute any one of the first aspects.
  • embodiments of the present application provide a readable storage medium in which a computer program is stored, and the computer program includes program code for controlling a process to execute a process, and the process includes: The target behavior recognition method of audio-visual feature fusion according to any one of the first aspects.
  • This solution uses an autoencoding network to represent the shared semantic subspace mapping.
  • vision and hearing use time as a unified measure to map vision and hearing that represent the same semantics, thereby capturing complementary information and advanced information between different modes. Semantics, realizing feature fusion at the semantic level.
  • This plan designs a dual-branch channel feature extraction network.
  • the LSTM network is selected to process the temporal relationship between audio frame-level features, thereby obtaining auditory segment-level features;
  • the ConvLSTM network is selected to process the time relationships between video frame-level features.
  • the temporal relationship is used to obtain the visual phase features, and then the feature heterogeneity of the visual phase features and the auditory phase features is eliminated in the autoencoding network to achieve feature fusion.
  • This solution uses video inter-frame difference information to extract visual features, which can better reflect the characteristics of abnormal behavior.
  • the sound features are extracted based on original audio waveforms instead of spectrum analysis-based methods such as MFCC or LPC, so it can be unified based on time.
  • the images and bands are sampled separately at intervals, and unified into the time domain to solve the problem of inconsistent audio and video feature processing during the audio-visual information fusion process.
  • Figure 1 is a flow chart of the main steps of a target behavior recognition method based on audio-visual feature fusion according to the first embodiment of the present application.
  • Figure 2 is a schematic diagram of the abnormal behavior auditory feature extraction network structure.
  • Figure 3 is a schematic diagram of the abnormal behavior recognition network structure based on autoencoding network mapping audio-visual feature fusion.
  • Figure 4 is a schematic structural diagram of the autoencoding network.
  • Figure 5 is a structural block diagram of a target behavior recognition device for audio-visual feature fusion according to the second embodiment of the present application.
  • FIG. 6 is a schematic diagram of the hardware structure of an electronic device according to the third embodiment of the present application.
  • the steps of the corresponding method are not necessarily performed in the order shown and described in this specification.
  • methods may include more or fewer steps than described in this specification.
  • a single step described in this specification may be broken down into multiple steps for description in other embodiments; and multiple steps described in this specification may also be combined into a single step in other embodiments. describe.
  • Embodiments of the present application provide a target behavior recognition method based on audio-visual feature fusion, which uses the audio-visual feature fusion target behavior recognition scheme to make judgments on abnormal event behaviors.
  • a target behavior recognition method based on audio-visual feature fusion which uses the audio-visual feature fusion target behavior recognition scheme to make judgments on abnormal event behaviors.
  • Figure 1 is a flow chart of the main steps of a target behavior recognition method based on audio-visual feature fusion according to the first embodiment of the present application.
  • the target behavior recognition method of audio-visual feature fusion mainly includes the following steps 101 to 106.
  • Step 101 Obtain the audio and video segments to be identified of a preset duration.
  • Step 102 Collect visual input information and auditory input information in the audio and video segments to be recognized.
  • Step 103 Input the visual input information and the auditory input information together into the target behavior model, where the target behavior model includes a dual-branch channel feature extraction network, an autoencoding network and a fully connected layer recognition module.
  • Step 104 Extract features from the visual input information and the auditory input information respectively according to the feature extraction network to obtain visual features and auditory features.
  • Step 105 Use the autoencoding network to map the visual features and the auditory features into the same subspace for audio-visual information fusion to obtain fusion features.
  • Step 106 Input the fused features into the fully connected layer recognition module for recognition to obtain the target behavior.
  • this solution not only visual input information is used, but also auditory input information is collected. Both are used as inputs to the behavior model, and the model outputs the target behavior classification results.
  • this solution uses sound features, and the information of sound features and image features can complement each other to more accurately judge the target behavior, so it has better feature expression effect.
  • visual input information and auditory input information are input into the model for processing at the same time.
  • the advantage of parallel processing is that feature-level fusion can be performed in the feature extraction stage to better capture the relationship between each modality. Compared with the decision-making and fusion methods of separately fusing the results of visual and auditory feature information recognition, this solution can consider the consistency between vision and hearing, so multi-modal feature information can complement each other and achieve better performance results. .
  • the target behavior in this scheme can be normal behavior or abnormal behavior.
  • the input is visual samples and auditory samples and normal behaviors marked from the samples
  • the trained model will identify normal behaviors from the visual input information and auditory input information.
  • the input to the model is labeled samples that label abnormal behaviors
  • the trained model will identify abnormal behaviors from visual input information and auditory input information.
  • visual samples and auditory samples describing violent scenes and features describing violent scenes annotated from the samples are used as the input of the model, so that the trained model can identify violent behaviors such as fights. .
  • an autoencoding network is used to achieve feature fusion.
  • the autoencoding network designed by the present invention can represent audio-visual data with the same metric, and can represent visual information and visual information that represent the same semantics under the same metric. Auditory information is matched, that is, visual and auditory semantic consistency is achieved.
  • the embodiment of the present invention establishes a shared semantic subspace in the future by mapping the visual features and the auditory features to the same subspace, thereby eliminating feature heterogeneity between different modes of video and audio, and thereby capturing visual modes and sound modes.
  • the complementary information and high-level semantics between them achieve feature fusion at the semantic level. That is, in this solution, the encoder of the auto-encoding network maps the visual features and the auditory features to the same subspace to obtain visual mapping features and auditory mapping features; the decoder of the auto-encoding network maps the visual features and auditory features to the same subspace.
  • the visual mapping features and the auditory mapping features are mapped into the multi-modal space to obtain the compensation features of other modalities as visual shared features and auditory shared features; the visual shared features, the auditory shared features, and the visual features are spliced and the auditory features to obtain fusion features.
  • the extracted visual feature is fvisual
  • g() is a function that maps the visual feature fvisual to the same subspace, and input fvisual into g() to obtain the visual mapping feature
  • H () is a function that maps the visual mapping feature g(fvisual) to the multi-modal space of shared semantics.
  • the extracted auditory feature is faudio.
  • h() is a function that maps the auditory feature faudio to the same subspace. Enter faudio into h() to obtain the auditory mapping feature; G() It is a function that maps the auditory mapping feature h(faudio) to the multi-modal space of shared semantics. Input h(faudio) into G() to obtain the compensation feature of the visual modality as a visual shared feature.
  • the autoencoding network includes an encoder and a decoder.
  • the encoder includes a first fully connected layer, a second fully connected layer and an encoder layer connected in sequence; visual features and auditory features are jointly input into the encoder. , and sequentially pass through the first fully connected layer, the second fully connected layer and the encoder layer output to obtain the auditory mapping features corresponding to the auditory features, and the visual mapping features corresponding to the visual features;
  • the decoder includes two branches, each branch consists of two fully connected layers; one branch takes auditory mapping features as input, and two fully connected layers map all auditory mapping features into the multi-module space.
  • the visual compensation features corresponding to the auditory mapping features are obtained.
  • the other branch takes the visual mapping features as input, and uses two fully connected layers to map all the visual mapping features into the multi-modal space to obtain the auditory compensation features corresponding to the visual mapping features.
  • formula (3) is used to splice visual features, visual compensation features, auditory features, and auditory compensation features to obtain fusion features.
  • each modal space receives information from its inter-modal neighbors and intra-modal neighbors, and shares its own information at the same time.
  • the inter-modal neighbor information obtained by any modal space can make up for the loss of its own information
  • the obtained inter-module neighbor information is used as a supplementary feature to enhance the expressive ability of the fused feature.
  • the error of the autoencoding network includes two parts. One is the error of the acoustic decoder, the other is the error of the visual decoder, and the sum of the two is the total error. Errors can be backpropagated to update the weights of the autoencoder network. .
  • Introducing semantic labels to calculate the auto-encoding network error includes: labeling the visual features and the auditory features input to the auto-encoding network using semantic mapping labels, where the semantic mapping labels are characterized as the visual input describing the same semantic content information and the mark label of the auditory input information; when the visual feature or auditory feature input to the autoencoding network has a semantic mapping label, the loss function is the algebraic sum of the auditory average error value and the visual average error value;
  • the loss function is the difference between 1 and the algebraic sum of the auditory average error value and the visual average error value
  • the auditory average error value is characterized as the average of the absolute differences between all auditory features and all auditory shared features
  • the visual average error value is characterized as the average of the absolute differences between all visual features and all visual shared features.
  • the loss function is obtained by the following formula:
  • y autocoder is the loss function
  • N is the number of features
  • faudio is the auditory feature
  • f' audio is the auditory shared feature
  • f visual is the visual feature
  • f' visual is the visual shared feature
  • this solution designs a new loss function that allows the model to learn the bias information of semantic mapping on the timeline.
  • this solution reduces the interference of blind splicing features, enhances the model's ability to distinguish abnormal video semantic correspondence, and is more conducive to eliminating interference between non-corresponding features.
  • semantic embedding learning can be regarded as a form of regularization, which helps to enhance the generalization ability of the model and prevent overfitting.
  • the semantic mapping label is obtained by semantically labeling the acoustic abnormality information of the auditory input information and the visual abnormality information of the visual input information respectively. If it is determined that the auditory input information and the visual abnormality information are If both have the semantic tag, then the semantic mapping tag is assigned to the set of the auditory input information and the visual abnormality information.
  • this solution maps the original audio waveform to a two-dimensional field, that is, the x-axis of the sound data is time and the y-axis is the waveform.
  • This solution uses the surveillance video screen as the original video data.
  • the original video data has approximately 30 to 60 frames of images in 1 second. That is, the x-axis of the visual data is also time, and the y-axis is the image frame.
  • the two can be aligned on the time axis based on whether the semantics are the same to achieve correspondence between visual features and auditory features, thereby achieving semantic consistency between vision and hearing.
  • the information of the original audio and video segments to be identified is also processed.
  • the present invention collects the difference between every two adjacent image frames from the audio and video segments to be identified to obtain a difference sequence,
  • the difference sequence is used as visual input information.
  • the objects of abnormal behavior recognition are some violent behaviors, such as punching someone. Placing the fist on the opponent's chest or on one's waist does not accurately indicate the presence of the characters in the video. Behavior, if the fist is still on the side of one's waist in the first few frames and is on the opponent's chest in the next few frames, it means that the person in the video has performed an abnormal behavior of punching. It can be seen that the difference between frames of the video can more accurately extract the required information than the video frame itself. Therefore, choosing the difference between adjacent frames of the video as the input of the network model has a better feature expression effect than inputting the video frame itself into the model.
  • the original audio waveform is used as the auditory information to be collected, so "collecting the auditory input information in the audio and video segments to be identified” includes: obtaining the original audio waveform corresponding to the audio and video segments to be identified, and from the Acoustic signals are collected at preset sampling intervals in the original audio waveform to obtain auditory input information.
  • the sound extracted by this solution is based on the original audio waveform, so it can be expressed in a unified metric with the visual features, that is, the auditory input information is represented by taking time as the horizontal axis and acoustics.
  • the signal is used as waveform data on the vertical axis, where the auditory input information and the visual input information use the time domain as a unified scale.
  • the feature extraction network for extracting visual and auditory features uses a dual-branch channel, that is, visual input information and auditory input information can be input into the feature extraction network at the same time, and feature extraction is performed separately to output visual features and auditory features.
  • the dual-branch channel feature extraction network includes an auditory feature extraction network and a visual feature extraction network, wherein the auditory feature extraction network includes an AFEN module and an LSTM module, and the multi-frame waveforms in the original audio waveform are input into the AFEN module , obtain multiple corresponding auditory frame-level features, fuse the multiple auditory frame-level features through the LSTM module, and output auditory segment-level features.
  • the abnormal behavior auditory feature extraction network structure AFEN structure includes 5 convolutional layers, 3 pooling layers and 3 sequentially connected fully connected layers.
  • the output of the last fully connected layer passes through the SoftMax layer.
  • the pooling layer is connected after the first convolution layer, the second convolution layer and the fifth convolution layer respectively.
  • the black rectangle represents each convolution layer
  • the white rectangle represents the pooling layer
  • the three rectangles immediately after the last pooling layer represent the three fully connected layers
  • the black rectangle after the fully connected layer represents the LSTM Structure
  • the LSTM network is selected to process the temporal relationship between audio frame-level features and obtain segment-level features.
  • the convolutional layer contains the ReLU activation function, making the activation pattern of the network sparser.
  • the pooling layer contains a local response normalization operation to avoid gradient disappearance and improve the training speed of the network.
  • the timing information is summarized through the LSTM network.
  • This method can be adapted to the entire surveillance video mechanism and has no rigid requirements in terms of audio and video length, sampling rate, etc. This solves the problem of feature timeline alignment.
  • this model also greatly reduces the complexity of visual and auditory feature fusion and improves the stability of the model.
  • the abnormal behavior visual feature extraction network structure is the same as the AFEN convolution structure shown in Figure 2, and the convolutional LSTM (ConvLSTM) module is used to replace the last LSTM module, and the original input signal is changed to the inter-image frame difference.
  • ConvLSTM convolutional LSTM
  • the difference is that compared to auditory feature extraction, visual features pay more attention to action recognition on spatiotemporal relationships, that is, ConvLSTM has a better effect than LSTM on obtaining spatiotemporal relationships, and can solve spatiotemporal sequence prediction problems, such as video classification, action recognition, etc. .
  • the present invention designs an abnormal behavior recognition model based on audio-visual information fusion using autoencoding network.
  • the model structure is shown in Figure 3.
  • the model includes four parts: visual feature extraction, auditory feature extraction, autoencoding network and fully connected recognition model.
  • Visual and acoustic feature extraction uses a dual-channel feature extraction method.
  • the network structure is based on a deep convolutional network.
  • visual features the difference between video frames is used as the original input, and deep convolution plus ConvLSTM network is used to extract segment-level visual features.
  • auditory features audio waveforms are used as network input, and deep convolution and LSTM networks are used to extract segment-level auditory features.
  • the autoencoding network shown in Figure 4 is used to construct a shared semantic subspace to eliminate the semantic bias of visual and auditory features, and the CONCAT method is used to combine visual and auditory features; finally, a fully connected model is used to identify abnormal behaviors . Therefore, the target behavior recognition method of this scheme is based on the shared semantic subspace of the autoencoding network to integrate auditory features into visual features to supplement visual information, and integrate visual features into auditory features to supplement auditory features, thus achieving different The complementary effect of the pattern, so the fused features obtained through feature fusion have richer semantic expressions, then the classification results of the model using the fused features to classify behaviors are also more accurate. Based on this, this solution improves the recognition accuracy and reduces the missed detection rate.
  • this solution provides a target behavior recognition device with audio-visual feature fusion.
  • the device uses the above target behavior recognition method with audio-visual feature fusion to identify the target behavior.
  • the device includes:
  • the acquisition module 501 is used to acquire the audio and video segments to be recognized of a preset duration.
  • the information collection module 502 is used to collect visual input information and auditory input information in the audio and video segments to be recognized.
  • Feature extraction module 503 is used to input the visual input information and the auditory input information into the target behavior model together, wherein the target behavior model includes a dual-branch channel feature extraction network, an autoencoding network and a fully connected layer recognition module. .
  • features are respectively extracted from the visual input information and the auditory input information to obtain visual features and auditory features.
  • the autoencoding network is used to map the visual features and the auditory features into the same subspace for audio-visual information fusion to obtain fusion features.
  • the behavior recognition module 504 is used to input the fused features into the fully connected layer recognition module for recognition to obtain the target behavior.
  • an electronic device includes a memory 604 and a processor 602.
  • the memory 604 stores a computer program
  • the processor 602 is configured to run the computer program to perform any of the above methods. The steps in the example.
  • the above-mentioned processor 602 may include a central processing unit (CPU), or an application specific integrated circuit (Application Specific Integrated Circuit, referred to as ASIC), or may be configured to implement one or more integrated circuits according to the embodiments of the present application.
  • CPU central processing unit
  • ASIC Application Specific Integrated Circuit
  • memory 604 may include mass storage 604 for data or instructions.
  • the memory 604 may include a hard disk drive (Hard Disk Drive, HDD for short), floppy disk drive, Solid State Drive (Solid State Drive, SSD for short), flash memory, optical disk, magneto-optical disk, magnetic tape, or Universal Serial Bus (Universal Serial Bus, Referred to as USB) drive or a combination of two or more of these.
  • Memory 604 may include removable or non-removable (or fixed) media, where appropriate.
  • Memory 604 may be internal or external to the data processing device, where appropriate.
  • memory 604 is Non-Volatile memory.
  • the memory 604 includes read-only memory (Read-OnlyMemory, ROM for short) and random access memory (RandomAccessMemory, RAM for short).
  • the ROM can be a mask-programmed ROM, programmable ROM (ProgrammableRead-OnlyMemory, referred to as PROM), erasable PROM (ErasableProgrammableRead-OnlyMemory, referred to as EPROM), electrically erasable PROM (Electrically ErasableProgrammableRead -OnlyMemory, referred to as EEPROM), electrically rewritable ROM (Electrically Alterable Read-OnlyMemory, referred to as EAROM) or flash memory (FLASH) or a combination of two or more of these.
  • PROM programmable ROM
  • EPROM erasable PROM
  • EPROM ErasableProgrammableRead-OnlyMemory
  • EEPROM Electrically ErasableProgrammable
  • the RAM can be static random access memory (StaticRandom-AccessMemory, referred to as SRAM) or dynamic random access memory (DynamicRandomAccessMemory, referred to as DRAM), wherein the DRAM can be fast page mode dynamic random access Memory 604 (FastPageModeDynamicRandomAccessMemory, referred to as FPMDRAM), extended data output dynamic random access memory (ExtendedDateOutDynamicRandomAccessMemory, referred to as EDODRAM), synchronous dynamic random access memory (SynchronousDynamicRandom-AccessMemory, referred to as SDRAM), etc.
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • FPMDRAM fast page mode dynamic random access Memory 604
  • FPMDRAM fast page mode dynamic random access Memory 604
  • EDODRAM Extended Data output dynamic random access memory
  • SDRAM synchronous dynamic random access memory
  • Memory 604 may be used to store or cache various data files required for processing and/or communication, as well as possibly computer program instructions executed by processor 602.
  • the processor 602 reads and executes the computer program instructions stored in the memory 604 to implement any of the audio-visual feature fusion target behavior recognition methods in the above embodiments.
  • the above-mentioned electronic device may also include a transmission device 606 and an input-output device 608, wherein the transmission device 606 is connected to the above-mentioned processor 602, and the input-output device 608 is connected to the above-mentioned processor 602.
  • Transmission device 606 may be used to receive or send data via a network.
  • Specific examples of the above-mentioned network may include a wired or wireless network provided by a communication provider of the electronic device.
  • the transmission device includes a network adapter (Network Interface Controller, NIC for short), which can be connected to other network devices through a base station to communicate with the Internet.
  • the transmission device 606 may be a radio frequency (Radio Frequency, RF for short) module, which is used to communicate with the Internet wirelessly.
  • RF Radio Frequency
  • Input and output devices 608 are used to input or output information.
  • the input information may be audio and video segments to be recognized, etc.
  • the output information may be the target behavior to be recognized, etc.
  • the above-mentioned processor 602 can be configured to perform the following steps through a computer program:
  • S102 Collect visual input information and auditory input information in the audio and video segments to be recognized.
  • various embodiments may be implemented in hardware or special purpose circuitry, software, logic, or any combination thereof. Some aspects of the invention may be implemented in hardware, while other aspects may be implemented in firmware or software that may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. Although various aspects of the invention may be shown and described as block diagrams, flow diagrams, or using some other graphical representation, it is to be understood that, by way of non-limiting example, the blocks, devices, systems, techniques, or methods described herein may be Hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controllers or other computing devices, or some combination thereof.
  • Embodiments of the invention may be implemented by computer software executable by a data processor of the mobile device, such as in a processor entity, or by hardware, or by a combination of software and hardware.
  • Computer software or programs also referred to as program products
  • a computer program product may include one or more computer-executable components that are configured to perform embodiments when the program is executed.
  • One or more computer-executable components may be at least one software code or a portion thereof.
  • any block of the logic flow in the figures may represent program steps, or interconnected logic circuits, blocks, and functions, or a combination of program steps and logic circuits, blocks, and functions.
  • Software may be stored on physical media such as memory chips or memory blocks implemented within a processor, magnetic media such as hard or floppy disks, and optical media such as, for example, DVD and its data variants, CDs.
  • Physical media are non-transient media.

Abstract

A target behavior recognition method and apparatus based on visual-audio feature fusion, and an application, which relate to the technical field of intelligent security protection. In the method, visual information and audio information are input into a specified algorithm network, a visual feature and an audio feature are extracted via different feature extraction networks of two branches, and timing features are calculated via an LSTM network; and a shared semantic sub-space is constructed by means of an auto-encoding network, a semantic bias between the visual feature and the audio feature is eliminated, and finally, the visual feature and the audio feature are fused, such that a target behavior can be recognized on the basis of a fused feature. The method can improve the accuracy of abnormal behavior recognition.

Description

视听特征融合的目标行为识别方法、装置及应用Target behavior recognition method, device and application of audio-visual feature fusion 技术领域Technical field
本申请涉及智能安防技术领域,特别是涉及一种视听特征融合的目标行为识别方法、装置及应用。The present application relates to the field of intelligent security technology, and in particular to a target behavior recognition method, device and application of audio-visual feature fusion.
背景技术Background technique
在城市管理和安全管理领域,监控突发事件并实时报警,对公共安全管理十分重要。在实际生活中,人员打架行为是最常见的异常突发事件之一。传统的报警方式主要包括:通过人员打电话报警得知打架斗殴行为已经发生,显然这种报告方式有滞后性,此外还可以通过安保人工盯着监控画面以发现异常,但这种方式耗费人力成本。In the field of urban management and safety management, monitoring emergencies and providing real-time alarms are very important for public safety management. In real life, fights between people are one of the most common abnormal emergencies. The traditional alarm methods mainly include: calling the police to learn that a fight has occurred. Obviously, this reporting method has a lag. In addition, security personnel can also manually monitor the monitoring screen to detect abnormalities, but this method consumes labor costs. .
故目前已有相关技术通过监控相机24小时不间断的监控,再根据人工智能算法对打架行为进行判断和报警,可以大大提高打架行为等突发异常事件的报警实时性和准确性。Therefore, there are currently relevant technologies that use surveillance cameras to monitor 24 hours a day, and then judge and alarm fighting behaviors based on artificial intelligence algorithms, which can greatly improve the real-time and accuracy of alarms for unexpected abnormal events such as fighting behaviors.
在现有技术中,通过人工智能算法对打架进行判断的方式包括:检测画面并做行为分类判断或者检测多帧画面的中的人体关键点位置并作行为判断。如公开号CN112733629A、CN111401296A只公开了采用图像信息来判断异常行为。对于实际场景中,上述的算法会将一些动作较大的劳动操作比如多人打扫卫生,或者多人的体育锻炼动作比如打球识别为打架斗殴行为;此外,通常的算法判断仅用到了图像信息做判断,在准确度方面需要提高。In the existing technology, the methods of judging fights through artificial intelligence algorithms include: detecting pictures and making behavioral classification judgments, or detecting the positions of key human body points in multiple frames of pictures and making behavioral judgments. For example, the publication numbers CN112733629A and CN111401296A only disclose the use of image information to determine abnormal behavior. In actual scenarios, the above algorithm will identify some labor operations with large movements, such as cleaning by multiple people, or physical exercise actions by multiple people, such as playing ball, as fighting behaviors; in addition, the usual algorithm judgment only uses image information to make judgments. Judgment,needs to improve in terms of accuracy.
此外,现有技术如公开号CN104243894A、CN102098492A根据相离的两个步骤来判断异常行为,再在决策级进行融合,然而策级融合方法对提高异常行为视频识别性能的作用有限,决策级融合只能融合两个支路模式决策后的得分,没有考虑各模式信息的语义一致性,因此无法解决视频和声音时间不对齐以及语义不一致性的问题。In addition, existing technologies such as Publication No. CN104243894A and CN102098492A judge abnormal behavior based on two separate steps, and then fuse it at the decision-making level. However, the policy-level fusion method has limited effect on improving the performance of video recognition of abnormal behavior, and decision-level fusion can only It can fuse the scores after decision-making of two branch modes without considering the semantic consistency of the information in each mode, so it cannot solve the problem of time misalignment and semantic inconsistency between video and sound.
而语义一致性在多模态信息融合,尤其是视觉和听觉信息融合中具有重要意义。当多模态信息语义一致时,信息是互补的,否则,它们就会相互干扰,如著名的“McGurk效应”。有时,人类的听力明显受到视觉的影响,这可能会导致误听,如当一种声音与视觉信号不匹配时,人们会神秘地感知到第三种声音,将声音和视频信号简单融合甚至会产生相反的效果。Semantic consistency is of great significance in multi-modal information fusion, especially visual and auditory information fusion. When multi-modal information is semantically consistent, the information is complementary, otherwise, they will interfere with each other, such as the famous "McGurk effect". Sometimes, human hearing is obviously affected by vision, which may lead to mishearing. For example, when a sound does not match the visual signal, people will mysteriously perceive a third sound. Simple fusion of sound and video signals may even cause mishearing. Produce the opposite effect.
因此,在多模态信息形式语义不一致的情况下,没有任何度量的模态之间的特征融合不仅无法实现模态之间的信息互补,还可能导致算法性能下降。由于异常行为发生的特殊性,视听信息语义的不一致性主要体现:首先,视听数据在时间轴上可能不是对齐的,可能是声音特征会慢于视觉特征出现。另外视觉与听觉存在语义表达偏差信息。这都是多模态特征融合过程中需要解决的问题。Therefore, in the case of inconsistent semantics in the form of multi-modal information, feature fusion between modalities without any measurement not only fails to achieve information complementarity between modalities, but may also lead to a decrease in algorithm performance. Due to the particularity of abnormal behavior, the semantic inconsistency of audio-visual information is mainly reflected in: First, the audio-visual data may not be aligned on the timeline, and it may be that the sound features appear slower than the visual features. In addition, there is semantic expression bias information between vision and hearing. These are all problems that need to be solved in the process of multi-modal feature fusion.
基于此,针对异常行为识别算法中无法很好融入视听特征准确判断是否存在 异常行为的问题,目前尚未提出有效解决方案。Based on this, no effective solution has yet been proposed for the problem that audio-visual features cannot be well integrated into the abnormal behavior recognition algorithm to accurately determine whether there is abnormal behavior.
发明内容Contents of the invention
本申请实施例提供了一种视听特征融合的目标行为识别方法、装置及应用,针对现有的异常行为识别算法,本方案使用特征级融合方法使进行视听信息融合,能够提高异常行为识别的准确率。Embodiments of this application provide a target behavior recognition method, device and application for audio-visual feature fusion. In view of the existing abnormal behavior recognition algorithm, this solution uses a feature-level fusion method to fuse audio-visual information, which can improve the accuracy of abnormal behavior recognition. Rate.
第一方面,本申请实施例提供了一种视听特征融合的目标行为识别方法,所述方法包括:获取预设时长的待识别音视频段;采集所述待识别音视频段中的视觉输入信息及听觉输入信息;将所述视觉输入信息及所述听觉输入信息一同输入目标行为模型中,其中所述目标行为模型包括双分支通道的特征提取网络、自编码网络及全连接层识别模块;根据所述特征提取网络分别从所述视觉输入信息、所述听觉输入信息中提取特征,得到视觉特征、听觉特征;采用所述自编码网络将所述视觉特征、所述听觉特征映射到同一子空间中进行视听信息融合,得到融合特征;将所述融合特征输入所述全连接层识别模块进行识别,得到目标行为。In the first aspect, embodiments of the present application provide a target behavior recognition method using audio-visual feature fusion. The method includes: obtaining an audio and video segment of a preset duration to be recognized; and collecting visual input information in the audio and video segment to be recognized. and auditory input information; input the visual input information and the auditory input information together into the target behavior model, wherein the target behavior model includes a dual-branch channel feature extraction network, an autoencoding network and a fully connected layer recognition module; according to The feature extraction network extracts features from the visual input information and the auditory input information respectively to obtain visual features and auditory features; the autoencoding network is used to map the visual features and the auditory features to the same subspace The audio-visual information is fused to obtain fusion features; the fusion features are input into the fully connected layer recognition module for recognition, and the target behavior is obtained.
在其中一些实施例中,“采用所述自编码网络将所述视觉特征、所述听觉特征映射到同一子空间中进行视听信息融合,得到融合特征”包括:由所述自编码网络的编码器将所述视觉特征、所述听觉特征映射到同一子空间,得到听觉特征对应的听觉映射特征,视觉特征对应的视觉映射特征;根据所述自编码网络的解码器将所有所述视觉映射特征及所有所述听觉映射特征映射到多模空间中,每个模态得到其他模态空间的视觉补偿特征,作为视觉共享特征以及得到其他模态的听觉补偿特征,作为听觉共享特征;拼接所述视觉共享特征、所述听觉共享特征、所述视觉特征及所述听觉特征,得到融合特征。In some of the embodiments, "using the auto-encoding network to map the visual features and the auditory features into the same subspace for audio-visual information fusion to obtain fusion features" includes: using the encoder of the auto-encoding network The visual features and the auditory features are mapped to the same subspace to obtain the auditory mapping features corresponding to the auditory features and the visual mapping features corresponding to the visual features; according to the decoder of the autoencoding network, all the visual mapping features and All the auditory mapping features are mapped into a multi-modal space, and each modality obtains the visual compensation features of other modal spaces as visual shared features and the auditory compensation features of other modalities as auditory shared features; splicing the visual The shared features, the auditory shared features, the visual features and the auditory features are used to obtain fused features.
在其中一些实施例中,自编码网络包括编码器及解码器,其中,编码器包括依次连接的第一全连接层、第二全连接层以及编码器层;将视觉特征及听觉特征共同输入编码器中,并依次经过第一个全连接层、第二个全连接层以及编码器层输出,得到听觉特征对应的听觉映射特征,视觉特征对应的视觉映射特征;其中,解码器包括两条支路,每条支路有两个全连接层组成;一条支路以听觉映射特征作为输入,由两个全连接层将所有听觉映射特征映射到多模空间中,得到听觉映射特征对应的视觉补偿特征,另一支路以视觉映射特征作为输入,由两个全连接层将所有视觉映射特征映射到多模空间中,得到视觉映射特征对应的听觉补偿特征。In some embodiments, the autoencoding network includes an encoder and a decoder, where the encoder includes a first fully connected layer, a second fully connected layer and an encoder layer connected in sequence; visual features and auditory features are jointly input into the code In the decoder, and through the first fully connected layer, the second fully connected layer and the encoder layer output in sequence, the auditory mapping features corresponding to the auditory features and the visual mapping features corresponding to the visual features are obtained; among them, the decoder includes two branches Each branch consists of two fully connected layers; one branch takes auditory mapping features as input, and the two fully connected layers map all auditory mapping features into the multi-modal space to obtain the visual compensation corresponding to the auditory mapping features. Features, the other branch takes visual mapping features as input, and two fully connected layers map all visual mapping features into the multi-modal space to obtain auditory compensation features corresponding to the visual mapping features.
在其中一些实施例中,对输入所述自编码网络的所述视觉特征和所述听觉特征采用语义映射标签进行标记,其中,语义映射标签表征为描述相同语义内容的所述视觉输入信息和所述听觉输入信息的标记标签;当输入自编码网络的视觉特征或听觉特征存在语义映射标签时,损失函数为听觉平均误差值和视觉平均误差值的代数和;当输入自编码网络的视觉特征或听觉特征不存在语义映射标签时,损失函数为1与听觉平均误差值和视觉平均误差值的代数和的差值;听觉平均误 差值表征为所有听觉特征与所有听觉共享特征的绝对差值的平均值,视觉平均误差值表征为所有视觉特征与所有视觉共享特征的绝对差值的平均值其中,损失函数由下列公式得到:In some embodiments, the visual features and the auditory features input to the autoencoding network are marked using semantic mapping tags, wherein the semantic mapping tags are characterized by the visual input information describing the same semantic content and the The mark label of the auditory input information; when the visual feature or auditory feature input to the autoencoding network has a semantic mapping label, the loss function is the algebraic sum of the auditory average error value and the visual average error value; when the visual feature or auditory feature input to the autoencoding network When there is no semantic mapping label for auditory features, the loss function is the difference between 1 and the algebraic sum of the auditory average error value and the visual average error value; the auditory average error value is represented by the average of the absolute differences between all auditory features and all auditory shared features. value, the visual average error value is characterized as the average of the absolute differences between all visual features and all visual shared features. The loss function is obtained by the following formula:
Figure PCTCN2022141314-appb-000001
Figure PCTCN2022141314-appb-000001
y autocoder为损失函数,N为特征数量,faudio为听觉特征,f’ audio为听觉共享特征,f visual为视觉特征,f’ visual为视觉共享特征,L corr=1表示存在语义映射标签,L corr=-1表示不存在语义映射标签。 y autocoder is the loss function, N is the number of features, faudio is the auditory feature, f' audio is the auditory shared feature, f visual is the visual feature, f' visual is the visual shared feature, L corr = 1 means there is a semantic mapping label, L corr =-1 indicates that there is no semantic mapping tag.
在其中一些实施例中,“对输入所述自编码网络的所述视觉特征和所述听觉特征采用语义映射标签进行标记”包括:分别对所述听觉输入信息的声学异常信息及所述视觉输入信息的视觉异常信息进行语义标记,若判断出所述听觉输入信息与所述视觉异常信息都具有所述语义标记,则为该组所述听觉输入信息与所述视觉异常信息分配所述语义映射标签。In some embodiments, "labeling the visual features and the auditory features input to the autoencoding network using semantic mapping labels" includes: separately labeling the acoustic anomaly information of the auditory input information and the visual input The visual abnormality information of the information is semantically tagged. If it is determined that both the auditory input information and the visual abnormality information have the semantic tags, the semantic mapping is assigned to the set of auditory input information and the visual abnormality information. Label.
在其中一些实施例中,“采集所述待识别音视频段中的视觉输入信息”包括:从所述待识别音视频段中采集每相邻两帧图像帧的差值,得到差值序列,将所述差值序列作为视觉输入信息。In some of the embodiments, "collecting visual input information in the audio and video segment to be recognized" includes: collecting the difference between every two adjacent image frames from the audio and video segment to be recognized to obtain a difference sequence, The difference sequence is used as visual input information.
在其中一些实施例中,“采集所述待识别音视频段中的听觉输入信息”包括:获取所述待识别音视频段对应的原始音频波形,从所述原始音频波形中以预设采样间隔采集声学信号,得到听觉输入信息。In some embodiments, "collecting auditory input information in the audio and video segments to be recognized" includes: obtaining the original audio waveform corresponding to the audio and video segment to be recognized, and sampling from the original audio waveform at a preset sampling interval. Acoustic signals are collected to obtain auditory input information.
在其中一些实施例中,所述听觉输入信息表征为以时间作为横轴,以声学信号作为纵轴的波形数据,其中所述听觉输入信息与视觉输入信息以时域为统一尺度。In some embodiments, the auditory input information is represented as waveform data with time as the horizontal axis and acoustic signal as the vertical axis, wherein the auditory input information and the visual input information use the time domain as a unified scale.
在其中一些实施例中,所述双分支通道的特征提取网络包括听觉特征提取网络及视觉特征提取网络,其中,所述听觉特征提取网络包括AFEN模块及LSTM模块,将原始音频波形中的多帧波形输入AFEN模块中,获取对应的多个听觉帧级特征,通过LSTM模块对多个所述听觉帧级特征进行融合,输出听觉段级特征。In some embodiments, the dual-branch channel feature extraction network includes an auditory feature extraction network and a visual feature extraction network, wherein the auditory feature extraction network includes an AFEN module and an LSTM module to convert multiple frames in the original audio waveform. The waveform is input into the AFEN module to obtain multiple corresponding auditory frame-level features, and the multiple auditory frame-level features are fused through the LSTM module to output auditory segment-level features.
在其中一些实施例中,AFEN网络包括5个卷积层、3个池化层和3个依次连接的全连接层,其中池化层分别连接于第一个卷积层、第二个卷积层及第五个卷积层之后,每个卷积层包括一个使AFEN模块的激活模式更稀疏的ReLU激活函数,每个池化层包括一个避免梯度消失的局部响应归一化操作。。In some embodiments, the AFEN network includes 5 convolutional layers, 3 pooling layers and 3 sequentially connected fully connected layers, where the pooling layers are connected to the first convolutional layer, the second convolutional layer respectively. After the first and fifth convolutional layers, each convolutional layer includes a ReLU activation function that makes the activation pattern of the AFEN module sparser, and each pooling layer includes a local response normalization operation to avoid gradient vanishing. .
在其中一些实施例中,所述视觉特征提取网络与所述听觉特征提取网络共享AFEN模块,所述视觉特征提取网络中以ConvLSTM模块替代LSTM模块对AFEN模块输出的多个视觉帧级特征进行融合,输出视觉段级特征。In some embodiments, the visual feature extraction network shares the AFEN module with the auditory feature extraction network. In the visual feature extraction network, the ConvLSTM module is used instead of the LSTM module to fuse multiple visual frame-level features output by the AFEN module. , output visual segment-level features.
第二方面,本申请实施例提供了一种视听特征融合的目标行为识别装置,包括:获取模块,用于获取预设时长的待识别音视频段;信息采集模块,用于采集所述待识别音视频段中的视觉输入信息及听觉输入信息;特征提取模块,用于将所述视觉输入信息及所述听觉输入信息一同输入目标行为模型中,其中所述目标行为模型包括双分支通道的特征提取网络、自编码网络及全连接层识别模块;根据所述特征提取网络分别从所述视觉输入信息、所述听觉输入信息中提取特征,得到视觉特征、听觉特征;采用所述自编码网络将所述视觉特征、所述听觉特征映射到同一子空间中进行视听信息融合,得到融合特征;行为识别模块,用于将所述融合特征输入所述全连接层识别模块进行识别,得到目标行为。In the second aspect, embodiments of the present application provide a target behavior recognition device with audio-visual feature fusion, including: an acquisition module for acquiring the audio and video segments to be recognized of a preset duration; and an information collection module for collecting the audio and video segments to be recognized. The visual input information and the auditory input information in the audio and video segments; the feature extraction module is used to input the visual input information and the auditory input information into the target behavior model together, wherein the target behavior model includes the characteristics of the dual branch channel Extraction network, autoencoding network and fully connected layer recognition module; extract features from the visual input information and auditory input information respectively according to the feature extraction network to obtain visual features and auditory features; use the autoencoding network to The visual features and the auditory features are mapped to the same subspace for audio-visual information fusion to obtain fused features; a behavior recognition module is used to input the fused features into the fully connected layer recognition module for recognition to obtain the target behavior.
第三方面,本申请实施例提供了一种电子装置,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器被设置为运行所述计算机程序以执行第一方面任一项所述的视听特征融合的目标行为识别方法。In a third aspect, embodiments of the present application provide an electronic device, including a memory and a processor. A computer program is stored in the memory, and the processor is configured to run the computer program to execute any one of the first aspects. The target behavior recognition method of audio-visual feature fusion.
第四方面,本申请实施例提供了一种可读存储介质,所述可读存储介质中存储有计算机程序,所述计算机程序包括用于控制过程以执行过程的程序代码,所述过程包括根据第一方面任一项所述的视听特征融合的目标行为识别方法。In a fourth aspect, embodiments of the present application provide a readable storage medium in which a computer program is stored, and the computer program includes program code for controlling a process to execute a process, and the process includes: The target behavior recognition method of audio-visual feature fusion according to any one of the first aspects.
本申请实施例的主要贡献和创新点如下:The main contributions and innovations of the embodiments of this application are as follows:
本方案采用自编码网络来表示共享语义子空间映射,在自编码网络中视觉和听觉采用时间为统一度量从而将表示相同语义的视觉和听觉进行映射,进而捕获不同模式之间的互补信息和高级语义,实现语义层面的特征融合。This solution uses an autoencoding network to represent the shared semantic subspace mapping. In the autoencoding network, vision and hearing use time as a unified measure to map vision and hearing that represent the same semantics, thereby capturing complementary information and advanced information between different modes. Semantics, realizing feature fusion at the semantic level.
本方案设计了一种双分支通道的特征提取网络,在该提取网络中选择LSTM网络处理音频帧级特征之间的时间关系,从而得到听觉段级特征;选择ConvLSTM网络处理视频帧级特征之间的时间关系,从而得到视觉阶段特征,再在自编码网络中消除视觉阶段特征和听觉阶段特征的特征异质性,实现特征融合。This plan designs a dual-branch channel feature extraction network. In this extraction network, the LSTM network is selected to process the temporal relationship between audio frame-level features, thereby obtaining auditory segment-level features; the ConvLSTM network is selected to process the time relationships between video frame-level features. The temporal relationship is used to obtain the visual phase features, and then the feature heterogeneity of the visual phase features and the auditory phase features is eliminated in the autoencoding network to achieve feature fusion.
本方案提取视觉特征采用了视频帧间差信息,更好地能体现异常行为的特征,提取声音特征采用了基于原始音频波形,而不是基于频谱分析的方法比如MFCC或者LPC,因此能够统一以时间间隔的方式分别对图像和波段进行采样,而统一到时域解决了视听信息融合过程中音视频特征处理不一致的问题。This solution uses video inter-frame difference information to extract visual features, which can better reflect the characteristics of abnormal behavior. The sound features are extracted based on original audio waveforms instead of spectrum analysis-based methods such as MFCC or LPC, so it can be unified based on time. The images and bands are sampled separately at intervals, and unified into the time domain to solve the problem of inconsistent audio and video feature processing during the audio-visual information fusion process.
本申请的一个或多个实施例的细节在以下附图和描述中提出,以使本申请的其他特征、目的和优点更加简明易懂。The details of one or more embodiments of the present application are set forth in the following drawings and description to make other features, objects, and advantages of the present application more concise and understandable.
附图说明Description of the drawings
此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:The drawings described here are used to provide a further understanding of the present application and constitute a part of the present application. The illustrative embodiments of the present application and their descriptions are used to explain the present application and do not constitute an improper limitation of the present application. In the attached picture:
图1是根据本申请第一实施例的视听特征融合的目标行为识别方法的主要步骤流程图。Figure 1 is a flow chart of the main steps of a target behavior recognition method based on audio-visual feature fusion according to the first embodiment of the present application.
图2是异常行为听觉特征提取网络结构示意图。Figure 2 is a schematic diagram of the abnormal behavior auditory feature extraction network structure.
图3是基于自编码网络映射视听特征融合的异常行为识别网络结构示意图。Figure 3 is a schematic diagram of the abnormal behavior recognition network structure based on autoencoding network mapping audio-visual feature fusion.
图4是自编码网络的结构示意图。Figure 4 is a schematic structural diagram of the autoencoding network.
图5是根据本申请第二实施例的视听特征融合的目标行为识别装置的结构框图。Figure 5 is a structural block diagram of a target behavior recognition device for audio-visual feature fusion according to the second embodiment of the present application.
图6是根据本申请第三实施例的电子装置的硬件结构示意图。FIG. 6 is a schematic diagram of the hardware structure of an electronic device according to the third embodiment of the present application.
具体实施方式Detailed ways
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本说明书一个或多个实施例相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本说明书一个或多个实施例的一些方面相一致的装置和方法的例子。Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. When the following description refers to the drawings, the same numbers in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with one or more embodiments of this specification. Rather, they are merely examples of apparatus and methods consistent with some aspects of one or more embodiments of this specification as detailed in the appended claims.
需要说明的是:在其他实施例中并不一定按照本说明书示出和描述的顺序来执行相应方法的步骤。在一些其他实施例中,其方法所包括的步骤可以比本说明书所描述的更多或更少。此外,本说明书中所描述的单个步骤,在其他实施例中可能被分解为多个步骤进行描述;而本说明书中所描述的多个步骤,在其他实施例中也可能被合并为单个步骤进行描述。It should be noted that in other embodiments, the steps of the corresponding method are not necessarily performed in the order shown and described in this specification. In some other embodiments, methods may include more or fewer steps than described in this specification. In addition, a single step described in this specification may be broken down into multiple steps for description in other embodiments; and multiple steps described in this specification may also be combined into a single step in other embodiments. describe.
本申请实施例提供一种视听特征融合的目标行为识别方法,该方法采用所述视听特征融合目标行为识别方案对异常事件行为作出判断。首先,对本申请方法所适用的使用场景进行说明:Embodiments of the present application provide a target behavior recognition method based on audio-visual feature fusion, which uses the audio-visual feature fusion target behavior recognition scheme to make judgments on abnormal event behaviors. First, the usage scenarios applicable to this application method are explained:
先接入监控视频声音和画面,提取固定长度(例如1秒)的视频和音频片段作为一个行为判断的窗口;First access the surveillance video sounds and images, and extract fixed-length (for example, 1 second) video and audio clips as a window for behavioral judgment;
再计算在该段时间中视频帧间差(相邻的两帧图像计算,1秒钟内有大约30帧-60帧图像,计算后可以得到时序上面的多个结果),计算得到的差值序列作为视觉输入信息;Then calculate the difference between video frames during this period of time (calculation of two adjacent frames of images, there are about 30 to 60 frames of images in 1 second, after calculation, multiple results in the timing can be obtained), and the calculated difference Sequences as visual input information;
然后将1秒钟内声音波形16kHz采样作为听觉输入信息;Then, 16kHz sound waveform samples within 1 second are used as auditory input information;
再将视觉信息和听觉信息输入指定的算法网络,经过两个分支的不同特征提取网络,提取了视觉特征和听觉特征,并经过了LSTM网络计算得到时序上特征;并通过自编码网络构建共享语义子空间,消除视觉和听觉特征的语义偏差,最后融合视觉特征和听觉特征,将融合后特征输入分类网络分支,得到异常行为分类概率;Then input the visual information and auditory information into the designated algorithm network, and extract the visual features and auditory features through the different feature extraction networks of the two branches, and calculate the temporal features through the LSTM network; and build shared semantics through the auto-encoding network subspace, eliminate the semantic bias of visual and auditory features, and finally fuse the visual features and auditory features, and input the fused features into the classification network branch to obtain the abnormal behavior classification probability;
最后根据分类概率阈值判断行为是否属于异常行为。Finally, it is judged whether the behavior is abnormal based on the classification probability threshold.
图1是根据本申请第一实施例的视听特征融合的目标行为识别方法的主要步骤流程图。Figure 1 is a flow chart of the main steps of a target behavior recognition method based on audio-visual feature fusion according to the first embodiment of the present application.
为实现该目的,如图1所示,视听特征融合的目标行为识别方法主要包括如下的步骤101至步骤106。In order to achieve this goal, as shown in Figure 1, the target behavior recognition method of audio-visual feature fusion mainly includes the following steps 101 to 106.
步骤101、获取预设时长的待识别音视频段。Step 101: Obtain the audio and video segments to be identified of a preset duration.
步骤102、采集所述待识别音视频段中的视觉输入信息及听觉输入信息。Step 102: Collect visual input information and auditory input information in the audio and video segments to be recognized.
步骤103、将所述视觉输入信息及所述听觉输入信息一同输入目标行为模型中,其中所述目标行为模型包括双分支通道的特征提取网络、自编码网络及全连接层识别模块。Step 103: Input the visual input information and the auditory input information together into the target behavior model, where the target behavior model includes a dual-branch channel feature extraction network, an autoencoding network and a fully connected layer recognition module.
步骤104、根据所述特征提取网络分别从所述视觉输入信息、所述听觉输入信息中提取特征,得到视觉特征、听觉特征。Step 104: Extract features from the visual input information and the auditory input information respectively according to the feature extraction network to obtain visual features and auditory features.
步骤105、采用所述自编码网络将所述视觉特征、所述听觉特征映射到同一子空间中进行视听信息融合,得到融合特征。Step 105: Use the autoencoding network to map the visual features and the auditory features into the same subspace for audio-visual information fusion to obtain fusion features.
步骤106、将所述融合特征输入所述全连接层识别模块进行识别,得到目标行为。Step 106: Input the fused features into the fully connected layer recognition module for recognition to obtain the target behavior.
具体地,在本方案中不仅使用了视觉输入信息,还采集了听觉输入信息,二者作为行为模型的输入,并由模型输出目标行为分类结果。与仅采用图像信息判断目标行为而言,本方案利用了声音特征,而声音特征和图像特征的信息能够互补从而更准确判断出目标行为,故具有更好的特征表达效果。此外,在本方案中视觉输入信息、听觉输入信息一同输入模型中进行处理,并行处理的好处是能够在特征提取阶段就进行特征级融合从而更好地捕捉出各模态之间的关系,相比于分别对视觉和听觉特征信息识别出的结果进行融合的决策及融合方式,本方案能够考虑视觉和听觉之间的一致性,因此多模态特征信息能够实现互补,达到更好的表现效果。Specifically, in this solution, not only visual input information is used, but also auditory input information is collected. Both are used as inputs to the behavior model, and the model outputs the target behavior classification results. Compared with only using image information to judge the target behavior, this solution uses sound features, and the information of sound features and image features can complement each other to more accurately judge the target behavior, so it has better feature expression effect. In addition, in this solution, visual input information and auditory input information are input into the model for processing at the same time. The advantage of parallel processing is that feature-level fusion can be performed in the feature extraction stage to better capture the relationship between each modality. Compared with the decision-making and fusion methods of separately fusing the results of visual and auditory feature information recognition, this solution can consider the consistency between vision and hearing, so multi-modal feature information can complement each other and achieve better performance results. .
需要说明的是,在本方案中目标行为可以是正常行为或者异常行为。例如,当训练模型时输入的是视觉样本和听觉样本以及从样本中标注出的正常行为时,训练出的模型会从视觉输入信息、听觉输入信息中识别正常行为。反之,当训练模型时输入的是对异常行为标注的标注样本时,训练出的模型会从视觉输入信息、听觉输入信息中识别异常行为。举例而非限制,在本方案中以描述暴力场景的视觉样本和听觉样本以及从样本中标注出的描述暴力场景的特征作为模型的输入,使得训练出的模型能对打架斗殴等暴力行为进行识别。It should be noted that the target behavior in this scheme can be normal behavior or abnormal behavior. For example, when training the model, the input is visual samples and auditory samples and normal behaviors marked from the samples, the trained model will identify normal behaviors from the visual input information and auditory input information. On the contrary, when the input to the model is labeled samples that label abnormal behaviors, the trained model will identify abnormal behaviors from visual input information and auditory input information. By way of example, but not limitation, in this solution, visual samples and auditory samples describing violent scenes and features describing violent scenes annotated from the samples are used as the input of the model, so that the trained model can identify violent behaviors such as fights. .
不仅如此,在本方案中采用自编码网络实现特征的融合,不同于现有技术,本发明设计的自编码网络能够将视听数据以同一度量表示,并对相同度量下表示同一语义的视觉信息和听觉信息进行匹配,即实现视觉和听觉的语义一致性。Not only that, in this solution, an autoencoding network is used to achieve feature fusion. Different from the existing technology, the autoencoding network designed by the present invention can represent audio-visual data with the same metric, and can represent visual information and visual information that represent the same semantics under the same metric. Auditory information is matched, that is, visual and auditory semantic consistency is achieved.
具体地,当获取到视觉特征及听觉特征后,需要融合这些特征。因此本发明实施例通过将所述视觉特征、所述听觉特征映射到同一子空间将来建立共享语义子空间,从而消除视频和音频不同模式之间的特征异质性,进而捕获视觉模式和声音模式之间的互补信息和高级语义,实现语义层面的特征融合。即,本方案由所述自编码网络的编码器将所述视觉特征、所述听觉特征映射到同一子空间,得到视觉映射特征,听觉映射特征;根据所述自编码网络的解码器将所述视觉映射特征、所述听觉映射特征映射到多模空间中,得到其他模态的补偿特征,作为视 觉共享特征、听觉共享特征;拼接所述视觉共享特征、所述听觉共享特征、所述视觉特征及所述听觉特征,得到融合特征。Specifically, after obtaining visual features and auditory features, these features need to be fused. Therefore, the embodiment of the present invention establishes a shared semantic subspace in the future by mapping the visual features and the auditory features to the same subspace, thereby eliminating feature heterogeneity between different modes of video and audio, and thereby capturing visual modes and sound modes. The complementary information and high-level semantics between them achieve feature fusion at the semantic level. That is, in this solution, the encoder of the auto-encoding network maps the visual features and the auditory features to the same subspace to obtain visual mapping features and auditory mapping features; the decoder of the auto-encoding network maps the visual features and auditory features to the same subspace. The visual mapping features and the auditory mapping features are mapped into the multi-modal space to obtain the compensation features of other modalities as visual shared features and auditory shared features; the visual shared features, the auditory shared features, and the visual features are spliced and the auditory features to obtain fusion features.
示例性地,如公式(1)所示,提取的视觉特征为fvisual,g())是将视觉特征fvisual映射到同一子空间的函数,将fvisual输入g()中,得到视觉映射特征;H()是将视觉映射特征g(fvisual)映射到共享语义的多模空间的函数,将g(fvisual)输入H()中,得到听觉模态的补偿特征,作为听觉共享特征。For example, as shown in formula (1), the extracted visual feature is fvisual, g()) is a function that maps the visual feature fvisual to the same subspace, and input fvisual into g() to obtain the visual mapping feature; H () is a function that maps the visual mapping feature g(fvisual) to the multi-modal space of shared semantics. Input g(fvisual) into H() to obtain the compensation features of the auditory modality as auditory shared features.
同理,如公式(2)所示,提取的听觉特征为faudio,h()是将听觉特征faudio映射到同一子空间的函数,将faudio输入h()中,得到听觉映射特征;G()是将听觉映射特征h(faudio)映射到共享语义的多模空间的函数,将h(faudio)输入G()中,得到视觉模态的补偿特征,作为视觉共享特征。In the same way, as shown in formula (2), the extracted auditory feature is faudio. h() is a function that maps the auditory feature faudio to the same subspace. Enter faudio into h() to obtain the auditory mapping feature; G() It is a function that maps the auditory mapping feature h(faudio) to the multi-modal space of shared semantics. Input h(faudio) into G() to obtain the compensation feature of the visual modality as a visual shared feature.
f′ audio=H(g(f visual))  公式(1) f′ audio =H(g(f visual )) Formula (1)
f′ visual=G(h(f audio))  公式(2) f′ visual =G(h(f audio )) Formula (2)
如图4所示,自编码网络包括编码器及解码器,其中,编码器包括依次连接的第一全连接层、第二全连接层以及编码器层;将视觉特征及听觉特征共同输入编码器中,并依次经过第一个全连接层、第二个全连接层以及编码器层输出,得到听觉特征对应的听觉映射特征,视觉特征对应的视觉映射特征;As shown in Figure 4, the autoencoding network includes an encoder and a decoder. The encoder includes a first fully connected layer, a second fully connected layer and an encoder layer connected in sequence; visual features and auditory features are jointly input into the encoder. , and sequentially pass through the first fully connected layer, the second fully connected layer and the encoder layer output to obtain the auditory mapping features corresponding to the auditory features, and the visual mapping features corresponding to the visual features;
其中,解码器包括两条支路,每条支路有两个全连接层组成;一条支路以听觉映射特征作为输入,由两个全连接层将所有听觉映射特征映射到多模空间中,得到听觉映射特征对应的视觉补偿特征,另一支路以视觉映射特征作为输入,由两个全连接层将所有视觉映射特征映射到多模空间中,得到视觉映射特征对应的听觉补偿特征。最后,利用公式(3)对视觉特征、视觉补偿特征、听觉特征、听觉补偿特征进行拼接,得到融合特征。Among them, the decoder includes two branches, each branch consists of two fully connected layers; one branch takes auditory mapping features as input, and two fully connected layers map all auditory mapping features into the multi-module space. The visual compensation features corresponding to the auditory mapping features are obtained. The other branch takes the visual mapping features as input, and uses two fully connected layers to map all the visual mapping features into the multi-modal space to obtain the auditory compensation features corresponding to the visual mapping features. Finally, formula (3) is used to splice visual features, visual compensation features, auditory features, and auditory compensation features to obtain fusion features.
在本方案中,每个模态空间收到来自其模间邻居和模内邻居的信息,同时共享自己的信息,当任一模态空间获取到的模间邻居信息能够弥补自己信息的损失时,则将获取的模间邻居信息作为补充特征,以增强融合特征的表达能力。In this scheme, each modal space receives information from its inter-modal neighbors and intra-modal neighbors, and shares its own information at the same time. When the inter-modal neighbor information obtained by any modal space can make up for the loss of its own information , the obtained inter-module neighbor information is used as a supplementary feature to enhance the expressive ability of the fused feature.
f fusion=CONCAT(f visual+f′ visual+f audio+f′ audio)  公式(3) f fusion =CONCAT(f visual +f′ visual +f audio +f′ audio ) Formula (3)
需要说明的是,当输入是具有语义一致性的视觉和听觉特征时,自编码网络的误差包括两部分。一个是声学解码器的误差,另一个是视觉解码器的误差,两者的和就是总误差。误差可以反向传播以更新自编码网络的权值。。It should be noted that when the input is visual and auditory features with semantic consistency, the error of the autoencoding network includes two parts. One is the error of the acoustic decoder, the other is the error of the visual decoder, and the sum of the two is the total error. Errors can be backpropagated to update the weights of the autoencoder network. .
具体地,对于相同的视频,视觉和听觉信息会存在时间轴偏差,并在语义上存在不一致,因此对视觉信息融合提出了挑战。为了解决这一问题,本发明提出了一种新的标签“语义映射”,该标签用于描述同一视频的视听数据是否包含相同的语义信息。例如,含有血液、身体暴力等的视频数据被认为是视觉异常。包含 尖叫声和哭喊声的声音被认为是声学异常。音频和视频数据分开做标记,防止相互干扰。如果视频的视觉语义标签与音频语义标签相同,则认为音视频具有语义对应L corr=1。否则,不存在语义对应L corr=-1。语义标记为构建具有不同模态特征的共享子空间提供度量。 Specifically, for the same video, visual and auditory information will have timeline deviations and semantic inconsistencies, thus posing challenges to visual information fusion. In order to solve this problem, the present invention proposes a new tag "semantic mapping", which is used to describe whether the audio-visual data of the same video contains the same semantic information. For example, video data containing blood, physical violence, etc. are considered visual anomalies. Sounds that include screams and cries are considered acoustic anomalies. Audio and video data are marked separately to prevent mutual interference. If the visual semantic label of the video is the same as the audio semantic label, the audio and video are considered to have semantic correspondence L corr =1. Otherwise, there is no semantic correspondence L corr =-1. Semantic tagging provides metrics for constructing shared subspaces with different modal characteristics.
引入语义标签计算自编码网络误差包括:对输入所述自编码网络的所述视觉特征和所述听觉特征采用语义映射标签进行标记,其中,语义映射标签表征为描述相同语义内容的所述视觉输入信息和所述听觉输入信息的标记标签;当输入自编码网络的视觉特征或听觉特征存在语义映射标签时,损失函数为听觉平均误差值和视觉平均误差值的代数和;Introducing semantic labels to calculate the auto-encoding network error includes: labeling the visual features and the auditory features input to the auto-encoding network using semantic mapping labels, where the semantic mapping labels are characterized as the visual input describing the same semantic content information and the mark label of the auditory input information; when the visual feature or auditory feature input to the autoencoding network has a semantic mapping label, the loss function is the algebraic sum of the auditory average error value and the visual average error value;
当输入自编码网络的视觉特征或听觉特征不存在语义映射标签时,损失函数为1与听觉平均误差值和视觉平均误差值的代数和的差值;When there is no semantic mapping label for the visual features or auditory features input to the autoencoder network, the loss function is the difference between 1 and the algebraic sum of the auditory average error value and the visual average error value;
听觉平均误差值表征为所有听觉特征与所有听觉共享特征的绝对差值的平均值,视觉平均误差值表征为所有视觉特征与所有视觉共享特征的绝对差值的平均值。The auditory average error value is characterized as the average of the absolute differences between all auditory features and all auditory shared features, and the visual average error value is characterized as the average of the absolute differences between all visual features and all visual shared features.
其中,损失函数由下列公式得到:Among them, the loss function is obtained by the following formula:
Figure PCTCN2022141314-appb-000002
Figure PCTCN2022141314-appb-000002
y autocoder为损失函数,N为特征数量,faudio为听觉特征,f’ audio为听觉共享特征,f visual为视觉特征,f’ visual为视觉共享特征,L corr=1表示存在语义映射标签,L corr=-1表示不存在语义映射标签。 y autocoder is the loss function, N is the number of features, faudio is the auditory feature, f' audio is the auditory shared feature, f visual is the visual feature, f' visual is the visual shared feature, L corr = 1 means there is a semantic mapping label, L corr =-1 indicates that there is no semantic mapping tag.
故本方案设计了一种新的损耗函数,能够让模型学习到时间轴上语义映射的偏差信息。本方案通过将语义标签引入损耗函数的计算中,减少了盲拼接特征的干扰,增强了模型对异常视频语义对应的判别能力,更有利于消除非对应特征之间的干扰。此外,这样的语义嵌入学习可以看作是正则化的一种形式,有助于增强模型的泛化能力,防止过拟合。Therefore, this solution designs a new loss function that allows the model to learn the bias information of semantic mapping on the timeline. By introducing semantic labels into the calculation of the loss function, this solution reduces the interference of blind splicing features, enhances the model's ability to distinguish abnormal video semantic correspondence, and is more conducive to eliminating interference between non-corresponding features. In addition, such semantic embedding learning can be regarded as a form of regularization, which helps to enhance the generalization ability of the model and prevent overfitting.
具体地,语义映射标签由以下方式得到:分别对所述听觉输入信息的声学异常信息及所述视觉输入信息的视觉异常信息进行语义标记,若判断出所述听觉输入信息与所述视觉异常信息都具有所述语义标记,则为该组所述听觉输入信息与所述视觉异常信息分配所述语义映射标签。Specifically, the semantic mapping label is obtained by semantically labeling the acoustic abnormality information of the auditory input information and the visual abnormality information of the visual input information respectively. If it is determined that the auditory input information and the visual abnormality information are If both have the semantic tag, then the semantic mapping tag is assigned to the set of the auditory input information and the visual abnormality information.
且值得一提的是,在本方案中为了实现视觉特征和听觉特征表示相同语义的目的,需要将视觉信息和听觉信息表示为相同度量的数据,故本方案中以时间为统一度量,具体地,本方案将原始音频波形映射到二维场,即声音数据的x轴为时间,y轴为波形。本方案以监控视频的画面作为原始视频数据,原始视频数据 在1秒中内大约有30帧-60帧图像,即视觉数据的x轴同样为时间,y轴为图像帧。It is worth mentioning that in this scheme, in order to achieve the purpose of visual features and auditory features expressing the same semantics, visual information and auditory information need to be expressed as data of the same measurement. Therefore, time is used as a unified measurement in this scheme, specifically. , this solution maps the original audio waveform to a two-dimensional field, that is, the x-axis of the sound data is time and the y-axis is the waveform. This solution uses the surveillance video screen as the original video data. The original video data has approximately 30 to 60 frames of images in 1 second. That is, the x-axis of the visual data is also time, and the y-axis is the image frame.
当视觉数据和听觉数据在同一时间度量下,则可以根据语义是否相同在时间轴上将二者对齐,实现视觉特征和听觉特征的对应,从而能够在语义上达成视觉和听觉的一致性。When visual data and auditory data are measured at the same time, the two can be aligned on the time axis based on whether the semantics are the same to achieve correspondence between visual features and auditory features, thereby achieving semantic consistency between vision and hearing.
此外,在本方案中还对原始待识别音视频段的信息进行处理,具体地,本发明从所述待识别音视频段中采集每相邻两帧图像帧的差值,得到差值序列,将所述差值序列作为视觉输入信息。In addition, in this solution, the information of the original audio and video segments to be identified is also processed. Specifically, the present invention collects the difference between every two adjacent image frames from the audio and video segments to be identified to obtain a difference sequence, The difference sequence is used as visual input information.
即,在该实施例中充分考虑了异常行为识别的对象是一些动作激烈的行为,如挥拳打人这一行为,拳头在对方胸前或者在自己腰侧并不能准确说明视频中人物存在包含行为,当拳头在前几帧中仍在自己腰侧,后几帧就在对方胸前,则说明视频中的人做出了挥拳的异常行为。可见,视频的帧间差异比视频帧本身更能准确提取所需信息。故选择视频相邻帧之间的差值作为网络模型的输入相比于将视频帧本身输入模型中具有更好的特征表达效果。That is, in this embodiment, it is fully considered that the objects of abnormal behavior recognition are some violent behaviors, such as punching someone. Placing the fist on the opponent's chest or on one's waist does not accurately indicate the presence of the characters in the video. Behavior, if the fist is still on the side of one's waist in the first few frames and is on the opponent's chest in the next few frames, it means that the person in the video has performed an abnormal behavior of punching. It can be seen that the difference between frames of the video can more accurately extract the required information than the video frame itself. Therefore, choosing the difference between adjacent frames of the video as the input of the network model has a better feature expression effect than inputting the video frame itself into the model.
在本方案中以原始音频波形作为待采集的听觉信息,故“采集所述待识别音视频段中的听觉输入信息”包括:获取所述待识别音视频段对应的原始音频波形,从所述原始音频波形中以预设采样间隔采集声学信号,得到听觉输入信息。相比于采用频谱分析的方法如MFCC或者LPC,本方案提取的声音特基于原始音频波形,因此能够与视觉特征以统一度量表示,即所述听觉输入信息表征为以时间作为横轴,以声学信号作为纵轴的波形数据,其中所述听觉输入信息与视觉输入信息以时域为统一尺度。In this solution, the original audio waveform is used as the auditory information to be collected, so "collecting the auditory input information in the audio and video segments to be identified" includes: obtaining the original audio waveform corresponding to the audio and video segments to be identified, and from the Acoustic signals are collected at preset sampling intervals in the original audio waveform to obtain auditory input information. Compared with methods that use spectrum analysis such as MFCC or LPC, the sound extracted by this solution is based on the original audio waveform, so it can be expressed in a unified metric with the visual features, that is, the auditory input information is represented by taking time as the horizontal axis and acoustics. The signal is used as waveform data on the vertical axis, where the auditory input information and the visual input information use the time domain as a unified scale.
在本方案中提取视觉和听觉特征的特征提取网络采用的是双分支通道,即视觉输入信息、听觉输入信息能够同时输入特征提取网络中,并分别进行特征提取,输出视觉特征、听觉特征。具体地,所述双分支通道的特征提取网络包括听觉特征提取网络及视觉特征提取网络,其中,所述听觉特征提取网络包括AFEN模块及LSTM模块,将原始音频波形中的多帧波形输入AFEN模块中,获取对应的多个听觉帧级特征,通过LSTM模块对多个所述听觉帧级特征进行融合,输出听觉段级特征。In this solution, the feature extraction network for extracting visual and auditory features uses a dual-branch channel, that is, visual input information and auditory input information can be input into the feature extraction network at the same time, and feature extraction is performed separately to output visual features and auditory features. Specifically, the dual-branch channel feature extraction network includes an auditory feature extraction network and a visual feature extraction network, wherein the auditory feature extraction network includes an AFEN module and an LSTM module, and the multi-frame waveforms in the original audio waveform are input into the AFEN module , obtain multiple corresponding auditory frame-level features, fuse the multiple auditory frame-level features through the LSTM module, and output auditory segment-level features.
如图2所示,异常行为听觉特征提取网络结构AFEN结构包括5个卷积层、3个池化层和3个依次连接的全连接层,最后一个全连接层的输出经过SoftMax层。其中池化层分别连接于第一个卷积层、第二个卷积层及第五个卷积层之后。As shown in Figure 2, the abnormal behavior auditory feature extraction network structure AFEN structure includes 5 convolutional layers, 3 pooling layers and 3 sequentially connected fully connected layers. The output of the last fully connected layer passes through the SoftMax layer. The pooling layer is connected after the first convolution layer, the second convolution layer and the fifth convolution layer respectively.
在本方案中,黑色矩形代表各个卷积层,白色矩形表示池化层,在最后一个池化层后紧跟着三个矩形表示三个全连接层,在全连接层后的黑色矩形表示LSTM结构,由于异常行为在时间轴上的连续性,选择LSTM网络处理音频帧级特征之间的时间关系,获取段级特征。卷积层包含了ReLU激活函数,使得网络的激活模式更加稀疏。池化层包含一个局部响应归一化操作,以避免梯度消失, 提高网络的训练速度。在图2中可以看到,一段声学信号经过听觉特征提取网络进行特征提取后,先提取出多个帧级特征,在基于时间上的关系将多个帧级特征进行融合,最终获得对应该段声学信号的段级特征。In this scheme, the black rectangle represents each convolution layer, the white rectangle represents the pooling layer, the three rectangles immediately after the last pooling layer represent the three fully connected layers, and the black rectangle after the fully connected layer represents the LSTM Structure, due to the continuity of abnormal behavior on the time axis, the LSTM network is selected to process the temporal relationship between audio frame-level features and obtain segment-level features. The convolutional layer contains the ReLU activation function, making the activation pattern of the network sparser. The pooling layer contains a local response normalization operation to avoid gradient disappearance and improve the training speed of the network. As can be seen in Figure 2, after a segment of acoustic signal is extracted through the auditory feature extraction network, multiple frame-level features are first extracted, and then the multiple frame-level features are fused based on the relationship in time, and finally the corresponding segment is obtained. Segment-level characteristics of acoustic signals.
在本发明的模型中,在视觉和听觉特征处理的最后阶段,通过LSTM网络总结时序信息,这种方法可以适应整个监控视频机制,在音频和视频的长度、采样率等方面没有硬性要求。从而解决了特征时间轴对准问题。另一方面,该模型也大大降低了视觉和听觉特征融合的复杂度,提高了模型的稳定性。In the model of the present invention, in the final stage of visual and auditory feature processing, the timing information is summarized through the LSTM network. This method can be adapted to the entire surveillance video mechanism and has no rigid requirements in terms of audio and video length, sampling rate, etc. This solves the problem of feature timeline alignment. On the other hand, this model also greatly reduces the complexity of visual and auditory feature fusion and improves the stability of the model.
同样的,在本方案中异常行为视觉特征提取网络结构与图2所示的AFEN卷积结构相同,并使用convolutional LSTM(ConvLSTM)模块取代最后的LSTM模块,原始的输入信号改为图像帧间的差值。区别在于,相比听觉特征提取,视觉特征更注重时空关系上的动作识别,即ConvLSTM在获取时空关系上比LSTM有更好的效果,能解决时空序列的预测问题,比如视频分类,动作识别等。Similarly, in this solution, the abnormal behavior visual feature extraction network structure is the same as the AFEN convolution structure shown in Figure 2, and the convolutional LSTM (ConvLSTM) module is used to replace the last LSTM module, and the original input signal is changed to the inter-image frame difference. The difference is that compared to auditory feature extraction, visual features pay more attention to action recognition on spatiotemporal relationships, that is, ConvLSTM has a better effect than LSTM on obtaining spatiotemporal relationships, and can solve spatiotemporal sequence prediction problems, such as video classification, action recognition, etc. .
综上,本发明设计了一种基于自编码网络的视听信息融合的异常行为识别模型。模型结构如图3所示。该模型包括四个部分:视觉特征提取、听觉特征提取、自编码网络和全连接识别模型。视觉和声学特征提取采用双通道特征提取方法,网络结构采用深度卷积网络为基础,在视觉特征方面,采用视频帧间差异作为原始输入,利用深度卷积加上ConvLSTM网络提取段级视觉特征。在听觉特征方面,利用音频波形作为网络输入,利用深度卷积加上LSTM网络提取段级听觉特征。然后,利用如图4所示的自编码网络构建共享语义子空间,消除视觉和听觉特征的语义偏差,并采用CONCAT方法实现视觉和听觉特征的结合;最后,利用全连接模型对异常行为进行识别。故本方案目标行为识别方法基于自编码网络的共享语义子空间实现将听觉特征融合进视觉特征中对视觉信息进行补充,并且将视觉特征融合进听觉特征中对听觉特征进行补充,从而实现了不同模式的互补效果,因此通过特征融合得到的融合特征具有更丰富的语义表达,那么该模型以融合特征对行为进行分类的分类结果也更为准确。基于此本方案提高了识别精度,降低了漏检率。In summary, the present invention designs an abnormal behavior recognition model based on audio-visual information fusion using autoencoding network. The model structure is shown in Figure 3. The model includes four parts: visual feature extraction, auditory feature extraction, autoencoding network and fully connected recognition model. Visual and acoustic feature extraction uses a dual-channel feature extraction method. The network structure is based on a deep convolutional network. In terms of visual features, the difference between video frames is used as the original input, and deep convolution plus ConvLSTM network is used to extract segment-level visual features. In terms of auditory features, audio waveforms are used as network input, and deep convolution and LSTM networks are used to extract segment-level auditory features. Then, the autoencoding network shown in Figure 4 is used to construct a shared semantic subspace to eliminate the semantic bias of visual and auditory features, and the CONCAT method is used to combine visual and auditory features; finally, a fully connected model is used to identify abnormal behaviors . Therefore, the target behavior recognition method of this scheme is based on the shared semantic subspace of the autoencoding network to integrate auditory features into visual features to supplement visual information, and integrate visual features into auditory features to supplement auditory features, thus achieving different The complementary effect of the pattern, so the fused features obtained through feature fusion have richer semantic expressions, then the classification results of the model using the fused features to classify behaviors are also more accurate. Based on this, this solution improves the recognition accuracy and reduces the missed detection rate.
另外,如图5所示,本方案提供一种视听特征融合的目标行为识别装置,该装置利用上述视听特征融合的目标行为识别方法识别出目标行为,该装置包括:In addition, as shown in Figure 5, this solution provides a target behavior recognition device with audio-visual feature fusion. The device uses the above target behavior recognition method with audio-visual feature fusion to identify the target behavior. The device includes:
获取模块501,用于获取预设时长的待识别音视频段。The acquisition module 501 is used to acquire the audio and video segments to be recognized of a preset duration.
信息采集模块502,用于采集所述待识别音视频段中的视觉输入信息及听觉输入信息。The information collection module 502 is used to collect visual input information and auditory input information in the audio and video segments to be recognized.
特征提取模块503,用于将所述视觉输入信息及所述听觉输入信息一同输入目标行为模型中,其中所述目标行为模型包括双分支通道的特征提取网络、自编码网络及全连接层识别模块。 Feature extraction module 503 is used to input the visual input information and the auditory input information into the target behavior model together, wherein the target behavior model includes a dual-branch channel feature extraction network, an autoencoding network and a fully connected layer recognition module. .
根据所述特征提取网络分别从所述视觉输入信息、所述听觉输入信息中提取特征,得到视觉特征、听觉特征。According to the feature extraction network, features are respectively extracted from the visual input information and the auditory input information to obtain visual features and auditory features.
采用所述自编码网络将所述视觉特征、所述听觉特征映射到同一子空间中进行视听信息融合,得到融合特征。The autoencoding network is used to map the visual features and the auditory features into the same subspace for audio-visual information fusion to obtain fusion features.
行为识别模块504,用于将所述融合特征输入所述全连接层识别模块进行识别,得到目标行为。The behavior recognition module 504 is used to input the fused features into the fully connected layer recognition module for recognition to obtain the target behavior.
如图6所示,本申请一个实施例的电子装置,包括存储器604和处理器602,该存储器604中存储有计算机程序,该处理器602被设置为运行计算机程序以执行上述任一项方法实施例中的步骤。As shown in Figure 6, an electronic device according to one embodiment of the present application includes a memory 604 and a processor 602. The memory 604 stores a computer program, and the processor 602 is configured to run the computer program to perform any of the above methods. The steps in the example.
具体地,上述处理器602可以包括中央处理器(CPU),或者特定集成电路(ApplicationSpecificIntegratedCircuit,简称为ASIC),或者可以被配置成实施本申请实施例的一个或多个集成电路。Specifically, the above-mentioned processor 602 may include a central processing unit (CPU), or an application specific integrated circuit (Application Specific Integrated Circuit, referred to as ASIC), or may be configured to implement one or more integrated circuits according to the embodiments of the present application.
其中,存储器604可以包括用于数据或指令的大容量存储器604。举例来说而非限制,存储器604可包括硬盘驱动器(HardDiskDrive,简称为HDD)、软盘驱动器、固态驱动器(SolidStateDrive,简称为SSD)、闪存、光盘、磁光盘、磁带或通用串行总线(UniversalSerialBus,简称为USB)驱动器或者两个或更多个以上这些的组合。在合适的情况下,存储器604可包括可移除或不可移除(或固定)的介质。在合适的情况下,存储器604可在数据处理装置的内部或外部。在特定实施例中,存储器604是非易失性(Non-Volatile)存储器。在特定实施例中,存储器604包括只读存储器(Read-OnlyMemory,简称为ROM)和随机存取存储器(RandomAccessMemory,简称为RAM)。在合适的情况下,该ROM可以是掩模编程的ROM、可编程ROM(ProgrammableRead-OnlyMemory,简称为PROM)、可擦除PROM(ErasableProgrammableRead-OnlyMemory,简称为EPROM)、电可擦除PROM(ElectricallyErasableProgrammableRead-OnlyMemory,简称为EEPROM)、电可改写ROM(ElectricallyAlterableRead-OnlyMemory,简称为EAROM)或闪存(FLASH)或者两个或更多个以上这些的组合。在合适的情况下,该RAM可以是静态随机存取存储器(StaticRandom-AccessMemory,简称为SRAM)或动态随机存取存储器(DynamicRandomAccessMemory,简称为DRAM),其中,DRAM可以是快速页模式动态随机存取存储器604(FastPageModeDynamicRandomAccessMemory,简称为FPMDRAM)、扩展数据输出动态随机存取存储器(ExtendedDateOutDynamicRandomAccessMemory,简称为EDODRAM)、同步动态随机存取内存(SynchronousDynamicRandom-AccessMemory,简称SDRAM)等。Among others, memory 604 may include mass storage 604 for data or instructions. By way of example and not limitation, the memory 604 may include a hard disk drive (Hard Disk Drive, HDD for short), floppy disk drive, Solid State Drive (Solid State Drive, SSD for short), flash memory, optical disk, magneto-optical disk, magnetic tape, or Universal Serial Bus (Universal Serial Bus, Referred to as USB) drive or a combination of two or more of these. Memory 604 may include removable or non-removable (or fixed) media, where appropriate. Memory 604 may be internal or external to the data processing device, where appropriate. In certain embodiments, memory 604 is Non-Volatile memory. In a specific embodiment, the memory 604 includes read-only memory (Read-OnlyMemory, ROM for short) and random access memory (RandomAccessMemory, RAM for short). Under appropriate circumstances, the ROM can be a mask-programmed ROM, programmable ROM (ProgrammableRead-OnlyMemory, referred to as PROM), erasable PROM (ErasableProgrammableRead-OnlyMemory, referred to as EPROM), electrically erasable PROM (Electrically ErasableProgrammableRead -OnlyMemory, referred to as EEPROM), electrically rewritable ROM (Electrically Alterable Read-OnlyMemory, referred to as EAROM) or flash memory (FLASH) or a combination of two or more of these. Under appropriate circumstances, the RAM can be static random access memory (StaticRandom-AccessMemory, referred to as SRAM) or dynamic random access memory (DynamicRandomAccessMemory, referred to as DRAM), wherein the DRAM can be fast page mode dynamic random access Memory 604 (FastPageModeDynamicRandomAccessMemory, referred to as FPMDRAM), extended data output dynamic random access memory (ExtendedDateOutDynamicRandomAccessMemory, referred to as EDODRAM), synchronous dynamic random access memory (SynchronousDynamicRandom-AccessMemory, referred to as SDRAM), etc.
存储器604可以用来存储或者缓存需要处理和/或通信使用的各种数据文件,以及处理器602所执行的可能的计算机程序指令。 Memory 604 may be used to store or cache various data files required for processing and/or communication, as well as possibly computer program instructions executed by processor 602.
处理器602通过读取并执行存储器604中存储的计算机程序指令,以实现上述实施例中的任意一种视听特征融合的目标行为识别方法。The processor 602 reads and executes the computer program instructions stored in the memory 604 to implement any of the audio-visual feature fusion target behavior recognition methods in the above embodiments.
可选地,上述电子装置还可以包括传输设备606以及输入输出设备608,其 中,该传输设备606和上述处理器602连接,该输入输出设备608和上述处理器602连接。Optionally, the above-mentioned electronic device may also include a transmission device 606 and an input-output device 608, wherein the transmission device 606 is connected to the above-mentioned processor 602, and the input-output device 608 is connected to the above-mentioned processor 602.
传输设备606可以用来经由一个网络接收或者发送数据。上述的网络具体实例可包括电子装置的通信供应商提供的有线或无线网络。在一个实例中,传输设备包括一个网络适配器(Network Interface Controller,简称为NIC),其可通过基站与其他网络设备相连从而可与互联网进行通讯。在一个实例中,传输设备606可以为射频(Radio Frequency,简称为RF)模块,其用于通过无线方式与互联网进行通讯。 Transmission device 606 may be used to receive or send data via a network. Specific examples of the above-mentioned network may include a wired or wireless network provided by a communication provider of the electronic device. In one example, the transmission device includes a network adapter (Network Interface Controller, NIC for short), which can be connected to other network devices through a base station to communicate with the Internet. In one example, the transmission device 606 may be a radio frequency (Radio Frequency, RF for short) module, which is used to communicate with the Internet wirelessly.
输入输出设备608用于输入或输出信息。在本实施例中,输入的信息可以是待识别音视频段等,输出的信息可以识别出的目标行为等。Input and output devices 608 are used to input or output information. In this embodiment, the input information may be audio and video segments to be recognized, etc., and the output information may be the target behavior to be recognized, etc.
可选地,在本实施例中,上述处理器602可以被设置为通过计算机程序执行以下步骤:Optionally, in this embodiment, the above-mentioned processor 602 can be configured to perform the following steps through a computer program:
S101、获取预设时长的待识别音视频段。S101. Obtain the audio and video segments to be recognized with a preset duration.
S102、采集所述待识别音视频段中的视觉输入信息及听觉输入信息。S102. Collect visual input information and auditory input information in the audio and video segments to be recognized.
S103、将所述视觉输入信息及所述听觉输入信息一同输入目标行为模型中,其中所述目标行为模型包括双分支通道的特征提取网络、自编码网络及全连接层识别模块。S103. Input the visual input information and the auditory input information into the target behavior model together, where the target behavior model includes a dual-branch channel feature extraction network, an autoencoding network and a fully connected layer recognition module.
S104、根据所述特征提取网络分别从所述视觉输入信息、所述听觉输入信息中提取特征,得到视觉特征、听觉特征。S104. Extract features from the visual input information and the auditory input information respectively according to the feature extraction network to obtain visual features and auditory features.
S105、采用所述自编码网络将所述视觉特征、所述听觉特征映射到同一子空间中进行视听信息融合,得到融合特征。S105. Use the autoencoding network to map the visual features and the auditory features into the same subspace for audio-visual information fusion to obtain fusion features.
S106、将所述融合特征输入所述全连接层识别模块进行识别,得到目标行为。S106. Input the fusion feature into the fully connected layer recognition module for recognition to obtain the target behavior.
需要说明的是,本实施例中的具体示例可以参考上述实施例及可选实施方式中所描述的示例,本实施例在此不再赘述。It should be noted that for specific examples in this embodiment, reference may be made to the examples described in the above-mentioned embodiments and optional implementations, and the details of this embodiment will not be repeated here.
通常,各种实施例可以以硬件或专用电路、软件、逻辑或其任何组合来实现。本发明的一些方面可以以硬件来实现,而其他方面可以以可以由控制器、微处理器或其他计算设备执行的固件或软件来实现,但是本发明不限于此。尽管本发明的各个方面可以被示出和描述为框图、流程图或使用一些其他图形表示,但是应当理解,作为非限制性示例,本文中描述的这些框、装置、系统、技术或方法可以以硬件、软件、固件、专用电路或逻辑、通用硬件或控制器或其他计算设备或其某种组合来实现。Generally, various embodiments may be implemented in hardware or special purpose circuitry, software, logic, or any combination thereof. Some aspects of the invention may be implemented in hardware, while other aspects may be implemented in firmware or software that may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. Although various aspects of the invention may be shown and described as block diagrams, flow diagrams, or using some other graphical representation, it is to be understood that, by way of non-limiting example, the blocks, devices, systems, techniques, or methods described herein may be Hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controllers or other computing devices, or some combination thereof.
本发明的实施例可以由计算机软件来实现,该计算机软件由移动设备的数据处理器诸如在处理器实体中可执行,或者由硬件来实现,或者由软件和硬件的组合来实现。包括软件例程、小程序和/或宏的计算机软件或程序(也称为程序产品)可以存储在任何装置可读数据存储介质中,并且它们包括用于执行特定任务的程序指令。计算机程序产品可以包括当程序运行时被配置为执行实施例的一个或多 个计算机可执行组件。一个或多个计算机可执行组件可以是至少一个软件代码或其一部分。另外,在这一点上,应当注意,如图中的逻辑流程的任何框可以表示程序步骤、或者互连的逻辑电路、框和功能、或者程序步骤和逻辑电路、框和功能的组合。软件可以存储在诸如存储器芯片或在处理器内实现的存储块等物理介质、诸如硬盘或软盘等磁性介质、以及诸如例如DVD及其数据变体、CD等光学介质上。物理介质是非瞬态介质。Embodiments of the invention may be implemented by computer software executable by a data processor of the mobile device, such as in a processor entity, or by hardware, or by a combination of software and hardware. Computer software or programs (also referred to as program products), including software routines, applets, and/or macros, may be stored on any device-readable data storage medium, and they include program instructions for performing specific tasks. A computer program product may include one or more computer-executable components that are configured to perform embodiments when the program is executed. One or more computer-executable components may be at least one software code or a portion thereof. Additionally, at this point, it should be noted that any block of the logic flow in the figures may represent program steps, or interconnected logic circuits, blocks, and functions, or a combination of program steps and logic circuits, blocks, and functions. Software may be stored on physical media such as memory chips or memory blocks implemented within a processor, magnetic media such as hard or floppy disks, and optical media such as, for example, DVD and its data variants, CDs. Physical media are non-transient media.
本领域的技术人员应该明白,以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。Those skilled in the art should understand that the technical features of the above embodiments can be combined in any way. To simplify the description, not all possible combinations of the technical features in the above embodiments are described. However, as long as these technical features There is no contradiction in the combinations, and they should be considered to be within the scope of this manual.
以上实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本申请范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请的保护范围应以所附权利要求为准。The above embodiments only express several implementation modes of the present application, and their descriptions are relatively specific and detailed, but should not be construed as limiting the scope of the present application. It should be noted that, for those of ordinary skill in the art, several modifications and improvements can be made without departing from the concept of the present application, and these all fall within the protection scope of the present application. Therefore, the scope of protection of this application should be determined by the appended claims.

Claims (11)

  1. 一种视听特征融合的目标行为识别方法,其特征在于,包括以下步骤:A target behavior recognition method based on audio-visual feature fusion, which is characterized by including the following steps:
    获取预设时长的待识别音视频段;Obtain the audio and video segments to be recognized with a preset duration;
    采集所述待识别音视频段中的视觉输入信息及听觉输入信息;Collect visual input information and auditory input information in the audio and video segments to be recognized;
    将所述视觉输入信息及所述听觉输入信息一同输入目标行为模型中,其中所述目标行为模型包括双分支通道的特征提取网络、自编码网络及全连接层识别模块;The visual input information and the auditory input information are input into the target behavior model together, wherein the target behavior model includes a dual-branch channel feature extraction network, an autoencoding network and a fully connected layer recognition module;
    根据所述特征提取网络分别从所述视觉输入信息、所述听觉输入信息中提取特征,得到视觉特征、听觉特征;According to the feature extraction network, features are extracted from the visual input information and the auditory input information respectively to obtain visual features and auditory features;
    由所述自编码网络的编码器将所述视觉特征、所述听觉特征映射到同一子空间,得到听觉特征对应的听觉映射特征,视觉特征对应的视觉映射特征;根据所述自编码网络的解码器将所有所述视觉映射特征及所有所述听觉映射特征映射到多模空间中,每个模态得到其他模态空间的视觉补偿特征作为视觉共享特征,以及得到其他模态的听觉补偿特征,作为听觉共享特征;拼接所述视觉共享特征、所述听觉共享特征、所述视觉特征及所述听觉特征,得到融合特征;The encoder of the autoencoding network maps the visual features and the auditory features to the same subspace to obtain the auditory mapping features corresponding to the auditory features and the visual mapping features corresponding to the visual features; according to the decoding of the autoencoding network The device maps all the visual mapping features and all the auditory mapping features into the multi-modal space, and each modality obtains the visual compensation features of other modal spaces as visual shared features, and obtains the auditory compensation features of other modalities, As the auditory shared feature; splicing the visual shared feature, the auditory shared feature, the visual feature and the auditory feature to obtain a fusion feature;
    其中自编码网络包括编码器及解码器,其中,编码器包括依次连接的第一全连接层、第二全连接层以及编码器层;将视觉特征及听觉特征共同输入编码器中,并依次经过第一个全连接层、第二个全连接层以及编码器层输出,得到听觉特征对应的听觉映射特征,视觉特征对应的视觉映射特征;The autoencoder network includes an encoder and a decoder, where the encoder includes a first fully connected layer, a second fully connected layer and an encoder layer connected in sequence; visual features and auditory features are input into the encoder and passed through in sequence. The output of the first fully connected layer, the second fully connected layer and the encoder layer obtains the auditory mapping features corresponding to the auditory features, and the visual mapping features corresponding to the visual features;
    其中,解码器包括两条支路,每条支路有两个全连接层组成;一条支路以听觉映射特征作为输入,由两个全连接层将所有听觉映射特征映射到多模空间中,得到听觉映射特征对应的视觉补偿特征,另一支路以视觉映射特征作为输入,由两个全连接层将所有视觉映射特征映射到多模空间中,得到视觉映射特征对应的听觉补偿特征;Among them, the decoder includes two branches, each branch consists of two fully connected layers; one branch takes auditory mapping features as input, and two fully connected layers map all auditory mapping features into the multi-module space. The visual compensation features corresponding to the auditory mapping features are obtained. The other branch takes the visual mapping features as input, and uses two fully connected layers to map all the visual mapping features into the multi-modal space to obtain the auditory compensation features corresponding to the visual mapping features;
    将所述融合特征输入所述全连接层识别模块进行识别,得到目标行为。The fused features are input into the fully connected layer recognition module for recognition, and the target behavior is obtained.
  2. 根据权利要求1所述的视听特征融合的目标行为识别方法,其特征在于,对输入所述自编码网络的所述视觉特征和所述听觉特征采用语义映射标签进行标记,其中,语义映射标签表征为描述相同语义内容的所述视觉输入信息和所述听觉输入信息的标记标签;The target behavior recognition method of audio-visual feature fusion according to claim 1, characterized in that the visual features and the auditory features input to the autoencoder network are marked using semantic mapping labels, wherein the semantic mapping labels represent Marking labels for the visual input information and the auditory input information that describe the same semantic content;
    当输入自编码网络的视觉特征或听觉特征存在语义映射标签时,损失函数为听觉平均误差值和视觉平均误差值的代数和;When the visual features or auditory features input to the autoencoder network have semantic mapping labels, the loss function is the algebraic sum of the auditory average error value and the visual average error value;
    当输入自编码网络的视觉特征或听觉特征不存在语义映射标签时,损失函数为1与听觉平均误差值和视觉平均误差值的代数和的差值;When there is no semantic mapping label for the visual features or auditory features input to the autoencoder network, the loss function is the difference between 1 and the algebraic sum of the auditory average error value and the visual average error value;
    听觉平均误差值表征为所有听觉特征与所有听觉共享特征的绝对差值的平均值,视觉平均误差值表征为所有视觉特征与所有视觉共享特征的绝对差值的平均值;The auditory average error value is characterized as the average of the absolute differences between all auditory features and all auditory shared features, and the visual average error value is characterized as the average of the absolute differences between all visual features and all visual shared features;
    其中,损失函数由下列公式得到:Among them, the loss function is obtained by the following formula:
    Figure PCTCN2022141314-appb-100001
    Figure PCTCN2022141314-appb-100001
    y autocoder为损失函数,N为特征数量,f audio为听觉特征,f’ audio为听觉共享特征,f visual为视觉特征,f’ visual为视觉共享特征,L corr=1表示存在语义映射标签,L corr=-1表示不存在语义映射标签。 y autocoder is the loss function, N is the number of features, f audio is the auditory feature, f' audio is the auditory shared feature, f visual is the visual feature, f' visual is the visual shared feature, L corr = 1 means there is a semantic mapping label, L corr =-1 indicates that there is no semantic mapping label.
  3. 根据权利要求2所述的视听特征融合的目标行为识别方法,其特征在于,“对输入所述自编码网络的所述视觉特征和所述听觉特征采用语义映射标签进行标记”包括:分别对所述听觉输入信息的声学异常信息及所述视觉输入信息的视觉异常信息进行语义标记,若判断出所述听觉输入信息与所述视觉异常信息都具有所述语义标记,则为所述听觉输入信息与所述视觉异常信息分配所述语义映射标签。The target behavior recognition method of audio-visual feature fusion according to claim 2, characterized in that "labeling the visual features and the auditory features input to the autoencoding network using semantic mapping labels" includes: The acoustic abnormality information of the auditory input information and the visual abnormality information of the visual input information are semantically marked. If it is determined that both the auditory input information and the visual abnormality information have the semantic mark, it is the auditory input information. The semantic mapping tag is assigned to the visual anomaly information.
  4. 根据权利要求1所述的视听特征融合的目标行为识别方法,其特征在于,“采集所述待识别音视频段中的视觉输入信息”包括:The target behavior recognition method of audio-visual feature fusion according to claim 1, characterized in that "collecting the visual input information in the audio and video segments to be recognized" includes:
    从所述待识别音视频段中采集每相邻两帧图像帧的差值,得到差值序列,将所述差值序列作为视觉输入信息。The difference between every two adjacent image frames is collected from the audio and video segment to be recognized to obtain a difference sequence, and the difference sequence is used as visual input information.
  5. 根据权利要求1所述的视听特征融合的目标行为识别方法,其特征在于,“采集所述待识别音视频段中的听觉输入信息”包括:The target behavior recognition method of audio-visual feature fusion according to claim 1, characterized in that "collecting the auditory input information in the audio and video segments to be recognized" includes:
    获取所述待识别音视频段对应的原始音频波形,从所述原始音频波形中以预设采样间隔采集声学信号,得到听觉输入信息。The original audio waveform corresponding to the audio and video segment to be identified is obtained, and acoustic signals are collected from the original audio waveform at preset sampling intervals to obtain auditory input information.
  6. 根据权利要求5所述的视听特征融合的目标行为识别方法,其特征在于,所述听觉输入信息表征为以时间作为横轴,以声学信号作为纵轴的波形数据,其中所述听觉输入信息与视觉输入信息以时域为统一尺度。The target behavior recognition method of audio-visual feature fusion according to claim 5, characterized in that the auditory input information is represented by waveform data with time as the horizontal axis and acoustic signal as the vertical axis, wherein the auditory input information and Visual input information takes the time domain as a unified scale.
  7. 根据权利要求1所述的视听特征融合的目标行为识别方法,其特征在于,所述双分支通道的特征提取网络包括听觉特征提取网络及视觉特征提取网络,The target behavior recognition method of audio-visual feature fusion according to claim 1, wherein the feature extraction network of the dual-branch channel includes an auditory feature extraction network and a visual feature extraction network,
    其中,所述听觉特征提取网络包括AFEN模块及LSTM模块,将原始音频波形中的多帧波形输入AFEN模块中,获取对应的多个听觉帧级特征,通过LSTM模块对多个所述听觉帧级特征进行融合,输出听觉段级特征,Among them, the auditory feature extraction network includes an AFEN module and an LSTM module. Multiple frame waveforms in the original audio waveform are input into the AFEN module to obtain corresponding multiple auditory frame-level features, and the multiple auditory frame-level features are obtained through the LSTM module. Features are fused to output auditory segment-level features,
    AFEN网络包括5个卷积层、3个池化层和3个依次连接的全连接层,其中池化层分别连接于第一个卷积层、第二个卷积层及第五个卷积层之后,每个卷积层包括一个使AFEN模块的激活模式更稀疏的ReLU激活函数,每个池化层包括一个避免梯度消失的局部响应归一化操作。The AFEN network includes 5 convolutional layers, 3 pooling layers and 3 fully connected layers connected in sequence. The pooling layers are connected to the first convolutional layer, the second convolutional layer and the fifth convolutional layer respectively. After the layer, each convolutional layer includes a ReLU activation function to make the activation pattern of the AFEN module sparser, and each pooling layer includes a local response normalization operation to avoid gradient vanishing.
  8. 根据权利要求7所述的视听特征融合的目标行为识别方法,其特征在于, 所述视觉特征提取网络与所述听觉特征提取网络共享AFEN模块,所述视觉特征提取网络中以ConvLSTM模块替代LSTM模块对AFEN模块输出的多个视觉帧级特征进行融合,输出视觉段级特征。The target behavior recognition method of audio-visual feature fusion according to claim 7, characterized in that, the visual feature extraction network and the auditory feature extraction network share the AFEN module, and the ConvLSTM module is used to replace the LSTM module in the visual feature extraction network Fusion of multiple visual frame-level features output by the AFEN module and outputs visual segment-level features.
  9. 一种视听特征融合的目标行为识别装置,其特征在于,包括:An audio-visual feature fusion target behavior recognition device, which is characterized by including:
    获取模块,用于获取预设时长的待识别音视频段;The acquisition module is used to obtain the audio and video segments to be recognized with a preset duration;
    信息采集模块,用于采集所述待识别音视频段中的视觉输入信息及听觉输入信息;An information collection module, used to collect visual input information and auditory input information in the audio and video segments to be identified;
    特征提取模块,用于将所述视觉输入信息及所述听觉输入信息一同输入目标行为模型中,其中所述目标行为模型包括双分支通道的特征提取网络、自编码网络及全连接层识别模块;A feature extraction module, used to input the visual input information and the auditory input information into a target behavior model together, wherein the target behavior model includes a dual-branch channel feature extraction network, an autoencoding network and a fully connected layer recognition module;
    根据所述特征提取网络分别从所述视觉输入信息、所述听觉输入信息中提取特征,得到视觉特征、听觉特征;According to the feature extraction network, features are extracted from the visual input information and the auditory input information respectively to obtain visual features and auditory features;
    由所述自编码网络的编码器将所述视觉特征、所述听觉特征映射到同一子空间,得到听觉特征对应的听觉映射特征,视觉特征对应的视觉映射特征;根据所述自编码网络的解码器将所有所述视觉映射特征及所有所述听觉映射特征映射到多模空间中,每个模态得到其他模态空间的视觉补偿特征作为视觉共享特征,以及得到其他模态的听觉补偿特征,作为听觉共享特征;拼接所述视觉共享特征、所述听觉共享特征、所述视觉特征及所述听觉特征,得到融合特征;The encoder of the autoencoding network maps the visual features and the auditory features to the same subspace to obtain the auditory mapping features corresponding to the auditory features and the visual mapping features corresponding to the visual features; according to the decoding of the autoencoding network The device maps all the visual mapping features and all the auditory mapping features into the multi-modal space, and each modality obtains the visual compensation features of other modal spaces as visual shared features, and obtains the auditory compensation features of other modalities, As the auditory shared feature; splicing the visual shared feature, the auditory shared feature, the visual feature and the auditory feature to obtain a fusion feature;
    其中自编码网络包括编码器及解码器,其中,编码器包括依次连接的第一全连接层、第二全连接层以及编码器层;将视觉特征及听觉特征共同输入编码器中,并依次经过第一个全连接层、第二个全连接层以及编码器层输出,得到听觉特征对应的听觉映射特征,视觉特征对应的视觉映射特征;The autoencoder network includes an encoder and a decoder, where the encoder includes a first fully connected layer, a second fully connected layer and an encoder layer connected in sequence; visual features and auditory features are input into the encoder and passed through in sequence. The output of the first fully connected layer, the second fully connected layer and the encoder layer obtains the auditory mapping features corresponding to the auditory features, and the visual mapping features corresponding to the visual features;
    其中,解码器包括两条支路,每条支路有两个全连接层组成;一条支路以听觉映射特征作为输入,由两个全连接层将所有听觉映射特征映射到多模空间中,得到听觉映射特征对应的视觉补偿特征,另一支路以视觉映射特征作为输入,由两个全连接层将所有视觉映射特征映射到多模空间中,得到视觉映射特征对应的听觉补偿特征;Among them, the decoder includes two branches, each branch consists of two fully connected layers; one branch takes auditory mapping features as input, and two fully connected layers map all auditory mapping features into the multi-module space. The visual compensation features corresponding to the auditory mapping features are obtained. The other branch takes the visual mapping features as input, and uses two fully connected layers to map all the visual mapping features into the multi-modal space to obtain the auditory compensation features corresponding to the visual mapping features;
    行为识别模块,用于将所述融合特征输入所述全连接层识别模块进行识别,得到目标行为。A behavior recognition module is used to input the fused features into the fully connected layer recognition module for recognition to obtain the target behavior.
  10. 一种电子装置,包括存储器和处理器,其特征在于,所述存储器中存储有计算机程序,所述处理器被设置为运行所述计算机程序以执行权利要求1-8任一项所述的视听特征融合的目标行为识别方法。An electronic device, including a memory and a processor, characterized in that a computer program is stored in the memory, and the processor is configured to run the computer program to perform the audio-visual processing described in any one of claims 1-8. Feature fusion target behavior recognition method.
  11. 一种可读存储介质,其特征在于,所述可读存储介质中存储有计算机程序,所述计算机程序包括用于控制过程以执行过程的程序代码,所述过程包括根据权利要求1至8任一项所述的视听特征融合的目标行为识别方法。A readable storage medium, characterized in that a computer program is stored in the readable storage medium, the computer program includes a program code for controlling a process to execute a process, the process includes any of claims 1 to 8 A target behavior recognition method based on audio-visual feature fusion.
PCT/CN2022/141314 2022-05-09 2022-12-23 Target behavior recognition method and apparatus based on visual-audio feature fusion, and application WO2023216609A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210496197.7 2022-05-09
CN202210496197.7A CN114581749B (en) 2022-05-09 2022-05-09 Audio-visual feature fusion target behavior identification method and device and application

Publications (1)

Publication Number Publication Date
WO2023216609A1 true WO2023216609A1 (en) 2023-11-16

Family

ID=81768993

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/141314 WO2023216609A1 (en) 2022-05-09 2022-12-23 Target behavior recognition method and apparatus based on visual-audio feature fusion, and application

Country Status (2)

Country Link
CN (1) CN114581749B (en)
WO (1) WO2023216609A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114581749B (en) * 2022-05-09 2022-07-26 城云科技(中国)有限公司 Audio-visual feature fusion target behavior identification method and device and application

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344781A (en) * 2018-10-11 2019-02-15 上海极链网络科技有限公司 Expression recognition method in a kind of video based on audio visual union feature
US20190341025A1 (en) * 2018-04-18 2019-11-07 Sony Interactive Entertainment Inc. Integrated understanding of user characteristics by multimodal processing
CN110647804A (en) * 2019-08-09 2020-01-03 中国传媒大学 Violent video identification method, computer system and storage medium
CN111460889A (en) * 2020-02-27 2020-07-28 平安科技(深圳)有限公司 Abnormal behavior identification method, device and equipment based on voice and image characteristics
CN112287893A (en) * 2020-11-25 2021-01-29 广东技术师范大学 Sow lactation behavior identification method based on audio and video information fusion
CN114581749A (en) * 2022-05-09 2022-06-03 城云科技(中国)有限公司 Audio-visual feature fusion target behavior identification method and device and application

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102098492A (en) * 2009-12-11 2011-06-15 上海弘视通信技术有限公司 Audio and video conjoint analysis-based fighting detection system and detection method thereof
CN103854014A (en) * 2014-02-25 2014-06-11 中国科学院自动化研究所 Terror video identification method and device based on sparse representation of context
US9697833B2 (en) * 2015-08-25 2017-07-04 Nuance Communications, Inc. Audio-visual speech recognition with scattering operators
CN108200483B (en) * 2017-12-26 2020-02-28 中国科学院自动化研究所 Dynamic multi-modal video description generation method
CN109509484A (en) * 2018-12-25 2019-03-22 科大讯飞股份有限公司 A kind of prediction technique and device of baby crying reason
CN111640424B (en) * 2019-03-01 2024-02-13 北京搜狗科技发展有限公司 Voice recognition method and device and electronic equipment
CN109961789B (en) * 2019-04-30 2023-12-01 张玄武 Service equipment based on video and voice interaction
CN112328830A (en) * 2019-08-05 2021-02-05 Tcl集团股份有限公司 Information positioning method based on deep learning and related equipment
US11244696B2 (en) * 2019-11-06 2022-02-08 Microsoft Technology Licensing, Llc Audio-visual speech enhancement
CN111461235B (en) * 2020-03-31 2021-07-16 合肥工业大学 Audio and video data processing method and system, electronic equipment and storage medium
CN111754992B (en) * 2020-06-30 2022-10-18 山东大学 Noise robust audio/video bimodal speech recognition method and system
CN112866586B (en) * 2021-01-04 2023-03-07 北京中科闻歌科技股份有限公司 Video synthesis method, device, equipment and storage medium
CN113255556A (en) * 2021-06-07 2021-08-13 斑马网络技术有限公司 Multi-mode voice endpoint detection method and device, vehicle-mounted terminal and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190341025A1 (en) * 2018-04-18 2019-11-07 Sony Interactive Entertainment Inc. Integrated understanding of user characteristics by multimodal processing
CN109344781A (en) * 2018-10-11 2019-02-15 上海极链网络科技有限公司 Expression recognition method in a kind of video based on audio visual union feature
CN110647804A (en) * 2019-08-09 2020-01-03 中国传媒大学 Violent video identification method, computer system and storage medium
CN111460889A (en) * 2020-02-27 2020-07-28 平安科技(深圳)有限公司 Abnormal behavior identification method, device and equipment based on voice and image characteristics
CN112287893A (en) * 2020-11-25 2021-01-29 广东技术师范大学 Sow lactation behavior identification method based on audio and video information fusion
CN114581749A (en) * 2022-05-09 2022-06-03 城云科技(中国)有限公司 Audio-visual feature fusion target behavior identification method and device and application

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
TIANLIANG LIU, QIAO QINGWEI; WAN JUNWEI; DAI XIUBIN; LUO JIEBO: "Human Action Recognition via Spatio-temporal Dual Network Flow and Visual Attention Fusion", JOURNAL OF ELECTRONICS & INFORMATION TECHNOLOGY, ZHONGGUO KEXUEYUAN DIANZIXUE YANJIUSUO,CHINESE ACADEMY OF SCIENCES, INSTITUTE OF ELECTRONICS, CN, vol. 40, no. 10, 15 August 2018 (2018-08-15), CN , pages 2395 - 2401, XP093107315, ISSN: 1009-5896, DOI: 10.11999/JEIT171116 *
XIAO-YU WU, GU CHAO-NAN; WANG SHENG-JIN: "Special video classification based on multitask learning and multimodal feature fusion", OPTICS AND PRECISION ENGINEERING, vol. 28, no. 5, 13 May 2020 (2020-05-13), pages 1177 - 1186, XP093107318 *

Also Published As

Publication number Publication date
CN114581749A (en) 2022-06-03
CN114581749B (en) 2022-07-26

Similar Documents

Publication Publication Date Title
WO2020103676A1 (en) Image identification method and apparatus, system, and storage medium
WO2022007193A1 (en) Weak supervision video behavior detection method and system based on iterative learning
CN107943837A (en) A kind of video abstraction generating method of foreground target key frame
CN111653368A (en) Artificial intelligence epidemic situation big data prevention and control early warning system
CN109360584A (en) Cough monitoring method and device based on deep learning
US20200034739A1 (en) Method and device for estimating user's physical condition
CN110931112B (en) Brain medical image analysis method based on multi-dimensional information fusion and deep learning
CN112587153B (en) End-to-end non-contact atrial fibrillation automatic detection system and method based on vPPG signal
WO2023216609A1 (en) Target behavior recognition method and apparatus based on visual-audio feature fusion, and application
Ahmedt-Aristizabal et al. A hierarchical multimodal system for motion analysis in patients with epilepsy
WO2023273629A1 (en) System and apparatus for configuring neural network model in edge server
Gao et al. Deep model-based semi-supervised learning way for outlier detection in wireless capsule endoscopy images
CN111814588B (en) Behavior detection method, related equipment and device
Hsu et al. Hierarchical Network for Facial Palsy Detection.
CN109460717A (en) Alimentary canal Laser scanning confocal microscope lesion image-recognizing method and device
WO2023285898A1 (en) Screening of individuals for a respiratory disease using artificial intelligence
Mohan et al. Non-invasive technique for real-time myocardial infarction detection using faster R-CNN
You et al. Automatic cough detection from realistic audio recordings using C-BiLSTM with boundary regression
Wang et al. A YOLO-based Method for Improper Behavior Predictions
CN107920224A (en) A kind of abnormality alarming method, equipment and video monitoring system
CN116416678A (en) Method for realizing motion capture and intelligent judgment by using artificial intelligence technology
US20210177307A1 (en) Repetitive human activities abnormal motion detection
Postawka Real-time monitoring system for potentially dangerous activities detection
Goldstein et al. Chest area segmentation in 3D images of sleeping patients
Mantini et al. A Day on Campus-An Anomaly Detection Dataset for Events in a Single Camera

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22941542

Country of ref document: EP

Kind code of ref document: A1