WO2023216609A1

WO2023216609A1 - Target behavior recognition method and apparatus based on visual-audio feature fusion, and application

Info

Publication number: WO2023216609A1
Application number: PCT/CN2022/141314
Authority: WO
Inventors: 毛云青; 王国梁; 齐韬; 陈思瑶; 葛俊
Original assignee: 城云科技（中国）有限公司
Priority date: 2022-05-09
Filing date: 2022-12-23
Publication date: 2023-11-16
Also published as: CN114581749A; CN114581749B

Abstract

A target behavior recognition method and apparatus based on visual-audio feature fusion, and an application, which relate to the technical field of intelligent security protection. In the method, visual information and audio information are input into a specified algorithm network, a visual feature and an audio feature are extracted via different feature extraction networks of two branches, and timing features are calculated via an LSTM network; and a shared semantic sub-space is constructed by means of an auto-encoding network, a semantic bias between the visual feature and the audio feature is eliminated, and finally, the visual feature and the audio feature are fused, such that a target behavior can be recognized on the basis of a fused feature. The method can improve the accuracy of abnormal behavior recognition.

Description

Target behavior recognition method, device and application of audio-visual feature fusion

Technical field

The present application relates to the field of intelligent security technology, and in particular to a target behavior recognition method, device and application of audio-visual feature fusion.

Background technique

In the field of urban management and safety management, monitoring emergencies and providing real-time alarms are very important for public safety management. In real life, fights between people are one of the most common abnormal emergencies. The traditional alarm methods mainly include: calling the police to learn that a fight has occurred. Obviously, this reporting method has a lag. In addition, security personnel can also manually monitor the monitoring screen to detect abnormalities, but this method consumes labor costs. .

Therefore, there are currently relevant technologies that use surveillance cameras to monitor 24 hours a day, and then judge and alarm fighting behaviors based on artificial intelligence algorithms, which can greatly improve the real-time and accuracy of alarms for unexpected abnormal events such as fighting behaviors.

In the existing technology, the methods of judging fights through artificial intelligence algorithms include: detecting pictures and making behavioral classification judgments, or detecting the positions of key human body points in multiple frames of pictures and making behavioral judgments. For example, the publication numbers CN112733629A and CN111401296A only disclose the use of image information to determine abnormal behavior. In actual scenarios, the above algorithm will identify some labor operations with large movements, such as cleaning by multiple people, or physical exercise actions by multiple people, such as playing ball, as fighting behaviors; in addition, the usual algorithm judgment only uses image information to make judgments. Judgment,needs to improve in terms of accuracy.

In addition, existing technologies such as Publication No. CN104243894A and CN102098492A judge abnormal behavior based on two separate steps, and then fuse it at the decision-making level. However, the policy-level fusion method has limited effect on improving the performance of video recognition of abnormal behavior, and decision-level fusion can only It can fuse the scores after decision-making of two branch modes without considering the semantic consistency of the information in each mode, so it cannot solve the problem of time misalignment and semantic inconsistency between video and sound.

Semantic consistency is of great significance in multi-modal information fusion, especially visual and auditory information fusion. When multi-modal information is semantically consistent, the information is complementary, otherwise, they will interfere with each other, such as the famous "McGurk effect". Sometimes, human hearing is obviously affected by vision, which may lead to mishearing. For example, when a sound does not match the visual signal, people will mysteriously perceive a third sound. Simple fusion of sound and video signals may even cause mishearing. Produce the opposite effect.

Therefore, in the case of inconsistent semantics in the form of multi-modal information, feature fusion between modalities without any measurement not only fails to achieve information complementarity between modalities, but may also lead to a decrease in algorithm performance. Due to the particularity of abnormal behavior, the semantic inconsistency of audio-visual information is mainly reflected in: First, the audio-visual data may not be aligned on the timeline, and it may be that the sound features appear slower than the visual features. In addition, there is semantic expression bias information between vision and hearing. These are all problems that need to be solved in the process of multi-modal feature fusion.

Based on this, no effective solution has yet been proposed for the problem that audio-visual features cannot be well integrated into the abnormal behavior recognition algorithm to accurately determine whether there is abnormal behavior.

Contents of the invention

Embodiments of this application provide a target behavior recognition method, device and application for audio-visual feature fusion. In view of the existing abnormal behavior recognition algorithm, this solution uses a feature-level fusion method to fuse audio-visual information, which can improve the accuracy of abnormal behavior recognition. Rate.

In the first aspect, embodiments of the present application provide a target behavior recognition method using audio-visual feature fusion. The method includes: obtaining an audio and video segment of a preset duration to be recognized; and collecting visual input information in the audio and video segment to be recognized. and auditory input information; input the visual input information and the auditory input information together into the target behavior model, wherein the target behavior model includes a dual-branch channel feature extraction network, an autoencoding network and a fully connected layer recognition module; according to The feature extraction network extracts features from the visual input information and the auditory input information respectively to obtain visual features and auditory features; the autoencoding network is used to map the visual features and the auditory features to the same subspace The audio-visual information is fused to obtain fusion features; the fusion features are input into the fully connected layer recognition module for recognition, and the target behavior is obtained.

In some of the embodiments, "using the auto-encoding network to map the visual features and the auditory features into the same subspace for audio-visual information fusion to obtain fusion features" includes: using the encoder of the auto-encoding network The visual features and the auditory features are mapped to the same subspace to obtain the auditory mapping features corresponding to the auditory features and the visual mapping features corresponding to the visual features; according to the decoder of the autoencoding network, all the visual mapping features and All the auditory mapping features are mapped into a multi-modal space, and each modality obtains the visual compensation features of other modal spaces as visual shared features and the auditory compensation features of other modalities as auditory shared features; splicing the visual The shared features, the auditory shared features, the visual features and the auditory features are used to obtain fused features.

In some embodiments, the autoencoding network includes an encoder and a decoder, where the encoder includes a first fully connected layer, a second fully connected layer and an encoder layer connected in sequence; visual features and auditory features are jointly input into the code In the decoder, and through the first fully connected layer, the second fully connected layer and the encoder layer output in sequence, the auditory mapping features corresponding to the auditory features and the visual mapping features corresponding to the visual features are obtained; among them, the decoder includes two branches Each branch consists of two fully connected layers; one branch takes auditory mapping features as input, and the two fully connected layers map all auditory mapping features into the multi-modal space to obtain the visual compensation corresponding to the auditory mapping features. Features, the other branch takes visual mapping features as input, and two fully connected layers map all visual mapping features into the multi-modal space to obtain auditory compensation features corresponding to the visual mapping features.

In some embodiments, the visual features and the auditory features input to the autoencoding network are marked using semantic mapping tags, wherein the semantic mapping tags are characterized by the visual input information describing the same semantic content and the The mark label of the auditory input information; when the visual feature or auditory feature input to the autoencoding network has a semantic mapping label, the loss function is the algebraic sum of the auditory average error value and the visual average error value; when the visual feature or auditory feature input to the autoencoding network When there is no semantic mapping label for auditory features, the loss function is the difference between 1 and the algebraic sum of the auditory average error value and the visual average error value; the auditory average error value is represented by the average of the absolute differences between all auditory features and all auditory shared features. value, the visual average error value is characterized as the average of the absolute differences between all visual features and all visual shared features. The loss function is obtained by the following formula:

y _autocoder is the loss function, N is the number of features, faudio is the auditory feature, f' _audio is the auditory shared feature, f _visual is the visual feature, f' _visual is the visual shared feature, L _corr = 1 means there is a semantic mapping label, L _corr =-1 indicates that there is no semantic mapping tag.

In some embodiments, "labeling the visual features and the auditory features input to the autoencoding network using semantic mapping labels" includes: separately labeling the acoustic anomaly information of the auditory input information and the visual input The visual abnormality information of the information is semantically tagged. If it is determined that both the auditory input information and the visual abnormality information have the semantic tags, the semantic mapping is assigned to the set of auditory input information and the visual abnormality information. Label.

In some of the embodiments, "collecting visual input information in the audio and video segment to be recognized" includes: collecting the difference between every two adjacent image frames from the audio and video segment to be recognized to obtain a difference sequence, The difference sequence is used as visual input information.

In some embodiments, "collecting auditory input information in the audio and video segments to be recognized" includes: obtaining the original audio waveform corresponding to the audio and video segment to be recognized, and sampling from the original audio waveform at a preset sampling interval. Acoustic signals are collected to obtain auditory input information.

In some embodiments, the auditory input information is represented as waveform data with time as the horizontal axis and acoustic signal as the vertical axis, wherein the auditory input information and the visual input information use the time domain as a unified scale.

In some embodiments, the dual-branch channel feature extraction network includes an auditory feature extraction network and a visual feature extraction network, wherein the auditory feature extraction network includes an AFEN module and an LSTM module to convert multiple frames in the original audio waveform. The waveform is input into the AFEN module to obtain multiple corresponding auditory frame-level features, and the multiple auditory frame-level features are fused through the LSTM module to output auditory segment-level features.

In some embodiments, the AFEN network includes 5 convolutional layers, 3 pooling layers and 3 sequentially connected fully connected layers, where the pooling layers are connected to the first convolutional layer, the second convolutional layer respectively. After the first and fifth convolutional layers, each convolutional layer includes a ReLU activation function that makes the activation pattern of the AFEN module sparser, and each pooling layer includes a local response normalization operation to avoid gradient vanishing. .

In some embodiments, the visual feature extraction network shares the AFEN module with the auditory feature extraction network. In the visual feature extraction network, the ConvLSTM module is used instead of the LSTM module to fuse multiple visual frame-level features output by the AFEN module. , output visual segment-level features.

In the second aspect, embodiments of the present application provide a target behavior recognition device with audio-visual feature fusion, including: an acquisition module for acquiring the audio and video segments to be recognized of a preset duration; and an information collection module for collecting the audio and video segments to be recognized. The visual input information and the auditory input information in the audio and video segments; the feature extraction module is used to input the visual input information and the auditory input information into the target behavior model together, wherein the target behavior model includes the characteristics of the dual branch channel Extraction network, autoencoding network and fully connected layer recognition module; extract features from the visual input information and auditory input information respectively according to the feature extraction network to obtain visual features and auditory features; use the autoencoding network to The visual features and the auditory features are mapped to the same subspace for audio-visual information fusion to obtain fused features; a behavior recognition module is used to input the fused features into the fully connected layer recognition module for recognition to obtain the target behavior.

In a third aspect, embodiments of the present application provide an electronic device, including a memory and a processor. A computer program is stored in the memory, and the processor is configured to run the computer program to execute any one of the first aspects. The target behavior recognition method of audio-visual feature fusion.

In a fourth aspect, embodiments of the present application provide a readable storage medium in which a computer program is stored, and the computer program includes program code for controlling a process to execute a process, and the process includes: The target behavior recognition method of audio-visual feature fusion according to any one of the first aspects.

The main contributions and innovations of the embodiments of this application are as follows:

This solution uses an autoencoding network to represent the shared semantic subspace mapping. In the autoencoding network, vision and hearing use time as a unified measure to map vision and hearing that represent the same semantics, thereby capturing complementary information and advanced information between different modes. Semantics, realizing feature fusion at the semantic level.

This plan designs a dual-branch channel feature extraction network. In this extraction network, the LSTM network is selected to process the temporal relationship between audio frame-level features, thereby obtaining auditory segment-level features; the ConvLSTM network is selected to process the time relationships between video frame-level features. The temporal relationship is used to obtain the visual phase features, and then the feature heterogeneity of the visual phase features and the auditory phase features is eliminated in the autoencoding network to achieve feature fusion.

This solution uses video inter-frame difference information to extract visual features, which can better reflect the characteristics of abnormal behavior. The sound features are extracted based on original audio waveforms instead of spectrum analysis-based methods such as MFCC or LPC, so it can be unified based on time. The images and bands are sampled separately at intervals, and unified into the time domain to solve the problem of inconsistent audio and video feature processing during the audio-visual information fusion process.

The details of one or more embodiments of the present application are set forth in the following drawings and description to make other features, objects, and advantages of the present application more concise and understandable.

Description of the drawings

The drawings described here are used to provide a further understanding of the present application and constitute a part of the present application. The illustrative embodiments of the present application and their descriptions are used to explain the present application and do not constitute an improper limitation of the present application. In the attached picture:

Figure 1 is a flow chart of the main steps of a target behavior recognition method based on audio-visual feature fusion according to the first embodiment of the present application.

Figure 2 is a schematic diagram of the abnormal behavior auditory feature extraction network structure.

Figure 3 is a schematic diagram of the abnormal behavior recognition network structure based on autoencoding network mapping audio-visual feature fusion.

Figure 4 is a schematic structural diagram of the autoencoding network.

Figure 5 is a structural block diagram of a target behavior recognition device for audio-visual feature fusion according to the second embodiment of the present application.

FIG. 6 is a schematic diagram of the hardware structure of an electronic device according to the third embodiment of the present application.

Detailed ways

Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. When the following description refers to the drawings, the same numbers in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with one or more embodiments of this specification. Rather, they are merely examples of apparatus and methods consistent with some aspects of one or more embodiments of this specification as detailed in the appended claims.

It should be noted that in other embodiments, the steps of the corresponding method are not necessarily performed in the order shown and described in this specification. In some other embodiments, methods may include more or fewer steps than described in this specification. In addition, a single step described in this specification may be broken down into multiple steps for description in other embodiments; and multiple steps described in this specification may also be combined into a single step in other embodiments. describe.

Embodiments of the present application provide a target behavior recognition method based on audio-visual feature fusion, which uses the audio-visual feature fusion target behavior recognition scheme to make judgments on abnormal event behaviors. First, the usage scenarios applicable to this application method are explained:

First access the surveillance video sounds and images, and extract fixed-length (for example, 1 second) video and audio clips as a window for behavioral judgment;

Then calculate the difference between video frames during this period of time (calculation of two adjacent frames of images, there are about 30 to 60 frames of images in 1 second, after calculation, multiple results in the timing can be obtained), and the calculated difference Sequences as visual input information;

Then, 16kHz sound waveform samples within 1 second are used as auditory input information;

Then input the visual information and auditory information into the designated algorithm network, and extract the visual features and auditory features through the different feature extraction networks of the two branches, and calculate the temporal features through the LSTM network; and build shared semantics through the auto-encoding network subspace, eliminate the semantic bias of visual and auditory features, and finally fuse the visual features and auditory features, and input the fused features into the classification network branch to obtain the abnormal behavior classification probability;

Finally, it is judged whether the behavior is abnormal based on the classification probability threshold.

In order to achieve this goal, as shown in Figure 1, the target behavior recognition method of audio-visual feature fusion mainly includes the following steps 101 to 106.

Step 101: Obtain the audio and video segments to be identified of a preset duration.

Step 102: Collect visual input information and auditory input information in the audio and video segments to be recognized.

Step 103: Input the visual input information and the auditory input information together into the target behavior model, where the target behavior model includes a dual-branch channel feature extraction network, an autoencoding network and a fully connected layer recognition module.

Step 104: Extract features from the visual input information and the auditory input information respectively according to the feature extraction network to obtain visual features and auditory features.

Step 105: Use the autoencoding network to map the visual features and the auditory features into the same subspace for audio-visual information fusion to obtain fusion features.

Step 106: Input the fused features into the fully connected layer recognition module for recognition to obtain the target behavior.

Specifically, in this solution, not only visual input information is used, but also auditory input information is collected. Both are used as inputs to the behavior model, and the model outputs the target behavior classification results. Compared with only using image information to judge the target behavior, this solution uses sound features, and the information of sound features and image features can complement each other to more accurately judge the target behavior, so it has better feature expression effect. In addition, in this solution, visual input information and auditory input information are input into the model for processing at the same time. The advantage of parallel processing is that feature-level fusion can be performed in the feature extraction stage to better capture the relationship between each modality. Compared with the decision-making and fusion methods of separately fusing the results of visual and auditory feature information recognition, this solution can consider the consistency between vision and hearing, so multi-modal feature information can complement each other and achieve better performance results. .

It should be noted that the target behavior in this scheme can be normal behavior or abnormal behavior. For example, when training the model, the input is visual samples and auditory samples and normal behaviors marked from the samples, the trained model will identify normal behaviors from the visual input information and auditory input information. On the contrary, when the input to the model is labeled samples that label abnormal behaviors, the trained model will identify abnormal behaviors from visual input information and auditory input information. By way of example, but not limitation, in this solution, visual samples and auditory samples describing violent scenes and features describing violent scenes annotated from the samples are used as the input of the model, so that the trained model can identify violent behaviors such as fights. .

Not only that, in this solution, an autoencoding network is used to achieve feature fusion. Different from the existing technology, the autoencoding network designed by the present invention can represent audio-visual data with the same metric, and can represent visual information and visual information that represent the same semantics under the same metric. Auditory information is matched, that is, visual and auditory semantic consistency is achieved.

Specifically, after obtaining visual features and auditory features, these features need to be fused. Therefore, the embodiment of the present invention establishes a shared semantic subspace in the future by mapping the visual features and the auditory features to the same subspace, thereby eliminating feature heterogeneity between different modes of video and audio, and thereby capturing visual modes and sound modes. The complementary information and high-level semantics between them achieve feature fusion at the semantic level. That is, in this solution, the encoder of the auto-encoding network maps the visual features and the auditory features to the same subspace to obtain visual mapping features and auditory mapping features; the decoder of the auto-encoding network maps the visual features and auditory features to the same subspace. The visual mapping features and the auditory mapping features are mapped into the multi-modal space to obtain the compensation features of other modalities as visual shared features and auditory shared features; the visual shared features, the auditory shared features, and the visual features are spliced and the auditory features to obtain fusion features.

For example, as shown in formula (1), the extracted visual feature is fvisual, g()) is a function that maps the visual feature fvisual to the same subspace, and input fvisual into g() to obtain the visual mapping feature; H () is a function that maps the visual mapping feature g(fvisual) to the multi-modal space of shared semantics. Input g(fvisual) into H() to obtain the compensation features of the auditory modality as auditory shared features.

In the same way, as shown in formula (2), the extracted auditory feature is faudio. h() is a function that maps the auditory feature faudio to the same subspace. Enter faudio into h() to obtain the auditory mapping feature; G() It is a function that maps the auditory mapping feature h(faudio) to the multi-modal space of shared semantics. Input h(faudio) into G() to obtain the compensation feature of the visual modality as a visual shared feature.

f′ _audio =H(g(f _visual )) Formula (1)

f′ _visual =G(h(f _audio )) Formula (2)

As shown in Figure 4, the autoencoding network includes an encoder and a decoder. The encoder includes a first fully connected layer, a second fully connected layer and an encoder layer connected in sequence; visual features and auditory features are jointly input into the encoder. , and sequentially pass through the first fully connected layer, the second fully connected layer and the encoder layer output to obtain the auditory mapping features corresponding to the auditory features, and the visual mapping features corresponding to the visual features;

Among them, the decoder includes two branches, each branch consists of two fully connected layers; one branch takes auditory mapping features as input, and two fully connected layers map all auditory mapping features into the multi-module space. The visual compensation features corresponding to the auditory mapping features are obtained. The other branch takes the visual mapping features as input, and uses two fully connected layers to map all the visual mapping features into the multi-modal space to obtain the auditory compensation features corresponding to the visual mapping features. Finally, formula (3) is used to splice visual features, visual compensation features, auditory features, and auditory compensation features to obtain fusion features.

In this scheme, each modal space receives information from its inter-modal neighbors and intra-modal neighbors, and shares its own information at the same time. When the inter-modal neighbor information obtained by any modal space can make up for the loss of its own information , the obtained inter-module neighbor information is used as a supplementary feature to enhance the expressive ability of the fused feature.

f _fusion =CONCAT(f _visual +f′ _visual +f _audio +f′ _audio ) Formula (3)

It should be noted that when the input is visual and auditory features with semantic consistency, the error of the autoencoding network includes two parts. One is the error of the acoustic decoder, the other is the error of the visual decoder, and the sum of the two is the total error. Errors can be backpropagated to update the weights of the autoencoder network. .

Specifically, for the same video, visual and auditory information will have timeline deviations and semantic inconsistencies, thus posing challenges to visual information fusion. In order to solve this problem, the present invention proposes a new tag "semantic mapping", which is used to describe whether the audio-visual data of the same video contains the same semantic information. For example, video data containing blood, physical violence, etc. are considered visual anomalies. Sounds that include screams and cries are considered acoustic anomalies. Audio and video data are marked separately to prevent mutual interference. If the visual semantic label of the video is the same as the audio semantic label, the audio and video are considered to have semantic correspondence L _corr =1. Otherwise, there is no semantic correspondence L _corr =-1. Semantic tagging provides metrics for constructing shared subspaces with different modal characteristics.

Introducing semantic labels to calculate the auto-encoding network error includes: labeling the visual features and the auditory features input to the auto-encoding network using semantic mapping labels, where the semantic mapping labels are characterized as the visual input describing the same semantic content information and the mark label of the auditory input information; when the visual feature or auditory feature input to the autoencoding network has a semantic mapping label, the loss function is the algebraic sum of the auditory average error value and the visual average error value;

When there is no semantic mapping label for the visual features or auditory features input to the autoencoder network, the loss function is the difference between 1 and the algebraic sum of the auditory average error value and the visual average error value;

The auditory average error value is characterized as the average of the absolute differences between all auditory features and all auditory shared features, and the visual average error value is characterized as the average of the absolute differences between all visual features and all visual shared features.

Among them, the loss function is obtained by the following formula:

Therefore, this solution designs a new loss function that allows the model to learn the bias information of semantic mapping on the timeline. By introducing semantic labels into the calculation of the loss function, this solution reduces the interference of blind splicing features, enhances the model's ability to distinguish abnormal video semantic correspondence, and is more conducive to eliminating interference between non-corresponding features. In addition, such semantic embedding learning can be regarded as a form of regularization, which helps to enhance the generalization ability of the model and prevent overfitting.

Specifically, the semantic mapping label is obtained by semantically labeling the acoustic abnormality information of the auditory input information and the visual abnormality information of the visual input information respectively. If it is determined that the auditory input information and the visual abnormality information are If both have the semantic tag, then the semantic mapping tag is assigned to the set of the auditory input information and the visual abnormality information.

It is worth mentioning that in this scheme, in order to achieve the purpose of visual features and auditory features expressing the same semantics, visual information and auditory information need to be expressed as data of the same measurement. Therefore, time is used as a unified measurement in this scheme, specifically. , this solution maps the original audio waveform to a two-dimensional field, that is, the x-axis of the sound data is time and the y-axis is the waveform. This solution uses the surveillance video screen as the original video data. The original video data has approximately 30 to 60 frames of images in 1 second. That is, the x-axis of the visual data is also time, and the y-axis is the image frame.

When visual data and auditory data are measured at the same time, the two can be aligned on the time axis based on whether the semantics are the same to achieve correspondence between visual features and auditory features, thereby achieving semantic consistency between vision and hearing.

In addition, in this solution, the information of the original audio and video segments to be identified is also processed. Specifically, the present invention collects the difference between every two adjacent image frames from the audio and video segments to be identified to obtain a difference sequence, The difference sequence is used as visual input information.

That is, in this embodiment, it is fully considered that the objects of abnormal behavior recognition are some violent behaviors, such as punching someone. Placing the fist on the opponent's chest or on one's waist does not accurately indicate the presence of the characters in the video. Behavior, if the fist is still on the side of one's waist in the first few frames and is on the opponent's chest in the next few frames, it means that the person in the video has performed an abnormal behavior of punching. It can be seen that the difference between frames of the video can more accurately extract the required information than the video frame itself. Therefore, choosing the difference between adjacent frames of the video as the input of the network model has a better feature expression effect than inputting the video frame itself into the model.

In this solution, the original audio waveform is used as the auditory information to be collected, so "collecting the auditory input information in the audio and video segments to be identified" includes: obtaining the original audio waveform corresponding to the audio and video segments to be identified, and from the Acoustic signals are collected at preset sampling intervals in the original audio waveform to obtain auditory input information. Compared with methods that use spectrum analysis such as MFCC or LPC, the sound extracted by this solution is based on the original audio waveform, so it can be expressed in a unified metric with the visual features, that is, the auditory input information is represented by taking time as the horizontal axis and acoustics. The signal is used as waveform data on the vertical axis, where the auditory input information and the visual input information use the time domain as a unified scale.

In this solution, the feature extraction network for extracting visual and auditory features uses a dual-branch channel, that is, visual input information and auditory input information can be input into the feature extraction network at the same time, and feature extraction is performed separately to output visual features and auditory features. Specifically, the dual-branch channel feature extraction network includes an auditory feature extraction network and a visual feature extraction network, wherein the auditory feature extraction network includes an AFEN module and an LSTM module, and the multi-frame waveforms in the original audio waveform are input into the AFEN module , obtain multiple corresponding auditory frame-level features, fuse the multiple auditory frame-level features through the LSTM module, and output auditory segment-level features.

As shown in Figure 2, the abnormal behavior auditory feature extraction network structure AFEN structure includes 5 convolutional layers, 3 pooling layers and 3 sequentially connected fully connected layers. The output of the last fully connected layer passes through the SoftMax layer. The pooling layer is connected after the first convolution layer, the second convolution layer and the fifth convolution layer respectively.

In this scheme, the black rectangle represents each convolution layer, the white rectangle represents the pooling layer, the three rectangles immediately after the last pooling layer represent the three fully connected layers, and the black rectangle after the fully connected layer represents the LSTM Structure, due to the continuity of abnormal behavior on the time axis, the LSTM network is selected to process the temporal relationship between audio frame-level features and obtain segment-level features. The convolutional layer contains the ReLU activation function, making the activation pattern of the network sparser. The pooling layer contains a local response normalization operation to avoid gradient disappearance and improve the training speed of the network. As can be seen in Figure 2, after a segment of acoustic signal is extracted through the auditory feature extraction network, multiple frame-level features are first extracted, and then the multiple frame-level features are fused based on the relationship in time, and finally the corresponding segment is obtained. Segment-level characteristics of acoustic signals.

In the model of the present invention, in the final stage of visual and auditory feature processing, the timing information is summarized through the LSTM network. This method can be adapted to the entire surveillance video mechanism and has no rigid requirements in terms of audio and video length, sampling rate, etc. This solves the problem of feature timeline alignment. On the other hand, this model also greatly reduces the complexity of visual and auditory feature fusion and improves the stability of the model.

Similarly, in this solution, the abnormal behavior visual feature extraction network structure is the same as the AFEN convolution structure shown in Figure 2, and the convolutional LSTM (ConvLSTM) module is used to replace the last LSTM module, and the original input signal is changed to the inter-image frame difference. The difference is that compared to auditory feature extraction, visual features pay more attention to action recognition on spatiotemporal relationships, that is, ConvLSTM has a better effect than LSTM on obtaining spatiotemporal relationships, and can solve spatiotemporal sequence prediction problems, such as video classification, action recognition, etc. .

In summary, the present invention designs an abnormal behavior recognition model based on audio-visual information fusion using autoencoding network. The model structure is shown in Figure 3. The model includes four parts: visual feature extraction, auditory feature extraction, autoencoding network and fully connected recognition model. Visual and acoustic feature extraction uses a dual-channel feature extraction method. The network structure is based on a deep convolutional network. In terms of visual features, the difference between video frames is used as the original input, and deep convolution plus ConvLSTM network is used to extract segment-level visual features. In terms of auditory features, audio waveforms are used as network input, and deep convolution and LSTM networks are used to extract segment-level auditory features. Then, the autoencoding network shown in Figure 4 is used to construct a shared semantic subspace to eliminate the semantic bias of visual and auditory features, and the CONCAT method is used to combine visual and auditory features; finally, a fully connected model is used to identify abnormal behaviors . Therefore, the target behavior recognition method of this scheme is based on the shared semantic subspace of the autoencoding network to integrate auditory features into visual features to supplement visual information, and integrate visual features into auditory features to supplement auditory features, thus achieving different The complementary effect of the pattern, so the fused features obtained through feature fusion have richer semantic expressions, then the classification results of the model using the fused features to classify behaviors are also more accurate. Based on this, this solution improves the recognition accuracy and reduces the missed detection rate.

In addition, as shown in Figure 5, this solution provides a target behavior recognition device with audio-visual feature fusion. The device uses the above target behavior recognition method with audio-visual feature fusion to identify the target behavior. The device includes:

The acquisition module 501 is used to acquire the audio and video segments to be recognized of a preset duration.

The information collection module 502 is used to collect visual input information and auditory input information in the audio and video segments to be recognized.

Feature extraction module 503 is used to input the visual input information and the auditory input information into the target behavior model together, wherein the target behavior model includes a dual-branch channel feature extraction network, an autoencoding network and a fully connected layer recognition module. .

According to the feature extraction network, features are respectively extracted from the visual input information and the auditory input information to obtain visual features and auditory features.

The autoencoding network is used to map the visual features and the auditory features into the same subspace for audio-visual information fusion to obtain fusion features.

The behavior recognition module 504 is used to input the fused features into the fully connected layer recognition module for recognition to obtain the target behavior.

As shown in Figure 6, an electronic device according to one embodiment of the present application includes a memory 604 and a processor 602. The memory 604 stores a computer program, and the processor 602 is configured to run the computer program to perform any of the above methods. The steps in the example.

Specifically, the above-mentioned processor 602 may include a central processing unit (CPU), or an application specific integrated circuit (Application Specific Integrated Circuit, referred to as ASIC), or may be configured to implement one or more integrated circuits according to the embodiments of the present application.

Among others, memory 604 may include mass storage 604 for data or instructions. By way of example and not limitation, the memory 604 may include a hard disk drive (Hard Disk Drive, HDD for short), floppy disk drive, Solid State Drive (Solid State Drive, SSD for short), flash memory, optical disk, magneto-optical disk, magnetic tape, or Universal Serial Bus (Universal Serial Bus, Referred to as USB) drive or a combination of two or more of these. Memory 604 may include removable or non-removable (or fixed) media, where appropriate. Memory 604 may be internal or external to the data processing device, where appropriate. In certain embodiments, memory 604 is Non-Volatile memory. In a specific embodiment, the memory 604 includes read-only memory (Read-OnlyMemory, ROM for short) and random access memory (RandomAccessMemory, RAM for short). Under appropriate circumstances, the ROM can be a mask-programmed ROM, programmable ROM (ProgrammableRead-OnlyMemory, referred to as PROM), erasable PROM (ErasableProgrammableRead-OnlyMemory, referred to as EPROM), electrically erasable PROM (Electrically ErasableProgrammableRead -OnlyMemory, referred to as EEPROM), electrically rewritable ROM (Electrically Alterable Read-OnlyMemory, referred to as EAROM) or flash memory (FLASH) or a combination of two or more of these. Under appropriate circumstances, the RAM can be static random access memory (StaticRandom-AccessMemory, referred to as SRAM) or dynamic random access memory (DynamicRandomAccessMemory, referred to as DRAM), wherein the DRAM can be fast page mode dynamic random access Memory 604 (FastPageModeDynamicRandomAccessMemory, referred to as FPMDRAM), extended data output dynamic random access memory (ExtendedDateOutDynamicRandomAccessMemory, referred to as EDODRAM), synchronous dynamic random access memory (SynchronousDynamicRandom-AccessMemory, referred to as SDRAM), etc.

Memory 604 may be used to store or cache various data files required for processing and/or communication, as well as possibly computer program instructions executed by processor 602.

The processor 602 reads and executes the computer program instructions stored in the memory 604 to implement any of the audio-visual feature fusion target behavior recognition methods in the above embodiments.

Optionally, the above-mentioned electronic device may also include a transmission device 606 and an input-output device 608, wherein the transmission device 606 is connected to the above-mentioned processor 602, and the input-output device 608 is connected to the above-mentioned processor 602.

Transmission device 606 may be used to receive or send data via a network. Specific examples of the above-mentioned network may include a wired or wireless network provided by a communication provider of the electronic device. In one example, the transmission device includes a network adapter (Network Interface Controller, NIC for short), which can be connected to other network devices through a base station to communicate with the Internet. In one example, the transmission device 606 may be a radio frequency (Radio Frequency, RF for short) module, which is used to communicate with the Internet wirelessly.

Input and output devices 608 are used to input or output information. In this embodiment, the input information may be audio and video segments to be recognized, etc., and the output information may be the target behavior to be recognized, etc.

Optionally, in this embodiment, the above-mentioned processor 602 can be configured to perform the following steps through a computer program:

S101. Obtain the audio and video segments to be recognized with a preset duration.

S102. Collect visual input information and auditory input information in the audio and video segments to be recognized.

S103. Input the visual input information and the auditory input information into the target behavior model together, where the target behavior model includes a dual-branch channel feature extraction network, an autoencoding network and a fully connected layer recognition module.

S104. Extract features from the visual input information and the auditory input information respectively according to the feature extraction network to obtain visual features and auditory features.

S105. Use the autoencoding network to map the visual features and the auditory features into the same subspace for audio-visual information fusion to obtain fusion features.

S106. Input the fusion feature into the fully connected layer recognition module for recognition to obtain the target behavior.

It should be noted that for specific examples in this embodiment, reference may be made to the examples described in the above-mentioned embodiments and optional implementations, and the details of this embodiment will not be repeated here.

Generally, various embodiments may be implemented in hardware or special purpose circuitry, software, logic, or any combination thereof. Some aspects of the invention may be implemented in hardware, while other aspects may be implemented in firmware or software that may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. Although various aspects of the invention may be shown and described as block diagrams, flow diagrams, or using some other graphical representation, it is to be understood that, by way of non-limiting example, the blocks, devices, systems, techniques, or methods described herein may be Hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controllers or other computing devices, or some combination thereof.

Embodiments of the invention may be implemented by computer software executable by a data processor of the mobile device, such as in a processor entity, or by hardware, or by a combination of software and hardware. Computer software or programs (also referred to as program products), including software routines, applets, and/or macros, may be stored on any device-readable data storage medium, and they include program instructions for performing specific tasks. A computer program product may include one or more computer-executable components that are configured to perform embodiments when the program is executed. One or more computer-executable components may be at least one software code or a portion thereof. Additionally, at this point, it should be noted that any block of the logic flow in the figures may represent program steps, or interconnected logic circuits, blocks, and functions, or a combination of program steps and logic circuits, blocks, and functions. Software may be stored on physical media such as memory chips or memory blocks implemented within a processor, magnetic media such as hard or floppy disks, and optical media such as, for example, DVD and its data variants, CDs. Physical media are non-transient media.

Those skilled in the art should understand that the technical features of the above embodiments can be combined in any way. To simplify the description, not all possible combinations of the technical features in the above embodiments are described. However, as long as these technical features There is no contradiction in the combinations, and they should be considered to be within the scope of this manual.

The above embodiments only express several implementation modes of the present application, and their descriptions are relatively specific and detailed, but should not be construed as limiting the scope of the present application. It should be noted that, for those of ordinary skill in the art, several modifications and improvements can be made without departing from the concept of the present application, and these all fall within the protection scope of the present application. Therefore, the scope of protection of this application should be determined by the appended claims.

Claims

A target behavior recognition method based on audio-visual feature fusion, which is characterized by including the following steps:

Obtain the audio and video segments to be recognized with a preset duration;

Collect visual input information and auditory input information in the audio and video segments to be recognized;

The visual input information and the auditory input information are input into the target behavior model together, wherein the target behavior model includes a dual-branch channel feature extraction network, an autoencoding network and a fully connected layer recognition module;

According to the feature extraction network, features are extracted from the visual input information and the auditory input information respectively to obtain visual features and auditory features;

The encoder of the autoencoding network maps the visual features and the auditory features to the same subspace to obtain the auditory mapping features corresponding to the auditory features and the visual mapping features corresponding to the visual features; according to the decoding of the autoencoding network The device maps all the visual mapping features and all the auditory mapping features into the multi-modal space, and each modality obtains the visual compensation features of other modal spaces as visual shared features, and obtains the auditory compensation features of other modalities, As the auditory shared feature; splicing the visual shared feature, the auditory shared feature, the visual feature and the auditory feature to obtain a fusion feature;

The autoencoder network includes an encoder and a decoder, where the encoder includes a first fully connected layer, a second fully connected layer and an encoder layer connected in sequence; visual features and auditory features are input into the encoder and passed through in sequence. The output of the first fully connected layer, the second fully connected layer and the encoder layer obtains the auditory mapping features corresponding to the auditory features, and the visual mapping features corresponding to the visual features;

Among them, the decoder includes two branches, each branch consists of two fully connected layers; one branch takes auditory mapping features as input, and two fully connected layers map all auditory mapping features into the multi-module space. The visual compensation features corresponding to the auditory mapping features are obtained. The other branch takes the visual mapping features as input, and uses two fully connected layers to map all the visual mapping features into the multi-modal space to obtain the auditory compensation features corresponding to the visual mapping features;

The fused features are input into the fully connected layer recognition module for recognition, and the target behavior is obtained.
The target behavior recognition method of audio-visual feature fusion according to claim 1, characterized in that the visual features and the auditory features input to the autoencoder network are marked using semantic mapping labels, wherein the semantic mapping labels represent Marking labels for the visual input information and the auditory input information that describe the same semantic content;

When the visual features or auditory features input to the autoencoder network have semantic mapping labels, the loss function is the algebraic sum of the auditory average error value and the visual average error value;

When there is no semantic mapping label for the visual features or auditory features input to the autoencoder network, the loss function is the difference between 1 and the algebraic sum of the auditory average error value and the visual average error value;

The auditory average error value is characterized as the average of the absolute differences between all auditory features and all auditory shared features, and the visual average error value is characterized as the average of the absolute differences between all visual features and all visual shared features;

Among them, the loss function is obtained by the following formula:

y autocoder is the loss function, N is the number of features, f audio is the auditory feature, f' audio is the auditory shared feature, f visual is the visual feature, f' visual is the visual shared feature, L corr = 1 means there is a semantic mapping label, L corr =-1 indicates that there is no semantic mapping label.
The target behavior recognition method of audio-visual feature fusion according to claim 2, characterized in that "labeling the visual features and the auditory features input to the autoencoding network using semantic mapping labels" includes: The acoustic abnormality information of the auditory input information and the visual abnormality information of the visual input information are semantically marked. If it is determined that both the auditory input information and the visual abnormality information have the semantic mark, it is the auditory input information. The semantic mapping tag is assigned to the visual anomaly information.
The target behavior recognition method of audio-visual feature fusion according to claim 1, characterized in that "collecting the visual input information in the audio and video segments to be recognized" includes:

The difference between every two adjacent image frames is collected from the audio and video segment to be recognized to obtain a difference sequence, and the difference sequence is used as visual input information.
The target behavior recognition method of audio-visual feature fusion according to claim 1, characterized in that "collecting the auditory input information in the audio and video segments to be recognized" includes:

The original audio waveform corresponding to the audio and video segment to be identified is obtained, and acoustic signals are collected from the original audio waveform at preset sampling intervals to obtain auditory input information.
The target behavior recognition method of audio-visual feature fusion according to claim 5, characterized in that the auditory input information is represented by waveform data with time as the horizontal axis and acoustic signal as the vertical axis, wherein the auditory input information and Visual input information takes the time domain as a unified scale.
The target behavior recognition method of audio-visual feature fusion according to claim 1, wherein the feature extraction network of the dual-branch channel includes an auditory feature extraction network and a visual feature extraction network,

Among them, the auditory feature extraction network includes an AFEN module and an LSTM module. Multiple frame waveforms in the original audio waveform are input into the AFEN module to obtain corresponding multiple auditory frame-level features, and the multiple auditory frame-level features are obtained through the LSTM module. Features are fused to output auditory segment-level features,

The AFEN network includes 5 convolutional layers, 3 pooling layers and 3 fully connected layers connected in sequence. The pooling layers are connected to the first convolutional layer, the second convolutional layer and the fifth convolutional layer respectively. After the layer, each convolutional layer includes a ReLU activation function to make the activation pattern of the AFEN module sparser, and each pooling layer includes a local response normalization operation to avoid gradient vanishing.
The target behavior recognition method of audio-visual feature fusion according to claim 7, characterized in that, the visual feature extraction network and the auditory feature extraction network share the AFEN module, and the ConvLSTM module is used to replace the LSTM module in the visual feature extraction network Fusion of multiple visual frame-level features output by the AFEN module and outputs visual segment-level features.
An audio-visual feature fusion target behavior recognition device, which is characterized by including:

The acquisition module is used to obtain the audio and video segments to be recognized with a preset duration;

An information collection module, used to collect visual input information and auditory input information in the audio and video segments to be identified;

A feature extraction module, used to input the visual input information and the auditory input information into a target behavior model together, wherein the target behavior model includes a dual-branch channel feature extraction network, an autoencoding network and a fully connected layer recognition module;

According to the feature extraction network, features are extracted from the visual input information and the auditory input information respectively to obtain visual features and auditory features;

The encoder of the autoencoding network maps the visual features and the auditory features to the same subspace to obtain the auditory mapping features corresponding to the auditory features and the visual mapping features corresponding to the visual features; according to the decoding of the autoencoding network The device maps all the visual mapping features and all the auditory mapping features into the multi-modal space, and each modality obtains the visual compensation features of other modal spaces as visual shared features, and obtains the auditory compensation features of other modalities, As the auditory shared feature; splicing the visual shared feature, the auditory shared feature, the visual feature and the auditory feature to obtain a fusion feature;

The autoencoder network includes an encoder and a decoder, where the encoder includes a first fully connected layer, a second fully connected layer and an encoder layer connected in sequence; visual features and auditory features are input into the encoder and passed through in sequence. The output of the first fully connected layer, the second fully connected layer and the encoder layer obtains the auditory mapping features corresponding to the auditory features, and the visual mapping features corresponding to the visual features;

Among them, the decoder includes two branches, each branch consists of two fully connected layers; one branch takes auditory mapping features as input, and two fully connected layers map all auditory mapping features into the multi-module space. The visual compensation features corresponding to the auditory mapping features are obtained. The other branch takes the visual mapping features as input, and uses two fully connected layers to map all the visual mapping features into the multi-modal space to obtain the auditory compensation features corresponding to the visual mapping features;

A behavior recognition module is used to input the fused features into the fully connected layer recognition module for recognition to obtain the target behavior.
An electronic device, including a memory and a processor, characterized in that a computer program is stored in the memory, and the processor is configured to run the computer program to perform the audio-visual processing described in any one of claims 1-8. Feature fusion target behavior recognition method.
A readable storage medium, characterized in that a computer program is stored in the readable storage medium, the computer program includes a program code for controlling a process to execute a process, the process includes any of claims 1 to 8 A target behavior recognition method based on audio-visual feature fusion.