CN114581749B

CN114581749B - Audio-visual feature fusion target behavior identification method and device and application

Info

Publication number: CN114581749B
Application number: CN202210496197.7A
Authority: CN
Inventors: 毛云青; 王国梁; 齐韬; 陈思瑶; 葛俊
Original assignee: CCI China Co Ltd
Current assignee: CCI China Co Ltd
Priority date: 2022-05-09
Filing date: 2022-05-09
Publication date: 2022-07-26
Anticipated expiration: 2042-05-09
Also published as: CN114581749A; WO2023216609A1

Abstract

The application provides a method, a device and an application for identifying target behaviors with audio-visual feature fusion, which relate to the technical field of intelligent security, wherein the method inputs visual information and auditory information into a specified algorithm network, extracts the visual feature and the auditory feature through different feature extraction networks of two branches, and obtains the features on a time sequence through calculation of an LSTM network; and a shared semantic subspace is constructed through a self-coding network, semantic deviation of visual and auditory characteristics is eliminated, the visual characteristics and the auditory characteristics are fused, and target behaviors can be identified based on the fused characteristics. By the method and the device, the accuracy of abnormal behavior identification can be improved.

Description

Audio-visual feature fusion target behavior identification method and device and application

Technical Field

The application relates to the technical field of intelligent security, in particular to a method, a device and application for identifying a target behavior with audio-visual feature fusion.

Background

In the fields of city management and safety management, monitoring emergencies and giving an alarm in real time are very important for public safety management. In real life, the behavior of fighting a shelf is one of the most common abnormal emergencies. The traditional alarm mode mainly comprises the following steps: the person calls the telephone to alarm to know that fighting has occurred, obviously, the reporting mode has hysteresis, and in addition, the security personnel can stare at the monitoring picture to find abnormality, but the mode consumes labor cost.

Therefore, in the prior art, the monitoring camera is used for monitoring continuously for 24 hours, and then the setting behavior is judged and alarmed according to the artificial intelligence algorithm, so that the real-time performance and the accuracy of alarming of sudden abnormal events such as the setting behavior can be greatly improved.

In the prior art, the mode of judging the fighting through an artificial intelligence algorithm comprises the following steps: and detecting the pictures and performing behavior classification judgment or detecting the positions of human body key points in the multi-frame pictures and performing behavior judgment. As disclosed in publication nos. CN112733629A and CN111401296A, only the use of image information to determine abnormal behavior is disclosed. For the actual scene, the algorithm can identify some labor operations with large actions, such as cleaning hygiene by multiple people, or physical exercise actions of multiple people, such as batting, as fighting actions; in addition, the general algorithm only uses image information for judgment, and the accuracy needs to be improved.

In addition, in the prior art, for example, the publications CN104243894A and CN102098492A determine abnormal behaviors according to two separated steps, and then perform fusion at a decision level, however, the decision level fusion method has a limited effect on improving the video recognition performance of the abnormal behaviors, and the decision level fusion can only fuse scores after two branch mode decisions without considering semantic consistency of information of each mode, so that the problems of video and audio time misalignment and semantic inconsistency cannot be solved.

Semantic consistency is of great significance in multimodal information fusion, especially visual and auditory information fusion. When multimodal information is semantically consistent, the information is complementary, otherwise they interfere with each other, such as the well-known "McGurk effect". Sometimes, human hearing is significantly affected by vision, which may lead to mishearing, such as when one sound does not match the visual signal, one may mysteriously perceive a third sound, simply blending the sound and the video signal may even have the opposite effect.

Therefore, in the case of the inconsistency of the multi-modal information form semantics, the feature fusion between the modalities without any metric cannot realize the information complementation between the modalities, and may cause the performance degradation of the algorithm. Due to the particularity of abnormal behaviors, the inconsistency of the semantics of the audiovisual information is mainly embodied as follows: first, the audiovisual data may not be aligned on the time axis, and it may be that the sound features may appear slower than the visual features. In addition, the vision and the hearing have semantic expression deviation information. This is a problem that needs to be solved in the multimodal feature fusion process.

Based on this, no effective solution is proposed at present for the problem that the audio-visual characteristics cannot be well integrated into the abnormal behavior recognition algorithm to accurately judge whether the abnormal behavior exists.

Disclosure of Invention

The embodiment of the application provides a method, a device and an application for identifying audio-visual characteristic fused target behaviors, aiming at the existing abnormal behavior identification algorithm, the scheme uses a characteristic level fusion method to fuse audio-visual information, and can improve the accuracy of abnormal behavior identification.

In a first aspect, an embodiment of the present application provides an audiovisual feature fusion target behavior identification method, where the method includes: acquiring an audio-video segment to be identified with preset duration; collecting visual input information and auditory input information in the audio and video segment to be identified; inputting the visual input information and the auditory input information into a target behavior model together, wherein the target behavior model comprises a feature extraction network, a self-coding network and a full connection layer identification module of a double-branch channel; extracting features from the visual input information and the auditory input information respectively according to the feature extraction network to obtain visual features and auditory features; mapping the visual features and the auditory features to the same subspace by adopting the self-coding network to perform audio-visual information fusion to obtain fusion features; and inputting the fusion characteristics into the full-connection layer identification module for identification to obtain target behaviors.

In some embodiments, the "mapping the visual features and the auditory features into the same subspace using the self-coding network for audio-visual information fusion to obtain fusion features" includes: mapping the visual features and the auditory features to the same subspace by an encoder of the self-coding network to obtain auditory mapping features corresponding to the auditory features and visual mapping features corresponding to the visual features; mapping all the visual mapping characteristics and all the auditory mapping characteristics into a multi-mode space according to a decoder of the self-coding network, wherein each mode obtains visual compensation characteristics of other mode spaces as visual sharing characteristics and obtains auditory compensation characteristics of other modes as auditory sharing characteristics; and splicing the visual sharing feature, the auditory sharing feature, the visual feature and the auditory feature to obtain a fusion feature.

In some embodiments, the self-coding network comprises an encoder and a decoder, wherein the encoder comprises a first fully-connected layer, a second fully-connected layer and an encoder layer which are connected in sequence; inputting the visual features and the auditory features into an encoder together, and outputting the visual features and the auditory features through a first full-connection layer, a second full-connection layer and an encoder layer in sequence to obtain auditory mapping features corresponding to the auditory features and visual mapping features corresponding to the visual features; the decoder comprises two branches, wherein each branch consists of two full-connection layers; one branch takes the auditory mapping characteristics as input, the two full-connection layers map all the auditory mapping characteristics into the multi-mode space to obtain the visual compensation characteristics corresponding to the auditory mapping characteristics, the other branch takes the visual mapping characteristics as input, and the two full-connection layers map all the visual mapping characteristics into the multi-mode space to obtain the auditory compensation characteristics corresponding to the visual mapping characteristics.

In some of these embodiments, the visual and auditory features input into the self-encoding network are tagged with semantic mapping tags, wherein a semantic mapping tag is characterized as a tag of the visual and auditory input information that describes the same semantic content; when the visual features or the auditory features input from the coding network have semantic mapping labels, the loss function is the algebraic sum of the auditory average error value and the visual average error value; when the visual features or the auditory features input into the self-coding network do not have semantic mapping labels, the loss function is the difference value between 1 and the algebraic sum of the auditory average error value and the visual average error value; the hearing mean error value is characterized as an average of the absolute differences of all the hearing features and all the hearing shared features, and the vision mean error value is characterized as an average of the absolute differences of all the vision features and all the vision shared features, wherein the loss function is obtained by the following formula:

y _autocoder for the loss function, N is the number of features,f _audio in order to be a characteristic of hearing,f’ _audio in order to be able to share the feature of hearing,f _visual in order to be a visual feature,f’ _visual in order to share the features for the vision,L _corr =1 indicates that there is a semantic mapping label,L _corr = -1 denotes that no semantic mapping label is present.

In some of these embodiments, "labeling the visual features and the auditory features input to the self-encoding network with semantic mapping labels" comprises: and respectively performing semantic marking on the acoustic abnormal information of the auditory input information and the visual abnormal information of the visual input information, and if the auditory input information and the visual abnormal information are judged to have the semantic marking, allocating the semantic mapping labels to the group of the auditory input information and the visual abnormal information.

In some embodiments, the step of "collecting visual input information in the audio and video segment to be identified" comprises: and acquiring the difference value of every two adjacent image frames from the audio and video segment to be identified to obtain a difference value sequence, and taking the difference value sequence as visual input information.

In some embodiments, the "capturing auditory input information in the audio-video segment to be identified" comprises: and acquiring an original audio waveform corresponding to the audio and video segment to be identified, and acquiring acoustic signals from the original audio waveform at a preset sampling interval to obtain auditory input information.

In some of these embodiments, the auditory input information is characterized as waveform data having time as a horizontal axis and an acoustic signal as a vertical axis, wherein the auditory input information and the visual input information are uniformly scaled in the time domain.

In some embodiments, the feature extraction network of the dual-branch channel includes an auditory feature extraction network and a visual feature extraction network, wherein the auditory feature extraction network includes an AFEN module and an LSTM module, and inputs multiple frames of waveforms in an original audio waveform into the AFEN module to obtain corresponding multiple auditory frame-level features, and fuses the multiple auditory frame-level features through the LSTM module to output auditory segment-level features.

In some of these embodiments, the AFEN network includes 5 convolutional layers, 3 pooling layers, and 3 fully-connected layers connected in series, where the pooling layers are connected after the first convolutional layer, the second convolutional layer, and the fifth convolutional layer, respectively, each convolutional layer includes a ReLU activation function that makes the activation pattern of the AFEN module more sparse, and each pooling layer includes a local response normalization operation that avoids gradient vanishing.

In some embodiments, the visual feature extraction network and the auditory feature extraction network share an AFEN module, and a ConvLSTM module replaces an LSTM module in the visual feature extraction network to fuse a plurality of visual frame-level features output by the AFEN module and output a visual segment-level feature.

In a second aspect, an embodiment of the present application provides an audio-visual feature fused target behavior recognition apparatus, including: the acquisition module is used for acquiring the audio and video segment to be identified with preset duration; the information acquisition module is used for acquiring visual input information and auditory input information in the audio-video band to be identified; the characteristic extraction module is used for inputting the visual input information and the auditory input information into a target behavior model together, wherein the target behavior model comprises a characteristic extraction network, a self-coding network and a full connection layer identification module of a double-branch channel; extracting features from the visual input information and the auditory input information respectively according to the feature extraction network to obtain visual features and auditory features; mapping the visual features and the auditory features to the same subspace by adopting the self-coding network to perform audio-visual information fusion to obtain fusion features; and the behavior identification module is used for inputting the fusion characteristics into the full-connection layer identification module for identification to obtain target behaviors.

In a third aspect, an embodiment of the present application provides an electronic device, which includes a memory and a processor, where the memory stores a computer program, and the processor is configured to execute the computer program to perform the method for identifying an audio-visual feature-fused target behavior according to any one of the first aspect.

In a fourth aspect, the present application provides a readable storage medium, in which a computer program is stored, the computer program including program code for controlling a process to execute the process, the process including the audio-visual feature fusion target behavior identification method according to any one of the first aspect.

The main contributions and innovation points of the embodiment of the application are as follows:

the scheme adopts a self-coding network to represent the mapping of the shared semantic subspace, and the vision and the hearing in the self-coding network adopt the unified measurement of time so as to map the vision and the hearing which represent the same semantic, further capture complementary information and high-level semantic between different modes and realize the feature fusion of the semantic level.

The scheme designs a feature extraction network of a double-branch channel, and an LSTM network is selected in the extraction network to process the time relation between audio frame level features, so that auditory segment level features are obtained; and selecting a ConvLSTM network to process the time relation among the video frame-level features so as to obtain visual stage features, and eliminating the feature heterogeneity of the visual stage features and the auditory stage features in a self-coding network to realize feature fusion.

According to the scheme, the video interframe difference information is adopted for extracting the visual characteristics, the characteristics of abnormal behaviors can be better embodied, the original audio waveform is adopted for extracting the sound characteristics instead of a method based on spectral analysis, such as MFCC or LPC, so that the images and wave bands can be uniformly and respectively sampled in a time interval mode, and the problem of inconsistent audio and video characteristic processing in the audio-visual information fusion process is solved by unifying the images and the wave bands to a time domain.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a flowchart of main steps of a method for identifying an audio-visual feature-fused target behavior according to a first embodiment of the present application.

Fig. 2 is a schematic diagram of an abnormal behavior auditory feature extraction network structure.

FIG. 3 is a schematic diagram of an abnormal behavior recognition network structure based on self-coding network mapping audio-visual feature fusion.

Fig. 4 is a schematic diagram of the structure of a self-coding network.

Fig. 5 is a block diagram of a target behavior recognition apparatus for audio-visual feature fusion according to a second embodiment of the present application.

Fig. 6 is a schematic hardware configuration diagram of an electronic device according to a third embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. The following description refers to the accompanying drawings in which the same numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with one or more embodiments of the present specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of one or more embodiments of the specification, as detailed in the claims which follow.

It should be noted that: in other embodiments, the steps of the corresponding methods are not necessarily performed in the order shown and described herein. In some other embodiments, the method may include more or fewer steps than those described herein. Moreover, a single step described in this specification may be broken down into multiple steps in other embodiments; multiple steps described in this specification may be combined into a single step in other embodiments.

The embodiment of the application provides an audio-visual characteristic fusion target behavior identification method, which adopts the audio-visual characteristic fusion target behavior identification scheme to judge abnormal event behaviors. First, a use scenario to which the method of the present application is applied will be explained:

firstly, accessing monitoring video sound and pictures, and extracting video and audio clips with fixed length (such as 1 second) as a window for behavior judgment;

then calculating the video frame-to-frame difference in the period of time (two adjacent frames of images are calculated, about 30-60 frames of images exist in 1 second, and a plurality of results on a time sequence can be obtained after calculation), and using the calculated difference value sequence as visual input information;

then sampling 16kHz sound waveform in 1 second as auditory input information;

inputting the visual information and the auditory information into a specified algorithm network, extracting the visual characteristic and the auditory characteristic through different characteristic extraction networks of two branches, and obtaining the characteristic on the time sequence through calculation of an LSTM network; a shared semantic subspace is constructed through a self-coding network, semantic deviation of visual and auditory characteristics is eliminated, the visual characteristics and the auditory characteristics are fused, and the fused characteristics are input into a classification network branch to obtain abnormal behavior classification probability;

and finally, judging whether the behaviors belong to abnormal behaviors or not according to the classification probability threshold.

Fig. 1 is a flowchart of main steps of an audiovisual feature fusion target behavior identification method according to a first embodiment of the present application.

To achieve this, as shown in fig. 1, the method for identifying target behaviors by audio-visual feature fusion mainly includes the following steps 101 to 106.

Step 101, obtaining an audio and video segment to be identified with preset duration.

And 102, collecting visual input information and auditory input information in the audio-video band to be identified.

And 103, inputting the visual input information and the auditory input information into a target behavior model together, wherein the target behavior model comprises a feature extraction network, a self-coding network and a full connection layer identification module of a double-branch channel.

And 104, respectively extracting features from the visual input information and the auditory input information according to the feature extraction network to obtain visual features and auditory features.

And 105, mapping the visual features and the auditory features to the same subspace by adopting the self-coding network to perform audio-visual information fusion to obtain fusion features.

And 106, inputting the fusion characteristics into the full-link layer identification module for identification to obtain target behaviors.

Specifically, in the scheme, not only visual input information is used, but also auditory input information is collected, the visual input information and the auditory input information are used as input of the behavior model, and the model outputs a target behavior classification result. Compared with the method for judging the target behavior only by adopting the image information, the method utilizes the sound characteristics, and the information of the sound characteristics and the image characteristics can be complemented so as to judge the target behavior more accurately, thereby having better characteristic expression effect. In addition, in the scheme, the visual input information and the auditory input information are input into the model together for processing, the parallel processing has the advantage that the characteristic level fusion can be carried out in the characteristic extraction stage so as to better capture the relation between the modes, and compared with a decision and fusion mode for respectively carrying out fusion on the results identified by the visual characteristic information and the auditory characteristic information, the scheme can consider the consistency between the visual characteristic and the auditory characteristic information, so that the multi-mode characteristic information can be complemented, and a better performance effect is achieved.

It should be noted that, in the present solution, the target behavior may be a normal behavior or an abnormal behavior. For example, when a visual sample and an auditory sample and normal behaviors marked from the samples are input when training a model, the trained model recognizes the normal behaviors from visual input information and auditory input information. On the contrary, when the labeled sample labeled to the abnormal behavior is input during the model training, the trained model can identify the abnormal behavior from the visual input information and the auditory input information. By way of example and not limitation, the visual sample and the auditory sample describing the violent scenes and the characteristics, marked out from the samples, describing the violent scenes are used as input of the model, so that the trained model can identify violent behaviors such as fighting.

Moreover, the self-coding network is adopted in the scheme to realize the fusion of the characteristics, and different from the prior art, the self-coding network designed by the invention can represent the audio-visual data by the same measurement and match the visual information and the auditory information which represent the same semantic under the same measurement, namely, the semantic consistency of the visual sense and the auditory sense is realized.

Specifically, after the visual features and the auditory features are acquired, the features need to be fused. Therefore, the embodiment of the invention establishes the shared semantic subspace in the future by mapping the visual features and the auditory features to the same subspace, thereby eliminating the feature heterogeneity between different video and audio modes, further capturing the complementary information and the high-level semantics between the visual mode and the audio mode, and realizing the feature fusion at the semantic level. In the scheme, the encoder of the self-coding network maps the visual features and the auditory features to the same subspace to obtain visual mapping features and auditory mapping features; mapping the visual mapping characteristics and the auditory mapping characteristics to a multi-mode space according to a decoder of the self-coding network to obtain compensation characteristics of other modes as visual sharing characteristics and auditory sharing characteristics; and splicing the visual sharing feature, the auditory sharing feature, the visual feature and the auditory feature to obtain a fusion feature.

Illustratively, as shown in formula (1), the visual feature extracted isf _visual ，g() Is to make the visual characteristicsf _visual A function mapped to the same subspace willf _visual Input the methodg() Obtaining visual mapping characteristics;H() Is to map the vision to the characteristicsg(f _visual ) A function mapped to a multimodal space of shared semanticsg(f _visual )Input deviceH() In (3), a compensation feature of the hearing modality is obtained as a hearing sharing feature.

Similarly, as shown in equation (2), the extracted auditory features aref _audio ，h() Is to make the auditory sense characteristicf _audio A function mapped to the same subspace willf _audio Input deviceh() Obtaining auditory mapping characteristics;G() Is to map the auditory sense to the characteristicsh(f _audio ) A function mapped to a multimodal space of shared semanticsh(f _audio ) Input deviceG() In the method, a compensation feature of the visual modality is obtained as a visual sharing feature.

As shown in fig. 4, the self-coding network includes an encoder and a decoder, wherein the encoder includes a first fully-connected layer, a second fully-connected layer, and an encoder layer that are connected in sequence; inputting the visual features and the auditory features into an encoder together, and outputting the visual features and the auditory features through a first full-connection layer, a second full-connection layer and an encoder layer in sequence to obtain auditory mapping features corresponding to the auditory features and visual mapping features corresponding to the visual features;

the decoder comprises two branches, wherein each branch consists of two full-connection layers; one branch takes the auditory mapping characteristics as input, the two full-connection layers map all the auditory mapping characteristics into the multimode space to obtain the visual compensation characteristics corresponding to the auditory mapping characteristics, the other branch takes the visual mapping characteristics as input, and the two full-connection layers map all the visual mapping characteristics into the multimode space to obtain the auditory compensation characteristics corresponding to the visual mapping characteristics. And finally, splicing the visual characteristic, the visual compensation characteristic, the auditory characteristic and the auditory compensation characteristic by using a formula (3) to obtain a fusion characteristic.

In the scheme, each modal space receives information from an inter-mode neighbor and an intra-mode neighbor thereof, and shares own information, and when the inter-mode neighbor information acquired by any modal space can make up for the loss of own information, the acquired inter-mode neighbor information is used as a supplementary feature to enhance the expression capability of the fusion feature.

It should be noted that when the input is visual and auditory with semantic consistency, the error of the self-coding network includes two parts. One is the error of the acoustic decoder and the other is the error of the visual decoder, and the sum of the two is the total error. The error may be propagated backwards to update the weights from the coding network.

In particular, visual and auditory information can have time axis deviations and be semantically inconsistent for the same video, thus presenting challenges to visual information fusion. To solve this problem, the invention proposes a new "semantic mapping" of a tag describing whether the audiovisual data of the same video contains the same or notThe semantic information of (1). For example, video data containing blood, physical violence, etc. is considered to be a visual anomaly. Sounds that contain screaming and crying are considered to be acoustic anomalies. The audio and video data are marked separately to prevent interference. If the visual semantic tag of the video is the same as the audio semantic tag, the audio and video are considered to have semantic correspondenceL _corr And = 1. Otherwise, there is no semantic correspondenceL _corr And (4) = -1. Semantic labeling provides a metric for constructing shared subspaces with different modal characteristics.

The method for calculating the self-coding network error by introducing the semantic tag comprises the following steps: marking the visual features and the auditory features input into the self-coding network by adopting semantic mapping labels, wherein the semantic mapping labels are characterized by the marking labels of the visual input information and the auditory input information which describe the same semantic content; when the visual features or the auditory features input into the self-coding network have semantic mapping labels, the loss function is the algebraic sum of the auditory average error value and the visual average error value;

when the visual feature or the auditory feature input from the coding network does not have a semantic mapping label, the loss function is the difference value between 1 and the algebraic sum of the auditory average error value and the visual average error value;

the auditory mean error value is characterized as the average of the absolute differences of all the auditory features and all the auditory shared features, and the visual mean error value is characterized as the average of the absolute differences of all the visual features and all the visual shared features.

Wherein the loss function is derived from the following equation:

y _autocoder for the loss function, N is the number of features,f _audio in order to be a characteristic of hearing,f’ _audio in order to be able to share the feature of hearing,f _visual in order to be a visual characteristic of the human,f’ _visual in order to share the features for the vision,L _corr =1 indicates the presence of a semantic mapping tag,L _corr = -1 denotes that no semantic mapping label is present.

Therefore, a new loss function is designed, and the model can be learned to the deviation information of semantic mapping on the time axis. According to the scheme, the semantic label is introduced into the calculation of the loss function, so that the interference of blind splicing characteristics is reduced, the discrimination capability of the model on the abnormal video semantic correspondence is enhanced, and the interference among non-corresponding characteristics is eliminated. In addition, such semantic embedding learning can be regarded as a form of regularization, which helps to enhance the generalization ability of the model and prevent overfitting.

Specifically, the semantic mapping label is obtained by the following method: and respectively performing semantic marking on the acoustic abnormal information of the auditory input information and the visual abnormal information of the visual input information, and if the auditory input information and the visual abnormal information are judged to have the semantic marking, allocating the semantic mapping labels to the group of the auditory input information and the visual abnormal information.

It should be noted that, in order to achieve the purpose that the visual features and the auditory features represent the same semantics, the visual information and the auditory information need to be represented as data of the same measurement, so in the present solution, time is used as the unified measurement, specifically, the present solution maps the original audio waveform to a two-dimensional field, that is, the x-axis of the sound data is time, and the y-axis is a waveform. The scheme takes the picture of a monitoring video as original video data, the original video data comprises about 30-60 frames of images in 1 second, namely the x axis of the visual data is also time, and the y axis is an image frame.

When the visual data and the auditory data are measured at the same time, the visual data and the auditory data can be aligned on a time axis according to whether the semantics are the same, so that the visual characteristic and the auditory characteristic correspond to each other, and the consistency of the vision and the auditory can be realized semantically.

In addition, the method and the device also process the information of the original audio-video segment to be identified, and particularly, the method and the device collect the difference value of every two adjacent image frames from the audio-video segment to be identified to obtain a difference value sequence, and the difference value sequence is used as visual input information.

That is, in this embodiment, it is sufficiently considered that the object of the abnormal behavior recognition is some behaviors with violent actions, such as a behavior of swinging a fist, and the fist is in front of the chest of the opponent or on the waist side of the opponent, which does not accurately indicate that the person in the video has the contained behavior. It can be seen that the inter-frame difference of the video is more accurate than the video frame itself to extract the required information. Selecting the difference between adjacent frames of the video as the input to the network model has a better feature expression effect than inputting the video frames themselves into the model.

In the scheme, the original audio waveform is taken as the auditory information to be collected, so that the step of collecting the auditory input information in the audio and video segment to be identified comprises the following steps: and acquiring an original audio waveform corresponding to the audio and video segment to be identified, and acquiring acoustic signals from the original audio waveform at a preset sampling interval to obtain auditory input information. Compared with a method adopting spectral analysis such as MFCC or LPC, the method extracts the sound based on original audio waveform, and therefore the sound and the visual features can be represented by unified measurement, namely the auditory input information is characterized by waveform data with time as a horizontal axis and acoustic signals as a vertical axis, wherein the auditory input information and the visual input information are unified in scale with a time domain.

The feature extraction network for extracting the visual and auditory features in the scheme adopts a double-branch channel, namely visual input information and auditory input information can be simultaneously input into the feature extraction network, and respectively extracted to output the visual features and the auditory features. Specifically, the feature extraction network of the dual-branch channel comprises an auditory feature extraction network and a visual feature extraction network, wherein the auditory feature extraction network comprises an AFEN module and an LSTM module, multiple frame waveforms in an original audio waveform are input into the AFEN module, corresponding multiple auditory frame level features are obtained, the multiple auditory frame level features are fused through the LSTM module, and auditory segment level features are output.

As shown in fig. 2, the abnormal behavior auditory feature extraction network structure AFEN structure includes 5 convolution layers, 3 pooling layers, and 3 full connection layers connected in sequence, and the output of the last full connection layer passes through the SoftMax layer. Wherein the pooling layer is connected behind the first convolution layer, the second convolution layer, and the fifth convolution layer, respectively.

In the scheme, black rectangles represent each convolutional layer, white rectangles represent pooling layers, three rectangles immediately follow the last pooling layer represent three full-connection layers, black rectangles behind the full-connection layers represent LSTM structures, and due to the continuity of abnormal behaviors on a time axis, the time relationship between LSTM network processing audio frame level characteristics is selected to obtain segment level characteristics. The convolution layer contains a ReLU activation function, so that the activation mode of the network is sparser. The pooling layer includes a local response normalization operation to avoid gradient disappearance and increase the training speed of the network. As can be seen from fig. 2, after a segment of acoustic signal is subjected to feature extraction through the auditory feature extraction network, a plurality of frame-level features are extracted first, and the plurality of frame-level features are fused based on a temporal relationship, so as to finally obtain a segment-level feature corresponding to the segment of acoustic signal.

In the model of the invention, timing information is summarized through the LSTM network in the final stage of visual and auditory feature processing, and the method can adapt to the whole monitoring video mechanism and has no hard requirements on the length, sampling rate and the like of audio and video. Thereby solving the characteristic time axis alignment problem. On the other hand, the model also greatly reduces the complexity of fusion of visual and auditory characteristics and improves the stability of the model.

Similarly, in the present solution, the structure of the visual feature extraction network for abnormal behaviors is the same as the AFEN convolution structure shown in fig. 2, and a connected LSTM (ConvLSTM) module is used instead of the last LSTM module, and the original input signal is changed to the difference value between image frames. The difference is that compared with auditory feature extraction, visual features pay more attention to motion recognition in a space-time relationship, namely ConvLSTM has better effect on obtaining the space-time relationship than LSTM, and the prediction problem of space-time sequences, such as video classification and motion recognition, can be solved.

In conclusion, the invention designs an abnormal behavior recognition model based on the audio-visual information fusion of the self-coding network. The model structure is shown in fig. 3. The model comprises four parts, namely visual feature extraction, auditory feature extraction, a self-coding network and a full-connection identification model. Visual and acoustic feature extraction adopts a double-channel feature extraction method, a network structure adopts a deep convolution network as a basis, in the aspect of visual features, video interframe difference is adopted as original input, and section-level visual features are extracted by utilizing deep convolution plus ConvLSTM network. In terms of auditory features, the audio waveform is used as a network input, and segment-level auditory features are extracted by deep convolution plus an LSTM network. Then, a shared semantic subspace is constructed by using a self-coding network shown in fig. 4, the semantic deviation of visual and auditory characteristics is eliminated, and the combination of the visual and auditory characteristics is realized by adopting a CONCAT method; and finally, identifying the abnormal behavior by using a full-connection model. Therefore, the target behavior recognition method based on the self-coding network realizes that the auditory characteristics are fused into the visual characteristics to supplement the visual information and the visual characteristics are fused into the auditory characteristics to supplement the auditory characteristics, thereby realizing the complementary effect of different modes, so that the fusion characteristics obtained by the characteristic fusion have richer semantic expression, and the classification result of the model for classifying the behaviors by the fusion characteristics is more accurate. Based on the scheme, the identification precision is improved, and the omission factor is reduced.

As shown in fig. 5, the present invention provides an audience feature fused target behavior recognition apparatus for recognizing a target behavior by using the audience feature fused target behavior recognition method, the apparatus including:

the obtaining module 501 is configured to obtain an audio and video segment to be identified with a preset duration.

The information collecting module 502 is configured to collect visual input information and auditory input information in the audio-video segment to be identified.

The feature extraction module 503 is configured to input the visual input information and the auditory input information into a target behavior model together, where the target behavior model includes a feature extraction network of a dual-branch channel, a self-coding network, and a full link layer identification module.

And respectively extracting features from the visual input information and the auditory input information according to the feature extraction network to obtain visual features and auditory features.

And mapping the visual features and the auditory features to the same subspace by adopting the self-coding network to perform audio-visual information fusion to obtain fusion features.

And a behavior identification module 504, configured to input the fusion feature into the full-connection layer identification module for identification, so as to obtain a target behavior.

As shown in fig. 6, the electronic device according to an embodiment of the present application includes a memory 604 and a processor 602, where the memory 604 stores a computer program, and the processor 602 is configured to execute the computer program to perform the steps in any of the method embodiments described above.

Specifically, the processor 602 may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more integrated circuits of the embodiments of the present application.

Memory 604 may include, among other things, mass storage 604 for data or instructions. By way of example, and not limitation, memory 604 may include a hard disk drive (hard disk drive, HDD for short), a floppy disk drive, a solid state drive (SSD for short), flash memory, an optical disk, a magneto-optical disk, magnetic tape, or a Universal Serial Bus (USB) drive or a combination of two or more of these. Memory 604 may include removable or non-removable (or fixed) media, where appropriate. The memory 604 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 604 is a Non-Volatile (Non-Volatile) memory. In particular embodiments, memory 604 includes Read-only memory (ROM) and Random Access Memory (RAM). The ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), electrically rewritable ROM (EAROM), or FLASH memory (FLASH), or a combination of two or more of these, where appropriate. The RAM may be a Static Random Access Memory (SRAM) or a Dynamic Random Access Memory (DRAM), where the DRAM may be a fast page mode dynamic random access memory 604 (FPMDRAM), an Extended Data Out Dynamic Random Access Memory (EDODRAM), a Synchronous Dynamic Random Access Memory (SDRAM), or the like.

The memory 604 may be used to store or cache various data files that need to be processed and/or used for communication, as well as possibly computer program instructions, executed by the processor 602.

The processor 602 reads and executes the computer program instructions stored in the memory 604 to implement any of the above-described embodiments of the method for identifying an audio-visual feature-fused target behavior.

Optionally, the electronic apparatus may further include a transmission device 606 and an input/output device 608, where the transmission device 606 is connected to the processor 602, and the input/output device 608 is connected to the processor 602.

The transmitting device 606 may be used to receive or transmit data via a network. Specific examples of the network described above may include a wired or wireless network provided by a communication provider of the electronic device. In one example, the transmission device includes a Network adapter (NIC) that can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmitting device 606 can be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.

The input/output device 608 is used for inputting or outputting information. In this embodiment, the input information may be an audio/video segment to be recognized, and the output information may be a target behavior recognized.

Optionally, in this embodiment, the processor 602 may be configured to execute the following steps by a computer program:

s101, obtaining the audio and video frequency band to be identified with preset time length.

S102, collecting visual input information and auditory input information in the audio-video frequency band to be identified.

S103, inputting the visual input information and the auditory input information into a target behavior model together, wherein the target behavior model comprises a feature extraction network, a self-coding network and a full connection layer identification module of a double-branch channel.

And S104, respectively extracting characteristics from the visual input information and the auditory input information according to the characteristic extraction network to obtain visual characteristics and auditory characteristics.

And S105, mapping the visual features and the auditory features to the same subspace by adopting the self-coding network to perform audio-visual information fusion to obtain fusion features.

And S106, inputting the fusion characteristics into the full-connection layer identification module for identification to obtain target behaviors.

It should be noted that, for specific examples in this embodiment, reference may be made to the examples described in the foregoing embodiment and optional implementation manners, and details of this embodiment are not described herein again.

In general, the various embodiments may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. Some aspects of the invention may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

Embodiments of the invention may be implemented by computer software executable by a data processor of the mobile device, such as in a processor entity, or by hardware, or by a combination of software and hardware. Computer software or programs (also called program products) including software routines, applets and/or macros can be stored in any device-readable data storage medium and they include program instructions for performing particular tasks. The computer program product may comprise one or more computer-executable components configured to perform embodiments when the program is run. The one or more computer-executable components may be at least one software code or a portion thereof. Further in this regard it should be noted that any block of the logic flow as in the figures may represent a program step, or an interconnected logic circuit, block and function, or a combination of a program step and a logic circuit, block and function. The software may be stored on physical media such as memory chips or memory blocks implemented within the processor, magnetic media such as hard or floppy disks, and optical media such as, for example, DVDs and data variants thereof, CDs. The physical medium is a non-transitory medium.

It should be understood by those skilled in the art that various features of the above embodiments can be combined arbitrarily, and for the sake of brevity, all possible combinations of the features in the above embodiments are not described, but should be considered as within the scope of the present disclosure as long as there is no contradiction between the combinations of the features.

The above examples are merely illustrative of several embodiments of the present application, and the description is more specific and detailed, but not to be construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, and these are all within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims

1. An audio-visual feature fused target behavior identification method is characterized by comprising the following steps:

acquiring an audio-video segment to be identified with preset duration;

collecting visual input information and auditory input information in the audio and video band to be identified;

inputting the visual input information and the auditory input information into a target behavior model together, wherein the target behavior model comprises a feature extraction network, a self-coding network and a full connection layer identification module of a double-branch channel;

extracting features from the visual input information and the auditory input information respectively according to the feature extraction network to obtain visual features and auditory features;

mapping the visual features and the auditory features to the same subspace by an encoder of the self-coding network to obtain auditory mapping features corresponding to the auditory features and visual mapping features corresponding to the visual features; mapping all the visual mapping characteristics and all the auditory mapping characteristics into a multi-mode space according to a decoder of the self-coding network, wherein each mode obtains visual compensation characteristics of other mode spaces as visual sharing characteristics and obtains auditory compensation characteristics of other modes as auditory sharing characteristics; splicing the visual sharing feature, the auditory sharing feature, the visual feature and the auditory feature to obtain a fusion feature;

the self-coding network comprises an encoder and a decoder, wherein the encoder comprises a first full connection layer, a second full connection layer and an encoder layer which are sequentially connected; inputting the visual features and the auditory features into an encoder together, and outputting the visual features and the auditory features through a first full-connection layer, a second full-connection layer and an encoder layer in sequence to obtain auditory mapping features corresponding to the auditory features and visual mapping features corresponding to the visual features;

the decoder comprises two branches, wherein each branch consists of two full-connection layers; one branch takes the auditory mapping characteristics as input, the two full-connection layers map all the auditory mapping characteristics into a multi-mode space to obtain visual compensation characteristics corresponding to the auditory mapping characteristics, the other branch takes the visual mapping characteristics as input, and the two full-connection layers map all the visual mapping characteristics into the multi-mode space to obtain the auditory compensation characteristics corresponding to the visual mapping characteristics;

and inputting the fusion characteristics into the full-connection layer identification module for identification to obtain target behaviors.

2. An audiovisual-feature-fused target behavior recognition method according to claim 1, characterized in that the visual features and the auditory features input into the self-coding network are labeled with semantic mapping labels, wherein a semantic mapping label is characterized as a label of the visual input information and the auditory input information describing the same semantic content;

when the visual features or the auditory features input into the self-coding network have semantic mapping labels, the loss function is the algebraic sum of the auditory average error value and the visual average error value;

the hearing average error value is characterized as the average of the absolute differences of all the hearing features and all the hearing sharing features, and the vision average error value is characterized as the average of the absolute differences of all the vision features and all the vision sharing features;

wherein the loss function is derived from the following equation:

y _autocoder for the loss function, N is the number of features, f _audio For the sense of hearingCharacteristic f' _audio For auditory sharing features, f _visual Is a visual feature, f' _visual For visual sharing features, L _corr =1 denotes the presence of semantic mapping tag, L _corr =1 indicates that no semantic mapping tag is present.

3. An audiovisual-feature-fused target behavior recognition method according to claim 2, characterized in that "labeling the visual features and the auditory features input into the self-coding network with semantic mapping labels" comprises: and respectively performing semantic marking on the acoustic abnormal information of the auditory input information and the visual abnormal information of the visual input information, and if the auditory input information and the visual abnormal information are judged to have the semantic marking, allocating the semantic mapping labels to the auditory input information and the visual abnormal information.

4. An audio-visual feature-fused target behavior recognition method according to claim 1, wherein the step of collecting visual input information in the audio-visual segment to be recognized comprises the steps of:

and acquiring the difference value of every two adjacent image frames from the audio and video segment to be identified to obtain a difference value sequence, and taking the difference value sequence as visual input information.

5. The audio-visual feature-fused target behavior recognition method according to claim 1, wherein the step of collecting the audio input information in the audio-visual segment to be recognized comprises:

and acquiring an original audio waveform corresponding to the audio and video segment to be identified, and acquiring acoustic signals from the original audio waveform at a preset sampling interval to obtain auditory input information.

6. An audiovisual feature fused target behavior recognition method according to claim 5, characterized in that the auditory input information is characterized by waveform data with time as horizontal axis and acoustic signal as vertical axis, wherein the auditory input information and visual input information are unified in scale in time domain.

7. The audio-visual feature-fused target behavior recognition method according to claim 1, wherein the feature extraction networks of the dual-branch channel comprise an auditory feature extraction network and a visual feature extraction network,

wherein, the auditory characteristic extraction network comprises an AFEN module and an LSTM module, multi-frame waveforms in original audio waveforms are input into the AFEN module to obtain a plurality of corresponding auditory frame level characteristics, the plurality of auditory frame level characteristics are fused through the LSTM module to output auditory segment level characteristics,

the AFEN network comprises 5 convolutional layers, 3 pooling layers and 3 fully-connected layers which are connected in sequence, wherein the pooling layers are respectively connected behind a first convolutional layer, a second convolutional layer and a fifth convolutional layer, each convolutional layer comprises a ReLU activation function which enables the activation mode of the AFEN module to be sparse, and each pooling layer comprises a local response normalization operation which avoids gradient disappearance.

8. An audiovisual feature fused target behavior identification method according to claim 7, wherein the visual feature extraction network and the auditory feature extraction network share an AFEN module, and a ConvLSTM module is used in the visual feature extraction network to replace an LSTM module to fuse a plurality of visual frame-level features output by the AFEN module and output a visual segment-level feature.

9. An audio-visual feature fused target behavior recognition device, comprising:

the acquisition module is used for acquiring the audio and video segment to be identified with preset duration;

the information acquisition module is used for acquiring visual input information and auditory input information in the audio-video band to be identified;

the characteristic extraction module is used for inputting the visual input information and the auditory input information into a target behavior model together, wherein the target behavior model comprises a characteristic extraction network, a self-coding network and a full connection layer identification module of a double-branch channel;

mapping the visual characteristic and the auditory characteristic to the same subspace by an encoder of the self-coding network to obtain an auditory mapping characteristic corresponding to the auditory characteristic and a visual mapping characteristic corresponding to the visual characteristic; mapping all the visual mapping characteristics and all the auditory mapping characteristics into a multi-mode space according to a decoder of the self-coding network, wherein each mode obtains visual compensation characteristics of other mode spaces as visual sharing characteristics and obtains auditory compensation characteristics of other modes as auditory sharing characteristics; splicing the visual sharing feature, the auditory sharing feature, the visual feature and the auditory feature to obtain a fusion feature;

the self-coding network comprises an encoder and a decoder, wherein the encoder comprises a first full connection layer, a second full connection layer and an encoder layer which are sequentially connected; inputting the visual features and the auditory features into an encoder together, and outputting the visual features and the auditory features sequentially through a first full-connection layer, a second full-connection layer and an encoder layer to obtain auditory mapping features corresponding to the auditory features and visual mapping features corresponding to the visual features;

the decoder comprises two branches, and each branch consists of two full connection layers; one branch takes the auditory mapping characteristics as input, the two full-connection layers map all the auditory mapping characteristics into a multimode space to obtain visual compensation characteristics corresponding to the auditory mapping characteristics, the other branch takes the visual mapping characteristics as input, and the two full-connection layers map all the visual mapping characteristics into the multimode space to obtain the auditory compensation characteristics corresponding to the visual mapping characteristics;

and the behavior identification module is used for inputting the fusion characteristics into the full-connection layer identification module for identification to obtain target behaviors.

10. An electronic device comprising a memory and a processor, wherein the memory stores a computer program, and the processor is configured to execute the computer program to perform the method for audiovisual feature fusion target behavior recognition according to any of claims 1-8.

11. A readable storage medium, characterized in that a computer program is stored therein, the computer program comprising program code for controlling a process to execute a process, the process comprising the method for audiovisual feature fusion target behavior recognition according to any one of claims 1 to 8.