WO2023216609A1 - Procédé et appareil de reconnaissance de comportement cible basés sur une fusion de caractéristiques audiovisuelles, et application - Google Patents

Procédé et appareil de reconnaissance de comportement cible basés sur une fusion de caractéristiques audiovisuelles, et application Download PDF

Info

Publication number
WO2023216609A1
WO2023216609A1 PCT/CN2022/141314 CN2022141314W WO2023216609A1 WO 2023216609 A1 WO2023216609 A1 WO 2023216609A1 CN 2022141314 W CN2022141314 W CN 2022141314W WO 2023216609 A1 WO2023216609 A1 WO 2023216609A1
Authority
WO
WIPO (PCT)
Prior art keywords
visual
features
auditory
feature
audio
Prior art date
Application number
PCT/CN2022/141314
Other languages
English (en)
Chinese (zh)
Inventor
毛云青
王国梁
齐韬
陈思瑶
葛俊
Original Assignee
城云科技(中国)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 城云科技(中国)有限公司 filed Critical 城云科技(中国)有限公司
Publication of WO2023216609A1 publication Critical patent/WO2023216609A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to the field of intelligent security technology, and in particular to a target behavior recognition method, device and application of audio-visual feature fusion.
  • the methods of judging fights through artificial intelligence algorithms include: detecting pictures and making behavioral classification judgments, or detecting the positions of key human body points in multiple frames of pictures and making behavioral judgments.
  • the publication numbers CN112733629A and CN111401296A only disclose the use of image information to determine abnormal behavior.
  • the above algorithm will identify some labor operations with large movements, such as cleaning by multiple people, or physical exercise actions by multiple people, such as playing ball, as fighting behaviors; in addition, the usual algorithm judgment only uses image information to make judgments. Judgment,needs to improve in terms of accuracy.
  • Semantic consistency is of great significance in multi-modal information fusion, especially visual and auditory information fusion.
  • the information is complementary, otherwise, they will interfere with each other, such as the famous "McGurk effect".
  • human hearing is obviously affected by vision, which may lead to mishearing. For example, when a sound does not match the visual signal, people will mysteriously perceive a third sound. Simple fusion of sound and video signals may even cause mishearing. Produce the opposite effect.
  • Embodiments of this application provide a target behavior recognition method, device and application for audio-visual feature fusion.
  • this solution uses a feature-level fusion method to fuse audio-visual information, which can improve the accuracy of abnormal behavior recognition. Rate.
  • embodiments of the present application provide a target behavior recognition method using audio-visual feature fusion.
  • the method includes: obtaining an audio and video segment of a preset duration to be recognized; and collecting visual input information in the audio and video segment to be recognized. and auditory input information; input the visual input information and the auditory input information together into the target behavior model, wherein the target behavior model includes a dual-branch channel feature extraction network, an autoencoding network and a fully connected layer recognition module; according to The feature extraction network extracts features from the visual input information and the auditory input information respectively to obtain visual features and auditory features; the autoencoding network is used to map the visual features and the auditory features to the same subspace
  • the audio-visual information is fused to obtain fusion features; the fusion features are input into the fully connected layer recognition module for recognition, and the target behavior is obtained.
  • "using the auto-encoding network to map the visual features and the auditory features into the same subspace for audio-visual information fusion to obtain fusion features” includes: using the encoder of the auto-encoding network The visual features and the auditory features are mapped to the same subspace to obtain the auditory mapping features corresponding to the auditory features and the visual mapping features corresponding to the visual features; according to the decoder of the autoencoding network, all the visual mapping features and All the auditory mapping features are mapped into a multi-modal space, and each modality obtains the visual compensation features of other modal spaces as visual shared features and the auditory compensation features of other modalities as auditory shared features; splicing the visual The shared features, the auditory shared features, the visual features and the auditory features are used to obtain fused features.
  • the autoencoding network includes an encoder and a decoder, where the encoder includes a first fully connected layer, a second fully connected layer and an encoder layer connected in sequence; visual features and auditory features are jointly input into the code In the decoder, and through the first fully connected layer, the second fully connected layer and the encoder layer output in sequence, the auditory mapping features corresponding to the auditory features and the visual mapping features corresponding to the visual features are obtained; among them, the decoder includes two branches Each branch consists of two fully connected layers; one branch takes auditory mapping features as input, and the two fully connected layers map all auditory mapping features into the multi-modal space to obtain the visual compensation corresponding to the auditory mapping features.
  • the other branch takes visual mapping features as input, and two fully connected layers map all visual mapping features into the multi-modal space to obtain auditory compensation features corresponding to the visual mapping features.
  • the visual features and the auditory features input to the autoencoding network are marked using semantic mapping tags, wherein the semantic mapping tags are characterized by the visual input information describing the same semantic content and the The mark label of the auditory input information; when the visual feature or auditory feature input to the autoencoding network has a semantic mapping label, the loss function is the algebraic sum of the auditory average error value and the visual average error value; when the visual feature or auditory feature input to the autoencoding network When there is no semantic mapping label for auditory features, the loss function is the difference between 1 and the algebraic sum of the auditory average error value and the visual average error value; the auditory average error value is represented by the average of the absolute differences between all auditory features and all auditory shared features. value, the visual average error value is characterized as the average of the absolute differences between all visual features and all visual shared features.
  • the loss function is obtained by the following formula:
  • y autocoder is the loss function
  • N is the number of features
  • faudio is the auditory feature
  • f' audio is the auditory shared feature
  • f visual is the visual feature
  • f' visual is the visual shared feature
  • labeling the visual features and the auditory features input to the autoencoding network using semantic mapping labels includes: separately labeling the acoustic anomaly information of the auditory input information and the visual input The visual abnormality information of the information is semantically tagged. If it is determined that both the auditory input information and the visual abnormality information have the semantic tags, the semantic mapping is assigned to the set of auditory input information and the visual abnormality information. Label.
  • "collecting visual input information in the audio and video segment to be recognized” includes: collecting the difference between every two adjacent image frames from the audio and video segment to be recognized to obtain a difference sequence, The difference sequence is used as visual input information.
  • "collecting auditory input information in the audio and video segments to be recognized” includes: obtaining the original audio waveform corresponding to the audio and video segment to be recognized, and sampling from the original audio waveform at a preset sampling interval. Acoustic signals are collected to obtain auditory input information.
  • the auditory input information is represented as waveform data with time as the horizontal axis and acoustic signal as the vertical axis, wherein the auditory input information and the visual input information use the time domain as a unified scale.
  • the dual-branch channel feature extraction network includes an auditory feature extraction network and a visual feature extraction network, wherein the auditory feature extraction network includes an AFEN module and an LSTM module to convert multiple frames in the original audio waveform.
  • the waveform is input into the AFEN module to obtain multiple corresponding auditory frame-level features, and the multiple auditory frame-level features are fused through the LSTM module to output auditory segment-level features.
  • the AFEN network includes 5 convolutional layers, 3 pooling layers and 3 sequentially connected fully connected layers, where the pooling layers are connected to the first convolutional layer, the second convolutional layer respectively.
  • each convolutional layer includes a ReLU activation function that makes the activation pattern of the AFEN module sparser, and each pooling layer includes a local response normalization operation to avoid gradient vanishing. .
  • the visual feature extraction network shares the AFEN module with the auditory feature extraction network.
  • the ConvLSTM module is used instead of the LSTM module to fuse multiple visual frame-level features output by the AFEN module. , output visual segment-level features.
  • embodiments of the present application provide a target behavior recognition device with audio-visual feature fusion, including: an acquisition module for acquiring the audio and video segments to be recognized of a preset duration; and an information collection module for collecting the audio and video segments to be recognized.
  • the feature extraction module is used to input the visual input information and the auditory input information into the target behavior model together, wherein the target behavior model includes the characteristics of the dual branch channel Extraction network, autoencoding network and fully connected layer recognition module; extract features from the visual input information and auditory input information respectively according to the feature extraction network to obtain visual features and auditory features; use the autoencoding network to The visual features and the auditory features are mapped to the same subspace for audio-visual information fusion to obtain fused features; a behavior recognition module is used to input the fused features into the fully connected layer recognition module for recognition to obtain the target behavior.
  • embodiments of the present application provide an electronic device, including a memory and a processor.
  • a computer program is stored in the memory, and the processor is configured to run the computer program to execute any one of the first aspects.
  • the target behavior recognition method of audio-visual feature fusion is configured to run the computer program to execute any one of the first aspects.
  • embodiments of the present application provide a readable storage medium in which a computer program is stored, and the computer program includes program code for controlling a process to execute a process, and the process includes: The target behavior recognition method of audio-visual feature fusion according to any one of the first aspects.
  • This solution uses an autoencoding network to represent the shared semantic subspace mapping.
  • vision and hearing use time as a unified measure to map vision and hearing that represent the same semantics, thereby capturing complementary information and advanced information between different modes. Semantics, realizing feature fusion at the semantic level.
  • This plan designs a dual-branch channel feature extraction network.
  • the LSTM network is selected to process the temporal relationship between audio frame-level features, thereby obtaining auditory segment-level features;
  • the ConvLSTM network is selected to process the time relationships between video frame-level features.
  • the temporal relationship is used to obtain the visual phase features, and then the feature heterogeneity of the visual phase features and the auditory phase features is eliminated in the autoencoding network to achieve feature fusion.
  • This solution uses video inter-frame difference information to extract visual features, which can better reflect the characteristics of abnormal behavior.
  • the sound features are extracted based on original audio waveforms instead of spectrum analysis-based methods such as MFCC or LPC, so it can be unified based on time.
  • the images and bands are sampled separately at intervals, and unified into the time domain to solve the problem of inconsistent audio and video feature processing during the audio-visual information fusion process.
  • Figure 1 is a flow chart of the main steps of a target behavior recognition method based on audio-visual feature fusion according to the first embodiment of the present application.
  • Figure 2 is a schematic diagram of the abnormal behavior auditory feature extraction network structure.
  • Figure 3 is a schematic diagram of the abnormal behavior recognition network structure based on autoencoding network mapping audio-visual feature fusion.
  • Figure 4 is a schematic structural diagram of the autoencoding network.
  • Figure 5 is a structural block diagram of a target behavior recognition device for audio-visual feature fusion according to the second embodiment of the present application.
  • FIG. 6 is a schematic diagram of the hardware structure of an electronic device according to the third embodiment of the present application.
  • the steps of the corresponding method are not necessarily performed in the order shown and described in this specification.
  • methods may include more or fewer steps than described in this specification.
  • a single step described in this specification may be broken down into multiple steps for description in other embodiments; and multiple steps described in this specification may also be combined into a single step in other embodiments. describe.
  • Embodiments of the present application provide a target behavior recognition method based on audio-visual feature fusion, which uses the audio-visual feature fusion target behavior recognition scheme to make judgments on abnormal event behaviors.
  • a target behavior recognition method based on audio-visual feature fusion which uses the audio-visual feature fusion target behavior recognition scheme to make judgments on abnormal event behaviors.
  • Figure 1 is a flow chart of the main steps of a target behavior recognition method based on audio-visual feature fusion according to the first embodiment of the present application.
  • the target behavior recognition method of audio-visual feature fusion mainly includes the following steps 101 to 106.
  • Step 101 Obtain the audio and video segments to be identified of a preset duration.
  • Step 102 Collect visual input information and auditory input information in the audio and video segments to be recognized.
  • Step 103 Input the visual input information and the auditory input information together into the target behavior model, where the target behavior model includes a dual-branch channel feature extraction network, an autoencoding network and a fully connected layer recognition module.
  • Step 104 Extract features from the visual input information and the auditory input information respectively according to the feature extraction network to obtain visual features and auditory features.
  • Step 105 Use the autoencoding network to map the visual features and the auditory features into the same subspace for audio-visual information fusion to obtain fusion features.
  • Step 106 Input the fused features into the fully connected layer recognition module for recognition to obtain the target behavior.
  • this solution not only visual input information is used, but also auditory input information is collected. Both are used as inputs to the behavior model, and the model outputs the target behavior classification results.
  • this solution uses sound features, and the information of sound features and image features can complement each other to more accurately judge the target behavior, so it has better feature expression effect.
  • visual input information and auditory input information are input into the model for processing at the same time.
  • the advantage of parallel processing is that feature-level fusion can be performed in the feature extraction stage to better capture the relationship between each modality. Compared with the decision-making and fusion methods of separately fusing the results of visual and auditory feature information recognition, this solution can consider the consistency between vision and hearing, so multi-modal feature information can complement each other and achieve better performance results. .
  • the target behavior in this scheme can be normal behavior or abnormal behavior.
  • the input is visual samples and auditory samples and normal behaviors marked from the samples
  • the trained model will identify normal behaviors from the visual input information and auditory input information.
  • the input to the model is labeled samples that label abnormal behaviors
  • the trained model will identify abnormal behaviors from visual input information and auditory input information.
  • visual samples and auditory samples describing violent scenes and features describing violent scenes annotated from the samples are used as the input of the model, so that the trained model can identify violent behaviors such as fights. .
  • an autoencoding network is used to achieve feature fusion.
  • the autoencoding network designed by the present invention can represent audio-visual data with the same metric, and can represent visual information and visual information that represent the same semantics under the same metric. Auditory information is matched, that is, visual and auditory semantic consistency is achieved.
  • the embodiment of the present invention establishes a shared semantic subspace in the future by mapping the visual features and the auditory features to the same subspace, thereby eliminating feature heterogeneity between different modes of video and audio, and thereby capturing visual modes and sound modes.
  • the complementary information and high-level semantics between them achieve feature fusion at the semantic level. That is, in this solution, the encoder of the auto-encoding network maps the visual features and the auditory features to the same subspace to obtain visual mapping features and auditory mapping features; the decoder of the auto-encoding network maps the visual features and auditory features to the same subspace.
  • the visual mapping features and the auditory mapping features are mapped into the multi-modal space to obtain the compensation features of other modalities as visual shared features and auditory shared features; the visual shared features, the auditory shared features, and the visual features are spliced and the auditory features to obtain fusion features.
  • the extracted visual feature is fvisual
  • g() is a function that maps the visual feature fvisual to the same subspace, and input fvisual into g() to obtain the visual mapping feature
  • H () is a function that maps the visual mapping feature g(fvisual) to the multi-modal space of shared semantics.
  • the extracted auditory feature is faudio.
  • h() is a function that maps the auditory feature faudio to the same subspace. Enter faudio into h() to obtain the auditory mapping feature; G() It is a function that maps the auditory mapping feature h(faudio) to the multi-modal space of shared semantics. Input h(faudio) into G() to obtain the compensation feature of the visual modality as a visual shared feature.
  • the autoencoding network includes an encoder and a decoder.
  • the encoder includes a first fully connected layer, a second fully connected layer and an encoder layer connected in sequence; visual features and auditory features are jointly input into the encoder. , and sequentially pass through the first fully connected layer, the second fully connected layer and the encoder layer output to obtain the auditory mapping features corresponding to the auditory features, and the visual mapping features corresponding to the visual features;
  • the decoder includes two branches, each branch consists of two fully connected layers; one branch takes auditory mapping features as input, and two fully connected layers map all auditory mapping features into the multi-module space.
  • the visual compensation features corresponding to the auditory mapping features are obtained.
  • the other branch takes the visual mapping features as input, and uses two fully connected layers to map all the visual mapping features into the multi-modal space to obtain the auditory compensation features corresponding to the visual mapping features.
  • formula (3) is used to splice visual features, visual compensation features, auditory features, and auditory compensation features to obtain fusion features.
  • each modal space receives information from its inter-modal neighbors and intra-modal neighbors, and shares its own information at the same time.
  • the inter-modal neighbor information obtained by any modal space can make up for the loss of its own information
  • the obtained inter-module neighbor information is used as a supplementary feature to enhance the expressive ability of the fused feature.
  • the error of the autoencoding network includes two parts. One is the error of the acoustic decoder, the other is the error of the visual decoder, and the sum of the two is the total error. Errors can be backpropagated to update the weights of the autoencoder network. .
  • Introducing semantic labels to calculate the auto-encoding network error includes: labeling the visual features and the auditory features input to the auto-encoding network using semantic mapping labels, where the semantic mapping labels are characterized as the visual input describing the same semantic content information and the mark label of the auditory input information; when the visual feature or auditory feature input to the autoencoding network has a semantic mapping label, the loss function is the algebraic sum of the auditory average error value and the visual average error value;
  • the loss function is the difference between 1 and the algebraic sum of the auditory average error value and the visual average error value
  • the auditory average error value is characterized as the average of the absolute differences between all auditory features and all auditory shared features
  • the visual average error value is characterized as the average of the absolute differences between all visual features and all visual shared features.
  • the loss function is obtained by the following formula:
  • y autocoder is the loss function
  • N is the number of features
  • faudio is the auditory feature
  • f' audio is the auditory shared feature
  • f visual is the visual feature
  • f' visual is the visual shared feature
  • this solution designs a new loss function that allows the model to learn the bias information of semantic mapping on the timeline.
  • this solution reduces the interference of blind splicing features, enhances the model's ability to distinguish abnormal video semantic correspondence, and is more conducive to eliminating interference between non-corresponding features.
  • semantic embedding learning can be regarded as a form of regularization, which helps to enhance the generalization ability of the model and prevent overfitting.
  • the semantic mapping label is obtained by semantically labeling the acoustic abnormality information of the auditory input information and the visual abnormality information of the visual input information respectively. If it is determined that the auditory input information and the visual abnormality information are If both have the semantic tag, then the semantic mapping tag is assigned to the set of the auditory input information and the visual abnormality information.
  • this solution maps the original audio waveform to a two-dimensional field, that is, the x-axis of the sound data is time and the y-axis is the waveform.
  • This solution uses the surveillance video screen as the original video data.
  • the original video data has approximately 30 to 60 frames of images in 1 second. That is, the x-axis of the visual data is also time, and the y-axis is the image frame.
  • the two can be aligned on the time axis based on whether the semantics are the same to achieve correspondence between visual features and auditory features, thereby achieving semantic consistency between vision and hearing.
  • the information of the original audio and video segments to be identified is also processed.
  • the present invention collects the difference between every two adjacent image frames from the audio and video segments to be identified to obtain a difference sequence,
  • the difference sequence is used as visual input information.
  • the objects of abnormal behavior recognition are some violent behaviors, such as punching someone. Placing the fist on the opponent's chest or on one's waist does not accurately indicate the presence of the characters in the video. Behavior, if the fist is still on the side of one's waist in the first few frames and is on the opponent's chest in the next few frames, it means that the person in the video has performed an abnormal behavior of punching. It can be seen that the difference between frames of the video can more accurately extract the required information than the video frame itself. Therefore, choosing the difference between adjacent frames of the video as the input of the network model has a better feature expression effect than inputting the video frame itself into the model.
  • the original audio waveform is used as the auditory information to be collected, so "collecting the auditory input information in the audio and video segments to be identified” includes: obtaining the original audio waveform corresponding to the audio and video segments to be identified, and from the Acoustic signals are collected at preset sampling intervals in the original audio waveform to obtain auditory input information.
  • the sound extracted by this solution is based on the original audio waveform, so it can be expressed in a unified metric with the visual features, that is, the auditory input information is represented by taking time as the horizontal axis and acoustics.
  • the signal is used as waveform data on the vertical axis, where the auditory input information and the visual input information use the time domain as a unified scale.
  • the feature extraction network for extracting visual and auditory features uses a dual-branch channel, that is, visual input information and auditory input information can be input into the feature extraction network at the same time, and feature extraction is performed separately to output visual features and auditory features.
  • the dual-branch channel feature extraction network includes an auditory feature extraction network and a visual feature extraction network, wherein the auditory feature extraction network includes an AFEN module and an LSTM module, and the multi-frame waveforms in the original audio waveform are input into the AFEN module , obtain multiple corresponding auditory frame-level features, fuse the multiple auditory frame-level features through the LSTM module, and output auditory segment-level features.
  • the abnormal behavior auditory feature extraction network structure AFEN structure includes 5 convolutional layers, 3 pooling layers and 3 sequentially connected fully connected layers.
  • the output of the last fully connected layer passes through the SoftMax layer.
  • the pooling layer is connected after the first convolution layer, the second convolution layer and the fifth convolution layer respectively.
  • the black rectangle represents each convolution layer
  • the white rectangle represents the pooling layer
  • the three rectangles immediately after the last pooling layer represent the three fully connected layers
  • the black rectangle after the fully connected layer represents the LSTM Structure
  • the LSTM network is selected to process the temporal relationship between audio frame-level features and obtain segment-level features.
  • the convolutional layer contains the ReLU activation function, making the activation pattern of the network sparser.
  • the pooling layer contains a local response normalization operation to avoid gradient disappearance and improve the training speed of the network.
  • the timing information is summarized through the LSTM network.
  • This method can be adapted to the entire surveillance video mechanism and has no rigid requirements in terms of audio and video length, sampling rate, etc. This solves the problem of feature timeline alignment.
  • this model also greatly reduces the complexity of visual and auditory feature fusion and improves the stability of the model.
  • the abnormal behavior visual feature extraction network structure is the same as the AFEN convolution structure shown in Figure 2, and the convolutional LSTM (ConvLSTM) module is used to replace the last LSTM module, and the original input signal is changed to the inter-image frame difference.
  • ConvLSTM convolutional LSTM
  • the difference is that compared to auditory feature extraction, visual features pay more attention to action recognition on spatiotemporal relationships, that is, ConvLSTM has a better effect than LSTM on obtaining spatiotemporal relationships, and can solve spatiotemporal sequence prediction problems, such as video classification, action recognition, etc. .
  • the present invention designs an abnormal behavior recognition model based on audio-visual information fusion using autoencoding network.
  • the model structure is shown in Figure 3.
  • the model includes four parts: visual feature extraction, auditory feature extraction, autoencoding network and fully connected recognition model.
  • Visual and acoustic feature extraction uses a dual-channel feature extraction method.
  • the network structure is based on a deep convolutional network.
  • visual features the difference between video frames is used as the original input, and deep convolution plus ConvLSTM network is used to extract segment-level visual features.
  • auditory features audio waveforms are used as network input, and deep convolution and LSTM networks are used to extract segment-level auditory features.
  • the autoencoding network shown in Figure 4 is used to construct a shared semantic subspace to eliminate the semantic bias of visual and auditory features, and the CONCAT method is used to combine visual and auditory features; finally, a fully connected model is used to identify abnormal behaviors . Therefore, the target behavior recognition method of this scheme is based on the shared semantic subspace of the autoencoding network to integrate auditory features into visual features to supplement visual information, and integrate visual features into auditory features to supplement auditory features, thus achieving different The complementary effect of the pattern, so the fused features obtained through feature fusion have richer semantic expressions, then the classification results of the model using the fused features to classify behaviors are also more accurate. Based on this, this solution improves the recognition accuracy and reduces the missed detection rate.
  • this solution provides a target behavior recognition device with audio-visual feature fusion.
  • the device uses the above target behavior recognition method with audio-visual feature fusion to identify the target behavior.
  • the device includes:
  • the acquisition module 501 is used to acquire the audio and video segments to be recognized of a preset duration.
  • the information collection module 502 is used to collect visual input information and auditory input information in the audio and video segments to be recognized.
  • Feature extraction module 503 is used to input the visual input information and the auditory input information into the target behavior model together, wherein the target behavior model includes a dual-branch channel feature extraction network, an autoencoding network and a fully connected layer recognition module. .
  • features are respectively extracted from the visual input information and the auditory input information to obtain visual features and auditory features.
  • the autoencoding network is used to map the visual features and the auditory features into the same subspace for audio-visual information fusion to obtain fusion features.
  • the behavior recognition module 504 is used to input the fused features into the fully connected layer recognition module for recognition to obtain the target behavior.
  • an electronic device includes a memory 604 and a processor 602.
  • the memory 604 stores a computer program
  • the processor 602 is configured to run the computer program to perform any of the above methods. The steps in the example.
  • the above-mentioned processor 602 may include a central processing unit (CPU), or an application specific integrated circuit (Application Specific Integrated Circuit, referred to as ASIC), or may be configured to implement one or more integrated circuits according to the embodiments of the present application.
  • CPU central processing unit
  • ASIC Application Specific Integrated Circuit
  • memory 604 may include mass storage 604 for data or instructions.
  • the memory 604 may include a hard disk drive (Hard Disk Drive, HDD for short), floppy disk drive, Solid State Drive (Solid State Drive, SSD for short), flash memory, optical disk, magneto-optical disk, magnetic tape, or Universal Serial Bus (Universal Serial Bus, Referred to as USB) drive or a combination of two or more of these.
  • Memory 604 may include removable or non-removable (or fixed) media, where appropriate.
  • Memory 604 may be internal or external to the data processing device, where appropriate.
  • memory 604 is Non-Volatile memory.
  • the memory 604 includes read-only memory (Read-OnlyMemory, ROM for short) and random access memory (RandomAccessMemory, RAM for short).
  • the ROM can be a mask-programmed ROM, programmable ROM (ProgrammableRead-OnlyMemory, referred to as PROM), erasable PROM (ErasableProgrammableRead-OnlyMemory, referred to as EPROM), electrically erasable PROM (Electrically ErasableProgrammableRead -OnlyMemory, referred to as EEPROM), electrically rewritable ROM (Electrically Alterable Read-OnlyMemory, referred to as EAROM) or flash memory (FLASH) or a combination of two or more of these.
  • PROM programmable ROM
  • EPROM erasable PROM
  • EPROM ErasableProgrammableRead-OnlyMemory
  • EEPROM Electrically ErasableProgrammable
  • the RAM can be static random access memory (StaticRandom-AccessMemory, referred to as SRAM) or dynamic random access memory (DynamicRandomAccessMemory, referred to as DRAM), wherein the DRAM can be fast page mode dynamic random access Memory 604 (FastPageModeDynamicRandomAccessMemory, referred to as FPMDRAM), extended data output dynamic random access memory (ExtendedDateOutDynamicRandomAccessMemory, referred to as EDODRAM), synchronous dynamic random access memory (SynchronousDynamicRandom-AccessMemory, referred to as SDRAM), etc.
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • FPMDRAM fast page mode dynamic random access Memory 604
  • FPMDRAM fast page mode dynamic random access Memory 604
  • EDODRAM Extended Data output dynamic random access memory
  • SDRAM synchronous dynamic random access memory
  • Memory 604 may be used to store or cache various data files required for processing and/or communication, as well as possibly computer program instructions executed by processor 602.
  • the processor 602 reads and executes the computer program instructions stored in the memory 604 to implement any of the audio-visual feature fusion target behavior recognition methods in the above embodiments.
  • the above-mentioned electronic device may also include a transmission device 606 and an input-output device 608, wherein the transmission device 606 is connected to the above-mentioned processor 602, and the input-output device 608 is connected to the above-mentioned processor 602.
  • Transmission device 606 may be used to receive or send data via a network.
  • Specific examples of the above-mentioned network may include a wired or wireless network provided by a communication provider of the electronic device.
  • the transmission device includes a network adapter (Network Interface Controller, NIC for short), which can be connected to other network devices through a base station to communicate with the Internet.
  • the transmission device 606 may be a radio frequency (Radio Frequency, RF for short) module, which is used to communicate with the Internet wirelessly.
  • RF Radio Frequency
  • Input and output devices 608 are used to input or output information.
  • the input information may be audio and video segments to be recognized, etc.
  • the output information may be the target behavior to be recognized, etc.
  • the above-mentioned processor 602 can be configured to perform the following steps through a computer program:
  • S102 Collect visual input information and auditory input information in the audio and video segments to be recognized.
  • various embodiments may be implemented in hardware or special purpose circuitry, software, logic, or any combination thereof. Some aspects of the invention may be implemented in hardware, while other aspects may be implemented in firmware or software that may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. Although various aspects of the invention may be shown and described as block diagrams, flow diagrams, or using some other graphical representation, it is to be understood that, by way of non-limiting example, the blocks, devices, systems, techniques, or methods described herein may be Hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controllers or other computing devices, or some combination thereof.
  • Embodiments of the invention may be implemented by computer software executable by a data processor of the mobile device, such as in a processor entity, or by hardware, or by a combination of software and hardware.
  • Computer software or programs also referred to as program products
  • a computer program product may include one or more computer-executable components that are configured to perform embodiments when the program is executed.
  • One or more computer-executable components may be at least one software code or a portion thereof.
  • any block of the logic flow in the figures may represent program steps, or interconnected logic circuits, blocks, and functions, or a combination of program steps and logic circuits, blocks, and functions.
  • Software may be stored on physical media such as memory chips or memory blocks implemented within a processor, magnetic media such as hard or floppy disks, and optical media such as, for example, DVD and its data variants, CDs.
  • Physical media are non-transient media.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

Procédé et appareil de reconnaissance de comportement cible basés sur une fusion de caractéristiques audiovisuelles, et application, qui se rapportent au domaine technique de la protection de sécurité intelligente. Dans le procédé, des informations visuelles et des informations audio sont entrées dans un réseau d'algorithmes spécifié, une caractéristique visuelle et une caractéristique audio sont extraites par l'intermédiaire de différents réseaux d'extraction de caractéristiques de deux branches, et des caractéristiques de synchronisation sont calculées par l'intermédiaire d'un réseau LSTM ; et un sous-espace sémantique partagé est construit au moyen d'un réseau d'auto-codage, une polarisation sémantique entre la caractéristique visuelle et la caractéristique audio est éliminée, et enfin, la caractéristique visuelle et la caractéristique audio sont fusionnées, de telle sorte qu'un comportement cible peut être reconnu sur la base d'une caractéristique fusionnée. Le procédé peut améliorer la précision de reconnaissance de comportement anormal.
PCT/CN2022/141314 2022-05-09 2022-12-23 Procédé et appareil de reconnaissance de comportement cible basés sur une fusion de caractéristiques audiovisuelles, et application WO2023216609A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210496197.7 2022-05-09
CN202210496197.7A CN114581749B (zh) 2022-05-09 2022-05-09 视听特征融合的目标行为识别方法、装置及应用

Publications (1)

Publication Number Publication Date
WO2023216609A1 true WO2023216609A1 (fr) 2023-11-16

Family

ID=81768993

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/141314 WO2023216609A1 (fr) 2022-05-09 2022-12-23 Procédé et appareil de reconnaissance de comportement cible basés sur une fusion de caractéristiques audiovisuelles, et application

Country Status (2)

Country Link
CN (1) CN114581749B (fr)
WO (1) WO2023216609A1 (fr)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114581749B (zh) * 2022-05-09 2022-07-26 城云科技(中国)有限公司 视听特征融合的目标行为识别方法、装置及应用

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344781A (zh) * 2018-10-11 2019-02-15 上海极链网络科技有限公司 一种基于声音视觉联合特征的视频内表情识别方法
US20190341025A1 (en) * 2018-04-18 2019-11-07 Sony Interactive Entertainment Inc. Integrated understanding of user characteristics by multimodal processing
CN110647804A (zh) * 2019-08-09 2020-01-03 中国传媒大学 一种暴力视频识别方法、计算机系统和存储介质
CN111460889A (zh) * 2020-02-27 2020-07-28 平安科技(深圳)有限公司 一种基于语音及图像特征的异常行为识别方法、装置及设备
CN112287893A (zh) * 2020-11-25 2021-01-29 广东技术师范大学 一种基于音视频信息融合的母猪哺乳行为识别方法
CN114581749A (zh) * 2022-05-09 2022-06-03 城云科技(中国)有限公司 视听特征融合的目标行为识别方法、装置及应用

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102098492A (zh) * 2009-12-11 2011-06-15 上海弘视通信技术有限公司 音视频联合分析的打架斗殴检测系统及其检测方法
CN103854014A (zh) * 2014-02-25 2014-06-11 中国科学院自动化研究所 一种基于上下文稀疏表示的恐怖视频识别方法及装置
US9697833B2 (en) * 2015-08-25 2017-07-04 Nuance Communications, Inc. Audio-visual speech recognition with scattering operators
CN108200483B (zh) * 2017-12-26 2020-02-28 中国科学院自动化研究所 动态多模态视频描述生成方法
CN109509484A (zh) * 2018-12-25 2019-03-22 科大讯飞股份有限公司 一种婴儿啼哭原因的预测方法及装置
CN111640424B (zh) * 2019-03-01 2024-02-13 北京搜狗科技发展有限公司 一种语音识别方法、装置和电子设备
CN109961789B (zh) * 2019-04-30 2023-12-01 张玄武 一种基于视频及语音交互服务设备
CN112328830A (zh) * 2019-08-05 2021-02-05 Tcl集团股份有限公司 一种基于深度学习的信息定位方法及相关设备
US11244696B2 (en) * 2019-11-06 2022-02-08 Microsoft Technology Licensing, Llc Audio-visual speech enhancement
CN111461235B (zh) * 2020-03-31 2021-07-16 合肥工业大学 音视频数据处理方法、系统、电子设备及存储介质
CN111754992B (zh) * 2020-06-30 2022-10-18 山东大学 一种噪声鲁棒的音视频双模态语音识别方法及系统
CN112866586B (zh) * 2021-01-04 2023-03-07 北京中科闻歌科技股份有限公司 一种视频合成方法、装置、设备及存储介质
CN113255556A (zh) * 2021-06-07 2021-08-13 斑马网络技术有限公司 多模态语音端点检测方法及装置、车载终端、存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190341025A1 (en) * 2018-04-18 2019-11-07 Sony Interactive Entertainment Inc. Integrated understanding of user characteristics by multimodal processing
CN109344781A (zh) * 2018-10-11 2019-02-15 上海极链网络科技有限公司 一种基于声音视觉联合特征的视频内表情识别方法
CN110647804A (zh) * 2019-08-09 2020-01-03 中国传媒大学 一种暴力视频识别方法、计算机系统和存储介质
CN111460889A (zh) * 2020-02-27 2020-07-28 平安科技(深圳)有限公司 一种基于语音及图像特征的异常行为识别方法、装置及设备
CN112287893A (zh) * 2020-11-25 2021-01-29 广东技术师范大学 一种基于音视频信息融合的母猪哺乳行为识别方法
CN114581749A (zh) * 2022-05-09 2022-06-03 城云科技(中国)有限公司 视听特征融合的目标行为识别方法、装置及应用

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
TIANLIANG LIU, QIAO QINGWEI; WAN JUNWEI; DAI XIUBIN; LUO JIEBO: "Human Action Recognition via Spatio-temporal Dual Network Flow and Visual Attention Fusion", JOURNAL OF ELECTRONICS & INFORMATION TECHNOLOGY, ZHONGGUO KEXUEYUAN DIANZIXUE YANJIUSUO,CHINESE ACADEMY OF SCIENCES, INSTITUTE OF ELECTRONICS, CN, vol. 40, no. 10, 15 August 2018 (2018-08-15), CN , pages 2395 - 2401, XP093107315, ISSN: 1009-5896, DOI: 10.11999/JEIT171116 *
XIAO-YU WU, GU CHAO-NAN; WANG SHENG-JIN: "Special video classification based on multitask learning and multimodal feature fusion", OPTICS AND PRECISION ENGINEERING, vol. 28, no. 5, 13 May 2020 (2020-05-13), pages 1177 - 1186, XP093107318 *

Also Published As

Publication number Publication date
CN114581749A (zh) 2022-06-03
CN114581749B (zh) 2022-07-26

Similar Documents

Publication Publication Date Title
WO2020103676A1 (fr) Procédé et appareil d'identification d'image, terminal et support de stockage
WO2022007193A1 (fr) Procédé et système de détection de comportement vidéo de faible supervision basés sur un apprentissage itératif
CN107943837A (zh) 一种前景目标关键帧化的视频摘要生成方法
CN111653368A (zh) 一种人工智能疫情大数据防控预警系统
Ren et al. Deep video anomaly detection: Opportunities and challenges
CN109360584A (zh) 基于深度学习的咳嗽监测方法及装置
CN110931112B (zh) 一种基于多维信息融合和深度学习的脑部医学影像分析方法
WO2023216609A1 (fr) Procédé et appareil de reconnaissance de comportement cible basés sur une fusion de caractéristiques audiovisuelles, et application
WO2023273629A1 (fr) Système et appareil pour configurer un modèle de réseau neuronal dans un serveur périphérique
Hsu et al. Hierarchical Network for Facial Palsy Detection.
US20220104725A9 (en) Screening of individuals for a respiratory disease using artificial intelligence
Gao et al. Deep model-based semi-supervised learning way for outlier detection in wireless capsule endoscopy images
CN109460717A (zh) 消化道共聚焦激光显微内镜病变图像识别方法及装置
Mohan et al. Non-invasive technique for real-time myocardial infarction detection using faster R-CNN
CN111814588B (zh) 行为检测方法以及相关设备、装置
Wang et al. A YOLO-based Method for Improper Behavior Predictions
US20200034739A1 (en) Method and device for estimating user's physical condition
CN116416678A (zh) 一种运用人工智能技术实现动作捕捉及智能评判的方法
CN115272967A (zh) 一种跨摄像机行人实时跟踪识别方法、装置及介质
Adewole et al. Graph convolutional neural network for weakly supervised abnormality localization in long capsule endoscopy videos
US11224359B2 (en) Repetitive human activities abnormal motion detection
Postawka Real-time monitoring system for potentially dangerous activities detection
CN113995379B (zh) 一种基于目标检测框架的睡眠呼吸暂停低通气综合症评估方法及装置
CN110598592A (zh) 适用于护理场所的智能实时视频监控方法
US20240153269A1 (en) Identifying variation in surgical approaches

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22941542

Country of ref document: EP

Kind code of ref document: A1