CN117633604A

CN117633604A - Audio and video intelligent processing method and device, storage medium and electronic equipment

Info

Publication number: CN117633604A
Application number: CN202311754364.4A
Authority: CN
Inventors: 李修贤
Original assignee: Weifang Yabaiwen Network Technology Co ltd
Current assignee: Weifang Yabaiwen Network Technology Co ltd
Priority date: 2023-12-19
Filing date: 2023-12-19
Publication date: 2024-03-01

Abstract

The application relates to the technical field of intelligent monitoring, and particularly discloses an audio and video intelligent processing method, an audio and video intelligent processing device, a storage medium and electronic equipment. Therefore, the real-time monitoring and recognition of the offensive behaviors of tourists in the animal park can be realized, the supervision effect is improved, and a large amount of human resources are saved.

Description

Audio and video intelligent processing method and device, storage medium and electronic equipment

Technical Field

The application relates to the technical field of intelligent monitoring, and more particularly, to an audio and video intelligent processing method, an audio and video intelligent processing device, a storage medium and electronic equipment.

Background

In recent years, the wild zoo industry in China develops rapidly, and the wild zoo is taken as a place for people to closely contact and know animal ecology, so that the wild zoo is more and more popular with people. However, as the number of guests increases, the difficulty in supervision of the animal park increases gradually, and the violations of guests occur sometimes, such as feeding food, climbing pens, etc. These violations not only jeopardize the safety and health of the animals, but may also trigger animal violence, creating a potential hazard to the guests themselves. Therefore, monitoring and preventing violations of tourists on animal parks is particularly important. The existing monitoring method for the offence of tourists in the animal park mainly relies on manual monitoring. However, manual monitoring often suffers from problems such as negligence and fatigue, and is inefficient.

Accordingly, an audio/video intelligent processing method, an audio/video intelligent processing device, a storage medium and an electronic device are expected.

Disclosure of Invention

The present application has been made in order to solve the above technical problems. The embodiment of the application provides an audio and video intelligent processing method, an audio and video intelligent processing device, a storage medium and electronic equipment, which are characterized in that firstly, an animal park monitoring video is collected, audio data is extracted from the animal park monitoring video, important semantic features are respectively extracted from the video data and the audio data by utilizing a deep learning technology, and whether a tourist has illegal behaviors is judged based on the fusion features of the video semantic features and the audio semantic features. Therefore, the real-time monitoring and recognition of the offensive behaviors of tourists in the animal park can be realized, the supervision effect is improved, and a large amount of human resources are saved.

Accordingly, according to one aspect of the present application, there is provided an audio/video intelligent processing method, which includes:

acquiring an animal park monitoring video in real time, and extracting audio data from the animal park monitoring video;

the animal park monitoring video passes through a ViT model containing an embedded layer to obtain a video context semantic association feature vector;

extracting a logarithmic mel spectrogram, a cochlear spectrogram and a constant Q transformation spectrogram of the audio data, and arranging the logarithmic mel spectrogram, the cochlear spectrogram and the constant Q transformation spectrogram into a multichannel sound spectrogram;

the multichannel sound spectrogram passes through a sound feature extractor based on a convolutional neural network model to obtain an audio feature vector, wherein a channel attention mechanism is used in the convolutional neural network model of the sound feature extractor;

fusing the audio feature vector and the video context semantic association feature vector based on a Gaussian density map to obtain an audio-video association feature matrix;

and the audio and video association characteristic matrix passes through a classifier to obtain a classification result, wherein the classification result is used for indicating whether the tourist has illegal behaviors at the current time point.

According to another aspect of the present application, there is provided an audio/video intelligent processing device, including:

the audio and video acquisition module is used for acquiring the animal park monitoring video in real time and extracting audio data from the animal park monitoring video;

the video semantic coding module is used for enabling the animal park monitoring video to pass through a ViT model containing an embedded layer to obtain video context semantic association feature vectors;

the sound spectrogram extraction module is used for extracting a logarithmic Mel spectrogram, a cochlear spectrogram and a constant Q transformation spectrogram of the audio data and arranging the logarithmic Mel spectrogram, the cochlear spectrogram and the constant Q transformation spectrogram into a multi-channel sound spectrogram;

the audio feature coding module is used for enabling the multichannel sound spectrogram to pass through a sound feature extractor based on a convolutional neural network model to obtain an audio feature vector, wherein a channel attention mechanism is used in the convolutional neural network model of the sound feature extractor;

the Gaussian fusion module is used for fusing the audio feature vector and the video context semantic association feature vector based on a Gaussian density chart to obtain an audio-video association feature matrix;

and the analysis result generation module is used for enabling the audio and video association characteristic matrix to pass through a classifier to obtain a classification result, wherein the classification result is used for indicating whether the tourist has illegal behaviors at the current time point.

According to still another aspect of the present application, there is provided an electronic device, including: a memory for storing instructions; a processor coupled to the memory, the processor configured to execute instructions stored in the memory to implement the audio video intelligent processing method as described above.

According to still another aspect of the present application, there is provided a storage medium having stored thereon an av intelligent processing program which, when executed by a processor, implements the av intelligent processing method as described above.

Compared with the prior art, the intelligent audio and video processing method, the intelligent audio and video processing device, the storage medium and the electronic equipment provided by the application are used for firstly collecting the animal park monitoring video, extracting audio data from the animal park monitoring video, respectively extracting important semantic features from the video data and the audio data by utilizing a deep learning technology, and judging whether a tourist has illegal behaviors or not based on the fusion features of the video semantic features and the audio semantic features. Therefore, the real-time monitoring and recognition of the offensive behaviors of tourists in the animal park can be realized, the supervision effect is improved, and a large amount of human resources are saved.

Drawings

The foregoing and other objects, features and advantages of the present application will become more apparent from the following more particular description of embodiments of the present application, as illustrated in the accompanying drawings. The accompanying drawings are included to provide a further understanding of embodiments of the application and are incorporated in and constitute a part of this specification, illustrate the application and not constitute a limitation to the application. In the drawings, like reference numerals generally refer to like parts or steps.

Fig. 1 is a flowchart of an audio/video intelligent processing method according to an embodiment of the present application.

Fig. 2 is a schematic architecture diagram of an audio/video intelligent processing method according to an embodiment of the application.

Fig. 3 is a flowchart of the method for intelligent audio/video processing according to an embodiment of the present application, in which the zoo park monitoring video is passed through a ViT model including an embedded layer to obtain a video context semantic association feature vector.

Fig. 4 is a flowchart of a method for processing audio and video according to an embodiment of the present application, in which the plurality of video image frames are respectively passed through a ViT model including an embedded layer to obtain a plurality of image frame semantic feature vectors.

Fig. 5 is a flowchart of a converter module for inputting the plurality of image block embedded vectors into the ViT model to perform conversion encoding to obtain the image frame semantic feature vector in the audio/video intelligent processing method according to the embodiment of the application.

Fig. 6 is a block diagram of an audio/video intelligent processing device according to an embodiment of the present application.

Detailed Description

Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application and not all of the embodiments of the present application, and it should be understood that the present application is not limited by the example embodiments described herein.

Fig. 1 is a flowchart of an audio/video intelligent processing method according to an embodiment of the present application. Fig. 2 is a schematic architecture diagram of an audio/video intelligent processing method according to an embodiment of the application. As shown in fig. 1 and fig. 2, the audio/video intelligent processing method according to the embodiment of the application includes the steps of: s110, acquiring an animal park monitoring video in real time, and extracting audio data from the animal park monitoring video; s120, enabling the animal park monitoring video to pass through a ViT model containing an embedded layer to obtain a video context semantic association feature vector; s130, extracting a logarithmic Mel spectrogram, a cochlear spectrogram and a constant Q transformation spectrogram of the audio data, and arranging the logarithmic Mel spectrogram, the cochlear spectrogram and the constant Q transformation spectrogram into a multichannel sound spectrogram; s140, the multichannel spectrogram passes through a convolutional neural network model-based sound feature extractor to obtain an audio feature vector, wherein a channel attention mechanism is used in the convolutional neural network model of the sound feature extractor; s150, fusing the audio feature vector and the video context semantic association feature vector based on a Gaussian density map to obtain an audio-video association feature matrix; and S160, the audio and video association characteristic matrix passes through a classifier to obtain a classification result, wherein the classification result is used for indicating whether the tourist has illegal behaviors at the current time point.

In the above audio/video intelligent processing method, in step S110, an animal park monitoring video is obtained in real time, and audio data is extracted from the animal park monitoring video. As described above in the background art, zoos are places for people to watch animals, but in zoos, a rule-breaking action of tourists occurs sometimes, for example, a part of tourists knocks and walks over a railing, feeds prohibited foods, and the like. The parapet and pens of zoo are provided for the safety of guests and animals, and any action beyond these limits may cause accidents. The act of beating and crossing the railing not only causes fear and anxiety to the animals, but also may cause injury or abnormal behavior to the animals. Many guests also attempt to feed animals, and zoos often prescribe prohibited or only allowed to feed specific foods, and feeding unsuitable foods may cause damage to the animal's digestive system, causing health problems. In addition, zoobreeders often control the diet of animals, and any unauthorized feeding action may interfere with the animals' diet program, negatively impacting their health. In addition, noise, quarry, charging or other non-civilized actions generated by the tourists during play can negatively affect the experience of other tourists.

In order to protect animals, maintain the order and safety of zoos, zoos typically supervise and guide tourist activities by installing cameras and adding patrol personnel. However, adding patrols requires additional human resources and economic costs, and hiring and training sufficient patrols may burden the operational budget of the zoo. Moreover, patrol personnel cannot monitor every guest and every corner of zoo at any time, and even with cameras, significant manpower and time may be required to monitor and respond to violations in real time. Accordingly, an automatic guest violation identification scheme is desired.

Correspondingly, in the technical scheme of the application, firstly, an animal park monitoring video is obtained, audio data are extracted from the animal park monitoring video, important semantic features are respectively extracted from the video data and the audio data by utilizing a deep learning technology, and whether a tourist has illegal behaviors is judged based on the fusion features of the video semantic features and the audio semantic features. Therefore, the real-time monitoring and recognition of the offensive behaviors of tourists in the animal park can be realized, the supervision effect is improved, and a large amount of human resources are saved.

In the above audio/video intelligent processing method, in step S120, the animal park monitoring video is passed through a ViT model including an embedded layer to obtain a video context semantic association feature vector. It should be appreciated that the ViT model is a transducer-based visual model that has the ability to self-learn, and the pretrained ViT model can learn visual feature representations from large-scale datasets using knowledge trained on the large-scale datasets without the need to train the model from scratch, thereby saving training time and resources, and improving the performance and generalization of the model. In the technical scheme of the application, the ViT model is used for modeling the video sequence, so that the context information in the video can be captured. Specifically, the video is decomposed into a series of frames or time steps, and the frames or time steps are respectively input into the ViT model, so that the characteristic representation of each time step can be obtained, and further, high-level semantic characteristics are extracted from the monitoring video so as to capture the context relation between the behavior of the tourist and the zoo environment, and further, the behavior of the tourist and the zoo environment can be better represented. In addition, the ViT model can convert the video sequence into a feature vector with fixed length so as to reduce the dimension of data, extract a more compact and high-dimension feature representation, better capture the relationship between the tourist behavior and the zoo environment and improve the detection and recognition capability of the illegal behavior.

Correspondingly, considering that the difference between adjacent image frames in the monitoring video of the animal park is too small, a large amount of redundant information exists, if the video is subjected to semantic coding directly, the calculation load is greatly increased, and the extraction of the semantic features of the video is interfered. Therefore, in order to save calculation resources and optimize the semantic coding process, the animal park monitoring video is further sampled to obtain a plurality of video image frames, the plurality of video image frames are subjected to semantic coding through ViT models respectively to obtain image frame semantic feature vectors corresponding to the image frames, and then the image frame semantic feature vectors of all the video image frames are integrated to obtain the video context semantic association feature vectors.

Fig. 3 is a flowchart of the method for intelligent audio/video processing according to an embodiment of the present application, in which the zoo park monitoring video is passed through a ViT model including an embedded layer to obtain a video context semantic association feature vector. As shown in fig. 3, the step S120 includes: s210, sampling the monitoring video of the animal park at a preset sampling frequency to obtain a plurality of video image frames; s220, the plurality of video image frames are respectively passed through a ViT model containing an embedded layer to obtain a plurality of image frame semantic feature vectors; and S230, cascading the plurality of image frame semantic feature vectors to obtain the video context semantic association feature vector.

Fig. 4 is a flowchart of a method for processing audio and video according to an embodiment of the present application, in which the plurality of video image frames are respectively passed through a ViT model including an embedded layer to obtain a plurality of image frame semantic feature vectors. As shown in fig. 4, the step S220 includes: s310, performing image blocking processing on the video image frame to obtain a sequence of image blocks; s320, using an embedding layer of the ViT model to respectively carry out embedding coding on each image block in the sequence of the image blocks so as to obtain a plurality of image block embedding vectors; s330, the plurality of image block embedded vectors are input into a converter module of the ViT model for conversion coding so as to obtain the image frame semantic feature vectors.

Fig. 5 is a flowchart of a converter module for inputting the plurality of image block embedded vectors into the ViT model to perform conversion encoding to obtain the image frame semantic feature vector in the audio/video intelligent processing method according to the embodiment of the application. As shown in fig. 5, the step S330 includes: s410, arranging the plurality of image block embedded vectors in one dimension to obtain an image frame global embedded vector; s420, calculating the product between the global embedded vector of the image frame and the transpose vector of each image block embedded vector in the plurality of image block embedded vectors to obtain a plurality of self-attention association matrixes; s430, respectively carrying out standardization processing on each self-attention association matrix in the plurality of self-attention association matrices to obtain a plurality of standardized self-attention association matrices; s440, each normalized self-attention correlation matrix in the normalized self-attention correlation matrices is activated to obtain a plurality of probability values; s450, weighting each image block embedded vector in the image block embedded vectors by taking each probability value in the plurality of probability values as a weight to obtain a plurality of image block feature vectors; s460, cascading the plurality of image block feature vectors to obtain the image frame semantic feature vector.

In the above audio/video intelligent processing method, in step S130, a logarithmic mel spectrogram, a cochlear spectrogram, and a constant Q transform spectrogram of the audio data are extracted, and the logarithmic mel spectrogram, the cochlear spectrogram, and the constant Q transform spectrogram are arranged into a multi-channel sound spectrogram. It will be appreciated that logarithmic mel-spectrum (Log Mel Spectrogram) is a widely used feature representation in audio signal processing by dividing an audio signal into short time windows, applying a fourier transform to each window to obtain spectral information, then applying a mel-filter bank to map the spectrum onto a mel scale, converting the frequency axis to a mel frequency axis, and then scaling the mel spectrum by logarithmic transformation to obtain a logarithmic mel-spectrum. The logarithmic mel-pattern provides energy distribution information of the audio signal in frequency and time dimensions, and is commonly used for tasks such as speech recognition, audio classification, sound event detection and the like. The cochlea spectrogram (Cochleagram) is a characteristic representation method for simulating the auditory system of the human ear, and is widely applied in the fields of voice recognition, audio signal processing, auditory modeling and the like by simulating the response of a cochlear filter in the human ear and decomposing an audio signal into a plurality of frequency bands so as to better capture the energy distribution and auditory perception characteristics of the audio signal at different frequencies. The constant Q transform spectrum (Constant Q Transform Spectrogram) is a spectral representation based on constant Q transform. The constant Q transform refers to decomposing an audio signal into a plurality of frequency bands, wherein the width of the frequency bands is proportional to the center frequency, such that the resolution of the low frequency interval is higher and the resolution of the high frequency interval is lower. The constant Q transform spectrogram has relatively high resolution in frequency, can better capture details in audio signals, and is widely used in the fields of music signal processing, audio synthesis, audio feature extraction and the like. That is, the log-mel spectrogram, the cochlear spectrogram and the constant Q transformation spectrogram are commonly used audio feature representation methods, different types of spectrum features of audio signals can be captured, different types of feature information can be comprehensively considered by arranging the log-mel spectrogram, the cochlear spectrogram and the constant Q transformation spectrogram into a multi-channel sound spectrogram, more comprehensive, rich and high-dimensional audio feature representation is provided, the feature expression capability and the degree of distinction are improved, the detection and classification accuracy of offensive behaviors of tourists are further improved, and the analysis capability of the audio signals is enhanced.

In the above audio/video intelligent processing method, in step S140, the multi-channel spectrogram is passed through a voice feature extractor based on a convolutional neural network model to obtain an audio feature vector. It should be appreciated that convolutional neural networks are a powerful deep learning model with the ability to automatically learn characteristic representations. By inputting the multichannel sound spectrogram into the CNN model, the network can automatically learn and extract relevant features in the audio data. Compared with a manually designed feature extraction method, the CNN can better capture complex features of audio data, and improve the expression capacity and discrimination capacity of the features. The audio signal typically has continuity in the time domain and local correlation in the frequency domain, and the CNN model can capture these characteristics through convolution operations and pooling operations, thereby extracting more discriminative audio features. Convolutional neural network models are typically composed of multiple convolutional layers and pooled layers for extracting features at different levels of abstraction in audio data layer by layer. The underlying convolution layers may capture low-level local features, while the higher-level convolution layers may capture more abstract and semantic features. By stacking multiple convolution layers in the CNN, richer and higher-level audio features can be progressively extracted. In addition, the convolutional neural network model utilizes a mechanism of parameter sharing and local receptive field, and can effectively process the time sequence and the local correlation of the audio data.

In particular, a channel attention mechanism is used in the convolutional neural network model of the acoustic feature extractor. The channel attention mechanism is an attention mechanism applied in convolutional neural networks for enhancing the attention of the network to different channel characteristics. The channel attention mechanism can automatically learn the importance weight of each channel, so that the network can pay more attention to the characteristic channels contributing to the task. Key information in the audio data is better captured by enhancing the response of the important feature channels, and the distinguishing capability of the features is improved. Meanwhile, the channel attention mechanism can also inhibit the response of redundant characteristic channels, reduce the attention to irrelevant information, further reduce the number of parameters and the computational complexity of the network, and improve the generalization capability of the model. That is, the channel attention mechanism can adaptively select the feature channel according to different situations of input data, so that the network can flexibly select and utilize the proper feature channel in different audio data sets and scenes, thereby capturing key features in the audio data more accurately, helping the network to focus on useful information better in the feature extraction stage, and improving the performance and effect of the model.

In a specific example, the step S140 includes: each layer of the acoustic feature extractor based on the convolutional neural network model is used for respectively carrying out the forward transfer of the input data: performing convolution processing on the input data based on a three-dimensional convolution check to obtain a convolution characteristic diagram; carrying out global mean pooling on each feature matrix of the convolution feature graph along the channel dimension to obtain a channel feature vector; calculating the ratio of the characteristic value of each position in the channel characteristic vector relative to the weighted sum of the characteristic values of all positions of the channel characteristic vector to obtain a channel weighted characteristic vector; weighting the feature matrix of the convolution feature images along the channel dimension by taking the feature value of each position in the channel weighted feature vector as a weight to obtain a channel attention feature image; carrying out global pooling treatment on each feature matrix along the channel dimension on the channel attention feature map to obtain a pooled feature map; performing activation processing on the pooled feature map to generate an activated feature map; wherein the output of the last layer of the sound feature extractor is the audio feature vector, and the input of the first layer of the sound feature extractor is the multi-channel spectrogram.

In the above audio/video intelligent processing method, step S150 is configured to fuse the audio feature vector and the video context semantic association feature vector based on a gaussian density map to obtain an audio/video association feature matrix. Considering that the audio data and the video data both contain important information for the illegal behaviors, the audio feature vector and the video context semantic association feature vector are respectively extracted from the audio data and the video data and have different feature representation capabilities, the multi-mode information of the audio data and the video data can be comprehensively considered by fusing the audio feature vector and the video data, the complementarity of the audio data and the video data is fully utilized, the detection and identification capability of the illegal behaviors is improved, and the relationship between the guest behaviors and zoo environments is more comprehensively described. In the technical scheme of the application, the audio feature vector and the video context semantic association feature vector are fused by constructing a Gaussian density map. It should be understood that by constructing the gaussian density diagram of the audio feature vector and the video context semantic association feature vector, mapping the high-dimensional features to a two-dimensional space, modeling the features of each position by using a gaussian kernel function, and calculating the gaussian distribution of each position according to the distance, so as to capture the space-time association between the audio feature and the video feature, so that the audio-video association feature matrix reflects the association degree between the audio data and the video data more accurately, the expression capability of the feature is enhanced, the discrimination capability of illegal behaviors is improved, the dynamic changes of guest behaviors and zoo environments are better described, and the time sequence analysis capability and the recognition capability of the illegal behaviors are improved.

In a specific example, the step S150 includes: constructing a gaussian density map of the audio feature vector and the video context semantic association feature vector in the following Gao Sigong formula, wherein the gaussian formula is:

wherein μ is a mean vector of the audio feature vector and the video context semantic association feature vector, σ is a variance between feature values of respective two positions in the audio feature vector and the video context semantic association feature vector,representing a gaussian density probability function, x representing the variables of the gaussian density map; and performing Gaussian discretization on the Gaussian distribution of each position in the Gaussian density map to obtain the audio and video association characteristic matrix.

In the above audio/video intelligent processing method, in step S160, the audio/video association feature matrix is passed through a classifier to obtain a classification result, where the classification result is used to indicate whether the tourist has a rule violation at the current time point. A classifier is a machine learning algorithm that is used to classify input data into different categories or labels. It may learn a decision boundary or rule based on the characteristics of the input data, thereby classifying the data instances of different categories into corresponding categories. It should be understood that the audio-video association feature matrix contains the correlations between the multi-mode features of the audio data and the video data, and by using the classifier, the feature representation of the audio-video association feature matrix can be automatically learned, and normal behaviors and illegal behaviors can be distinguished according to the learned modes and rules, so as to realize real-time illegal behavior detection and identification, thereby helping zoo managers to better monitor the behaviors of tourists, so that corresponding measures can be timely taken to protect the safety of the tourists and animals, and the accuracy and practicality of monitoring can be improved.

It will be appreciated that training of the ViT model containing embedded layers, the acoustic feature extractor based on the convolutional neural network model, and the classifier is required before the neural network model described above is utilized. That is, the audio/video intelligent processing method further comprises a training stage for training the ViT model including the embedded layer, the acoustic feature extractor based on the convolutional neural network model and the classifier.

Specifically, the training phase comprises: acquiring a training animal park monitoring video, and extracting training audio data from the training animal park monitoring video; passing the training animal park monitoring video through a ViT model containing an embedded layer to obtain a training video context semantic association feature vector; extracting a logarithmic mel spectrogram, a cochlear spectrogram and a constant Q transformation spectrogram of the training audio data, and arranging the logarithmic mel spectrogram, the cochlear spectrogram and the constant Q transformation spectrogram into a training multichannel sound spectrogram; the training multichannel sound spectrogram passes through a sound feature extractor based on a convolutional neural network model to obtain a training audio feature vector; fusing the training audio feature vector and the training video context semantic association feature vector based on a Gaussian density chart to obtain a training audio-video association feature matrix; calculating geometric non-rigid consistency factors based on order priors between the training audio feature vectors and the training video context semantic association feature vectors; the training audio and video association characteristic matrix passes through a classifier to obtain a classification loss function value; training the ViT model including an embedded layer, the convolutional neural network model-based acoustic feature extractor, and the classifier with a weighted sum between the classification loss function value and the order-based a priori geometric non-rigid consensus factor as a loss function value.

In particular, in the technical scheme of the application, the training audio feature vector and the training video context semantic association feature vector are considered to represent key features of audio data and video data respectively. By further improving the responsiveness between the audio and video information, the audio and video information can be better synthesized, and the association and interaction between the audio and the video can be captured. This helps to improve the expressive power of the audio-video correlation feature matrix, more accurately describing the semantic correlation between audio and video data. At the same time, audio and video are different modalities that contain different types of information. By further improving the responsiveness between the training audio feature vector and the training video context semantic association feature vector, the cross-modal expression capability can be enhanced, and the corresponding relation and semantic consistency between the audio and the video can be captured better. This helps to improve the accuracy and robustness of training the audio-video correlation feature matrix. The training audio-video association characteristic matrix is used for indicating whether the tourist has illegal behaviors at the current time point. By further improving the responsiveness between the training audio feature vector and the training video context semantic association feature vector, the accuracy and reliability of detecting the illegal behaviors can be improved. This helps to better identify and classify violations, improving the performance and effectiveness of the monitoring. In sum, the expression capability of the training audio-video association feature matrix can be enhanced, the cross-modal expression is enhanced, and the offence detection is improved by further improving the responsiveness between the training audio feature vector and the training video context semantic association feature vector. These improvements will help to improve the performance and accuracy of the monitoring, and more effectively identify and classify violations. Therefore, in the technical scheme of the application, the responsiveness between the training audio feature vector and the training video context semantic association feature vector is improved by calculating the geometric non-rigid consistency factor based on order prior between the training audio feature vector and the training video context semantic association feature vector.

Further, in order to promote the viewpoint correlation of the perceivable prediction result of the feature distribution under the class probability, the similarity and the difference between the training audio feature vector and the training video context semantic association feature vector are measured by using the geometric non-rigid consistency factor based on order priors as a loss function value. In particular, the expressive power and the distinguishing power of the feature vectors are enhanced by reducing the difficulty level of mutual representation of the feature distribution among the vectors by the geometric non-rigid unification factor based on order priors. At the same time, the geometric non-rigid consistency factor based on order priori can also consider the order of the eigenvectors, namely, the arrangement and transformation of the eigenvectors should keep a certain order and rule so as to accord with the internal structure and logic of the data. Accordingly, by introducing the factors, the parameter updating of the model can optimize the correlation response degree between the training audio feature vector and the training video context semantic association feature vector, so that the generalization capability and the robustness of the model are improved.

Specifically, calculating an order-a-priori based geometric non-rigid consensus factor between the training audio feature vector and the training video context semantic association feature vector, comprising: calculating an order-prior-based geometric non-rigid consensus factor between the training audio feature vector and the training video context semantic association feature vector with a geometric non-rigid consensus factor calculation formula;

wherein, the geometric non-rigidity consistency factor calculation formula is:

wherein V is ₁ Representing the training audio feature vector, V ₂ Representing the training video context semantic association feature vector, I.I ₂ Representing the two norms of the feature vector, cos (V ₁ ,V ₂ ) Representing the calculation of cosine distances between the training audio feature vectors and the training video context semantic association feature vectors, loss representing the order-a-priori based geometric non-rigid consensus factor.

In summary, the audio and video intelligent processing method according to the embodiment of the application is explained, firstly, an animal park monitoring video is collected, audio data is extracted from the animal park monitoring video, important semantic features are respectively extracted from the video data and the audio data by using a deep learning technology, and whether a tourist has illegal behaviors is judged based on the fusion features of the video semantic features and the audio semantic features. Therefore, the real-time monitoring and recognition of the offensive behaviors of tourists in the animal park can be realized, the supervision effect is improved, and a large amount of human resources are saved.

Fig. 6 is a block diagram of an audio/video intelligent processing device according to an embodiment of the present application. As shown in fig. 6, an audio/video intelligent processing device 100 according to an embodiment of the present application includes: the audio and video acquisition module 110 is configured to acquire an animal park monitoring video in real time, and extract audio data from the animal park monitoring video; the video semantic coding module 120 is configured to pass the animal park monitoring video through a ViT model including an embedded layer to obtain a video context semantic association feature vector; a sound spectrogram extraction module 130, configured to extract a logarithmic mel spectrogram, a cochlear spectrogram, and a constant Q transform spectrogram of the audio data, and arrange the logarithmic mel spectrogram, the cochlear spectrogram, and the constant Q transform spectrogram into a multi-channel sound spectrogram; an audio feature encoding module 140, configured to pass the multi-channel spectrogram through a convolutional neural network model based on a convolutional neural network model to obtain an audio feature vector, where a channel attention mechanism is used in the convolutional neural network model of the acoustic feature extractor; the gaussian fusion module 150 is configured to fuse the audio feature vector and the video context semantic association feature vector based on a gaussian density map to obtain an audio-video association feature matrix; the analysis result generation module 160 is configured to pass the audio and video association feature matrix through a classifier to obtain a classification result, where the classification result is used to indicate whether the tourist has a rule violation at the current time point.

Here, it will be understood by those skilled in the art that the specific operations of the respective steps in the above-described audio/video intelligent processing apparatus have been described in detail in the above description of the audio/video intelligent processing method with reference to fig. 1 to 5, and thus, repetitive descriptions thereof will be omitted.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In several embodiments provided in the present application, the disclosed methods and apparatuses may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and the division of the units is merely a logical function division, and there may be another division manner in actual implementation, for example, a plurality of units or components may be combined or may be integrated into another apparatus, or some features may be omitted, or not performed.

Those skilled in the art will appreciate that the present application is not limited to the particular embodiments described herein, but is capable of numerous obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the present application. Thus, while the present application has been described in terms of the foregoing embodiments, the present application is not limited to the foregoing embodiments, but may include many other equivalent embodiments without departing from the spirit of the present application, all of which fall within the scope of the present application.

Claims

1. An audio and video intelligent processing method is characterized by comprising the following steps:

2. The audio-video intelligent processing method according to claim 1, wherein the step of passing the animal park monitoring video through a ViT model containing an embedded layer to obtain a video context semantic association feature vector comprises the steps of:

sampling the monitoring video of the animal park at a preset sampling frequency to obtain a plurality of video image frames;

passing the plurality of video image frames through a ViT model comprising an embedded layer to obtain a plurality of image frame semantic feature vectors;

and cascading the plurality of image frame semantic feature vectors to obtain the video context semantic association feature vector.

3. The method of intelligent audio/video processing according to claim 2, wherein passing the plurality of video image frames through a ViT model including an embedded layer to obtain a plurality of image frame semantic feature vectors, respectively, comprises:

performing image blocking processing on the video image frames to obtain a sequence of image blocks;

using the embedding layer of the ViT model to respectively carry out embedded coding on each image block in the sequence of the image blocks so as to obtain a plurality of image block embedded vectors;

and inputting the plurality of image block embedded vectors into a converter module of the ViT model for conversion coding to obtain the image frame semantic feature vectors.

4. The method of claim 3, wherein passing the multichannel spectrogram through a convolutional neural network model-based sound feature extractor to obtain an audio feature vector, comprising: each layer of the acoustic feature extractor based on the convolutional neural network model is used for respectively carrying out the forward transfer of the input data:

performing convolution processing on the input data based on a three-dimensional convolution check to obtain a convolution characteristic diagram;

carrying out global mean pooling on each feature matrix of the convolution feature graph along the channel dimension to obtain a channel feature vector;

calculating the ratio of the characteristic value of each position in the channel characteristic vector relative to the weighted sum of the characteristic values of all positions of the channel characteristic vector to obtain a channel weighted characteristic vector;

weighting the feature matrix of the convolution feature images along the channel dimension by taking the feature value of each position in the channel weighted feature vector as a weight to obtain a channel attention feature image;

carrying out global pooling treatment on each feature matrix along the channel dimension on the channel attention feature map to obtain a pooled feature map;

performing activation processing on the pooled feature map to generate an activated feature map;

wherein the output of the last layer of the sound feature extractor is the audio feature vector, and the input of the first layer of the sound feature extractor is the multi-channel spectrogram.

5. The method of claim 4, wherein fusing the audio feature vector and the video context semantic association feature vector based on a gaussian density map to obtain an audio-video association feature matrix, comprises:

constructing a gaussian density map of the audio feature vector and the video context semantic association feature vector in the following Gao Sigong formula, wherein the gaussian formula is:

wherein μ is a mean vector of the audio feature vector and the video context semantic association feature vector, σ is the audio feature vector and the video context semantic association feature vectorThe variance between the eigenvalues of the corresponding two positions in the vector,representing a gaussian density probability function, x representing the variables of the gaussian density map;

and carrying out Gaussian discretization on the Gaussian distribution of each position in the Gaussian density map to obtain the audio and video association characteristic matrix.

6. The audio/video intelligent processing method according to claim 5, further comprising: a training phase for training the ViT model including the embedded layer, the convolutional neural network model-based acoustic feature extractor, and the classifier;

wherein the training phase comprises:

acquiring a training animal park monitoring video, and extracting training audio data from the training animal park monitoring video;

passing the training animal park monitoring video through a ViT model containing an embedded layer to obtain a training video context semantic association feature vector;

extracting a logarithmic mel spectrogram, a cochlear spectrogram and a constant Q transformation spectrogram of the training audio data, and arranging the logarithmic mel spectrogram, the cochlear spectrogram and the constant Q transformation spectrogram into a training multichannel sound spectrogram;

the training multichannel sound spectrogram passes through a sound feature extractor based on a convolutional neural network model to obtain a training audio feature vector;

fusing the training audio feature vector and the training video context semantic association feature vector based on a Gaussian density chart to obtain a training audio-video association feature matrix;

calculating geometric non-rigid consistency factors based on order priors between the training audio feature vectors and the training video context semantic association feature vectors;

the training audio and video association characteristic matrix passes through a classifier to obtain a classification loss function value;

training the ViT model including an embedded layer, the convolutional neural network model-based acoustic feature extractor, and the classifier with a weighted sum between the classification loss function value and the order-based a priori geometric non-rigid consensus factor as a loss function value.

7. The audio-video intelligent processing method according to claim 6, wherein calculating an order-prior-based geometric non-rigid consensus factor between the training audio feature vector and the training video context semantic association feature vector comprises: calculating an order-prior-based geometric non-rigid consensus factor between the training audio feature vector and the training video context semantic association feature vector with a geometric non-rigid consensus factor calculation formula;

wherein, the geometric non-rigidity consistency factor calculation formula is:

8. An audio/video intelligent processing device, which is characterized by comprising:

9. An electronic device, comprising:

a memory for storing instructions;

a processor coupled to the memory, the processor configured to perform implementing the audio video intelligent processing method of any of claims 1-7 based on instructions stored by the memory.

10. A storage medium, wherein an audio/video intelligent processing program is stored on the storage medium, and when the audio/video intelligent processing program is executed by a processor, the audio/video intelligent processing method according to any one of claims 1 to 7 is implemented.