CN115700880A

CN115700880A - Behavior monitoring method and device, electronic equipment and storage medium

Info

Publication number: CN115700880A
Application number: CN202110829669.1A
Authority: CN
Inventors: 夏艺菲; 苗海委; 陈建; 周剑; 李泽源
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Chengdu ICT Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Chengdu ICT Co Ltd
Priority date: 2021-07-22
Filing date: 2021-07-22
Publication date: 2023-02-07

Abstract

The application discloses a behavior monitoring method and device, electronic equipment and a storage medium. The method comprises the following steps: extracting at least one second audio from the first audio; the first audio characterizing sound emitted by at least two monitored objects; each of the at least one second audio corresponds to a sound emitted by one of the at least two monitored objects; inputting each second audio in the at least one second audio and the corresponding second video into a first set model to obtain a first behavior characteristic corresponding to each monitoring object in the at least two monitoring objects; matching the first behavior characteristic corresponding to each of the at least two monitoring objects with a first set behavior characteristic to obtain a first behavior monitoring result; and the second video representation shoots a video of the corresponding monitored object. The method can accurately and quickly locate the monitoring object with abnormal behavior, and improves the locating efficiency.

Description

Behavior monitoring method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of monitoring technologies, and in particular, to a behavior monitoring method and apparatus, an electronic device, and a storage medium.

Background

In the breeding industry, the health condition, the breeding efficiency and the like of pigs are important indexes for measuring the breeding technology of a farm, particularly the health condition of the pigs, and once epidemic diseases occur in a swinery, the income of the farm is seriously influenced, and even immeasurable economic loss is caused. With the development of science and technology, the monitoring of the sound production behavior of pigs by remote monitoring becomes a difficult point and a key point of the current research.

In the related art, the health monitoring of pigs mainly includes the following three modes:

1. and (5) manually patrolling. And (4) carrying out patrol inspection on the health condition of the pigs in the pigsty by an observer, paying attention to abnormal sound production behaviors of the pigs, marking the pigs and recording the information of the pigs if the pigs with the abnormal sound production behaviors are found. Although the method is complete, complete and reliable in inspection, the method has the disadvantages of consuming time and labor, depending on the experience of observers, being only suitable for small-scale cultivation scenes and not suitable for large-scale cultivation scenes with efficient management.

2. And matching the envelope template with the sounds of the pigs to identify the pigs with abnormal sounding behaviors. Specifically, abnormal sounds of various pigs are collected in advance, envelope templates of the abnormal sounds are established, sound data of the pigs to be monitored are collected, and the sound data of the pigs are matched with the envelope templates to judge whether abnormal sound production behaviors exist in the pigs or not. The method has the disadvantages that some sounds which are similar to abnormal sounds but are not matched with the envelope template, so that the monitoring accuracy is low, and abnormal sounding behaviors of a plurality of pigs cannot be monitored simultaneously.

3. The audio equipment is used for collecting the audio data of the pigs, and abnormal sounds of the swinery are monitored and classified in a machine learning and deep learning mode. Specifically, the method comprises the steps of collecting audio data of a pig swarm, manually distinguishing cough and non-cough audio data, inputting audio features such as Mel Frequency Cepstral Coeffients (MFCCs) or spectrogram extracted from the audio data of the pig swarm as data of an abnormal sound classification model, training the abnormal sound classification model by using a machine learning or deep learning mode such as Dynamic Time Warping (DTW), vector Quantization (VQ), fuzzy C mean clustering (FCM), hidden Markov algorithm (HMM), artificial Neural Network (ANN) algorithm and convolutional neural network, and classifying abnormal sounds of the pig swarm through the abnormal sound classification model. And inputting the collected audio data of the swinery into an abnormal sound classification model, and if the abnormal sound classification model judges that abnormal sound exists in the audio data, manually finding out pigs with abnormal sounding behaviors in the swinery by combining the abnormal swinery position. The method usually collects the audio data of the swinery, and only the pigs with abnormal sounding behaviors in the swinery can be finally judged, and the pigs cannot be accurately positioned to the target pigs with the abnormal sounding behaviors, so that the target pigs are often found by using manpower, and the efficiency of determining the target pigs is low. Secondly, audio features such as MFCC, power Spectral Density (PSD), linear Predictive Cepstrum Coefficient (LPCC) and the like are generally adopted as input data of the abnormal sound classification model, so that the trained model has low degree of distinction for abnormal sounds such as cough, scream, gnawing metal and the like of the pig, and the classification precision is poor. In addition, the abnormal sound classification model is trained only by using the audio data of the pigs, so that the abnormal sound production behaviors of the pigs are difficult to accurately judge.

That is, in the related art, there are also technical problems of low determination accuracy and low determination efficiency in determining pigs with abnormal vocal behaviors.

Disclosure of Invention

In view of this, a main object of the embodiments of the present application is to provide a behavior monitoring method, a behavior monitoring device, an electronic device, and a storage medium, so as to solve the problems in the related art that the determination accuracy and the determination efficiency of a pig with abnormal behavior are low.

In order to achieve the above purpose, the technical solution of the embodiment of the present application is implemented as follows:

the embodiment of the application provides a behavior monitoring method, which comprises the following steps:

extracting at least one second audio from the first audio; the first audio characterizing sound emitted by at least two monitored objects; each of the at least one second audio frequency corresponds to a sound that characterizes one of the at least two monitoring objects;

inputting each second audio in the at least one second audio and the corresponding second video into a first set model to obtain a first behavior characteristic corresponding to each monitoring object in the at least two monitoring objects;

matching the first behavior characteristic corresponding to each of the at least two monitoring objects with a first set behavior characteristic to obtain a first behavior monitoring result; wherein the content of the first and second substances,

the second video representation is shot with the video of the corresponding monitoring object.

In the foregoing solution, the matching the first behavior feature corresponding to each of the at least two monitoring objects with the first set behavior feature includes:

determining that the first behavior feature matches the first set behavior feature if the first behavior feature satisfies at least one of the following conditions:

voice signals with amplitude values larger than a set threshold value exist in the first spectrogram; the first spectrogram is a spectrogram of a second audio corresponding to the monitoring object corresponding to the first behavior feature;

and judging that the corresponding monitored object generates a set behavior based on the second video corresponding to the monitored object corresponding to the first behavior characteristic.

In the above scheme, the method further comprises:

under the condition that the first behavior monitoring result represents that the first behavior characteristics corresponding to the monitored object are matched with the first set behavior characteristics, inputting corresponding second audio and corresponding second video into a second set model to obtain second behavior characteristics corresponding to the monitored object;

matching the obtained second behavior characteristic with a second set behavior characteristic to obtain a second behavior monitoring result of the corresponding monitoring object; wherein, the first and the second end of the pipe are connected with each other,

and the second set behavior characteristics represent abnormal behaviors of the corresponding monitoring object.

In the foregoing solution, the matching the obtained second behavior feature with the second set behavior feature includes:

determining that the second behavior feature matches a second set behavior feature if the obtained second behavior feature satisfies at least one of the following conditions:

the time interval of the occurrence of the voice signals with the amplitude larger than the set threshold value in the second spectrogram is smaller than the set time interval; the second spectrogram is a spectrogram of a second audio corresponding to the monitoring object corresponding to the second behavior characteristic;

the duration of the voice signals with the amplitude larger than the set threshold value in the second spectrogram is larger than the set duration.

In the foregoing solution, the method further includes:

and under the condition that the second behavior monitoring result represents that the second behavior characteristic is matched with the second set behavior characteristic, determining the monitoring object corresponding to the second behavior characteristic based on the audio coding of the second audio of the monitoring object corresponding to the second behavior characteristic.

In the above solution, after at least one second audio is extracted from the first audio, the method further includes:

determining a monitoring object corresponding to a second audio based on the audio coding of the second audio;

and acquiring a second video corresponding to the monitored object.

In the above solution, before the extracting at least one second audio from the first audio, the method further includes:

respectively inputting the sound emitted by each monitored object into a set voice coder to obtain the audio coding of the sound emitted by each monitored object;

and storing the corresponding relation between each monitoring object and the audio coding of the emitted sound.

The embodiment of the present application further provides a model training method, where the method is used to train a first setting model in any one of the behavior monitoring methods, and the method includes:

acquiring an audio sample and a video sample of a monitored object; the audio sample characterizes sound emitted by the monitored object; the video sample represents a video which is acquired simultaneously with the audio sample and shoots the monitored object;

inputting the audio characteristics corresponding to the audio sample and the video sample into a first set model to obtain a first output result; the first output result represents a first behavior characteristic corresponding to the monitored object;

calculating a loss value based on the first output result, and updating a weight parameter of a first set model based on the loss value; wherein, the first and the second end of the pipe are connected with each other,

the audio features corresponding to the audio samples comprise cross-correlation coefficient matrix features, and the cross-correlation coefficient matrix features represent correlation coefficients between two adjacent frames in a spectrogram corresponding to the audio samples.

In the above scheme, the audio characteristics corresponding to the audio samples further include at least one of:

a spectrogram corresponding to the audio sample;

a mel frequency cepstrum characteristic corresponding to the audio sample;

a first order difference feature corresponding to the audio sample;

and the second-order difference characteristic corresponding to the audio sample.

An embodiment of the present application further provides a behavior monitoring device, the device includes:

an extracting unit for extracting at least one second audio from the first audio; the first audio characterizing sound emitted by at least two monitored objects; each of the at least one second audio frequency corresponds to a sound that characterizes one of the at least two monitoring objects;

the input unit is used for inputting each second audio and the corresponding second video in the at least one second audio into a first set model to obtain a first behavior characteristic corresponding to each monitoring object in the at least two monitoring objects;

the matching unit is used for matching the first behavior characteristic corresponding to each of the at least two monitored objects with a first set behavior characteristic to obtain a first behavior monitoring result; wherein the content of the first and second substances,

the second video representation shoots a video of the corresponding monitored object.

An embodiment of the present application further provides a model training device, the device includes:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring an audio sample and a video sample of a monitored object; the audio sample characterizes sound emitted by the monitored object; the video sample represents a video which is acquired simultaneously with the audio sample and shoots the monitoring object;

the input unit is used for inputting the audio characteristics corresponding to the audio samples and the video samples into a first set model to obtain a first output result; the first output result represents a first behavior characteristic corresponding to the monitored object;

a calculation unit configured to calculate a loss value based on the first output result and update a weight parameter of a first setting model based on the loss value; wherein, the first and the second end of the pipe are connected with each other,

An embodiment of the present application further provides an electronic device, including: a processor and a memory for storing a computer program capable of running on the processor, wherein,

the processor is adapted to perform the steps of any of the above methods when running the computer program.

Embodiments of the present application further provide a storage medium on which a computer program is stored, where the computer program is executed by a processor to implement the steps of any one of the above methods.

In an embodiment of the application, at least one second audio is extracted from a first audio, wherein the first audio represents sound emitted by at least two monitoring objects, and each second audio in the at least one second audio represents sound emitted by one monitoring object in the at least two monitoring objects; inputting each second audio in the at least one second audio and the corresponding second video into the first set model to obtain a first behavior characteristic corresponding to each monitoring object in the at least two monitoring objects; matching the first behavior characteristic corresponding to each of the at least two monitoring objects with a first set behavior characteristic to obtain a first behavior monitoring result; the second video representation shoots the video with the corresponding monitoring object, so that the single audio of each monitoring object can be extracted from the mixed audio formed by the sounds of the multiple monitoring objects, the behavior monitoring is carried out by combining the videos corresponding to the monitoring objects, the accuracy of the behavior monitoring is improved by the multi-mode, the audio of a single monitoring object can be extracted from the mixed audio for behavior monitoring, when the monitoring object is monitored to have abnormal behavior, the monitoring object with the abnormal behavior can be accurately and quickly positioned, and the efficiency of positioning the behavior monitoring object is improved.

Drawings

Fig. 1 is a schematic flow chart illustrating an implementation of a behavior monitoring method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a training process of a speech separation model according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a spectrogram provided in an embodiment of the present application;

fig. 4 is a schematic diagram of behavior monitoring performed by a second setting model according to an embodiment of the present disclosure;

fig. 5 is a schematic flow chart of an implementation of a behavior monitoring method according to an embodiment of the present application;

FIG. 6 is a schematic diagram of an implementation flow of a model training method provided in the embodiment of the present application;

FIG. 7 is a schematic diagram of audio data processing provided by an embodiment of the present application;

FIG. 8 is a schematic diagram of extracting features of a video sample for training according to an embodiment of the present disclosure;

FIG. 9 is a schematic diagram of extracting audio features of an audio sample for training according to an embodiment of the present application;

fig. 10 is a schematic view of a behavior monitoring device provided in an embodiment of the present application;

FIG. 11 is a schematic diagram of a model training apparatus according to an embodiment of the present application;

fig. 12 is a schematic diagram of a hardware component structure of an electronic device according to an embodiment of the present application.

Detailed Description

In the related technology, the technical problems of low determination accuracy and low determination efficiency exist in the determination of the pigs with abnormal sounding behaviors.

Based on this, the embodiment of the application provides a behavior monitoring method, a behavior monitoring device, an electronic device and a storage medium, wherein at least one second audio is extracted from a first audio, the first audio represents sound emitted by at least two monitoring objects, and each second audio in the at least one second audio represents sound emitted by one monitoring object of the at least two monitoring objects; inputting each second audio in the at least one second audio and the corresponding second video into the first set model to obtain a first behavior characteristic corresponding to each monitoring object in the at least two monitoring objects; matching the first behavior characteristic corresponding to each of the at least two monitored objects with the first set behavior characteristic to obtain a first behavior monitoring result; the second video representation shoots the video with the corresponding monitoring object, so that the single audio of each monitoring object can be extracted from the mixed audio formed by the sounds of the multiple monitoring objects, the behavior monitoring is carried out by combining the videos corresponding to the monitoring objects, the accuracy of the behavior monitoring is improved by the multi-mode, the audio of a single monitoring object can be extracted from the mixed audio for behavior monitoring, when the monitoring object is monitored to have abnormal behavior, the monitoring object with the abnormal behavior can be accurately and quickly positioned, and the efficiency of positioning the behavior monitoring object is improved.

The present application will be described in further detail with reference to the drawings and examples. For convenience of understanding, the behavior monitoring method provided by the present application is described in detail in the embodiments of the present application by taking the monitored target as a pig as an example.

Fig. 1 is a schematic implementation flow diagram of a behavior monitoring method provided in an embodiment of the present application. As shown in fig. 1, the method includes:

step 101: extracting at least one second audio from the first audio; the first audio characterizing sound emitted by at least two monitored objects; each of the at least one second audio corresponds to a sound that characterizes one of the at least two monitored objects.

Here, for more accurate behavior monitoring, at least one second audio is first extracted from the first audio, wherein the first audio characterizes the sound emitted by at least two monitoring objects and each second audio characterizes the sound emitted by one monitoring object. When the monitored object is a pig, the first audio represents mixed audio formed by sounds emitted by a plurality of pigs, and the second audio represents sounds emitted by any one of the plurality of pigs. The audio frequency of a single pig is extracted from the mixed audio frequency, so that the interference caused by the audio frequencies of other pigs can be eliminated, and the behavior monitoring of each single pig is facilitated.

In practical applications, at least one second audio may be extracted from the first audio by a speech separation model. The speech separation model includes two parts, an audio encoder and an audio filter.

Aiming at the audio encoder, in the data acquisition stage, in order to improve the precision of model training, the audio of each pig can be respectively acquired, specifically, the pig is independently placed in a pigsty, and the audio of each pig is respectively acquired. Extracting the log mel cepstrum energy characteristics of the collected audio frequency of each pig, and inputting the log mel cepstrum energy characteristics corresponding to the audio frequency of each pig into a three-layer Long Short-Term Memory model (LSTM) to obtain an audio vector pig-vector corresponding to each pig, wherein the dimensionality is 256 dimensions. The audio vector represents the tone of each pig, and the pigs can be uniquely identified.

Aiming at an audio filter, a mixed audio formed by collected sounds emitted by a plurality of pigs is used as the input of a voice separation model, and a filter network of a time domain and a frequency domain is trained by combining a pig-vector corresponding to each pig and using the audio of the pig corresponding to the pig-vector as a label, namely: the input is mixed audio frequency formed by the pig-vector of a single pig and the sound emitted by a plurality of pigs, after training is finished, the filter network can remove the interference audio frequency of the rest pigs and output the audio frequency of the single pig corresponding to the pig-vector.

For ease of understanding, the audio isolated by the speech isolation model is taken as the audio of the target pig.

Fig. 2 is a schematic diagram of a training process of a speech separation model according to an embodiment of the present application, as shown in fig. 2:

firstly, inputting the audio of the target pig into a three-layer LSTM model to obtain the pig-vector corresponding to the target pig.

And carrying out Short-time Fourier Transform (SIFT) on the noisy mixed audio consisting of the sounds emitted by a plurality of pigs to obtain a spectrogram corresponding to the noisy mixed audio. And inputting the amplitude frequency spectrum of the spectrogram and the pig-vector of the target pig into a filter network, and outputting soft mask (soft mask) characteristics by the filter network.

And multiplying the soft mask characteristic with a spectrogram corresponding to the noisy mixed audio to obtain an enhanced amplitude spectrum. And combining the original amplitude spectrum and the enhanced amplitude spectrum of the spectrogram corresponding to the noisy mixed audio to obtain a spectrogram mask.

And carrying out inverse SIFT transformation on the spectrogram mask to obtain the enhanced audio.

The method comprises the steps of carrying out denoising processing on collected audio of a target pig to obtain pure audio of the target pig after denoising, carrying out SIFT conversion on the pure audio of the target pig to obtain a corresponding spectrogram, calculating a loss value by calculating a difference value between a spectrogram mask and an amplitude spectrum of the spectrogram corresponding to the pure audio of the target pig, and updating voice separation model parameters based on the loss value.

Illustratively, the filter network consists of 8 convolutional layers (CNN), 1 LSTM layer, and 2 fully-connected layers (FC), and the activation functions of the remaining layers except for the last layer are all Linear rectification functions (ReLU), and the activation function of the last layer is a sigmoid function. And repeatedly splicing the pig-vector of the target pig with the output of the convolution layer on the previous layer at each layer, and taking the spliced value as the input of the next layer. Wherein, the detailed parameters of each layer are shown in table 1:

TABLE 1

Wherein, width represents a Width function, division represents an expansion function, time represents a value in the time domain, freq represents a value in the frequency domain, and Filters/Nodes represent Filters.

Step 102: inputting each second audio in the at least one second audio and the corresponding second video into a first set model to obtain a first behavior characteristic corresponding to each monitoring object in the at least two monitoring objects; and the second video representation shoots a video of the corresponding monitored object.

Here, after at least one second audio is extracted from the first audio, each second audio and the corresponding second video representing the video in which the corresponding detection object is photographed are input into the first setting model. Illustratively, the second audio representation is the sound made by pig 1, and then the second video representation captures video of pig 1.

It should be noted that the capturing time point of the second audio is the same as the capturing time point of the second video. In practical application, the video stream information of a plurality of pigs can be obtained in a mode that a terminal shoots videos, and the corresponding audio and the corresponding videos of the plurality of pigs are extracted from the video stream information.

And inputting the second audio and the corresponding second video into the first set model to obtain the first behavior characteristic corresponding to the corresponding monitored object.

When the monitored subject is a pig, the first set model can be used to identify whether the pig has cough behavior characteristics.

Step 103: and matching the first behavior characteristic corresponding to each of the at least two monitoring objects with a first set behavior characteristic to obtain a first behavior monitoring result.

After the first behavior feature of each monitored object is obtained, the first behavior feature of each monitored object is matched with the first set behavior feature, and a first behavior monitoring result is obtained.

In practical applications, the first set behavior profile characterizes the behavior of the cough in the pig. The first behavior monitoring result can be used for representing whether the pig corresponding to the first behavior characteristic only has the behavior characteristic of cough.

In an embodiment, the matching the first behavior feature corresponding to each of the at least two monitoring objects with the first set behavior feature includes:

a voice signal with the amplitude value larger than a set threshold value exists in the first spectrogram; the first spectrogram is a spectrogram of a second audio corresponding to the monitoring object corresponding to the first behavior feature;

Here, the first set behavior feature characterizes a behavior feature of a cough of the monitored subject, the cough being a process in which strong airflow impact in the vocal tract accompanied by typical sound is achieved by a plurality of sudden openings of the glottis due to contraction of the abdominal muscles to generate subglottic pressure. The difference between the change amplitude of the voice signal in the voice spectrogram of the corresponding audio when the pig coughs and the change amplitude of the voice signal in the voice spectrogram of the corresponding audio when the pig normally sounds is larger, the amplitude of the voice signal in the voice spectrogram corresponding to the audio when the pig normally sounds is smaller, and the amplitude of the voice signal in the voice spectrogram corresponding to the audio when the pig coughs is larger, so that whether the cough behavior occurs can be judged according to the amplitude of the voice signal in the voice spectrogram. Specifically, if a voice signal with the amplitude larger than a set threshold exists in the spectrogram, the pig corresponding to the second audio is considered to have the cough behavior feature only. The set threshold represents the amplitude of the voice signal corresponding to the audio frequency during normal sound production.

Fig. 3 is a schematic diagram of a spectrogram provided in the embodiment of the present application, as shown in fig. 3:

in the graph a, the amplitude of the voice signal in the time period I is significantly larger than that in the time period II, and if the set threshold is set to 0.05, the amplitude of the voice signal in the time period I is around 0.1 and is larger than 0.05 of the set threshold, thus indicating that cough behavior occurs in the time period I, that is: the first behavior feature matches the first set behavior feature over time period I. And the amplitude of the voice signal in the time period II is less than 0.05, which indicates that no cough behavior occurs in the time period II, and the first behavior characteristic is not matched with the first set behavior characteristic in the time period II.

In the graph b, if the set threshold is set to 0.1, the amplitudes of the speech signals in the time period I and the time period III are both significantly larger than the set threshold of 0.1, and thus, it is illustrated that the cough behavior occurs in the time period I and the time period III, that is, the first behavior feature matches the first set behavior feature in the time period I and the time period III. And the amplitude of the voice signal in the time period II is less than 0.1, which indicates that no cough behavior occurs in the time period II, and the first behavior feature does not match with the first set behavior feature in the time period II.

The pig usually has typical cough behaviors, such as body shake, back arching and hind limb shake, only when the pig is in cough, therefore, the first behavior feature can be judged to be matched with the first set behavior feature by judging that the corresponding monitored object has set behavior based on the second video corresponding to the monitored object corresponding to the first behavior feature. The set behavior may be body tremor, back vault, hind limb tremor. And if the corresponding monitored object is judged to have the set behavior based on the second video of the monitored object corresponding to the first behavior characteristic, the first behavior characteristic is matched with the first set behavior characteristic.

In practical application, after a second video corresponding to a second audio is obtained, the second video is firstly subjected to framing processing, and an effective picture is extracted. Features are typically extracted from the second video frame by frame using the OpenCV algorithm. After each frame of picture separated from the second video is subjected to CNN extraction of apparent features, LSTM learning time sequence features are adopted, and therefore vector output of the second video is achieved. When the effective picture is extracted, a picture in which behaviors such as body shake, back arching, hind limb shake and the like accompanying the pig cough appear is mainly extracted as the effective picture.

By the two ways of judgment, whether the first behavior characteristic of the corresponding monitored object is matched with the first set behavior characteristic or not can be judged by combining the audio characteristic and the video characteristic, and the accuracy of the judgment result is improved.

In an embodiment, the method further comprises:

Here, if the first behavior monitoring result indicates that the first behavior feature corresponding to the presence of the monitored object matches the first set behavior feature, it is determined that the behavior of the corresponding pig with cough is present through the first set model, and the cough of the pig may be caused by drinking water or by playing water, and is not necessarily caused by illness, so as to further determine whether the cough behavior of the pig is caused by illness, the corresponding second audio and the corresponding second video are input into the second set model, and the second behavior feature of the corresponding monitored object is obtained. The second setting model is used for judging whether the monitoring object has second setting behavior characteristics or not based on the voice characteristics and the video characteristics of the monitoring object. And the second set behavior characteristics represent abnormal behaviors of the corresponding monitoring object. In practical application, the second set behavior feature represents the diseased behavior feature of the monitored object.

And the output result of the second setting model is a second behavior characteristic corresponding to the monitored object, and after the second behavior characteristic is obtained, the second behavior characteristic is matched with the second setting behavior characteristic to obtain a second behavior monitoring result of the corresponding monitored object.

Under the condition that the first behavior monitoring result is matched in representation, the second audio and the corresponding second video are input into the second setting model, the second behavior characteristics are matched with the second setting behavior characteristics to obtain a second behavior monitoring result, whether the second setting behavior characteristics exist in the corresponding monitored object can be further judged, and therefore the accuracy of behavior monitoring of the monitored object is improved.

In an embodiment, the matching the obtained second behavior feature with the second set behavior feature includes:

Here, if the obtained second behavior feature satisfies at least one of the following conditions, it is determined that the second behavior feature matches a second set behavior feature, specifically, a time interval during which a speech signal with an amplitude greater than a set threshold value in the second spectrogram appears is smaller than a set time interval; and/or the duration of the voice signal with the amplitude larger than the set threshold value in the second spectrogram is larger than the set duration. And the second spectrogram is a spectrogram of a second audio corresponding to the monitored object corresponding to the second behavior characteristic.

Because the amplitude variation range of the voice signal in the spectrogram corresponding to the audio when the pig normally sounds is small, and the amplitude variation range of the voice signal in the spectrogram corresponding to the audio when the pig coughs is large, the voice signal with the amplitude larger than the set threshold value in the spectrogram can be considered that the pig has cough behavior characteristics. The pigs with the sick behavior characteristics generally cough continuously in a short time, and the time length of a single cough is longer, so that if the time interval of occurrence of the voice signals with the amplitude larger than the set threshold value in the second spectrogram is smaller than the set time interval, the cough behavior characteristics of the pigs appear more frequently, and the pigs only have the sick behavior characteristics, so that the second behavior characteristics are matched with the second set behavior characteristics under the condition. If the duration of the voice signals with the amplitude larger than the set threshold value in the second spectrogram is larger than the set duration, the single cough time of the pig is longer, and the pig has the sick behavior characteristics, so that the second behavior characteristics are matched with the second set behavior characteristics in the situation.

Whether the second behavior characteristic is matched with the second set behavior characteristic or not is judged by judging the time interval and the time length of occurrence of the voice signal with the amplitude larger than the set threshold, and the accuracy of the obtained judgment result can be improved.

In an embodiment, the matching the obtained second behavior feature with a second set behavior feature further includes:

and judging that the corresponding monitoring object generates a first set behavior based on a second video corresponding to the monitoring object corresponding to the second behavior characteristic, and determining that the second behavior characteristic is matched with the second set behavior characteristic.

Here, only when a pig with a diseased behavior feature coughs, the pig may have a typical behavior, such as a specific behavior like mouth-breathing, mouth-nose spittle, dog sitting, abdominal breathing, etc., so that, based on the second video corresponding to the monitoring object corresponding to the second behavior feature, it may be determined that the corresponding monitoring object has the first set behavior to determine that the second behavior feature matches the second set behavior feature. The first set behavior may be one or more of mouth breathing, mouth-nose spitting, dog sitting, abdominal breathing. And if the corresponding monitored object is judged to have the first set behavior based on the second video of the monitored object corresponding to the second behavior characteristic, the second behavior characteristic is matched with the second set behavior characteristic.

It should be noted that, in the embodiment of the present application, the feature extraction for the second video in the second set model is the same as the process of the feature extraction for the second video in the first set model in the foregoing, and the difference is that the image features extracted by the two sets are different.

Since the cough is caused by water drinking and juggling in a normal living scene, when the features of the framed picture in the video are extracted by using OpenCV, the cough caused by water drinking and juggling is also extracted, so that the cough can be better distinguished from the cough caused by the characteristics of the sick behavior.

By determining whether the second behavior feature matches the second set behavior feature by using whether the first set behavior occurs in the second video, the behavior monitoring result of the monitored object can be further accurately determined from a video perspective.

Fig. 4 is a schematic diagram of behavior monitoring performed by a second setting model according to an embodiment of the present application, as shown in fig. 4:

and (3) performing frame processing on the video of the pig, and extracting the characteristics in each frame of image through CNN and LSTM to obtain a video output result. And extracting cough time interval characteristics and single cough duration characteristics in the audio, and obtaining a voice output result through two FC layers. And splicing the video output result and the voice output result, obtaining a final behavior monitoring result through two FC layers and one softmax layer, and judging whether the pigs have the characteristics of the diseased behaviors or not.

In an embodiment, the method further comprises:

Here, if the second behavior feature matches the second set behavior feature, the second set behavior represents an abnormal behavior feature, which indicates that the monitored object corresponding to the second behavior feature has an abnormal behavior feature, and at this time, the monitored object having the abnormal behavior feature needs to be determined. Specifically, the monitoring object is determined based on the audio coding of the second audio of the monitoring object corresponding to the second behavior feature. Since each monitoring object corresponds to a unique audio code, the audio code can uniquely identify the monitoring object, that is, the audio code and the monitoring object are in a one-to-one correspondence relationship, when the audio code of the second audio is determined, the monitoring object corresponding to the audio code can be determined according to the correspondence relationship between the audio code and the monitoring object.

Under the condition that the second behavior monitoring result represents that the abnormal behavior feature exists, the corresponding monitoring object is determined based on the audio coding of the second audio corresponding to the monitoring object corresponding to the second behavior feature, and the monitoring object with the abnormal behavior feature can be accurately determined based on the audio coding of the second audio, so that the monitoring object with the abnormal behavior feature can be accurately positioned, and the efficiency of finding out the monitoring object with the abnormal behavior feature is improved.

In an embodiment, after extracting at least one second audio from the first audio, the method further comprises:

and acquiring a second video corresponding to the monitored object.

Here, after at least one second audio is extracted from the first audio, the behavior monitoring method further includes determining a monitoring object corresponding to the second audio based on an audio code of the second audio, and since the audio code of the second audio can uniquely identify the corresponding monitoring object, the monitoring object corresponding to the second audio can be determined based on the audio code of the second audio. And after the monitoring object is determined, acquiring a second video corresponding to the monitoring object. Wherein, the second video and the second audio have the same acquisition time point.

After the second audio is extracted, the corresponding second video is determined based on the audio coding of the second audio, so that the behavior of the monitored object can be monitored conveniently based on the audio and the video of the same monitored object, and the accuracy of behavior monitoring is improved.

In an embodiment, before said extracting at least one second audio from the first audio, the method further comprises:

Here, in order to improve the accuracy of the training of the speech separation model, the sound emitted by each monitoring object is collected separately, and in order to better locate a specific monitoring object based on the sound emitted by the monitoring object, the sound emitted by each monitoring object is input into a set speech coder respectively, so as to obtain the audio coding of the sound emitted by each monitoring object. In practice, the set-up speech coder may be a three-layered LSTM model.

And after the audio coding of the sound emitted by each monitoring object is obtained, storing the corresponding relation between each monitoring object and the audio coding of the emitted sound. In practical application, each monitoring object is numbered, so that the corresponding relation between the number corresponding to the monitoring object and the audio coding of the emitted sound can be stored.

By obtaining the audio codes of the sounds emitted by each monitoring object and storing the corresponding relations between the monitoring objects and the audio codes, the corresponding monitoring objects can be conveniently and accurately determined based on the audio codes, and the efficiency and the accuracy of determining the monitoring objects are improved.

Fig. 5 is a schematic flow chart of an implementation of the behavior monitoring method provided in the application embodiment of the present application, as shown in fig. 5:

inputting a mixed audio consisting of an audio vector pig-vector corresponding to a target pig and sounds emitted by a plurality of pigs into a voice separation model, extracting the audio of the target pig from the mixed audio through the voice separation model, inputting the audio of the target pig and the video of the target pig into a first set model to obtain an output result of an audio part and an output result of a video part, splicing the two output results together through an FC layer and a softmax layer to obtain a judgment result of whether the target pig has cough behaviors, extracting a cough time interval and a single cough duration in the audio of the target pig as the input of an audio part model under the condition that the judgment result represents that the target pig has the cough behaviors, taking the behavior video of the target pig as the input of the video part model, splicing the output result of the audio part model and the output result of the video part model together, connecting the FC layer and the softmax layer to obtain a final judgment result, and accurately positioning the number of the audio of the pig having the sick behaviors based on the audio of the target pig under the condition that the pig has the sick behaviors.

In an embodiment of the application, at least one second audio is extracted from a first audio, wherein the first audio represents sound emitted by at least two monitoring objects, and each second audio in the at least one second audio represents sound emitted by one monitoring object in the at least two monitoring objects; inputting each second audio in the at least one second audio and the corresponding second video into the first set model to obtain a first behavior characteristic corresponding to each monitoring object in the at least two monitoring objects; matching the first behavior characteristic corresponding to each of the at least two monitored objects with the first set behavior characteristic to obtain a first behavior monitoring result; the second video representation shoots the video with the corresponding monitoring object, so that the single audio of each monitoring object can be extracted from the mixed audio formed by the sounds of the multiple monitoring objects, the behavior monitoring is carried out by combining the videos corresponding to the monitoring objects, the accuracy of the behavior monitoring is improved by the multi-mode, the audio of a single monitoring object can be extracted from the mixed audio for behavior monitoring, when the monitoring object is monitored to have abnormal behavior, the monitoring object with the abnormal behavior can be accurately and quickly positioned, and the efficiency of positioning the behavior monitoring object is improved.

The embodiment of the present application further provides a model training method, and fig. 6 is a schematic diagram illustrating an implementation flow of the model training method provided in the embodiment of the present application. As shown in fig. 6, the method includes:

601, acquiring an audio sample and a video sample of a monitored object; the audio sample characterizes sound emitted by the monitored object; the video sample represents a video shot with the monitoring object and acquired simultaneously with the audio sample.

Here, an audio sample and a video sample of the monitoring object are obtained, wherein the audio sample represents a sound emitted by the monitoring object, the video sample represents a video in which the monitoring object is shot, and a collection time point is the same as a collection time point of the audio sample.

The model training method is explained only by taking the monitoring object as the pig as an example.

Illustratively, the monitoring subjects may take five and a half-month-old pigs, including pigs with diseased behavior, each with an average body weight of 60kg. 30 pigs were used as monitoring subjects, wherein each pig had a unique corresponding number.

The acquisition time of the audio sample and the video sample can be season changes such as late winter and early summer, because the season changes are usually the seasons with sick behaviors of the pigs.

The size of the pigsty in which the pig is located is 27.5m long, 13.7m wide and 3.2m high. The pigsty contained 30 pens, with an average of 1 pig per pen, for a total of 30 pigs. Each fence was enclosed by a 1.1m tall iron fence.

The recording equipment is a microphone, a microphone with the frequency of 100Hz-16kHz is arranged in different swineries in the pigsty, the microphone is connected with a sound card of a notebook computer, and recording is carried out through recording software on the notebook computer. The microphone was fixed at a distance of 1.4m from the ground, and approximately 0.8m from the back of the pig. The sampling rate of the sound card of the notebook computer is 44.1kHz, and the resolution ratio is 16Bits.

The video shooting equipment is a shooting camera. The mixed audio data and the video data of a plurality of pigs are acquired aiming at the pigs in a plurality of pigsties, if 5 pigs are selected as the same pigsties, 30 pigs are divided into 6 pigsties, and the mixed audio data and the video data of the pigs in the 6 pigsties are acquired. The acquisition time may be 3 days.

In order to train the first setting model, the audio data of each pig needs to be acquired separately, specifically, 30 pigs may be placed in a pigsty separately, and the acquisition time may also be 3 days for acquiring the audio data of a single pig. After the audio data of each pig is collected, a data set S is generated based on the audio data of each pig ₁ Denoising audio data of different pigs to obtain pure audio data, and generating a data set S based on the pure audio data ₁₁ . The specific denoising method comprises: and performing noise reduction processing by adopting a spectral subtraction method. Noise generally comes from two sources, ambient noise and noise generated by the recording device itself. The voice signal under the pigsty environment is considered to be the superposition of a pure sound signal and a noise signal by the spectral subtraction method, so that the average noise energy of the sound signal can be estimated through a mute part of the whole signal, and then a stable noise part in the sound signal is removed, so that the pure sound signal is obtained.

For better training of the first set of models, the data set S is used ₁ The audio data of 5 pigs in the same pigsty are selected for fusion, 6 sections of 72h of audio data are obtained as 6 pigsties are collected for 3 days, the 6 sections of 72h of audio data are processed in a segmentation mode according to 15S, audio sections with sound are selected from the 6 sections of 72h of audio data, and a mixed audio data set S is generated based on the audio sections with sound ₂ 。S ₂ Each piece of data in S ₁₁ There is a corresponding data set as a label for training. For a data set S ₁₁ Separating the cough part and the non-cough part in a manual mode, segmenting the cough part and the non-cough part according to 15S respectively, and storing the segments as a cough data set S ₁₂ And non-cough data set S ₁₃ In addition, S is ₁₂ And S ₁₃ Other abnormal sounds are included, and only the cough sounds are distinguished here. All data sets correspond to the number of pigs and the time of acquisition. And extracting the video with the corresponding duration as a video sample according to the audio acquisition time.

Fig. 7 is a schematic diagram of audio data processing provided in an embodiment of the present application, as shown in fig. 7:

the method comprises the steps of inputting audio of a target pig into a set voice coder, generating a corresponding pig-vector, carrying out denoising processing on the audio of the target pig to obtain pure audio of the target pig, taking the pure audio as a training label of a voice separation model, mixing the pure audio of the target pig and the noisy audio of other pigs to form a mixed audio data set, inputting the mixed audio data set into the voice separation model, and carrying out training.

Step 602, inputting the audio features corresponding to the audio samples and the video samples into a first set model to obtain a first output result; the first output result represents a first behavior characteristic corresponding to the monitored object. The audio features corresponding to the audio samples comprise cross-correlation coefficient matrix features, and the cross-correlation coefficient matrix features represent correlation coefficients between two adjacent frames in a spectrogram corresponding to the audio samples.

Here, the audio characteristics corresponding to the audio sample and the video sample are input into the first setting model, so as to obtain a first output result, and the first output result represents the first behavior characteristics corresponding to the monitoring object.

Specifically, the cough data sets S are extracted separately ₁₂ And non-cough data set S ₁₃ Of the speech feature of S ₁₂ Inputting the corresponding voice features and the video corresponding to the cough data set into a first set model, and training by taking the corresponding voice features during cough as labels to obtain a first output result; will S ₁₃ And inputting the corresponding voice features and the video corresponding to the non-cough data set into a first set model, and training by taking the corresponding voice features in the non-cough time as labels to obtain a first output result. The audio features further comprise cross-correlation coefficient matrix features, and the cross-correlation coefficient matrix features correlation coefficients between two adjacent frames in a spectrogram corresponding to the audio sample. Because the difference between the voice signal change amplitude in the spectrogram corresponding to the audio when the pig coughs and the voice signal change amplitude in the spectrogram corresponding to the audio when the pig normally sounds is large, the similarity characteristics between the adjacent frames of the spectrogram corresponding to the audio sample are extracted for analysis.

After obtaining the spectral energy corresponding to the audio sample, equally dividing the entire frequency domain range into M frequency bands on a mel scale, wherein there is an overlap between the frequency bands, specifically, the center frequency of the previous frequency band is the start frequency of the next frequency band. And respectively solving cross correlation coefficients of corresponding frequency bands of two adjacent frames in the spectrogram, and taking the obtained cross correlation coefficients of the M frequency bands as the dynamic characteristics of a frame of input signals.

Assuming that s (n, k) represents the spectral energy corresponding to the kth point after the nth frame spectrogram is subjected to Fast Fourier Transform (FFT), the calculation formula of the cross-correlation coefficient cc (n, m) of the mth frequency band of the nth frame is:

wherein k is _mi And k _mh Respectively is the starting frequency point and the ending frequency point of the mth frequency band of the nth frame spectrogram after FFT, N is the total frame number, and M is the total frequency band number. And obtaining a cross correlation coefficient matrix of N M through calculation.

And aiming at the video sample part, performing framing processing on the video sample by using OpenCV, extracting effective pictures, adopting CNN (compressed natural language) to extract apparent features of each picture, and then learning time sequence features by using LSTM (least squares).

Fig. 8 is a schematic diagram of extracting features of a video sample for training according to an embodiment of the present application, as shown in fig. 8:

the method comprises the steps of inputting a video sample of a pig into a first set model, obtaining multi-frame images of the pig based on the video sample, extracting apparent features by using CNN for each frame image, inputting the result extracted by the CNN into LSTM to learn time sequence features, and outputting the result as an image vector.

Step 603, calculating a loss value based on the first output result, and updating the weight parameter of the first setting model based on the loss value.

Here, the loss value is calculated based on the first output result, the weight parameter of the first setting model is updated based on the loss value, the loss value between the first output result and the corresponding label is calculated, if the loss value is too large, the fitting degree of the first setting model is poor, and the result output by the first setting model has a large error.

In one embodiment, the audio characteristics corresponding to the audio samples further include at least one of:

a spectrogram corresponding to the audio sample;

a mel frequency cepstrum characteristic corresponding to the audio sample;

a first order difference feature corresponding to the audio sample;

Here, the audio characteristics of the audio samples include at least one of the following in addition to the cross-correlation coefficient matrix characteristics: the method comprises the following steps of obtaining a spectrogram corresponding to an audio sample, a Mel frequency cepstrum characteristic corresponding to the audio sample, a first-order difference characteristic corresponding to the audio sample and a second-order difference characteristic corresponding to the audio sample.

The spectrogram is obtained by performing frame windowing on an original audio signal to obtain multiple frames, performing fast Fourier transform on each frame to convert a time domain signal into a frequency domain signal, and then stacking the frequency domain signals of each frame after FFT in time. The spectrogram fully extracts time domain and frequency domain characteristics of the audio sample, and displays the audio sample in an image form. After a two-dimensional spectrogram of the audio sample is obtained, the spectrogram is saved as 227 × 3 RGB color pictures.

The Mel frequency cepstrum feature represents a short-time power spectrum of a voice signal, is obtained by performing linear cosine transform on a nonlinear Mel scale of the frequency of a logarithmic power spectrum of the voice signal, and is mainly used for extracting static features of an audio sample and reducing operation dimensionality. The MFCC is generally obtained by pre-emphasis, framing, windowing, FFT, mel-filter bank, discrete Cosine Transform (DCT), and finally retaining the 2 nd to 13 th coefficients of the calculation result, and the 12 coefficients obtained are MFCCs.

The first order difference (Deltas) feature, also known as the differential coefficient, is used to describe the dynamics of the audio sample.

Second order difference (Deltas-Deltas) features, also known as acceleration coefficients, are used to characterize the dynamics of the audio sample.

Fig. 9 is a schematic diagram of extracting audio features of an audio sample for training according to an embodiment of the present application, as shown in fig. 9:

and extracting voice features of the audio sample, such as a spectrogram, MFCC, first-order difference, second-order difference and cross-correlation coefficient matrix, and inputting the voice features into the first set model.

And aiming at the spectrogram, training by adopting a CRNN algorithm, and acquiring a voice vector by combining an FC layer. And combining the cross-correlation coefficient matrix with the MFCC and the first-order difference and second-order difference characteristics to obtain HFSs characteristics, inputting the HFSs characteristics into three-layer FC for training to generate static and dynamic characteristic vectors, and finally splicing the voice vectors and the static and dynamic characteristic vectors to obtain the output of a voice part.

In order to implement the method according to the embodiment of the present application, an embodiment of the present application further provides a behavior monitoring device, fig. 10 is a schematic diagram of the behavior monitoring device according to the embodiment of the present application, please refer to fig. 10, where the behavior monitoring device includes:

an extracting unit 1001 for extracting at least one second audio from the first audio; the first audio characterizing sound emitted by at least two monitored objects; each of the at least one second audio frequency corresponds to a sound that characterizes one of the at least two monitoring objects;

an input unit 1002, configured to input each second audio in the at least one second audio and the corresponding second video into a first set model, so as to obtain a first behavior feature corresponding to each monitored object in the at least two monitored objects;

a matching unit 1003, configured to match a first behavior feature corresponding to each of the at least two monitoring objects with a first set behavior feature, so as to obtain a first behavior monitoring result; wherein the content of the first and second substances,

In an embodiment, the matching unit 1003 is further configured to determine that the first behavior feature matches the first set behavior feature if the first behavior feature satisfies at least one of the following conditions:

In one embodiment, the apparatus further comprises: the second matching unit is used for inputting a corresponding second audio and a corresponding second video into a second set model under the condition that the first behavior monitoring result represents that the first behavior characteristics corresponding to the monitored object are matched with the first set behavior characteristics, so as to obtain second behavior characteristics corresponding to the monitored object;

matching the obtained second behavior characteristic with a second set behavior characteristic to obtain a second behavior monitoring result of the corresponding monitoring object; wherein the content of the first and second substances,

and the second set behavior characteristics represent the behavior abnormity of the corresponding monitored object.

In an embodiment, the second matching unit is further configured to determine that the second behavior feature matches the second set behavior feature when the obtained second behavior feature satisfies at least one of the following conditions:

In one embodiment, the apparatus further comprises: and the second determining unit is used for determining the monitoring object corresponding to the second behavior characteristic based on the audio coding of the second audio of the monitoring object corresponding to the second behavior characteristic under the condition that the second behavior monitoring result represents that the second behavior characteristic is matched with the second set behavior characteristic.

In one embodiment, the apparatus further comprises: the acquisition unit is used for determining a monitoring object corresponding to a second audio based on the audio coding of the second audio;

and acquiring a second video corresponding to the monitored object.

In one embodiment, the apparatus further comprises: the storage unit is used for respectively inputting the sound emitted by each monitored object into the set voice coder to obtain the audio coding of the sound emitted by each monitored object;

In practical applications, the extracting Unit 1001, the input Unit 1002, the matching Unit 1005, the second matching Unit, the second determining Unit, the obtaining Unit, and the storing Unit may be implemented by a Processor in a terminal, such as a Central Processing Unit (CPU), a Digital Signal Processor (DSP), a Micro Control Unit (MCU), or a Programmable Gate Array (FPGA).

It should be noted that: in the behavior monitoring device provided in the above embodiment, when displaying information, only the division of the program modules is exemplified, and in practical applications, the processing distribution may be completed by different program modules according to needs, that is, the internal structure of the device may be divided into different program modules to complete all or part of the processing described above. In addition, the behavior monitoring device and the behavior monitoring method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments in detail and are not described herein again.

To implement the method according to the embodiment of the present application, a model training apparatus is further provided according to the embodiment of the present application, fig. 11 is a schematic diagram of the model training apparatus provided according to the embodiment of the present application, please refer to fig. 11, and the apparatus includes:

an acquisition unit 1101 configured to acquire an audio sample and a video sample of a monitoring target; the audio sample characterizes sound emitted by the monitored object; the video sample represents a video which is acquired simultaneously with the audio sample and shoots the monitored object;

an input unit 1102, configured to input the audio features corresponding to the audio samples and the video samples into a first setting model, so as to obtain a first output result; the first output result represents a first behavior characteristic corresponding to the monitored object;

a calculating unit 1103 configured to calculate a loss value based on the first output result, and update a weight parameter of the first setting model based on the loss value; wherein the content of the first and second substances,

In an embodiment, the audio characteristics corresponding to the audio samples further include at least one of:

a spectrogram corresponding to the audio sample;

a mel frequency cepstrum characteristic corresponding to the audio sample;

a first order difference feature corresponding to the audio sample;

Based on the hardware implementation of the program module, in order to implement the method of the embodiment of the present application, an embodiment of the present application further provides an electronic device. Fig. 12 is a schematic diagram of a hardware component structure of an electronic device according to an embodiment of the present application, where as shown in fig. 12, the electronic device includes:

a communication interface 1201 capable of performing information interaction with other devices such as a network device and the like;

the processor 1202 is connected to the communication interface 1201 to implement information interaction with other devices, and is configured to execute a method provided by one or more technical solutions of the terminal side when running a computer program. And the computer program is stored on the memory 1203.

Specifically, the processor 1202 is configured to extract at least one second audio from a first audio; the first audio characterizing sound emitted by at least two monitored objects; each of the at least one second audio frequency corresponds to a sound that characterizes one of the at least two monitoring objects; inputting each second audio in the at least one second audio and the corresponding second video into a first set model to obtain a first behavior characteristic corresponding to each monitored object in the at least two monitored objects; matching the first behavior characteristic corresponding to each of the at least two monitoring objects with a first set behavior characteristic to obtain a first behavior monitoring result; and the second video representation shoots a video of the corresponding monitored object.

In an embodiment, the processor 1202 is further configured to determine that the first behavior feature matches the first set behavior feature if the first behavior feature satisfies at least one of the following conditions:

In an embodiment, the processor 1202 is further configured to, when the first behavior monitoring result indicates that there is a match between the first behavior feature corresponding to the monitoring object and the first set behavior feature, input a corresponding second audio and a corresponding second video into the second set model to obtain a second behavior feature corresponding to the corresponding monitoring object;

In an embodiment, the processor 1202 is further configured to determine that the second behavior feature matches the second set behavior feature if the obtained second behavior feature satisfies at least one of the following conditions:

In an embodiment, the processor 1202 is further configured to determine, if the second behavior monitoring result indicates that the second behavior feature matches the second set behavior feature, the monitored object corresponding to the second behavior feature based on audio coding of a second audio of the monitored object corresponding to the second behavior feature.

In an embodiment, after at least one second audio is extracted from the first audio, the processor 1202 is further configured to determine a monitoring object corresponding to the second audio based on audio coding of the second audio;

and acquiring a second video corresponding to the monitored object.

In an embodiment, before the at least one second audio is extracted from the first audio, the processor 1202 is further configured to input the sound emitted by each monitored object into a set speech coder, respectively, to obtain an audio code of the sound emitted by each monitored object;

In an embodiment, the processor 1202 is further configured to obtain an audio sample and a video sample of the monitored object; the audio sample characterizes sound emitted by the monitored object; the video sample represents a video which is acquired simultaneously with the audio sample and shoots the monitored object;

a spectrogram corresponding to the audio sample;

a mel frequency cepstrum characteristic corresponding to the audio sample;

first order difference features corresponding to the audio samples;

Of course, in practice, the various components in the electronic device are coupled together by a bus system 1204. It is understood that the bus system 1204 is used to enable connective communication between these components. The bus system 1204 includes a power bus, a control bus, and a status signal bus, in addition to a data bus. For clarity of illustration, however, the various buses are designated as bus system 1204 in figure 12.

The memory 1203 in the embodiment of the present application is used for storing various types of data to support the operation of the electronic device. Examples of such data include: any computer program for operating on an electronic device.

It will be appreciated that the memory 1203 may be either volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a magnetic random access Memory (Flash Memory), a magnetic surface Memory, an optical Disc, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration, and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), synchronous Static Random Access Memory (SSRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), double Data Rate Synchronous Random Access Memory (ESDRAM), enhanced Synchronous Dynamic Random Access Memory (ESDRAM), enhanced Synchronous Random Access Memory (DRAM), synchronous Random Access Memory (DRAM), direct Random Access Memory (DRmb Access Memory). The memory 1203 described in embodiments herein is intended to comprise, without being limited to, these and any other suitable types of memory.

The methods disclosed in the embodiments of the present application may be applied to the processor 1202 or implemented by the processor 1202. The processor 1202 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by instructions in the form of hardware integrated logic circuits or software in the processor 1202. The processor 1202 described above may be a general purpose processor, a DSP, or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. Processor 1202 may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules may be located in a storage medium located in the memory 1203, and the processor 1202 reads the program in the memory 1203 to implement the steps of the foregoing method in conjunction with its hardware.

When the processor 1202 executes the program, the corresponding flow in each method of the embodiment of the present application is implemented.

In an exemplary embodiment, the present application further provides a storage medium, i.e., a computer storage medium, specifically a computer readable storage medium, for example, including a memory 1203 storing a computer program, which can be executed by a processor 1202 to implement the steps of the foregoing method. The computer readable storage medium may be Memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash Memory, magnetic surface Memory, optical disk, or CD-ROM.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus, terminal and method may be implemented in other manners. The above-described device embodiments are only illustrative, for example, the division of the unit is only one logical function division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media capable of storing program code.

Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application or portions thereof that contribute to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for enabling an electronic device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of behavioral monitoring, the method comprising:

extracting at least one second audio from the first audio; the first audio characterizing sound emitted by at least two monitored objects; each of the at least one second audio corresponds to a sound emitted by one of the at least two monitored objects;

inputting each second audio in the at least one second audio and the corresponding second video into a first set model to obtain a first behavior characteristic corresponding to each monitored object in the at least two monitored objects;

matching the first behavior characteristic corresponding to each of the at least two monitored objects with a first set behavior characteristic to obtain a first behavior monitoring result; wherein the content of the first and second substances,

2. The behavior monitoring method according to claim 1, wherein the matching the first behavior feature corresponding to each of the at least two monitoring objects with a first set behavior feature comprises:

3. The behavior monitoring method according to claim 1, further comprising:

4. A method as claimed in claim 3, wherein said matching the obtained second behaviour characteristic with a second set behaviour characteristic comprises:

5. The behavior monitoring method according to claim 3, further comprising:

6. The behavior monitoring method according to claim 1, wherein after extracting at least one second audio from the first audio, the method further comprises:

and acquiring a second video corresponding to the monitored object.

7. A method as claimed in claim 5 or 6, wherein prior to said extracting at least one second audio from the first audio, the method further comprises:

8. A method of model training for training a first set model in a method of behavioral monitoring according to any one of claims 1 to 7, the method comprising:

acquiring an audio sample and a video sample of a monitored object; the audio sample characterizes sound emitted by the monitored object; the video sample represents a video which is acquired simultaneously with the audio sample and shoots the monitoring object;

calculating a loss value based on the first output result, and updating a weight parameter of a first set model based on the loss value; wherein the content of the first and second substances,

9. The model training method of claim 8, wherein the audio features corresponding to the audio samples further comprise at least one of:

a spectrogram corresponding to the audio sample;

a mel frequency cepstrum characteristic corresponding to the audio sample;

a first order difference feature corresponding to the audio sample;

10. A performance monitoring device, the device comprising:

an extracting unit for extracting at least one second audio from the first audio; the first audio characterizing sound emitted by at least two monitored objects; each of the at least one second audio corresponds to a sound emitted by one of the at least two monitored objects;

the matching unit is used for matching the first behavior characteristic corresponding to each of the at least two monitored objects with a first set behavior characteristic to obtain a first behavior monitoring result; wherein, the first and the second end of the pipe are connected with each other,

11. A model training apparatus, the apparatus comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring an audio sample and a video sample of a monitored object; the audio sample characterizes sound emitted by the monitored object; the video sample represents a video which is acquired simultaneously with the audio sample and shoots the monitored object;

12. An electronic device, comprising: a processor and a memory for storing a computer program capable of running on the processor, wherein,

the processor, when being configured to execute the computer program, is configured to perform the steps of the method of any one of claims 1 to 7 or 8 to 9.

13. A storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, performs the steps of the method of any one of claims 1-7 or 8-9.