CN116631443B

CN116631443B - Infant crying type detection method, device and equipment based on vibration spectrum comparison

Info

Publication number: CN116631443B
Application number: CN202310590550.2A
Authority: CN
Inventors: 陈辉; 张智; 雷奇文; 艾伟; 胡国湖
Original assignee: Wuhan Xingxun Intelligent Technology Co ltd
Current assignee: Wuhan Xingxun Intelligent Technology Co ltd
Priority date: 2021-02-26
Filing date: 2021-02-26
Publication date: 2024-05-07
Anticipated expiration: 2041-02-26
Also published as: CN116631443A; CN113012716A; CN113012716B

Abstract

The invention belongs to the technical field of voice recognition, solves the technical problem of low accuracy in the prior art of judging baby crying through voice recognition, and provides a baby crying type detection method, device and equipment based on vibration spectrum comparison. The method comprises the steps of obtaining a threshold value corresponding to each infant crying type; comparing the vibration spectrum corresponding to the collected electric signal corresponding to the vibration of the baby's vocal cords when the baby crys with the standard vibration spectrum in the database to obtain the similarity value of the cry and each cry category, finding the cry category corresponding to the maximum similarity value, comparing the maximum similarity value with the threshold value representing the cry category corresponding to the maximum similarity value, and outputting the cry category. The invention also comprises a device and equipment for executing the method. The invention can accurately detect the sounding difference caused by the individual difference of the infants, and improve the accuracy of detecting the crying type of the infants based on vibration spectrum comparison.

Description

Infant crying type detection method, device and equipment based on vibration spectrum comparison

The application relates to a method, a device and equipment for identifying the type of crying of an infant, which are filed on 26 months of 2021, and are divisional application of an application patent application with the application number 202110218126.6.

Technical Field

The invention relates to the technical field of voice recognition, in particular to a method, a device and equipment for detecting baby crying type based on vibration spectrum comparison.

Background

With the development of speech recognition technology, speech recognition is applied to more and more fields, such as recognition of various kinds of crying sounds of infants, to determine various conditions corresponding to the infants. For the recognition of infant crying, the method generally adopted in the prior art is as follows: the method comprises the steps of collecting crying by adopting a voice collecting technology, matching the collected crying with the set crying of the baby, determining whether the crying is the crying of the baby, matching the determined crying of the baby with the set crying category, and determining the corresponding crying category of the collected crying after successful matching, so that the concrete meaning of the crying of the baby is finally determined. However, due to the differences among the individuals of the infants, different requirements of the same crying expressions are usually met, particularly when the infants are abnormal in sounding, such as hoarseness, the collected audio information can not be used for judging the crying type of the infants obviously; therefore, when the voice recognition technology is adopted to recognize the crying of the baby, the accuracy and the precision are not high, and the user experience is not high.

Disclosure of Invention

In view of the above, the embodiments of the present invention provide a method, an apparatus, and a device for detecting a class of infant cry based on vibration spectrum comparison, which are used for solving the technical problem of low accuracy in determining infant cry through voice recognition.

The technical scheme adopted by the invention is as follows:

the invention provides a method for detecting the type of baby crying based on vibration spectrum comparison, which is characterized by comprising the following steps:

obtaining a threshold value corresponding to each infant crying type;

comparing the vibration spectrum corresponding to the collected electric signals corresponding to the vibration of the vocal cords of the infant when the infant crys with the standard vibration spectrum in the database to obtain similarity values of the cry and each cry category at the moment, and taking all the similarity values as a similarity value group;

And finding out the crying class corresponding to the maximum similarity value from the similarity value group, comparing the maximum similarity value with a threshold value representing the crying class corresponding to the maximum similarity value, and outputting the crying class.

Preferably, before comparing the vibration spectrum corresponding to the collected electric signal corresponding to the vocal cord vibration of the infant when the infant cries with the standard vibration spectrum in the database, the method further comprises:

acquiring an electric signal corresponding to the vibration of the vocal cords of the infant when the infant cries;

Segmenting the electric signal according to preset time length to obtain a plurality of continuous electric signal segments;

performing short-time Fourier transform on a plurality of continuous electric signal fragments to output the vibration frequency spectrum;

preferably, before the acquiring the electric signal corresponding to the vocal cord vibration of the infant when the infant cries, the method further comprises:

Acquiring a detected audio signal;

processing the audio signal, extracting MFCC characteristics of the audio signal;

And inputting the MFCC characteristics into a preset infant crying recognition model, and determining whether the infant is crying currently.

Preferably, the processing the audio signal, extracting MFCC characteristics of the audio signal includes:

Performing Fourier transform on each frame of collected audio to obtain the bandwidth of an audio signal, and determining a target bandwidth;

Filtering according to the target bandwidth to obtain a Mel frequency cepstrum coefficient;

And amplifying the logarithmic transformation of the mel-frequency cepstral coefficient, and extracting a discrete value of the mel-frequency cepstral coefficient by using the discrete cosine transformation as a mel-frequency cepstral coefficient (MFCC) characteristic.

Preferably, the comparing the vibration spectrum corresponding to the collected electric signal corresponding to the vibration of the vocal cords of the infant when the infant cries with the standard vibration spectrum in the database to obtain similarity values of the cry and each cry category, and taking all the similarity values as a similarity value group includes:

comparing the vibration spectrum with each standard vibration spectrum according to a formula, and outputting the similarity value group;

Wherein X is a vibration spectrum, Y is a standard vibration spectrum in a database, xi is the value of the ith signal segment of the vibration spectrum, and Yi is the value of the ith signal segment of the standard vibration spectrum; mu_x and mu_y are the average value of each electric signal segment in X and each electric signal segment in Y respectively, sigma_x and sigma_y are the standard deviation of each electric signal segment in X and each electric signal segment in Y respectively, and Q is the length of the electric signal corresponding to the collected vocal cord vibration.

Preferably, the finding a crying class corresponding to a maximum similarity value from the similarity value group, comparing the maximum similarity value with a threshold value representing the crying class corresponding to the maximum similarity value, and outputting the crying class includes:

comparing the maximum similarity value with a threshold value representing the crying class corresponding to the maximum similarity value;

And counting the number of times that the maximum similarity value is larger than the crying class threshold, and outputting the crying class if the maximum similarity value is larger than the crying class threshold for k times in succession.

Preferably, the finding a crying class corresponding to a maximum similarity value from the similarity value group, comparing the maximum similarity value with a threshold value representing the crying class corresponding to the maximum similarity value, and outputting the crying class further includes:

if the crying types corresponding to the maximum similarity value are different, resetting the count;

and if the maximum similarity value is smaller than the crying class threshold value in the counting process, resetting the counting.

The invention also provides a device for detecting the type of the baby crying sound based on vibration spectrum comparison, which is characterized by comprising the following steps:

the crying type threshold value acquisition module is used for acquiring the threshold value corresponding to each baby crying type;

the vibration spectrum comparison analysis module is used for comparing the vibration spectrum corresponding to the acquired electric signal corresponding to the vibration of the vocal cords of the infant when the infant crys with the standard vibration spectrum in the database to obtain similarity values of cry and each cry category at the moment, and taking all the similarity values as a similarity value group;

And the crying class output module is used for finding out the crying class corresponding to the maximum similarity value from the similarity value group, comparing the maximum similarity value with a threshold value representing the crying class corresponding to the maximum similarity value, and outputting the crying class.

The invention also provides an electronic device, comprising: at least one processor, at least one memory, and computer program instructions stored in the memory, which when executed by the processor, implement the method of any of the above.

The invention also provides a medium having stored thereon computer program instructions which when executed by a processor implement a method as claimed in any one of the preceding claims.

In summary, the beneficial effects of the invention are as follows:

According to the infant crying type detection method, device and equipment based on vibration spectrum comparison, an electric signal of vibration of a vocal cord of an infant during crying is obtained, the electric signal is converted into a corresponding vibration spectrum, and the vibration spectrum is compared with a standard vibration spectrum of a database; the crying type corresponding to the vibration spectrum is obtained. Judging the crying type of the infant by utilizing the vibration frequency spectrum of the vibration of the vocal cords of the infant; the method can accurately detect the sounding difference caused by the individual difference of the infants or crying abnormal caused by the abnormal conditions such as the hoarseness of the infants, and the like, and improves the accuracy of detecting the crying type of the infants based on vibration spectrum comparison.

Drawings

In order to more clearly illustrate the technical solution of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described, and it is within the scope of the present invention to obtain other drawings according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of a method for detecting infant crying type based on vibration spectrum comparison in example 1 according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of acquiring a vibration spectrum in example 1 according to a first embodiment of the present invention;

fig. 3 is a flowchart illustrating a process of acquiring a vibration spectrum by fourier transform in example 1 according to the first embodiment of the present invention;

Fig. 4 is a schematic flow chart of obtaining a vibration spectrum by normalizing an electrical signal in example 1 according to a first embodiment of the present invention;

Fig. 5 is a flow chart of the similarity determination output crying class in example 1 according to the first embodiment of the present invention;

fig. 6 is a flow chart of crying detection in example 1 according to the first embodiment of the invention;

fig. 7 is a flow chart showing a method for identifying a crying class of an infant by combining vocal cord vibration and posture in example 2 according to the first embodiment of the present invention;

fig. 8 is a flowchart illustrating the process of acquiring audio features according to embodiment 2 of the present invention;

Fig. 9 is a flowchart illustrating a process of acquiring a vibration spectrum of vocal cord vibration in example 2 according to the first embodiment of the present invention;

fig. 10 is a schematic flow chart of acquiring fusion features in embodiment 2 of the first embodiment of the present invention;

Fig. 11 is a flow chart illustrating the fusion of vibration spectrum and audio in the embodiment 2 of the present invention;

Fig. 12 is a flowchart illustrating the process of obtaining the encoding feature vector in embodiment 2 according to the first embodiment of the present invention;

Fig. 13 is a flow chart of obtaining a cry category according to a cry threshold in example 2 according to the first embodiment of the invention;

fig. 14 is a flow chart of a method for identifying a baby crying class by multi-feature fusion in example 3 according to an embodiment of the present invention;

fig. 15 is a schematic flow chart of acquiring motion characteristics of a gesture in example 3 according to the first embodiment of the present invention;

fig. 16 is a flowchart illustrating the determination of the motion feature by the standard motion feature value in the database in example 3 according to the first embodiment of the present invention;

fig. 17 is a flowchart illustrating a process of acquiring a vibration spectrum according to example 3 of the first embodiment of the present invention;

Fig. 18 is a flowchart illustrating the audio feature acquisition by the mel filter in embodiment 3 according to the first embodiment of the present invention;

fig. 19 is a schematic flow chart of multi-feature fusion in example 3 according to the first embodiment of the present invention;

fig. 20 is a schematic flow chart of multi-feature fusion at vibration frequency in example 3 according to the first embodiment of the present invention;

Fig. 21 is a flowchart illustrating the process of obtaining the encoding feature vector in embodiment 3 according to the first embodiment of the present invention;

Fig. 22 is a schematic structural diagram of a device for continuously optimizing camera effect in embodiment 4 of the second embodiment of the present invention;

fig. 23 is a block diagram of a device for selecting confidence level threshold of a sample of an intelligent camera according to embodiment 5 of the second embodiment of the present invention;

FIG. 24 is a schematic structural diagram of a device for self-training of a smart camera model according to embodiment 6 of the second embodiment of the present invention;

Fig. 25 is a schematic structural diagram of an electronic device in a third embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. In the description of the present application, it should be understood that the terms "center," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like indicate orientations or positional relationships based on the orientation or positional relationships shown in the drawings, merely to facilitate describing the present application and simplify the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present application. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. If not conflicting, the embodiments of the present application and the features of the embodiments may be combined with each other, which are all within the protection scope of the present application.

Embodiment one

Example 1

Referring to fig. 1, fig. 1 is a flow chart of a method for detecting infant crying type based on vibration spectrum comparison in embodiment 1 of the present invention; the method comprises the following steps:

S10: acquiring an electric signal corresponding to the vibration of the vocal cords of the infant when the infant cries;

Specifically, when the infant crys is determined, an electric signal generated by vocal cord vibration is obtained, wherein the electric signal can be an electric signal obtained by converting a vibration parameter of the vocal cords, or an electric signal obtained by converting an optical image signal; the vibration signal is continuous and non-stationary; it should be noted that: the vocal cord vibration can be acquired through the piezoelectric sensor, and also the vocal cord vibration parameters can be acquired through other optical components, such as: infrared rays, radar waves, video acquired by a camera, and the like.

S11: outputting a vibration frequency spectrum corresponding to the vibration of the vocal cords of the infant when crying according to the electric signal;

Specifically, audio information is acquired in real time, the audio information is input into a sound detection model for sound recognition, wherein the sound detection model is a gate control cyclic neural network (GRU); when the audio information is detected to contain the crying of the baby, acquiring an electric signal corresponding to the vibration of the vocal cords of the baby when the baby crys, wherein the electric signal is a non-stationary electric signal; the electrical signal is subjected to short-time Fourier transform, and a vibration spectrum is output.

In one embodiment, referring to fig. 2, the step S11 includes:

s111: segmenting the electric signal according to preset time length to obtain a plurality of continuous electric signal segments;

Specifically, the electrical signal of the vocal cord vibration is a continuous signal with respect to time; dividing the vibration signal into a plurality of electrical signal segments at equal time intervals; in one application embodiment, the electrical signal is a non-stationary electrical signal detected by a piezoelectric sensor.

S112: performing short-time Fourier transform on a plurality of continuous electric signal fragments to output the vibration frequency spectrum;

In one embodiment, referring to fig. 3, the step S112 includes:

s1121: acquiring a window function;

S1122: according to the formula Performing short-time Fourier transform on a plurality of continuous electric signal segments, and outputting the vibration frequency spectrum corresponding to each electric signal segment;

Wherein X is the vibration spectrum corresponding to the acquired signal, X is the acquired time domain signal, f is the frequency, the window function is w (t-tau), tau is the window movement variable, and t is the time.

Specifically, a window function is added in Fourier change to prevent spectrum leakage; the accuracy of the vibration spectrum can be improved.

In an embodiment, in the step S1031, the window function is:

Wherein a and b are constants, N is a window function variable, and N is a positive integer greater than 1.

In one embodiment, referring to fig. 4, the step S112 includes:

S1123: acquiring each peak value in the electric signal, and finding out the maximum peak value from each peak value;

S1124: normalizing the electric signal by dividing each peak value by the maximum peak value to obtain the vibration spectrum;

Wherein the peak value is the wave peak value and/or the wave trough value of the electric signal.

Specifically, the requirements corresponding to different crying sounds are different, electric signal values generated by vocal cord vibration when the infant crys are collected, each wave crest and/or wave trough in all the electric signal values are extracted, and then normalization processing is carried out on the collected wave crest value and wave trough value in each period to obtain the vibration frequency spectrum of the electric signal; and the stability of data is ensured.

S12: and comparing the vibration spectrum with each standard vibration spectrum of the database, and outputting the crying sound type corresponding to the vibration spectrum.

In one embodiment, referring to fig. 5, the step S12 includes:

s121: acquiring a threshold value corresponding to each crying type;

Specifically, a threshold value is set for whether to output the crying class judged by the neural network, the crying class meeting the threshold value requirement is output, and the crying class not meeting the threshold value requirement is removed and is not output.

S122: according to the formulaComparing the vibration spectrum with each standard vibration spectrum, and outputting a similarity value group formed by a plurality of similarity values;

specifically, comparing the vibration spectrum corresponding to the collected electric signals with the standard vibration spectrum in the database to obtain similarity values of the cry and each cry category at the moment, and taking all the similarity values as a similarity value group.

S123: finding out the crying class corresponding to the maximum similarity value from the similarity value group;

specifically, the maximum similarity value is found out from the similarity value group, and then the crying class corresponding to the maximum similarity value is used as the current result.

S124, comparing the maximum similarity value with a threshold value representing the crying class corresponding to the maximum similarity value, and outputting the crying class;

wherein X is a vibration spectrum, Y is a standard vibration spectrum in a database, X _i is a value of an ith signal segment of the vibration spectrum, and Y _i is a value of an ith signal segment of the standard vibration spectrum; mu _x and mu _y are the average value of each electric signal segment in X and each electric signal segment in Y, sigma _x and sigma _y are the standard deviation of each electric signal segment in X and each electric signal segment in Y, and Q is the length of the electric signal corresponding to the collected vocal cord vibration.

Specifically, the crying category corresponding to the maximum similarity value is used as the crying category to be outputted at this time, then the threshold value corresponding to the crying category to be outputted is used for comparing with the maximum similarity, if the threshold value is larger than the threshold value, the crying category corresponding to the similarity is outputted, and if the threshold value is smaller than the threshold value, no crying category is outputted, so that the vibration spectrum is invalid; in an application embodiment, the number of times that the maximum similarity value is greater than the crying class threshold value may be counted, and if the maximum similarity value is greater than the crying class threshold value k consecutive times, the crying class is output; if the crying types corresponding to the maximum similarity value are different, resetting the count; if the maximum similarity value is smaller than the crying class threshold value in the counting process, the counting is cleared; by means of the accumulation mode, the detection accuracy can be improved.

In one embodiment, referring to fig. 6, before S10, the method further includes:

s1: acquiring a detected audio signal;

Specifically, after sound is detected, each frame of audio is acquired; such as: taking each 512 sampling points with the frequency of 16KHz and the quantization progress of 16bit as a frame, overlapping 256 sampling points in each frame, namely, collecting frames with the frame length of 32ms and the frame shift of 16ms, and obtaining each frame of audio.

S2: processing the audio signal, extracting MFCC characteristics of the audio signal;

Specifically, performing Fourier transform on each frame of collected audio to obtain the bandwidth of an audio signal, and determining a target bandwidth; then, the Mel filter group carries out filtering treatment according to the target bandwidth to obtain Mel frequency cepstrum coefficient, and then carries out logarithmic transformation to amplify, so that the characteristics are more obvious; discrete cosine transform is used to extract discrete values of mel-frequency cepstral coefficients as mel-frequency cepstral coefficient (MFCC) features.

S3: and inputting the MFCC characteristics into a preset infant crying recognition model, and determining whether the infant is crying currently.

Specifically, inputting the MFCC characteristics into a baby crying recognition model, and judging whether the audio signal is baby crying; if the infant is crying, starting crying detection of the infant, and obtaining the crying type according to the crying detection result.

By adopting the infant crying type detection method based on vibration spectrum comparison, the electric signal of the vibration of the vocal cords of the infant during crying is obtained, the electric signal is converted into the corresponding vibration spectrum, and the vibration spectrum is compared with the standard vibration spectrum of the database; the crying type corresponding to the vibration spectrum is obtained. Judging the crying type of the infant by utilizing the vibration frequency spectrum of the vibration of the vocal cords of the infant; the method can accurately detect the sounding difference caused by the individual difference of the infants or crying abnormal caused by the abnormal conditions such as the hoarseness of the infants, and the like, and improves the accuracy of detecting the crying type of the infants based on vibration spectrum comparison.

Example 2

In embodiment 1, the crying type of the baby cry is determined by the vibration parameters corresponding to the vibration of the vocal cords, and since the vocal cords of the baby are in the early stage of development, the difference of the vibration of the vocal cords is small, the accuracy of the collected vibration parameters is low, and finally the accuracy of crying type detection is affected. Therefore, the audio signal generated by the crying of the baby is further analyzed by the embodiment 2 of the invention on the basis of the embodiment 1; referring to fig. 7, the method includes:

s20: acquiring the audio characteristics of the baby crying sound and the vibration frequency spectrum corresponding to the baby vocal cord vibration;

Specifically, when an infant crys, collecting an audio signal containing the crying sound and vibration parameters of a corresponding vocal cord; the audio characteristics are obtained by processing the audio signals, and the vibration parameters are processed to obtain a vibration frequency spectrum.

In one embodiment, referring to fig. 8, the step S20 includes:

s201: acquiring an audio signal corresponding to the baby crying;

S202: extracting the characteristics of the audio signal by using a Mel filter to obtain the audio characteristics;

wherein the audio features are mel-frequency cepstrum coefficient, MFCC, features.

Specifically, frequency 16KHz, quantization progress 16bit, every 512 sampling points are used as a frame, every frame overlaps 256 sampling points, namely frame length 32ms and frame shift 16ms are used for acquisition, and each frame of audio is obtained; performing Fourier transform on each frame of collected audio so as to convert an audio signal from a time domain signal into a frequency domain signal and a bandwidth of the audio signal, and determining a target bandwidth; then, the Mel filter group carries out filtering treatment according to the target bandwidth to obtain Mel frequency cepstrum coefficient, and then carries out logarithmic transformation to amplify, so that the characteristics are more obvious; discrete cosine transform is used to extract discrete values of mel-frequency cepstral coefficients as mel-frequency cepstral coefficient (MFCC) features.

In one embodiment, referring to fig. 9, the step S20 includes:

S203: acquiring an electric signal corresponding to vocal cord vibration when the infant cries;

specifically, when the infant crys, collecting vibration parameters corresponding to vocal cord vibration and/or optical image signals corresponding to the vocal cord vibration, and then obtaining an electric signal of the vocal cord vibration; the vibration parameter and the optical image signal are obtained by at least one of the following modes: image sensors, infrared, radar waves, and piezoelectric sensors.

S204: segmenting the electric signal according to the time length of each frame of audio in the audio signal to obtain a plurality of continuous electric signal segments;

Specifically, the electrical signal of the vocal cord vibration is a continuous signal with respect to time; dividing the electrical signal into a plurality of segments with a length of time corresponding to each frame of audio in the audio signal; wherein the initial electrical signal generated by the vocal cord vibration is a non-stationary signal.

S205: performing short-time Fourier transform on a plurality of continuous electric signal fragments to output the vibration frequency spectrum;

s21: performing feature fusion on the audio features and the vibration frequency spectrum, and outputting fused fusion features;

In an embodiment, referring to fig. 10, the step S21 includes:

S211: performing principal component analysis (principal component) dimension reduction processing on the MFCC characteristics of each frame of audio and the vibration frequency spectrum of each electric signal segment, and outputting the MFCC characteristics of each frame of audio and each electric signal segment in the dimension-reduced audio signal;

Specifically, the main component analysis method is adopted for dimension reduction treatment, so that key components in signals can be effectively extracted, and the complexity of data is reduced; it should be noted that: the dimension reduction processing is unified processing of MFCC characteristics and corresponding vibration spectrums of the entire audio signal, or is separate processing of each frame of audio and vibration spectrums of the electrical signal corresponding to each frame of audio.

S212: and carrying out feature fusion on the MFCC features of each frame of audio after the dimension reduction and the vibration frequency spectrums of the electric signals corresponding to each frame of audio to obtain each fusion feature.

Specifically, the key components of each frame of audio in the audio signal can be effectively extracted by adopting the main component analysis method for dimension reduction treatment, so that the complexity of data is reduced; then, the key components in the MFCC characteristics of each frame of audio and the key components of each electric signal segment corresponding to each frame of audio are subjected to characteristic fusion, so that redundant information in data can be eliminated, and the data accuracy is improved.

In one embodiment, referring to fig. 11, the step S212 includes:

S2121: acquiring a frequency change threshold of the vibration spectrum and vibration frequencies of the vibration spectrum corresponding to each frame of audio;

Specifically, vibration frequencies of vibration spectrums corresponding to audio frequencies of frames are obtained, and frequency change thresholds of the vibration frequencies of adjacent frames are set.

S2122: segmenting each vibration frequency by utilizing the frequency change threshold value to obtain a plurality of continuous frequency segments;

Specifically, the relation between vibration frequency change and a frequency change threshold is judged by comparing vibration frequencies in vibration frequency spectrums corresponding to adjacent frame audios, if the vibration frequency change corresponding to the adjacent frame audios is larger than the frequency change threshold, the adjacent two frame audios belong to different frequency segments, and if the vibration frequency change corresponding to the adjacent frame audios is smaller than or equal to the frequency change threshold, the adjacent two frame audios belong to the same frequency segment, so that the vibration frequency spectrums are divided into a plurality of continuous frequency segments.

S2123: and carrying out feature fusion on the vibration frequency spectrum corresponding to each frequency segment and the MFCC features of all the frame audios corresponding to each frequency segment respectively to obtain the fusion features corresponding to each frequency segment.

Specifically, according to the length of each frequency segment, the audio information is divided into corresponding audio segments, and then the MFCC features of each frame of audio corresponding to the vibration spectrum of each frequency segment are subjected to feature fusion, so that the detection accuracy of sound abnormality is improved under the same crying requirement, for example: the infant is excited, cryed for a long time to cause hoarseness, in the process, all vibration frequencies corresponding to the vibration frequency spectrum are marked into the same frequency band, then feature fusion is carried out, reliability of fusion features is guaranteed, and detection accuracy is improved.

S22: inputting the fusion characteristic into a preset neural network, and outputting a coding characteristic vector corresponding to a crying state;

specifically, inputting the obtained vibration spectrum into a neural network, and performing convolution calculation on the vibration spectrum and a convolution kernel; converting the convolved characteristics into a one-dimensional vector for output; and then, obtaining the coding feature vector by a gate-controlled recurrent neural network (GRU), wherein the coding feature vector is a one-dimensional vector.

In one embodiment, referring to fig. 12, the step S22 includes:

s221: acquiring the feature matrix capacity of the neural network;

Specifically, the feature matrix capacity is the number of features required for determining the crying type of the crying characterization of the infant at a certain moment; that is, the neural network outputs the corresponding crying class according to all the encoding feature vectors in the feature matrix; when the coded feature vector in the feature matrix is updated, the neural network will output a new crying class.

S222: performing convolution calculation on the fusion characteristics and convolution kernels, and outputting coding characteristic vectors corresponding to the electric signal fragments;

S223: and obtaining each coding feature vector in the current feature matrix according to the feature matrix capacity and each coding feature vector.

Specifically, each fusion feature is sequentially subjected to convolution calculation with a convolution kernel, and each coding feature vector is output; deleting the last row of the feature matrix before one fusion feature enters the feature matrix, and moving the rest rows downward by one row as a whole, wherein the latest fusion feature enters the first row of the feature matrix; deforming the two-dimensional fusion characteristic into a one-dimensional vector through convolution calculation; then, the one-dimensional vector Zhuang Bian is used as a coding feature vector through a gate control loop neural network (GRU); meanwhile, deleting the last line of coding feature vectors of the coding features, moving the other lines of coding feature vectors downwards as a whole, and placing the obtained coding feature vectors in the first line; thereby completing the updating of the coding feature vector; and carrying out weighted average on all the updated coding feature vectors, outputting the final coding feature vector, and then outputting the probability corresponding to each crying type through an activation function.

S23: and outputting the crying type of the crying state according to the coding feature vector.

In one embodiment, referring to fig. 13, the step S23 includes:

s231: obtaining a crying class threshold;

Specifically, the crying class outputted each time is counted, a threshold value of the crying class continuously appearing in the same crying class is set, and when the threshold value is reached, the crying class is outputted.

S232: comparing the first crying category corresponding to the current coding feature vector with the second crying category corresponding to the previous coding feature vector, and outputting a category comparison result;

S233: if the comparison results are the same, counting and adding 1; otherwise, counting 0;

s234: outputting the crying category when the counted value is equal to the crying category threshold value.

Specifically, comparing the previous crying category with the current crying category, and if the crying categories output by two adjacent times are the same category, counting by an internal counter by 1; if the two adjacent crying types are different, the count value of the counter is cleared, and when the number of times of continuous occurrence of the same crying type reaches the threshold value of the crying type, the crying type is taken as the current crying type to be output.

In one embodiment, the crying category comprises at least one of: hunger, pain, distraction and discomfort.

In an embodiment, the preset neural network includes at least one sub-neural network of a scene, the scene including at least one of: night, daytime, outdoor, sunny, cloudy, rainy, indoor, etc. corresponding to each season.

In an embodiment, the S22 includes:

S224: obtaining corresponding time information and environment information when the baby crys;

Specifically, when the infant crys, see the time at this time, such as: breakfast time, morning, lunch time, afternoon, dinner time, night, etc., and the environmental information includes at least one of: indoor, outdoor, sunny, rainy, etc.

S225: determining the corresponding sub-neural network as a target neural network according to the time information and the environment information;

Specifically, a sub-neural network for performing convolution calculation on the volume and the characteristics is determined according to the crying time period of the baby and the crying environment, and the sub-neural network is recorded as a target neural network.

S226: and carrying out convolution calculation on the fusion characteristic by using the target neural network, and outputting a coding characteristic vector corresponding to the crying state.

By adopting the method for intelligently identifying the crying type of the baby, the audio characteristics and the vibration frequency spectrum of the vocal cords corresponding to the crying of the baby are obtained; the method comprises the steps of performing feature fusion on audio features and vibration frequency spectrums, and converting fused features into corresponding coding feature vectors through a preset neural network; outputting the probability corresponding to each crying class to obtain the crying class; by acquiring the audio characteristics generated by vocal cord vibration and combining the vibration characteristics, the crying recognition accuracy is improved.

Example 3

In the embodiment 1 and the embodiment 2, the crying type of the baby crying is determined by the audio signal of the vibration parameter crying corresponding to the vibration of the vocal cords, and since the vocal cords of the baby are in the early development stage, the development of the vocal cords is imperfect, the expression of the vibration of the vocal cords and the crying for the requirement has a small range, so that the matched samples are limited, and finally, the misjudgment is caused; therefore, on the basis of the embodiment 1, posture information corresponding to the crying state of the infant is introduced for further improvement; referring to fig. 14, the method includes;

S30: acquiring the audio characteristics of the baby crying sounds, the action characteristics corresponding to the gesture actions and the vibration frequency spectrum corresponding to the vocal cord vibrations;

Specifically, when the infant crys is detected, obtaining a video stream comprising the cry and vibration parameters of vibration of the vocal cords of the infant; extracting audio features and action features in a video stream; a vibration spectrum corresponding to vocal cord vibration; wherein the motion features include limb motion and facial microexpressions.

In one embodiment, referring to fig. 15, the step S30 includes:

S301: obtaining video stream of infant crying;

s302: extracting motion characteristic values of each frame of image in the video stream;

Specifically, splitting a video stream into multiple frames of images; forming an action from a plurality of successive images; extracting motion characteristic values of each motion in each frame of image; in an application embodiment, filtering each frame of image in a video stream by adopting a Kalman filtering method, eliminating background interference of the image, and then extracting a motion characteristic value in the image; the Kalman filtering method can eliminate slow background transformation between images, and mainly changes light and shadow; the detection efficiency and the accuracy of the detection result are improved.

It should be noted that: taking each 512 sampling points with the frequency of 16KHz and the quantization progress of 16bit as a frame, overlapping 256 sampling points in each frame, namely, collecting frames with the frame length of 32ms and the frame movement of 16ms, and obtaining each frame of image.

S303: comparing each motion characteristic value with an action behavior database, and converting the motion characteristic value of each frame of image into the corresponding motion standard characteristic value in the action behavior database to obtain the motion characteristic.

Specifically, comparing the motion characteristic value of each motion in each frame image with a motion behavior database, outputting the motion characteristic value matched with each motion in the motion behavior database as an actual motion characteristic value of each motion, and taking the value as the motion characteristic of each motion; the motion characteristic value of the motion behavior database is utilized to represent the motion characteristic value of each motion which is actually collected, so that the data stability of the feature fusion can be ensured.

In one embodiment, referring to fig. 16, the step S303 includes:

S3031: collecting image sample sets corresponding to a plurality of actions of the infant;

s3032: extracting motion characteristic values of all images in the image sample set;

specifically, the motion characteristic of each frame image of each action is extracted, and the motion characteristic value of each frame image is obtained; and outputting a change section of the motion characteristic value of the motion according to the motion characteristic value of each frame of image.

S3033: correlating the motion characteristic value of each action with crying states of each category; outputting an action behavior database;

specifically, crying information corresponding to each action is obtained, a motion characteristic value range of each action is associated with the corresponding crying information, and an action behavior database is established.

S3034: comparing each motion characteristic value with an action behavior database, and outputting the motion standard characteristic value to obtain the motion characteristic.

In one embodiment, referring to fig. 17, the step S30 includes:

S304: acquiring an electric signal generated by vibration of a vocal cord of an infant when crying;

Specifically, when the infant crys, collecting vibration parameters corresponding to vocal cord vibration and/or optical image signals corresponding to the vocal cord vibration, and then obtaining an electric signal of the vocal cord vibration; the vibration parameters and the optical image signal are obtained in the manner described in embodiment 1, and are not described herein.

S305: segmenting the electric signal according to the time length of each frame of image to obtain a plurality of continuous electric signal segments;

specifically, the electrical signal of the vocal cord vibration is a continuous signal with respect to time; dividing the electrical signal into a plurality of segments with a length of time corresponding to each frame in the audio signal; the electrical signal is a non-stationary signal.

S306: and performing short-time Fourier transform on a plurality of continuous electric signal fragments to output the vibration frequency spectrum.

In one embodiment, referring to fig. 18, the step S30 includes:

S307: acquiring an audio signal of a crying sound of an infant;

s308: extracting the characteristics of the audio signal by using a Mel filter to obtain the audio characteristics;

Specifically, short-time fourier transform processing is performed on each frame of audio signals, so that the audio signals are converted from time domain signals to frequency domain signals, after the audio signals of the frequency domain signals are obtained through fourier transform, filtering processing is performed on each frame of audio signals through a mel filter bank, and after logarithmic transform and discrete cosine transform, mel frequency cepstrum coefficient MFCC characteristics are extracted.

S31: performing feature fusion on the audio features, the action features and the vibration spectrum, and outputting fused fusion features;

in one embodiment, referring to fig. 19, the step S31 includes:

S311: performing principal component analysis (principal component) dimension reduction processing on the MFCC characteristics of each frame of audio, the action characteristics of each frame of image and the vibration frequency spectrums of each electric signal segment, and outputting the MFCC characteristics of each frame of audio, the action characteristics of each frame of image and the vibration frequency spectrums of each electric signal segment after dimension reduction;

Specifically, the main component analysis method is adopted for dimension reduction treatment, so that key components in signals can be effectively extracted, and the complexity of data is reduced; it should be noted that: the dimension reduction processing is unified processing of the MFCC characteristics of the whole audio signal, the action characteristics of the video stream and the corresponding vibration spectrum, or is processing of each frame of audio, the vibration spectrum of the electric signal corresponding to each frame of audio, and the image corresponding to each frame of audio.

S312: and carrying out feature fusion on the motion features of the corresponding images of each frame after dimension reduction, the MFCC features of the corresponding audio signals and the vibration frequency spectrums of the corresponding electric signals to obtain each fusion feature.

Specifically, the main component analysis method is adopted for dimension reduction treatment, so that key components in signals can be effectively extracted, and the complexity of data is reduced; and then, carrying out feature fusion on key components in the MFCC features of each frame, action features corresponding to each frame image and key components of each corresponding electric signal segment, so that redundant information in data can be eliminated, and the data accuracy is improved.

In an embodiment, referring to fig. 20, the step S31 includes:

s313: acquiring a frequency change threshold of the vibration spectrum and vibration frequencies of the vibration spectrum corresponding to each frame of audio information;

specifically, the vibration frequency of the vibration spectrum corresponding to the audio information of each frame is obtained, and the frequency change threshold of the vibration frequency of the adjacent frame is set.

S314: segmenting each vibration frequency by utilizing the frequency change threshold value to obtain a plurality of continuous frequency segments;

Specifically, the relation between vibration frequency change and a frequency change threshold is judged by comparing vibration frequencies corresponding to adjacent frame audios, if the vibration frequency change corresponding to the adjacent frame audios is larger than the frequency change threshold, the adjacent two frame audios belong to different frequency segments, and if the vibration frequency change corresponding to the adjacent frame audios is smaller than or equal to the frequency change threshold, the adjacent two frame audios belong to the same frequency segment, so that the vibration frequency spectrum is divided into a plurality of continuous frequency segments.

S315: and carrying out feature fusion on the vibration frequency spectrum corresponding to each frequency segment, the dynamic motion features corresponding to all frame images and the MFCC features of all frame audios to obtain the fusion features corresponding to each frequency segment.

Specifically, the vibration spectrum of each frequency band is fused with the corresponding MFCC feature of each frame of audio and the corresponding motion feature of each frame of image, so as to ensure that the detection accuracy of sound abnormality is improved under the same crying requirement, for example: the infant is excited, the crying is carried out for a long time, the voice is hoarse, in the process, all vibration frequencies corresponding to the vibration frequency spectrum are marked into the same frequency section, abnormal crying exists in sound information corresponding to the frequency section, and multi-feature fusion is carried out through the motion features of the face, limbs and the like of the infant in the corresponding time section, so that the weighting of the motion features on the crying is realized, the reliability of the fusion features is ensured, and the detection accuracy is improved; another example is: the infant suddenly crys and crys at night and takes a large amplitude of limb movements, but the duration is short, at the moment, the fusion characteristic has the attribute that the infant is biased to belong to nightmare scenes through the weighting of the motion characteristic, so that after the fusion characteristic is input into a neural network for convolution calculation, the nightmare probability in the output cry category is maximum, and a guardian can perform movements such as pacifying.

S32: inputting the fusion characteristic into a preset neural network, and outputting a coding characteristic vector corresponding to a crying state;

in one embodiment, referring to fig. 21, the step S32 includes:

s321: acquiring the feature matrix capacity of the neural network;

S322: performing convolution calculation on the fusion characteristics and convolution kernels, and outputting coding characteristic vectors corresponding to the electric signal fragments;

s323: and obtaining each coding feature vector in the current feature matrix according to the feature matrix capacity and each coding feature vector.

Specifically, each fusion feature is sequentially subjected to convolution calculation with a convolution kernel, and each coding feature vector is output; referring to fig. x, before one fusion feature enters the feature matrix, deleting the last row of the feature matrix, moving the rest rows downward by one row as a whole, and enabling the latest fusion feature to enter the first row of the feature matrix; deforming the two-dimensional fusion characteristic into a one-dimensional vector through convolution calculation; then, the one-dimensional vector Zhuang Bian is used as a coding feature vector through a gate control loop neural network (GRU); meanwhile, deleting the last line of coding feature vectors of the coding features, moving the other lines of coding feature vectors downwards as a whole, and placing the obtained coding feature vectors in the first line; thereby completing the updating of the coding feature vector; the updated all the coding feature vectors are weighted and averaged, the final coding feature vector is output, and then the probability corresponding to each crying class is output through an activating function; and comparing the probabilities of the crying categories, and outputting the crying categories.

S33: and outputting the crying type of the crying state according to the coding feature vector.

By adopting the method for identifying the crying type of the baby by multi-feature fusion of the embodiment, the audio features, the vocal cord vibration features and the gesture features of the baby during crying are fused, the fused features are input into a neural network for analysis of the crying type of the baby, and the corresponding crying type is output; by combining the gesture features of the infant during crying, the limit of the sound signal and the vibration spectrum of the vocal cords on crying judgment can be made up, the requirements of the infant are enhanced by the gesture features, wrong judgment can be reduced, and crying detection accuracy is improved.

Second embodiment

Example 4

Embodiment 4 of the present invention provides a device for identifying a crying type of infant by vocal cord vibration based on the method of embodiments 1 to 3, please refer to fig. 22, comprising:

The signal acquisition module: the method is used for acquiring an electric signal corresponding to the vibration of the vocal cords of the infant when the infant crys;

And a signal processing module: the vibration frequency spectrum is used for outputting vibration of the vocal cords of the infant when crying according to the electric signals;

Crying category module: and comparing the vibration spectrum with each standard vibration spectrum of the database, and outputting the crying type corresponding to the vibration spectrum.

By adopting the device for identifying the type of the infant crying through the vocal cord vibration, the electric signal of the vocal cord vibration during the infant crying is obtained, the electric signal is converted into the corresponding vibration frequency spectrum, and the vibration frequency spectrum is compared with the standard vibration frequency spectrum of the database; the crying type corresponding to the vibration spectrum is obtained. Judging the crying type of the infant by utilizing the vibration frequency spectrum of the vibration of the vocal cords of the infant; the method can accurately detect the sounding difference caused by the individual difference of the infants or crying abnormal caused by the abnormal conditions such as the hoarseness of the infants, and the like, and improves the accuracy of detecting the crying type of the infants based on vibration spectrum comparison.

It should be noted that, the device further includes the other technical solutions described in embodiments 1 to 3, which are not described herein.

Example 5

In embodiment 4, the crying type of the baby is determined by the vibration parameters corresponding to the vibration of the vocal cords, and the difference of the vibration of the vocal cords is small because the vocal cords of the baby are in the early development stage, the accuracy of the collected vibration parameters is low, and the accuracy of the crying type detection is finally affected; thus the audio signal with crying introduced on the basis of example 4 is further improved; referring to fig. 23, the method includes:

parameter acquisition module: the method comprises the steps of acquiring the audio characteristics of the sound of the infant crying and the vibration frequency spectrum corresponding to the vibration of the vocal cords of the infant;

And a feature fusion module: the method comprises the steps of performing feature fusion on the audio features and the vibration frequency spectrum, and outputting fused features;

Neural network module: the fusion feature is input into a preset neural network, and a coding feature vector corresponding to the crying state is output;

category output module: and outputting the crying type of the crying state according to the coding feature vector.

By adopting the device for intelligently identifying the crying type of the baby, the audio characteristics and the vibration frequency spectrum of the vocal cords corresponding to the crying of the baby are obtained; the method comprises the steps of performing feature fusion on audio features and vibration frequency spectrums, and converting fused features into corresponding coding feature vectors through a preset neural network; outputting the probability corresponding to each crying class to obtain the crying class; by acquiring the audio characteristics generated by vocal cord vibration and combining the vibration characteristics, the crying recognition accuracy is improved.

It should be noted that, the device further includes the other technical solutions described in embodiment 4, which are not described herein.

Example 6

In the embodiment 4 and the embodiment 5, the crying type of the baby crying is determined by the audio signal of the vibration parameter crying corresponding to the vibration of the vocal cords, and since the vocal cords of the baby are in the early development stage, the development of the vocal cords is imperfect, the expression of the vibration of the vocal cords and the crying for the requirement has a small range, so that the matched samples are limited, and finally, the misjudgment is caused; therefore, on the basis of the embodiment 4 and the embodiment 5, the corresponding gesture information when the baby crys is introduced for further improvement; referring to fig. 24, the method includes:

the characteristic acquisition module is used for: the method is used for acquiring the audio characteristics of the sound of the infant crying, the action characteristics corresponding to the gesture actions and the vibration frequency spectrum corresponding to the vocal cord vibration;

And the fusion characteristic output module is used for: the method comprises the steps of performing feature fusion on the audio features, the action features and the vibration frequency spectrum, and outputting fused fusion features;

the coding characteristic output module: the fusion feature is input into a preset neural network, and a coding feature vector corresponding to the crying state is output;

crying type output module: and outputting the crying type of the crying state according to the coding feature vector.

By adopting the device for identifying the crying type of the baby through multi-feature fusion in the embodiment, the audio features, the vocal cord vibration features and the gesture features of the crying of the baby are fused, the fused features are input into a neural network to analyze the crying type of the baby, and the corresponding crying type is output; by combining the gesture features of the infant during crying, the limit of the sound signal and the vibration spectrum of the vocal cords on crying judgment can be made up, the requirements of the infant are enhanced by the gesture features, wrong judgment can be reduced, and crying detection accuracy is improved.

It should be noted that, the device further includes the rest of the technical solutions described in embodiment 4 and/or embodiment 5, which are not described herein.

Embodiment III

The present invention provides an electronic device and storage medium, as shown in fig. 25, comprising at least one processor, at least one memory, and computer program instructions stored in the memory.

In particular, the processor may include a Central Processing Unit (CPU), or an Application SPECIFIC INTEGRATED Circuit (ASIC), or may be configured as one or more integrated circuits that implement embodiments of the present invention, and the electronic device includes at least one of: camera, have the mobile device of camera, have the wearing equipment of camera.

The memory may include mass storage for data or instructions. By way of example, and not limitation, the memory may comprise a hard disk drive (HARD DISK DRIVE, HDD), floppy disk drive, flash memory, optical disk, magneto-optical disk, magnetic tape, or universal serial bus (Universal Serial Bus, USB) drive, or a combination of two or more of these. The memory may include removable or non-removable (or fixed) media, where appropriate. The memory may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory is a non-volatile solid state memory. In a particular embodiment, the memory includes Read Only Memory (ROM). The ROM may be mask programmed ROM, programmable ROM (PROM), erasable PROM (EPROM), electrically Erasable PROM (EEPROM), electrically rewritable ROM (EAROM), or flash memory, or a combination of two or more of these, where appropriate.

The processor reads and executes the computer program instructions stored in the memory to implement any one of the method for detecting the infant crying type based on vibration spectrum comparison, the method for identifying the infant crying type by combining vocal cord vibration and gesture, and the method for identifying the infant crying type by multi-feature fusion in the first embodiment.

In one example, the electronic device may also include a communication interface and a bus. The processor, the memory and the communication interface are connected through a bus and complete communication with each other.

The communication interface is mainly used for realizing communication among the modules, the devices, the units and/or the equipment in the embodiment of the invention.

The bus includes hardware, software, or both that couple components of the electronic device to each other. By way of example, and not limitation, the buses may include an Accelerated Graphics Port (AGP) or other graphics bus, an Enhanced Industry Standard Architecture (EISA) bus, a Front Side Bus (FSB), a HyperTransport (HT) interconnect, an Industry Standard Architecture (ISA) bus, an infiniband interconnect, a Low Pin Count (LPC) bus, a memory bus, a micro channel architecture (MCa) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCI-X) bus, a Serial Advanced Technology Attachment (SATA) bus, a video electronics standards association local (VLB) bus, or other suitable bus, or a combination of two or more of the above. The bus may include one or more buses, where appropriate. Although embodiments of the invention have been described and illustrated with respect to a particular bus, the invention contemplates any suitable bus or interconnect.

In summary, the embodiments of the present invention provide a method for optimizing an intelligent camera detection model by adopting edge calculation, a sample confidence threshold selection method, a device, equipment and a storage medium for model self-training.

It should be understood that the invention is not limited to the particular arrangements and instrumentality described above and shown in the drawings. For the sake of brevity, a detailed description of known methods is omitted here. In the above embodiments, several specific steps are described and shown as examples. The method processes of the present invention are not limited to the specific steps described and shown, but various changes, modifications and additions, or the order between steps may be made by those skilled in the art after appreciating the spirit of the present invention.

The functional blocks shown in the above-described structural block diagrams may be implemented in hardware, software, firmware, or a combination thereof. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, a plug-in, a function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine readable medium or transmitted over transmission media or communication links by a data signal carried in a carrier wave. A "machine-readable medium" may include any medium that can store or transfer information. Examples of machine-readable media include electronic circuitry, semiconductor memory devices, ROM, flash memory, erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, radio Frequency (RF) links, and the like. The code segments may be downloaded via computer networks such as the internet, intranets, etc.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. A method for detecting a class of infant crying based on vibration spectrum contrast, the method comprising:

obtaining a threshold value corresponding to each infant crying type;

finding out a crying class corresponding to the maximum similarity value from the similarity value group, comparing the maximum similarity value with a threshold value representing the crying class corresponding to the maximum similarity value, and outputting the crying class;

Finding out a crying category corresponding to a maximum similarity value from the similarity value group, comparing the maximum similarity value with a threshold value representing the crying category corresponding to the maximum similarity value, and outputting the crying category comprises:

And counting the number of times that the maximum similarity value is larger than the crying class threshold, and outputting the crying class if the maximum similarity value is larger than the crying class threshold for k times in succession, wherein k is a positive integer.

2. The method for detecting the class of infant cry based on vibration spectrum comparison according to claim 1, wherein before comparing the vibration spectrum corresponding to the electric signal corresponding to the vibration of the vocal cords of the infant at the time of the collected infant cry with the standard vibration spectrum in the database, further comprises:

and performing short-time Fourier transform on a plurality of continuous electric signal fragments to output the vibration frequency spectrum.

3. The method for detecting the type of infant cry based on vibration spectrum contrast according to claim 2, further comprising, before said obtaining the electrical signal corresponding to the vibration of the vocal cords of the infant while the infant crys:

Acquiring a detected audio signal;

4. A method for detecting a class of infant crying based on vibration spectrum contrast according to claim 3, wherein said processing said audio signal to extract MFCC characteristics of said audio signal comprises:

5. The method for detecting the class of the infant crying based on the vibration spectrum comparison according to claim 1, wherein comparing the vibration spectrum corresponding to the collected electric signal corresponding to the vibration of the vocal cords of the infant when the infant crys with the standard vibration spectrum in the database to obtain the similarity values of the cry and each class of the cry, and taking all the similarity values as a similarity value group comprises:

According to the formula Comparing the vibration spectrum with each standard vibration spectrum, and outputting the similarity value group;

6. The method for detecting a class of infant crying based on vibration spectrum comparison according to claim 1, wherein finding a class of crying corresponding to a maximum similarity value from the set of similarity values, comparing the maximum similarity value with a threshold value representing the class of crying corresponding to the maximum similarity value, and outputting the class of crying further comprises:

7. A device for detecting the type of crying of an infant based on vibration spectrum contrast, comprising:

The crying class output module is used for finding out the crying class corresponding to the maximum similarity value from the similarity value group, comparing the maximum similarity value with a threshold value representing the crying class corresponding to the maximum similarity value, and outputting the crying class;

8. An electronic device, comprising: at least one processor, at least one memory, and computer program instructions stored in the memory, which when executed by the processor, implement the method of any one of claims 1-6.

9. A medium having stored thereon computer program instructions, which when executed by a processor, implement the method of any of claims 1-6.