CN113035241A - Method, device and equipment for identifying baby cry class through multi-feature fusion - Google Patents

Method, device and equipment for identifying baby cry class through multi-feature fusion Download PDF

Info

Publication number
CN113035241A
CN113035241A CN202110218120.9A CN202110218120A CN113035241A CN 113035241 A CN113035241 A CN 113035241A CN 202110218120 A CN202110218120 A CN 202110218120A CN 113035241 A CN113035241 A CN 113035241A
Authority
CN
China
Prior art keywords
crying
baby
feature
characteristic value
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110218120.9A
Other languages
Chinese (zh)
Other versions
CN113035241B (en
Inventor
陈辉
张智
谢鹏
雷奇文
艾伟
胡国湖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Xingxun Intelligent Technology Co ltd
Original Assignee
Wuhan Xingxun Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Xingxun Intelligent Technology Co ltd filed Critical Wuhan Xingxun Intelligent Technology Co ltd
Priority to CN202110218120.9A priority Critical patent/CN113035241B/en
Publication of CN113035241A publication Critical patent/CN113035241A/en
Application granted granted Critical
Publication of CN113035241B publication Critical patent/CN113035241B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Child & Adolescent Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the technical field of voice recognition, solves the technical problem of low accuracy in the judgment of baby cry through voice recognition, and provides a method, a device and equipment for recognizing the baby cry by multi-feature fusion. Acquiring the audio characteristic, the action characteristic value of the posture action and the vibration frequency spectrum of vocal cord vibration when the baby cries; converting the action characteristic value into a standard characteristic value in a database; performing feature fusion on the audio features and the vibration frequency spectrum based on the standard feature value; inputting the fused fusion characteristics into a preset neural network, and obtaining the crying category of the baby according to the coding characteristic vector output by the neural network; and the standard characteristic value is a probability value of each crying category represented by the corresponding gesture motion. The invention also comprises a device and equipment for executing the method. The invention utilizes the posture characteristics to strengthen the requirements of the infants, can reduce the error judgment and improve the crying detection accuracy.

Description

Method, device and equipment for identifying baby cry class through multi-feature fusion
Technical Field
The invention relates to the technical field of voice recognition, in particular to a method, a device and equipment for recognizing baby cry by multi-feature fusion.
Background
With the development of speech recognition technology, speech recognition is applied to more and more fields, such as recognizing various kinds of crying of infants to determine various corresponding conditions of the infants. For the identification of baby crying, the method generally adopted is as follows: collecting crying by adopting a voice collection technology, matching the collected crying with the set crying of the baby, determining whether the crying is the crying of the baby, matching the confirmed crying of the baby with the set crying category, and after the matching is successful, confirming the crying category corresponding to the collected crying and finally confirming the specific meaning of the crying of the baby. However, because the individual babies have differences and different requirements for expressing the same crying, particularly when the sound of the baby is abnormal, such as a hoarse sound and foreign matters, the collected audio information obviously cannot judge the crying category of the baby; therefore, when the voice recognition technology is adopted to recognize the baby cry, the accuracy and precision are not high, and the user experience is not high.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method, an apparatus, and a device for multi-feature fusion recognition of baby cry categories, so as to solve the technical problem of low accuracy in determining baby cry by voice recognition.
The technical scheme adopted by the invention is as follows:
the invention provides a method for identifying the crying category of an infant through multi-feature fusion, which comprises the following steps:
s30: acquiring the audio characteristics of the sound of the baby during crying, the motion characteristic value of the posture motion of the baby in the image and the vibration frequency spectrum of vocal cord vibration;
s31: comparing the action characteristic value with a standard characteristic value of a database, and outputting a standard characteristic value corresponding to the action characteristic value;
s32: performing feature fusion on the audio features and the vibration frequency spectrum based on the standard feature value to obtain fusion features corresponding to the baby cry;
s33: inputting the fusion characteristics into a preset neural network, and outputting a coding characteristic vector corresponding to the baby cry;
s34, outputting the crying classification of the infant when crying according to the encoding feature vector;
and the standard characteristic value is a probability value of each crying category represented by the corresponding gesture motion.
Preferably, the S31 includes:
s311: acquiring a video stream of the baby during crying;
s312: extracting the motion characteristic value of each frame of image in the video stream;
s313: and comparing each motion characteristic value with each standard characteristic value of the database, and outputting each standard characteristic value matched with the motion characteristic value of each frame image.
Preferably, the S313 includes:
s3131: collecting image sample sets corresponding to a plurality of posture actions of the baby;
s3132: extracting a standard characteristic value of each gesture action in the image sample set;
s3133: establishing the database corresponding to each standard characteristic value and each crying category;
s3134: and comparing each motion characteristic value of each frame of image when the baby cries with each standard characteristic value of the database, and outputting each standard characteristic value matched with each motion characteristic value.
Preferably, the S30 includes:
s301: acquiring an electric signal generated by vocal cord vibration when the baby cries;
s302: segmenting the electric signal according to the time length of each frame image to obtain a plurality of continuous electric signal segments;
s303: and carrying out short-time Fourier transform on a plurality of continuous electric signal segments, and outputting the vibration frequency spectrum.
Preferably, the S30 includes:
s304: acquiring audio signals of crying sounds of the baby;
s305: carrying out feature extraction on the audio signal by using a Mel filter to obtain the audio features;
wherein the audio features are Mel Frequency Cepstrum Coefficients (MFCC) features.
Preferably, the S32 includes:
s321: performing principal component analysis (MFCC) feature of each frame of audio, the action feature value of each frame of image and the vibration spectrum of each electric signal segment to perform dimensionality reduction, and outputting the MFCC feature of each frame of audio, the action feature value of each frame of image and the vibration spectrum of each electric signal segment after dimensionality reduction;
s322: and performing feature fusion on the MFCC features corresponding to the frames of audio after the dimension reduction and the vibration frequency spectrum of the electric signal segment corresponding to the frames of audio based on the standard feature values corresponding to the motion feature values of the frames of images to obtain the fusion features.
Preferably, the S32 includes:
s323: acquiring a frequency change threshold of the vibration frequency spectrum and the vibration frequency of the vibration frequency spectrum corresponding to each frame of audio information;
s324: segmenting each vibration frequency by using the frequency change threshold value to obtain a plurality of continuous frequency segments;
s325: and performing feature fusion on the vibration frequency spectrum corresponding to each frequency segment and the MFCC features of all the corresponding frame audios based on each standard feature value corresponding to each frequency segment to obtain the fusion features corresponding to each frequency segment.
The invention also provides a device for identifying the crying category of the baby by multi-feature fusion, which comprises:
a characteristic acquisition module: the method comprises the steps of acquiring the audio characteristics of the sound of the baby during crying, the motion characteristic values of the posture motion of the baby in an image and the vibration frequency spectrum of vocal cord vibration;
the data conversion module: the standard characteristic value is used for comparing the action characteristic value with a standard characteristic value of a database and outputting the standard characteristic value corresponding to the action characteristic value;
a fusion feature output module: the standard characteristic value is used for carrying out characteristic fusion on the audio characteristic and the vibration frequency spectrum based on the standard characteristic value to obtain fusion characteristics corresponding to the baby cry;
the encoding characteristic output module: the fusion feature is used for inputting the fusion feature into a preset neural network and outputting a coding feature vector corresponding to the baby cry;
crying category output module: the encoding feature vector is used for outputting crying categories when the infant cryes;
and the standard characteristic value is a probability value of each crying category represented by the corresponding gesture motion.
The present invention also provides an electronic device, comprising: at least one processor, at least one memory, and computer program instructions stored in the memory that, when executed by the processor, implement the method of any of the above.
The invention also provides a medium having stored thereon computer program instructions which, when executed by a processor, implement the method of any of the above.
In conclusion, the beneficial effects of the invention are as follows:
the invention provides a method, a device and equipment for identifying baby cry categories through multi-feature fusion, which are characterized in that audio features, vibration frequency spectrums of vocal cord vibration and action feature values of attitude actions during baby cry are obtained, the action feature values of the attitude actions are converted into standard feature values of attitude actions corresponding to the various cry categories, the standard feature values represent the probability of the various cry categories, then the audio features and the vibration frequency spectrums of the vocal cord vibration are fused based on the standard feature values, the weighing processing of the fusion of the cry features is realized, then the fused fusion features are input into a neural network to perform baby cry category analysis, and the corresponding cry categories are output; by weighting the posture characteristics of the baby during crying, the limit of crying discrimination of the sound signal and vocal cord vibration frequency spectrum can be made up, the wrong crying detection judgment during abnormal baby sound is reduced, and the crying detection accuracy is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below, and for those skilled in the art, without any creative effort, other drawings may be obtained according to the drawings, and these drawings are all within the protection scope of the present invention.
Fig. 1 is a schematic flow chart of a method for identifying a crying category of a baby in example 1 according to a first embodiment of the present invention;
fig. 2 is a schematic flow chart of obtaining a vibration spectrum in example 1 according to a first embodiment of the present invention;
fig. 3 is a schematic flow chart of obtaining a vibration spectrum by fourier transform in example 1 according to the first embodiment of the present invention;
fig. 4 is a schematic flow chart of obtaining a vibration spectrum by normalizing an electrical signal in example 1 according to the first embodiment of the present invention;
fig. 5 is a schematic flow chart of the process of determining the output cry class according to the similarity in example 1 according to the first embodiment of the present invention;
fig. 6 is a schematic flow chart of crying detection in example 1 according to the first embodiment of the present invention;
FIG. 7 is a flowchart illustrating a method for recognizing baby cry by combining vocal tract vibration and posture in example 2 according to a first embodiment of the present invention;
fig. 8 is a schematic flowchart of acquiring audio features in example 2 according to a first embodiment of the present invention;
fig. 9 is a schematic flow chart of acquiring a vibration spectrum of vocal cord vibration in example 2 according to the first embodiment of the present invention;
fig. 10 is a schematic flow chart of acquiring a fusion feature in example 2 according to the first embodiment of the present invention;
fig. 11 is a schematic flow chart of the fusion of the vibration spectrum and the audio frequency in example 2 according to the first embodiment of the present invention;
fig. 12 is a schematic flowchart of acquiring a coding feature vector in embodiment 2 of the first embodiment of the present invention;
fig. 13 is a schematic flow chart of acquiring crying categories according to crying thresholds in example 2 according to the first embodiment of the present invention;
FIG. 14 is a flowchart illustrating a method for identifying baby cry classes through multi-feature fusion in example 3 according to a first embodiment of the present invention;
fig. 15 is a schematic flow chart illustrating a process of acquiring a motion characteristic of a gesture in example 3 according to a first embodiment of the present invention;
fig. 16 is a schematic flowchart of determining motion characteristics from standard motion characteristic values in a database in example 3 according to a first embodiment of the present invention;
fig. 17 is a schematic flow chart of acquiring a vibration spectrum in example 3 according to the first embodiment of the present invention;
fig. 18 is a schematic flow chart illustrating the process of obtaining audio features through a mel filter in example 3 according to the first embodiment of the present invention;
fig. 19 is a schematic flowchart of multi-feature fusion in example 3 according to the first embodiment of the present invention;
fig. 20 is a schematic flow chart of multi-feature fusion with vibration frequency in example 3 according to the first embodiment of the present invention;
fig. 21 is a schematic structural diagram of an apparatus for continuously optimizing a camera effect according to embodiment 4 of the second embodiment of the present invention;
fig. 22 is a block diagram illustrating a structure of an apparatus for selecting confidence threshold of an intelligent camera sample in embodiment 5 according to a second embodiment of the present invention;
fig. 23 is a schematic structural diagram of an apparatus for self-training an intelligent camera model in embodiment 6 of the real-time mode two of the present invention;
fig. 24 is a schematic structural diagram of an electronic device in a third embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention. It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. In the description of the present invention, it is to be understood that the terms "center", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like indicate orientations or positional relationships based on those shown in the drawings, and are only for convenience of description and simplicity of description, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed in a particular orientation, and be operated, and thus, are not to be construed as limiting the present invention. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element. In case of conflict, it is intended that the embodiments of the present invention and the individual features of the embodiments may be combined with each other within the scope of the present invention.
Implementation mode one
Example 1
Referring to fig. 1, fig. 1 is a schematic flow chart of a method for identifying a crying category of an infant in embodiment 1 of the present invention; the method comprises the following steps:
s10: acquiring an electric signal corresponding to the vocal cord vibration of the baby when the baby cries;
specifically, when the baby is determined to cry, an electric signal generated by vocal cord vibration is obtained, wherein the electric signal can be an electric signal obtained by converting a vibration parameter of vocal cords, or an electric signal obtained by converting an optical image signal; the vibration signal is continuous and non-stationary; it should be noted that: can gather vocal cord vibration through piezoelectric sensor, also can acquire vocal cord vibration parameter through other optical components, if: infrared rays, radar waves, videos collected by a camera, and the like.
S11: outputting a vibration frequency spectrum corresponding to vocal cord vibration when the baby cries according to the electric signal;
specifically, audio information is acquired in real time and input into a sound detection model for sound recognition, wherein the sound detection model is a gate control recurrent neural network (GRU); when the fact that the audio information contains the baby crying is detected, an electric signal corresponding to vocal cord vibration when the baby crying is obtained, wherein the electric signal is a non-steady electric signal; and carrying out short-time Fourier transform on the electric signal and outputting a vibration frequency spectrum.
In one embodiment, referring to fig. 2, the S11 includes:
s111: segmenting the electric signal according to preset time duration to obtain a plurality of continuous electric signal segments;
specifically, the electrical signal of vocal cord vibration is a continuous signal with respect to time; dividing the vibration signal into a plurality of electrical signal segments at equal time intervals; in one embodiment, the electrical signal is a non-stationary electrical signal detected by a piezoelectric transducer.
S112: carrying out short-time Fourier transform on a plurality of continuous electric signal segments and outputting the vibration frequency spectrum;
in an embodiment, referring to fig. 3, the S112 includes:
s1121: acquiring a window function;
s1122: according to the formula
Figure BDA0002954737830000041
Performing short-time Fourier transform on a plurality of continuous electric signal segments, and outputting the vibration frequency spectrum corresponding to each electric signal segment;
wherein, X is the vibration frequency spectrum corresponding to the collected signal, X is the collected time domain signal, f is the frequency, the window function is w (t-tau), tau is the window moving variable, and t is the time.
Specifically, a window function is added in Fourier change to prevent frequency spectrum leakage; the accuracy of the vibration spectrum can be improved.
In an embodiment, in S1031, the window function is:
Figure BDA0002954737830000051
wherein a and b are constants, N is a window function variable, and N is a positive integer greater than 1.
In an embodiment, referring to fig. 4, the S112 includes:
s1123: acquiring each peak value in the electric signal, and finding out the maximum peak value from each peak value;
s1124: normalizing the electric signals by dividing each peak value by the maximum peak value to obtain the vibration frequency spectrum;
wherein, the peak value is the wave peak value and/or the wave trough value of the electric signal.
Specifically, the corresponding requirements of different crying sounds are different, electric signal values generated by vocal cord vibration when the baby cryes are collected, each peak and/or trough in all the electric signal values are extracted, and then the collected wave peak values and trough values in each period are normalized to obtain the vibration frequency spectrum of the electric signals; and the stability of data is ensured.
S12: and comparing the vibration frequency spectrum with each standard vibration frequency spectrum of a database, and outputting the crying category corresponding to the vibration frequency spectrum.
In one embodiment, referring to fig. 5, the S12 includes:
s121: obtaining a threshold value corresponding to each crying category;
specifically, a threshold value for judging whether to output the crying categories judged by the neural network is set, the crying categories meeting the threshold value requirement are output, and the crying categories not meeting the threshold value requirement are omitted and are not output.
S122: according to the formula
Figure BDA0002954737830000052
Comparing the vibration frequency spectrum with each standard vibration frequency spectrum, and outputting a similarity value group consisting of a plurality of similarity values;
specifically, the vibration frequency spectrum corresponding to the acquired electric signal is compared with the standard vibration frequency spectrum in the database to obtain the similarity value of the current cry and each cry class, and all the similarity values are used as a similarity value group.
S123: finding out the crying category corresponding to the maximum similarity value from the similarity value group;
specifically, the maximum similarity value is found from the similarity value group, and then the crying category corresponding to the maximum similarity value is used as the result.
S124: comparing the maximum similarity value with a threshold value representing the crying category corresponding to the maximum similarity value, and outputting the crying category;
wherein X is the vibration frequency spectrum, Y is the standard vibration frequency spectrum in the database, XiIs the value of the i-th signal segment of the vibration spectrum, YiIs the value of the ith signal segment of the standard vibration spectrum; mu.sxAnd muyRespectively, the mean value, sigma, of each electrical signal segment in X and each electrical signal segment in YxAnd σyThe standard deviation of each electric signal segment in X and each electric signal segment in Y, and Q is the electric signal length corresponding to the collected vocal cord vibration.
Specifically, the crying category corresponding to the maximum similarity value is used as the crying category to be output at this time, then a threshold corresponding to the crying category to be output is used for comparing with the maximum similarity, if the threshold is larger than the threshold, the crying category corresponding to the similarity is output, if the threshold is smaller than the threshold, no crying category is output, and therefore the vibration spectrum is understood to be invalid; in an application embodiment, the maximum similarity value is greater than the cry class threshold value for counting, and if the maximum similarity values for k times are greater than the cry class threshold value continuously, the cry class is output; if the crying categories corresponding to the next maximum similarity value are different, counting and resetting; if the maximum similarity value is smaller than the crying category threshold value in the counting process, the counting is cleared; through the accumulation mode, the accuracy of detection can be improved.
In an embodiment, referring to fig. 6, before the step S10, the method further includes:
s1: acquiring a detected audio signal;
specifically, after sound is detected, each frame of audio is acquired; such as: the frequency of 16KHz, the quantization progress of 16 bits, and each 512 sampling points are used as a frame, and each frame is overlapped with 256 sampling points, namely the frame length is 32ms, and the frame shift is 16ms for collection, so that each frame of audio is obtained.
S2: processing the audio signal and extracting MFCC characteristics of the audio signal;
specifically, Fourier transform is carried out on each collected frame of audio to obtain the bandwidth of an audio signal, and a target bandwidth is determined; then, a Mel filter bank carries out filtering processing according to the target bandwidth to obtain a Mel frequency cepstrum coefficient, and then logarithmic transformation is carried out to amplify, so that the characteristics are more obvious; discrete values of the Mel frequency cepstrum coefficients are extracted as Mel Frequency Cepstrum Coefficient (MFCC) features using the discrete cosine change.
S3: and inputting the MFCC characteristics into a preset infant crying identification model to determine whether the infant is crying currently.
Specifically, the MFCC characteristics are input into a baby cry recognition model, and whether an audio signal is a baby cry is judged; if the baby cry is detected, starting the baby cry detection, and obtaining the cry category according to the cry detection result.
By adopting the method for identifying the crying category of the baby, the electric signal of the vocal cord vibration when the baby cryes is obtained, the electric signal is converted into the corresponding vibration frequency spectrum, and the vibration frequency spectrum is compared with the standard vibration frequency spectrum of the database; and obtaining the crying category corresponding to the vibration spectrum. Judging the crying type of the baby by using the vibration frequency spectrum of the baby vocal cord vibration; the method can accurately detect the sounding difference caused by the individual difference of the baby or the abnormal crying caused by the abnormal conditions such as hoarse sound and the like of the baby, and improve the accuracy of the identification of the crying category of the baby.
Example 2
In embodiment 1, the crying category of the baby is determined by the vibration parameters corresponding to vocal cords vibration, and since the vocal cords of the baby are in the early development stage, the difference of the vocal cords vibration is small, the accuracy of the acquired vibration parameters is low, and the accuracy of the crying category detection is finally affected. Therefore, the embodiment 2 of the invention further analyzes the audio signal generated by the baby crying on the basis of the embodiment 1; referring to fig. 7, the method includes:
s20: acquiring audio characteristics of sound when the baby cries and a vibration frequency spectrum corresponding to the vibration of the vocal cords of the baby;
specifically, when the baby cries, the method comprises the steps of collecting an audio signal containing the crying and vibration parameters of corresponding vocal cords; the audio signal is processed to obtain audio characteristics, the vibration parameters are processed to obtain a vibration frequency spectrum, and the information collected when the baby cries can also be the motion characteristic value of posture and body movement, the respiratory frequency, the face color information, the face temperature information and the like.
In one embodiment, referring to fig. 8, the S20 includes:
s201: acquiring an audio signal corresponding to the baby cry;
s202: carrying out feature extraction on the audio signal by using a Mel filter to obtain the audio features;
wherein the audio features are Mel Frequency Cepstrum Coefficients (MFCC) features.
Specifically, a frequency of 16KHz, a quantization progress of 16 bits, and each 512 sampling points are used as a frame, and each frame is overlapped with 256 sampling points, namely, the frame length is 32ms, and the frame shift is 16ms for collection, so as to obtain each frame of audio; carrying out Fourier transform on each frame of collected audio, so that the audio signal is converted into a frequency domain signal and the bandwidth of the audio signal from a time domain signal, and determining a target bandwidth; then, a Mel filter bank carries out filtering processing according to the target bandwidth to obtain a Mel frequency cepstrum coefficient, and then logarithmic transformation is carried out to amplify, so that the characteristics are more obvious; discrete values of the Mel frequency cepstrum coefficients are extracted as Mel Frequency Cepstrum Coefficient (MFCC) features using the discrete cosine change.
In one embodiment, referring to fig. 9, the S20 includes:
s203: acquiring an electric signal corresponding to vocal cord vibration when the baby cries;
specifically, when the baby cry is determined, collecting vibration parameters corresponding to vocal cord vibration and/or optical image signals corresponding to vocal cord vibration, and then obtaining electric signals of vocal cord vibration; the vibration parameters and the optical image signal are acquired in a manner including at least one of: image sensors, infrared, radar, and piezoelectric sensors.
S204: segmenting the electric signal according to the time length of each frame of audio in the audio signal to obtain a plurality of continuous electric signal segments;
specifically, the electrical signal of vocal cord vibration is a continuous signal with respect to time; dividing the electrical signal into a plurality of segments for a time length corresponding to each frame of audio in the audio signal; wherein the initial electrical signal generated by the vocal cord vibration is a non-stationary signal.
S205: carrying out short-time Fourier transform on a plurality of continuous electric signal segments and outputting the vibration frequency spectrum;
s21: performing feature fusion on the audio features and the vibration frequency spectrum, and outputting fused features;
in an embodiment, referring to fig. 10, the S21 includes:
s211: performing principal component analysis (MFCC) feature of each frame of audio and the vibration spectrum of each electric signal segment to reduce dimension, and outputting the MFCC feature of each frame of audio in the audio signal and each electric signal segment after dimension reduction;
specifically, the key components in the signals can be effectively extracted by adopting the principal component analysis method for dimensionality reduction, and the complexity of data is reduced; it should be noted that: the dimension reduction processing is the unified processing of the MFCC characteristics of the whole audio signal and the corresponding vibration spectrum, or the independent processing of each frame of audio and the vibration spectrum of the electric signal corresponding to each frame of audio.
S212: and performing feature fusion on the MFCC features of each frame of audio subjected to the dimension reduction and the vibration frequency spectrum of the electric signal corresponding to each frame of audio to obtain each fusion feature.
Specifically, the key components of each frame of audio in the audio signal can be effectively extracted by adopting principal component analysis method dimension reduction processing, so that the complexity of data is reduced; and then, the key components in the MFCC features of each frame of audio and the key components of each electric signal segment corresponding to each frame of audio are subjected to feature fusion, so that redundant information in data can be eliminated, and the data accuracy is improved.
In an embodiment, referring to fig. 11, the step S212 includes:
s2121: acquiring a frequency change threshold of the vibration frequency spectrum and the vibration frequency of the vibration frequency spectrum corresponding to each frame of audio;
specifically, the vibration frequency corresponding to the vibration frequency spectrum and each frame of audio is obtained, and the frequency change threshold of the vibration frequency of the adjacent frame is set.
S2122: segmenting each vibration frequency by using the frequency change threshold value to obtain a plurality of continuous frequency segments;
specifically, the vibration frequency in the vibration frequency spectrum corresponding to the adjacent frame of audio is compared, the relationship between the vibration frequency change and the frequency change threshold is judged, if the vibration frequency change corresponding to the adjacent frame of audio is greater than the frequency change threshold, the two adjacent frames of audio belong to different frequency bands, and if the vibration frequency change corresponding to the adjacent frame of audio is less than or equal to the frequency change threshold, the two adjacent frames of audio belong to the same frequency band, so that the vibration frequency spectrum is divided into a plurality of continuous frequency bands.
S2123: and performing feature fusion on the vibration frequency spectrum corresponding to each frequency segment and the MFCC features of all the frame audios corresponding to each frequency segment respectively to obtain the fusion features corresponding to each frequency segment.
Specifically, according to the length of each frequency segment, dividing the audio information into corresponding audio segments, and then performing feature fusion on the MFCC features of each frame of audio corresponding to the vibration spectrum of each frequency segment, thereby ensuring that the detection accuracy of sound abnormality is improved under the same cry requirement, such as: the infant is excited in emotion and cry for a long time to cause hoarseness, in the process, all vibration frequencies corresponding to the vibration frequency spectrum are divided into the same frequency section, then feature fusion is carried out, the reliability of fusion features is guaranteed, and the detection accuracy is improved.
S22: inputting the fusion characteristics into a preset neural network, and outputting encoding characteristic vectors corresponding to the crying state;
specifically, inputting the obtained vibration frequency spectrum into a neural network, and performing convolution calculation on the vibration frequency spectrum and a convolution kernel; converting the convolved features into a one-dimensional vector for output; then, a gated recurrent neural network (GRU) is used to obtain the coding feature vector, and the coding feature vector is a one-dimensional vector.
In one embodiment, referring to fig. 12, the S22 includes:
s221: acquiring the characteristic matrix capacity of the neural network;
specifically, the capacity of the feature matrix is the number of features required for judging the crying category represented by the crying of the baby at a certain moment; that is to say, the neural network outputs corresponding cry categories according to all the coded feature vectors in the feature matrix; when the coded feature vector in the feature matrix is updated, the neural network outputs a new cry class.
S222: performing convolution calculation on the fusion characteristics and a convolution kernel, and outputting coding characteristic vectors corresponding to the electric signal segments;
s223: and obtaining each coding feature vector in the current feature matrix according to the feature matrix capacity and each coding feature vector.
Specifically, each fusion feature is sequentially subjected to convolution calculation with a convolution kernel, and each coding feature vector is output; before one fused feature enters the feature matrix, deleting the last row of the feature matrix, integrally moving the rest rows downwards by one row, and enabling the latest fused feature to enter the first row of the feature matrix; transforming the two-dimensional fusion feature into a one-dimensional vector through convolution calculation; then, the one-dimensional vector is used as a coding feature vector through a gate control recurrent neural network (GRU); meanwhile, the last line of coding feature vectors of the coding features is deleted, the coding feature vectors of the other lines are integrally moved downwards, and the obtained coding feature vectors are placed in the first line; thereby completing the updating of the coding feature vector; and performing weighted average on all the updated encoding characteristic vectors, outputting the final encoding characteristic vector, and then outputting the probability corresponding to each cry class through an activation function.
S23: and outputting the crying classification when the infant cryes according to the encoding feature vector.
In one embodiment, referring to fig. 13, the S23 includes:
s231: obtaining a crying category threshold;
specifically, the crying categories output each time are counted, a crying category threshold value continuously appearing in the same crying category is set, and the crying categories are output when the crying category threshold value is reached.
S232: comparing a first cry class corresponding to the current encoding characteristic vector with a second cry class corresponding to the previous encoding characteristic vector, and outputting a class comparison result;
s233: if the comparison results are the same, counting and adding 1; otherwise, counting is clear 0;
s234: outputting the cry class when the value of the count equals the cry class threshold.
Specifically, the former crying category is compared with the current crying category, and if the crying categories output twice in the neighborhood are the same category, the internal counter counts and adds 1; if two adjacent crying categories are different, the counter value of the counter is reset, and when the continuous occurrence frequency of the same crying category reaches a crying category threshold value, the crying category is taken as the current crying category to be output.
In one embodiment, the category of crying includes at least one of: hunger, pain, distraction, and discomfort.
In one embodiment, the pre-set neural network includes at least one sub-neural network of a scene, the scene including at least one of: the season is black night, day, outdoor, sunny day, cloudy day, rainy day, indoor, etc.
In one embodiment, the S22 includes:
s224: acquiring time information and environment information corresponding to crying of the baby;
specifically, when the infant cries, the time is referred to as follows: breakfast time, morning, lunch time, afternoon, dinner time, night and the like, the environmental information at least comprises one of the following: indoor, outdoor, sunny, rainy, etc.
S225: determining the corresponding sub-neural network as a target neural network according to the time information and the environment information;
specifically, according to the crying time period and the crying environment of the infant, a sub-neural network for performing convolution calculation on the volume-sum characteristics is determined, and the sub-neural network is marked as a target neural network.
S226: and performing convolution calculation on the fusion characteristics by using the target neural network, and outputting encoding characteristic vectors corresponding to the crying states.
By adopting the method for intelligently identifying the crying category of the baby, the audio characteristics corresponding to the crying of the baby and the vibration frequency spectrum of the vocal cords are obtained; performing feature fusion on the audio features and the vibration frequency spectrum, and converting the fused fusion features into corresponding coding feature vectors through a preset neural network; thereby outputting the corresponding probability of each crying category to obtain the crying category; the accuracy of crying recognition is improved by acquiring the audio features generated by vocal cord vibration and combining the vibration features.
Example 3
In the embodiment 1 and the embodiment 2, the crying category of the baby is determined by the audio signal of the crying vibration parameter corresponding to the vocal cord vibration, and because the vocal cords of the baby are in the initial development stage, the vocal cords are not developed well, and the vibration of the vocal cords and the crying have small range for the required expression, the sample which can be matched is limited, and finally the misjudgment is caused; therefore, posture information corresponding to the crying state of the infant is introduced to further improve on the basis of the embodiment 1; referring to fig. 14, the method includes;
s30: acquiring the audio characteristics of the sound of the baby during crying, the motion characteristic value of the posture motion of the baby in the image and the vibration frequency spectrum of vocal cord vibration;
specifically, when the baby cry is detected, a video stream containing the cry and vibration parameters of the baby vocal cords vibration are obtained; extracting audio features and action features in the video stream; and a vibration spectrum corresponding to vocal cord vibration; the motion characteristics comprise limb motions and facial micro-expressions, and the information collected when the baby cries can also be respiratory rate, facial color information, facial temperature information and the like.
S31: comparing the action characteristic value with a standard characteristic value of a database, and outputting a standard characteristic value corresponding to the action characteristic value;
specifically, the motion characteristic values of the baby posture actions are compared with the standard characteristic values of the database, the motion characteristic values are converted into the standard characteristic values of the corresponding cry categories, the probabilities of the cry categories represented by the standard characteristic values are different, and the cry categories of the baby at least comprise one of the following categories: sleepiness, pain, hunger and fear, the standard characteristic value a representing sleepiness of 50%, pain of 55%, hunger of 60%, fear of 52%, the standard characteristic value B representing sleepiness of 55%, pain of 53%, hunger of 58%, fear of 75%, etc.
In one embodiment, referring to fig. 15, the S31 includes:
s311: acquiring a video stream of the baby during crying;
s312: extracting the motion characteristic value of each frame of image in the video stream;
specifically, a video stream is split into multiple frames of images; composing an action from a plurality of successive images; extracting the motion characteristic value of each action in each frame image; in an application embodiment, filtering each frame of image in a video stream by adopting a Kalman filtering method to eliminate background interference of the image, and then extracting a motion characteristic value in the image; by adopting the Kalman filtering method, the slow background transformation between images can be eliminated, and mainly the light and shadow change is eliminated; the detection efficiency and the accuracy of the detection result are improved.
It should be noted that: the frequency of 16KHz, the quantization progress of 16 bits, and each 512 sampling points are used as a frame, and each frame is overlapped with 256 sampling points, namely the frame length is 32ms, and the frame shift is 16ms for collection, so that each frame of image is obtained.
S313: and comparing each motion characteristic value with each standard characteristic value of the database, and outputting each standard characteristic value matched with the motion characteristic value of each frame image.
Specifically, the motion characteristic value of each action in each frame image is compared with a database, and a standard characteristic value matched with each gesture action in the database is output as the actual motion characteristic value of each gesture action; the standard characteristic values of the database are used for representing the probability of each crying category for weighting, so that the data stability of characteristic fusion can be ensured.
In an embodiment, referring to fig. 16, the S313 includes:
s3131: collecting image sample sets corresponding to a plurality of posture actions of the baby;
s3132: extracting a standard characteristic value of each gesture action in the image sample set;
s3133: establishing the database corresponding to each standard characteristic value and each crying category;
specifically, motion characteristic value extraction is carried out on each frame image of each action to obtain the motion characteristic value of each frame image; and outputting the probability of each posture action corresponding to each motion characteristic value according to the motion characteristic value of each frame image, so as to fit a standard characteristic value corresponding to each action, and then establishing a database of the crying classification corresponding to the standard characteristic value.
S3034: and comparing each motion characteristic value with an action behavior database, and outputting the motion standard characteristic value to obtain the motion characteristics.
In one embodiment, referring to fig. 17, the S30 includes:
s301: acquiring an electric signal generated by vocal cord vibration when the baby cries;
specifically, when the baby cry is determined, collecting vibration parameters corresponding to vocal cord vibration and/or optical image signals corresponding to vocal cord vibration, and then obtaining electric signals of vocal cord vibration; the vibration parameters and the optical image signals are obtained in the manner described in example 1, and the details are not repeated here.
S302: segmenting the electric signal according to the time length of each frame image to obtain a plurality of continuous electric signal segments;
specifically, the electrical signal of vocal cord vibration is a continuous signal with respect to time; dividing the electrical signal into a plurality of segments for a length of time corresponding to each frame in the audio signal; the electrical signal is a non-stationary signal.
S303: and carrying out short-time Fourier transform on a plurality of continuous electric signal segments, and outputting the vibration frequency spectrum.
In one embodiment, referring to fig. 18, the S30 includes:
s304: acquiring audio signals of crying sounds of the baby;
s305: carrying out feature extraction on the audio signal by using a Mel filter to obtain the audio features;
wherein the audio features are Mel Frequency Cepstrum Coefficients (MFCC) features.
Specifically, short-time fourier transform processing is performed on each frame of audio signal, so that the audio signal is converted from a time domain signal to a frequency domain signal, after the audio signal of the frequency domain signal is obtained through fourier transform, filtering processing is performed on each frame of audio signal through a mel filter bank, and after logarithmic transform and discrete cosine transform, mel-frequency cepstrum coefficient MFCC features are extracted.
S32: performing feature fusion on the audio features and the vibration frequency spectrum based on the standard feature value to obtain fusion features corresponding to the baby cry;
specifically, feature fusion is carried out on the audio features and the vibration frequency spectrum based on the probability of each crying category represented by the standard feature value, and the fused fusion features are used for analyzing the crying category of the baby.
In one embodiment, referring to fig. 19, the S32 includes:
s321: performing principal component analysis (MFCC) feature of each frame of audio, the action feature value of each frame of image and the vibration spectrum of each electric signal segment to perform dimensionality reduction, and outputting the MFCC feature of each frame of audio, the action feature value of each frame of image and the vibration spectrum of each electric signal segment after dimensionality reduction;
specifically, the key components in the signals can be effectively extracted by adopting the principal component analysis method for dimensionality reduction, and the complexity of data is reduced; it should be noted that: the dimension reduction processing is the unified processing of the MFCC characteristics of the whole audio signal, the action characteristics of the video stream and the corresponding vibration spectrum, or the independent processing of each frame of audio, the vibration spectrum of the electric signal corresponding to each frame of audio and the image corresponding to each frame of audio.
S322: and performing feature fusion on the MFCC features corresponding to the frames of audio after the dimension reduction and the vibration frequency spectrum of the electric signal segment corresponding to the frames of audio based on the standard feature values corresponding to the motion feature values of the frames of images to obtain the fusion features.
In one embodiment, referring to fig. 20, the S32 includes:
s323: acquiring a frequency change threshold of the vibration frequency spectrum and the vibration frequency of the vibration frequency spectrum corresponding to each frame of audio information;
specifically, the vibration frequency corresponding to the vibration frequency spectrum and each frame of audio information is obtained, and the frequency change threshold of the vibration frequency of the adjacent frame is set.
S324: segmenting each vibration frequency by using the frequency change threshold value to obtain a plurality of continuous frequency segments;
specifically, the vibration frequencies corresponding to adjacent frames of audio frequencies are compared, the relationship between the vibration frequency change and the frequency change threshold is judged, if the vibration frequency change corresponding to the adjacent frames of audio frequencies is greater than the frequency change threshold, the two adjacent frames of audio frequencies belong to different frequency bands, and if the vibration frequency change corresponding to the adjacent frames of audio frequencies is less than or equal to the frequency change threshold, the two adjacent frames of audio frequencies belong to the same frequency band, so that the vibration frequency spectrum is divided into a plurality of continuous frequency bands.
S325: and performing feature fusion on the vibration frequency spectrum corresponding to each frequency segment and the MFCC features of all the corresponding frame audios based on each standard feature value corresponding to each frequency segment to obtain the fusion features corresponding to each frequency segment.
Specifically, based on each standard characteristic value corresponding to each frequency segment, the vibration frequency spectrum of each frequency segment is subjected to characteristic fusion with the MFCC characteristic of each frame of audio corresponding to each frequency segment, so that the detection accuracy of sound abnormality is improved under the same cry requirement, for example: the infant is excited in emotion and cry for a long time to cause hoarseness, in the process, all vibration frequencies corresponding to the vibration frequency spectrum are divided into the same frequency section, abnormal cry exists in the corresponding sound information in the frequency section, and multi-feature fusion is performed through the motion features of the face, limbs and the like of the infant in the corresponding time section, so that the weight of the motion features on the cry is realized, the reliability of the fusion features is ensured, and the detection accuracy is improved; for another example: at night, the baby cry suddenly and severely, and the body action is accompanied by large amplitude, but the duration is short, at this time, the weighting of the motion characteristic makes the fused characteristic biased to the fact that the baby belongs to the nightmare scenario, so that after the fused characteristic is input into a neural network for convolution calculation, the nightmare probability in the output cry category is maximum, and the guardian can perform actions such as pacifying and the like.
S33: inputting the fusion characteristics into a preset neural network, and outputting a coding characteristic vector corresponding to the baby cry;
s34 outputting crying categories when the baby cryes according to the encoding feature vectors;
and the standard characteristic value is a probability value of each crying category represented by the corresponding gesture motion.
In one embodiment, the S34 includes:
s341: acquiring the characteristic matrix capacity of the neural network;
specifically, the capacity of the feature matrix is the number of features required for judging the crying category represented by the crying of the baby at a certain moment; that is to say, the neural network outputs corresponding cry categories according to all the coded feature vectors in the feature matrix; when the coded feature vector in the feature matrix is updated, the neural network outputs a new cry class.
S342: performing convolution calculation on the fusion characteristics and a convolution kernel, and outputting coding characteristic vectors corresponding to the electric signal segments;
s343: and obtaining each coding feature vector in the current feature matrix according to the feature matrix capacity and each coding feature vector.
Specifically, each fusion feature is sequentially subjected to convolution calculation with a convolution kernel, and each coding feature vector is output; before one fusion characteristic enters the characteristic matrix, deleting the last row of the characteristic matrix, integrally moving the rest rows downwards by one row, and enabling the latest fusion characteristic to enter the first row of the characteristic matrix; transforming the two-dimensional fusion feature into a one-dimensional vector through convolution calculation; then, the one-dimensional vector is used as a coding feature vector through a gate control recurrent neural network (GRU); meanwhile, the last line of coding feature vectors of the coding features is deleted, the coding feature vectors of the other lines are integrally moved downwards, and the obtained coding feature vectors are placed in the first line; thereby completing the updating of the coding feature vector; carrying out weighted average on all the updated coding feature vectors, outputting the final coding feature vector, and then outputting the probability corresponding to each cry class through an activation function; comparing the probabilities of the crying categories and outputting the crying categories.
By adopting the method for identifying the crying categories of the infants through multi-feature fusion, the audio features, the vibration frequency spectrums of vocal cord vibration and the action feature values of attitude actions during crying of the infants are obtained, the action feature values of the attitude actions are converted into the standard feature values of the attitude actions corresponding to the crying categories, the standard feature values represent the probability of each crying category, then the audio features and the vibration frequency spectrums of the vocal cord vibration are fused based on the standard feature values, the weighing processing of the crying feature fusion is realized, then the fused fusion features are input into a neural network to perform the analysis of the crying categories of the infants, and the corresponding crying categories are output; by weighting the posture characteristics of the baby during crying, the limit of crying discrimination of the sound signal and vocal cord vibration frequency spectrum can be made up, the wrong crying detection judgment during abnormal baby sound is reduced, and the crying detection accuracy is improved.
Second embodiment
Example 4
Embodiment 4 of the present invention provides a device for identifying a crying category of a baby by vocal cord vibration, based on the method for identifying a crying category of a baby in embodiment 1, and with reference to fig. 22, the method includes:
the signal acquisition module: the device is used for acquiring an electric signal corresponding to the vocal cord vibration of the baby during the crying of the baby;
the signal processing module: the vibration frequency spectrum corresponding to vocal cord vibration when the baby cries is output according to the electric signal;
crying category module: and the system is used for comparing the vibration frequency spectrum with each standard vibration frequency spectrum in a database and outputting the crying category corresponding to the vibration frequency spectrum.
By adopting the device for identifying the crying category of the infant through the vocal cord vibration, the electrical signal of the vocal cord vibration when the infant cryes is obtained and converted into the corresponding vibration frequency spectrum, and the vibration frequency spectrum is compared with the standard vibration frequency spectrum of the database; and obtaining the crying category corresponding to the vibration spectrum. Judging the crying type of the baby by using the vibration frequency spectrum of the baby vocal cord vibration; the method can accurately detect the sounding difference caused by the individual difference of the baby or the abnormal crying caused by the abnormal conditions such as hoarse sound and the like of the baby, and improve the accuracy of the identification of the crying category of the baby.
It should be noted that the apparatus further includes the remaining technical solutions described in embodiment 1, and details are not described herein.
Example 5
In embodiment 4, the crying category of the baby is determined by the vibration parameters corresponding to vocal cord vibration, and since the vocal cords of the baby are in the early development stage, the difference of the vocal cord vibration is small, the accuracy of the acquired vibration parameters is low, and the accuracy of crying category detection is finally influenced; therefore, the crying audio signal is further improved on the basis of the embodiment 4; please refer to fig. 23, which includes:
a parameter acquisition module: the device is used for acquiring the audio characteristics of the sound of the baby during crying and the vibration frequency spectrum corresponding to the vocal cord vibration of the baby;
a feature fusion module: the system is used for performing feature fusion on the audio features and the vibration frequency spectrum and outputting fused features;
a neural network module: the fusion characteristic is input into a preset neural network, and a coding characteristic vector corresponding to the crying state is output;
a category output module: and the crying classification is used for outputting the crying classification of the infant when the infant cryes according to the encoding feature vector.
The device for identifying the crying category of the baby by combining the vocal cord vibration and the posture is adopted to obtain the audio characteristics corresponding to the crying of the baby and the vibration frequency spectrum of the vocal cords; performing feature fusion on the audio features and the vibration frequency spectrum, and converting the fused fusion features into corresponding coding feature vectors through a preset neural network; thereby outputting the corresponding probability of each crying category to obtain the crying category; the accuracy of crying recognition is improved by acquiring the audio features generated by vocal cord vibration and combining the vibration features.
It should be noted that the apparatus further includes the remaining technical solutions described in embodiment 4, and details are not described herein.
Example 6
In the embodiments 4 and 5, the crying category of the baby is determined by the audio signal of the crying vibration parameter corresponding to the vocal cords vibration, and because the vocal cords of the baby are in the initial development stage, the vocal cords are not developed well, and the vibration of the vocal cords and the crying have small range for the required expression, the sample which can be matched is limited, and finally the misjudgment is caused; therefore, the posture information corresponding to the crying of the infant is introduced on the basis of the embodiment 4 and the embodiment 5 for further improvement; please refer to fig. 24, which includes:
a characteristic acquisition module: the method comprises the steps of acquiring the audio characteristics of the sound of the baby during crying, the motion characteristic values of the posture motion of the baby in an image and the vibration frequency spectrum of vocal cord vibration;
the data conversion module: the standard characteristic value is used for comparing the action characteristic value with a standard characteristic value of a database and outputting the standard characteristic value corresponding to the action characteristic value;
a fusion feature output module: the standard characteristic value is used for carrying out characteristic fusion on the audio characteristic and the vibration frequency spectrum based on the standard characteristic value to obtain fusion characteristics corresponding to the baby cry;
the encoding characteristic output module: the fusion feature is used for inputting the fusion feature into a preset neural network and outputting a coding feature vector corresponding to the baby cry;
crying category output module: the encoding feature vector is used for outputting crying categories when the infant cryes;
and the standard characteristic value is a probability value of each crying category represented by the corresponding gesture motion.
By adopting the device for identifying the crying category of the infant through multi-feature fusion, the audio feature, vocal cord vibration feature and posture feature when the infant cryes are fused, the fused feature is input into a neural network to analyze the crying category of the infant, and the corresponding crying category is output; by combining the posture characteristics of the baby during crying, the limit of the crying discrimination of the sound signal and the vocal cord vibration frequency spectrum can be made up, the requirement of the baby is strengthened by the posture characteristics, the wrong judgment can be reduced, and the crying detection accuracy is improved.
It should be noted that the apparatus further includes the remaining technical solutions described in embodiment 4 and/or embodiment 5, and details are not described herein.
The third embodiment is as follows:
the present invention provides an electronic device and storage medium, as shown in FIG. 24, comprising at least one processor, at least one memory, and computer program instructions stored in the memory.
Specifically, the processor may include a Central Processing Unit (CPU), or an Application Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present invention, and the electronic device includes at least one of the following: the wearing equipment that camera, mobile device that has the camera, have the camera.
The memory may include mass storage for data or instructions. By way of example, and not limitation, memory may include a Hard Disk Drive (HDD), floppy Disk Drive, flash memory, optical Disk, magneto-optical Disk, magnetic tape, or Universal Serial Bus (USB) Drive or a combination of two or more of these. The memory may include removable or non-removable (or fixed) media, where appropriate. The memory may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory is non-volatile solid-state memory. In a particular embodiment, the memory includes Read Only Memory (ROM). Where appropriate, the ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), electrically rewritable ROM (EAROM), or flash memory or a combination of two or more of these.
The processor reads and executes the computer program instructions stored in the memory to realize any one of the method for identifying the crying category of the baby, the method for identifying the crying category of the baby by combining vocal cord vibration and posture, and the method for identifying the crying category of the baby by multi-feature fusion in the first embodiment mode.
In one example, the electronic device may also include a communication interface and a bus. The processor, the memory and the communication interface are connected through a bus and complete mutual communication.
The communication interface is mainly used for realizing communication among modules, devices, units and/or equipment in the embodiment of the invention.
A bus comprises hardware, software, or both that couple components of an electronic device to one another. By way of example, and not limitation, a bus may include an Accelerated Graphics Port (AGP) or other graphics bus, an Enhanced Industry Standard Architecture (EISA) bus, a Front Side Bus (FSB), a Hypertransport (HT) interconnect, an Industry Standard Architecture (ISA) bus, an infiniband interconnect, a Low Pin Count (LPC) bus, a memory bus, a Micro Channel Architecture (MCA) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCI-X) bus, a Serial Advanced Technology Attachment (SATA) bus, a video electronics standards association local (VLB) bus, or other suitable bus or a combination of two or more of these. A bus may include one or more buses, where appropriate. Although specific buses have been described and shown in the embodiments of the invention, any suitable buses or interconnects are contemplated by the invention.
In summary, embodiments of the present invention provide a method for optimizing an intelligent camera detection model by using edge calculation, a sample confidence threshold selection method, a method, an apparatus, a device, and a storage medium for model self-training.
It is to be understood that the invention is not limited to the specific arrangements and instrumentality described above and shown in the drawings. A detailed description of known methods is omitted herein for the sake of brevity. In the above embodiments, several specific steps are described and shown as examples. However, the method processes of the present invention are not limited to the specific steps described and illustrated, and those skilled in the art can make various changes, modifications and additions or change the order between the steps after comprehending the spirit of the present invention.
The functional blocks shown in the above-described structural block diagrams may be implemented as hardware, software, firmware, or a combination thereof. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, plug-in, function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine-readable medium or transmitted by a data signal carried in a carrier wave over a transmission medium or a communication link. A "machine-readable medium" may include any medium that can store or transfer information. Examples of a machine-readable medium include electronic circuits, semiconductor memory devices, ROM, flash memory, Erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, Radio Frequency (RF) links, and so forth. The code segments may be downloaded via computer networks such as the internet, intranet, etc.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for identifying a crying category of an infant through multi-feature fusion, the method comprising:
s30: acquiring the audio characteristics of the sound of the baby during crying, the motion characteristic value of the posture motion of the baby in the image and the vibration frequency spectrum of vocal cord vibration;
s31: comparing the action characteristic value with a standard characteristic value of a database, and outputting a standard characteristic value corresponding to the action characteristic value;
s32: performing feature fusion on the audio features and the vibration frequency spectrum based on the standard feature value to obtain fusion features corresponding to the baby cry;
s33: inputting the fusion characteristics into a preset neural network, and outputting a coding characteristic vector corresponding to the baby cry;
s34: outputting crying categories when the infant cryes according to the encoding feature vectors;
and the standard characteristic value is a probability value of each crying category represented by the corresponding gesture motion.
2. The method for identifying baby crying category through multi-feature fusion as claimed in claim 1, wherein the S31 comprises:
s311: acquiring a video stream of the baby during crying;
s312: extracting the motion characteristic value of each frame of image in the video stream;
s313: and comparing each motion characteristic value with each standard characteristic value of the database, and outputting each standard characteristic value matched with the motion characteristic value of each frame image.
3. The method for multi-feature fusion identification of baby crying category according to claim 2, wherein said S313 comprises:
s3131: collecting image sample sets corresponding to a plurality of posture actions of the baby;
s3132: extracting a standard characteristic value of each gesture action in the image sample set;
s3133: establishing the database corresponding to each standard characteristic value and each crying category;
s3134: and comparing each motion characteristic value of each frame of image when the baby cries with each standard characteristic value of the database, and outputting each standard characteristic value matched with each motion characteristic value.
4. The method for identifying baby crying category through multi-feature fusion as claimed in claim 3, wherein the S30 comprises:
s301: acquiring an electric signal generated by vocal cord vibration when the baby cries;
s302: segmenting the electric signal according to the time length of each frame image to obtain a plurality of continuous electric signal segments;
s303: and carrying out short-time Fourier transform on a plurality of continuous electric signal segments, and outputting the vibration frequency spectrum.
5. The method for identifying baby crying category through multi-feature fusion as claimed in claim 4, wherein the S30 comprises:
s304: acquiring audio signals of crying sounds of the baby;
s305: carrying out feature extraction on the audio signal by using a Mel filter to obtain the audio features;
wherein the audio features are Mel Frequency Cepstrum Coefficients (MFCC) features.
6. The method for identifying baby crying category through multi-feature fusion as claimed in claim 5, wherein the S32 comprises:
s321: performing principal component analysis (MFCC) feature of each frame of audio, the action feature value of each frame of image and the vibration spectrum of each electric signal segment to perform dimensionality reduction, and outputting the MFCC feature of each frame of audio, the action feature value of each frame of image and the vibration spectrum of each electric signal segment after dimensionality reduction;
s322: and performing feature fusion on the MFCC features corresponding to the frames of audio after the dimension reduction and the vibration frequency spectrum of the electric signal segment corresponding to the frames of audio based on the standard feature values corresponding to the motion feature values of the frames of images to obtain the fusion features.
7. The method for identifying baby crying category through multi-feature fusion as claimed in claim 6, wherein said S32 comprises:
s323: acquiring a frequency change threshold of the vibration frequency spectrum and the vibration frequency of the vibration frequency spectrum corresponding to each frame of audio information;
s324: segmenting each vibration frequency by using the frequency change threshold value to obtain a plurality of continuous frequency segments;
s325: and performing feature fusion on the vibration frequency spectrum corresponding to each frequency segment and the MFCC features of all the corresponding frame audios based on each standard feature value corresponding to each frequency segment to obtain the fusion features corresponding to each frequency segment.
8. An apparatus for identifying baby cry category through multi-feature fusion, comprising:
a characteristic acquisition module: the method comprises the steps of acquiring the audio characteristics of the sound of the baby during crying, the motion characteristic values of the posture motion of the baby in an image and the vibration frequency spectrum of vocal cord vibration;
the data conversion module: the standard characteristic value is used for comparing the action characteristic value with a standard characteristic value of a database and outputting the standard characteristic value corresponding to the action characteristic value;
a fusion feature output module: the standard characteristic value is used for carrying out characteristic fusion on the audio characteristic and the vibration frequency spectrum based on the standard characteristic value to obtain fusion characteristics corresponding to the baby cry;
the encoding characteristic output module: the fusion feature is used for inputting the fusion feature into a preset neural network and outputting a coding feature vector corresponding to the baby cry;
crying category output module: the encoding feature vector is used for outputting crying categories when the infant cryes;
and the standard characteristic value is a probability value of each crying category represented by the corresponding gesture motion.
9. An electronic device, comprising: at least one processor, at least one memory, and computer program instructions stored in the memory that, when executed by the processor, implement the method of any of claims 1-7.
10. A medium having stored thereon computer program instructions, which, when executed by a processor, implement the method of any one of claims 1-7.
CN202110218120.9A 2021-02-26 2021-02-26 Method, device and equipment for identifying crying type of baby by multi-feature fusion Active CN113035241B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110218120.9A CN113035241B (en) 2021-02-26 2021-02-26 Method, device and equipment for identifying crying type of baby by multi-feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110218120.9A CN113035241B (en) 2021-02-26 2021-02-26 Method, device and equipment for identifying crying type of baby by multi-feature fusion

Publications (2)

Publication Number Publication Date
CN113035241A true CN113035241A (en) 2021-06-25
CN113035241B CN113035241B (en) 2023-08-08

Family

ID=76461764

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110218120.9A Active CN113035241B (en) 2021-02-26 2021-02-26 Method, device and equipment for identifying crying type of baby by multi-feature fusion

Country Status (1)

Country Link
CN (1) CN113035241B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113436650A (en) * 2021-08-25 2021-09-24 深圳市北科瑞声科技股份有限公司 Baby cry identification method and device, electronic equipment and storage medium
CN114582355A (en) * 2021-11-26 2022-06-03 华南师范大学 Audio and video fusion-based infant crying detection method and device
CN114863950A (en) * 2022-07-07 2022-08-05 深圳神目信息技术有限公司 Baby crying detection and network establishment method and system based on anomaly detection
CN116386671A (en) * 2023-03-16 2023-07-04 宁波星巡智能科技有限公司 Infant crying type identification method, device, equipment and storage medium
CN114582355B (en) * 2021-11-26 2024-07-12 华南师范大学 Infant crying detection method and device based on audio and video fusion

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150265206A1 (en) * 2012-08-29 2015-09-24 Brown University Accurate analysis tool and method for the quantitative acoustic assessment of infant cry
CN107767874A (en) * 2017-09-04 2018-03-06 南方医科大学南方医院 A kind of baby crying sound identification reminding method and system
CN111563422A (en) * 2020-04-17 2020-08-21 五邑大学 Service evaluation obtaining method and device based on bimodal emotion recognition network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150265206A1 (en) * 2012-08-29 2015-09-24 Brown University Accurate analysis tool and method for the quantitative acoustic assessment of infant cry
CN107767874A (en) * 2017-09-04 2018-03-06 南方医科大学南方医院 A kind of baby crying sound identification reminding method and system
CN111563422A (en) * 2020-04-17 2020-08-21 五邑大学 Service evaluation obtaining method and device based on bimodal emotion recognition network

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113436650A (en) * 2021-08-25 2021-09-24 深圳市北科瑞声科技股份有限公司 Baby cry identification method and device, electronic equipment and storage medium
CN113436650B (en) * 2021-08-25 2021-11-16 深圳市北科瑞声科技股份有限公司 Baby cry identification method and device, electronic equipment and storage medium
CN114582355A (en) * 2021-11-26 2022-06-03 华南师范大学 Audio and video fusion-based infant crying detection method and device
CN114582355B (en) * 2021-11-26 2024-07-12 华南师范大学 Infant crying detection method and device based on audio and video fusion
CN114863950A (en) * 2022-07-07 2022-08-05 深圳神目信息技术有限公司 Baby crying detection and network establishment method and system based on anomaly detection
CN116386671A (en) * 2023-03-16 2023-07-04 宁波星巡智能科技有限公司 Infant crying type identification method, device, equipment and storage medium
CN116386671B (en) * 2023-03-16 2024-05-07 宁波星巡智能科技有限公司 Infant crying type identification method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN113035241B (en) 2023-08-08

Similar Documents

Publication Publication Date Title
CN113035241B (en) Method, device and equipment for identifying crying type of baby by multi-feature fusion
CN112967733B (en) Method and device for intelligently identifying crying type of baby
US9177203B2 (en) Target detection device and target detection method
CN104795064B (en) The recognition methods of sound event under low signal-to-noise ratio sound field scape
CN111103976B (en) Gesture recognition method and device and electronic equipment
CN107688790B (en) Human behavior recognition method and device, storage medium and electronic equipment
CN110772700B (en) Automatic sleep-aiding music pushing method and device, computer equipment and storage medium
CN107767874B (en) Infant crying recognition prompting method and system
US20210104230A1 (en) Method of recognising a sound event
Ting Yuan et al. Frog sound identification system for frog species recognition
CN113205820B (en) Method for generating voice coder for voice event detection
US20210125628A1 (en) Method and device for audio recognition
AU2013204156A1 (en) Classification apparatus and program
CN111262637B (en) Human body behavior identification method based on Wi-Fi channel state information CSI
CN113012716B (en) Infant crying type identification method, device and equipment
CN117711436B (en) Far-field sound classification method and device based on multi-sensor fusion
CN112466284B (en) Mask voice identification method
Eyobu et al. A real-time sleeping position recognition system using IMU sensor motion data
CN117577133A (en) Crying detection method and system based on deep learning
CN117275525A (en) Cough sound detection and extraction method
CN117270081A (en) Meteorological prediction generation method and device, equipment and storage medium
CN113436650B (en) Baby cry identification method and device, electronic equipment and storage medium
CN114998731A (en) Intelligent terminal navigation scene perception identification method
CN114764580A (en) Real-time human body gesture recognition method based on no-wearing equipment
CN114999501A (en) Pet voice recognition method and system based on neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant