CN111883174A - Voice recognition method and device, storage medium and electronic device - Google Patents

Voice recognition method and device, storage medium and electronic device Download PDF

Info

Publication number
CN111883174A
CN111883174A CN201910562749.8A CN201910562749A CN111883174A CN 111883174 A CN111883174 A CN 111883174A CN 201910562749 A CN201910562749 A CN 201910562749A CN 111883174 A CN111883174 A CN 111883174A
Authority
CN
China
Prior art keywords
model
sound information
demand
information
target object
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910562749.8A
Other languages
Chinese (zh)
Inventor
屈奇勋
胡雯
张磊
石瑗璐
李宛庭
沈凌浩
郑汉城
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Icarbonx Intelligent Digital Life Health Management Co ltd
Shenzhen Digital Life Institute
Original Assignee
Shenzhen Icarbonx Intelligent Digital Life Health Management Co ltd
Shenzhen Digital Life Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Icarbonx Intelligent Digital Life Health Management Co ltd, Shenzhen Digital Life Institute filed Critical Shenzhen Icarbonx Intelligent Digital Life Health Management Co ltd
Priority to CN201910562749.8A priority Critical patent/CN111883174A/en
Priority to PCT/CN2020/087072 priority patent/WO2020259057A1/en
Publication of CN111883174A publication Critical patent/CN111883174A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/26Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/66Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a voice recognition method and device, a storage medium and an electronic device; wherein, the method comprises the following steps: collecting sound information emitted by a target object; judging whether the collected sound information emitted by the target object is crying information or not; if the judgment result is yes, inputting the sound information into a pre-trained sound model, wherein the sound model comprises a first-stage model and a second-stage model; the first-level model is used for identifying the demand type of the sound information for representing the demand of the target object, and the second-level model is used for identifying the demand state of the sound information in the demand type; specific requirements corresponding to the sound information for characterizing the target object are identified through the first-level model and the second-level model. The invention solves the problem that the crying of the baby can be identified only according to the experience of a person in the prior art, which easily causes the identification error.

Description

Voice recognition method and device, storage medium and electronic device
Technical Field
The invention relates to the field of computers, in particular to a voice recognition method and device, a storage medium and an electronic device.
Background
Crying is a very important expression of infants, and the need to correctly identify crying to understand an infant is very important for nurturing an infant. The sense of safety achieved by a newborn infant during the first few months has a very important impact on its future life, most likely with and affecting its lifetime. Therefore, if the crying of the baby can be correctly identified and the needs of the baby can be met, the healthy growth of the baby can be facilitated.
The crying is relatively complex, and the information conveyed by the crying is fuzzy, such as hunger, tired, solitary, etc. However, the need for an experienced infant nurser to distinguish crying inclusions of an infant timely and effectively is not easy, let alone for young parents who are the first mother/father. Therefore, in the related art, the identification of the baby cry is based on the experience of the person, the experience of the person is often inconsistent, and the subjective judgment easily causes identification errors.
In view of the above problems in the related art, no effective solution exists at present.
Disclosure of Invention
The embodiment of the invention provides a sound identification method and device, a storage medium and an electronic device, which at least solve the problem that the identification error is easily caused by identifying the crying of a baby only according to the experience of a person in the related art.
According to an embodiment of the present invention, there is provided a sound recognition method including: collecting sound information emitted by a target object; judging whether the collected sound information emitted by the target object is crying information or not; if the judgment result is yes, inputting the sound information into a pre-trained sound model, wherein the sound model is obtained by training an initial sound model according to a training set consisting of a plurality of pieces of crying information, and comprises a first-stage model and a second-stage model; the first-stage model is used for identifying a demand type of the sound information for representing the demand of the target object, and the second-stage model is used for identifying a demand state of the sound information in the demand type; identifying specific requirements corresponding to the sound information for characterizing the target object through the first-level model and the second-level model.
According to another embodiment of the present invention, there is provided a voice recognition apparatus including: the acquisition module is used for acquiring sound information emitted by a target object; the judgment module is used for judging whether the acquired sound information emitted by the target object is crying information or not; the input module is used for inputting the voice information into a pre-trained voice model under the condition that the judgment result is yes, wherein the voice model is obtained by training an initial voice model according to a training set consisting of a plurality of pieces of crying information and comprises a first-stage model and a second-stage model; the first-stage model is used for identifying a demand type of the sound information for representing the demand of the target object, and the second-stage model is used for identifying a demand state of the sound information in the demand type; and the identification module is used for identifying the specific requirements corresponding to the sound information and used for representing the target object through the first-level model and the second-level model.
According to a further embodiment of the present invention, there is also provided a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.
According to yet another embodiment of the present invention, there is also provided an electronic device, including a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above method embodiments.
According to the invention, under the condition that the acquired sound information sent by the target object is crying information, the requirement type of the sound information and the requirement state under the requirement type can be further identified according to the first-stage model and the second-stage model in the sound model, so that the current requirement state of the target object can be identified according to the crying information through the sound model instead of identifying the requirement state represented by the crying according to the experience of a person, the problem that the identification error is easily caused by the fact that the crying of the baby can only be identified according to the experience of the person in the related art is solved, and the accuracy of identifying the requirement state represented by the crying is improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
fig. 1 is a block diagram of a hardware configuration of a terminal of a voice recognition method according to an embodiment of the present invention;
FIG. 2 is a flow chart of a method for recognizing sounds according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a hierarchical UBM-GMM model according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a training process for a UBM-GMM model according to an embodiment of the present invention;
fig. 5 is a block diagram of a structure of a voice recognition apparatus according to an embodiment of the present invention;
FIG. 6 is a block diagram of an alternative configuration of a voice recognition apparatus according to an embodiment of the present invention;
fig. 7 is a block diagram of an alternative configuration of a voice recognition apparatus according to an embodiment of the present invention.
Detailed Description
The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
Example 1
The method provided by the first embodiment of the present application may be executed in a terminal, a computer terminal, or a similar computing device. Taking the example of the operation on the terminal, fig. 1 is a hardware structure block diagram of the terminal of the voice recognition method according to the embodiment of the present invention. As shown in fig. 1, the terminal 10 may include one or more (only one shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 for storing data, and optionally may also include a transmission device 106 for communication functions and an input-output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the terminal. For example, the terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
The memory 104 may be used to store a computer program, for example, a software program and a module of application software, such as a computer program corresponding to the voice recognition method in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the computer program stored in the memory 104, so as to implement the method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network or a wired network provided by a communication provider of the terminal 10. In one example, the transmission device 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.
It should be noted that, whether the method steps in the present application are used in the transmission device 106 depends on the scheme itself of the present application, for example, the present application is an interactive method step scheme and needs to use the transmission device 106, and if all the method steps in the present application can be executed inside the terminal 10, the transmission device 106 does not need to be used.
In this embodiment, a method for recognizing a voice running on the terminal is provided, and fig. 2 is a flowchart of a method for recognizing a voice according to an embodiment of the present invention, as shown in fig. 2, the flowchart includes the following steps:
step S202, collecting sound information emitted by a target object;
step S204, judging whether the collected sound information emitted by the target object is crying information or not;
step S206, if the result of the determination is yes, inputting the voice information into a pre-trained voice model, wherein the voice model is obtained by training an initial voice model according to a training set composed of a plurality of pieces of crying information, and the pre-trained voice model includes: a first level model and a second level model; the first-level model is used for identifying the demand type of the sound information for representing the demand of the target object, and the second-level model is used for identifying the demand state of the sound information in the demand type;
and step S208, identifying specific requirements corresponding to the sound information and used for representing the target object through the first-level model and the second-level model.
It should be noted that, in the present invention, the pre-trained acoustic model is composed of a multi-level model, which may be a two-level model (a first-level model and a second-level model), or a three-level model, a four-level model, or a more-level model; correspondingly, the specific requirement for representing the requirement of the target object can be directly obtained by sequentially identifying the first-stage model and the second-stage model, or obtained by identifying the third-stage model (consisting of three-stage models) after the result is sequentially identified on the basis of the first-stage model and the second-stage model, or obtained by identifying the fourth-stage model (consisting of four-stage models) on the basis of the result is identified by the third-stage model, and so on.
In an embodiment of the present invention, when the acoustic model trained in advance in step S206 includes only a first-level model and a second-level model, the requirement state of the target object obtained by the second-level model identification is the specific requirement of the target object. In other embodiments, the pre-trained acoustic model in step S206 may be a three-level model, a four-level model, or other series model, when the pre-trained acoustic model is composed of three models with different levels, the first-level model is configured to identify a first requirement type of the acoustic information for characterizing a requirement of the target object, the second-level model is configured to identify a second requirement state type of the acoustic information in the first requirement type, and the third-level model is configured to identify a specific requirement state of the acoustic information in the second requirement state type for characterizing the target object, where the specific requirement of the target object is a requirement state of the target object.
For example, when the pre-trained acoustic model is composed of three models with different levels, the first-level model is used for identifying a first requirement type of the acoustic information for representing the requirements of the target object, and the first requirement type comprises both physiological and non-physiological types. Wherein the second requirement type of the second-level model corresponding to the "physiology" is as follows: physiological responses, physiological needs, and emotional needs; the demand state of the third-level model corresponding to the "physiological response" is: hiccups, abdominal pain, other discomfort; the demand state of the third-level model corresponding to the "physiological demand" is: hungry, cold or hot, stranded; the requirement state of the third-level model corresponding to the emotion requirement is as follows: fear and lonely. The second demand type of the "non-physiological" corresponding second level model is: pain, poor breathing, weakness, and lack of strength. The demand state for the third-level model for "pain" is: abdominal pain, headache, etc.; the demand state of the third-level model corresponding to "breathing is not smooth" is: nasal plugs, etc.; the demand state of the third-level model corresponding to the physical weakness and weakness is as follows: asthenia.
Through the steps S202 to S208, when the acquired sound information sent by the target object is judged to be crying information, the demand type of the sound information and the demand state under the demand type can be further identified according to the first-stage model and the second-stage model in the sound model, so that the current demand state of the target object can be identified through the sound model according to the crying information, instead of identifying the demand state represented by the crying according to the experience of a person, the problem that in the related art, the identification of the crying of the baby can only be carried out according to the experience of the person, so that the identification error is easily caused is solved, and the identification accuracy of the demand state represented by the crying is improved.
It should be noted that the target object referred to in the present application is preferably an infant, but may be a young friend of several years old, or an animal. The present application does not limit the specific objects, and the corresponding settings may be performed according to the actual situation.
In an optional implementation manner of this embodiment, the manner of determining whether the collected sound information emitted by the target object is crying information in step S204 of this application may be implemented by the following manner:
step S204-11, transcoding the collected sound information into a specified format;
in a preferred mode of the application, the specified format is preferably wav format, and the audio sampling rates are both 8000 Hz; of course, in other application scenarios, the format may also be 3gp, aac, amr, caf, flac, mp3, ogg, aiff, and the like, and based on this, 8000, 11025, 12000, 16000, 22050, 24000, 32000, 40000, 44100, 47250, 48000, and the like at the down-sampling frequency (unit Hz) may be selected, which is not limited herein.
It should be noted that the input audio (sound information) needs to be unified in format (transcoding) and sampling frequency, so that the actual use process is more convenient and faster; if the input audio is not transcoded into a uniform format, each format needs to be read respectively, so that the operation is complicated; moreover, the sampling frequency is not uniform, and if the audio with the same length contains different amounts of data, the subsequent feature extraction and model training are affected, so the audio needs to be preprocessed first. In the current practical use, the input audio is converted into the wav format, or into other formats, as long as the audio data can be read, and the audio sampling frequency is unified to 8000Hz, or of course, other sampling frequencies are also possible. In addition, in the present application, the means for transcoding the sound information is preferably FFMpeg.
Step S204-12, segmenting the audio frequency of the coded sound information, and extracting the frequency spectrum characteristics from each segment of the audio frequency; wherein, two adjacent sections of audio frequency overlap partial audio frequency;
it should be noted that, in the actual use process, because the lengths of the audios uploaded by the users are not uniform, it is preferable to convert the audios with different lengths into the audios with fixed lengths. If the input audio with indefinite length is directly converted into the audio with definite length by methods such as interpolation and the like, a lot of information of the audio can be lost; by using the segments in the manner of step S204-12, and overlapping the segments, the complete audio information and the association between the segments are preserved. In actual use, the input audio is segmented, for example, the segment length is 3 seconds, and two adjacent segments of audio overlap for 1 second. Of course, the segment length may be 4 seconds, and two adjacent audio segments overlap for 1.5 seconds, or the segment length may be 5 seconds, two adjacent audio segments overlap for 2 seconds, and so on, and the corresponding setting may be performed according to the actual situation.
And S206-13, detecting the spectral characteristics of each section of audio through the classification model to judge whether the sound information is crying information.
In the preferred embodiment of the present application, the features used are preferably mel-frequency cepstral coefficients and the first order gradient of mel-frequency cepstral coefficients, which belong to the frequency features of audio. In order to learn more features and obtain a better effect of determining the sound information, in a more preferred embodiment of the present application, the features used are mel-frequency cepstrum coefficients and first and second gradients of the mel-frequency cepstrum coefficients.
The following describes the calculation process of mel-frequency cepstrum coefficients: 1) windowing the input audio (e.g., 50 milliseconds of windowing length), with overlap between adjacent windows (e.g., 20 milliseconds of overlap length); 2) performing Fourier transform on the audio signal of each window to obtain a frequency spectrum; 3) for the frequency spectrum of each window, using a plurality of mel filters (for example, using 20 mel filters), obtaining a mel scale (then obtaining 20 mel scales); 4) taking logarithm of each Mel scale to obtain energy; 5) performing inverse discrete Fourier transform (or inverse discrete cosine transform) on logarithmic energy of each Mel scale to obtain a cepstrum; 6) the amplitudes of the obtained cepstrum (20, the number of which is the same as that of the used mel filters) are mel frequency cepstrum coefficients. First and second order gradients of mel-frequency cepstral coefficients are then calculated.
Wherein, extracting the related parameter range of the Mel frequency cepstrum coefficient: the audio windowing length preferably ranges from 30 milliseconds to 50 milliseconds; the adjacent window overlap length preferably ranges from 10 milliseconds to 20 milliseconds; the number of mel filters used is preferably 20 to 40.
Therefore, as for the classification model in the above step S204-13, the classification model in the present application may be a gradient-enhanced tree, a support vector machine, a multi-layer perceptron, a statistical probability model and/or a deep learning model, and in a preferred embodiment of the present application, the classification model is a gradient-enhanced tree, a support vector machine and a multi-layer perceptron, that is, the audio features are respectively input into the three classifiers, the three classifiers respectively judge to obtain respective classification results, and then count the classification results and take the result with the largest number of the same results as the generated detection result, that is, whether the detection result is crying of the target object or not. .
The classification model mentioned in the above application needs to be trained in advance, and therefore, in an optional implementation manner of this embodiment, before the step S202 collects the sound information emitted by the target object, the method of this example further includes:
step S101, a first data set is obtained, wherein the first data set comprises a plurality of sound information which is crying information;
step S102, extracting the frequency spectrum characteristics of the sound information in the first data set;
step S103, selecting partial data from the first data set as a training set of the initial classification model, and training the initial statistical probability model based on the spectrum characteristics in the training set to determine the parameters of the classification model.
For the above steps S101 to S103, in a specific application scenario, the infant is taken as a target object, and the classification model is a gradient lifting tree, a support vector machine and a multi-layer perceptron, then the specific training process may be:
the first data set: the first data set can be derived from other data sets such as a data set donatacry-corpus and comprises 2467 sections of baby crying audio; the data set ESC-50 comprises 50 types of audios, each type of audio contains 40 samples, one type of the 50 types of audio is baby crying, and the other 49 types of audio are non-baby crying audios, wherein the classes comprise animal crying, natural environment sounds, human voices, indoor sounds and urban noises; thus, there are 2507 samples of baby cry audio samples and 1960 samples of non-baby cry samples. The data set is divided into 20% test set and 80% training set.
Further, through the above step S206-13, mel-frequency cepstrum coefficients and first-order and second-order gradient features thereof are extracted for each section of audio; respectively training a gradient lifting tree (XGboost), a Support Vector Machine (SVM) and a multilayer perceptron (MLP) by using a training set and cross validation, and determining the optimal parameters of a classifier model; using a test set to classify a certain sample by using a trained gradient lifting tree, a support vector machine and a multilayer perceptron respectively, and voting classification results of the three models to generate a final classification result; and counting the sample classification results of the test set for evaluating the training effect of the model, wherein the finally determined model parameters are shown in table 1:
Figure BDA0002108729340000091
TABLE 1
In addition, in another optional implementation manner of this embodiment, the acoustic model also needs to be trained, that is, before the step S202 collects the acoustic information emitted by the target object, the method of this embodiment further includes:
step S111, acquiring a second data set; wherein the sound information in the second data set is divided into sound information of a plurality of demand types; each demand type comprises sound information used for representing the demand state of the demand of the target object;
step S112, extracting the frequency spectrum characteristics of the sound information in the second data set;
and S113, selecting partial data from the second data set as a training set of the initial sound model, and training the initial first-stage model and the initial second-stage model in the initial sound model based on the spectral features in the training set to determine the parameters of the first-stage model and the second-stage model in the sound model.
In a specific application scenario, taking the target object as an infant as an example, if the sound model is a hierarchical UBM-GMM, the steps S111 to S113 may be:
the source of the second data set may be other data sets such as data donative-corpus, including: 2467 baby cry audio comprises 8 categories, including 740 times hungry, 468 times tired, 232 times solitary, 161 times burping, 268 times belly pain, 115 times cold or hot, 149 times fear, and 334 times uncomfortable. Wherein, 20% of the second data set is divided into a test set, and 80% is divided into a training set.
Further, through the step S206-13, a Mel frequency cepstrum coefficient and first-order and second-order gradient features thereof are extracted from each section of audio;
fig. 3 is a schematic diagram of a hierarchical UBM-GMM model according to an embodiment of the present invention, based on fig. 3, trained using the training set in the second data and cross validation: the UBM-GMM1 is trained first, the input audio is divided into 3 major classes, for each major class, the UBM-GMM2, the UBM-GMM3 and the UBM-GMM4 are trained, and the major classes are classified into minor classes. According to different requirements of babies, crying is divided into three types of requirements, namely 'physiological response', 'physiological requirement' and 'emotional requirement'; the three demand types are divided into a plurality of demand state subclasses, physiological responses: hiccups, abdominal pain, other discomfort; physiological requirements: hungry, sleepy, cold and hot; emotional requirements: fear and lonely.
The reason for using hierarchical UBM-GMMs is: (1) the data quantity difference of each category in the second data set is large; if only a single UBM-GMM is used, the class with large data volume can be easily identified, but the class with small data volume can be difficultly identified; by using a classification method, the subclasses are combined into the major classes, so that the imbalance of data quantity among the classes is reduced, and the classification accuracy is improved; (2) the causes of the baby crying are not always univocal, and in large categories, sub-categories are subdivided, in favour of all the possible factors causing the baby to cry.
For each UBM-GMM model training process, as shown in solid line in FIG. 4, first train one GMM, called UBM, using all training data; then, training the GMM by using the data of each category respectively to obtain a model CN-GMM of each category; thus, the training process is completed. The classification process of new input data by using a single UBM-GMM is shown as a dotted line in FIG. 4, firstly, the input characteristics are respectively input into GMM models of various categories, and meanwhile, the UBM models are combined to carry out maximum posterior probability estimation to obtain scores input on each category model, wherein the category with the maximum score is the category to which the input belongs; the parameters for training each UBM-GMM model are shown in Table 2:
Figure BDA0002108729340000111
TABLE 2
In yet another optional implementation manner of this embodiment, the manner of identifying the demand state corresponding to the sound information and used for representing the demand of the target object through the first-level model and the second-level model, which is referred to in step S208, may be implemented by:
step S208-11, inputting the frequency spectrum characteristics of the sound information into a first-level model to obtain probability values of the sound information which are respectively of a plurality of requirement types;
step S208-12, selecting the demand type with the maximum probability value from the probability values of the demand types;
step S208-13, inputting the frequency spectrum characteristics of the sound information into a second-level model to obtain the probability value of the demand state corresponding to the demand type with the maximum selected probability value;
and step S208-14, taking the demand state with the maximum probability value as the demand state of the sound information.
In this embodiment, the pre-trained model is a two-level model, the first-level model is used to identify a requirement type of the sound information for representing a requirement of the target object, and the second-level model is used to identify a requirement type of the sound information for representing a requirement of the target object, where the requirement type is a specific requirement of the target object. Therefore, in the step S208-11, the frequency spectrum characteristics of the sound information are input into the first-level model, and probability values that the sound information is of a plurality of requirement types are obtained; step S208-13 is to input the frequency spectrum characteristics of the sound information into the second-level model to obtain the probability value of the demand state corresponding to the demand type with the maximum selected probability value.
In addition, the type of demand in the present application is preferably physiological response, emotional demand, physiological demand; of course, other types of requirements, such as psychological response, etc., can be added according to the actual situation. And the demand state of physiological response includes: hiccups, stomach pain, discomfort, etc.; the physiological requirements include: hungry, cold/hot, stranded, etc.; emotional requirements: fear, lonely, etc. That is to say, in the application, the classification method is used to classify the cry into a large class, and then each large class is divided into each small class, so correspondingly, when the model is trained, the sample data of each small class under the same large class can be combined to be used as the training model training sample of the large class, the sample data of each small class is used as the training model sample of the small class, and compared with the method for directly carrying out model training by the sample data of each small class to obtain the model in the prior art, the first-stage model and the second-stage model trained by the method can avoid the problem of inaccurate identification caused by the imbalance among the sample data quantities of each small class training, thereby improving the accuracy rate of identification; in addition, because the reason for the baby crying is not always single, all possible factors (specific requirements) for the baby crying can be effectively obtained by identifying the large class corresponding to the baby crying and then identifying the small class from the large class.
The present application will be described below by way of example with reference to specific embodiments thereof;
1) and performing data set preprocessing:
in this embodiment, the source of the second data set is other data sets such as data set donatacry-corpus, which has 2467 sections of baby cry audio, and is divided into 3 types of demand types and 8 types of demand states:
the requirement type one: physiological responses including burp, abdominal pain, other discomfort, etc. 3 demanding states;
the requirement type II: physiological needs including 3 demand states of hungry, cold or hot, sleepy, etc.;
the requirement type three: emotional requirements, including 2 demand states such as fear, lonely, etc.
Dividing 20% of samples in the second data set into a test set, and dividing 80% of samples into a training set; transcoding the audio samples in the training set into wav format audio of 8000 Hz; segmenting the transcoded audio by overlapping the audio for 1 second in length for 3 seconds, extracting Mel frequency cepstrum coefficient and first-order and second-order gradient characteristics thereof for each audio segment, and using a single-stage UBM-GMM model and a multi-stage UBM-GMM model of the characteristics extracted after the training set segments. Transcoding the audio samples in the test set into wav format audio of 8000 Hz; segmenting the transcoded audio by overlapping the audio for 1 second in length for 3 seconds, extracting Mel frequency cepstrum coefficient and first-order and second-order gradient characteristics thereof from each audio segment, and evaluating the trained single-stage UBM-GMM model and multi-stage UBM-GMM model by using the characteristics extracted after the test set is segmented.
2) Training and evaluating a multi-level UBM-GMM model;
wherein, training a multi-level UBM-GMM model:
the multi-level UBM-GMM model refers to firstly dividing input samples into three categories by using a first-level UBM-GMM model; and then according to the classification result, selecting a second-stage UBM-GMM model corresponding to different classes to classify the input sample into a sub-class of the class.
The classification category of the first-level UBM-GMM model is as follows: physiological responses, physiological needs and emotional needs;
the classification category of the second-level UBM-GMM model corresponding to the physiological response category is as follows: hiccups, abdominal pain, other discomfort;
the classification categories of the second-level UBM-GMM model corresponding to the category of the physiological requirement are as follows: hungry, cold or hot, stranded;
the classification categories of the second-level UBM-GMM model corresponding to the category of the emotion requirement are as follows: fear and lonely.
Using the characteristics extracted after the training set is segmented, combining cross validation, firstly training a first-stage UBM-GMM model and adjusting related hyper-parameters to be optimal, wherein the hyper-parameters comprise the mixed component quantity of the first-stage UBM and each type of GMM of the first stage; and then, training 3 second-stage UBM-GMM models by using the training set characteristics of the corresponding classes respectively, and adjusting related hyper-parameters to be optimal, wherein the hyper-parameters comprise the mixed component quantity of the second-stage UBM and each class of GMM of the second stage.
Evaluation of a multilevel UBM-GMM model:
and evaluating the trained multi-stage UBM-GMM model by using the features extracted after the test set is segmented. The process is as follows: for a complete test set sample, respectively inputting the characteristics of the segmented audio frequency into a trained multi-stage UBM-GMM model, obtaining the classification result of each segment, counting the classification results of all the segments, and obtaining the probability of each classification, wherein the class with the highest probability is the predicted result of the complete test sample. The results show that using the multi-level UBM-GMM model has a more accurate identification of the audio to be tested.
3) Training and evaluating a single-stage UBM-GMM model
It should be noted that the single-stage UBM-GMM model is a conventionally used model, i.e., a comparative example.
Training a single-stage UBM-GMM model:
the single-stage UBM-GMM model refers to 8 classification of input samples by using the single UBM-GMM model, and the classification classes are as follows: hungry, tired, solitary, belching, upset stomach, cold or hot, fear, and other discomfort.
And training a single-stage UBM-GMM model by using the characteristics extracted after the training set is segmented and combining cross validation, and adjusting related hyper-parameters to be optimal, wherein the hyper-parameters comprise the mixed component quantity of the UBM and the GMM of each type.
Evaluation of a single-stage UBM-GMM model:
and evaluating the trained single-stage UBM-GMM model by using the features extracted after the test set is segmented. The process is as follows: for a complete test set sample, respectively inputting the characteristics of the segmented audio into a trained single-stage UBM-GMM model to obtain the classification result of each segment, and counting the classification results of all the segments to obtain the probability of each classification, wherein the class with the highest probability is the predicted result of the complete test sample. And classifying each complete sample in the test set by using the trained single-stage UBM-GMM model, wherein the sample classification accuracy of the test set is 38%.
The single-stage and multi-stage UBM-GMM models described above were tested using a section of "starved" audio:
multilevel UBM-GMM model: firstly, classifying the audio features of the test sample segments by using a 1 st-level UBM-GMM model to obtain the classification result of each segment of the test sample. After the classification of the 1 st-level UBM-GMM model, the classification result of the input test sample is that the classification probability of the 'physiological demand' is 0.8, and the classification probability of the 'physiological response' is 0.2, so that the classification of the input test sample is the 'physiological demand'; and then, classifying by using a 2 nd-level UBM-GMM model corresponding to the physiological requirement, wherein the classification result of the input test sample is that the classification is hungry, the class probability is 0.8, and the class probability is 0.2, so that the final classification class of the input test sample is hungry.
Similarly, using the same test sample, a single-stage UBM-GMM model is used for classification, and the obtained classification result is that the classification probability of "hungry" is 0.4, the classification probability of "fear" is 0.2, the classification probability of "sleepy" is 0.2, and the classification probability of "belly pain" is 0.2; it can be seen that the final classification result is also "starved", and the classification result using the multi-level UBM-GMM model is better than the classification result using the single-level UBM-GMM model because the probability of "starving" in the multi-level UBM-GMM model is higher.
The test was performed using a "belly pain" audio in the single and multi-level UBM-GMM models described above.
Multilevel UBM-GMM model: firstly, classifying the audio features of the test sample segments by using a 1 st-level UBM-GMM model to obtain the classification result of each segment of the test sample. After the classification of the 1 st-level UBM-GMM model, the classification result of the input test sample is that the classification probability of the physiological response is 0.8, the classification probability of the physiological demand is 0.2, and the classification of the input test sample is the physiological response; further, classification was performed using the 2 nd-level UBM-GMM model corresponding to "physiological response", and the classification result of the input test sample was "abdominal pain" class probability 0.8 "and" hiccup "class probability 0.2, and the final classification class of the input test sample was" abdominal pain ".
Similarly, using the same test sample as above, using a single-stage UBM-GMM model classification, the result of the classification is: the probability of "trapped" is 0.4, the probability of "fear" is 0.2, the probability of "hiccup" is 0.2, the probability of "belly ache" is 0.2, and the end result is "trapped".
As can be seen from the results of the classification using the multi-level UBM-GMM model and the single-level UBM-GMM model described above, the test audio is an audio of the "stomach ache" category, and the classification using the single-level UBM-GMM model is erroneously "trapped" as "stomach ache" at an approximate rate of 0.8 in the UBM-GMM classification mode test.
Therefore, the sound is identified through the multi-stage UBM-GMM model, and the result can be accurately identified.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
Example 2
In this embodiment, a sound recognition apparatus is further provided, and the apparatus is used to implement the foregoing embodiments and preferred embodiments, and the description that has been already made is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
Fig. 5 is a block diagram of a voice recognition apparatus according to an embodiment of the present invention, as shown in fig. 5, the apparatus including: the acquisition module 52 is used for acquiring sound information emitted by the target object; the judging module 54 is coupled with the collecting module 52 and is used for judging whether the collected sound information emitted by the target object is crying information; an input module 56, coupled to the determining module 54, configured to input the voice information into a pre-trained voice model if the determination result is yes, where the voice model is obtained by training an initial voice model according to a training set composed of multiple pieces of cry information, and the voice model includes a first-level model and a second-level model; the first-level model is used for identifying the demand type of the sound information for representing the demand of the target object, and the second-level model is used for identifying the demand state of the sound information in the demand type; and the identification module 58 is coupled with the input module 56 and is used for identifying the specific requirement corresponding to the sound information and used for representing the target object through the first-stage model and the second-stage model.
It should be noted that the target object referred to in the present application is preferably an infant, but may be a young friend of several years old, or an animal. The present application does not limit the specific objects, and the corresponding settings may be performed according to the actual situation.
Optionally, the determining module 54 in this embodiment further includes: the transcoding unit is used for transcoding the acquired sound information into a specified format; the processing unit is used for segmenting the audio of the transcoded sound information and extracting spectral characteristics from each segment of audio; wherein, the audio frequencies at two adjacent ends are mutually overlapped to form partial audio frequency; and the judging unit is used for detecting the spectral characteristics of each section of audio through the classification model so as to judge whether the sound information is crying information.
In a preferred mode of the application, the specified format is preferably wav format, and the audio sampling rates are both 8000 Hz; of course in other application scenarios the following format is possible: 3gp, aac, amr, caf, flac, mp3, ogg, aiff, etc., on the basis of which the following sampling frequencies (in Hz) are available: 8000, 11025, 12000, 16000, 22050, 24000, 32000, 40000, 44100, 47250, 48000, etc.
It should be noted that unifying the format (transcoding) and sampling frequency of the input audio (sound information) is mainly for convenience in the actual use process, because if transcoding is not performed, a reading mode needs to be implemented for each format, which leads to complexity; if the sampling frequency is not uniform, the audio with the same length will contain different amounts of data, which affects the subsequent feature extraction and model training. The audio is pre-processed first. In the current practical use, the input audio is converted into the wav format, or into other formats, as long as the audio data can be read, and the audio sampling frequency is unified to 8000Hz, or of course, other sampling frequencies are also possible. In addition, in the present application, the means for transcoding the sound information is preferably FFMpeg.
Furthermore, in a preferred embodiment of the present application, the features used are preferably mel-frequency cepstral coefficients and first and second order gradients of mel-frequency cepstral coefficients, both of which belong to frequency features of audio.
The following describes the calculation process of mel-frequency cepstrum coefficients: 1) windowing the input audio (with a length of 50 milliseconds), with overlap between adjacent windows (with an overlap length of 20 milliseconds); 2) performing Fourier transform on the audio signal of each window to obtain a frequency spectrum; 3) for the frequency spectrum of each window, using a plurality of Mel filters (20 are used) to obtain Mel scales (20); 4) taking logarithm of each Mel scale to obtain energy; 5) performing inverse discrete Fourier transform (or inverse discrete cosine transform) on logarithmic energy of each Mel scale to obtain a cepstrum; 6) the amplitudes of the obtained cepstrum (20, the number of which is the same as that of the used mel filters) are mel frequency cepstrum coefficients. First and second order gradients of mel-frequency cepstral coefficients are then calculated.
Fig. 6 is a block diagram showing an alternative structure of a voice recognition apparatus according to an embodiment of the present invention, and as shown in fig. 6, the apparatus further includes: the first acquiring module 62 is configured to acquire a first data set before acquiring sound information emitted by a target object, where the first data set includes a plurality of pieces of sound information that are crying information; a first extracting module 64 coupled to the first obtaining module 62, configured to extract a spectral feature of the sound information in the first data set; a first training module 66, coupled to the first extraction module 64, is configured to select a portion of the data from the first data set as a training set of the initial classification model, and train the initial statistical probability model based on spectral features in the training set to determine parameters of the classification model.
In a specific application scenario, an infant is taken as a target object, and classification models are a gradient lifting tree, a support vector machine and a multilayer perceptron, so that a specific training process may be as follows:
the first data set: the first data set can be derived from other data sets such as a data set donatacry-corpus and comprises 2467 sections of baby crying audio; the data set ESC-50 comprises 50 types of audios, each type of audio contains 40 samples, one type of the 50 types of audio is baby crying, and the other 49 types of audio are non-baby crying audios, wherein the classes comprise animal crying, natural environment sounds, human voices, indoor sounds and urban noises; thus, there are 2507 samples of baby cry audio samples and 1960 samples of non-baby cry samples. The data set is divided into 20% test set and 80% training set.
Further, extracting mel frequency cepstrum coefficients and first-order and second-order gradient characteristics of the mel frequency cepstrum coefficients for each section of audio; respectively training a gradient lifting tree (XGboost), a Support Vector Machine (SVM) and a multilayer perceptron (MLP) by using a training set and cross validation, and determining the optimal parameters of a classifier model; using a test set to classify a certain sample by using a trained gradient lifting tree, a support vector machine and a multilayer perceptron respectively, and voting classification results of the three models to generate a final classification result; and counting the sample classification result of the test set for evaluating the training effect of the model.
Fig. 7 is a block diagram of an alternative structure of a voice recognition apparatus according to an embodiment of the present invention, and as shown in fig. 7, the apparatus further includes: a second obtaining module 72, configured to obtain a second data set before acquiring sound information emitted by the target object; wherein the sound information in the second data set is divided into sound information of a plurality of demand types; each demand type comprises sound information used for representing the demand state of the demand of the target object; a second extracting module 74, coupled to the second obtaining module 72, configured to extract a spectral feature of the sound information in the second data set; and a second training module 76, coupled to the second extraction module 74, for selecting a portion of the data from the second data set as a training set of the initial acoustic model, and training the initial first-stage model and the initial second-stage model in the initial acoustic model based on spectral features in the training set to determine parameters of the first-stage model and the second-stage model in the acoustic model.
In a specific application scenario, for example, if the target object is an infant, the acoustic model is a hierarchical UBM-GMM, and the training process in the specific application scenario may be:
the source of the second data set may be other data sets such as a data set donative-corrpus, including: 2467 baby cry audio comprises 8 categories, including 740 times hungry, 468 times tired, 232 times solitary, 161 times burping, 268 times belly pain, 115 times cold or hot, 149 times fear, and 334 times uncomfortable. Wherein, 20% of the second data set is divided into a test set, and 80% is divided into a training set.
Further extracting mel frequency cepstrum coefficients and first-order and second-order gradient characteristics of the mel frequency cepstrum coefficients for each section of audio frequency;
based on fig. 3, the ranked UBM-GMMs were trained using the training set in the second data above and using cross validation: the UBM-GMM1 is trained first, the input audio is divided into 3 major classes, for each major class, the UBM-GMM2, the UBM-GMM3 and the UBM-GMM4 are trained, and the major classes are classified into minor classes. According to different requirements of babies, crying is divided into three types of requirements, namely 'physiological response', 'physiological requirement' and 'emotional requirement'; the three demand types are divided into a plurality of demand state subclasses, physiological responses: hiccups, abdominal pain, other discomfort; physiological requirements: hungry, sleepy, cold and hot; emotional requirements: fear and lonely.
The reason for using hierarchical UBM-GMMs is: (1) the data quantity difference of each category in the second data set is large; if only a single UBM-GMM is used, the class with large data volume can be easily identified, but the class with small data volume can be difficultly identified; by using a grading method, the demand states are combined into demand types, so that the imbalance of data quantity among categories is reduced, and the accuracy of classification is improved; (2) the causes of the baby crying are not always univocal, and in large categories, sub-categories are subdivided, in favour of all the possible factors causing the baby to cry.
For the training process of each UBM-GMM model, as shown in FIG. 4, first, one GMM, called UBM, is trained using all training data; then, training the GMM by using the data of each category respectively to obtain a model CN-GMM of each category; thus, the training process is completed
Optionally, the identification module 58 in this embodiment may further include: the first input unit is used for inputting the frequency spectrum characteristics of the sound information into the first-level model to obtain probability values of the sound information which are respectively of a plurality of requirement types; the selection unit is used for selecting the demand type with the maximum probability value from the probability values of the demand types; the second input unit is used for inputting the frequency spectrum characteristics of the sound information into the second-level model to obtain the probability value of the demand state corresponding to the selected demand type with the maximum probability value; and the identification unit is used for taking the demand state with the maximum probability value as the demand state of the sound information.
It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.
Example 3
Embodiments of the present invention also provide a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.
Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:
s1, collecting sound information sent by the target object;
s2, judging whether the collected sound information sent by the target object is crying information or not;
s3, if yes, inputting the voice information into a pre-trained voice model, wherein the voice model is obtained by training an initial voice model according to a training set composed of a plurality of pieces of crying information, and the voice model includes: a first level model and a second level model; the first-level model is used for identifying the demand type of the sound information for representing the demand of the target object, and the second-level model is used for identifying the demand state of the sound information in the demand type;
and S4, identifying specific requirements corresponding to the sound information and used for representing the target object through the first-level model and the second-level model.
Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.
Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.
Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.
Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:
s1, collecting sound information sent by the target object;
s2, judging whether the collected sound information sent by the target object is crying information or not;
s3, if yes, inputting the voice information into a pre-trained voice model, wherein the voice model is obtained by training an initial voice model according to a training set composed of a plurality of pieces of crying information, and the voice model includes: a first level model and a second level model; the first-level model is used for identifying the demand type of the sound information for representing the demand of the target object, and the second-level model is used for identifying the demand state of the sound information in the demand type;
and S4, identifying specific requirements corresponding to the sound information and used for representing the target object through the first-level model and the second-level model.
Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.
It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims (12)

1. A method for recognizing a sound, comprising:
collecting sound information emitted by a target object;
judging whether the collected sound information emitted by the target object is crying information or not;
if the judgment result is yes, inputting the sound information into a pre-trained sound model, wherein the pre-trained sound model is obtained by training an initial sound model according to a training set consisting of a plurality of pieces of crying information, and comprises a first-stage model and a second-stage model; the first-stage model is used for identifying a demand type of the sound information for representing the demand of the target object, and the second-stage model is used for identifying a demand state of the sound information in the demand type;
identifying specific requirements corresponding to the sound information for characterizing the target object through the first-level model and the second-level model.
2. The method of claim 1, wherein determining whether the collected sound information of the target object is crying information comprises:
transcoding the collected sound information into a specified format;
segmenting the audio of the transcoded sound information, and extracting spectral characteristics from each segment of audio; wherein, two adjacent sections of audio frequency overlap partial audio frequency;
and detecting the spectral characteristics of each section of audio through a classification model to judge whether the sound information is crying information.
3. The method of claim 2, wherein prior to collecting the acoustic information emitted by the target object, the method further comprises:
acquiring a first data set, wherein the first data set comprises a plurality of sound information which are crying information;
extracting the spectral characteristics of the sound information in the first data set;
selecting a part of data from the first data set as a training set of an initial classification model, and training an initial statistical probability model based on spectral features in the training set to determine parameters of the classification model.
4. The method of claim 1, wherein prior to collecting the acoustic information emitted by the target object, the method further comprises:
acquiring a second data set; wherein the sound information in the second data set is divided into sound information of a plurality of demand types; each demand type comprises sound information used for representing the demand state of the demand of the target object;
extracting the spectral characteristics of the sound information in the second data set;
selecting a portion of data from the second data set as a training set of initial acoustic models, and training initial first-stage and second-stage models of the initial acoustic models based on spectral features in the training set to determine parameters of the first-stage and second-stage models of the acoustic models.
5. The method of claim 1 or 4, wherein identifying a demand state corresponding to the acoustic information for characterizing the target object demand by the first-level model and the second-level model comprises:
inputting the frequency spectrum characteristics of the sound information into the first-level model to obtain probability values of the sound information which are respectively of a plurality of requirement types;
selecting a demand type with the maximum probability value from the probability values of the demand types;
inputting the frequency spectrum characteristics of the sound information into the second-level model to obtain the probability value of the demand state corresponding to the demand type with the maximum selected probability value;
and taking the demand state with the maximum probability value as the demand state of the sound information.
6. An apparatus for recognizing a sound, comprising:
the acquisition module is used for acquiring sound information emitted by a target object;
the judgment module is used for judging whether the acquired sound information emitted by the target object is crying information or not;
the input module is used for inputting the voice information into a pre-trained voice model under the condition that the judgment result is yes, wherein the voice model is obtained by training an initial voice model according to a training set consisting of a plurality of pieces of crying information and comprises a first-stage model and a second-stage model; the first-stage model is used for identifying a demand type of the sound information for representing the demand of the target object, and the second-stage model is used for identifying a demand state of the sound information in the demand type;
and the identification module is used for identifying the specific requirements corresponding to the sound information and used for representing the target object through the first-level model and the second-level model.
7. The apparatus of claim 6, wherein the determining module comprises:
the transcoding unit is used for transcoding the acquired sound information into a specified format;
the processing unit is used for segmenting the audio of the transcoded sound information and extracting spectral characteristics from each segment of audio; wherein, two adjacent sections of audio frequency overlap partial audio frequency;
and the judging unit is used for detecting the spectral characteristics of each section of audio through the classification model so as to judge whether the sound information is crying information.
8. The apparatus of claim 7, further comprising:
the first acquisition module is used for acquiring a first data set before acquiring sound information emitted by a target object, wherein the first data set comprises a plurality of pieces of sound information which are crying information;
the first extraction module is used for extracting the frequency spectrum characteristics of the sound information in the first data set;
and the first training module is used for selecting partial data from the first data set as a training set of an initial classification model and training the initial statistical probability model based on the spectral features in the training set to determine the parameters of the classification model.
9. The apparatus of claim 6, further comprising:
the second acquisition module is used for acquiring a second data set before acquiring the sound information emitted by the target object; wherein the sound information in the second data set is divided into sound information of a plurality of demand types; each demand type comprises sound information used for representing the demand state of the demand of the target object;
the second extraction module is used for extracting the frequency spectrum characteristics of the sound information in the second data set;
and the second training module is used for selecting partial data from the second data set as a training set of an initial sound model, and training an initial first-stage model and an initial second-stage model in the initial sound model based on the spectral features in the training set to determine the parameters of the first-stage model and the second-stage model in the sound model.
10. The apparatus of claim 6 or 9, wherein the identification module comprises:
the first input unit is used for inputting the frequency spectrum characteristics of the sound information into the first-level model to obtain probability values of the sound information which are respectively of a plurality of requirement types;
the selection unit is used for selecting the demand type with the maximum probability value from the probability values of the demand types;
the second input unit is used for inputting the frequency spectrum characteristics of the sound information into the second-level model to obtain the probability value of the demand state corresponding to the selected demand type with the maximum probability value;
and the identification unit is used for taking the demand state with the maximum probability value as the demand state of the sound information.
11. A storage medium, in which a computer program is stored, wherein the computer program is arranged to perform the method of any of claims 1 to 5 when executed.
12. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 5.
CN201910562749.8A 2019-06-26 2019-06-26 Voice recognition method and device, storage medium and electronic device Pending CN111883174A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910562749.8A CN111883174A (en) 2019-06-26 2019-06-26 Voice recognition method and device, storage medium and electronic device
PCT/CN2020/087072 WO2020259057A1 (en) 2019-06-26 2020-04-26 Sound identification method, device, storage medium, and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910562749.8A CN111883174A (en) 2019-06-26 2019-06-26 Voice recognition method and device, storage medium and electronic device

Publications (1)

Publication Number Publication Date
CN111883174A true CN111883174A (en) 2020-11-03

Family

ID=73153876

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910562749.8A Pending CN111883174A (en) 2019-06-26 2019-06-26 Voice recognition method and device, storage medium and electronic device

Country Status (2)

Country Link
CN (1) CN111883174A (en)
WO (1) WO2020259057A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113488077A (en) * 2021-09-07 2021-10-08 珠海亿智电子科技有限公司 Method and device for detecting baby crying in real scene and readable medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101807396A (en) * 2010-04-02 2010-08-18 陕西师范大学 Device and method for automatically recording crying of babies
CN103258532A (en) * 2012-11-28 2013-08-21 河海大学常州校区 Method for recognizing Chinese speech emotions based on fuzzy support vector machine
CN103280220A (en) * 2013-04-25 2013-09-04 北京大学深圳研究生院 Real-time recognition method for baby cry
US20150073306A1 (en) * 2012-03-29 2015-03-12 The University Of Queensland Method and apparatus for processing patient sounds
US20160364963A1 (en) * 2015-06-12 2016-12-15 Google Inc. Method and System for Detecting an Audio Event for Smart Home Devices
CN109903780A (en) * 2019-02-22 2019-06-18 宝宝树(北京)信息技术有限公司 Crying cause model method for building up, system and crying reason discriminating conduct
CN111354375A (en) * 2020-02-25 2020-06-30 咪咕文化科技有限公司 Cry classification method, device, server and readable storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130070928A1 (en) * 2011-09-21 2013-03-21 Daniel P. W. Ellis Methods, systems, and media for mobile audio event recognition
CN104347066B (en) * 2013-08-09 2019-11-12 上海掌门科技有限公司 Recognition method for baby cry and system based on deep-neural-network
CN107808658A (en) * 2016-09-06 2018-03-16 深圳声联网科技有限公司 Based on real-time baby's audio serial behavior detection method under domestic environment
CN107591162B (en) * 2017-07-28 2021-01-12 南京邮电大学 Cry recognition method based on pattern matching and intelligent nursing system
CN107818779A (en) * 2017-09-15 2018-03-20 北京理工大学 A kind of infant's crying sound detection method, apparatus, equipment and medium
CN108461091A (en) * 2018-03-14 2018-08-28 南京邮电大学 Intelligent crying detection method towards domestic environment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101807396A (en) * 2010-04-02 2010-08-18 陕西师范大学 Device and method for automatically recording crying of babies
US20150073306A1 (en) * 2012-03-29 2015-03-12 The University Of Queensland Method and apparatus for processing patient sounds
CN103258532A (en) * 2012-11-28 2013-08-21 河海大学常州校区 Method for recognizing Chinese speech emotions based on fuzzy support vector machine
CN103280220A (en) * 2013-04-25 2013-09-04 北京大学深圳研究生院 Real-time recognition method for baby cry
US20160364963A1 (en) * 2015-06-12 2016-12-15 Google Inc. Method and System for Detecting an Audio Event for Smart Home Devices
CN109903780A (en) * 2019-02-22 2019-06-18 宝宝树(北京)信息技术有限公司 Crying cause model method for building up, system and crying reason discriminating conduct
CN111354375A (en) * 2020-02-25 2020-06-30 咪咕文化科技有限公司 Cry classification method, device, server and readable storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113488077A (en) * 2021-09-07 2021-10-08 珠海亿智电子科技有限公司 Method and device for detecting baby crying in real scene and readable medium
CN113488077B (en) * 2021-09-07 2021-12-07 珠海亿智电子科技有限公司 Method and device for detecting baby crying in real scene and readable medium

Also Published As

Publication number Publication date
WO2020259057A1 (en) 2020-12-30

Similar Documents

Publication Publication Date Title
CN112750465B (en) Cloud language ability evaluation system and wearable recording terminal
US11655622B2 (en) Smart toilet and electric appliance system
US10157619B2 (en) Method and device for searching according to speech based on artificial intelligence
CN110047512B (en) Environmental sound classification method, system and related device
CN106725532A (en) Depression automatic evaluation system and method based on phonetic feature and machine learning
CN103730130A (en) Detection method and system for pathological voice
CN106653059A (en) Automatic identification method and system for infant crying cause
CN109979486B (en) Voice quality assessment method and device
CN115410711B (en) White feather broiler health monitoring method based on sound signal characteristics and random forest
CN114708964B (en) Vertigo auxiliary analysis statistical method and system based on intelligent feature classification
CN113870239A (en) Vision detection method and device, electronic equipment and storage medium
Kulkarni et al. Child cry classification-an analysis of features and models
CN106710588B (en) Speech data sentence recognition method, device and system
CN109935241A (en) Voice information processing method
CN103578480A (en) Negative emotion detection voice emotion recognition method based on context amendment
CN111883174A (en) Voice recognition method and device, storage medium and electronic device
CN114037018A (en) Medical data classification method and device, storage medium and electronic equipment
CN111611781B (en) Data labeling method, question answering device and electronic equipment
CN117219127A (en) Cognitive state recognition method and related equipment
US20240023877A1 (en) Detection of cognitive impairment
CN114822557A (en) Method, device, equipment and storage medium for distinguishing different sounds in classroom
CN114743619A (en) Questionnaire quality evaluation method and system for disease risk prediction
CN113057588A (en) Disease early warning method, device, equipment and medium
Feier et al. Newborns' cry analysis classification using signal processing and data mining
Ribeiro et al. Early Dyslexia Evidences using Speech Features.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20201103

RJ01 Rejection of invention patent application after publication