WO2020259057A1 - 声音的识别方法及装置、存储介质和电子装置 - Google Patents

声音的识别方法及装置、存储介质和电子装置 Download PDF

Info

Publication number
WO2020259057A1
WO2020259057A1 PCT/CN2020/087072 CN2020087072W WO2020259057A1 WO 2020259057 A1 WO2020259057 A1 WO 2020259057A1 CN 2020087072 W CN2020087072 W CN 2020087072W WO 2020259057 A1 WO2020259057 A1 WO 2020259057A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
sound information
sound
demand
information
Prior art date
Application number
PCT/CN2020/087072
Other languages
English (en)
French (fr)
Inventor
屈奇勋
胡雯
张磊
石瑗璐
李宛庭
沈凌浩
郑汉城
Original Assignee
深圳数字生命研究院
深圳碳云智能数字生命健康管理有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳数字生命研究院, 深圳碳云智能数字生命健康管理有限公司 filed Critical 深圳数字生命研究院
Publication of WO2020259057A1 publication Critical patent/WO2020259057A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/26Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/66Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition

Definitions

  • This application relates to the computer field, and in particular to a method and device for sound recognition, a storage medium and an electronic device.
  • the crying sound is relatively complicated, and the information conveyed by the crying sound is also relatively vague, such as hunger, tiredness, and loneliness.
  • the recognition of baby crying in related technologies is based on human experience, and human experience is often inconsistent, and subjective judgments can easily lead to recognition errors.
  • the embodiments of the present application provide a voice recognition method and device, a storage medium, and an electronic device, so as to at least solve the problem in the related art that the baby's cry can only be recognized based on human experience, which may easily lead to recognition errors.
  • a sound recognition method including: collecting sound information emitted by a target object; judging whether the collected sound information emitted by the target object is crying information; in the case where the judgment result is yes Next, input the sound information into a pre-trained sound model, where the sound model is obtained by training an initial sound model according to a training set composed of multiple crying information, and the sound model includes the first level Model and a second-level model; the first-level model is used to identify the demand type of the voice information used to characterize the needs of the target object, and the second-level model is used to recognize that the voice information is in the The demand status in the demand type; the specific demand for representing the target object corresponding to the sound information is identified through the first-level model and the second-level model.
  • a sound recognition device which includes: a collection module configured to collect sound information emitted by a target object; and a judgment module configured to determine whether the collected sound information emitted by the target object is Crying information; an input module configured to input the voice information into a pre-trained sound model when the judgment result is yes, wherein the sound model is based on a training set composed of multiple crying information to the initial A sound model is obtained through training, and the sound model includes a first-level model and a second-level model; the first-level model is used to identify the demand type of the sound information used to characterize the needs of the target object, The second-level model is used to identify the demand state of the sound information in the demand type; the recognition module is configured to identify the same with the sound information through the first-level model and the second-level model Corresponding is used to characterize the specific needs of the target object.
  • a computer-readable storage medium in which a computer program is stored, wherein the computer program is configured to execute any of the above methods when running Steps in the embodiment.
  • an electronic device including a memory and a processor, the memory is stored with a computer program, and the processor is configured to run the computer program to execute any of the above Steps in the method embodiment.
  • the target object when it is determined that the collected sound information emitted by the target object is crying information, it is further possible to identify the demand type of the sound information and the type of sound information according to the first-level model and the second-level model in the sound model.
  • the demand state under the demand type so that the current demand state of the target object can be identified based on the cry information through the sound model, instead of judging the demand state represented by the cry based on human experience, which solves the problem that the related technology can only be based on the human
  • the experience of recognizing infants’ crying is likely to lead to the problem of recognition errors, which improves the accuracy of recognition of the demand state of crying representation.
  • FIG. 1 is a block diagram of the hardware structure of a terminal of a voice recognition method according to an embodiment of the present application
  • Fig. 2 is a flowchart of a voice recognition method according to an embodiment of the present application
  • Figure 3 is a schematic diagram of a hierarchical UBM-GMM model according to an embodiment of the present application.
  • FIG. 4 is a schematic diagram of the training process of the UBM-GMM model according to an embodiment of the present application.
  • Figure 5 is a structural block diagram of a voice recognition device according to an embodiment of the present application.
  • Fig. 6 is an optional structural block diagram 1 of a voice recognition device according to an embodiment of the present application.
  • Fig. 7 is a second optional structural block diagram of a voice recognition device according to an embodiment of the present application.
  • FIG. 1 is a hardware structure block diagram of a terminal of a voice recognition method according to an embodiment of the present application.
  • the terminal may include one or more (only one is shown in FIG. 1) processor 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA) and The memory 104 is configured to store data.
  • the aforementioned terminal may also include a transmission device 106 and an input/output device 108 configured as a communication function.
  • the terminal may also include more or fewer components than shown in FIG. 1, or have a different configuration from that shown in FIG.
  • the memory 104 may be configured to store computer programs, for example, software programs and modules of application software, such as the computer programs corresponding to the voice recognition method in the embodiment of the present application.
  • the processor 102 runs the computer programs stored in the memory 104, thereby Perform various functional applications and data processing, that is, realize the above-mentioned methods.
  • the memory 104 may include a high-speed random access memory, and may also include a non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory.
  • the memory 104 may further include a memory remotely provided with respect to the processor 102, and these remote memories may be connected to the terminal through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
  • the transmission device 106 is configured to receive or transmit data via a network.
  • the aforementioned specific examples of the network may include a wireless network or a wired network provided by the communication provider of the terminal.
  • the transmission device 106 includes a network adapter (Network Interface Controller, NIC for short), which can be connected to other network devices through a base station to communicate with the Internet.
  • the transmission device 106 may be a radio frequency (Radio Frequency, referred to as RF) module, which is configured to communicate with the Internet in a wireless manner.
  • RF Radio Frequency
  • transmission device 106 is used for the method steps in this application depends on the solution of this application. For example, if this application is an interactive method step solution, the transmission 106 needs to be used. If all the method steps in this application can be executed inside the aforementioned terminal, the transmission device 106 does not need to be used.
  • FIG. 2 is a flowchart of a voice recognition method according to an embodiment of the present application. As shown in FIG. 2, the process includes the following steps:
  • Step S202 collecting sound information emitted by the target object
  • Step S204 Determine whether the collected sound information emitted by the target object is crying information
  • Step S206 If the result of the judgment is yes, input the sound information into the pre-trained sound model, where the sound model is obtained by training the initial sound model according to a training set composed of multiple crying information, and
  • the pre-trained sound model includes: the first-level model and the second-level model; the first-level model is used to identify the demand type of the sound information used to represent the needs of the target object, and the second-level model is used to identify the demand for the sound information The demand status in the type;
  • step S208 the specific requirements for representing the target object corresponding to the sound information are identified through the first-level model and the second-level model.
  • the pre-trained sound model is composed of multi-level models, which can be two-level models (first-level models and second-level models), or three-level models and four-level models. It is composed of one-level model or more-level models; correspondingly, the specific requirements used to represent the needs of the target object can be directly identified by the first-level model and the second-level model, or based on the first-level model and the second-level After the results of the model identification in turn, they are identified by the third-level model (composed of three-level models), or by the fourth-level model based on the results of the third-level model recognition (composed of four-level models). analogy.
  • the pre-trained sound model in step S206 when the pre-trained sound model in step S206 only includes the first-level model and the second-level model, the demand state of the target object obtained by the second-level model recognition is the specific demand.
  • the pre-trained sound model in step S206 may be composed of a three-level model, a four-level model, or other levels of models.
  • the first The first-level model is used to identify the first demand type of the sound information used to characterize the needs of the target object
  • the second-level model is used to identify the second demand state type of the sound information in the first demand type
  • the first The three-level model is used to identify the specific demand state of the target object in the second demand state type of sound information
  • the specific demand of the target object is the demand state of the target object.
  • the first level model is used to identify the first demand type of the sound information used to characterize the needs of the target object, and the first demand type includes physiological And non-physiological.
  • the second demand types of the second-level model corresponding to "physiological” are: physiological response, physiological demand, and emotional demand;
  • the demand status of the third-level model corresponding to "physiological response” is: hiccups, stomach pain, and other non-necessities.
  • Comfortable the demand status of the third-level model corresponding to "physiological needs” is: hungry, cold or hot, sleepy;
  • the demand status of the third-level model corresponding to "emotional needs” is: scared, lonely.
  • the second need types of the second-level model corresponding to "non-physiological" are: pain, poor breathing, and weakness.
  • the demand state of the third-level model corresponding to "pain” is: abdominal pain, headache, etc.; the demand state of the third-level model corresponding to "pain” is: nasal congestion, etc.; the third-level model corresponding to "frailty”
  • the demand status is: weak and weak.
  • the sound information can be further identified according to the first-level model and the second-level model in the sound model.
  • the demand type and the demand state under this demand type so that the current demand state of the target object can be identified based on the crying information through the sound model, instead of judging the demand state represented by the cry based on human experience, which solves the problem in related technologies Recognizing the baby's cry only based on human experience can easily lead to the problem of identification errors, and the accuracy of the recognition of the demand state of crying representation is improved.
  • the target object involved in this application is preferably a baby, of course, it can also be a child of a few years old, or an animal.
  • This application does not limit the specific objects, and the corresponding settings can be made according to the actual situation.
  • the method of determining whether the collected sound information emitted by the target object is crying information involved in step S204 of the present application may be implemented in the following manner:
  • Step S204-11 transcoding the collected sound information into a specified format
  • the specified format is preferably wav format
  • the audio sampling rate is 8000Hz; of course, in other application scenarios, it can also be 3gp, aac, amr, caf, flac, mp3, ogg, aiff, etc.
  • the following sampling frequencies can be selected: 8000, 11025, 12000, 16000, 22050, 24000, 32000, 40000, 44100, 47250, 48000, etc., which are not specifically limited here.
  • the sampling frequency needs to be unified to make it more convenient and faster in actual use; if the input audio is not transcoded into a unified format, it is necessary to Each format is read separately, which leads to cumbersome operations; and the sampling frequency is not uniform. If the audio of the same length contains different amounts of data, it will affect the subsequent feature extraction and model training, so the audio must be processed first Pretreatment. In the current actual use, the input audio is converted to wav format or other formats. As long as the audio data can be read, the audio sampling frequency is unified to 8000 Hz, of course, other sampling frequencies are also possible.
  • the tool for transcoding sound information in this application is preferably FFMpeg.
  • Step S204-12 segment the audio of the transcoded sound information, and extract spectral features from each segment of audio; wherein two adjacent segments of audio overlap each other in part of the audio;
  • variable-length audio since the length of the audio uploaded by the user is not uniform, it is preferable to convert the variable-length audio into fixed-length audio. If the input audio of variable length is directly converted to fixed-length audio by methods such as interpolation, a lot of information of the audio itself will be lost; the segmentation is used in the above step S204-12, and there is overlap between the segments, which can be retained The complete audio information retains the correlation between segments.
  • the input audio is segmented, for example, the segment length is 3 seconds, and two adjacent audio segments overlap by 1 second. Of course, it can also be a segment length of 4 seconds, two adjacent segments of audio overlap 1.5 seconds, or a segment length of 5 seconds, two adjacent segments of audio overlap 2 seconds, etc., which can be set accordingly according to the actual situation.
  • step S206-13 the frequency spectrum characteristics of each audio segment are detected through the classification model to determine whether the sound information is crying information.
  • the features used are preferably the Mel frequency cepstrum coefficient and a step of the Mel frequency cepstrum coefficient, these two features belong to the frequency characteristics of audio.
  • the features used are Mel frequency cepstral coefficients and a step of Mel frequency cepstral coefficients Degree and second degree.
  • the range of relevant parameters for extracting the Mel frequency cepstral coefficient is preferably 30 milliseconds to 50 milliseconds; the adjacent window overlap length range is preferably 10 milliseconds to 20 milliseconds; the number of mel filters used is preferably For 20 to 40.
  • the classification model in this application may be a gradient boosting tree, a support vector machine, a multi-layer perceptron, a statistical probability model and/or a deep learning model, a preferred implementation of this application
  • the classification models are gradient boosting tree, support vector machine and multi-layer perceptron, that is, the audio features are input into the three classifiers, and the three classifiers respectively judge to obtain their respective classification results, and then count the classification results and The result with the largest number of identical results is used as the generated detection result, that is, whether it is the cry of the target object or not.
  • the method of this example before collecting the sound information emitted by the target object in step S202, the method of this example further includes:
  • Step S101 Obtain a first data set, where the first data set includes a plurality of sound information that is cry information;
  • Step S102 extract the frequency spectrum characteristics of the sound information in the first data set
  • Step S103 selecting part of the data from the first data set as the training set of the initial classification model, and training the initial statistical probability model based on the spectral features in the training set to determine the parameters of the classification model.
  • the baby is the target object
  • the classification model is gradient boosting tree, support vector machine and multi-layer perceptron
  • the first data set can be derived from other data sets such as donateacry-corpus. There are 2467 baby cry audios; the data set ESC-50 contains 50 types of audio, and each type of audio contains 40 samples , One of the 50 categories is baby crying, and the remaining 49 categories are non-baby crying audio, including animal calls, natural environment sounds, human voices, indoor sounds and urban noise; therefore, the audio samples of baby crying are shared There are 2507 segments. There are 1960 segments of non-baby crying samples. Divide 20% of the data set into the test set and 80% into the training set.
  • the Mel frequency cepstral coefficients and their first-order and second-order features are extracted for each audio segment; the training set and cross-validation are used to train the gradient boosting tree (XGBoost) and the support vector machine respectively (SVM) and Multilayer Perceptron (MLP) to determine the best parameters of the classifier model; use the test set to classify a sample using the trained gradient boosting tree, support vector machine and multilayer perceptron, three The classification results of the model are voted to produce the final classification results; the classification results of the statistical test set samples are used to evaluate the training effect of the model.
  • the final model parameters are shown in Table 1:
  • the sound model also requires training, that is, before collecting the sound information emitted by the target object in step S202, the method of this embodiment further includes:
  • Step S111 acquiring a second data set; wherein the sound information in the second data set is divided into sound information of multiple demand types; each demand type includes sound information used to characterize the demand status of the target object's demand;
  • Step S112 extract the frequency spectrum characteristics of the sound information in the second data set
  • Step S113 Select part of the data from the second data set as the training set of the initial sound model, and train the initial first-level model and the initial second-level model in the initial sound model based on the spectral features in the training set to determine the sound model Parameters of the first-level model and the second-level model.
  • the target object is a baby as an example
  • the sound model is a hierarchical UBM-GMM
  • the source of this second data set can be other data sets such as donateacry-corpus, including: 2467 baby cry audios, divided into 8 categories, namely 740 sections when hungry, 468 sections when tired, 232 sections when lonely, and hiccups Segment 161, segment 268 with stomach pain, segment 115 with cold or heat, segment 149 with fear, and segment 334 with other discomfort.
  • 20% of the second data set is divided into the test set, and 80% is divided into the training set.
  • Figure 3 is a schematic diagram of a hierarchical UBM-GMM model according to an embodiment of the present application. Based on Figure 3, using the training set in the above second data and using cross-validation, training the hierarchical UBM-GMM: first train UBM-GMM1 and input Audio is divided into 3 categories. For each category, train UBM-GMM2, UBM-GMM3 and UBM-GMM4, and then classify the major categories into small categories.
  • the reasons for using hierarchical UBM-GMM are: (1) The amount of data in each category in the second data set varies greatly; if only a single UBM-GMM is used, it will cause categories with a large amount of data to be easily identified, but categories with a small amount of data However, it is difficult to be identified; using the grading method to merge small categories into large categories first reduces the imbalance in the amount of data between categories and improves the accuracy of classification; (2) The reason for the baby crying is not always Single, sub-category in the big category is conducive to get all possible factors that cause babies to cry.
  • each UBM-GMM model For the training process of each UBM-GMM model, as shown in the solid line in Figure 4, first use all the training data to train a GMM, called UBM; then, use the data of each category to train the GMM to obtain the Model CN-GMM; in this way, the training process is complete.
  • the process of using a single UBM-GMM to classify new input data is shown by the dotted line in Figure 4.
  • the input features are input into each category GMM model, and the UBM model is used to estimate the maximum posterior probability to obtain the input in
  • the score on each category model, the category with the highest score, is the category the input belongs to; the parameters for training each UBM-GMM model are shown in Table 2:
  • step S208 to identify the demand state of the target object corresponding to the sound information through the first-level model and the second-level model can be through Realize as follows:
  • Step S208-11 Input the frequency spectrum characteristics of the sound information into the first-level model, and obtain the probability values that the sound information is of multiple demand types;
  • Step S208-12 selecting the demand type with the largest probability value from the probability values of the multiple demand types
  • Step S208-13 input the frequency spectrum characteristics of the sound information into the second-level model to obtain the probability value of the demand state corresponding to the demand type with the largest probability value selected;
  • step S208-14 the demand state with the largest probability value is used as the demand state of the voice information.
  • the pre-trained model is a two-level model, so the first-level model is used to identify the demand type of voice information used to characterize the needs of the target object, and the second-level model is used to identify the voice information used to characterize The demand type of the target object's demand, where the demand type here is the specific demand of the target object. Therefore, the above step S208-11 is to input the frequency characteristics of the sound information into the first-level model to obtain the probability values of multiple demand types respectively; the second step S208-13 is to input the frequency characteristics of the sound information into the second-level model , Obtain the probability value of the demand state corresponding to the demand type with the largest probability value selected.
  • the types of needs in this application are preferably physiological responses, emotional needs, and physiological needs; of course, other types of needs, such as psychological responses, can also be added according to actual conditions.
  • the physiological needs include: hiccups, stomachache, uncomfortable, etc.; physiological needs include: hungry, cold/hot, sleepy, etc.; emotional needs: fear, loneliness, etc. That is to say, in this application, the grading method is used to first divide the crying into major categories, and then divide the major categories into each subcategory, so that correspondingly, when the model is trained, each subcategory under the same category
  • the sample data can be combined as the training sample of the training model of the large category, and the sample data of each small category is used as the training model sample of the small category.
  • the first-level model and the second-level model trained by this method are compared with the existing technology. Compared with the method of directly using the sample data of each sub-category for model training to obtain the model, it can avoid the problem of inaccurate recognition caused by the imbalance between the amount of sample data trained in each sub-category, thereby improving the recognition Accuracy; In addition, because the reason for the baby's crying is not always single, so by first identifying the major category corresponding to the baby's cry, and then identifying the subcategories from the major categories, all possible baby crying can be effectively obtained Factors (specific requirements).
  • the source of the second data set in this specific embodiment is the data set donateacry-corpus and other data sets.
  • Need type 1 Physiological response, including hiccups, stomach pains, other discomforts and other 3 demand states;
  • Demand type 2 physiological needs, including 3 demand states: hungry, cold or hot, sleepy;
  • Need type three emotional needs, including 2 need states such as fear and loneliness.
  • the multi-level UBM-GMM model refers to the first-level UBM-GMM model to divide the input sample into three categories; then according to the classification results, the second-level UBM-GMM model corresponding to different categories is selected to input the sample Classified as a subcategory of this category.
  • the classification categories of the first-level UBM-GMM model are: physiological response, physiological needs and emotional needs;
  • the classification categories of the second-level UBM-GMM model corresponding to the "Physiological Response” category are: hiccups, stomach pains, and other discomforts;
  • the classification categories of the second-level UBM-GMM model corresponding to the "physiological needs" category are: hungry, cold or hot, sleepy;
  • the classification categories of the second-level UBM-GMM model corresponding to the category of "emotional needs" are: fear and loneliness.
  • the hyperparameters include the first-level UBM and the mixed components of each type of GMM in the first-level Quantity; then, use the training set features of the corresponding category to train three second-level UBM-GMM models, adjust the related hyperparameters to the optimal, the hyperparameters include the second-level UBM and the number of mixed components of each type of GMM of the second level .
  • the multi-level UBM-GMM model trained by using the features extracted after the test set segmentation is evaluated.
  • the process is: for a complete test set sample, input the characteristics of its segmented audio into the trained multi-level UBM-GMM model, obtain the classification results of each segment, and count the classification results of all segments to obtain The probability of each category, where the category with the highest probability is the predicted result of this complete test sample.
  • the results show that using the multi-level UBM-GMM model can more accurately identify the audio to be tested.
  • the single-stage UBM-GMM model is a traditional commonly used model, that is, a comparative example.
  • the single-level UBM-GMM model refers to the use of a single UBM-GMM model to classify the input samples into 8 categories.
  • the classification categories are: hungry, tired, lonely, hiccups, stomachache, cold or hot, scared, etc. Uncomfortable.
  • the hyperparameters include UBM and the number of mixed components of each type of GMM.
  • the single-stage UBM-GMM model trained using the features extracted after the test set segmentation is evaluated.
  • the process is: for a complete test set sample, input the features of its segmented audio into the trained single-level UBM-GMM model, obtain the classification results of each segment, and count the classification results of all segments to obtain The probability of each category, where the category with the highest probability is the predicted result of this complete test sample.
  • the statistical test set sample classification accuracy rate is 38%.
  • Multi-level UBM-GMM model First, use the level 1 UBM-GMM model to classify the audio features of the test sample segments, and obtain the classification results of each segment of the test sample. After the first level UBM-GMM model classification, the classification result of the input test sample is that the probability of the "physiological need” category is 0.8, and the category probability of "physiological response" is 0.2, then the category of the input test sample is "physiological need”; , Use the second-level UBM-GMM model corresponding to "Physiological Needs” to classify, and the classification result of the input test sample is: "hungry" category probability 0.8, "sleepy” category probability 0.2, then enter the final classification of the test sample The category is "hungry".
  • the classification result obtained is that the probability of the "hungry" category is 0.4, the probability of the "fear” category is 0.2, and the category probability of "sleepy” is 0.2, the category probability of "stomach pain” is 0.2; it can be seen that the final classification result is also "hungry".
  • the classification result using the multi-level UBM-GMM model is better than the classification result using the single-level UBM-GMM model because of the multi-level The probability of being "hungry" in the UBM-GMM model is higher.
  • Multi-level UBM-GMM model First, use the level 1 UBM-GMM model to classify the audio features of the test sample segments, and obtain the classification results of each segment of the test sample.
  • the classification result of the input test sample is that the probability of the "physiological response” category is 0.8, and the probability of the "physiological need” category is 0.2, then the category of the input test sample is "physiological response”;
  • the classification result of the input test sample is 0.8 for "belly pain” category probability and 0.2 for "hiccup” category probability, then enter the final classification category of the test sample As "belly pain”.
  • the classification result is: the probability of "sleepy” is 0.4, the probability of "fear” is 0.2, and the probability of "hiccup” is 0.2. The probability of "stomachache” is 0.2, and the final result is "sleepy”.
  • the test audio is the audio of the "belly pain” category
  • the UBM-GMM grading mode test is used to identify it as “belly pain” with a high probability of 0.8
  • the classification result using the single-level UBM-GMM model is incorrectly "sleepy”.
  • the method according to the above embodiment can be implemented by means of software plus the necessary general hardware platform, of course, it can also be implemented by hardware, but in many cases the former is Better implementation.
  • the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, The optical disc) includes several instructions to enable a terminal device (which can be a mobile phone, a computer, a server, or a network device, etc.) to execute the method described in each embodiment of the present application.
  • a voice recognition device is also provided, which is used to implement the above-mentioned embodiments and preferred implementations, and those that have been described will not be repeated.
  • the term "module” can implement a combination of software and/or hardware with predetermined functions.
  • the devices described in the following embodiments are preferably implemented by software, hardware or a combination of software and hardware is also possible and conceived.
  • FIG. 5 is a structural block diagram of a sound recognition device according to an embodiment of the present application.
  • the device includes: a collection module 52 configured to collect sound information emitted by a target object; a judgment module 54 coupled with the collection module 52 Connect, set to determine whether the collected sound information of the target object is crying information; the input module 56 is coupled to the judgment module 54 and is set to input the sound information into the pre-trained sound if the judgment result is yes Model, where the sound model is obtained by training the initial sound model based on a training set composed of multiple crying information, and the sound model includes a first-level model and a second-level model; the first-level model is used to identify sounds The information is used to characterize the demand type of the target object’s needs, and the second-level model is used to identify the demand state of the sound information in the demand type; the recognition module 58 is coupled to the input module 56 and is set to pass the first-level model and the second The secondary model identifies the specific needs of the target object corresponding to the sound information
  • the target object involved in this application is preferably a baby, of course, it can also be a child of a few years old, or an animal.
  • This application does not limit the specific objects, and the corresponding settings can be made according to the actual situation.
  • the judgment module 54 in this embodiment may further include: a transcoding unit configured to transcode the collected sound information into a specified format; and a processing unit configured to divide the audio of the transcoded sound information Segment, and extract the spectral features from each segment of audio; among them, the audio at the two adjacent ends overlap each other part of the audio; the judgment unit is set to detect the spectral features of each segment of audio through the classification model to determine whether the sound information is a cry information.
  • the specified format is preferably the wav format, and the audio sampling rate is 8000 Hz; of course, the following formats can also be used in other application scenarios: 3gp, aac, amr, caf, flac, mp3, ogg, Formats such as aiff, based on this, the following sampling frequencies (unit Hz) are all available: 8000, 11025, 12000, 16000, 22050, 24000, 32000, 40000, 44100, 47250, 48000, etc.
  • the unified format (transcoding) and sampling frequency of the input audio are mainly for the convenience of the actual use process, because if not transcoding, then each format needs to be read In this way, it will be very cumbersome; if the sampling frequency is not uniform, the audio of the same length will contain different amounts of data, which will affect the subsequent feature extraction and model training. Therefore, the audio must be preprocessed first.
  • the input audio is converted to wav format or other formats.
  • the audio sampling frequency is unified to 8000 Hz, of course, other sampling frequencies are also possible.
  • the tool for transcoding sound information in this application is preferably FFMpeg.
  • the features used are preferably Mel frequency cepstral coefficients and Mel frequency cepstral coefficients one-step and two-step, these two features belong to the frequency characteristics of audio.
  • Fig. 6 is an optional structural block diagram 1 of a sound recognition device according to an embodiment of the present application.
  • the device further includes: a first acquisition module 62 configured to acquire the first acquisition module 62 before acquiring the sound information emitted by the target object A data set, wherein the first data set includes a plurality of sound information that is crying information; the first extraction module 64 is coupled to the first acquisition module 62, and is configured to extract the spectral characteristics of the sound information in the first data set; A training module 66, coupled to the first extraction module 64, is configured to select part of the data from the first data set as the training set of the initial classification model, and train the initial statistical probability model based on the spectral features in the training set to determine the classification model Parameters.
  • babies are the target object
  • classification models are gradient boosting trees, support vector machines and multi-layer perceptrons.
  • the specific training process can be:
  • the first data set can be derived from other data sets such as donateacry-corpus. There are 2467 baby cry audios; the data set ESC-50 contains 50 types of audio, and each type of audio contains 40 samples , One of the 50 categories is baby crying, and the remaining 49 categories are non-baby crying audio, including animal calls, natural environment sounds, human voices, indoor sounds and urban noise; therefore, the audio samples of baby crying are shared There are 2507 segments. There are 1960 segments of non-baby crying samples. Divide 20% of the data set into the test set and 80% into the training set.
  • extract the Mel frequency cepstral coefficient and its first-order and second-order features for each audio segment use the training set and use cross-validation to train the gradient boosting tree (XGBoost), support vector machine (SVM) and multi-layer perception respectively Machine (MLP) to determine the best parameters of the classifier model; use the test set to classify a sample using the trained gradient boosting tree, support vector machine, and multi-layer perceptron.
  • the classification results of the three models are voted to produce the final Classification results; statistical test set sample classification results, used to evaluate the training effect of the model.
  • FIG. 7 is a second optional structural block diagram of the sound recognition device according to the embodiment of the present application.
  • the device further includes: a second acquisition module 72 configured to acquire the second acquisition module 72 before acquiring the sound information emitted by the target object Two data sets; wherein the sound information in the second data set is divided into multiple demand types of sound information; each demand type includes sound information used to characterize the demand status of the target object's demand; the second extraction module 74, and The second acquisition module 72 is coupled and connected to extract the spectral features of the sound information in the second data set; the second training module 76 is coupled to the second extraction module 74 and is configured to select part of the data from the second data set as the initial sound model The initial first-level model and the initial second-level model in the initial sound model are trained based on the spectral features in the training set to determine the parameters of the first-level model and the second-level model in the sound model.
  • the target object is a baby as an example
  • the sound model is a hierarchical UBM-GMM
  • the source of this second data set can be other data sets such as donateacry-corpus, including: 2467 baby cry audios, divided into 8 categories, namely 740 segments when hungry, 468 segments when tired, 232 segments alone, and Hiccup 161, stomach pain 268, cold or hot 115, fear 149, and other discomfort 334. Among them, 20% of the second data set is divided into the test set, and 80% is divided into the training set.
  • donateacry-corpus including: 2467 baby cry audios, divided into 8 categories, namely 740 segments when hungry, 468 segments when tired, 232 segments alone, and Hiccup 161, stomach pain 268, cold or hot 115, fear 149, and other discomfort 334.
  • 20% of the second data set is divided into the test set, and 80% is divided into the training set.
  • the reasons for using hierarchical UBM-GMM are: (1) The amount of data in each category in the second data set varies greatly; if only a single UBM-GMM is used, it will cause categories with a large amount of data to be easily identified, but categories with a small amount of data But it is difficult to be identified; the use of a grading method to merge the demand status into demand types first reduces the imbalance in the amount of data between categories and improves the accuracy of classification; (2) The reason for the baby's crying is not always Single, sub-category in the big category is conducive to get all possible factors that cause babies to cry.
  • each UBM-GMM model As shown in Figure 4, first use all the training data to train a GMM called UBM; then, use the data of each category to train the GMM to obtain the model CN- of each category. GMM; in this way, the training process is complete
  • the recognition module 58 in this embodiment may further include: a first input unit configured to input the spectral characteristics of the sound information into the first-level model to obtain the probability values that the sound information is of multiple demand types;
  • the selection unit is set to select the demand type with the largest probability value from the probability values of multiple demand types;
  • the second input unit is set to input the spectral characteristics of the sound information into the second-level model to obtain the selected probability The probability value of the demand state corresponding to the demand type with the largest value;
  • the identification unit is set to use the demand state with the largest probability value as the demand state of the sound information.
  • each of the above modules can be implemented by software or hardware.
  • it can be implemented in the following manner, but not limited to this: the above modules are all located in the same processor; or, the above modules are combined in any combination The forms are located in different processors.
  • the embodiments of the present application also provide a computer-readable storage medium in which a computer program is stored, wherein the computer program is configured to execute the steps in any one of the foregoing method embodiments when running.
  • the foregoing computer-readable storage medium may be configured to store a computer program for executing the following steps:
  • S4 Identify the specific requirements for representing the target object corresponding to the sound information through the first-level model and the second-level model.
  • the above-mentioned computer-readable storage medium may include, but is not limited to: U disk, Read-Only Memory (Read-Only Memory, ROM for short), Random Access Memory (Random Access Memory, for short)
  • ROM Read-Only Memory
  • Random Access Memory Random Access Memory
  • Various media that can store computer programs such as RAM
  • mobile hard disks such as hard disks, magnetic disks or optical disks.
  • the embodiment of the present application also provides an electronic device, including a memory and a processor, the memory is stored with a computer program, and the processor is configured to run the computer program to execute the steps in any of the foregoing method embodiments.
  • the aforementioned electronic device may further include a transmission device and an input-output device, wherein the transmission device is connected to the aforementioned processor, and the input-output device is connected to the aforementioned processor.
  • the foregoing processor may be configured to execute the following steps through a computer program:
  • S4 Identify the specific requirements for representing the target object corresponding to the sound information through the first-level model and the second-level model.
  • modules or steps of this application can be implemented by a general computing device, and they can be concentrated on a single computing device or distributed in a network composed of multiple computing devices.
  • they can be implemented with program codes executable by the computing device, so that they can be stored in the storage device for execution by the computing device, and in some cases, can be executed in a different order than here.
  • the voice recognition method and device, storage medium, and electronic device provided by the embodiments of the present application have the following beneficial effects: it solves the problem that the recognition of infant crying only based on human experience in the related art can easily lead to recognition.
  • the problem of errors has been achieved to improve the accuracy of the recognition of the demand state of crying representation.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)

Abstract

一种声音的识别方法及装置、存储介质和电子装置。其中,该方法包括:采集目标对象发出的声音信息(S202);判断采集到的目标对象发出的声音信息是否为哭声信息(S204);在判断结果为是的情况下,将声音信息输入预先训练的声音模型(S206);其中,声音模型包括第一级模型和第二级模型;第一级模型用于识别出声音信息的用于表征目标对象需求的需求类型,第二级模型用于识别出声音信息在需求类型中的需求状态;通过第一级模型和第二级模型识别出与声音信息对应的用于表征目标对象的具体需求(S208)。该方法能够解决相关技术中只能根据人的经验对婴儿的哭声进行识别容易导致识别失误的问题。

Description

声音的识别方法及装置、存储介质和电子装置 技术领域
本申请涉及计算机领域,具体而言,涉及一种声音的识别方法及装置、存储介质和电子装置。
背景技术
啼哭是婴儿很主要的一种表达方式,正确识别哭声了解婴儿的需求对于养育婴儿非常重要。初生婴儿在最开始的几个月内获得的安全感对其以后的生活有着非常重要的影响,极有可能伴随并影响其一生。因此,若能够正确识别婴儿的哭声并满足其需求,将会更有利于婴儿的健康成长。
相对来说,哭声相对比较复杂,哭声所传达的信息也是比较模糊,比如饥饿、累了、孤独等。但是,对于一个有经验的育婴师来说,及时、有效地分清婴儿的哭声包含的需求尚且不容易,更不用说对于初为人母/人父的青年父母。可见,相关技术中对于婴儿哭声的识别均是依据人的经验,而人的经验往往是不一致的,而且主观的判断容易导致识别失误。
针对相关技术中的上述问题,目前尚未提出有效的解决方案。
发明内容
本申请实施例提供了一种声音的识别方法及装置、存储介质和电子装置,以至少解决相关技术中只能根据人的经验对婴儿的哭声进行识别容易导致识别失误的问题。
根据本申请的一个实施例,提供了一种声音的识别方法,包括:采集目标对象发出的声音信息;判断采集到的目标对象发出的声音信息是否为哭声信息;在判断结果为是的情况下,将所述声音信息输入预先训练的声音模型,其中,所述声音模型是根据由多个哭声信息组成的训练集对初始声音模型进行训练得到的,且所述声音模型包括第一级模型和第二级模型;所述第一级模型用于识别出所述声音信息的用于表征所述目标对象需求 的需求类型,所述第二级模型用于识别出所述声音信息在所述需求类型中的需求状态;通过所述第一级模型和所述第二级模型识别出与所述声音信息对应的用于表征所述目标对象的具体需求。
根据本申请的另一个实施例,提供了一种声音的识别装置,包括:采集模块,设置为采集目标对象发出的声音信息;判断模块,设置为判断采集到的目标对象发出的声音信息是否为哭声信息;输入模块,设置为在判断结果为是的情况下,将所述声音信息输入预先训练的声音模型,其中,所述声音模型是根据由多个哭声信息组成的训练集对初始声音模型进行训练得到的,且所述声音模型包括第一级模型和第二级模型;所述第一级模型用于识别出所述声音信息的用于表征所述目标对象需求的需求类型,所述第二级模型用于识别出所述声音信息在所述需求类型中的需求状态;识别模块,设置为通过所述第一级模型和所述第二级模型识别出与所述声音信息对应的用于表征所述目标对象的具体需求。
根据本申请的又一个实施例,还提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,其中,所述计算机程序被设置为运行时执行上述任一项方法实施例中的步骤。
根据本申请的又一个实施例,还提供了一种电子装置,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器被设置为运行所述计算机程序以执行上述任一项方法实施例中的步骤。
通过本申请,在判断出采集到的目标对象发出的声音信息为哭声信息的情况下,进一步可以根据声音模型中的第一级模型和第二级模型识别出该声音信息的需求类型以及该需求类型下的需求状态,从而可以通过声音模型根据哭声信息识别出目标对象当前的需求状态,而不是根据人的经验来判别哭声所表征的需求状态,解决了相关技术中只能根据人的经验对婴儿的哭声进行识别容易导致识别失误的问题,达到了提高对哭声表征的需求状态识别的准确率。
附图说明
图1是本申请实施例的一种声音的识别方法的终端的硬件结构框图;
图2是根据本申请实施例的声音的识别方法流程图;
图3是根据本申请实施例的分级的UBM-GMM模型示意图;
图4是根据本申请实施例UBM-GMM模型的训练过程的示意图;
图5是根据本申请实施例的声音的识别装置的结构框图;
图6是根据本申请实施例的声音的识别装置的可选结构框图一;
图7是根据本申请实施例的声音的识别装置的可选结构框图二。
具体实施方式
下文中将参考附图并结合实施例来详细说明本申请。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。
需要说明的是,本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。
本申请所提供的方法实施例可以在终端、计算机终端或者类似的运算装置中执行。以运行在终端上为例,图1是本申请实施例的一种声音的识别方法的终端的硬件结构框图。如图1所示,终端可以包括一个或多个(图1中仅示出一个)处理器102(处理器102可以包括但不限于微处理器MCU或可编程逻辑器件FPGA等的处理装置)和设置为存储数据的存储器104,可选地,上述终端还可以包括设置为通信功能的传输设备106以及输入输出设备108。本领域普通技术人员可以理解,图1所示的结构仅为示意,其并不对上述终端的结构造成限定。例如,终端还可包括比图1中所示更多或者更少的组件,或者具有与图1所示不同的配置。
存储器104可设置为存储计算机程序,例如,应用软件的软件程序以及模块,如本申请实施例中的声音的识别方法对应的计算机程序,处理器102通过运行存储在存储器104内的计算机程序,从而执行各种功能应用 以及数据处理,即实现上述的方法。存储器104可包括高速随机存储器,还可包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器104可进一步包括相对于处理器102远程设置的存储器,这些远程存储器可以通过网络连接至终端。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
传输设备106设置为经由一个网络接收或者发送数据。上述的网络具体实例可包括终端的通信供应商提供的无线网络或有线网络。在一个实例中,传输设备106包括一个网络适配器(Network Interface Controller,简称为NIC),其可通过基站与其他网络设备相连从而可与互联网进行通讯。在一个实例中,传输设备106可以为射频(Radio Frequency,简称为RF)模块,其设置为通过无线方式与互联网进行通讯。
此外,需要说明的是,对于本申请中的方法步骤是否要用到上述传输设备106,取决于本申请的方案本身,例如,本申请是一个交互类的方法步骤方案则需要用到传输106,如果本申请中的所有方法步骤均在可以在上述终端内部执行,则无需使用到传输设备106。
在本实施例中提供了一种运行于上述终端的声音的识别方法,图2是根据本申请实施例的声音的识别方法流程图,如图2所示,该流程包括如下步骤:
步骤S202,采集目标对象发出的声音信息;
步骤S204,判断采集到的目标对象发出的声音信息是否为哭声信息;
步骤S206,在判断结果为是的情况下,将声音信息输入预先训练的声音模型,其中,该声音模型是根据由多个哭声信息组成的训练集对初始声音模型进行训练得到的,且该预先训练的声音模型包括:第一级模型和第二级模型;第一级模型用于识别出声音信息的用于表征目标对象需求的需求类型,第二级模型用于识别出声音信息在需求类型中的需求状态;
步骤S208,通过第一级模型和第二级模型识别出与声音信息对应的 用于表征目标对象的具体需求。
需要说明的是,在本申请中,所述预先训练的声音模型是由多级模型组成的,可以是两级模型(第一级模型和第二级模型),也可是由三级模型、四级模型或更多级模型组成;相应的,所述用于表征目标对象需求的具体需求可直接由第一级模型和第二级模型依次识别得出,或基于第一级模型和第二级模型依次识别得到的结果后,由第三级模型识别得出(三级模型组成),或由第四级模型基于第三级模型识别得到的结果识别得出(四级模型组成),以此类推。
本申请的一个实施方式中,步骤S206中预先训练的声音模型只包括第一级模型和第二级模型时,所述其中第二级模型识别获得的目标对象的需求状态即为目标对象的具体需求。在其他实施方式中,步骤S206中预先训练的声音模型可以为三级模型、四级模型或其他级数的模型构成,当预先训练的声音模型为三个不同层级的模型组成时,所述第一级模型用于识别出声音信息的用于表征目标对象需求的第一需求类型,所述第二级模型用于识别出声音信息在第一需求类型中的第二需求状态类型,所述第三级模型用于识别声音信息在第二需求状态类型中的表征目标对象的具体需求状态,所述目标对象的具体需求即为目标对象的需求状态。
例如,当预先训练的声音模型为三个不同层级的模型组成时,所述第一级模型用于识别出声音信息的用于表征目标对象需求的第一需求类型,第一需求类型包括生理性和非生理性两种。其中,“生理性”对应的第二级模型的第二需求类型为:生理反应、生理需求和情感需求;“生理反应”对应的第三级模型的需求状态为:打嗝、肚子痛、其他不舒服;“生理需求”对应的第三级模型的需求状态为:饿了、冷了或热了、困了;“情感需求”对应的第三级模型的需求状态为:害怕、孤单。“非生理性”对应的第二级模型的第二需求类型为:疼痛、呼吸不畅和体弱乏力等。“疼痛”对应的第三级模型的需求状态为:腹痛、头痛等;“呼吸不畅”对应的第三级模型的需求状态为:鼻塞等;“体弱乏力”对应的第三级模型的需求状态为:体弱乏力。
通过上述步骤S202至步骤S208,在判断出采集到的目标对象发出的声音信息为哭声信息的情况下,进一步可以根据声音模型中的第一级模型和第二级模型识别出该声音信息的需求类型以及该需求类型下的需求状态,从而可以通过声音模型根据哭声信息识别出目标对象当前的需求状态,而不是根据人的经验来判别哭声所表征的需求状态,解决了相关技术中只能根据人的经验对婴儿的哭声进行识别容易导致识别失误的问题,达到了提高对哭声表征的需求状态识别的准确率。
需要说明的是,本申请中涉及到的目标对象优选为婴儿,当然也可以是几岁的小朋友,或者是动物。在本申请对此并不限定具体的对象,可以根据实际的情况进行相应的设置。
在本实施例的可选的实施方式中,对于本申请步骤S204中涉及到的判断采集到的目标对象发出的声音信息是否为哭声信息的方式,可以是通过如下方式来实现:
步骤S204-11,将采集到的声音信息转码为指定格式;
其中,在本申请的优选方式中该指定格式优选为wav格式,音频采样率均为8000Hz;当然在其他应用场景中也可以是3gp,aac,amr,caf,flac,mp3,ogg,aiff等格式,基于此,可选择以下采样频率(单位Hz)8000,11025,12000,16000,22050,24000,32000,40000,44100,47250,48000等,具体在此不做限定。
需要说明的是,对输入音频(声音信息)统一格式(转码)、需要统一采样频率,以使在实际使用过程中的更加方便快捷;如果对输入音频不转码为统一格式,即需要对每一种格式都进行分别读取,从而导致操作繁琐;而且不统一采样频率,如果相同长度的音频会包含不同数量的数据,则会影响后续的特征提取、模型训练,所以要先将音频进行预处理。在目前的实际使用中,将输入音频转换至wav格式,也可以转换成其他格式,只要能读取到音频数据即可,音频采样频率统一为8000Hz,当然也可以是其他采样频率。另外,在本申请中对声音信息进行转码的工具优选为 FFMpeg。
步骤S204-12,对转码后的声音信息的音频进行分段,并从每一段音频中提取出频谱特征;其中,相邻两段音频相互重叠部分音频;
其中,需要说明的是,在实际使用过程中,由于用户上传的音频长度是不统一的,优选将不定长的音频转化为定长音频。若直接将不定长的输入音频通过例如插值等方法,转换为定长音频,会丢失很多音频本身的信息;通过上述步骤S204-12的方式使用分段,且分段间有重叠,既可保留音频完整的信息,又保留了分段间的关联性。实际使用中,对输入音频进行分段,例如分段长度为3秒,相邻两段音频重叠1秒。当然也可以是分段长度为4秒,相邻两段音频重叠1.5秒,或者分段长度为5秒,相邻两段音频重叠2秒等等,可以根据实际情况进行相应的设置。
步骤S206-13,通过分类模型对每一段音频的频谱特征进行检测以判断声音信息是否为哭声信息。
其中,在本申请的优选实施方式中,使用的特征优选为梅尔频率倒谱系数及梅尔频率倒谱系数的一阶梯度,这两个特征属于音频的频率特征。为了可以学习更多的特征,以取得更好的判断声音信息的效果,本申请在更优选的是实施方式中,使用的特征为梅尔频率倒谱系数及梅尔频率倒谱系数的一阶梯度和二阶梯度。
下面对对梅尔频率倒谱系数的计算过程进行介绍:1)对输入音频加窗(例如加窗长度为50毫秒),相邻窗之间有叠加(例如叠加长度为20毫秒);2)对每个窗的音频信号进行傅里叶变换,得到频率谱;3)对每个窗的频率普,使用若干个梅尔滤波器(比如使用20个梅尔滤波器),获得梅尔刻度(那么获得20个梅尔刻度);4)对每个梅尔刻度取对数,获得能量;5)对每个梅尔刻度对数能量做离散傅里叶反变换(或离散余弦反变换),得到倒频谱;6)得到的若干的倒频谱(20个,与使用的梅尔滤波器个数相同)的幅值即为梅尔频率倒谱系数。然后计算梅尔频率倒谱系数的一阶梯度和二阶梯度。
其中,提取梅尔频率倒谱系数的相关参数范围:音频加窗长度范围优选为30毫秒至50毫秒;相邻窗叠加长度范围优选为10毫秒至20毫秒;使用的梅尔滤波器个数优选为20至40个。
因此,对于上述步骤S204-13中的分类模型,在本申请的分类模型可以是梯度提升树、支持向量机、多层感知机、统计概率模型和/或深度学习模型,本申请的一个优选实施方式中,分类模型为梯度提升树、支持向量机和多层感知机,即将音频特征分别输入该三个分类器,该三个分类器分别判断获得各自的分类结果,再统计各分类结果并将相同结果的数量最多的结果作为产生的检测结果,即是或不是目标对象的哭声。。
上述本申请中涉及到的分类模型是需要预先训练的,因此,在本实施例的可选实施方式,在步骤S202采集目标对象发出的声音信息之前,本示例的方法还包括:
步骤S101,获取第一数据集,其中,第一数据集中包括多个为哭声信息的声音信息;
步骤S102,提取第一数据集中声音信息的频谱特征;
步骤S103,从第一数据集中选择部分数据作为初始分类模型的训练集,并基于训练集中的频谱特征对初始统计概率模型进行训练以确定分类模型的参数。
对于上述步骤S101至步骤S103,在具体的应用场景中以婴儿为目标对象,分类模型为梯度提升树、支持向量机和多层感知机,则具体的训练过程可以是:
第一数据集:第一数据集可来源于数据集donateacry-corpus等其他数据集,有2467段宝宝哭声音频;数据集ESC-50,包含50类音频,每一类音频均含有40个样本,50类中有一类为宝宝哭声,其余49类为非宝宝哭声音频,包括的类别有动物叫声、自然环境声、人声、室内声及城市噪音;因此,宝宝哭声音频样本共有2507段,非宝宝哭声样本共有1960段。将数据集20%划分为测试集,80%划分为训练集。
进而,通过上述步骤S206-13,对每段音频提取梅尔频率倒谱系数及其一阶、二阶梯度特征;使用训练集和使用交叉验证,分别训练梯度提升树(XGBoost)、支持向量机(SVM)及多层感知机(MLP),确定分类器模型最佳参数;使用测试集,对某一样本分别使用训练好的梯度提升树、支持向量机及多层感知机进行分类,三个模型的分类结果投票产生最终分类结果;统计测试集样本分类结果,用于评价模型的训练效果,最后确定的模型参数如表1所示:
Figure PCTCN2020087072-appb-000001
表1
此外,在本实施例的另一个可选实施方式中,对于声音模型也是需要训练的,即在步骤S202采集目标对象发出的声音信息之前,本实施例的方法还包括:
步骤S111,获取第二数据集;其中,第二数据集中的声音信息被划分为多个需求类型的声音信息;每个需求类型中包括用于表征目标对象需求的需求状态的声音信息;
步骤S112,提取第二数据集中声音信息的频谱特征;
步骤S113,从第二数据集中选择部分数据作为初始声音模型的训练集,并基于训练集中的频谱特征对初始声音模型中的初始第一级模型和初始 第二级模型进行训练以确定声音模型中第一级模型和第二级模型的参数。
在具体的应用场景中还是以目标对象为婴儿为例,则声音模型为分级的UBM-GMM,则上述步骤S111至步骤S113,在具体应用场景中可以是:
该第二数据集的来源可以是数据donateacry-corpus等其他数据集,包括:2467段宝宝哭声音频,分为8类,分别是饿了740段、累了468段、孤独232段、要打嗝161段、肚子痛268段、冷了或热了115段、害怕149段及其他不舒服334段。其中,将该第二数据集中的20%划分为测试集,80%划分为训练集。
进而通过上述步骤S206-13,对每段音频提取梅尔频率倒谱系数及其一阶、二阶梯度特征;
图3是根据本申请实施例的分级的UBM-GMM模型示意图,基于图3,使用上述第二数据中的训练集和使用交叉验证,训练分级的UBM-GMM:首先训练UBM-GMM1,将输入音频分为3个大类,对于每个大类,训练UBM-GMM2、UBM-GMM3及UBM-GMM4,再将大类分类成小类。根据宝宝不同的需求,将哭声分为三个需求类型大类,分别是“生理反应”“生理需求”及“情感需求”;再将三个需求类型分成若干需求状态小类,生理反应:打嗝、肚子痛、其他不舒服;生理需求:饿了、困了、冷了热了;情感需求:害怕、孤单。
使用分级的UBM-GMM的原因是:(1)第二数据集中各类别数据量差异大;若只使用单个UBM-GMM,会造成数据量多的类别很容易被识别,但数据量少的类别却很难被识别;使用分级的方法,将小类合并成大类,首先就降低了类别间数据量的不均衡性,提升了分类的准确率;(2)婴儿哭的原因并不总是单一的,在大的类别中再分小类,有利于获得造成婴儿哭的所有可能的因素。
对每一个UBM-GMM模型的训练过程,如图4中的实线部分,首先使用所有训练数据训练一个GMM,称为UBM;然后,分别使用每个类别的数据训练GMM,获得每个类别的模型CN-GMM;这样,训练过程就完 成了。使用单个UBM-GMM对新的输入数据的分类过程如图4中虚线所示,首先将此输入的特征分别输入到各类别GMM模型中,同时结合UBM模型做最大后验概率估计,获得输入在每个类别模型上的得分,得分最大的类别,即输入属于的类别;训练每个UBM-GMM模型的参数如表2所示:
Figure PCTCN2020087072-appb-000002
表2
在本实施例的再一个可选实施方式中,步骤S208中涉及到的通过第一级模型和第二级模型识别出与声音信息对应的用于表征目标对象需求的需求状态的方式,可以通过如下方式来实现:
步骤S208-11,将声音信息的频谱特征输入到第一级模型中,得到声音信息分别为多个需求类型的概率值;
步骤S208-12,从多个需求类型的概率值中选择出概率值最大的需求类型;
步骤S208-13,将声音信息的频谱特征输入到第二级模型中,得到与选择出的概率值最大的需求类型对应的需求状态的概率值;
步骤S208-14,将概率值最大的需求状态作为声音信息的需求状态。
本实施例中该预先训练的模型为两级模型,那么第一级模型用于识别出声音信息的用于表征目标对象需求的需求类型,第二级模型用于识别出声音信息的用于表征目标对象需求的需求类型,其中,这里的需求类型即为目标对象的具体需求。因此上述步骤S208-11为将声音信息的频谱特征输入第一级模型中,得到声音信息分别为多个需求类型的概率值;二步骤S208-13为将声音信息的频谱特征输入第二级模型中,得到与选择出的概率值最大的需求类型对应的需求状态的概率值。
此外,在本申请中需求类型优选为生理反应、情感需求、生理需求;当然也可以根据实际情况增加其他需求类型,例如心理反应等等。而生理反应的需求状态包括:打嗝、肚子疼、不舒服等等;生理需求包括:饿了、冷了/热了,困了等等;情感需求:害怕、孤单等等。也就是说,在本申请中用分级的方法,先将哭声分成大类,再把各个大类分别分成各个小类,这样相应的,在模型训练时,同一大类下的各小类的样本数据可合并作为该大类的训练模型训练样本,各小类的样本数据作为该小类的训练模型样本,通过这种方法训练出来的第一级模型和第二级模型,与现有技术中直接以各个小类的样本数据进行模型训练的方法获得模型相比,能够避免因为各小类训练的样本数据量之间的不均衡性而导致的识别不准确的问题,从而提升了识别的准确率;另外,因为宝宝哭的原因并不总是单一的,因此通过先识别出宝宝哭声对应的大类,再从大类中识别出小类,能够有效的获得宝宝哭的所有可能的因素(具体需求)。
下面结合本申请的具体实施方式,对本申请进行举例说明;
1),进行数据集预处理:
本具体实施方式中的第二数据集的来源为数据集donateacry-corpus等其他数据集,有2467段宝宝哭声音频,分为3类需求类型,8类需求状态:
需求类型一:生理反应,包含打嗝、肚子痛、其他不舒服等3需求状态;
需求类型二:生理需求,包含饿了、冷了或热了、困了等3需求状态;
需求类型三:情感需求,包含害怕、孤单等2需求状态。
将第二数据集20%的样本划分为测试集,80%的样本划分为训练集;接着,将训练集中的音频样本转码为8000Hz的wav格式音频;将转码后的音频以长度3秒重叠1秒的方式进行分段,对每个音频分段提取梅尔频率倒谱系数及其一阶、二阶梯度特征,使用训练集分段后提取的特征的训练单级的UBM-GMM模型及多级的UBM-GMM模型。进而将测试集中的音频样本转码为8000Hz的wav格式音频;将转码后的音频以长度3秒重叠1秒的方式进行分段,对每个音频分段提取梅尔频率倒谱系数及其一阶、二阶梯度特征,使用测试集分段后提取的特征对训练出的单级的UBM-GMM模型及多级的UBM-GMM模型进行评价。
2),训练及评价多级的UBM-GMM模型:
其中,训练多级的UBM-GMM模型:
多级的UBM-GMM模型指的是首先使用第一级UBM-GMM模型将输入样本分为三个类别;然后根据此分类结果,选择使用不同类别对应的第二级UBM-GMM模型将输入样本分类为此类别的子类别。
其中,第一级UBM-GMM模型的分类类别为:生理反应,生理需求和情感需求;
“生理反应”类别对应的第二级UBM-GMM模型的分类类别为:打嗝、肚子痛、其他不舒服;
“生理需求”类别对应的第二级UBM-GMM模型的分类类别为:饿了、冷了或热了、困了;
“情感需求”类别对应的第二级UBM-GMM模型的分类类别为:害怕、孤单。
使用训练集分段后提取的特征,结合交叉验证,首先训练第一级UBM-GMM模型并调整相关超参数至最优,超参数包含第一级UBM及第一级每一类GMM的混合成分数量;然后,分别使用对应类别的训练集特征训练3个第二级UBM-GMM模型,调整相关超参数至最优,超参数包含第二级UBM及第二级每一类GMM的混合成分数量。
评价多级的UBM-GMM模型:
使用测试集分段后提取的特征评价训练出的多级的UBM-GMM模型。过程为:对于一个完整的测试集样本,分别将其分段音频的特征输入训练出的多级的UBM-GMM模型,获得其每个分段的分类结果,统计所有分段的分类结果,获得每个分类的概率,其中,概率最高的类别即为此完整测试样本所被预测的结果。结果显示,使用多级UBM-GMM模型具有更准确地识别待测音频。
3),训练及评价单级的UBM-GMM模型:
需要说明的是,该单级的UBM-GMM模型为传统常用的模型,即对比例。
训练单级的UBM-GMM模型:
单级的UBM-GMM模型指的是使用单个UBM-GMM模型对输入样本进行8分类,分类类别为:饿了、累了、孤独、要打嗝、肚子痛、冷了或热了、害怕及其他不舒服。
使用训练集分段后提取的特征,结合交叉验证,训练单级的UBM-GMM模型并调整相关超参数至最优,超参数包含UBM及每一类的GMM的混合成分数量。
评价单级的UBM-GMM模型:
使用测试集分段后提取的特征评价训练出的单级的UBM-GMM模型。过程为:对于一个完整的测试集样本,分别将其分段音频的特征输入训练出的单级的UBM-GMM模型,获得其每个分段的分类结果,统计所有分段的分类结果,获得每个分类的概率,其中,概率最高的类别即为此完整 测试样本所被预测的结果。使用训练出的单级的UBM-GMM模型,对测试集中的每个完整样本进行分类,统计测试集样本分类准确率为38%。
下面采用一段“饿了”的音频对上述单级和多级UBM-GMM模型进行测试:
多级UBM-GMM模型:首先使用第1级UBM-GMM模型对测试样本分段音频特征进行分类,获得测试样本每个分段的分类结果。经过第1级UBM-GMM模型分类,输入测试样本的分类结果为,“生理需求”类别概率为0.8,“生理反应”类别概率为0.2,则此输入测试样本的类别为“生理需求”;进而,使用“生理需求”对应的第2级UBM-GMM模型进行分类,输入测试样本的分类结果为,“饿了”类别概率0.8,“困了”类别概率0.2,则此输入测试样本的最终分类类别为“饿了”。
同样地,使用上述同样的测试样本,使用单级的UBM-GMM模型进行分类,得到的分类结果为“饿了”类别概率为0.4,“害怕”类别概率为0.2,“困了”类别概率为0.2,“肚子痛”类别概率为0.2;可见,最终的分类结果也是“饿了”,使用多级UBM-GMM模型的分类结果要优于使用单级UBM-GMM模型的分类结果,因为多级UBM-GMM模型中“饿了”的概率更高。
下面再采用一段“肚子疼”的音频上述单级和多级UBM-GMM模型进行测试。
多级UBM-GMM模型:首先使用第1级UBM-GMM模型对测试样本分段音频特征进行分类,获得测试样本每个分段的分类结果。经过第1级UBM-GMM模型分类,输入测试样本的分类结果为,“生理反应”类别概率为0.8,“生理需求”类别概率为0.2,则此输入测试样本的类别为“生理反应”;进而,使用“生理反应”对应的第2级UBM-GMM模型进行分类,输入测试样本的分类结果为,“肚子痛”类别概率0.8,“打嗝”类别概率0.2,则此输入测试样本的最终分类类别为“肚子痛”。
同样地,使用上述同样的测试样本,使用单级的UBM-GMM模型分 类,分类的结果是:“困了”的概率是0.4,“害怕”的概率是0.2,“打嗝”的概率是0.2,“肚子疼”的概率是0.2,最终结果是“困了”。
通过上述多级UBM-GMM模型和单级UBM-GMM模型进行分类的结果可知,测试音频为“肚子痛”类别的音频,采用UBM-GMM分级模式测试以0.8的大概率识别为“肚子痛”,而使用单级UBM-GMM模型的分类结果错误为“困了”。
由此可见,通过本申请的多级UBM-GMM模型对声音进行识别,能够准确的识别出结果。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到根据上述实施例的方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。
在本实施例中还提供了一种声音的识别装置,该装置用于实现上述实施例及优选实施方式,已经进行过说明的不再赘述。如以下所使用的,术语“模块”可以实现预定功能的软件和/或硬件的组合。尽管以下实施例所描述的装置较佳地以软件来实现,但是硬件,或者软件和硬件的组合的实现也是可能并被构想的。
图5是根据本申请实施例的声音的识别装置的结构框图,如图5所示,该装置包括:采集模块52,设置为采集目标对象发出的声音信息;判断模块54,与采集模块52耦合连接,设置为判断采集到的目标对象发出的声音信息是否为哭声信息;输入模块56,与判断模块54耦合连接,设置为在判断结果为是的情况下,将声音信息输入预先训练的声音模型,其中,声音模型是根据由多个哭声信息组成的训练集对初始声音模型进行训练 得到的,且声音模型包括第一级模型和第二级模型;第一级模型用于识别出声音信息的用于表征目标对象需求的需求类型,第二级模型用于识别出声音信息在需求类型中的需求状态;识别模块58,与输入模块56耦合连接,设置为通过第一级模型和第二级模型识别出与声音信息对应的用于表征目标对象的具体需求。
需要说明的是,本申请中涉及到的目标对象优选为婴儿,当然也可以是几岁的小朋友,或者是动物。在本申请对此并不限定具体的对象,可以根据实际的情况进行相应的设置。
可选地,本实施例中的判断模块54进一步可以包括:转码单元,设置为将采集到的声音信息转码为指定格式;处理单元,设置为对转码后的声音信息的音频进行分段,并从每一段音频中提取出频谱特征;其中,相邻两端音频相互重叠部分音频;判断单元,设置为通过分类模型对每一段音频的频谱特征进行检测以判断声音信息是否为哭声信息。
其中,在本申请的优选方式中该指定格式优选为wav格式,音频采样率均为8000Hz;当然在其他应用场景中也可以是以下格式:3gp,aac,amr,caf,flac,mp3,ogg,aiff等格式,基于此,以下采样频率(单位Hz)都可以:8000,11025,12000,16000,22050,24000,32000,40000,44100,47250,48000等。
需要说明的是,对输入音频(声音信息)统一格式(转码)、统一采样频率主要是为了实际使用过程中的方便,因为如果不转码,那就需要对每一种格式都实现读取方式,这样会导致很繁琐;不统一采样频率,则相同长度的音频会包含不同数量的数据,影响后续的特征提取、模型训练。所以要先将音频进行预处理。在目前的实际使用中,将输入音频转换至wav格式,也可以转换成其他格式,只要能读取到音频数据即可,音频采样频率统一为8000Hz,当然也可以是其他采样频率。另外,在本申请中对声音信息进行转码的工具优选为FFMpeg。
此外,在本申请的优选实施方式中,使用的特征优选为梅尔频率倒谱 系数及梅尔频率倒谱系数一阶梯度和二阶梯度,这两个特征属于音频的频率特征。
下面对对梅尔频率倒谱系数的计算过程进行介绍:1)对输入音频加窗(长度为50毫秒),相邻窗之间有叠加(叠加长度为20毫秒);2)对每个窗的音频信号进行傅里叶变换,得到频率谱;3)对每个窗的频率普,使用若干个梅尔滤波器(使用20个),获得梅尔刻度(20个);4)对每个梅尔刻度取对数,获得能量;5)对每个梅尔刻度对数能量做离散傅里叶反变换,(或离散余弦反变换)得到倒频谱;6)得到的若干的倒频谱(20个,与使用的梅尔滤波器个数相同)的幅值即为梅尔频率倒谱系数。然后计算梅尔频率倒谱系数的一阶梯度和二阶梯度。
图6是根据本申请实施例的声音的识别装置的可选结构框图一,如图6所示,装置还包括:第一获取模块62,设置为在采集目标对象发出的声音信息之前,获取第一数据集,其中,第一数据集中包括多个为哭声信息的声音信息;第一提取模块64,与第一获取模块62耦合连接,设置为提取第一数据集中声音信息的频谱特征;第一训练模块66,与第一提取模块64耦合连接,设置为从第一数据集中选择部分数据作为初始分类模型的训练集,并基于训练集中的频谱特征对初始统计概率模型进行训练以确定分类模型的参数。
在具体的应用场景中以婴儿为目标对象,分类模型为梯度提升树、支持向量机和多层感知机,则具体的训练过程可以是:
第一数据集:第一数据集可来源于数据集donateacry-corpus等其他数据集,有2467段宝宝哭声音频;数据集ESC-50,包含50类音频,每一类音频均含有40个样本,50类中有一类为宝宝哭声,其余49类为非宝宝哭声音频,包括的类别有动物叫声、自然环境声、人声、室内声及城市噪音;因此,宝宝哭声音频样本共有2507段,非宝宝哭声样本共有1960段。将数据集20%划分为测试集,80%划分为训练集。
进而,对每段音频提取梅尔频率倒谱系数及其一阶、二阶梯度特征; 使用训练集和使用交叉验证,分别训练梯度提升树(XGBoost)、支持向量机(SVM)及多层感知机(MLP),确定分类器模型最佳参数;使用测试集,对某一样本分别使用训练好的梯度提升树、支持向量机及多层感知机进行分类,三个模型的分类结果投票产生最终分类结果;统计测试集样本分类结果,用于评价模型的训练效果。
图7是根据本申请实施例的声音的识别装置的可选结构框图二,如图7所示,装置还包括:第二获取模块72,设置为在采集目标对象发出的声音信息之前,获取第二数据集;其中,第二数据集中的声音信息被划分为多个需求类型的声音信息;每个需求类型中包括用于表征目标对象需求的需求状态的声音信息;第二提取模块74,与第二获取模块72耦合连接,设置为提取第二数据集中声音信息的频谱特征;第二训练模块76,与第二提取模块74耦合连接,设置为从第二数据集中选择部分数据作为初始声音模型的训练集,并基于训练集中的频谱特征对初始声音模型中的初始第一级模型和初始第二级模型进行训练以确定声音模型中第一级模型和第二级模型的参数。
在具体的应用场景中还是目标对象为婴儿为例,则声音模型为分级的UBM-GMM,则上述训练过程在具体应用场景中可以是:
该第二数据集的来源可以是数据集donateacry-corpus等其他数据集,包括:2467段宝宝哭声音频,分为8类,分别是饿了740段、累了468段、孤独232段、要打嗝161段、肚子痛268段、冷了或热了115段、害怕149段及其他不舒服334段。其中,将该第二数据集中的20%划分为测试集,80%划分为训练集。
进而对每段音频提取梅尔频率倒谱系数及其一阶、二阶梯度特征;
基于图3,使用上述第二数据中的训练集和使用交叉验证,训练分级的UBM-GMM:首先训练UBM-GMM1,将输入音频分为3个大类,对于每个大类,训练UBM-GMM2、UBM-GMM3及UBM-GMM4,再将大类分类成小类。根据宝宝不同的需求,将哭声分为三个需求类型大类,分别 是“生理反应”“生理需求”及“情感需求”;再将三个需求类型分成若干需求状态小类,生理反应:打嗝、肚子痛、其他不舒服;生理需求:饿了、困了、冷了热了;情感需求:害怕、孤单。
使用分级的UBM-GMM的原因是:(1)第二数据集中各类别数据量差异大;若只使用单个UBM-GMM,会造成数据量多的类别很容易被识别,但数据量少的类别却很难被识别;使用分级的方法,将需求状态合并成需求类型,首先就降低了类别间数据量的不均衡性,提升了分类的准确率;(2)婴儿哭的原因并不总是单一的,在大的类别中再分小类,有利于获得造成婴儿哭的所有可能的因素。
对每一个UBM-GMM模型的训练过程,如图4所示,首先使用所有训练数据训练一个GMM,称为UBM;然后,分别使用每个类别的数据训练GMM,获得每个类别的模型CN-GMM;这样,训练过程就完成了
可选地,本实施例中的识别模块58进一步可以包括:第一输入单元,设置为将声音信息的频谱特征输入到第一级模型中,得到声音信息分别为多个需求类型的概率值;选择单元,设置为从多个需求类型的概率值中选择出概率值最大的需求类型;第二输入单元,设置为将声音信息的频谱特征输入到第二级模型中,得到与选择出的概率值最大的需求类型对应的需求状态的概率值;识别单元,设置为将概率值最大的需求状态作为声音信息的需求状态。
需要说明的是,上述各个模块是可以通过软件或硬件来实现的,对于后者,可以通过以下方式实现,但不限于此:上述模块均位于同一处理器中;或者,上述各个模块以任意组合的形式分别位于不同的处理器中。
本申请的实施例还提供了一种计算机可读存储介质,该计算机可读存储介质中存储有计算机程序,其中,该计算机程序被设置为运行时执行上述任一项方法实施例中的步骤。
可选地,在本实施例中,上述计算机可读存储介质可以被设置为存储用于执行以下步骤的计算机程序:
S1,采集目标对象发出的声音信息;
S2,判断采集到的目标对象发出的声音信息是否为哭声信息;
S3,在判断结果为是的情况下,将声音信息输入预先训练的声音模型,其中,该声音模型是根据由多个哭声信息组成的训练集对初始声音模型进行训练得到的,且声音模型包括:第一级模型和第二级模型;第一级模型用于识别出声音信息的用于表征目标对象需求的需求类型,第二级模型用于识别出声音信息在需求类型中的需求状态;
S4,通过第一级模型和第二级模型识别出与声音信息对应的用于表征目标对象的具体需求。
可选地,在本实施例中,上述计算机可读存储介质可以包括但不限于:U盘、只读存储器(Read-Only Memory,简称为ROM)、随机存取存储器(Random Access Memory,简称为RAM)、移动硬盘、磁碟或者光盘等各种可以存储计算机程序的介质。
本申请的实施例还提供了一种电子装置,包括存储器和处理器,该存储器中存储有计算机程序,该处理器被设置为运行计算机程序以执行上述任一项方法实施例中的步骤。
可选地,上述电子装置还可以包括传输设备以及输入输出设备,其中,该传输设备和上述处理器连接,该输入输出设备和上述处理器连接。
可选地,在本实施例中,上述处理器可以被设置为通过计算机程序执行以下步骤:
S1,采集目标对象发出的声音信息;
S2,判断采集到的目标对象发出的声音信息是否为哭声信息;
S3,在判断结果为是的情况下,将声音信息输入预先训练的声音模型,其中,该声音模型是根据由多个哭声信息组成的训练集对初始声音模型进行训练得到的,且声音模型包括:第一级模型和第二级模型;第一级模型用于识别出声音信息的用于表征目标对象需求的需求类型,第二级模型用 于识别出声音信息在需求类型中的需求状态;
S4,通过第一级模型和第二级模型识别出与声音信息对应的用于表征目标对象的具体需求。
可选地,本实施例中的具体示例可以参考上述实施例及可选实施方式中所描述的示例,本实施例在此不再赘述。
显然,本领域的技术人员应该明白,上述的本申请的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,可选地,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本申请不限制于任何特定的硬件和软件结合。
以上所述仅为本申请的优选实施例而已,并不用于限制本申请,对于本领域的技术人员来说,本申请可以有各种更改和变化。凡在本申请的原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。
工业实用性
如上所述,本申请实施例提供的一种声音的识别方法及装置、存储介质和电子装置具有以下有益效果:解决了相关技术中只能根据人的经验对婴儿的哭声进行识别容易导致识别失误的问题,达到了提高对哭声表征的需求状态识别的准确率。

Claims (12)

  1. 一种声音的识别方法,包括:
    采集目标对象发出的声音信息;
    判断采集到的目标对象发出的声音信息是否为哭声信息;
    在判断结果为是的情况下,将所述声音信息输入预先训练的声音模型,其中,所述预先训练的声音模型是根据由多个哭声信息组成的训练集对初始声音模型进行训练得到的,且所述预先训练的声音模型包括第一级模型和第二级模型;所述第一级模型用于识别出所述声音信息的用于表征所述目标对象需求的需求类型,所述第二级模型用于识别出所述声音信息在所述需求类型中的需求状态;
    通过所述第一级模型和所述第二级模型识别出与所述声音信息对应的用于表征所述目标对象的具体需求。
  2. 根据权利要求1所述的方法,其中,判断采集到的目标对象发出的声音信息是否为哭声信息,包括:
    将采集到的所述声音信息转码为指定格式;
    对转码后的声音信息的音频进行分段,并从每一段音频中提取出频谱特征;其中,相邻两段音频相互重叠部分音频;
    通过分类模型对每一段音频的频谱特征进行检测以判断所述声音信息是否为哭声信息。
  3. 根据权利要求2所述的方法,其中,在采集目标对象发出的声音信息之前,所述方法还包括:
    获取第一数据集,其中,所述第一数据集中包括多个为哭声信息的声音信息;
    提取所述第一数据集中声音信息的频谱特征;
    从所述第一数据集中选择部分数据作为初始分类模型的训练集,并基于所述训练集中的频谱特征对初始统计概率模型进行训练以确定所述分类模型的参数。
  4. 根据权利要求1所述的方法,其中,在采集目标对象发出的声音信息之前,所述方法还包括:
    获取第二数据集;其中,所述第二数据集中的声音信息被划分为多个需求类型的声音信息;每个需求类型中包括用于表征所述目标对象需求的需求状态的声音信息;
    提取所述第二数据集中声音信息的频谱特征;
    从所述第二数据集中选择部分数据作为初始声音模型的训练集,并基于所述训练集中的频谱特征对所述初始声音模型中的初始第一级模型和初始第二级模型进行训练以确定所述声音模型中所述第一级模型和所述第二级模型的参数。
  5. 根据权利要求1或4所述的方法,其中,通过所述第一级模型和所述第二级模型识别出与所述声音信息对应的用于表征所述目标对象需求的需求状态,包括:
    将所述声音信息的频谱特征输入到所述第一级模型中,得到所述声音信息分别为多个需求类型的概率值;
    从多个所述需求类型的概率值中选择出概率值最大的需求类型;
    将所述声音信息的频谱特征输入到所述第二级模型中,得到与选择出的概率值最大的需求类型对应的需求状态的概率值;
    将概率值最大的需求状态作为所述声音信息的需求状态。
  6. 一种声音的识别装置,包括:
    采集模块,设置为采集目标对象发出的声音信息;
    判断模块,设置为判断采集到的目标对象发出的声音信息是否为哭声信息;
    输入模块,设置为在判断结果为是的情况下,将所述声音信息输入预先训练的声音模型,其中,所述声音模型是根据由多个哭声信息组成的训练集对初始声音模型进行训练得到的,且所述声音模型包括第一级模型和第二级模型;所述第一级模型用于识别出所述声音信息的用于表征所述目标对象需求的需求类型,所述第二级模型用于识别出所述声音信息在所述需求类型中的需求状态;
    识别模块,设置为通过所述第一级模型和所述第二级模型识别出与所述声音信息对应的用于表征所述目标对象的具体需求。
  7. 根据权利要求6所述的装置,其中,所述判断模块包括:
    转码单元,设置为将采集到的所述声音信息转码为指定格式;
    处理单元,设置为对转码后的声音信息的音频进行分段,并从每一段音频中提取出频谱特征;其中,相邻两段音频相互重叠部分音频;
    判断单元,设置为通过分类模型对每一段音频的频谱特征进行检测以判断所述声音信息是否为哭声信息。
  8. 根据权利要求7所述的装置,其中,所述装置还包括:
    第一获取模块,设置为在采集目标对象发出的声音信息之前,获取第一数据集,其中,所述第一数据集中包括多个为哭声信息的声音信息;
    第一提取模块,设置为提取所述第一数据集中声音信息的频谱特征;
    第一训练模块,设置为从所述第一数据集中选择部分数据作为初始分类模型的训练集,并基于所述训练集中的频谱特征对初始统计概率模型进行训练以确定所述分类模型的参数。
  9. 根据权利要求6所述的装置,其中,所述装置还包括:
    第二获取模块,设置为在采集目标对象发出的声音信息之前,获取第 二数据集;其中,所述第二数据集中的声音信息被划分为多个需求类型的声音信息;每个需求类型中包括用于表征所述目标对象需求的需求状态的声音信息;
    第二提取模块,设置为提取所述第二数据集中声音信息的频谱特征;
    第二训练模块,设置为从所述第二数据集中选择部分数据作为初始声音模型的训练集,并基于所述训练集中的频谱特征对所述初始声音模型中的初始第一级模型和初始第二级模型进行训练以确定所述声音模型中所述第一级模型和所述第二级模型的参数。
  10. 根据权利要求6或9所述的装置,其中,所述识别模块包括:
    第一输入单元,设置为将所述声音信息的频谱特征输入到所述第一级模型中,得到所述声音信息分别为多个需求类型的概率值;
    选择单元,设置为从多个所述需求类型的概率值中选择出概率值最大的需求类型;
    第二输入单元,设置为将所述声音信息的频谱特征输入到所述第二级模型中,得到与选择出的概率值最大的需求类型对应的需求状态的概率值;
    识别单元,设置为将概率值最大的需求状态作为所述声音信息的需求状态。
  11. 一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,其中,所述计算机程序被设置为运行时执行所述权利要求1至5任一项中所述的方法。
  12. 一种电子装置,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器被设置为运行所述计算机程序以执行所述权利要求1至5任一项中所述的方法。
PCT/CN2020/087072 2019-06-26 2020-04-26 声音的识别方法及装置、存储介质和电子装置 WO2020259057A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910562749.8A CN111883174A (zh) 2019-06-26 2019-06-26 声音的识别方法及装置、存储介质和电子装置
CN201910562749.8 2019-06-26

Publications (1)

Publication Number Publication Date
WO2020259057A1 true WO2020259057A1 (zh) 2020-12-30

Family

ID=73153876

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/087072 WO2020259057A1 (zh) 2019-06-26 2020-04-26 声音的识别方法及装置、存储介质和电子装置

Country Status (2)

Country Link
CN (1) CN111883174A (zh)
WO (1) WO2020259057A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113488077B (zh) * 2021-09-07 2021-12-07 珠海亿智电子科技有限公司 真实场景下的婴儿哭声检测方法、装置及可读介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130070928A1 (en) * 2011-09-21 2013-03-21 Daniel P. W. Ellis Methods, systems, and media for mobile audio event recognition
CN103280220A (zh) * 2013-04-25 2013-09-04 北京大学深圳研究生院 一种实时的婴儿啼哭声识别方法
CN104347066A (zh) * 2013-08-09 2015-02-11 盛乐信息技术(上海)有限公司 基于深层神经网络的婴儿啼哭声识别方法及系统
CN107591162A (zh) * 2017-07-28 2018-01-16 南京邮电大学 基于模式匹配的哭声识别方法及智能看护系统
CN107808658A (zh) * 2016-09-06 2018-03-16 深圳声联网科技有限公司 基于家居环境下实时的婴儿音频系列行为检测方法
CN107818779A (zh) * 2017-09-15 2018-03-20 北京理工大学 一种婴幼儿啼哭声检测方法、装置、设备及介质
CN108461091A (zh) * 2018-03-14 2018-08-28 南京邮电大学 面向家居环境的智能哭声检测方法
CN109903780A (zh) * 2019-02-22 2019-06-18 宝宝树(北京)信息技术有限公司 哭声原因模型建立方法、系统及哭声原因辨别方法

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101807396A (zh) * 2010-04-02 2010-08-18 陕西师范大学 婴儿哭闹自动记录装置及方法
EP4241676A3 (en) * 2012-03-29 2023-10-18 The University of Queensland A method and apparatus for processing sound recordings of a patient
CN103258532B (zh) * 2012-11-28 2015-10-28 河海大学常州校区 一种基于模糊支持向量机的汉语语音情感识别方法
US9965685B2 (en) * 2015-06-12 2018-05-08 Google Llc Method and system for detecting an audio event for smart home devices
CN111354375A (zh) * 2020-02-25 2020-06-30 咪咕文化科技有限公司 一种哭声分类方法、装置、服务器和可读存储介质

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130070928A1 (en) * 2011-09-21 2013-03-21 Daniel P. W. Ellis Methods, systems, and media for mobile audio event recognition
CN103280220A (zh) * 2013-04-25 2013-09-04 北京大学深圳研究生院 一种实时的婴儿啼哭声识别方法
CN104347066A (zh) * 2013-08-09 2015-02-11 盛乐信息技术(上海)有限公司 基于深层神经网络的婴儿啼哭声识别方法及系统
CN107808658A (zh) * 2016-09-06 2018-03-16 深圳声联网科技有限公司 基于家居环境下实时的婴儿音频系列行为检测方法
CN107591162A (zh) * 2017-07-28 2018-01-16 南京邮电大学 基于模式匹配的哭声识别方法及智能看护系统
CN107818779A (zh) * 2017-09-15 2018-03-20 北京理工大学 一种婴幼儿啼哭声检测方法、装置、设备及介质
CN108461091A (zh) * 2018-03-14 2018-08-28 南京邮电大学 面向家居环境的智能哭声检测方法
CN109903780A (zh) * 2019-02-22 2019-06-18 宝宝树(北京)信息技术有限公司 哭声原因模型建立方法、系统及哭声原因辨别方法

Also Published As

Publication number Publication date
CN111883174A (zh) 2020-11-03

Similar Documents

Publication Publication Date Title
CN110556129B (zh) 双模态情感识别模型训练方法及双模态情感识别方法
Schuller et al. The interspeech 2017 computational paralinguistics challenge: Addressee, cold & snoring
Pramono et al. A cough-based algorithm for automatic diagnosis of pertussis
CN112750465B (zh) 一种云端语言能力评测系统及可穿戴录音终端
Arias-Londoño et al. On combining information from modulation spectra and mel-frequency cepstral coefficients for automatic detection of pathological voices
US20210071401A1 (en) Smart toilet and electric appliance system
Hariharan et al. Objective evaluation of speech dysfluencies using wavelet packet transform with sample entropy
Wu et al. Investigation and evaluation of glottal flow waveform for voice pathology detection
WO2022012777A1 (en) A computer-implemented method of providing data for an automated baby cry assessment
Bhagatpatil et al. An automatic infant’s cry detection using linear frequency cepstrum coefficients (LFCC)
WO2020259057A1 (zh) 声音的识别方法及装置、存储介质和电子装置
Kulkarni et al. Child cry classification-an analysis of features and models
CN103578480A (zh) 负面情绪检测中的基于上下文修正的语音情感识别方法
US11475876B2 (en) Semantic recognition method and semantic recognition device
Aggarwal et al. A machine learning approach to classify biomedical acoustic features for baby cries
Richards et al. The LENATM automatic vocalization assessment
Messaoud et al. A cry-based babies identification system
Milani et al. A real-time application to detect human voice disorders
CN116153298A (zh) 一种认知功能障碍筛查用的语音识别方法和装置
CN113488077B (zh) 真实场景下的婴儿哭声检测方法、装置及可读介质
Rosen et al. Infant mood prediction and emotion classification with different intelligent models
US20240023877A1 (en) Detection of cognitive impairment
Kokot et al. Classification of child vocal behavior for a robot-assisted autism diagnostic protocol
Xu et al. Differential Time-frequency Log-mel Spectrogram Features for Vision Transformer Based Infant Cry Recognition}}
Motlagh et al. Using general sound descriptors for early autism detection

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20831594

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20831594

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 17/05/2022)

122 Ep: pct application non-entry in european phase

Ref document number: 20831594

Country of ref document: EP

Kind code of ref document: A1