WO2020259057A1 - 声音的识别方法及装置、存储介质和电子装置 - Google Patents
声音的识别方法及装置、存储介质和电子装置 Download PDFInfo
- Publication number
- WO2020259057A1 WO2020259057A1 PCT/CN2020/087072 CN2020087072W WO2020259057A1 WO 2020259057 A1 WO2020259057 A1 WO 2020259057A1 CN 2020087072 W CN2020087072 W CN 2020087072W WO 2020259057 A1 WO2020259057 A1 WO 2020259057A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- model
- sound information
- sound
- demand
- information
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 61
- 206010011469 Crying Diseases 0.000 claims abstract description 52
- 238000012549 training Methods 0.000 claims description 74
- 230000003595 spectral effect Effects 0.000 claims description 22
- 238000013145 classification model Methods 0.000 claims description 18
- 238000004590 computer program Methods 0.000 claims description 18
- 230000015654 memory Effects 0.000 claims description 18
- 238000001228 spectrum Methods 0.000 claims description 12
- 238000000605 extraction Methods 0.000 claims description 8
- 238000012545 processing Methods 0.000 claims description 4
- 238000012360 testing method Methods 0.000 description 34
- 230000008569 process Effects 0.000 description 15
- 230000006461 physiological response Effects 0.000 description 14
- 238000005070 sampling Methods 0.000 description 12
- 206010000087 Abdominal pain upper Diseases 0.000 description 11
- 208000031361 Hiccup Diseases 0.000 description 11
- 206010041349 Somnolence Diseases 0.000 description 11
- 238000010586 diagram Methods 0.000 description 11
- 230000002996 emotional effect Effects 0.000 description 11
- 230000005540 biological transmission Effects 0.000 description 9
- 230000011218 segmentation Effects 0.000 description 8
- 238000012706 support-vector machine Methods 0.000 description 8
- 208000002193 Pain Diseases 0.000 description 7
- 230000036407 pain Effects 0.000 description 7
- 206010037180 Psychiatric symptoms Diseases 0.000 description 6
- 238000002790 cross-validation Methods 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 6
- 206010013082 Discomfort Diseases 0.000 description 4
- 241001465754 Metazoa Species 0.000 description 4
- 210000001015 abdomen Anatomy 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 3
- 238000000528 statistical test Methods 0.000 description 3
- 208000004998 Abdominal Pain Diseases 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000005236 sound signal Effects 0.000 description 2
- 208000036119 Frailty Diseases 0.000 description 1
- 108010089143 GMM2 Proteins 0.000 description 1
- 206010019233 Headaches Diseases 0.000 description 1
- 206010028735 Nasal congestion Diseases 0.000 description 1
- 206010003549 asthenia Diseases 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 231100000869 headache Toxicity 0.000 description 1
- 235000003642 hunger Nutrition 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000029058 respiratory gaseous exchange Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 230000003867 tiredness Effects 0.000 description 1
- 208000016255 tiredness Diseases 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/26—Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/66—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
Definitions
- This application relates to the computer field, and in particular to a method and device for sound recognition, a storage medium and an electronic device.
- the crying sound is relatively complicated, and the information conveyed by the crying sound is also relatively vague, such as hunger, tiredness, and loneliness.
- the recognition of baby crying in related technologies is based on human experience, and human experience is often inconsistent, and subjective judgments can easily lead to recognition errors.
- the embodiments of the present application provide a voice recognition method and device, a storage medium, and an electronic device, so as to at least solve the problem in the related art that the baby's cry can only be recognized based on human experience, which may easily lead to recognition errors.
- a sound recognition method including: collecting sound information emitted by a target object; judging whether the collected sound information emitted by the target object is crying information; in the case where the judgment result is yes Next, input the sound information into a pre-trained sound model, where the sound model is obtained by training an initial sound model according to a training set composed of multiple crying information, and the sound model includes the first level Model and a second-level model; the first-level model is used to identify the demand type of the voice information used to characterize the needs of the target object, and the second-level model is used to recognize that the voice information is in the The demand status in the demand type; the specific demand for representing the target object corresponding to the sound information is identified through the first-level model and the second-level model.
- a sound recognition device which includes: a collection module configured to collect sound information emitted by a target object; and a judgment module configured to determine whether the collected sound information emitted by the target object is Crying information; an input module configured to input the voice information into a pre-trained sound model when the judgment result is yes, wherein the sound model is based on a training set composed of multiple crying information to the initial A sound model is obtained through training, and the sound model includes a first-level model and a second-level model; the first-level model is used to identify the demand type of the sound information used to characterize the needs of the target object, The second-level model is used to identify the demand state of the sound information in the demand type; the recognition module is configured to identify the same with the sound information through the first-level model and the second-level model Corresponding is used to characterize the specific needs of the target object.
- a computer-readable storage medium in which a computer program is stored, wherein the computer program is configured to execute any of the above methods when running Steps in the embodiment.
- an electronic device including a memory and a processor, the memory is stored with a computer program, and the processor is configured to run the computer program to execute any of the above Steps in the method embodiment.
- the target object when it is determined that the collected sound information emitted by the target object is crying information, it is further possible to identify the demand type of the sound information and the type of sound information according to the first-level model and the second-level model in the sound model.
- the demand state under the demand type so that the current demand state of the target object can be identified based on the cry information through the sound model, instead of judging the demand state represented by the cry based on human experience, which solves the problem that the related technology can only be based on the human
- the experience of recognizing infants’ crying is likely to lead to the problem of recognition errors, which improves the accuracy of recognition of the demand state of crying representation.
- FIG. 1 is a block diagram of the hardware structure of a terminal of a voice recognition method according to an embodiment of the present application
- Fig. 2 is a flowchart of a voice recognition method according to an embodiment of the present application
- Figure 3 is a schematic diagram of a hierarchical UBM-GMM model according to an embodiment of the present application.
- FIG. 4 is a schematic diagram of the training process of the UBM-GMM model according to an embodiment of the present application.
- Figure 5 is a structural block diagram of a voice recognition device according to an embodiment of the present application.
- Fig. 6 is an optional structural block diagram 1 of a voice recognition device according to an embodiment of the present application.
- Fig. 7 is a second optional structural block diagram of a voice recognition device according to an embodiment of the present application.
- FIG. 1 is a hardware structure block diagram of a terminal of a voice recognition method according to an embodiment of the present application.
- the terminal may include one or more (only one is shown in FIG. 1) processor 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA) and The memory 104 is configured to store data.
- the aforementioned terminal may also include a transmission device 106 and an input/output device 108 configured as a communication function.
- the terminal may also include more or fewer components than shown in FIG. 1, or have a different configuration from that shown in FIG.
- the memory 104 may be configured to store computer programs, for example, software programs and modules of application software, such as the computer programs corresponding to the voice recognition method in the embodiment of the present application.
- the processor 102 runs the computer programs stored in the memory 104, thereby Perform various functional applications and data processing, that is, realize the above-mentioned methods.
- the memory 104 may include a high-speed random access memory, and may also include a non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory.
- the memory 104 may further include a memory remotely provided with respect to the processor 102, and these remote memories may be connected to the terminal through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
- the transmission device 106 is configured to receive or transmit data via a network.
- the aforementioned specific examples of the network may include a wireless network or a wired network provided by the communication provider of the terminal.
- the transmission device 106 includes a network adapter (Network Interface Controller, NIC for short), which can be connected to other network devices through a base station to communicate with the Internet.
- the transmission device 106 may be a radio frequency (Radio Frequency, referred to as RF) module, which is configured to communicate with the Internet in a wireless manner.
- RF Radio Frequency
- transmission device 106 is used for the method steps in this application depends on the solution of this application. For example, if this application is an interactive method step solution, the transmission 106 needs to be used. If all the method steps in this application can be executed inside the aforementioned terminal, the transmission device 106 does not need to be used.
- FIG. 2 is a flowchart of a voice recognition method according to an embodiment of the present application. As shown in FIG. 2, the process includes the following steps:
- Step S202 collecting sound information emitted by the target object
- Step S204 Determine whether the collected sound information emitted by the target object is crying information
- Step S206 If the result of the judgment is yes, input the sound information into the pre-trained sound model, where the sound model is obtained by training the initial sound model according to a training set composed of multiple crying information, and
- the pre-trained sound model includes: the first-level model and the second-level model; the first-level model is used to identify the demand type of the sound information used to represent the needs of the target object, and the second-level model is used to identify the demand for the sound information The demand status in the type;
- step S208 the specific requirements for representing the target object corresponding to the sound information are identified through the first-level model and the second-level model.
- the pre-trained sound model is composed of multi-level models, which can be two-level models (first-level models and second-level models), or three-level models and four-level models. It is composed of one-level model or more-level models; correspondingly, the specific requirements used to represent the needs of the target object can be directly identified by the first-level model and the second-level model, or based on the first-level model and the second-level After the results of the model identification in turn, they are identified by the third-level model (composed of three-level models), or by the fourth-level model based on the results of the third-level model recognition (composed of four-level models). analogy.
- the pre-trained sound model in step S206 when the pre-trained sound model in step S206 only includes the first-level model and the second-level model, the demand state of the target object obtained by the second-level model recognition is the specific demand.
- the pre-trained sound model in step S206 may be composed of a three-level model, a four-level model, or other levels of models.
- the first The first-level model is used to identify the first demand type of the sound information used to characterize the needs of the target object
- the second-level model is used to identify the second demand state type of the sound information in the first demand type
- the first The three-level model is used to identify the specific demand state of the target object in the second demand state type of sound information
- the specific demand of the target object is the demand state of the target object.
- the first level model is used to identify the first demand type of the sound information used to characterize the needs of the target object, and the first demand type includes physiological And non-physiological.
- the second demand types of the second-level model corresponding to "physiological” are: physiological response, physiological demand, and emotional demand;
- the demand status of the third-level model corresponding to "physiological response” is: hiccups, stomach pain, and other non-necessities.
- Comfortable the demand status of the third-level model corresponding to "physiological needs” is: hungry, cold or hot, sleepy;
- the demand status of the third-level model corresponding to "emotional needs” is: scared, lonely.
- the second need types of the second-level model corresponding to "non-physiological" are: pain, poor breathing, and weakness.
- the demand state of the third-level model corresponding to "pain” is: abdominal pain, headache, etc.; the demand state of the third-level model corresponding to "pain” is: nasal congestion, etc.; the third-level model corresponding to "frailty”
- the demand status is: weak and weak.
- the sound information can be further identified according to the first-level model and the second-level model in the sound model.
- the demand type and the demand state under this demand type so that the current demand state of the target object can be identified based on the crying information through the sound model, instead of judging the demand state represented by the cry based on human experience, which solves the problem in related technologies Recognizing the baby's cry only based on human experience can easily lead to the problem of identification errors, and the accuracy of the recognition of the demand state of crying representation is improved.
- the target object involved in this application is preferably a baby, of course, it can also be a child of a few years old, or an animal.
- This application does not limit the specific objects, and the corresponding settings can be made according to the actual situation.
- the method of determining whether the collected sound information emitted by the target object is crying information involved in step S204 of the present application may be implemented in the following manner:
- Step S204-11 transcoding the collected sound information into a specified format
- the specified format is preferably wav format
- the audio sampling rate is 8000Hz; of course, in other application scenarios, it can also be 3gp, aac, amr, caf, flac, mp3, ogg, aiff, etc.
- the following sampling frequencies can be selected: 8000, 11025, 12000, 16000, 22050, 24000, 32000, 40000, 44100, 47250, 48000, etc., which are not specifically limited here.
- the sampling frequency needs to be unified to make it more convenient and faster in actual use; if the input audio is not transcoded into a unified format, it is necessary to Each format is read separately, which leads to cumbersome operations; and the sampling frequency is not uniform. If the audio of the same length contains different amounts of data, it will affect the subsequent feature extraction and model training, so the audio must be processed first Pretreatment. In the current actual use, the input audio is converted to wav format or other formats. As long as the audio data can be read, the audio sampling frequency is unified to 8000 Hz, of course, other sampling frequencies are also possible.
- the tool for transcoding sound information in this application is preferably FFMpeg.
- Step S204-12 segment the audio of the transcoded sound information, and extract spectral features from each segment of audio; wherein two adjacent segments of audio overlap each other in part of the audio;
- variable-length audio since the length of the audio uploaded by the user is not uniform, it is preferable to convert the variable-length audio into fixed-length audio. If the input audio of variable length is directly converted to fixed-length audio by methods such as interpolation, a lot of information of the audio itself will be lost; the segmentation is used in the above step S204-12, and there is overlap between the segments, which can be retained The complete audio information retains the correlation between segments.
- the input audio is segmented, for example, the segment length is 3 seconds, and two adjacent audio segments overlap by 1 second. Of course, it can also be a segment length of 4 seconds, two adjacent segments of audio overlap 1.5 seconds, or a segment length of 5 seconds, two adjacent segments of audio overlap 2 seconds, etc., which can be set accordingly according to the actual situation.
- step S206-13 the frequency spectrum characteristics of each audio segment are detected through the classification model to determine whether the sound information is crying information.
- the features used are preferably the Mel frequency cepstrum coefficient and a step of the Mel frequency cepstrum coefficient, these two features belong to the frequency characteristics of audio.
- the features used are Mel frequency cepstral coefficients and a step of Mel frequency cepstral coefficients Degree and second degree.
- the range of relevant parameters for extracting the Mel frequency cepstral coefficient is preferably 30 milliseconds to 50 milliseconds; the adjacent window overlap length range is preferably 10 milliseconds to 20 milliseconds; the number of mel filters used is preferably For 20 to 40.
- the classification model in this application may be a gradient boosting tree, a support vector machine, a multi-layer perceptron, a statistical probability model and/or a deep learning model, a preferred implementation of this application
- the classification models are gradient boosting tree, support vector machine and multi-layer perceptron, that is, the audio features are input into the three classifiers, and the three classifiers respectively judge to obtain their respective classification results, and then count the classification results and The result with the largest number of identical results is used as the generated detection result, that is, whether it is the cry of the target object or not.
- the method of this example before collecting the sound information emitted by the target object in step S202, the method of this example further includes:
- Step S101 Obtain a first data set, where the first data set includes a plurality of sound information that is cry information;
- Step S102 extract the frequency spectrum characteristics of the sound information in the first data set
- Step S103 selecting part of the data from the first data set as the training set of the initial classification model, and training the initial statistical probability model based on the spectral features in the training set to determine the parameters of the classification model.
- the baby is the target object
- the classification model is gradient boosting tree, support vector machine and multi-layer perceptron
- the first data set can be derived from other data sets such as donateacry-corpus. There are 2467 baby cry audios; the data set ESC-50 contains 50 types of audio, and each type of audio contains 40 samples , One of the 50 categories is baby crying, and the remaining 49 categories are non-baby crying audio, including animal calls, natural environment sounds, human voices, indoor sounds and urban noise; therefore, the audio samples of baby crying are shared There are 2507 segments. There are 1960 segments of non-baby crying samples. Divide 20% of the data set into the test set and 80% into the training set.
- the Mel frequency cepstral coefficients and their first-order and second-order features are extracted for each audio segment; the training set and cross-validation are used to train the gradient boosting tree (XGBoost) and the support vector machine respectively (SVM) and Multilayer Perceptron (MLP) to determine the best parameters of the classifier model; use the test set to classify a sample using the trained gradient boosting tree, support vector machine and multilayer perceptron, three The classification results of the model are voted to produce the final classification results; the classification results of the statistical test set samples are used to evaluate the training effect of the model.
- the final model parameters are shown in Table 1:
- the sound model also requires training, that is, before collecting the sound information emitted by the target object in step S202, the method of this embodiment further includes:
- Step S111 acquiring a second data set; wherein the sound information in the second data set is divided into sound information of multiple demand types; each demand type includes sound information used to characterize the demand status of the target object's demand;
- Step S112 extract the frequency spectrum characteristics of the sound information in the second data set
- Step S113 Select part of the data from the second data set as the training set of the initial sound model, and train the initial first-level model and the initial second-level model in the initial sound model based on the spectral features in the training set to determine the sound model Parameters of the first-level model and the second-level model.
- the target object is a baby as an example
- the sound model is a hierarchical UBM-GMM
- the source of this second data set can be other data sets such as donateacry-corpus, including: 2467 baby cry audios, divided into 8 categories, namely 740 sections when hungry, 468 sections when tired, 232 sections when lonely, and hiccups Segment 161, segment 268 with stomach pain, segment 115 with cold or heat, segment 149 with fear, and segment 334 with other discomfort.
- 20% of the second data set is divided into the test set, and 80% is divided into the training set.
- Figure 3 is a schematic diagram of a hierarchical UBM-GMM model according to an embodiment of the present application. Based on Figure 3, using the training set in the above second data and using cross-validation, training the hierarchical UBM-GMM: first train UBM-GMM1 and input Audio is divided into 3 categories. For each category, train UBM-GMM2, UBM-GMM3 and UBM-GMM4, and then classify the major categories into small categories.
- the reasons for using hierarchical UBM-GMM are: (1) The amount of data in each category in the second data set varies greatly; if only a single UBM-GMM is used, it will cause categories with a large amount of data to be easily identified, but categories with a small amount of data However, it is difficult to be identified; using the grading method to merge small categories into large categories first reduces the imbalance in the amount of data between categories and improves the accuracy of classification; (2) The reason for the baby crying is not always Single, sub-category in the big category is conducive to get all possible factors that cause babies to cry.
- each UBM-GMM model For the training process of each UBM-GMM model, as shown in the solid line in Figure 4, first use all the training data to train a GMM, called UBM; then, use the data of each category to train the GMM to obtain the Model CN-GMM; in this way, the training process is complete.
- the process of using a single UBM-GMM to classify new input data is shown by the dotted line in Figure 4.
- the input features are input into each category GMM model, and the UBM model is used to estimate the maximum posterior probability to obtain the input in
- the score on each category model, the category with the highest score, is the category the input belongs to; the parameters for training each UBM-GMM model are shown in Table 2:
- step S208 to identify the demand state of the target object corresponding to the sound information through the first-level model and the second-level model can be through Realize as follows:
- Step S208-11 Input the frequency spectrum characteristics of the sound information into the first-level model, and obtain the probability values that the sound information is of multiple demand types;
- Step S208-12 selecting the demand type with the largest probability value from the probability values of the multiple demand types
- Step S208-13 input the frequency spectrum characteristics of the sound information into the second-level model to obtain the probability value of the demand state corresponding to the demand type with the largest probability value selected;
- step S208-14 the demand state with the largest probability value is used as the demand state of the voice information.
- the pre-trained model is a two-level model, so the first-level model is used to identify the demand type of voice information used to characterize the needs of the target object, and the second-level model is used to identify the voice information used to characterize The demand type of the target object's demand, where the demand type here is the specific demand of the target object. Therefore, the above step S208-11 is to input the frequency characteristics of the sound information into the first-level model to obtain the probability values of multiple demand types respectively; the second step S208-13 is to input the frequency characteristics of the sound information into the second-level model , Obtain the probability value of the demand state corresponding to the demand type with the largest probability value selected.
- the types of needs in this application are preferably physiological responses, emotional needs, and physiological needs; of course, other types of needs, such as psychological responses, can also be added according to actual conditions.
- the physiological needs include: hiccups, stomachache, uncomfortable, etc.; physiological needs include: hungry, cold/hot, sleepy, etc.; emotional needs: fear, loneliness, etc. That is to say, in this application, the grading method is used to first divide the crying into major categories, and then divide the major categories into each subcategory, so that correspondingly, when the model is trained, each subcategory under the same category
- the sample data can be combined as the training sample of the training model of the large category, and the sample data of each small category is used as the training model sample of the small category.
- the first-level model and the second-level model trained by this method are compared with the existing technology. Compared with the method of directly using the sample data of each sub-category for model training to obtain the model, it can avoid the problem of inaccurate recognition caused by the imbalance between the amount of sample data trained in each sub-category, thereby improving the recognition Accuracy; In addition, because the reason for the baby's crying is not always single, so by first identifying the major category corresponding to the baby's cry, and then identifying the subcategories from the major categories, all possible baby crying can be effectively obtained Factors (specific requirements).
- the source of the second data set in this specific embodiment is the data set donateacry-corpus and other data sets.
- Need type 1 Physiological response, including hiccups, stomach pains, other discomforts and other 3 demand states;
- Demand type 2 physiological needs, including 3 demand states: hungry, cold or hot, sleepy;
- Need type three emotional needs, including 2 need states such as fear and loneliness.
- the multi-level UBM-GMM model refers to the first-level UBM-GMM model to divide the input sample into three categories; then according to the classification results, the second-level UBM-GMM model corresponding to different categories is selected to input the sample Classified as a subcategory of this category.
- the classification categories of the first-level UBM-GMM model are: physiological response, physiological needs and emotional needs;
- the classification categories of the second-level UBM-GMM model corresponding to the "Physiological Response” category are: hiccups, stomach pains, and other discomforts;
- the classification categories of the second-level UBM-GMM model corresponding to the "physiological needs" category are: hungry, cold or hot, sleepy;
- the classification categories of the second-level UBM-GMM model corresponding to the category of "emotional needs" are: fear and loneliness.
- the hyperparameters include the first-level UBM and the mixed components of each type of GMM in the first-level Quantity; then, use the training set features of the corresponding category to train three second-level UBM-GMM models, adjust the related hyperparameters to the optimal, the hyperparameters include the second-level UBM and the number of mixed components of each type of GMM of the second level .
- the multi-level UBM-GMM model trained by using the features extracted after the test set segmentation is evaluated.
- the process is: for a complete test set sample, input the characteristics of its segmented audio into the trained multi-level UBM-GMM model, obtain the classification results of each segment, and count the classification results of all segments to obtain The probability of each category, where the category with the highest probability is the predicted result of this complete test sample.
- the results show that using the multi-level UBM-GMM model can more accurately identify the audio to be tested.
- the single-stage UBM-GMM model is a traditional commonly used model, that is, a comparative example.
- the single-level UBM-GMM model refers to the use of a single UBM-GMM model to classify the input samples into 8 categories.
- the classification categories are: hungry, tired, lonely, hiccups, stomachache, cold or hot, scared, etc. Uncomfortable.
- the hyperparameters include UBM and the number of mixed components of each type of GMM.
- the single-stage UBM-GMM model trained using the features extracted after the test set segmentation is evaluated.
- the process is: for a complete test set sample, input the features of its segmented audio into the trained single-level UBM-GMM model, obtain the classification results of each segment, and count the classification results of all segments to obtain The probability of each category, where the category with the highest probability is the predicted result of this complete test sample.
- the statistical test set sample classification accuracy rate is 38%.
- Multi-level UBM-GMM model First, use the level 1 UBM-GMM model to classify the audio features of the test sample segments, and obtain the classification results of each segment of the test sample. After the first level UBM-GMM model classification, the classification result of the input test sample is that the probability of the "physiological need” category is 0.8, and the category probability of "physiological response" is 0.2, then the category of the input test sample is "physiological need”; , Use the second-level UBM-GMM model corresponding to "Physiological Needs” to classify, and the classification result of the input test sample is: "hungry" category probability 0.8, "sleepy” category probability 0.2, then enter the final classification of the test sample The category is "hungry".
- the classification result obtained is that the probability of the "hungry" category is 0.4, the probability of the "fear” category is 0.2, and the category probability of "sleepy” is 0.2, the category probability of "stomach pain” is 0.2; it can be seen that the final classification result is also "hungry".
- the classification result using the multi-level UBM-GMM model is better than the classification result using the single-level UBM-GMM model because of the multi-level The probability of being "hungry" in the UBM-GMM model is higher.
- Multi-level UBM-GMM model First, use the level 1 UBM-GMM model to classify the audio features of the test sample segments, and obtain the classification results of each segment of the test sample.
- the classification result of the input test sample is that the probability of the "physiological response” category is 0.8, and the probability of the "physiological need” category is 0.2, then the category of the input test sample is "physiological response”;
- the classification result of the input test sample is 0.8 for "belly pain” category probability and 0.2 for "hiccup” category probability, then enter the final classification category of the test sample As "belly pain”.
- the classification result is: the probability of "sleepy” is 0.4, the probability of "fear” is 0.2, and the probability of "hiccup” is 0.2. The probability of "stomachache” is 0.2, and the final result is "sleepy”.
- the test audio is the audio of the "belly pain” category
- the UBM-GMM grading mode test is used to identify it as “belly pain” with a high probability of 0.8
- the classification result using the single-level UBM-GMM model is incorrectly "sleepy”.
- the method according to the above embodiment can be implemented by means of software plus the necessary general hardware platform, of course, it can also be implemented by hardware, but in many cases the former is Better implementation.
- the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, The optical disc) includes several instructions to enable a terminal device (which can be a mobile phone, a computer, a server, or a network device, etc.) to execute the method described in each embodiment of the present application.
- a voice recognition device is also provided, which is used to implement the above-mentioned embodiments and preferred implementations, and those that have been described will not be repeated.
- the term "module” can implement a combination of software and/or hardware with predetermined functions.
- the devices described in the following embodiments are preferably implemented by software, hardware or a combination of software and hardware is also possible and conceived.
- FIG. 5 is a structural block diagram of a sound recognition device according to an embodiment of the present application.
- the device includes: a collection module 52 configured to collect sound information emitted by a target object; a judgment module 54 coupled with the collection module 52 Connect, set to determine whether the collected sound information of the target object is crying information; the input module 56 is coupled to the judgment module 54 and is set to input the sound information into the pre-trained sound if the judgment result is yes Model, where the sound model is obtained by training the initial sound model based on a training set composed of multiple crying information, and the sound model includes a first-level model and a second-level model; the first-level model is used to identify sounds The information is used to characterize the demand type of the target object’s needs, and the second-level model is used to identify the demand state of the sound information in the demand type; the recognition module 58 is coupled to the input module 56 and is set to pass the first-level model and the second The secondary model identifies the specific needs of the target object corresponding to the sound information
- the target object involved in this application is preferably a baby, of course, it can also be a child of a few years old, or an animal.
- This application does not limit the specific objects, and the corresponding settings can be made according to the actual situation.
- the judgment module 54 in this embodiment may further include: a transcoding unit configured to transcode the collected sound information into a specified format; and a processing unit configured to divide the audio of the transcoded sound information Segment, and extract the spectral features from each segment of audio; among them, the audio at the two adjacent ends overlap each other part of the audio; the judgment unit is set to detect the spectral features of each segment of audio through the classification model to determine whether the sound information is a cry information.
- the specified format is preferably the wav format, and the audio sampling rate is 8000 Hz; of course, the following formats can also be used in other application scenarios: 3gp, aac, amr, caf, flac, mp3, ogg, Formats such as aiff, based on this, the following sampling frequencies (unit Hz) are all available: 8000, 11025, 12000, 16000, 22050, 24000, 32000, 40000, 44100, 47250, 48000, etc.
- the unified format (transcoding) and sampling frequency of the input audio are mainly for the convenience of the actual use process, because if not transcoding, then each format needs to be read In this way, it will be very cumbersome; if the sampling frequency is not uniform, the audio of the same length will contain different amounts of data, which will affect the subsequent feature extraction and model training. Therefore, the audio must be preprocessed first.
- the input audio is converted to wav format or other formats.
- the audio sampling frequency is unified to 8000 Hz, of course, other sampling frequencies are also possible.
- the tool for transcoding sound information in this application is preferably FFMpeg.
- the features used are preferably Mel frequency cepstral coefficients and Mel frequency cepstral coefficients one-step and two-step, these two features belong to the frequency characteristics of audio.
- Fig. 6 is an optional structural block diagram 1 of a sound recognition device according to an embodiment of the present application.
- the device further includes: a first acquisition module 62 configured to acquire the first acquisition module 62 before acquiring the sound information emitted by the target object A data set, wherein the first data set includes a plurality of sound information that is crying information; the first extraction module 64 is coupled to the first acquisition module 62, and is configured to extract the spectral characteristics of the sound information in the first data set; A training module 66, coupled to the first extraction module 64, is configured to select part of the data from the first data set as the training set of the initial classification model, and train the initial statistical probability model based on the spectral features in the training set to determine the classification model Parameters.
- babies are the target object
- classification models are gradient boosting trees, support vector machines and multi-layer perceptrons.
- the specific training process can be:
- the first data set can be derived from other data sets such as donateacry-corpus. There are 2467 baby cry audios; the data set ESC-50 contains 50 types of audio, and each type of audio contains 40 samples , One of the 50 categories is baby crying, and the remaining 49 categories are non-baby crying audio, including animal calls, natural environment sounds, human voices, indoor sounds and urban noise; therefore, the audio samples of baby crying are shared There are 2507 segments. There are 1960 segments of non-baby crying samples. Divide 20% of the data set into the test set and 80% into the training set.
- extract the Mel frequency cepstral coefficient and its first-order and second-order features for each audio segment use the training set and use cross-validation to train the gradient boosting tree (XGBoost), support vector machine (SVM) and multi-layer perception respectively Machine (MLP) to determine the best parameters of the classifier model; use the test set to classify a sample using the trained gradient boosting tree, support vector machine, and multi-layer perceptron.
- the classification results of the three models are voted to produce the final Classification results; statistical test set sample classification results, used to evaluate the training effect of the model.
- FIG. 7 is a second optional structural block diagram of the sound recognition device according to the embodiment of the present application.
- the device further includes: a second acquisition module 72 configured to acquire the second acquisition module 72 before acquiring the sound information emitted by the target object Two data sets; wherein the sound information in the second data set is divided into multiple demand types of sound information; each demand type includes sound information used to characterize the demand status of the target object's demand; the second extraction module 74, and The second acquisition module 72 is coupled and connected to extract the spectral features of the sound information in the second data set; the second training module 76 is coupled to the second extraction module 74 and is configured to select part of the data from the second data set as the initial sound model The initial first-level model and the initial second-level model in the initial sound model are trained based on the spectral features in the training set to determine the parameters of the first-level model and the second-level model in the sound model.
- the target object is a baby as an example
- the sound model is a hierarchical UBM-GMM
- the source of this second data set can be other data sets such as donateacry-corpus, including: 2467 baby cry audios, divided into 8 categories, namely 740 segments when hungry, 468 segments when tired, 232 segments alone, and Hiccup 161, stomach pain 268, cold or hot 115, fear 149, and other discomfort 334. Among them, 20% of the second data set is divided into the test set, and 80% is divided into the training set.
- donateacry-corpus including: 2467 baby cry audios, divided into 8 categories, namely 740 segments when hungry, 468 segments when tired, 232 segments alone, and Hiccup 161, stomach pain 268, cold or hot 115, fear 149, and other discomfort 334.
- 20% of the second data set is divided into the test set, and 80% is divided into the training set.
- the reasons for using hierarchical UBM-GMM are: (1) The amount of data in each category in the second data set varies greatly; if only a single UBM-GMM is used, it will cause categories with a large amount of data to be easily identified, but categories with a small amount of data But it is difficult to be identified; the use of a grading method to merge the demand status into demand types first reduces the imbalance in the amount of data between categories and improves the accuracy of classification; (2) The reason for the baby's crying is not always Single, sub-category in the big category is conducive to get all possible factors that cause babies to cry.
- each UBM-GMM model As shown in Figure 4, first use all the training data to train a GMM called UBM; then, use the data of each category to train the GMM to obtain the model CN- of each category. GMM; in this way, the training process is complete
- the recognition module 58 in this embodiment may further include: a first input unit configured to input the spectral characteristics of the sound information into the first-level model to obtain the probability values that the sound information is of multiple demand types;
- the selection unit is set to select the demand type with the largest probability value from the probability values of multiple demand types;
- the second input unit is set to input the spectral characteristics of the sound information into the second-level model to obtain the selected probability The probability value of the demand state corresponding to the demand type with the largest value;
- the identification unit is set to use the demand state with the largest probability value as the demand state of the sound information.
- each of the above modules can be implemented by software or hardware.
- it can be implemented in the following manner, but not limited to this: the above modules are all located in the same processor; or, the above modules are combined in any combination The forms are located in different processors.
- the embodiments of the present application also provide a computer-readable storage medium in which a computer program is stored, wherein the computer program is configured to execute the steps in any one of the foregoing method embodiments when running.
- the foregoing computer-readable storage medium may be configured to store a computer program for executing the following steps:
- S4 Identify the specific requirements for representing the target object corresponding to the sound information through the first-level model and the second-level model.
- the above-mentioned computer-readable storage medium may include, but is not limited to: U disk, Read-Only Memory (Read-Only Memory, ROM for short), Random Access Memory (Random Access Memory, for short)
- ROM Read-Only Memory
- Random Access Memory Random Access Memory
- Various media that can store computer programs such as RAM
- mobile hard disks such as hard disks, magnetic disks or optical disks.
- the embodiment of the present application also provides an electronic device, including a memory and a processor, the memory is stored with a computer program, and the processor is configured to run the computer program to execute the steps in any of the foregoing method embodiments.
- the aforementioned electronic device may further include a transmission device and an input-output device, wherein the transmission device is connected to the aforementioned processor, and the input-output device is connected to the aforementioned processor.
- the foregoing processor may be configured to execute the following steps through a computer program:
- S4 Identify the specific requirements for representing the target object corresponding to the sound information through the first-level model and the second-level model.
- modules or steps of this application can be implemented by a general computing device, and they can be concentrated on a single computing device or distributed in a network composed of multiple computing devices.
- they can be implemented with program codes executable by the computing device, so that they can be stored in the storage device for execution by the computing device, and in some cases, can be executed in a different order than here.
- the voice recognition method and device, storage medium, and electronic device provided by the embodiments of the present application have the following beneficial effects: it solves the problem that the recognition of infant crying only based on human experience in the related art can easily lead to recognition.
- the problem of errors has been achieved to improve the accuracy of the recognition of the demand state of crying representation.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Computational Linguistics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Epidemiology (AREA)
- General Health & Medical Sciences (AREA)
- Public Health (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
Abstract
Description
Claims (12)
- 一种声音的识别方法,包括:采集目标对象发出的声音信息;判断采集到的目标对象发出的声音信息是否为哭声信息;在判断结果为是的情况下,将所述声音信息输入预先训练的声音模型,其中,所述预先训练的声音模型是根据由多个哭声信息组成的训练集对初始声音模型进行训练得到的,且所述预先训练的声音模型包括第一级模型和第二级模型;所述第一级模型用于识别出所述声音信息的用于表征所述目标对象需求的需求类型,所述第二级模型用于识别出所述声音信息在所述需求类型中的需求状态;通过所述第一级模型和所述第二级模型识别出与所述声音信息对应的用于表征所述目标对象的具体需求。
- 根据权利要求1所述的方法,其中,判断采集到的目标对象发出的声音信息是否为哭声信息,包括:将采集到的所述声音信息转码为指定格式;对转码后的声音信息的音频进行分段,并从每一段音频中提取出频谱特征;其中,相邻两段音频相互重叠部分音频;通过分类模型对每一段音频的频谱特征进行检测以判断所述声音信息是否为哭声信息。
- 根据权利要求2所述的方法,其中,在采集目标对象发出的声音信息之前,所述方法还包括:获取第一数据集,其中,所述第一数据集中包括多个为哭声信息的声音信息;提取所述第一数据集中声音信息的频谱特征;从所述第一数据集中选择部分数据作为初始分类模型的训练集,并基于所述训练集中的频谱特征对初始统计概率模型进行训练以确定所述分类模型的参数。
- 根据权利要求1所述的方法,其中,在采集目标对象发出的声音信息之前,所述方法还包括:获取第二数据集;其中,所述第二数据集中的声音信息被划分为多个需求类型的声音信息;每个需求类型中包括用于表征所述目标对象需求的需求状态的声音信息;提取所述第二数据集中声音信息的频谱特征;从所述第二数据集中选择部分数据作为初始声音模型的训练集,并基于所述训练集中的频谱特征对所述初始声音模型中的初始第一级模型和初始第二级模型进行训练以确定所述声音模型中所述第一级模型和所述第二级模型的参数。
- 根据权利要求1或4所述的方法,其中,通过所述第一级模型和所述第二级模型识别出与所述声音信息对应的用于表征所述目标对象需求的需求状态,包括:将所述声音信息的频谱特征输入到所述第一级模型中,得到所述声音信息分别为多个需求类型的概率值;从多个所述需求类型的概率值中选择出概率值最大的需求类型;将所述声音信息的频谱特征输入到所述第二级模型中,得到与选择出的概率值最大的需求类型对应的需求状态的概率值;将概率值最大的需求状态作为所述声音信息的需求状态。
- 一种声音的识别装置,包括:采集模块,设置为采集目标对象发出的声音信息;判断模块,设置为判断采集到的目标对象发出的声音信息是否为哭声信息;输入模块,设置为在判断结果为是的情况下,将所述声音信息输入预先训练的声音模型,其中,所述声音模型是根据由多个哭声信息组成的训练集对初始声音模型进行训练得到的,且所述声音模型包括第一级模型和第二级模型;所述第一级模型用于识别出所述声音信息的用于表征所述目标对象需求的需求类型,所述第二级模型用于识别出所述声音信息在所述需求类型中的需求状态;识别模块,设置为通过所述第一级模型和所述第二级模型识别出与所述声音信息对应的用于表征所述目标对象的具体需求。
- 根据权利要求6所述的装置,其中,所述判断模块包括:转码单元,设置为将采集到的所述声音信息转码为指定格式;处理单元,设置为对转码后的声音信息的音频进行分段,并从每一段音频中提取出频谱特征;其中,相邻两段音频相互重叠部分音频;判断单元,设置为通过分类模型对每一段音频的频谱特征进行检测以判断所述声音信息是否为哭声信息。
- 根据权利要求7所述的装置,其中,所述装置还包括:第一获取模块,设置为在采集目标对象发出的声音信息之前,获取第一数据集,其中,所述第一数据集中包括多个为哭声信息的声音信息;第一提取模块,设置为提取所述第一数据集中声音信息的频谱特征;第一训练模块,设置为从所述第一数据集中选择部分数据作为初始分类模型的训练集,并基于所述训练集中的频谱特征对初始统计概率模型进行训练以确定所述分类模型的参数。
- 根据权利要求6所述的装置,其中,所述装置还包括:第二获取模块,设置为在采集目标对象发出的声音信息之前,获取第 二数据集;其中,所述第二数据集中的声音信息被划分为多个需求类型的声音信息;每个需求类型中包括用于表征所述目标对象需求的需求状态的声音信息;第二提取模块,设置为提取所述第二数据集中声音信息的频谱特征;第二训练模块,设置为从所述第二数据集中选择部分数据作为初始声音模型的训练集,并基于所述训练集中的频谱特征对所述初始声音模型中的初始第一级模型和初始第二级模型进行训练以确定所述声音模型中所述第一级模型和所述第二级模型的参数。
- 根据权利要求6或9所述的装置,其中,所述识别模块包括:第一输入单元,设置为将所述声音信息的频谱特征输入到所述第一级模型中,得到所述声音信息分别为多个需求类型的概率值;选择单元,设置为从多个所述需求类型的概率值中选择出概率值最大的需求类型;第二输入单元,设置为将所述声音信息的频谱特征输入到所述第二级模型中,得到与选择出的概率值最大的需求类型对应的需求状态的概率值;识别单元,设置为将概率值最大的需求状态作为所述声音信息的需求状态。
- 一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,其中,所述计算机程序被设置为运行时执行所述权利要求1至5任一项中所述的方法。
- 一种电子装置,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器被设置为运行所述计算机程序以执行所述权利要求1至5任一项中所述的方法。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910562749.8A CN111883174A (zh) | 2019-06-26 | 2019-06-26 | 声音的识别方法及装置、存储介质和电子装置 |
CN201910562749.8 | 2019-06-26 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020259057A1 true WO2020259057A1 (zh) | 2020-12-30 |
Family
ID=73153876
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/087072 WO2020259057A1 (zh) | 2019-06-26 | 2020-04-26 | 声音的识别方法及装置、存储介质和电子装置 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN111883174A (zh) |
WO (1) | WO2020259057A1 (zh) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113488077B (zh) * | 2021-09-07 | 2021-12-07 | 珠海亿智电子科技有限公司 | 真实场景下的婴儿哭声检测方法、装置及可读介质 |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130070928A1 (en) * | 2011-09-21 | 2013-03-21 | Daniel P. W. Ellis | Methods, systems, and media for mobile audio event recognition |
CN103280220A (zh) * | 2013-04-25 | 2013-09-04 | 北京大学深圳研究生院 | 一种实时的婴儿啼哭声识别方法 |
CN104347066A (zh) * | 2013-08-09 | 2015-02-11 | 盛乐信息技术(上海)有限公司 | 基于深层神经网络的婴儿啼哭声识别方法及系统 |
CN107591162A (zh) * | 2017-07-28 | 2018-01-16 | 南京邮电大学 | 基于模式匹配的哭声识别方法及智能看护系统 |
CN107808658A (zh) * | 2016-09-06 | 2018-03-16 | 深圳声联网科技有限公司 | 基于家居环境下实时的婴儿音频系列行为检测方法 |
CN107818779A (zh) * | 2017-09-15 | 2018-03-20 | 北京理工大学 | 一种婴幼儿啼哭声检测方法、装置、设备及介质 |
CN108461091A (zh) * | 2018-03-14 | 2018-08-28 | 南京邮电大学 | 面向家居环境的智能哭声检测方法 |
CN109903780A (zh) * | 2019-02-22 | 2019-06-18 | 宝宝树(北京)信息技术有限公司 | 哭声原因模型建立方法、系统及哭声原因辨别方法 |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101807396A (zh) * | 2010-04-02 | 2010-08-18 | 陕西师范大学 | 婴儿哭闹自动记录装置及方法 |
EP4241676A3 (en) * | 2012-03-29 | 2023-10-18 | The University of Queensland | A method and apparatus for processing sound recordings of a patient |
CN103258532B (zh) * | 2012-11-28 | 2015-10-28 | 河海大学常州校区 | 一种基于模糊支持向量机的汉语语音情感识别方法 |
US9965685B2 (en) * | 2015-06-12 | 2018-05-08 | Google Llc | Method and system for detecting an audio event for smart home devices |
CN111354375A (zh) * | 2020-02-25 | 2020-06-30 | 咪咕文化科技有限公司 | 一种哭声分类方法、装置、服务器和可读存储介质 |
-
2019
- 2019-06-26 CN CN201910562749.8A patent/CN111883174A/zh active Pending
-
2020
- 2020-04-26 WO PCT/CN2020/087072 patent/WO2020259057A1/zh active Application Filing
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130070928A1 (en) * | 2011-09-21 | 2013-03-21 | Daniel P. W. Ellis | Methods, systems, and media for mobile audio event recognition |
CN103280220A (zh) * | 2013-04-25 | 2013-09-04 | 北京大学深圳研究生院 | 一种实时的婴儿啼哭声识别方法 |
CN104347066A (zh) * | 2013-08-09 | 2015-02-11 | 盛乐信息技术(上海)有限公司 | 基于深层神经网络的婴儿啼哭声识别方法及系统 |
CN107808658A (zh) * | 2016-09-06 | 2018-03-16 | 深圳声联网科技有限公司 | 基于家居环境下实时的婴儿音频系列行为检测方法 |
CN107591162A (zh) * | 2017-07-28 | 2018-01-16 | 南京邮电大学 | 基于模式匹配的哭声识别方法及智能看护系统 |
CN107818779A (zh) * | 2017-09-15 | 2018-03-20 | 北京理工大学 | 一种婴幼儿啼哭声检测方法、装置、设备及介质 |
CN108461091A (zh) * | 2018-03-14 | 2018-08-28 | 南京邮电大学 | 面向家居环境的智能哭声检测方法 |
CN109903780A (zh) * | 2019-02-22 | 2019-06-18 | 宝宝树(北京)信息技术有限公司 | 哭声原因模型建立方法、系统及哭声原因辨别方法 |
Also Published As
Publication number | Publication date |
---|---|
CN111883174A (zh) | 2020-11-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110556129B (zh) | 双模态情感识别模型训练方法及双模态情感识别方法 | |
Schuller et al. | The interspeech 2017 computational paralinguistics challenge: Addressee, cold & snoring | |
Pramono et al. | A cough-based algorithm for automatic diagnosis of pertussis | |
CN112750465B (zh) | 一种云端语言能力评测系统及可穿戴录音终端 | |
Arias-Londoño et al. | On combining information from modulation spectra and mel-frequency cepstral coefficients for automatic detection of pathological voices | |
US20210071401A1 (en) | Smart toilet and electric appliance system | |
Hariharan et al. | Objective evaluation of speech dysfluencies using wavelet packet transform with sample entropy | |
Wu et al. | Investigation and evaluation of glottal flow waveform for voice pathology detection | |
WO2022012777A1 (en) | A computer-implemented method of providing data for an automated baby cry assessment | |
Bhagatpatil et al. | An automatic infant’s cry detection using linear frequency cepstrum coefficients (LFCC) | |
WO2020259057A1 (zh) | 声音的识别方法及装置、存储介质和电子装置 | |
Kulkarni et al. | Child cry classification-an analysis of features and models | |
CN103578480A (zh) | 负面情绪检测中的基于上下文修正的语音情感识别方法 | |
US11475876B2 (en) | Semantic recognition method and semantic recognition device | |
Aggarwal et al. | A machine learning approach to classify biomedical acoustic features for baby cries | |
Richards et al. | The LENATM automatic vocalization assessment | |
Messaoud et al. | A cry-based babies identification system | |
Milani et al. | A real-time application to detect human voice disorders | |
CN116153298A (zh) | 一种认知功能障碍筛查用的语音识别方法和装置 | |
CN113488077B (zh) | 真实场景下的婴儿哭声检测方法、装置及可读介质 | |
Rosen et al. | Infant mood prediction and emotion classification with different intelligent models | |
US20240023877A1 (en) | Detection of cognitive impairment | |
Kokot et al. | Classification of child vocal behavior for a robot-assisted autism diagnostic protocol | |
Xu et al. | Differential Time-frequency Log-mel Spectrogram Features for Vision Transformer Based Infant Cry Recognition}} | |
Motlagh et al. | Using general sound descriptors for early autism detection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20831594 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20831594 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 17/05/2022) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20831594 Country of ref document: EP Kind code of ref document: A1 |