CN111862950A - Interactive multifunctional elderly care robot recognition system - Google Patents

Interactive multifunctional elderly care robot recognition system Download PDF

Info

Publication number
CN111862950A
CN111862950A CN202010768423.3A CN202010768423A CN111862950A CN 111862950 A CN111862950 A CN 111862950A CN 202010768423 A CN202010768423 A CN 202010768423A CN 111862950 A CN111862950 A CN 111862950A
Authority
CN
China
Prior art keywords
module
model
voice
specific
robot
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010768423.3A
Other languages
Chinese (zh)
Inventor
彭志峰
彭水平
孙伟红
刘少科
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen As Technology Co ltd
Original Assignee
Shenzhen As Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen As Technology Co ltd filed Critical Shenzhen As Technology Co ltd
Priority to CN202010768423.3A priority Critical patent/CN111862950A/en
Publication of CN111862950A publication Critical patent/CN111862950A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Quality & Reliability (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Manipulator (AREA)

Abstract

The invention discloses an interactive multifunctional elderly care robot recognition system, and relates to the technical field of robots; including specific people discernment branch system, unspecific people discernment branch system and speech collection processing module, specific people discernment branch system and unspecific people discernment branch system share same pronunciation and collect initial processing module, speech collection processing module includes endpoint detection module and characteristic extraction module, specific people discernment branch system includes model training and identification module and model update module, endpoint detection module is with the endpoint detection as the characteristic, can confirm the starting point and the terminal point of confirming pronunciation in one section of sound signal that contains the pronunciation, distinguishes speech signal and non-speech signal. The invention adopts different subsystem control for specific crowd and non-specific crowd, adopts a training mode for voice recognition of specific people, and adopts a vocabulary library comparison mode for specific crowd, thereby greatly improving the accuracy of voice recognition.

Description

Interactive multifunctional elderly care robot recognition system
Technical Field
The invention relates to the technical field of robots, in particular to an interactive multifunctional elderly care robot recognition system.
Background
At present, the problem of population aging in China is increasingly severe, and under the background of population aging, the elderly nursing robot is urgently needed in social life, so that the elderly nursing robot not only can well improve the life quality of the old, but also can help solve the problem of labor shortage caused by the aging of children, relieve the burden of nursing the children and the women, and relieve the problem caused by the aging of the population.
In the prior art, the design of an identification system of an elderly care robot has certain defects, in actual use, a specific person (old people) and a non-specific person (family members or visitors) exist in the use of the robot, when the identification system is designed, when the identification system is used for a specific population, how to ensure whether an accurate identification command sender is a set population or not can be ensured, and when the identification system is used for a non-specific population, how to ensure the identification of different voiceprint characteristics and ensure the identification accuracy of the voiceprint characteristics are ensured.
Disclosure of Invention
The invention aims to solve the defects in the prior art and provides an interactive multifunctional elderly care robot recognition system.
In order to achieve the purpose, the invention adopts the following technical scheme:
the utility model provides a multi-functional old nursing robot identification system of interactive, includes that specific person discerns branch system, unspecific person discerns branch system and speech acquisition processing module, specific person discerns branch system and unspecific person discerns branch system sharing same speech acquisition primary processing module, speech acquisition processing module includes that endpoint detection module and characteristic draw the module, specific person discerns branch system and includes model training and identification module and model update module.
Preferably: the endpoint detection module is characterized by endpoint detection and can determine the starting point and the end point of a voice in a section of voice signal containing the voice and distinguish the voice signal from a non-voice signal.
Preferably: the feature extraction module comprises three processes of front-end processing, feature extraction and feature normalization, and comprises the following specific steps:
a1: framing: the robot microphone collects voice parameters, and performs framing processing on a voice stream, wherein the frame length is 512 sampling points, and the frame is shifted by 128 sampling points;
a2: pre-emphasis to counteract the mouth's squelching of high frequency sounds, for each frame, the following are processed: y (0) ═ 0.03 × s (1), y (n) ═ s (n) — 0.97 × s (n-1), and n ═ 1, 2, … …, 51, where s (n) denotes the (n + 1) th sample point in one frame;
a3: windowing: using a hamming window, the discontinuity of the signal on the boundaries caused by framing is attenuated:
Figure BDA0002615567560000021
wherein n is 0, 1, … …, T-1, and T is 512.
A4: fast Fourier transform: the discrete fourier transform using radix-2 converts the time domain energy to frequency domain energy:
Figure BDA0002615567560000022
where k is 0, 1, … …, N-1, and N is 512, and the flat-laying operation thereof obtains the energy of the real number domain: xk=|Xk|2
A5: MEL energy: obtaining 40-dimensional MEMEL frequency sub-band energy through 40 MEL filter banks;
a6: mel logarithmic energy: for each mel video subband energy, go pair: mel (i) ═ ln (filt (i)), i ═ 1, 2, … …, 40;
a7: discrete cosine transform:
Figure BDA0002615567560000031
wherein i is 0, 1 … …, M-1; m ═ 40; d ═ 13.
Preferably: the model training and recognition module adopts a Gaussian mixture-general background model, the model training and recognition module fits a human sounding system through unsupervised learning, different specific human models are constructed according to the sound characteristics of each person, the model training and recognition module performs adaptive training on a small amount of user voice data to be recognized on a pre-trained background model, and the specific process is as follows:
b1: preparing data, namely preparing voice data for about 3 hours;
b2: carrying out primary clustering on the data by using a K-Means algorithm;
b3: expectation maximization, and further optimization of the result in the step B2;
b4: modeling a specific human target;
b5: and (5) voice recognition.
Preferably: in the speech data in B1, the number of specific persons is 80-100 persons, the two groups of data are divided into two groups according to gender, feature extraction is respectively carried out on the two groups of data, in B2, the number of classes is 1024, and 1024 classes including class centers and variances, which are components of a Gaussian mixture model, can be obtained through 5 iterations.
Preferably: in B5, when the user gives a command to the robot or performs voice communication, the robot calculates scores of the collected voice data on each model in the specific person model library, and after performing necessary normalization processing on the scores, the specific person to which the model with the highest score belongs is the specific person result recognized by the system, and the score is calculated as follows:
Figure BDA0002615567560000041
lnP (x) thereiniI lambdaspr) Probability of the i-th frame speech on a particular human model, lnP (x)iI lambdaUBN) Is the probability of the frame I on the UBN model, and N is the language to be recognizedThe number of frames of tones.
Preferably: the model updating module updates a specific person model in the system for the recognition system, when the system confirms the identity of a speaker once with a high success rate, the voice data of the system is stored in the database together with the identity and time information, and after a period of time, the system retrains the specific person model and updates the database based on the recently acquired information.
Preferably: the non-specific person recognition subsystem completes non-specific person voice recognition by utilizing a hidden Markov model, when the system is actually used, firstly, a vocabulary library is established, parameters are required to be given, a state data N, an observation symbol number M set, three probability distribution state transition probability matrixes A, an observation symbol probability matrix B and an initial state probability vector p are given, a specific model l ═ (p, A, B) is established once, and for the given N, M, A, B and p, the model can generate an observation sequence: o ═ o (o)1,o2,......,oT) T is the data of the observation symbol, i.e. the length of the observation sequence, which is generated as follows:
c1: selecting an initial state S based on the initial state distribution pi
C2: setting t to be 1;
c3: according to state SiSelecting an output symbol according to the probability distribution of the observation symbols;
c4: according to state SiTransition probability of to the Heart State Sj
C5: setting T to be T +1, if T is less than T, jumping to C3, otherwise, ending;
c6: and scoring each model according to the observation sequence, wherein the recognition result is a word in the vocabulary library corresponding to the model with the highest score.
Preferably: the voice collecting and processing module also comprises a post-processing module which can effectively control the influence of environmental noise on voice recognition, and the method comprises the following steps:
d1: according to the characteristic that human voice auditory perception is insensitive to the slow change of an excitation source, an empirical formula is established: h (z) 0.1 × (2+ z)-1-z-3-2z-4)/z4×(1-0.98z-1) Reducing the proportion of energy outside the human sound ray frequency,
d2: high-order difference, which is to do first-order and second-order difference to the model between the adjacent 13-dimensional features to form the improved model characteristic of 39;
d3: reducing the cepstrum mean value, correspondingly modulating the original signal, and denoising;
d4: and (4) Gaussian regularization is carried out on the speech features.
Preferably: the robot is characterized by further comprising an autonomous function subsystem, wherein the autonomous function subsystem comprises a request help module, a danger alarm module and a notification module, the request help module is used for sending corresponding request help after being judged by an upper computer when the robot is stuck, stumbled and lost in the process of traveling, the danger alarm module is used for sending an alarm when detecting abnormal temperature and abnormal air component by combining various sensors of the robot, the notification module is used for reminding various schedules, time and events set by a user, the robot further comprises an entertainment function subsystem, the entertainment function subsystem comprises a playing module and a wireless communication module, when the system detects instructions such as 'playing music' and 'speaking joke', the upper computer controls the wireless communication module to be connected with the internet to perform corresponding search on the internet, after searching and downloading, the characters, audio and video are played through the playing module.
The invention has the beneficial effects that:
1. the invention adopts different subsystem control for specific crowd and non-specific crowd, adopts a training mode for voice recognition of specific people, and adopts a vocabulary library comparison mode for specific crowd, thereby greatly improving the accuracy of voice recognition.
2. The invention determines the starting point and the key point of the voice by detecting the end point of the voice packet and then extracts the characteristic point, thereby greatly reducing the calculated amount of an upper computer and improving the corresponding speed of the whole system.
3. The invention removes other noises in the whole voice data packet by performing post data processing on the data after feature extraction, and also updates the model base recognized by a specific person at regular time, thereby further improving the accuracy of voice recognition.
Drawings
Fig. 1 is a schematic diagram of a feature extraction process in an interactive multifunctional geriatric care robot recognition system according to the present invention;
Detailed Description
The technical solution of the present patent will be described in further detail with reference to the following embodiments.
In the description of this patent, it is noted that unless otherwise specifically stated or limited, the terms "mounted," "connected," and "disposed" are to be construed broadly and can include, for example, fixedly connected, disposed, detachably connected, disposed, or integrally connected and disposed. The specific meaning of the above terms in this patent may be understood by those of ordinary skill in the art as appropriate.
Example 1:
the utility model provides a multi-functional old nursing robot identification system of interactive, includes that specific person discerns branch system, unspecific person discerns branch system and speech acquisition processing module, specific person discerns branch system and unspecific person discerns branch system sharing same speech acquisition primary processing module, speech acquisition processing module includes that endpoint detection module and characteristic draw the module, specific person discerns branch system and includes model training and identification module and model update module.
The endpoint detection module is characterized by endpoint detection and can determine the starting point and the end point of a voice in a section of voice signal containing the voice and distinguish the voice signal from a non-voice signal.
The feature extraction module comprises three processes of front-end processing, feature extraction and feature normalization, and comprises the following specific steps:
a1: framing: the robot microphone collects voice parameters, and performs framing processing on a voice stream, wherein the frame length is 512 sampling points, and the frame is shifted by 128 sampling points;
a2: pre-emphasis to counteract the mouth's squelching of high frequency sounds, for each frame, the following are processed: y (0) ═ 0.03 × s (1), y (n) ═ s (n) — 0.97 × s (n-1), and n ═ 1, 2, … …, 51, where s (n) denotes the (n + 1) th sample point in one frame;
a3: windowing: using a hamming window, the discontinuity of the signal on the boundaries caused by framing is attenuated:
Figure BDA0002615567560000081
wherein n is 0, 1, … …, T-1, and T is 512.
A4: fast Fourier transform: the discrete fourier transform using radix-2 converts the time domain energy to frequency domain energy:
Figure BDA0002615567560000082
where k is 0, 1, … …, N-1, and N is 512, and the flat-laying operation thereof obtains the energy of the real number domain: xk=|Xk|2
A5: MEL energy: obtaining 40-dimensional MEMEL frequency sub-band energy through 40 MEL filter banks;
a6: mel logarithmic energy: for each mel video subband energy, go pair: mel (i) ═ ln (filt (i)), i ═ 1, 2,..., 40;
a7: discrete cosine transform:
Figure BDA0002615567560000083
wherein, i is 0, 1.. times, M-1; m ═ 40; d ═ 13.
The model training and recognition module adopts a Gaussian mixture-general background model, the model training and recognition module fits a human sounding system through unsupervised learning, different specific human models are constructed according to the sound characteristics of each person, the model training and recognition module performs adaptive training on a small amount of user voice data to be recognized on a pre-trained background model, and the specific process is as follows:
b1: preparing data, namely preparing voice data for about 3 hours;
b2: carrying out primary clustering on the data by using a K-Means algorithm;
b3: expectation maximization, and further optimization of the result in the step B2;
b4: modeling a specific human target;
b5: and (5) voice recognition.
In the speech data in B1, the number of specific persons is 80-100 persons, the two groups of data are divided into two groups according to gender, feature extraction is respectively carried out on the two groups of data, in B2, the number of classes is 1024, and 1024 classes including class centers and variances, which are components of a Gaussian mixture model, can be obtained through 5 iterations.
In B5, when the user gives a command to the robot or performs voice communication, the robot calculates scores of the collected voice data on each model in the specific person model library, and after performing necessary normalization processing on the scores, the specific person to which the model with the highest score belongs is the specific person result recognized by the system, and the score is calculated as follows:
Figure BDA0002615567560000091
lnP (x) thereiniI lambdaspr) Probability of the i-th frame speech on a particular human model, lnP (x)iI lambdaUBN) Is the probability of the I frame on the UBN model, and N is the number of frames of speech to be recognized.
The model updating module updates the specific person model in the recognition system, when the system confirms the identity of a speaker once with a high success rate, the voice data of the system is stored in the database together with the identity, time and other information, and after a period of time, the system retrains the specific person model and updates the database based on the latest acquired information
The non-specific person recognition subsystem completes non-specific person voice recognition by utilizing a hidden Markov model, when the system is actually used, firstly, a vocabulary library is established, parameters are required to be given, a state data N, an observation symbol number M set, three probability distribution state transition probability matrixes A, an observation symbol probability matrix B and an initial state probability vector p are given, a specific model l ═ (p, A, B) is established once, and for the given N, M, A, B and p, the model can generate an observation sequence:o=(o1,o2,......,oT) T is the data of the observation symbol, i.e. the length of the observation sequence, which is generated as follows:
c1: selecting an initial state S based on the initial state distribution pi
C2: setting t to be 1;
c3: according to state SiSelecting an output symbol according to the probability distribution of the observation symbols;
c4: according to state SiTransition probability of to the Heart State Sj
C5: setting T to be T +1, if T is less than T, jumping to C3, otherwise, ending;
c6: and scoring each model according to the observation sequence, wherein the recognition result is a word in the vocabulary library corresponding to the model with the highest score.
Example 2:
the utility model provides a multi-functional old nursing robot identification system of interactive, includes that specific person discerns branch system, unspecific person discerns branch system and speech acquisition processing module, specific person discerns branch system and unspecific person discerns branch system sharing same speech acquisition primary processing module, speech acquisition processing module includes that endpoint detection module and characteristic draw the module, specific person discerns branch system and includes model training and identification module and model update module.
The endpoint detection module is characterized by endpoint detection and can determine the starting point and the end point of a voice in a section of voice signal containing the voice and distinguish the voice signal from a non-voice signal.
The feature extraction module comprises three processes of front-end processing, feature extraction and feature normalization, and comprises the following specific steps:
a1: framing: the robot microphone collects voice parameters, and performs framing processing on a voice stream, wherein the frame length is 512 sampling points, and the frame is shifted by 128 sampling points;
a2: pre-emphasis to counteract the mouth's squelching of high frequency sounds, for each frame, the following are processed: y (0) ═ 0.03 × s (1), y (n) ═ s (n) () -0.97 × s (n-1), n ═ 1, 2.. and.. 51, where s (n) denotes the (n + 1) th sample point in a frame;
a3: windowing: using a hamming window, the discontinuity of the signal on the boundaries caused by framing is attenuated:
Figure BDA0002615567560000111
wherein n is 0, 1, … …, T-1, and T is 512.
A4: fast Fourier transform: the discrete fourier transform using radix-2 converts the time domain energy to frequency domain energy:
Figure BDA0002615567560000112
where k is 0, 1, … …, N-1, and N is 512, and the flat-laying operation thereof obtains the energy of the real number domain: xk=|Xk|2
A5: MEL energy: obtaining 40-dimensional MEMEL frequency sub-band energy through 40 MEL filter banks;
a6: mel logarithmic energy: for each mel video subband energy, go pair: mel (i) ═ ln (filt (i)), i ═ 1, 2,..., 40;
a7: discrete cosine transform:
Figure BDA0002615567560000113
wherein, i is 0, 1.. times, M-1; m ═ 40; d ═ 13.
The model training and recognition module adopts a Gaussian mixture-general background model, the model training and recognition module fits a human sounding system through unsupervised learning, different specific human models are constructed according to the sound characteristics of each person, the model training and recognition module performs adaptive training on a small amount of user voice data to be recognized on a pre-trained background model, and the specific process is as follows:
b1: preparing data, namely preparing voice data for about 3 hours;
b2: carrying out primary clustering on the data by using a K-Means algorithm;
b3: expectation maximization, and further optimization of the result in the step B2;
b4: modeling a specific human target;
b5: and (5) voice recognition.
In the speech data in B1, the number of specific persons is 80-100 persons, the two groups of data are divided into two groups according to gender, feature extraction is respectively carried out on the two groups of data, in B2, the number of classes is 1024, and 1024 classes including class centers and variances, which are components of a Gaussian mixture model, can be obtained through 5 iterations.
In B5, when the user gives a command to the robot or performs voice communication, the robot calculates scores of the collected voice data on each model in the specific person model library, and after performing necessary normalization processing on the scores, the specific person to which the model with the highest score belongs is the specific person result recognized by the system, and the score is calculated as follows:
Figure BDA0002615567560000121
lnP (x) thereiniI lambdaspr) Probability of the i-th frame speech on a particular human model, lnP (x)iI lambdaUBN) Is the probability of the I frame on the UBN model, and N is the number of frames of speech to be recognized.
The model updating module updates the specific person model in the recognition system, when the system confirms the identity of a speaker once with a high success rate, the voice data of the system is stored in the database together with the identity, time and other information, and after a period of time, the system retrains the specific person model and updates the database based on the latest acquired information
The non-specific person recognition subsystem completes non-specific person voice recognition by utilizing a hidden Markov model, when the system is actually used, firstly, a vocabulary library is established, parameters are required to be given, a state data N, an observation symbol number M set, three probability distribution state transition probability matrixes A, an observation symbol probability matrix B and an initial state probability vector p are given, a specific model l ═ (p, A, B) is established once, and for the given N, M, A, B and p, the model can generate an observation sequence: o ═ o (o)1,o2,......,oT) T is the data of the observation symbol, i.e. the length of the observation sequence, which is generated as follows:
c1: selecting an initial state S based on the initial state distribution pi
C2: setting t to be 1;
c3: according to state SiSelecting an output symbol according to the probability distribution of the observation symbols;
c4: according to state SiTransition probability of to the Heart State Sj
C5: setting T to be T +1, if T is less than T, jumping to C3, otherwise, ending;
c6: and scoring each model according to the observation sequence, wherein the recognition result is a word in the vocabulary library corresponding to the model with the highest score.
The voice collecting and processing module also comprises a post-processing module which can effectively control the influence of environmental noise on voice recognition, and the method comprises the following steps:
d1: according to the characteristic that human voice auditory perception is insensitive to the slow change of an excitation source, an empirical formula is established: h (z) 0.1 × (2+ z)-1-z-3-2z-4)/z4×(1-0.98z-1) Reducing the proportion of energy outside the human sound ray frequency,
d2: high-order difference, which is to do first-order and second-order difference to the model between the adjacent 13-dimensional features to form the improved model characteristic of 39;
d3: reducing the cepstrum mean value, correspondingly modulating the original signal, and denoising;
d4: and (4) Gaussian regularization is carried out on the speech features.
Example 3:
the utility model provides a multi-functional old nursing robot identification system of interactive, includes that specific person discerns branch system, unspecific person discerns branch system and speech acquisition processing module, specific person discerns branch system and unspecific person discerns branch system sharing same speech acquisition primary processing module, speech acquisition processing module includes that endpoint detection module and characteristic draw the module, specific person discerns branch system and includes model training and identification module and model update module.
The endpoint detection module is characterized by endpoint detection and can determine the starting point and the end point of a voice in a section of voice signal containing the voice and distinguish the voice signal from a non-voice signal.
The feature extraction module comprises three processes of front-end processing, feature extraction and feature normalization, and comprises the following specific steps:
a1: framing: the robot microphone collects voice parameters, and performs framing processing on a voice stream, wherein the frame length is 512 sampling points, and the frame is shifted by 128 sampling points;
a2: pre-emphasis to counteract the mouth's squelching of high frequency sounds, for each frame, the following are processed: y (0) ═ 0.03 × s (1), y (n) ═ s (n) () -0.97 × s (n-1), n ═ 1, 2.. and.. 51, where s (n) denotes the (n + 1) th sample point in a frame;
a3: windowing: using a hamming window, the discontinuity of the signal on the boundaries caused by framing is attenuated:
Figure BDA0002615567560000151
wherein n is 0, 1, … …, T-1, and T is 512.
A4: fast Fourier transform: the discrete fourier transform using radix-2 converts the time domain energy to frequency domain energy:
Figure BDA0002615567560000152
where k is 0, 1, … …, N-1, and N is 512, and the flat-laying operation thereof obtains the energy of the real number domain: xk=|Xk|2
A5: MEL energy: obtaining 40-dimensional MEMEL frequency sub-band energy through 40 MEL filter banks;
a6: mel logarithmic energy: for each mel video subband energy, go pair: mel (i) ═ ln (filt (i)), i ═ 1, 2,..., 40;
a7: discrete cosine transform:
Figure BDA0002615567560000153
wherein, i is 0, 1.. times, M-1; m ═ 40; d ═ 13.
The model training and recognition module adopts a Gaussian mixture-general background model, the model training and recognition module fits a human sounding system through unsupervised learning, different specific human models are constructed according to the sound characteristics of each person, the model training and recognition module performs adaptive training on a small amount of user voice data to be recognized on a pre-trained background model, and the specific process is as follows:
b1: preparing data, namely preparing voice data for about 3 hours;
b2: carrying out primary clustering on the data by using a K-Means algorithm;
b3: expectation maximization, and further optimization of the result in the step B2;
b4: modeling a specific human target;
b5: and (5) voice recognition.
In the speech data in B1, the number of specific persons is 80-100 persons, the two groups of data are divided into two groups according to gender, feature extraction is respectively carried out on the two groups of data, in B2, the number of classes is 1024, and 1024 classes including class centers and variances, which are components of a Gaussian mixture model, can be obtained through 5 iterations.
In B5, when the user gives a command to the robot or performs voice communication, the robot calculates scores of the collected voice data on each model in the specific person model library, and after performing necessary normalization processing on the scores, the specific person to which the model with the highest score belongs is the specific person result recognized by the system, and the score is calculated as follows:
Figure BDA0002615567560000161
lnP (x) thereiniI lambdaspr) Probability of the i-th frame speech on a particular human model, lnP (x)iI lambdaUBN) Is the probability of the I frame on the UBN model, and N is the number of frames of speech to be recognized.
The model updating module updates the specific person model in the recognition system, when the system confirms the identity of a speaker once with a high success rate, the voice data of the system is stored in the database together with the identity, time and other information, and after a period of time, the system retrains the specific person model and updates the database based on the latest acquired information
The non-specific person recognition subsystem completes non-specific person voice recognition by utilizing a hidden Markov model, when the system is actually used, firstly, a vocabulary library is established, parameters are required to be given, a state data N, an observation symbol number M set, three probability distribution state transition probability matrixes A, an observation symbol probability matrix B and an initial state probability vector p are given, a specific model l ═ (p, A, B) is established once, and for the given N, M, A, B and p, the model can generate an observation sequence: o ═ o (o)1,o2,......,oT) T is the data of the observation symbol, i.e. the length of the observation sequence, which is generated as follows:
c1: selecting an initial state S based on the initial state distribution pi
C2: setting t to be 1;
c3: according to state SiSelecting an output symbol according to the probability distribution of the observation symbols;
c4: according to state SiTransition probability of to the Heart State Sj
C5: setting T to be T +1, if T is less than T, jumping to C3, otherwise, ending;
c6: and scoring each model according to the observation sequence, wherein the recognition result is a word in the vocabulary library corresponding to the model with the highest score.
The voice collecting and processing module also comprises a post-processing module which can effectively control the influence of environmental noise on voice recognition, and the method comprises the following steps:
d1: according to the characteristic that human voice auditory perception is insensitive to the slow change of an excitation source, an empirical formula is established: h (z) 0.1 × (2+ z)-1-z-3-2z-4)/z4×(1-0.98z-1) Reducing the proportion of energy outside the human sound ray frequency,
d2: high-order difference, which is to do first-order and second-order difference to the model between the adjacent 13-dimensional features to form the improved model characteristic of 39;
d3: reducing the cepstrum mean value, correspondingly modulating the original signal, and denoising;
d4: and (4) Gaussian regularization is carried out on the speech features.
It still includes autonomic function branch system, autonomic function branch system is including request help module, danger alarm module and notice module, request help module, when the robot is at the in-process of going, when meetting the condition such as block, stumble, lost, through host computer judgement back, the robot just sends corresponding request help, danger alarm module combines the various sensors of robot, and when detecting conditions such as abnormal temperature, the unusual composition of air, it sends out the police dispatch newspaper, notice module is used for the warning to various schedules, time, the incident that the user set up.
Example 4:
the utility model provides a multi-functional old nursing robot identification system of interactive, includes that specific person discerns branch system, unspecific person discerns branch system and speech acquisition processing module, specific person discerns branch system and unspecific person discerns branch system sharing same speech acquisition primary processing module, speech acquisition processing module includes that endpoint detection module and characteristic draw the module, specific person discerns branch system and includes model training and identification module and model update module.
The endpoint detection module is characterized by endpoint detection and can determine the starting point and the end point of a voice in a section of voice signal containing the voice and distinguish the voice signal from a non-voice signal.
The feature extraction module comprises three processes of front-end processing, feature extraction and feature normalization, and comprises the following specific steps:
a1: framing: the robot microphone collects voice parameters, and performs framing processing on a voice stream, wherein the frame length is 512 sampling points, and the frame is shifted by 128 sampling points;
a2: pre-emphasis to counteract the mouth's squelching of high frequency sounds, for each frame, the following are processed: y (0) ═ 0.03 × s (1), y (n) ═ s (n) — 0.97 × s (n-1), and n ═ 1, 2, … …, 51, where s (n) denotes the (n + 1) th sample point in one frame;
a3: windowing: using a hamming window, the discontinuity of the signal on the boundaries caused by framing is attenuated:
Figure BDA0002615567560000191
wherein n is 0, 1, … …, T-1, and T is 512.
A4: fast Fourier transform: the discrete fourier transform using radix-2 converts the time domain energy to frequency domain energy:
Figure BDA0002615567560000192
where k is 0, 1, … …, N-1, and N is 512, and the flat-laying operation thereof obtains the energy of the real number domain: xk=|Xk|2
A5: MEL energy: obtaining 40-dimensional MEMEL frequency sub-band energy through 40 MEL filter banks;
a6: mel logarithmic energy: for each mel video subband energy, go pair: mel (i) ═ ln (filt (i)), i ═ 1, 2,..., 40;
a7: discrete cosine transform:
Figure BDA0002615567560000193
wherein, i is 0, 1.. times, M-1; m ═ 40; d ═ 13.
The model training and recognition module adopts a Gaussian mixture-general background model, the model training and recognition module fits a human sounding system through unsupervised learning, different specific human models are constructed according to the sound characteristics of each person, the model training and recognition module performs adaptive training on a small amount of user voice data to be recognized on a pre-trained background model, and the specific process is as follows:
b1: preparing data, namely preparing voice data for about 3 hours;
b2: carrying out primary clustering on the data by using a K-Means algorithm;
b3: expectation maximization, and further optimization of the result in the step B2;
b4: modeling a specific human target;
b5: and (5) voice recognition.
In the speech data in B1, the number of specific persons is 80-100 persons, the two groups of data are divided into two groups according to gender, feature extraction is respectively carried out on the two groups of data, in B2, the number of classes is 1024, and 1024 classes including class centers and variances, which are components of a Gaussian mixture model, can be obtained through 5 iterations.
In B5, when the user gives a command to the robot or performs voice communication, the robot calculates scores of the collected voice data on each model in the specific person model library, and after performing necessary normalization processing on the scores, the specific person to which the model with the highest score belongs is the specific person result recognized by the system, and the score is calculated as follows:
Figure BDA0002615567560000201
lnP (x) thereiniI lambdaspr) Probability of the i-th frame speech on a particular human model, lnP (x)iI lambdaUBN) Is the probability of the I frame on the UBN model, and N is the number of frames of speech to be recognized.
The model updating module updates a specific person model in the recognition system, when the system confirms the identity of a speaker once with a high success rate, the voice data of the system is stored in a database together with the identity, time and other information, and after a period of time, the system retrains the specific person model and updates the database based on the recently acquired information.
The non-specific person recognition subsystem completes non-specific person voice recognition by utilizing a hidden Markov model, when the system is actually used, firstly, a vocabulary library is established, parameters are required to be given, a state data N, an observation symbol number M set, three probability distribution state transition probability matrixes A, an observation symbol probability matrix B and an initial state probability vector p are given, a specific model l ═ (p, A, B) is established once, and for the given N, M, A, B and p, the model can generate an observation sequence: o ═ o (o)1,o2,……,oT) T is the data of the observation symbol, i.e. the length of the observation sequence, which is generated as follows:
c1: selecting an initial state S based on the initial state distribution pi
C2: setting t to be 1;
c3: according to state SiSelecting an output symbol according to the probability distribution of the observation symbols;
c4: according to state SiTransition probability of to the Heart State Sj
C5: setting T to be T +1, if T is less than T, jumping to C3, otherwise, ending;
c6: and scoring each model according to the observation sequence, wherein the recognition result is a word in the vocabulary library corresponding to the model with the highest score.
The voice collecting and processing module also comprises a post-processing module which can effectively control the influence of environmental noise on voice recognition, and the method comprises the following steps:
d1: according to the characteristic that human voice auditory perception is insensitive to the slow change of an excitation source, an empirical formula is established: h (z) 0.1 × (2+ z)-1-z-3-2z-4)/z4×(1-0.98z-1) Reducing the proportion of energy outside the human sound ray frequency,
d2: high-order difference, which is to do first-order and second-order difference to the model between the adjacent 13-dimensional features to form the improved model characteristic of 39;
d3: reducing the cepstrum mean value, correspondingly modulating the original signal, and denoising;
d4: and (4) Gaussian regularization is carried out on the speech features.
It still includes autonomic function branch system, autonomic function branch system is including request help module, danger alarm module and notice module, request help module, when the robot is at the in-process of going, when meetting the condition such as block, stumble, lost, through host computer judgement back, the robot just sends corresponding request help, danger alarm module combines the various sensors of robot, and when detecting conditions such as abnormal temperature, the unusual composition of air, it sends out the police dispatch newspaper, notice module is used for the warning to various schedules, time, the incident that the user set up.
The system also comprises an entertainment function subsystem, wherein the entertainment function subsystem comprises a playing module and a wireless communication module, when the system detects instructions such as 'playing music', 'speaking joke' and the like, the upper computer controls the wireless communication module to be connected with the Internet, corresponding search is carried out on the Internet, and after searching and downloading, characters, audio, videos and the like are played through the playing module.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims (10)

1. The utility model provides a multi-functional old nursing robot identification system of interactive, its characterized in that, includes that specific person discerns branch system, nonspecific person discerns branch system and speech acquisition processing module, specific person discerns branch system and nonspecific person discerns branch system sharing same speech acquisition primary processing module, speech acquisition processing module includes that endpoint detection module and characteristic draw the module, specific person discerns branch system and includes model training and identification module and model update module.
2. The system of claim 1, wherein the endpoint detection module is characterized by endpoint detection, and can determine a start point and an end point of a voice in a segment of voice-containing sound signal, and distinguish between a voice signal and a non-voice signal.
3. The system of claim 1, wherein the feature extraction module comprises three processes of front-end processing, feature extraction and feature normalization, and the specific steps are as follows:
a1: framing: the robot microphone collects voice parameters, and performs framing processing on a voice stream, wherein the frame length is 512 sampling points, and the frame is shifted by 128 sampling points;
a2: pre-emphasis to counteract the mouth's squelching of high frequency sounds, for each frame, the following are processed: y (0) ═ 0.03 × s (1), y (n) ═ s (n) () -0.97 × s (n-1), n ═ 1, 2.. and.. 51, where s (n) denotes the (n + 1) th sample point in a frame;
a3: windowing: using a hamming window, the discontinuity of the signal on the boundaries caused by framing is attenuated:
Figure FDA0002615567550000011
wherein n is 0, 1, … …, T-1, and T is 512.
A4: fast Fourier transform: the discrete fourier transform using radix-2 converts the time domain energy to frequency domain energy:
Figure FDA0002615567550000021
where k is 0, 1, … …, N-1, and N is 512, and the flat-laying operation thereof obtains the energy of the real number domain: xk=|Xk|2
A5: MEL energy: obtaining 40-dimensional MEMEL frequency sub-band energy through 40 MEL filter banks;
a6: mel logarithmic energy: for each mel video subband energy, go pair: mel (i) ═ ln (filt (i)), i ═ 1, 2,..., 40;
a7: discrete cosine transform:
Figure FDA0002615567550000022
wherein, i is 0, 1.. times, M-1; m ═ 40; d ═ 13.
4. The system of claim 1, wherein the model training and recognition module adopts a Gaussian mixture-general background model, which is adapted to a human voice system through unsupervised learning, constructs different human-specific models according to the voice characteristics of each person, and adaptively trains a small amount of user voice data to be recognized on the pre-trained background model, and the specific process flow is as follows:
b1: preparing data, namely preparing voice data for about 3 hours;
b2: carrying out primary clustering on the data by using a K-Means algorithm;
b3: expectation maximization, and further optimization of the result in the step B2;
b4: modeling a specific human target;
b5: and (5) voice recognition.
5. The interactive multifunctional elderly nursing robot recognition system of claim 4, wherein said B1 comprises 80-100 specific people, which are divided into two groups according to gender, and feature extraction is performed on the two groups of data, wherein said B2 comprises 1024 classes, and after 5 iterations, 1024 classes including class center and variance are obtained, which are each component of gaussian mixture model.
6. An interactive multifunctional system for recognizing a robot as a senior care robot in claim 5, wherein in B5, when the user gives a command to the robot or makes a voice communication with the robot, the robot calculates scores of the collected voice data on each model in the specific person model library, and after the scores are normalized as required, the specific person with the highest score is the specific person result recognized by the system, and the scores are calculated as follows:
Figure FDA0002615567550000031
lnP (x) thereiniI lambdaspr) Probability of the i-th frame speech on a particular human model, lnP (x)iI lambdaUBN) Is the probability of the I frame on the UBN model, and N is the number of frames of speech to be recognized.
7. The system of claim 1, wherein the model update module updates the speaker-specific model in the system, when the system confirms the speaker identity once with a high success rate, the voice data of the system is stored in the database together with the identity and time information, and after a while, the system retrains the speaker-specific model and updates the database based on the latest collected information.
8. The system of claim 1, wherein the unspecific person recognition subsystem uses hidden markov models to perform unspecific person voice recognition, and when in actual use, a vocabulary library is first established, which requires a given parameter, state data N, M sets of observation symbol numbers, three probability distribution state transition probability matrices a, an observation symbol probability matrix B, an initial state probability vector p, and a specific model l ═ p, a, B is established once, and for a given N, M, a, B, and p, the model generates an observation sequence: o ═ o (o)1,o2,......,oT) T is the data of the observation symbol, i.e. the length of the observation sequence, which is generated as follows:
c1: selecting an initial state S based on the initial state distribution pi
C2: setting t to be 1;
c3: according to state SiSelecting an output symbol according to the probability distribution of the observation symbols;
c4: according to state SiTransition probability of to the Heart State Sj
C5: setting T to be T +1, if T is less than T, jumping to C3, otherwise, ending;
c6: and scoring each model according to the observation sequence, wherein the recognition result is a word in the vocabulary library corresponding to the model with the highest score.
9. The system of claim 1, wherein the voice collecting and processing module further comprises a post-processing module for effectively controlling the influence of environmental noise on voice recognition, and the method comprises the following steps:
d1: according to the characteristic that human voice auditory perception is insensitive to the slow change of an excitation source, an empirical formula is established: h (z) 0.1 × (2+ z)-1-z-3-2z-4)/z4×(1-0.98z-1) Reducing the proportion of energy outside the human sound ray frequency,
d2: high-order difference, which is to do first-order and second-order difference to the model between the adjacent 13-dimensional features to form the improved model characteristic of 39;
d3: reducing the cepstrum mean value, correspondingly modulating the original signal, and denoising;
d4: and (4) Gaussian regularization is carried out on the speech features.
10. The system as claimed in claim 1, further comprising an autonomous function subsystem, wherein the autonomous function subsystem comprises a help request module, a danger alarm module and a notification module, the help request module is used for the robot to send out help request when the robot is jammed, stumbled and lost during traveling and the upper computer judges that the robot is in the process of being stuck, stumbled and lost, the danger alarm module is combined with various sensors of the robot to send out alarm when abnormal temperature and abnormal air components are detected, the notification module is used for reminding various schedules, time and events set by the user, the entertainment function subsystem comprises a playing module and a wireless communication module, and when the system detects that 'music playing' is available, When the 'speak joke' instruction is given, the upper computer controls the wireless communication module to be connected with the Internet, corresponding search is carried out on the Internet, and after the search and downloading, characters, audio and video are played through the playing module.
CN202010768423.3A 2020-08-03 2020-08-03 Interactive multifunctional elderly care robot recognition system Pending CN111862950A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010768423.3A CN111862950A (en) 2020-08-03 2020-08-03 Interactive multifunctional elderly care robot recognition system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010768423.3A CN111862950A (en) 2020-08-03 2020-08-03 Interactive multifunctional elderly care robot recognition system

Publications (1)

Publication Number Publication Date
CN111862950A true CN111862950A (en) 2020-10-30

Family

ID=72953675

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010768423.3A Pending CN111862950A (en) 2020-08-03 2020-08-03 Interactive multifunctional elderly care robot recognition system

Country Status (1)

Country Link
CN (1) CN111862950A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1293428A (en) * 2000-11-10 2001-05-02 清华大学 Information check method based on speed recognition
CN101042868A (en) * 2006-03-20 2007-09-26 富士通株式会社 Clustering system, clustering method, clustering program and attribute estimation system using clustering system
CN102324232A (en) * 2011-09-12 2012-01-18 辽宁工业大学 Method for recognizing sound-groove and system based on gauss hybrid models
CN108470567A (en) * 2018-03-15 2018-08-31 青岛海尔科技有限公司 A kind of voice interactive method, device, storage medium and computer equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1293428A (en) * 2000-11-10 2001-05-02 清华大学 Information check method based on speed recognition
CN101042868A (en) * 2006-03-20 2007-09-26 富士通株式会社 Clustering system, clustering method, clustering program and attribute estimation system using clustering system
CN102324232A (en) * 2011-09-12 2012-01-18 辽宁工业大学 Method for recognizing sound-groove and system based on gauss hybrid models
CN108470567A (en) * 2018-03-15 2018-08-31 青岛海尔科技有限公司 A kind of voice interactive method, device, storage medium and computer equipment

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
李燕萍: "非特定人的语音识别系统", 《中国优秀博硕士学位论文全文数据库(硕士) 信息科技辑》 *
梁永立: "服务机器人语音识别研究", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *
武宁 等: "家用机器人的说话人识别系统", 《计算机工程》 *

Similar Documents

Publication Publication Date Title
CN110310623B (en) Sample generation method, model training method, device, medium, and electronic apparatus
Kandali et al. Emotion recognition from Assamese speeches using MFCC features and GMM classifier
KR100826875B1 (en) On-line speaker recognition method and apparatus for thereof
CN105206271A (en) Intelligent equipment voice wake-up method and system for realizing method
CN111667818B (en) Method and device for training wake-up model
CN106548775B (en) Voice recognition method and system
CN108922541A (en) Multidimensional characteristic parameter method for recognizing sound-groove based on DTW and GMM model
Samantaray et al. A novel approach of speech emotion recognition with prosody, quality and derived features using SVM classifier for a class of North-Eastern Languages
Ge et al. Deep neural network based wake-up-word speech recognition with two-stage detection
CN112687291B (en) Pronunciation defect recognition model training method and pronunciation defect recognition method
CN110189746A (en) A kind of method for recognizing speech applied to earth-space communication
CN111798846A (en) Voice command word recognition method and device, conference terminal and conference terminal system
Hidayat et al. Wavelet detail coefficient as a novel wavelet-mfcc features in text-dependent speaker recognition system
Ons et al. A self learning vocal interface for speech-impaired users
Loh et al. Speech recognition interactive system for vehicle
Aggarwal Improving hindi speech recognition using filter bank optimization and acoustic model refinement
CN114155882B (en) Method and device for judging emotion of road anger based on voice recognition
Balaji et al. Waveform analysis and feature extraction from speech data of dysarthric persons
CN111862950A (en) Interactive multifunctional elderly care robot recognition system
CN111833869B (en) Voice interaction method and system applied to urban brain
Li et al. An auditory system-based feature for robust speech recognition
CN112216270B (en) Speech phoneme recognition method and system, electronic equipment and storage medium
Boril et al. Data-driven design of front-end filter bank for Lombard speech recognition
Kuppusamy et al. Speaker recognition system based on age-related features using convolutional and deep neural networks
CN114171009A (en) Voice recognition method, device, equipment and storage medium for target equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20201030