CN108806723A - Baby's audio recognition method and device - Google Patents

Baby's audio recognition method and device Download PDF

Info

Publication number
CN108806723A
CN108806723A CN201810490697.3A CN201810490697A CN108806723A CN 108806723 A CN108806723 A CN 108806723A CN 201810490697 A CN201810490697 A CN 201810490697A CN 108806723 A CN108806723 A CN 108806723A
Authority
CN
China
Prior art keywords
voice
infant
baby
acoustic model
microphone array
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810490697.3A
Other languages
Chinese (zh)
Other versions
CN108806723B (en
Inventor
许仿珍
向勇阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Waterward Information Co Ltd
Original Assignee
Shenzhen Water World Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Water World Co Ltd filed Critical Shenzhen Water World Co Ltd
Priority to CN201810490697.3A priority Critical patent/CN108806723B/en
Publication of CN108806723A publication Critical patent/CN108806723A/en
Application granted granted Critical
Publication of CN108806723B publication Critical patent/CN108806723B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Machine Translation (AREA)

Abstract

Present invention is disclosed a kind of baby's audio recognition method and devices, wherein baby's audio recognition method, including:Receive the voice signal obtained by microphone array;Isolate baby's voice in above-mentioned voice signal;Above-mentioned baby's voice is input to advance trained acoustic model, above-mentioned baby's voice is identified.The present invention can be helped preferably to be grown up and judge the wish and demand of baby by the way that baby's voice is identified as the language that adult can understand.

Description

Baby voice recognition method and device
Technical Field
The invention relates to the technical field of voice recognition, in particular to a method and a device for recognizing a baby voice.
Background
The infant is too young to speak some of the Eyah's voice to express his will and need, but inexperienced parents generally do not know exactly what the infant is speaking as an Eyah's voice, and cannot understand the baby's will and need in time.
Therefore, there is a need for a method or apparatus for recognizing babbling-on voice of an infant, which can accurately recognize information conveyed by the infant's voice, so that an adult can accurately know the condition and need of the infant, and provide better care to the infant.
Disclosure of Invention
The invention mainly aims to provide a baby voice recognition method, aiming at recognizing baby voice as language understood by adults, thereby better helping the adults to judge the will and the demand of the babies.
The invention provides a baby voice recognition method, which comprises the following steps:
receiving a voice signal acquired by a microphone array;
separating the baby voice in the voice signal;
and inputting the infant voice into a pre-trained acoustic model, and recognizing the infant voice.
Preferably, the step of separating the baby voice in the voice signal comprises:
and separating the baby voice in the voice signal by adopting a FastICA algorithm.
Preferably, the step of receiving the voice signal acquired by the microphone array is preceded by:
acquiring the intensity change condition of a voice signal through the microphone array, and judging whether the orientation of the microphone array is the direction opposite to the position of the baby;
if not, the direction of the microphone array is adjusted to be opposite to the direction of the position of the baby.
Preferably, the step of inputting the infant speech into a pre-trained acoustic model and recognizing the infant speech includes:
inputting the infant speech into the acoustic model;
obtaining at least one tag sequence corresponding to the infant voice output by the acoustic model;
obtaining a label sequence corresponding to the maximum probability by comparing the correct mapping probability of each label sequence;
and outputting the label sequence corresponding to the maximum probability.
Preferably, the inputting the infant speech into a pre-trained acoustic model, after the step of recognizing the infant speech, includes:
and transmitting the identified baby voice to the corresponding terminal equipment of the guardian.
The invention also provides a baby voice recognition device, comprising:
the receiving module is used for receiving the voice signals acquired by the microphone array;
the separation module is used for separating the baby voice in the voice signal;
and the recognition module is used for inputting the infant voice into a pre-trained acoustic model and recognizing the infant voice.
Preferably, the separation module comprises:
and the separation unit is used for separating the baby voice in the voice signal by adopting a FastICA algorithm.
Preferably, the infant speech recognition apparatus includes:
the judgment module is used for acquiring the strength change condition of the voice signal through the microphone array and judging whether the orientation of the microphone array is the direction opposite to the position of the infant;
and the adjusting module is used for adjusting the direction of the microphone array to be opposite to the direction of the position of the baby if the position of the baby is not the same as the position of the baby.
Preferably, the identification module includes:
an input unit for inputting the infant speech into the acoustic model;
the acquisition unit is used for acquiring at least one label sequence corresponding to the infant voice output by the acoustic model;
the comparison unit is used for obtaining the label sequence corresponding to the maximum probability by comparing the correct mapping probability of each label sequence;
and the output unit is used for outputting the label sequence corresponding to the maximum probability.
Preferably, the infant speech recognition apparatus includes:
and the sending module is used for transmitting the identified baby voice to the corresponding terminal equipment of the guardian.
The baby voice recognition method and the baby voice recognition device provided by the invention have the following beneficial effects: the method comprises the steps of acquiring a voice signal through a microphone array, separating infant voice to be recognized from the voice signal, and inputting the infant voice into a pre-trained acoustic model to recognize the infant voice. The babbling voice of the baby is recognized as the language which can be understood by the adult, so that the adult can be better helped to judge the intention and the demand of the baby, the adult can know the condition of the baby accurately in time and take corresponding care of the baby.
Drawings
FIG. 1 is a flow chart of a baby speech recognition method according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of an infant speech recognition apparatus according to an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a separation module according to another embodiment of the present invention;
FIG. 4 is a schematic structural diagram of an infant speech recognition apparatus according to another embodiment of the present invention;
FIG. 5 is a schematic structural diagram of an identification module according to another embodiment of the present invention;
FIG. 6 is a schematic diagram of an optimized structure of an infant speech recognition apparatus according to an embodiment of the present invention;
fig. 7 is a schematic diagram of an optimized structure of an infant speech recognition apparatus according to another embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, an infant speech recognition method according to an embodiment of the present invention includes:
s1: a speech signal acquired by a microphone array is received.
Microphone Array (Microphone Array), which refers to an Array of microphones, is a system consisting of a certain number of acoustic sensors for sampling and processing the spatial characteristics of a sound field. Wherein the acoustic sensor generally employs a microphone. For example, in the embodiment of the present invention, a microphone array formed by a plurality of microphones is used to obtain a speech signal, and the speech signal is filtered according to the difference between the phases of sound waves of the received audio signal by using processing methods such as beam forming and noise reduction of the microphone array, so as to reduce the influence of environmental noise on post-processing and improve speech quality.
S2: separating the baby voice in the voice signal.
Since the speech signal is mixed with other sounds besides the baby speech, the baby speech required by the embodiment of the present invention needs to be accurately separated from the speech signal collected by the microphone array, for example, by some separation algorithm.
S3: and inputting the infant voice into a pre-trained acoustic model, and recognizing the infant voice.
The acoustic model trained in advance is trained for multiple times by using a certain number of infant voice samples as input, and can output a tag sequence with the maximum probability corresponding to the infant voice. Therefore, the input baby voice to be recognized can be recognized by using the acoustic model trained in advance.
The method comprises the steps of acquiring a voice signal through a microphone array, and inputting the baby voice separated from the voice signal into a pre-trained acoustic model, so that the baby voice can be recognized. Because the baby is too young, the baby can only generally send out the babbling voice, but can not accurately send out the language which can be understood by the adult, the babbling voice of the baby can be recognized into the language which can be understood by the adult through the baby voice recognition method of the embodiment of the invention, and the adult can conveniently and really know the condition and the requirement of the baby.
Preferably, in another embodiment of the present invention, the step S2 includes:
and separating the baby voice in the voice signal by adopting a FastICA algorithm.
The Fast ICA algorithm is obtained based on a fixed point recursion algorithm, is suitable for any type of data, is possible to analyze high-dimensional data by applying the ICA, and adopts a fixed point iteration optimization algorithm to ensure that the convergence is faster and more stable. The Fast ICA algorithm is essentially a neural network method for minimizing mutual information of estimated components, and approximates negative entropy by using the maximum entropy principle and optimizes the negative entropy by using a proper nonlinear function g. This algorithm has many advantages in neural algorithms: the method is parallel, distributed, simple in calculation and small in memory requirement. In addition, the format of the baby voice within the voice signal is separated into sequences by the FastICA algorithm.
In the embodiment of the invention, the infant voice in the voice signal can be accurately separated by using Fast ICA without more interference signals, so that the accuracy of recognition is improved for the following recognition steps.
The embodiment of the invention adopts a FastICA algorithm based on the maximum negative entropy to separate the baby voice in the voice signal, and the specific process is as follows:
firstly, N sound source signals s are set1,s2,s3……snA is the unknown mixing matrix, m microphones, m>N, H is the signal received by the microphone, then: h ═ As; defining the infant voice sequence to be separated as x, and making x equal to UTH and U are vector parameters, and the length of x is T.
(1) And carrying out zero equalization on the x, namely, centralizing the x to enable the mean value of the x to be 0.
(2) For x, a linear transformation is sought so that x becomes the whitening vector z after projection into the new subspace.
The whitening processing is carried out on the zero-averaged data to remove the correlation among the observation signals, the subsequent independent component extraction process is simplified, and the convergence of the algorithm is good.
(3) The number of the baby voice sequences x to be estimated is selected to be n, and the iteration number p is set to be 1.
(4) Selecting a random initialization vector Up
(5) To UpAnd (3) updating the value: u shapep=E{zg(UTz)}-E{g′(UTz)}U。
Wherein, E { zg (U)Tz) and E { g' (U) }Tz) } U is mean operation; g (y) ═ tanh (ay) is a nonlinear function, where 2 ≧ a ≧ 1, and a generally takes the value 1.
(6) To UpOrthogonalizing is carried out, namely the formula:
(7) after each iteration, to UpCarry out standardization Up=Up/||Up||。
(8) If U is presentpIf not converged, thenRepeating the step (5) for 4-6 times until U is reachedpAnd (6) converging.
(9) And (3) returning to the step (4) and repeating the step (4) for 4-7 times until p is equal to n, and calculating the U of all independent componentsp
(10) According to x ═ UTH solves all n infant speech sequences x as input to a pre-trained acoustic model.
In addition, the separation algorithm employed in the present embodiment includes, but is not limited to, the FastICA algorithm, and other algorithms that achieve similar effects to the FastICA algorithm may also be used.
Preferably, in another embodiment of the present invention, before step S1, the method includes:
and S10, acquiring the intensity change condition of the voice signal through the microphone array, and judging whether the orientation of the microphone array is the direction opposite to the position of the baby.
And S11, if not, adjusting the orientation of the microphone array to be opposite to the direction of the position of the baby.
Wherein, the orientation of the microphone array is set to be the direction just facing the baby, so that the strength of the collected voice signals is higher and the effect is better. In addition, the number and geometry of the microphone arrays can be varied in many ways, for example, four microphones in a square arrangement or six microphones in a circular arrangement.
Preferably, in another embodiment of the present invention, step S3 includes:
s30: inputting the infant speech into the acoustic model.
S31: and acquiring at least one tag sequence corresponding to the baby voice output by the acoustic model.
S32: and obtaining the label sequence corresponding to the maximum probability by comparing the correct mapping probability of each label sequence.
S33: and outputting the label sequence corresponding to the maximum probability.
The acoustic model is BLSTM (Bidirectional Long Short-Term Memory), and the trained acoustic model can be obtained after the BLSTM is trained by using CTC (connection sequential temporal classification) as a training criterion.
RNN (recurrent Neural Network) is a Neural Network with fixed weights, external inputs and internal states, and has a circular structure that can keep information continuously, which can be regarded as the behavior dynamics about the internal states with the weights and external inputs as parameters.
BLSTM is a special recurrent neural network that can learn long-term dependencies. The BLSTM can well control the flow and the transmission of information through the input gate, the output gate and the forgetting gate, has a long-time memory function, and can relieve the problems of gradient dissipation and gradient explosion of the RNN to a certain extent compared with the RNN; compared with the LSTM (long short-Term Memory network), the method also considers the influence of reverse time sequence information, namely the influence on the current time sequence in the future, and improves the accuracy of voice recognition to a certain extent.
CTC (connection sequential temporal classification) is used to transform the network output of the acoustic model BLSTM into a conditional probability distribution over the tag sequences, after which the CTC can complete the classification by selecting the most likely tag sequence for a given input sequence. The acoustic model is trained using CTCs, i.e., the most characteristic model parameter is obtained from a large number of known utterances according to certain criteria. After using CTCs, there is only one neural network model from speech features (input) to text strings (output), i.e., an "end-to-end" model.
The specific process of step S3 is as follows:
(1) presetting a weighting moment with m inputs and n outputsArray W is a continuously mapped BLSTM, NW:(Rm)^T→(Rn) T, let y be NW(x) As the output sequence of BLSTM, x represents the input above-mentioned baby speech sequence,as the output activation of the unit k at the t-th time step (time t), the probability that the output of the output sequence at the t-th time step (time t) is k is represented, for example: when the output sequence is (a-ab-),representing the probability of the letter output at time step 3 (time 3) being a.
(2) defining a character set L ', regarding the elements in L' as a tag sequence, where L ═ L ∪ { blank }, L is the number of tags, and blank is a blank tag, the output probability of each tag in CTC is like a peak, and blank functions to separate the tags from each other, for example, the tags may be 26 english letters plus a blank tag.
(3) Regarding the elements in L' T as paths and expressing the paths by pi, assuming that the probabilities of the tag sequences output at each time step (moment) are independent and do not affect each other, p (pi | x) expresses the probability that the output path is pi when the input is x, and the following formula is shown as follows:
the above formula can be understood as the product of the probabilities of the corresponding label sequences for each time step (time instant) output path pi.
For example, if the tag sequence is day, the sequence output during training may be dd-aaa-yy-.
(4) A many-to-one mapping β L ^ T → L ^ T is defined to represent a transformation that maps the output path π to tag sequence L, where L ^ T is the set of possible tag sequences.
wherein the mapping function β is defined as follows:
and eliminating all blank labels in the output sequence, such as beta (b-a-c) bac.
in addition, the possible paths for converting the output sequence into the tag sequence through the mapping beta are certainly many, and the total probability is equal to the sum of all the path probabilities.
(5) the probability that the output sequence is the tag sequence l can be expressed as the sum of the probabilities that the sequence of all the output paths pi after beta mapping is the tag sequence l, and the following formula is used:
wherein p (l | x) represents the probability that the output sequence is the tag sequence l when the input is x, and the tag sequence l can be obtained by mapping a plurality of paths, i.e. the tag sequence l generally corresponds to a plurality of pi, for example, several different paths, such as β (aa-a-bb-) and β (a-aab-) and β (a-aaa-b-) correspond to a corresponding tag sequence (aab), wherein, -represents the blank tag blank mentioned above.
(6) The identification, also called decoding, aims at outputting the tag sequence corresponding to the maximum probability by inputting the sequence x, if we can obtain the distribution p (l | x) of the output sequences, and choose the one with the maximum probability as the output sequence, and this logic can be expressed by the following formula:
the best path decoding method or prefix searching decoding method, which is the prior art and is not described herein, may be used to find the tag sequence corresponding to the highest probability. The identification of the input baby voice sequence x is completed by finding the tag sequence with the highest probability, that is, the input baby voice sequence can be converted into a language which can be understood by adults, for example, the input baby voice sequence is converted into a text sequence, so that the adults can conveniently know the will and the demand of babies.
Preferably, in another embodiment of the present invention, after step S3, the method includes:
s4: and transmitting the identified baby voice to the corresponding terminal equipment of the guardian.
Wherein, the babbling voice of the baby is recognized as a language which can be understood by the adult and then is transmitted to the corresponding terminal equipment of the guardian, so that the guardian can better judge the will and the demand of the baby according to the recognized baby voice and carry out corresponding action on the baby.
Preferably, in another embodiment of the present invention, after step S2, the method includes:
s21: and training the acoustic model by taking BLSTM as the acoustic model and CTC as a training criterion to obtain the acoustic model which is trained in advance.
The corresponding relation between the input voice sequence and the output label sequence can be directly optimized by combining BLSTM with CTC training criterion, so that the recognition of the baby voice is realized.
The specific process of step S21 is as follows:
(1) definition of alphat(s)is a forward variable, βt(s)For backward variables, the negative logarithm of the probability p (z | x) is used as the objective function and for the objective functionMaximum likelihood, the following equation is obtained:
s represents the training set, each sample in S constitutes a sequence pair (x, z), z is the target sequence, x is the input sequence, p (z | x) represents the probability of outputting the sequence z given that the input is the sequence x, and the length of the target sequence is smaller than that of the input sequence. The objective of using maximum likelihood training for the objective function is to simultaneously maximize the log probability of correct classification in all training sets, i.e., minimize the objective function.
(2) Performing gradient calculation on the target function, namely performing partial derivative calculation on the output of the target function relative to the BLSTM to obtain the following formula:
wherein,representing the probability that the output of the output sequence is k at the t-th time step (instant), NWThe representative weight matrix W is a continuously mapped two-way long-short memory network.
(3) And then, deriving by adopting a forward and backward algorithm to obtain:
wherein the target sequence z corresponds to the tag sequence l.
(4) To obtainThen carrying out gradient back propagation to obtainFinding a model of an acoustic modelAnd updating the parameter set w of the weight matrix to finish training so as to obtain the pre-trained acoustic model.
Wherein, the step (4) of carrying out gradient back propagation adopts a back propagation algorithm, and the steps of the back propagation algorithm are as follows:
1) the output value of each neuron is calculated forward.
2) The error term value of each neuron is calculated reversely, and the backward propagation of the BLSTM error term also comprises two directions: one is the backward propagation along the time, namely, the error term of each moment is calculated from the current t moment; one is to propagate the error term up one layer.
3) The gradient of each weight is calculated according to the corresponding error term.
In addition, in the process of training the acoustic model by taking CTC as a training criterion, a certain number of infant voice sequence samples are adopted as the input of the acoustic model for training, so that the parameter set w of the weight matrix corresponding to the infant voice sequence samples is determined, the trained acoustic model can have associative memory and prediction capability, and the input infant voice sequence can be effectively identified by utilizing the acoustic model trained in advance.
Referring to fig. 2, an infant speech recognition apparatus according to an embodiment of the present invention includes:
the receiving module 1 is used for receiving the voice signals acquired by the microphone array.
Microphone Array (Microphone Array), which refers to an Array of microphones, is a system consisting of a certain number of acoustic sensors for sampling and processing the spatial characteristics of a sound field. Wherein the acoustic sensor generally employs a microphone. For example, in the embodiment of the present invention, a microphone array formed by a plurality of microphones is used to obtain a speech signal, and the speech signal is filtered according to the difference between the phases of sound waves of the received audio signal by using processing methods such as beam forming and noise reduction of the microphone array, so as to reduce the influence of environmental noise on post-processing and improve speech quality.
And the separation module 2 is used for separating the baby voice in the voice signal.
Since the speech signal is mixed with other sounds besides the baby speech, the baby speech required by the embodiment of the present invention needs to be accurately separated from the speech signal collected by the microphone array, for example, by some separation algorithm.
And the recognition module 3 is used for inputting the infant voice into a pre-trained acoustic model and recognizing the infant voice.
The acoustic model trained in advance is trained for multiple times by using a certain number of infant voice samples as input, and can output a tag sequence with the maximum probability corresponding to the infant voice. Therefore, the input baby voice to be recognized can be recognized by using the acoustic model trained in advance.
The method comprises the steps of acquiring a voice signal through a microphone array, and inputting the baby voice separated from the voice signal into a pre-trained acoustic model, so that the baby voice can be recognized. Because the baby is too young, the baby can only generally send out the babbling voice, but can not accurately send out the language which can be understood by the adult, the babbling voice of the baby can be recognized into the language which can be understood by the adult through the baby voice recognition method of the embodiment of the invention, and the adult can conveniently and really know the condition and the requirement of the baby.
Referring to fig. 3, in another embodiment of the present invention, the separation module 2 includes:
and a separation unit 20 for separating the baby voice in the voice signal by using a FastICA algorithm.
The Fast ICA algorithm is obtained based on a fixed point recursion algorithm, is suitable for any type of data, is possible to analyze high-dimensional data by applying the ICA, and adopts a fixed point iteration optimization algorithm to ensure that the convergence is faster and more stable. The Fast ICA algorithm is essentially a neural network method for minimizing mutual information of estimated components, and approximates negative entropy by using the maximum entropy principle and optimizes the negative entropy by using a proper nonlinear function g. This algorithm has many advantages in neural algorithms: the method is parallel, distributed, simple in calculation and small in memory requirement. In addition, the format of the baby voice within the voice signal is separated into sequences by the FastICA algorithm.
In the embodiment of the invention, the infant voice in the voice signal can be accurately separated by using Fast ICA without more interference signals, so that the accuracy of recognition is improved for the following recognition steps.
The embodiment of the invention adopts a FastICA algorithm based on the maximum negative entropy to separate the baby voice in the voice signal, and the specific process is as follows:
firstly, N sound source signals s are set1,s2,s3……snA is the unknown mixing matrix, m microphones, m>N, H is the signal received by the microphone, then: h ═ As; defining the infant voice sequence to be separated as x, and making x equal to UTH and U are vector parameters, and the length of x is T.
(1) And carrying out zero equalization on the x, namely, centralizing the x to enable the mean value of the x to be 0.
(2) For x, a linear transformation is sought so that x becomes the whitening vector z after projection into the new subspace.
The whitening processing is carried out on the zero-averaged data to remove the correlation among the observation signals, the subsequent independent component extraction process is simplified, and the convergence of the algorithm is good.
(3) The number of the baby voice sequences x to be estimated is selected to be n, and the iteration number p is set to be 1.
(4) Selecting a random initialization vector Up
(5) To UpAnd (3) updating the value: u shapep=E{zg(UTz)}-E{g′(UTz)}U。
Wherein, E { zg (U)Tz) and E { g' (U) }Tz) } U is mean operation; g (y) ═ tanh (ay) is a nonlinear function, where 2 ≧ a ≧ 1, and a generally takes the value 1.
(6) To UpOrthogonalizing is carried out, namely the formula:
(7) after each iteration, to UpCarry out standardization Up=Up/||Up||。
(8) If U is presentpIf not, returning to the step (5) and repeating for 4-6 times until U is reachedpAnd (6) converging.
(9) And (3) returning to the step (4) and repeating the step (4) for 4-7 times until p is equal to n, and calculating the U of all independent componentsp
(10) According to x ═ UTH solves all n infant speech sequences x as input to a pre-trained acoustic model.
In addition, the separation algorithm employed in the present embodiment includes, but is not limited to, the FastICA algorithm, and other algorithms that achieve similar effects to the FastICA algorithm may also be used.
Referring to fig. 4, in another embodiment of the present invention, the infant speech recognition apparatus includes:
the judgment module 10 is configured to judge whether the orientation of the microphone array is a direction facing the position of the infant by acquiring the strength change condition of the voice signal through the microphone array.
And the adjusting module 11 is used for adjusting the direction of the microphone array to face the direction of the position of the infant if the position of the infant is not the same as the position of the infant.
Wherein, the orientation of the microphone array is set to be the direction just facing the baby, so that the strength of the collected voice signals is higher and the effect is better. In addition, the number and geometry of the microphone arrays can be varied in many ways, for example, four microphones in a square arrangement or six microphones in a circular arrangement.
Referring to fig. 5, in another embodiment of the present invention, the identification module 3 includes:
an input unit 30 for inputting the infant speech into the acoustic model.
An obtaining unit 31, configured to obtain at least one tag sequence corresponding to the infant speech output by the acoustic model.
And a comparing unit 32, configured to obtain a tag sequence corresponding to the maximum probability by comparing the correct mapping probabilities of the tag sequences.
And an output unit 33, configured to output the tag sequence corresponding to the maximum probability.
The acoustic model is BLSTM (Bidirectional Long Short-Term Memory), and the trained acoustic model can be obtained after the BLSTM is trained by using CTC (connection sequential classification) as a training criterion.
RNN (recurrent Neural Network) is a Neural Network with fixed weights, external inputs and internal states, and has a circular structure that can keep information continuously, which can be regarded as the behavior dynamics about the internal states with the weights and external inputs as parameters.
BLSTM is a special recurrent neural network that can learn long-term dependencies. The BLSTM can well control the flow and the transmission of information through the input gate, the output gate and the forgetting gate, has a long-time memory function, and can relieve the problems of gradient dissipation and gradient explosion of the RNN to a certain extent compared with the RNN; compared with the LSTM (long short-Term Memory network), the method also considers the influence of reverse time sequence information, namely the influence on the current time sequence in the future, and improves the accuracy of voice recognition to a certain extent.
CTC (connection sequential temporal classification) is used to transform the network output of the acoustic model BLSTM into a conditional probability distribution over the tag sequences, after which the CTC can complete the classification by selecting the most likely tag sequence for a given input sequence. The acoustic model is trained using CTCs, i.e., the most characteristic model parameter is obtained from a large number of known utterances according to certain criteria. After using CTCs, there is only one neural network model from speech features (input) to text strings (output), i.e., an "end-to-end" model.
The specific working process of the identification module 3 is as follows:
(1) presetting a BLSTM with m inputs and N outputs, wherein the weight matrix W is a continuous mappingW:(Rm)^T→(Rn) T, let y be NW(x) As the output sequence of BLSTM, x represents the input above-mentioned baby speech sequence,as the output activation of the unit k at the time t, representing the probability that the output of the output sequence at the t-th time step (time t) is k, for example: when the output sequence is (a-ab-),representing the probability of the letter output at time step 3 (time 3) being a.
(2) defining a character set L ', regarding the elements in L' as a tag sequence, where L ═ L ∪ { blank }, L is the number of tags, and blank is a blank tag, the output probability of each tag in CTC is like a peak, and blank functions to separate the tags from each other, for example, the tags may be 26 english letters plus a blank tag.
(3) Regarding the elements in L' T as paths and expressing the paths by pi, assuming that the probabilities of the tag sequences output at each time step (moment) are independent and do not affect each other, p (pi | x) expresses the probability that the output path is pi when the input is x, and the following formula is shown as follows:
the above formula can be understood as the product of the probabilities of the corresponding label sequences for each time step (time instant) output path pi.
For example, if the tag sequence is day, the sequence output during training may be dd-aaa-yy-.
(4) A many-to-one mapping β L ^ T → L ^ T is defined to represent a transformation that maps the output path π to tag sequence L, where L ^ T is the set of possible tag sequences.
wherein the mapping function β is defined as follows:
and eliminating all blank in the output sequence, such as β (b-a-c) bac.
in addition, the possible paths for converting the output sequence into the tag sequence through the mapping beta are certainly many, and the total probability is equal to the sum of all the path probabilities.
(5) the probability that the output sequence is the tag sequence l can be expressed as the sum of the probabilities that the sequence of all the output paths pi after beta mapping is the tag sequence l, and the following formula is used:
wherein p (l | x) represents the probability that the output sequence is the tag sequence l when the input is x, wherein the tag sequence l can be obtained by mapping a plurality of paths, i.e. the tag sequence l generally corresponds to a plurality of pi, for example, several different paths, such as β (aa-a-bb-) and β (a-aab-) and β (a-aaa-b-) correspond to a corresponding tag sequence (aab), wherein, -represents the blank tag blank mentioned above.
(6) The identification, also called decoding, aims at outputting the tag sequence corresponding to the maximum probability by inputting the sequence x, if we can obtain the distribution p (l | x) of the output sequences, and choose the one with the maximum probability as the output sequence, and this logic can be expressed by the following formula:
the best path decoding method or prefix searching decoding method, which is the prior art and is not described herein, may be used to find the tag sequence corresponding to the highest probability. The identification of the input baby voice sequence x is completed by finding the tag sequence with the highest probability, that is, the input baby voice sequence can be converted into a language which can be understood by adults, for example, the input baby voice sequence is converted into a text sequence, so that the adults can conveniently know the will and the demand of babies.
Referring to fig. 6, an infant speech recognition apparatus according to an embodiment of the present invention includes:
and the sending module 4 is used for transmitting the identified baby voice to the corresponding terminal equipment of the guardian.
The babbling and upbeat voice of the baby is recognized as a language which can be understood by an adult and then is transmitted to the terminal equipment of the corresponding guardian, so that the guardian can better judge the will and the demand of the baby according to the recognized baby voice and take care of the baby by implementing corresponding actions.
Referring to fig. 7, in another embodiment of the present invention, an infant speech automatic recognition apparatus according to an embodiment of the present invention includes:
and the training module 21 is configured to train the acoustic model by using BLSTM as the acoustic model and CTC as a training criterion, so as to obtain the acoustic model trained in advance.
The corresponding relation between the input voice sequence and the output label sequence can be directly optimized by combining BLSTM with CTC training criterion, so that the recognition of the baby voice is realized.
The specific working process of the training module 20 is as follows:
(1) definition of alphat(s)is a forward variable, βt(s)For backward variables, using the negative logarithm of the probability p (z | x) as the objective function and maximum likelihood training on the objective function, the following equation is obtained:
s represents the training set, each sample in S constitutes a sequence pair (x, z), z is the target sequence, x is the input sequence, p (z | x) represents the probability of outputting the sequence z given that the input is x, and the length of the target sequence is smaller than that of the input sequence. The objective of using maximum likelihood training for the objective function is to simultaneously maximize the log probability of correct classification in all training sets, i.e., minimize the objective function.
(2) Performing gradient calculation on the target function, namely performing partial derivative calculation on the output of the target function relative to the BLSTM to obtain the following formula:
wherein,representing the probability that the output of the output sequence is k at the t-th time step (time t), NWThe representative weight matrix W is a continuously mapped two-way long-short memory network.
(3) And then, deriving by adopting a forward and backward algorithm to obtain:
wherein the target sequence z corresponds to the tag sequence l.
(4) To obtainThen carrying out gradient back propagation to obtainFinding a model of an acoustic modelAnd updating the parameter set w of the weight matrix to finish training so as to obtain the pre-trained acoustic model.
Wherein, the step (4) of carrying out gradient back propagation adopts a back propagation algorithm, and the steps of the back propagation algorithm are as follows:
1) the output value of each neuron is calculated forward.
2) The error term value of each neuron is calculated reversely, and the backward propagation of the BLSTM error term also comprises two directions: one is the backward propagation along the time, namely, the error term of each moment is calculated from the current t moment; one is to propagate the error term up one layer.
3) The gradient of each weight is calculated according to the corresponding error term.
In addition, in the process of training the acoustic model by taking CTC as a training criterion, a certain number of infant voice sequence samples are adopted as the input of the acoustic model for training, so that the parameter set w of the weight matrix corresponding to the infant voice sequence samples is determined, the trained acoustic model can have associative memory and prediction capability, and the input infant voice sequence can be effectively identified by utilizing the acoustic model trained in advance.
Compared with the prior art, the baby voice recognition method provided by the invention has the following beneficial effects: the invention obtains the voice signal through the microphone array, then separates the baby voice to be recognized from the voice signal, and inputs the baby voice into the acoustic model trained in advance to recognize the baby voice. The babbling voice of the baby is recognized as the language which can be understood by the adult, so that the adult can be better helped to judge the intention and the demand of the baby, the adult can know the condition of the baby accurately in time and take corresponding care of the baby.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. An infant speech recognition method, comprising:
receiving a voice signal acquired by a microphone array;
separating the baby voice in the voice signal;
and inputting the infant voice into a pre-trained acoustic model, and recognizing the infant voice.
2. The infant speech recognition method of claim 1, wherein the step of separating the infant speech from the speech signal comprises:
and separating the baby voice in the voice signal by adopting a FastICA algorithm.
3. The infant speech recognition method of claim 1, wherein the step of receiving the speech signal obtained by the microphone array is preceded by:
acquiring the intensity change condition of a voice signal through the microphone array, and judging whether the orientation of the microphone array is the direction opposite to the position of the baby;
if not, the direction of the microphone array is adjusted to be opposite to the direction of the position of the baby.
4. The infant speech recognition method according to claim 1, wherein the step of inputting the infant speech into a pre-trained acoustic model to recognize the infant speech comprises:
inputting the infant speech into the acoustic model;
obtaining at least one tag sequence corresponding to the infant voice output by the acoustic model;
obtaining a label sequence corresponding to the maximum probability by comparing the correct mapping probability of each label sequence;
and outputting the label sequence corresponding to the maximum probability.
5. The method for infant speech recognition according to claim 1, wherein the step of inputting the infant speech into a pre-trained acoustic model and recognizing the infant speech after the step of inputting the infant speech comprises:
and transmitting the identified baby voice to the corresponding terminal equipment of the guardian.
6. An infant speech recognition apparatus, comprising:
the receiving module is used for receiving the voice signals acquired by the microphone array;
the separation module is used for separating the baby voice in the voice signal;
and the recognition module is used for inputting the infant voice into a pre-trained acoustic model and recognizing the infant voice.
7. The infant speech recognition device of claim 6, wherein the separation module comprises:
and the separation unit is used for separating the baby voice in the voice signal by adopting a FastICA algorithm.
8. The infant speech recognition device of claim 6, comprising:
the judgment module is used for acquiring the strength change condition of the voice signal through the microphone array and judging whether the orientation of the microphone array is the direction opposite to the position of the infant;
and the adjusting module is used for adjusting the direction of the microphone array to be opposite to the direction of the position of the baby if the position of the baby is not the same as the position of the baby.
9. The infant speech recognition device of claim 6, wherein the recognition module comprises:
an input unit for inputting the infant speech into the acoustic model;
the acquisition unit is used for acquiring at least one label sequence corresponding to the infant voice output by the acoustic model;
the comparison unit is used for obtaining the label sequence corresponding to the maximum probability by comparing the correct mapping probability of each label sequence;
and the output unit is used for outputting the label sequence corresponding to the maximum probability.
10. The infant speech recognition device of claim 6, comprising:
and the sending module is used for transmitting the identified baby voice to the corresponding terminal equipment of the guardian.
CN201810490697.3A 2018-05-21 2018-05-21 Baby voice recognition method and device Active CN108806723B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810490697.3A CN108806723B (en) 2018-05-21 2018-05-21 Baby voice recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810490697.3A CN108806723B (en) 2018-05-21 2018-05-21 Baby voice recognition method and device

Publications (2)

Publication Number Publication Date
CN108806723A true CN108806723A (en) 2018-11-13
CN108806723B CN108806723B (en) 2021-08-17

Family

ID=64091330

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810490697.3A Active CN108806723B (en) 2018-05-21 2018-05-21 Baby voice recognition method and device

Country Status (1)

Country Link
CN (1) CN108806723B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110232114A (en) * 2019-05-06 2019-09-13 平安科技(深圳)有限公司 Sentence intension recognizing method, device and computer readable storage medium
CN111428596A (en) * 2020-03-16 2020-07-17 重庆邮电大学 Grinding sound signal detection method based on three sound pickups
CN111540344A (en) * 2020-04-21 2020-08-14 北京字节跳动网络技术有限公司 Acoustic network model training method and device and electronic equipment
CN112466056A (en) * 2020-12-01 2021-03-09 上海旷日网络科技有限公司 Self-service cabinet pickup system and method based on voice recognition
CN113763988A (en) * 2020-06-01 2021-12-07 中车株洲电力机车研究所有限公司 Time synchronization method and system for locomotive cab monitoring information and LKJ monitoring information

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104347066A (en) * 2013-08-09 2015-02-11 盛乐信息技术(上海)有限公司 Deep neural network-based baby cry identification method and system
CN104882140A (en) * 2015-02-05 2015-09-02 宇龙计算机通信科技(深圳)有限公司 Voice recognition method and system based on blind signal extraction algorithm
CN105512692A (en) * 2015-11-30 2016-04-20 华南理工大学 BLSTM-based online handwritten mathematical expression symbol recognition method
CN106653059A (en) * 2016-11-17 2017-05-10 沈晓明 Automatic identification method and system for infant crying cause
US20170154630A1 (en) * 2015-11-27 2017-06-01 Fu Tai Hua Industry (Shenzhen) Co., Ltd. Electronic device and method for interpreting baby language
CN107105095A (en) * 2017-04-25 2017-08-29 努比亚技术有限公司 A kind of sound processing method and mobile terminal
CN107886953A (en) * 2017-11-27 2018-04-06 四川长虹电器股份有限公司 A kind of vagitus translation system based on expression and speech recognition

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104347066A (en) * 2013-08-09 2015-02-11 盛乐信息技术(上海)有限公司 Deep neural network-based baby cry identification method and system
CN104882140A (en) * 2015-02-05 2015-09-02 宇龙计算机通信科技(深圳)有限公司 Voice recognition method and system based on blind signal extraction algorithm
US20170154630A1 (en) * 2015-11-27 2017-06-01 Fu Tai Hua Industry (Shenzhen) Co., Ltd. Electronic device and method for interpreting baby language
CN105512692A (en) * 2015-11-30 2016-04-20 华南理工大学 BLSTM-based online handwritten mathematical expression symbol recognition method
CN106653059A (en) * 2016-11-17 2017-05-10 沈晓明 Automatic identification method and system for infant crying cause
CN107105095A (en) * 2017-04-25 2017-08-29 努比亚技术有限公司 A kind of sound processing method and mobile terminal
CN107886953A (en) * 2017-11-27 2018-04-06 四川长虹电器股份有限公司 A kind of vagitus translation system based on expression and speech recognition

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王庆楠: "基于端到端技术的藏语语音识别", 《模式识别与人工智能》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110232114A (en) * 2019-05-06 2019-09-13 平安科技(深圳)有限公司 Sentence intension recognizing method, device and computer readable storage medium
CN111428596A (en) * 2020-03-16 2020-07-17 重庆邮电大学 Grinding sound signal detection method based on three sound pickups
CN111540344A (en) * 2020-04-21 2020-08-14 北京字节跳动网络技术有限公司 Acoustic network model training method and device and electronic equipment
CN111540344B (en) * 2020-04-21 2022-01-21 北京字节跳动网络技术有限公司 Acoustic network model training method and device and electronic equipment
CN113763988A (en) * 2020-06-01 2021-12-07 中车株洲电力机车研究所有限公司 Time synchronization method and system for locomotive cab monitoring information and LKJ monitoring information
CN113763988B (en) * 2020-06-01 2024-05-28 中车株洲电力机车研究所有限公司 Time synchronization method and system for locomotive cab monitoring information and LKJ monitoring information
CN112466056A (en) * 2020-12-01 2021-03-09 上海旷日网络科技有限公司 Self-service cabinet pickup system and method based on voice recognition
CN112466056B (en) * 2020-12-01 2022-04-05 上海旷日网络科技有限公司 Self-service cabinet pickup system and method based on voice recognition

Also Published As

Publication number Publication date
CN108806723B (en) 2021-08-17

Similar Documents

Publication Publication Date Title
CN108806723B (en) Baby voice recognition method and device
Coucke et al. Efficient keyword spotting using dilated convolutions and gating
US10643602B2 (en) Adversarial teacher-student learning for unsupervised domain adaptation
US10937438B2 (en) Neural network generative modeling to transform speech utterances and augment training data
Xu et al. Convolutional gated recurrent neural network incorporating spatial features for audio tagging
Fraccaro et al. Sequential neural models with stochastic layers
Deng et al. Ensemble deep learning for speech recognition
US10417329B2 (en) Dialogue act estimation with learning model
CN105139864B (en) Audio recognition method and device
Rajamani et al. A novel attention-based gated recurrent unit and its efficacy in speech emotion recognition
US10580432B2 (en) Speech recognition using connectionist temporal classification
Bae et al. End-to-End Speech Command Recognition with Capsule Network.
CN111179911A (en) Target voice extraction method, device, equipment, medium and joint training method
CN112183107B (en) Audio processing method and device
CN110895935B (en) Speech recognition method, system, equipment and medium
CN111461173A (en) Attention mechanism-based multi-speaker clustering system and method
Saeidi et al. Uncertain LDA: Including observation uncertainties in discriminative transforms
KR20210052036A (en) Apparatus with convolutional neural network for obtaining multiple intent and method therof
JP2019215500A (en) Voice conversion learning device, voice conversion device, method, and program
WO2019094562A1 (en) Neural network based blind source separation
CN113808581B (en) Chinese voice recognition method based on acoustic and language model training and joint optimization
Huang et al. Deep graph random process for relational-thinking-based speech recognition
Maheswari et al. A hybrid model of neural network approach for speaker independent word recognition
CN112180318B (en) Sound source direction of arrival estimation model training and sound source direction of arrival estimation method
Ansari et al. Toward growing modular deep neural networks for continuous speech recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20210927

Address after: 518000 201, No.26, yifenghua Innovation Industrial Park, Xinshi community, Dalang street, Longhua District, Shenzhen City, Guangdong Province

Patentee after: Shenzhen waterward Information Co.,Ltd.

Address before: 518000, block B, huayuancheng digital building, 1079 Nanhai Avenue, Shekou, Nanshan District, Shenzhen City, Guangdong Province

Patentee before: SHENZHEN WATER WORLD Co.,Ltd.