WO2017177629A1 - 远讲语音识别方法及装置 - Google Patents

远讲语音识别方法及装置 Download PDF

Info

Publication number
WO2017177629A1
WO2017177629A1 PCT/CN2016/101053 CN2016101053W WO2017177629A1 WO 2017177629 A1 WO2017177629 A1 WO 2017177629A1 CN 2016101053 W CN2016101053 W CN 2016101053W WO 2017177629 A1 WO2017177629 A1 WO 2017177629A1
Authority
WO
WIPO (PCT)
Prior art keywords
far
speech
input
talk
talking
Prior art date
Application number
PCT/CN2016/101053
Other languages
English (en)
French (fr)
Inventor
那兴宇
Original Assignee
乐视控股(北京)有限公司
乐视致新电子科技(天津)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 乐视控股(北京)有限公司, 乐视致新电子科技(天津)有限公司 filed Critical 乐视控股(北京)有限公司
Publication of WO2017177629A1 publication Critical patent/WO2017177629A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search

Definitions

  • the present invention relates to the field of speech recognition technology, and in particular, to a remote speech recognition method and apparatus.
  • speech recognition technology has made remarkable progress, and more and more from the laboratory to the market, into people's lives.
  • the application of speech recognition dictation machines in some fields has been rated by the US press as one of the ten major events in computer development in 1997.
  • speech recognition technology will enter the fields of industry, home appliances, communications, automotive electronics, medical care, home services, and consumer electronics.
  • the areas covered by speech recognition technology include: signal processing, pattern recognition, probability theory and information theory, vocal mechanism and auditory mechanism, artificial intelligence, and so on.
  • Voice communication with the machine let the machine understand the purpose of people's speech, which can greatly improve the quality of life for us living in the mechanized era.
  • the method of remotely controlling the TV by installing the APP on the smartphone is cumbersome, especially for the elderly and children who do not control the smartphone, this method does not bring obvious Advantages; for the remote control built-in a radio device to remotely control the TV, in terms of life experience, many TV users are remotely placed, especially for families with children, children may mischief to hide the remote control The remote control is caused, which often leads to where the remote control is not found. For the elderly who are inconvenient and forgetful, it is more inconvenient to control the TV through the remote control.
  • the audio device is embedded in the TV to collect the voice command from the user.
  • the sound signal is reverberated due to the reflection of the wall in the room, and the surrounding environment is inevitably noisy, resulting in long-distance speech.
  • the recognition rate is low and the user experience is not good.
  • the embodiment of the invention provides a remote speech recognition method and device, which is used to solve the problem that the speech recognition in the prior art is susceptible to environmental influences and the recognition rate is low, and the correct rate of the speech recognition is improved.
  • Embodiments of the present invention also provide a remote speech recognition electronic device, a non-transitory computer storage medium, and a computer program product.
  • the embodiment of the invention provides a remote speech recognition method, including:
  • the pre-trained near-talk speech model is invoked to identify the approximate near-speech speech input to obtain a far-end speech recognition result.
  • the embodiment of the invention provides a remote speech recognition device, comprising:
  • a signal acquisition module configured to acquire a test far-talking speech frame of the user's far-talking voice input, and call a pre-trained near-talking voice model to identify the test far-talking speech frame and obtain a preliminary knowledge result;
  • a training module configured to calculate, according to the initial knowledge result, an environment feature mapping matrix of the far-talking voice input and the near-talk voice input in a current environment
  • mapping module configured to: when the remote speech input of the user is detected, map the remote speech input to a corresponding approximate speech input according to the environmental feature mapping matrix
  • an identification module configured to invoke the pre-trained near-talk speech model to identify the approximate near-speech speech input to obtain a far-talk speech recognition result.
  • the embodiment of the present invention further provides a non-transitory computer storage medium storing computer executable instructions for performing the above-mentioned remote speech recognition method method of the present application.
  • An embodiment of the present invention provides a remote speech recognition electronic device, including: at least one processor;
  • the memory stores instructions executable by the one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform any of the above-described remote speech recognition methods of the present application method.
  • Embodiments of the present invention also provide a computer program product, the computer program product comprising a computing program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions, when the program instructions are executed by a computer
  • the computer is caused to perform the above-described remote speech recognition method method of the present application.
  • the remote speech recognition method and device provided by the embodiments of the present invention recognize the user's far-talk input according to the pre-trained near-talk speech model to obtain a preliminary recognition result, and then calculate the current environment according to the preliminary recognition result.
  • the environment mapping relationship between the input and the near-input input changes the problem that the sound wave is reflected in the environment and the correct rate of speech recognition caused by the environmental noise is low when the speech recognition is performed in the prior art, and the high recognition rate of the far-talk speech is realized. .
  • Embodiment 1 is a technical flowchart of Embodiment 1 of the present application.
  • FIG. 3 is a schematic structural diagram of a device according to Embodiment 3 of the present application.
  • FIG. 4 is a schematic structural diagram of a remote speech recognition electronic device according to Embodiment 4 of the present application.
  • FIG. 1 is a technical flowchart of Embodiment 1 of the present application.
  • a remote speech recognition method of the present application can be implemented by the following steps:
  • Step S110 Acquire a test far-talking speech frame of the user's far-talking voice input, and call a pre-trained near-talking voice model to identify the test far-talking speech frame and obtain a preliminary knowledge result;
  • Step S120 Calculate an environment feature mapping matrix of the far-talking voice input and the near-talk voice input in the current environment according to the initial recognition result;
  • Step S130 When detecting the far-talking voice input of the user, mapping the far-talking voice input to the corresponding approximate near-talking voice input according to the environment feature mapping matrix;
  • Step S140 Calling the pre-trained near-talk speech model to identify the approximate near-speech speech input to obtain a far-talk speech recognition result.
  • the remote speech recognition method of the embodiment of the present application may be corresponding to the remote speech recognition device. It is placed on a TV, an in-vehicle device, etc. that does not depend on the remote controller, and is used to realize recognition of a long-distance voice input signal.
  • the following part will be exemplified by a television, but it should be understood that the application of the technical solution of the embodiment of the present application is not limited thereto.
  • step S110 the user directly sends a voice command to the television, for example, I want to see the moon pass.
  • the sound wave may be attenuated to some extent during the transmission; in addition, it is limited by the environment in which the TV is located, for example, the living room of the user's home, walls and various The furniture has a strong reflection on the sound waves, which causes the sound reverberation and noise to reach the TV to be relatively large.
  • the speech signal is a quasi-stationary signal
  • the signal is often framed during processing, and each frame is about 20ms-30ms in length, and the speech signal is regarded as a steady state signal in this interval. Only steady-state information can be processed, so it is necessary to divide the frame first.
  • the voice signal may be framed by a function of voice framing, such as an enframe.
  • the near-talk speech model is trained in advance by collecting a certain number of near-talk speech signals, that is, a close-range speech input signal, and the signal distortion is small and the noise is included.
  • the data is small, and the speech model trained by the near-speech speech sample is almost free of environmental factors.
  • the sample training of the remote speech input is far from the speech model, there will be a problem that each user speaks differently and the interference to the speech signal is different. If the same speech input environment is used to collect the far Speaking of the speech sample will result in the speech recognition rate being difficult to improve when the trained far-talking speech model faces different speaking environments.
  • a speech model without noise and without attenuation interference is pre-trained, that is, the speech model is closely spoken, and then the speech signal is modified by each user in different speech environments to correct the speech.
  • Model parameters of the model resulting in a speech model that is adaptive to the user's speaking environment.
  • This speech model contains the factors that the user speaks, so it can greatly improve the accuracy of speech recognition.
  • the training of the near-talk speech model may adopt a mixed Gaussian model method or hidden mar Cove model method.
  • the training of the near-speech speech model may adopt HMM, GMM-HMM, DNN-HMM, and the like.
  • HMM Hidden Markov Model
  • HMM is a kind of Markov chain, its state can not be directly observed, but can be observed through the observation vector sequence, each observation vector is expressed in various states through some probability density distribution, each observation vector is Generated by a sequence of states with a corresponding probability density distribution. Therefore, the hidden Markov model is a double stochastic process—a hidden Markov chain with a certain number of states and a set of random functions. Since the 1980s, HMM has been used in speech recognition with great success.
  • the HMM speech model ⁇ ( ⁇ , A, B) is determined by three parameters: initial state probability ( ⁇ ), state transition probability (A), and observed sequence probability (B).
  • reveals the topology of the HMM, A describes the variation of the speech signal over time, and B gives the statistical properties of the observed sequence.
  • GMM is a mixed Gaussian model and DNN is a deep neural network model. Both the GMM-HMM and the DNN-HMM are variants based on the HMM. Since these three models are very mature prior art and are not the protection focus of the embodiments of the present invention, they will not be described again here.
  • the embodiment of the present application obtains a preliminary knowledge result according to the user's test in a specific environment.
  • the test far-reaching voice input may be that the user prompts the user to input when the voice recognition device is used for the first time, or may be obtained when the user initiates the power-on instruction.
  • Obtaining the user's test far from the voice input the purpose is to obtain the environment of the user who initiated the voice input from the test far-talking voice input, and take this environmental factor into consideration in the process of far-reaching speech recognition, improve the far-talking Environmental adaptability for speech recognition.
  • step S120 includes: calculating an environment feature mapping matrix of the far-talking voice input and the near-talk voice input in the current environment according to the initial recognition result;
  • the environment feature mapping matrix of the far-talking voice input and the near-talk voice input is calculated by using a maximum likelihood linear regression method according to the initial recognition result of the user's far-talking voice input in a specific environment.
  • the method of MLLR is to obtain a set of linear transformations, through which the likelihood function of adaptive data is maximized.
  • the parameters to be transformed by the MLLR method are generally the mean of the GMM of the state layer; the parameter to be transformed in the random segment model is the mean vector of the domain model.
  • the transformation process can be simply expressed as follows:
  • u denotes the mean vector whose dimension is D before the domain model is adaptive
  • u ⁇ is the adaptive mean vector
  • is the extended vector [1, u']' of u
  • W is the desired D ⁇ (D+1) Linear transformation matrix.
  • step S130 the user's far-talking voice input is mapped to the corresponding approximate near-talk input according to the environment feature mapping matrix trained in the previous step.
  • step S140 according to the approximate near-talk input obtained in the previous step, the near-talk speech model is used for recognition.
  • step S140 an optional step S150 is further included:
  • Step S150 Iteratively update the environment mapping matrix.
  • the trained environment feature mapping matrix is further iteratively trained to obtain a more stable and more adaptive mapping relationship of the user language environment, thereby further ensuring the correctness of the speech recognition.
  • the specific algorithm for iterative training is as follows:
  • S152 Calling the pre-trained near-talk speech model to identify the approximate near-speech voice input to obtain a preliminary recognition result
  • S153 Calculate an environment mapping relationship between the far-talking voice input and the near-talk voice input by using a maximum likelihood linear regression method according to the initial knowledge result, and update the environment feature mapping matrix according to the mapping relationship.
  • the environmental feature mapping matrix is updated until the environmental feature mapping matrix tends to be stable.
  • the user's far-talk input is performed according to the pre-trained near-talk speech model.
  • the initial recognition result is obtained, and the environment mapping relationship between the far-talk input and the near-talk input in the current environment is calculated according to the preliminary recognition result, and the sound wave is reflected in the environment when the far-talk speech recognition is performed in the prior art.
  • the problem of low accuracy of speech recognition caused by environmental noise achieves a high recognition rate of far speaking speech.
  • the remote speech recognition method of the embodiment of the present application further has the following optional implementation steps:
  • Step S210 extracting an acoustic feature of the user, and determining an acoustic group to which the user belongs;
  • Step S220 Calling an attribute feature mapping matrix of the acoustic group that is pre-trained to map the far-talk voice input to a corresponding approximate near-talk voice input;
  • Step S230 Calling the pre-trained near-talk speech model to identify the approximate near-speech speech input to obtain a far-talk speech recognition result.
  • step S210 after extracting the acoustic features of the user, matching with the pre-classified acoustic packets, determining the acoustic group to which the user belongs, thereby invoking different attribute feature mappings according to different acoustic groups. Matrix to achieve higher accuracy speech recognition.
  • step S220 the acoustic group to which the user belongs in the previous step is acquired, and the environmental feature mapping matrix in the corresponding group is called according to the result of the associated acoustic grouping.
  • the environmental feature mapping matrix is unique to some acoustic grouping, and is a mapping relationship between the voice environment of the user's speech and the acoustic characteristics of the user's speech, and further improves the pre-trained speech model.
  • Environmental adaptability and user feature adaptation is unique to some acoustic grouping, and is a mapping relationship between the voice environment of the user's speech and the acoustic characteristics of the user's speech, and further improves the pre-trained speech model.
  • the training method of the feature mapping matrix is implemented by the following steps:
  • Step S231 Acquire a test far-talking speech frame of the user's far-talking voice input, and call the pre-trained near-talking voice model to identify the test far-talking speech frame and obtain the initial recognition result;
  • Step S232 Calculate, according to the initial recognition result, the environment feature mapping matrix of the far-talking voice input and the near-talk voice input in the current environment;
  • Step S233 When detecting the far-talking voice input of the user, extracting the user acoustic feature, and dividing the user into different acoustic groups according to the acoustic feature;
  • Step S234 in each of the acoustic groups, calling the environmental feature mapping matrix to Distinguishing the voice input mapping to the corresponding approximate near-talk voice input;
  • Step S235 Calling the pre-trained near-talk speech model to identify the approximate near-speech speech input to obtain a preliminary recognition result
  • Step S236 Calculate a mapping relationship between the far-talking speech input and the near-talk speech input by using a maximum likelihood linear regression method according to the initial recognition result, and update the environmental feature mapping matrix according to the mapping relationship to obtain each
  • the attribute feature mapping matrix of the acoustic group is described, and the attribute feature mapping matrix is updated.
  • step S231 and step S232 are as follows: step S110 and step S120 of the first embodiment, and details are not described herein again.
  • step S233 the user is divided into different acoustic groups according to the acoustic feature, and the MFCC of the voice feature parameter (ie, the abbreviation of the Mel frequency cepstrum coefficient) may be calculated, or the base for extracting the voice input may also be adopted.
  • the MFCC of the voice feature parameter ie, the abbreviation of the Mel frequency cepstrum coefficient
  • the Mel frequency is based on the auditory characteristics of the human ear, which is nonlinearly related to the Hz frequency.
  • the Mel Frequency Cepstral Coefficient (MFCC) is a Hz spectral feature calculated using this relationship between them.
  • an FFT Fast Fourier transform
  • a Mel filter group is added to the amplitude spectrum
  • Logarithm is performed on all filter outputs, and further discrete cosine transform DCT is performed to obtain MFCC.
  • the airflow passes through the glottis to cause the vocal cord to produce a oscillating vibration, which produces a quasi-periodic pulsed airflow that produces a voiced sound that carries most of the energy in the voice, of which the vocal cords
  • the vibration frequency is called the fundamental frequency.
  • a time domain based algorithm and/or a spatial domain based algorithm may be employed to extract a fundamental frequency of a user speech input, wherein the time domain based algorithm comprises an autocorrelation function algorithm and an average amplitude difference function algorithm, the spatial domain based algorithm comprising The inverted analysis method and the discrete wavelet transform method.
  • the autocorrelation function method utilizes the quasi-periodicity of the voiced signal to detect the fundamental frequency by comparing the similarity between the original signal and its shifted signal.
  • the principle is that the autocorrelation function of the voiced signal is equal to the pitch at the time delay. A peak is generated at an integer multiple of the period, and the autocorrelation function of the unvoiced signal has no significant peak. Therefore, by detecting the peak position of the autocorrelation function of the speech signal, the fundamental frequency of the speech can be estimated.
  • the basis of the average amplitude difference function method for detecting the fundamental frequency is that the voiced sound of the speech has a quasi-periodicity, and the amplitudes of the full periodic signals at the amplitude points which are multiples of the period are equal, so that the difference is zero. Assuming that the pitch period is P, then in the voiced segment, the average amplitude difference function will have a valley bottom, then the distance between the two valleys is the pitch period, and the reciprocal is the fundamental frequency.
  • Cepstrum analysis is a method of spectral analysis.
  • the output is the result of inverse Fourier transform of the amplitude spectrum of the Fourier transform.
  • the method is based on the theory that the amplitude spectrum of a Fourier transform of a signal with a fundamental frequency has some equidistant distribution peaks representing the harmonic structure in the signal. When the logarithm of the amplitude spectrum is taken, these peaks are weakened to A range available.
  • the result obtained by taking the logarithm of the amplitude spectrum is a periodic signal in the frequency domain, and the period of the frequency domain signal (which is the frequency value) can be regarded as the fundamental frequency of the original signal, so the inverse Fourier transform is performed on the signal. A peak can be obtained at the pitch period of the original signal.
  • Discrete wavelet transform is a powerful tool that allows signals to be decomposed into high frequency components and low frequency components on a continuous scale. It is a local transformation of time and frequency that effectively extracts information from the signal. Compared with the fast Fourier transform, the main advantage of the discrete wavelet transform is that it can achieve good time resolution in the high frequency part and good frequency resolution in the low frequency part.
  • the fundamental frequency depends on the size of the vocal cords, the thickness, the degree of slack, and the effect of the air pressure difference between the upper and lower glottis. As the vocal cords are pulled longer, tighter and thinner, the shape of the glottis becomes more slender, and at this time the vocal cords are not necessarily completely closed when closed, and the corresponding fundamental frequency is higher.
  • the fundamental frequency depends on the gender, age and specific circumstances of the speaker. In general, older men are lower, and women and children are higher. After testing, in general, the male's fundamental frequency range is between 80Hz and 200Hz, the female's fundamental frequency range is between 200-350HZ, and the children's fundamental frequency range is about 350-500Hz.
  • the fundamental frequency is extracted, and the threshold range is determined, and the user characteristics of the source of the input voice can be determined, and the user is classified according to the feature.
  • different acoustic groups and the environmental adaptive speech model corresponding to each acoustic group can be obtained according to their acoustic characteristics.
  • step S234 in each acoustic group, an approximate near-talk input first obtained by the environmental feature mapping matrix obtained in step S232 is invoked for the user's far-talking voice input.
  • the initial recognition result is a recognition result that eliminates the influence of the environment of the user, but does not eliminate the influence of each user's speaking feature on the speech recognition result.
  • step S236 the environment mapping matrix trained in step S232 is further updated to obtain an attribute mapping matrix including user acoustic attributes.
  • the specific algorithm of iterative training also adopts maximum likelihood linear regression method, and each time the user's far-talking voice input is detected, the acoustic characteristics of the user are extracted and the user is divided into the associated acoustic group according to the acoustic feature; And according to the far-talking voice input, calling the attribute feature mapping matrix to map the far-talking voice input to the corresponding approximate near-talk voice input; calling the pre-trained near-talking voice model to identify the approximate near-talking
  • the speech input obtains the initial recognition result; according to the initial recognition result, the maximum likelihood linear regression method is used to calculate the attribute feature mapping matrix of the far-talk speech input and the near-talk speech input, thereby realizing the update of the attribute feature mapping matrix.
  • the acoustic characteristics are acquired, and the environment-adaptive and user-adaptive training of the far-talk speech input by the user is performed according to the acoustic feature, thereby obtaining a more suitable user pronunciation.
  • the characteristics and the personalized mapping relationship of the voice environment greatly improve the efficiency of speech recognition and improve the user experience.
  • FIG. 3 is a schematic structural diagram of a device according to Embodiment 3 of the present application.
  • an embodiment of the present application provides a remote speech recognition device, which includes the following modules:
  • the signal acquisition module 310 is configured to acquire a test far-talking voice frame of the user's far-talking voice input, and call the pre-trained near-talking voice model to identify the test far-talking voice frame and obtain a preliminary knowledge result;
  • the training module 320 is configured to calculate an environment feature mapping matrix of the far-talking voice input and the near-talk voice input in the current environment according to the initial knowledge result;
  • the mapping module 330 is configured to: when detecting the far-talking voice input of the user, mapping the far-talking voice input to the corresponding approximate near-talking voice input according to the environment feature mapping matrix;
  • the identification module 340 is configured to invoke the pre-trained near-talk speech model to identify the approximate near-speech speech input to obtain a far-talk speech recognition result.
  • the training module 320 is specifically configured to: calculate, according to the far-talk speech frame and the initial recognition result, the far-talk speech input and the corresponding near-talk voice input by using maximum likelihood linear regression method An environmental feature mapping matrix and an iterative update of the environment mapping matrix.
  • the training module 320 is further configured to: when the remote voice input of the user is detected, invoke the environment feature mapping matrix to map the remote voice input to the corresponding near-talk voice input;
  • the trained near-talk speech model identifies the approximate near-speech speech input to obtain a preliminary recognition result; according to the initial recognition result, a maximum likelihood linear regression method is used to calculate between the far-talking speech input and the near-talk speech input An environment mapping relationship, and updating the environment feature mapping matrix according to the mapping relationship.
  • the mapping module 330 is further configured to: extract an acoustic feature of the user, determine an acoustic group to which the user belongs; and invoke an attribute feature mapping matrix of the acoustic group that is pre-trained to map the far-talk voice input to Corresponding approximate speech input;
  • the identification module 340 is further configured to invoke the pre-trained near-talk speech model to identify the approximate near-talk speech input to obtain a far-talk speech recognition result.
  • the training module 320 is further configured to: when detecting a voice input of the user, extract a user acoustic feature, and divide the user into different acoustic groups according to the acoustic feature;
  • the environmental feature mapping matrix is invoked to map the far-talk speech input to the corresponding approximate near-talk speech input; and the pre-trained near-talk speech model is invoked to identify the approximation Speaking of the voice input to obtain the initial recognition result; according to the initial knowledge result, using the maximum likelihood linear regression method to calculate the mapping relationship between the far-talking voice input and the near-talk voice input, and updating the environment feature map according to the mapping relationship
  • the matrix obtains the attribute of each of the acoustic groups
  • the feature mapping matrix updates the attribute feature mapping matrix.
  • the training module 330 is further configured to: when detecting a voice input of the user, extract an acoustic feature of the user and divide the user into an associated acoustic group according to the acoustic feature; Distinguishing the voice input, calling the attribute feature mapping matrix to map the far-talk voice input to the corresponding approximate near-talk voice input; calling the pre-trained near-talk voice model to identify the approximate near-talk voice input Firstly, the result is obtained; according to the initial knowledge result, the maximum likelihood linear regression method is used to calculate the attribute feature mapping matrix of the far-talking speech input and the near-talk speech input, thereby realizing the updating of the attribute feature mapping matrix.
  • the apparatus shown in FIG. 3 can perform the method of the embodiment shown in FIG. 1 and FIG. 2, and the implementation principle and technical effects are referred to the embodiment shown in FIG. 1 and FIG. 2, and details are not described herein again.
  • FIG. 4 is a schematic structural diagram of a remote speech recognition electronic device according to Embodiment 3 of the present application.
  • the device in this embodiment may be a part of a remote speech recognition server or a remote speech recognition server, and the device may include:
  • One or more processors 401 and memory 402 are exemplified by one processor 401 in FIG.
  • the far-end speech recognition electronic device may further include: an input device 403 and an output device 404.
  • the processor 401, the memory 402, the input device 403, and the output device 404 can be connected by a bus or other means.
  • the memory 402 is used as a non-transitory computer readable storage medium, and can be used for storing non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions corresponding to the remote speech recognition method in the embodiment of the present application.
  • the processor 401 executes various functional applications and data processing of the server by running non-transitory software programs, instructions, and modules stored in the memory 602, that is, implementing the above-described method embodiment remote speech recognition method.
  • the memory 402 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application required for at least one function; the storage data area may store data created according to the use of the voice recognition device, and the like.
  • the memory 402 may include a high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory. Devices, or other non-transitory solid state storage devices.
  • the memory 402 can optionally include a memory remotely located relative to the processor 401 that can be connected to the processing device for speech recognition via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the input device 403 can receive the input digital or character information and generate a key signal input related to user settings and function control of the remote speech recognition device.
  • Output device 404 can include a display device such as a display screen.
  • the one or more modules are stored in the memory 402, and when executed by the one or more processors 401, perform the remote speech recognition method in any of the above method embodiments.
  • the electronic device of the embodiment of the present application exists in various forms, including but not limited to:
  • Mobile communication devices These devices are characterized by mobile communication functions and are mainly aimed at providing voice and data communication.
  • Such terminals include: smart phones (such as iPhone), multimedia phones, functional phones, and low-end phones.
  • Ultra-mobile personal computer equipment This type of equipment belongs to the category of personal computers, has computing and processing functions, and generally has mobile Internet access.
  • Such terminals include: PDAs, MIDs, and UMPC devices, such as the iPad.
  • Portable entertainment devices These devices can display and play multimedia content. Such devices include: audio, video players (such as iPod), handheld game consoles, e-books, and smart toys and portable car navigation devices.
  • the server consists of a processor, a hard disk, a memory, a system bus, etc.
  • the server is similar to a general-purpose computer architecture, but because of the need to provide highly reliable services, processing power and stability High reliability in terms of reliability, security, scalability, and manageability.
  • the embodiment of the present application further provides a computer readable storage medium, where the program for executing the method of the foregoing embodiment is stored.
  • the apparatus of the embodiments of the present application is applied to a smart television.
  • the user buys the TV and places it in his living room.
  • the built-in voice recognition module of the TV can accurately recognize the user's near-talk voice input.
  • the user starts the TV and issues the control password from a long distance.
  • the voice recognition module acquires the user's control password and performs frame processing on it. According to the obtained speech frame, the pre-trained near-talk speech recognition model is called to identify the password issued by the user, and a rough recognition result is obtained.
  • the maximum likelihood linear regression method is used to recalculate the environmental mapping relationship between the control password issued by the user and the voice input.
  • the built-in near-talk voice model of the TV can adapt the voice model of the user's living room environment.
  • the user can control the smart TV by issuing voice commands from a distance at home, for example, program search, application or service startup, power on/off, and the like.
  • the user has an elderly person, a child, a male or a female, and the general environment adaptive model may not fully meet the user's needs. Therefore, after collecting the remote speaking voice input of the user multiple times, the voice recognition device determines whether the voice input result collected multiple times has the same acoustic feature according to the acoustic characteristics of the user. When the judgment result is two or more, the two voice inputs are classified, such as children and adults. In the category of children, the children's far-talking voice input speech frames are used many times.
  • the children's far-talking voice input is first mapped into an environment-adaptive near-talk voice input, according to the maximum
  • the likelihood linear regression method updates the general environment mapping relationship and obtains the feature mapping relationship of children.
  • the adult category the adult's far-talking speech input speech frame is used separately.
  • the adult's far-talking voice input is mapped into an environment-adaptive approximate near-talk input, and the general environment mapping relationship is updated according to the maximum likelihood linear regression method to obtain the feature mapping relationship of the adult type.
  • the feature mapping relationship of the child type is invoked to adapt the environment and user attributes of the child's voice input.
  • the device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, ie may be located A place, or it can be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment. Those of ordinary skill in the art can understand and implement without deliberate labor.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Telephonic Communication Services (AREA)
  • Machine Translation (AREA)

Abstract

一种远讲语音识别方法及装置,该方法包括:获取用户远讲语音输入的测试远讲语音帧,调用预先训练的近讲语音模型识别测试远讲语音帧并得到初识结果(110);根据初识结果计算当前环境下远讲语音输入与近讲语音输入的环境特征映射矩阵(120);检测到用户的远讲语音输入时,根据环境特征映射矩阵将远讲语音输入映射至对应的近似近讲语音输入(130);调用预先训练的近讲语音模型识别近似近讲语音输入得到远讲语音识别结果(140)。从而实现了高正确率的远讲语音识别。

Description

远讲语音识别方法及装置
交叉引用
本申请引用于2016年04月11日递交的名称为“远讲语音识别方法及装置”的第201610219407.2号中国专利申请,其通过引用被全部并入本申请。
技术领域
本发明涉及语音识别技术领域,尤其涉及一种远讲语音识别方法及装置。
背景技术
近些年来,语音识别技术取得了显著进步,并且越来越多的从实验室走向市场,走进人们的生活。语音识别听写机在一些领域的应用被美国新闻界评为1997年计算机发展十件大事之一。未来10年内,语音识别技术将进入工业、家电、通信、汽车电子、医疗、家庭服务、消费电子产品等各个领域。
语音识别技术所涉及的领域包括:信号处理、模式识别、概率论和信息论、发声机理和听觉机理、人工智能等等。与机器进行语音交流,让机器明白人们的说话目的,这对于生活在机械化时代的我们而言,能够大幅提升生活质量。
目前,市场上出现了许多智能的能够通过语音进行控制的电视。一种方式是在智能手机上安装APP,然后将指令发送到特定的遥控器,遥控器再将指令转换成红外遥控信号,这种方式可以实现对普通电视的遥控。还有一种方式是在遥控器内置一个收音的设备,它可以收录用户发出的语音命令,然后将用户的语音命令发送至电视进行语义解析,然后通过语义解析的结果控制电视机的各种服务。
然而,对于在智能手机上安装APP对电视进行遥控的方法,其步骤繁琐,尤其对于不会操控智能手机的老人和孩子而言,这种方式并没有带来明显的 优势;对于在遥控器内置一个收音设备对电视进行遥控的方法,就生活体验而言,很多电视用户都是遥控器随手放置,对于有儿童的家庭更是如此,小孩子也许恶作剧藏起遥控器导致遥控器,从而导致经常找不到遥控器去了哪里。对于行动不便和健忘的老人而言,通过遥控器控制电视更加显得不方便。
若是不使用遥控器,将收音设备内嵌在电视内部采集用户发出的语音命令,则由于声波信号在室内遇到墙壁易发生反射造成混响,且周围环境难免会有噪声,导致远距离讲话语音识别的正确率低,用户体验不佳。
综上,一种新的语音识别的方法及装置亟待提出。
发明内容
本发明实施例提供一种远讲语音识别方法及装置,用以解决现有技术中远讲语音识别易受环境影响而识别率低缺陷,提高了远讲语音识别的正确率。
本发明实施例还提供一种远讲语音识别电子设备、一种非暂态计算机存储介质以及一种计算机程序产品。
本发明实施例提供一种远讲语音识别方法,包括:
获取用户远讲语音输入的测试远讲语音帧,调用预先训练的近讲语音模型识别所述测试远讲语音帧并得到初识结果;
根据所述初识结果计算当前环境下所述远讲语音输入与近讲语音输入的环境特征映射矩阵;
检测到用户的远讲语音输入时,根据所述环境特征映射矩阵将所述远讲语音输入映射至对应的近似近讲语音输入;
调用预先训练的所述近讲语音模型识别所述近似近讲语音输入得到远讲语音识别结果。
本发明实施例提供一种远讲语音识别装置,包括:
信号获取模块,用于获取用户远讲语音输入的测试远讲语音帧,调用预先训练的近讲语音模型识别所述测试远讲语音帧并得到初识结果;
训练模块,用于根据所述初识结果计算当前环境下所述远讲语音输入与近讲语音输入的环境特征映射矩阵;
映射模块,用于检测到用户的远讲语音输入时,根据所述环境特征映射矩阵将所述远讲语音输入映射至对应的近似近讲语音输入;
识别模块,用于调用预先训练的所述近讲语音模型识别所述近似近讲语音输入得到远讲语音识别结果。
本发明实施例还提供一种非暂态计算机存储介质,存储有计算机可执行指令,所述计算机可执行指令用于执行本申请上述任一项远讲语音识别方法方法。
本发明实施例提供一种远讲语音识别电子设备,包括:至少一个处理器;以及,
与所述至少一个处理器通信连接的存储器;其中,
所述存储器存储有可被所述一个处理器执行的指令,所述指令被被所述至少一个处理器执行,以使所述至少一个处理器能够执行本申请上述任一项远讲语音识别方法方法。
本发明实施例还提供了一种计算机程序产品,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使所述计算机执行本申请上述任一项远讲语音识别方法方法。
本发明实施例提供的远讲语音识别方法及装置,根据预先训练得到的近讲语音模型对用户的远讲输入进行识别得到初步的识别结果,再根据初步的识别结果计算得到当前环境下远讲输入与近讲输入的环境映射关系,改变了现有技术中进行远讲语音识别时,声波在环境中进行反射以及环境噪声引起的语音识别正确率低的问题,实现远讲语音的高识别率。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实 施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例一的技术流程图;
图2-1为本申请实施例二的技术流程图;
图2-2为本申请实施例二的另一技术流程图;
图3为本申请实施例三的装置结构示意图;
图4为本申请实施例四的远讲语音识别电子设备的结构示意图。
具体实施方式
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
图1是本申请实施例一的技术流程图,结合图1,本申请一种远讲语音识别方法,可由如下步骤实现:
步骤S110:获取用户远讲语音输入的测试远讲语音帧,调用预先训练的近讲语音模型识别所述测试远讲语音帧并得到初识结果;
步骤S120:根据所述初识结果计算当前环境下所述远讲语音输入与近讲语音输入的环境特征映射矩阵;
步骤S130:检测到用户的远讲语音输入时,根据所述环境特征映射矩阵将所述远讲语音输入映射至对应的近似近讲语音输入;
步骤S140:调用预先训练的所述近讲语音模型识别所述近似近讲语音输入得到远讲语音识别结果。
本申请实施例的远讲语音识别方法,其对应的远讲语音识别设备可以内 置于不依托于遥控器的电视、车载设备等,用于实现远距离语音输入信号的识别。以下部分,将以电视进行举例,但是应当理解,本申请实施例的技术方案的应用并不仅限于此。
具体的,在步骤S110中,用户直接对着电视进行语音命令的发送,例如:我想看芈月传。但是,用户和电视之间存在一定的距离,声波在传输的过程中可能会有一定程度上的衰减;另外,受限于电视所处的环境,例如,用户家里的客厅,有墙壁以及各种家具对声波有较强的反射,从而造成到达电视的声音混响和噪声比较大。因此,对于用户的语音指令“我想看芈月传”,“我想看”三个字在汉语习惯中出现较多,因此即使在混响和噪声大的情境下语音识别率也比较高,但是“芈月传”三个字较为生僻,可能存在识别困难。
因为语音信号是准稳态信号,在处理时常把信号分帧,每帧长度约20ms-30ms,在这一区间内把语音信号看作为稳态信号。只有稳态的信息才能进行信号处理,所以要先分帧。本申请实施例中,可采用语音分帧的函数将语音信号分帧,例enframe等。
本申请实施例中,所述近讲语音模型是预先通过采集一定数量的近讲语音信号进行训练的,所述近讲语音信号,即近距离语音输入信号,其信号失真度小且包含的噪声数据较小,采用近讲语音样本训练出的语音模型几乎不参杂环境因素。然而,若是采集远讲语音输入的样本训练远讲语音模型,将面临这样一个问题,即,每个用户说话的环境不同,对语音信号的干扰是不同的,若是采用同样一个语音输入环境采集远讲语音样本会导致训练出的远讲语音模型在面临不同的说话环境时,语音识别率难以提高。因此,本申请实施例中,预先训练一个不带噪声且不带衰减干扰的语音模型,即近讲语音模型,再通过每个用户在不同说话环境中发出的语音信号来修正所述近讲语音模型的模型参数,从而得到一个能够自适应用户说话环境的语音模型。这个语音模型包含了用户说话环境的因素,因此,能极大提高远讲语音识别的正确率。
具体的,所述近讲语音模型的训练可以采用混合高斯模型法或者隐马尔 科夫模型法。本发明实施例中,近讲语音模型的训练可以采用HMM,GMM-HMM,DNN-HMM等。
HMM(Hidden Markov Model),即隐马尔可夫模型。HMM是马尔可夫链的一种,它的状态不能直接观察到,但能通过观测向量序列观察到,每个观测向量都是通过某些概率密度分布表现为各种状态,每一个观测向量是由一个具有相应概率密度分布的状态序列产生。所以,隐马尔可夫模型是一个双重随机过程----具有一定状态数的隐马尔可夫链和显示随机函数集。自20世纪80年代以来,HMM被应用于语音识别,取得重大成功。HMM语音模型λ(π,A,B)由起始状态概率(π)、状态转移概率(A)和观测序列概率(B)三个参数决定。π揭示了HMM的拓扑结构,A描述了语音信号随时间的变化情况,B给出了观测序列的统计特性。
GMM为混合高斯模型,DNN为深度神经网络模型。GMM-HMM和DNN-HMM都是基于HMM的变形,由于这三种模型都是非常成熟的现有技术且并非本发明实施例保护重点,此处将不再赘述。
基于上述已经训练好的近讲语音模型,本申请实施例根据用户在特定环境下的测试远讲语音输入,得到一个初识结果。其中,所述测试远讲语音输入可以是用户在第一次使用语音识别设备时,由设备向用户提示输入的,也可以是用户发起开机指令时获取的。获取用户的测试远讲语音输入,其目的在于,从测试远讲语音输入中,获取发起语音输入的用户所在的环境,并将这一环境因素考虑到远讲语音识别的过程中,提高远讲语音识别的环境自适应性。
具体的,步骤S120包括:根据所述初识结果计算当前环境下所述远讲语音输入与近讲语音输入的环境特征映射矩阵;
本申请实施例根据用户在特定环境下的远讲语音输入的初识结果,采用最大似然线性回归法计算所述远讲语音输入与近讲语音输入的环境特征映射矩阵。
最大似然线性回归法MLLR(Mxium Likelihood Linear Regression)的方法是求得一组线性变换,通过这组变换,使自适应数据的似然函数最大化。例 如,在HMM系统中,MLLR方法待变换的参数一般是状态层的GMM的均值;在随机段模型中待变换的参数是域模型的均值向量。变换过程可以简单地表示如下:
u^=Au+b=Wξ
其中,u表示域模型自适应前的维数为D的均值向量,u^为自适应后的均值向量,ξ是u的扩展向量[1,u’]’,W即为所求的D×(D+1)线性变换矩阵。
由于最大似然线性回归法是成熟的现有技术,本步骤中不再赘述。
具体的,在步骤S130中,根据上一步骤中训练得到的环境特征映射矩阵,将用户的远讲语音输入映射至相应的近似近讲输入。
具体的,在步骤S140中,根据上一步骤中获取的近似近讲语音输入,采用近讲语音模型进行识别。
本申请实施例中,在步骤S140之后,进一步还包括可选步骤S150:
步骤S150:对所述环境映射矩阵进行迭代更新。
本步骤中,进一步对训练出的所述环境特征映射矩阵进行迭代训练,从而得到更加稳定、更加适应用户语言环境的映射关系,从而进一步保证远讲语音识别的正确性。迭代训练的具体算法如下所述:
S151:检测到用户的远讲语音输入时,调用所述环境特征映射矩阵将所述远讲语音输入映射至对应的所述近似近讲语音输入;
S152:调用预先训练的所述近讲语音模型识别所述近似近讲语音输入得到初识结果;
S153:根据所述初识结果,采用最大似然线性回归法计算所述远讲语音输入与近讲语音输入之间的环境映射关系,并根据所述映射关系更新所述环境特征映射矩阵。
每一次检测到用户的远讲语音输入后,都进行一次环境特征映射矩阵更新,直至所述环境特征映射矩阵趋于稳定。
本实施例中,根据预先训练得到的近讲语音模型对用户的远讲输入进行 识别得到初步的识别结果,再根据初步的识别结果计算得到当前环境下远讲输入与近讲输入的环境映射关系,改变了现有技术中进行远讲语音识别时,声波在环境中进行反射以及环境噪声引起的语音识别正确率低的问题,实现远讲语音的高识别率。
图2-1以及图2-2是本申请实施例二的技术流程图,结合图2-1,本申请实施例一种远讲语音识别方法还有如下可选的实施步骤:
步骤S210:提取所述用户的声学特征,判断所述用户所属的声学分组;
步骤S220:调用预先训练的所述声学分组的属性特征映射矩阵将所述远讲语音输入映射至对应的近似近讲语音输入;
步骤S230:调用预先训练的所述近讲语音模型识别所述近似近讲语音输入得到远讲语音识别结果。
具体的,在步骤S210中,提取到用户的声学特征后,与预先分类好的声学分组进行匹配,判断用户所属的声学分组,从而,从而根据不同的声学分组,调用不同的所述属性特征映射矩阵,实现更高准确率的语音识别。
在步骤S220中,获取上一步骤中用户所属的声学分组,并根据所属声学分组的结果调用相应分组中的环境特征映射矩阵。需要说明的是,所述环境特征映射矩阵,是某种声学分组特有的,是结合用户说话的语音环境和用户说话的声学特征得到的映射关系,进一步提高了预先训练的所述近讲语音模型的环境自适应和用户特征的自适应性。
具体,如图2-2所示,所述特征映射矩阵的训练方法由如下步骤实现:
步骤S231:获取用户远讲语音输入的测试远讲语音帧,调用预先训练的近讲语音模型识别所述测试远讲语音帧并得到初识结果;
步骤S232:根据所述初识结果计算,当前环境下所述远讲语音输入与近讲语音输入的环境特征映射矩阵;
步骤S233:检测到用户的远讲语音输入时,提取用户声学特征,根据所述声学特征将所述用户划分至不同声学分组;
步骤S234:在每个所述声学分组中,调用所述环境特征映射矩阵将所述 远讲语音输入映射至对应的所述近似近讲语音输入;
步骤S235:调用预先训练的所述近讲语音模型识别所述近似近讲语音输入得到初识结果;
步骤S236:根据所述初识结果,采用最大似然线性回归法计算所述远讲语音输入与近讲语音输入的的映射关系,根据所述映射关系更新所述环境特征映射矩阵得到每个所述声学分组的所述属性特征映射矩阵,并对所述属性特征映射矩阵进行更新。
具体的,步骤S231和步骤S232如实施例一的步骤S110和步骤S120,此处不再赘述。
具体的,在步骤S233中,根据所述声学特征将所述用户划分至不同声学分组,可以通过计算语音特征参数的MFCC(即Mel频率倒谱系数的缩写),也可以采用提取语音输入的基频实现。
Mel频率是基于人耳听觉特性提出来的,它与Hz频率成非线性对应关系。Mel频率倒谱系数(MFCC)则是利用它们之间的这种关系,计算得到的Hz频谱特征。MFCC计算总体流程如下首先是信号的预处理,包括预加重(Preemphasis),分帧(Frame Blocking),加窗(Windowing)。假设语音信号的采样频率fs=8KHz.由于语音信号在10-30ms认为是稳定的,则可设置帧长为80~240点。帧移可以设置为帧长的1/2;其次,对每一帧进行FFT(快速傅里叶)变换,求频谱,进而求得幅度谱;再者,对幅度谱加Mel滤波器组;最后,对所有的滤波器输出做对数运算(Logarlithm),再进一步做离散余弦变换DCT可得MFCC。
在浊音的发音过程中,气流通过声门使得声带产生张弛振荡式的振动,产生一股准周期脉冲气流,这一气流激励声道就产生浊音,它携带了语音中的大部分能量,其中声带的振动频率就称为基频。
可以采用基于时域的算法和/或基于空域的算法提取用户语音输入的基频,其中,所述基于时域的算法包括自相关函数算法和平均幅度差函数算法,所述基于空域的算法包括倒普分析法和离散小波变换法。
自相关函数法是利用了浊音信号的准周期性,通过对比原始信号和它的位移后信号之间的类似性来进行基频的检测,其原理是浊音信号的自相关函数在时延等于基音周期整数倍的地方产生一个峰值,而清音信号的自相关函数无明显的峰值。因此通过检测语音信号的自相关函数的峰值位置,就可以估计语音的基频。
平均幅度差函数法检测基频的依据为:语音的浊音具有准周期性,完全周期信号在相距为周期的倍数的幅值点上的幅值是相等的,从而差值为零。假设基音周期为P,则在浊音段,则平均幅度差函数将出现谷底,则两个谷底之间的距离即为基音周期,其倒数则为基频。
倒谱分析是谱分析的一种方法,输出是傅里叶变换的幅度谱取对数后做傅里叶逆变换的结果。该方法所依据的理论是,一个具有基频的信号的傅立叶变换的幅度谱有一些等距离分布的峰值,代表信号中的谐波结构,当对幅度谱取对数之后,这些峰值被削弱到一个可用的范围。幅度谱取对数后得到的结果是在频域的一个周期信号,而这个频域信号的周期(是频率值)可以认为就是原始信号的基频,所以对这个信号做傅里叶逆变换就可以在原始信号的基音周期处得到一个峰值。
离散小波变换是一个强大的工具,它允许在连续的尺度上把信号分解为高频成分和低频成分,它是时间和频率的局部变换,能有效地从信号中提取信息。与快速傅里叶变换相比,离散小波变换的主要好处在于,在高频部分它可以取得好的时间分辨率,在低频部分可以取得好的频率分辨率。
基频取决于声带的大小、厚薄、松弛程度以及声门上下之间的气压差的效应等。当声带被拉得越长、越紧、越薄,声门的形状就变得越细长,而且这时声带在闭合时也未必是完全的闭合,相应的基频就越高。基频随着发音人的性别,年龄及具体情况而定,总体来说,老年男性偏低,女性和儿童偏高。经测试,一般地,男性的基频范围大概在80Hz到200Hz之间,女性的基频范围大概在200-350HZ之间,而儿童的基频范围大概在350-500Hz之间。
当检测到用户的远讲语音输入时,提取其基频,并判断其所述的阈值范围,即可判断输入语音的来源的用户特征,并根据这一特征将用户进行分类。 当有不同的用户进行语音输入时,便可根据其声学特征得到不同声学分组以及每个声学分组对应的所述环境自适应语音模型。
具体的,在步骤S234中,在每个声学分组中,针对用户的远讲语音输入,调用步骤S232中得到的所述环境特征映射矩阵先得到的一个近似近讲语音输入。
具体的,在步骤S235中,所述初识结果是摒除用户所处环境影响的识别结果,但是并没有消除每个用户说话特征对语音识别结果的影响。
具体的,在步骤S236中,将步骤S232中训练得到的所述环境映射矩阵进行进一步的更新,得到包含用户声学属性的属性映射矩阵。
需要说明的是,本步骤中,还需进一步对训练出的所述属性特征映射矩阵进行迭代训练,从而得到更加稳定、更加适应用户语言环境的用户属性映射关系,从而进一步保证特定用户远讲语音识别的正确性。
迭代训练的具体算法同样采用最大似然线性回归法,每一次检测到用户的远讲语音输入时,提取所述用户的声学特征并根据所述声学特征将所述用户划分至所属的声学分组;根据所述远讲语音输入,调用所述属性特征映射矩阵将所述远讲语音输入映射至对应的所述近似近讲语音输入;调用预先训练的所述近讲语音模型识别所述近似近讲语音输入得到初识结果;根据所述初识结果,采用最大似然线性回归法计算所述远讲语音输入与近讲语音输入的的属性特征映射矩阵,从而实现所述属性特征映射矩阵的更新。
本实施例中,根据用户输入的远讲语音输入,获取其声学特征,并根据所述声学特征对用户输入的远讲语音进行环境自适应和用户自适应的训练,得到了更加贴合用户发音特征以及语音环境的个性化映射关系,极大提高了远讲语音识别的效率,提升了用户体验。
图3是本申请实施例三的装置结构示意图,结合图3,本申请实施例一种一种远讲语音识别装置,包括如下的模块:
信号获取模块310,用于获取用户远讲语音输入的测试远讲语音帧,调用预先训练的近讲语音模型识别所述测试远讲语音帧并得到初识结果;
训练模块320,用于根据所述初识结果计算当前环境下所述远讲语音输入与近讲语音输入的环境特征映射矩阵;
映射模块330,用于检测到用户的远讲语音输入时,根据所述环境特征映射矩阵将所述远讲语音输入映射至对应的近似近讲语音输入;
识别模块340,用于调用预先训练的所述近讲语音模型识别所述近似近讲语音输入得到远讲语音识别结果。
其中,所述训练模块320,具体用于:根据所述远讲语音帧与所述初识结果,采用最大似然线性回归法计算所述远讲语音输入与对应的所述近讲语音输入之间的环境特征映射矩阵并对所述环境映射矩阵进行迭代更新。
其中,所述训练模块320,具体还用于:检测到用户的远讲语音输入时调用所述环境特征映射矩阵将所述远讲语音输入映射至对应的所述近似近讲语音输入;调用预先训练的所述近讲语音模型识别所述近似近讲语音输入得到初识结果;根据所述初识结果,采用最大似然线性回归法计算所述远讲语音输入与近讲语音输入之间的环境映射关系,并根据所述映射关系更新所述环境特征映射矩阵。
其中,所述映射模块330还用于:提取所述用户的声学特征,判断所述用户所属的声学分组;调用预先训练的所述声学分组的属性特征映射矩阵将所述远讲语音输入映射至对应的近似近讲语音输入;
所述识别模块340,还用于调用预先训练的所述近讲语音模型识别所述近似近讲语音输入得到远讲语音识别结果。
其中,所述训练模块320,还用于:检测到用户的远讲语音输入时,提取用户声学特征,根据所述声学特征将所述用户划分至不同声学分组;
在每个所述声学分组中,调用所述环境特征映射矩阵将所述远讲语音输入映射至对应的所述近似近讲语音输入;调用预先训练的所述近讲语音模型识别所述近似近讲语音输入得到初识结果;根据所述初识结果,采用最大似然线性回归法计算所述远讲语音输入与近讲语音输入的的映射关系,根据所述映射关系更新所述环境特征映射矩阵得到每个所述声学分组的所述属性 特征映射矩阵,并对所述属性特征映射矩阵进行更新。
其中,所述训练模块330,具体还用于:检测到用户的远讲语音输入时,提取所述用户的声学特征并根据所述声学特征将所述用户划分至所属的声学分组;根据所述远讲语音输入,调用所述属性特征映射矩阵将所述远讲语音输入映射至对应的所述近似近讲语音输入;调用预先训练的所述近讲语音模型识别所述近似近讲语音输入得到初识结果;根据所述初识结果,采用最大似然线性回归法计算所述远讲语音输入与近讲语音输入的的属性特征映射矩阵,从而实现所述属性特征映射矩阵的更新。
图3所示装置可以执行图1以及图2所示实施例的方法,实现原理和技术效果参考图1以及图2所示实施例,不再赘述。
图4为本申请实施例三的远讲语音识别电子设备的结构示意图,本实施例所述设备可以为远讲语音识别服务器或远讲语音识别服务器中的一部分,该设备可以包括:
一个或多个处理器401以及存储器402,图4中以一个处理器401为例。
远讲语音识别电子设备还可以包括:输入装置403和输出装置404。
处理器401、存储器402、输入装置403和输出装置404可以通过总线或者其他方式连接。
存储器402作为一种非暂态计算机可读存储介质,可用于存储非暂态软件程序、非暂态计算机可执行程序以及模块,如本申请实施例中的远讲语音识别方法对应的程序指令/模块。处理器401通过运行存储在存储器602中的非暂态软件程序、指令以及模块,从而执行服务器的各种功能应用以及数据处理,即实现上述方法实施例远讲语音识别方法。
存储器402可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储根据远讲语音识别装置的使用所创建的数据等。此外,存储器402可以包括高速随机存取存储器,还可以包括非暂态存储器,例如至少一个磁盘存储器件、闪存 器件、或其他非暂态固态存储器件。在一些实施例中,存储器402可选包括相对于处理器401远程设置的存储器,这些远程存储器可以通过网络连接至远讲语音识别的处理装置。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
输入装置403可接收输入的数字或字符信息,以及产生与远讲语音识别装置的用户设置以及功能控制有关的键信号输入。输出装置404可包括显示屏等显示设备。
所述一个或者多个模块存储在所述存储器402中,当被所述一个或者多个处理器401执行时,执行上述任意方法实施例中的远讲语音识别方法。
上述产品可执行本申请实施例所提供的方法,具备执行方法相应的功能模块和有益效果。未在本实施例中详尽描述的技术细节,可参见本申请实施例所提供的方法。
本申请实施例的电子设备以多种形式存在,包括但不限于:
(1)移动通信设备:这类设备的特点是具备移动通信功能,并且以提供话音、数据通信为主要目标。这类终端包括:智能手机(例如iPhone)、多媒体手机、功能性手机,以及低端手机等。
(2)超移动个人计算机设备:这类设备属于个人计算机的范畴,有计算和处理功能,一般也具备移动上网特性。这类终端包括:PDA、MID和UMPC设备等,例如iPad。
(3)便携式娱乐设备:这类设备可以显示和播放多媒体内容。该类设备包括:音频、视频播放器(例如iPod),掌上游戏机,电子书,以及智能玩具和便携式车载导航设备。
(4)服务器:提供计算服务的设备,服务器的构成包括处理器、硬盘、内存、系统总线等,服务器和通用的计算机架构类似,但是由于需要提供高可靠的服务,因此在处理能力、稳定性、可靠性、安全性、可扩展性、可管理性等方面要求较高。
(5)其他具有数据交互功能的电子装置。
相应地,本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有用于执行上述实施例方法的程序。
应用实例
在一种可能的应用场景中,本申请实施例的装置被应用于智能电视。用户购买电视放在自家客厅。根据预先训练的近讲语音模型,电视内置的语音识别模块可以准确地识别用户的近讲语音输入。
用户启动电视,远距离发布控制口令,语音识别模块获取用户的控制口令,并对其进行分帧处理。根据得到的语音帧,调用预先训练出来的近讲语音识别模型对用户发布的口令进行识别,得到一个粗糙的识别结果。
根据这个粗糙的识别结果,采用最大似然线性回归法重新计算用户远讲发布的控制口令和近讲语音输入的环境映射关系。通过这一映射关系电视内置的近讲语音模型就能够自适应用户家客厅环境的语音模型。如此一来,用户在家可以通过远距离发布语音指令来控制智能电视,例如,节目搜索,应用或服务的启动,开关机等。
在另一种应用场景下,用户家里有老人、孩子、男性或者女性,通用的环境自适应模型可能并不能够完全满足用户的需求。因此,所述语音识别设备在采集了多次用户的远讲语音输入之后,根据用户的声学特征,判断多次采集的语音输入结果是否具有同一种声学特征。当判断结果为两种以上时,将这两种语音输入进行分类,例如儿童和成年人。在儿童这一类中,多次采用儿童的远讲语音输入语音帧,根据之前训练得到的环境映射关系,先将儿童的远讲语音输入映射成环境自适应的近似近讲语音输入,根据最大似然线性回归法更新通用的环境映射关系,得到儿童类型的特征映射关系;在成人这一类中,多次单独采用成人的远讲语音输入语音帧,根据之前训练得到的环境映射关系,先将成人的远讲语音输入映射成环境自适应的近似近讲语音输入,根据最大似然线性回归法更新通用的环境映射关系,得到成人类型的特征映射关系。
当再一次检测到用户有语音输入时,首先根据用户的语音特征,判断用户是儿童、成年人还是老人。若是判断为儿童,则调用儿童类型的特征映射关系对儿童的语音输入进行环境以及用户属性的自适应。与此同时,还需要用儿童的语音输入对儿童类型的特征映射关系不断的迭代训练,从而达到一个较稳定的结果。
以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机装置(可以是个人计算机,服务器,或者网络装置等)执行各个实施例或者实施例的某些部分所述的方法。
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。

Claims (15)

  1. 一种远讲语音识别方法,其特征在于,包括如下的步骤:
    获取用户远讲语音输入的测试远讲语音帧,调用预先训练的近讲语音模型识别所述测试远讲语音帧并得到初识结果;
    根据所述初识结果计算当前环境下所述远讲语音输入与近讲语音输入的环境特征映射矩阵;
    检测到用户的远讲语音输入时,根据所述环境特征映射矩阵将所述远讲语音输入映射至对应的近似近讲语音输入;
    调用预先训练的所述近讲语音模型识别所述近似近讲语音输入得到远讲语音识别结果。
  2. 根据权利要求1所述的方法,其特征在于,根据所述初识结果计算当前环境下所述远讲语音输入与近讲语音输入的环境特征映射矩阵,具体包括:
    根据所述远讲语音帧与所述初识结果,采用最大似然线性回归法计算所述远讲语音输入与对应的所述近讲语音输入之间的环境特征映射矩阵并对所述环境映射矩阵进行迭代更新。
  3. 根据权利要求2所述的方法,其特征在于,对所述环境映射矩阵进行迭代更新,具体包括:
    检测到用户的远讲语音输入时调用所述环境特征映射矩阵将所述远讲语音输入映射至对应的所述近似近讲语音输入;
    调用预先训练的所述近讲语音模型识别所述近似近讲语音输入得到初识结果;
    根据所述初识结果,采用最大似然线性回归法计算所述远讲语音输入与近讲语音输入之间的环境映射关系,并根据所述映射关系更新所述环境特征映射矩阵。
  4. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    提取所述用户的声学特征,判断所述用户所属的声学分组;
    调用预先训练的所述声学分组的属性特征映射矩阵将所述远讲语音输入映射至对应的近似近讲语音输入;
    调用预先训练的所述近讲语音模型识别所述近似近讲语音输入得到远讲语音识别结果。
  5. 根据权利要求4所述的方法,其特征在于,所述方法还包括:
    检测到用户的远讲语音输入时,提取用户声学特征,根据所述声学特征将所述用户划分至不同声学分组;
    在每个所述声学分组中,调用所述环境特征映射矩阵将所述远讲语音输入映射至对应的所述近似近讲语音输入;
    调用预先训练的所述近讲语音模型识别所述近似近讲语音输入得到初识结果;
    根据所述初识结果,采用最大似然线性回归法计算所述远讲语音输入与近讲语音输入的的映射关系,根据所述映射关系更新所述环境特征映射矩阵得到每个所述声学分组的所述属性特征映射矩阵,并对所述属性特征映射矩阵进行更新。
  6. 根据权利要求5所述的方法,其特征在于,对所述属性特征映射矩阵进行更新,具体包括:
    检测到用户的远讲语音输入时,提取所述用户的声学特征并根据所述声学特征将所述用户划分至所属的声学分组;
    根据所述远讲语音输入,调用所述属性特征映射矩阵将所述远讲语音输入映射至对应的所述近似近讲语音输入;
    采用最大似然线性回归法计算所述远讲语音输入与对应的近讲语音输入之间的属性特征映射矩阵,从而实现所述属性特征映射矩阵的更新。
  7. 一种远讲语音识别装置,其特征在于,包括如下的模块:
    信号获取模块,用于获取用户远讲语音输入的测试远讲语音帧,调用预先训练的近讲语音模型识别所述测试远讲语音帧并得到初识结果;
    训练模块,用于根据所述初识结果计算当前环境下所述远讲语音输入与近讲语音输入的环境特征映射矩阵;
    映射模块,用于检测到用户的远讲语音输入时,根据所述环境特征映射矩阵将所述远讲语音输入映射至对应的近似近讲语音输入;
    识别模块,用于调用预先训练的所述近讲语音模型识别所述近似近讲语音输入得到远讲语音识别结果。
  8. 根据权利要求7所述的装置,其特征在于,所述训练模块,具体用于:
    根据所述远讲语音帧与所述初识结果,采用最大似然线性回归法计算所述远讲语音输入与对应的所述近讲语音输入之间的环境特征映射矩阵并对所述环境映射矩阵进行迭代更新。
  9. 根据权利要求8所述的装置,其特征在于,所述训练模块,具体还用于:
    检测到用户的远讲语音输入时调用所述环境特征映射矩阵将所述远讲语音输入映射至对应的所述近似近讲语音输入;
    调用预先训练的所述近讲语音模型识别所述近似近讲语音输入得到初识结果;
    根据所述初识结果,采用最大似然线性回归法计算所述远讲语音输入与近讲语音输入之间的环境映射关系,并根据所述映射关系更新所述环境特征映射矩阵。
  10. 根据权利要求7所述的装置,其特征在于,所述映射模块还用于:提取所述用户的声学特征,判断所述用户所属的声学分组;
    调用预先训练的所述声学分组的属性特征映射矩阵将所述远讲语音输入映射至对应的近似近讲语音输入;
    所述识别模块,还用于调用预先训练的所述近讲语音模型识别所述近似近讲语音输入得到远讲语音识别结果。
  11. 根据权利要求10所述的装置,其特征在于,所述训练模块,还用 于:
    检测到用户的远讲语音输入时,提取用户声学特征,根据所述声学特征将所述用户划分至不同声学分组;
    在每个所述声学分组中,调用所述环境特征映射矩阵将所述远讲语音输入映射至对应的所述近似近讲语音输入;
    调用预先训练的所述近讲语音模型识别所述近似近讲语音输入得到初识结果;
    根据所述初识结果,采用最大似然线性回归法计算所述远讲语音输入与近讲语音输入的的映射关系,根据所述映射关系更新所述环境特征映射矩阵得到每个所述声学分组的所述属性特征映射矩阵,并对所述属性特征映射矩阵进行更新。
  12. 根据权利要求11所述的装置,其特征在于,所述训练模块,具体还用于:
    检测到用户的远讲语音输入时,提取所述用户的声学特征并根据所述声学特征将所述用户划分至所属的声学分组;
    根据所述远讲语音输入,调用所述属性特征映射矩阵将所述远讲语音输入映射至对应的所述近似近讲语音输入;
    调用预先训练的所述近讲语音模型识别所述近似近讲语音输入得到初识结果;
    根据所述初识结果,采用最大似然线性回归法计算所述远讲语音输入与近讲语音输入的的属性特征映射矩阵,从而实现所述属性特征映射矩阵的更新。
  13. 一种远讲语音识别电子设备,其特征在于,包括:
    至少一个处理器;以及,
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1-6任一 所述方法。
  14. 一种非暂态计算机可读存储介质,其特征在于,所述非暂态计算机可读存储介质存储计算机指令,所述计算机指令用于使所述计算机执行权利要求1-6任一所述方法。
  15. 一种计算机程序产品,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使所述计算机执行权利要求1-6任一所述方法。
PCT/CN2016/101053 2016-04-11 2016-09-30 远讲语音识别方法及装置 WO2017177629A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610219407.2A CN105845131A (zh) 2016-04-11 2016-04-11 远讲语音识别方法及装置
CN201610219407.2 2016-04-11

Publications (1)

Publication Number Publication Date
WO2017177629A1 true WO2017177629A1 (zh) 2017-10-19

Family

ID=56598055

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/101053 WO2017177629A1 (zh) 2016-04-11 2016-09-30 远讲语音识别方法及装置

Country Status (2)

Country Link
CN (1) CN105845131A (zh)
WO (1) WO2017177629A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105845131A (zh) * 2016-04-11 2016-08-10 乐视控股(北京)有限公司 远讲语音识别方法及装置
CN108836574A (zh) * 2018-06-20 2018-11-20 广州智能装备研究院有限公司 一种利用颈部振动的人工智能发声系统及其发声方法
CN108959627B (zh) * 2018-07-23 2021-12-17 北京光年无限科技有限公司 基于智能机器人的问答交互方法及系统
CN112771608A (zh) * 2018-11-20 2021-05-07 深圳市欢太科技有限公司 语音信息的处理方法、装置、存储介质及电子设备

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6389393B1 (en) * 1998-04-28 2002-05-14 Texas Instruments Incorporated Method of adapting speech recognition models for speaker, microphone, and noisy environment
US7457745B2 (en) * 2002-12-03 2008-11-25 Hrl Laboratories, Llc Method and apparatus for fast on-line automatic speaker/environment adaptation for speech/speaker recognition in the presence of changing environments
CN103258533A (zh) * 2013-05-27 2013-08-21 重庆邮电大学 远距离语音识别中的模型域补偿新方法
CN104025188A (zh) * 2011-12-29 2014-09-03 英特尔公司 声学信号修改
CN104810021A (zh) * 2015-05-11 2015-07-29 百度在线网络技术(北京)有限公司 应用于远场识别的前处理方法和装置
CN104952450A (zh) * 2015-05-15 2015-09-30 百度在线网络技术(北京)有限公司 远场识别的处理方法和装置
CN105845131A (zh) * 2016-04-11 2016-08-10 乐视控股(北京)有限公司 远讲语音识别方法及装置

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104078041B (zh) * 2014-06-26 2018-03-13 美的集团股份有限公司 语音识别方法及系统

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6389393B1 (en) * 1998-04-28 2002-05-14 Texas Instruments Incorporated Method of adapting speech recognition models for speaker, microphone, and noisy environment
US7457745B2 (en) * 2002-12-03 2008-11-25 Hrl Laboratories, Llc Method and apparatus for fast on-line automatic speaker/environment adaptation for speech/speaker recognition in the presence of changing environments
CN104025188A (zh) * 2011-12-29 2014-09-03 英特尔公司 声学信号修改
CN103258533A (zh) * 2013-05-27 2013-08-21 重庆邮电大学 远距离语音识别中的模型域补偿新方法
CN104810021A (zh) * 2015-05-11 2015-07-29 百度在线网络技术(北京)有限公司 应用于远场识别的前处理方法和装置
CN104952450A (zh) * 2015-05-15 2015-09-30 百度在线网络技术(北京)有限公司 远场识别的处理方法和装置
CN105845131A (zh) * 2016-04-11 2016-08-10 乐视控股(北京)有限公司 远讲语音识别方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
KATSAMANIS, A. ET AL.: "ROBUST FAR-FIELD SPOKEN COMMAND RECOGNITION FOR HOME AUTOMATION COMBINING ADAPTATION AND MUlTICHANNEL PROCESSING", 2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTIC, SPEECH AND SIGNAL PROCESSING (ICASSP, 9 May 2014 (2014-05-09), pages 5547 - 5551, XP032617715 *

Also Published As

Publication number Publication date
CN105845131A (zh) 2016-08-10

Similar Documents

Publication Publication Date Title
US10373609B2 (en) Voice recognition method and apparatus
US12033632B2 (en) Context-based device arbitration
US11875820B1 (en) Context driven device arbitration
US11138977B1 (en) Determining device groups
US20170154640A1 (en) Method and electronic device for voice recognition based on dynamic voice model selection
WO2017084360A1 (zh) 一种用于语音识别方法及系统
WO2014114048A1 (zh) 一种语音识别的方法、装置
CN108922525B (zh) 语音处理方法、装置、存储介质及电子设备
WO2014114049A1 (zh) 一种语音识别的方法、装置
WO2018223727A1 (zh) 识别声纹的方法、装置、设备及介质
US10685664B1 (en) Analyzing noise levels to determine usability of microphones
CN104700843A (zh) 一种年龄识别的方法及装置
CN104575504A (zh) 采用声纹和语音识别进行个性化电视语音唤醒的方法
WO2017177629A1 (zh) 远讲语音识别方法及装置
US11250854B2 (en) Method and apparatus for voice interaction, device and computer-readable storage medium
CN102945673A (zh) 一种语音指令范围动态变化的连续语音识别方法
KR20230107860A (ko) 실제 노이즈를 사용한 음성 개인화 및 연합 트레이닝
CN113643693A (zh) 以声音特征为条件的声学模型
CN110268471A (zh) 具有嵌入式降噪的asr的方法和设备
CN107393539A (zh) 一种声音密码控制方法
WO2023142409A1 (zh) 调整播放音量的方法、装置、设备以及存储介质
CN112017662B (zh) 控制指令确定方法、装置、电子设备和存储介质
CN114512133A (zh) 发声对象识别方法、装置、服务器及存储介质
US11887602B1 (en) Audio-based device locationing
US12125483B1 (en) Determining device groups

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16898442

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 16898442

Country of ref document: EP

Kind code of ref document: A1