WO2021098153A1 - 目标用户改变的检测方法、系统、电子设备和存储介质 - Google Patents

目标用户改变的检测方法、系统、电子设备和存储介质 Download PDF

Info

Publication number
WO2021098153A1
WO2021098153A1 PCT/CN2020/087744 CN2020087744W WO2021098153A1 WO 2021098153 A1 WO2021098153 A1 WO 2021098153A1 CN 2020087744 W CN2020087744 W CN 2020087744W WO 2021098153 A1 WO2021098153 A1 WO 2021098153A1
Authority
WO
WIPO (PCT)
Prior art keywords
frame
target
probability
speech
signal
Prior art date
Application number
PCT/CN2020/087744
Other languages
English (en)
French (fr)
Inventor
陆成
叶顺舟
康力
巴莉芳
Original Assignee
锐迪科微电子科技(上海)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 锐迪科微电子科技(上海)有限公司 filed Critical 锐迪科微电子科技(上海)有限公司
Publication of WO2021098153A1 publication Critical patent/WO2021098153A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/165Management of the audio stream, e.g. setting of volume, audio stream path
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal

Definitions

  • the present invention relates to the technical field of artificial intelligence equipment, and in particular to a method, system, electronic device and storage medium for detecting a target user change in voice interaction.
  • Smart speakers With the rapid development of artificial intelligence technology, smart speakers have also emerged. Smart speakers mostly use a microphone array to pick up the sound, and then activate it by the user's input of a wake-up word (such as "Hello Xiaorui”), and then execute the corresponding control command after waking up.
  • a wake-up word such as "Hello Xiaorui”
  • VAD Voice Activity Detection
  • VAD Voice Activity Detection
  • the main function of VAD is to determine the starting point and ending point of the voice recorded by the microphone after the voice interaction function of the device or product is awakened. If there is no VAD algorithm, it can only be artificially interfered with the device. recording. However, in the actual application scenario, there is the following situation: when the device wakes up successfully and the recording function is turned on, the target user starts recording.
  • the microphone After the target user finishes speaking, if there are other non-target users around suddenly starting to speak, the microphone will still continue at this time Voice pickup, that is, the VAD algorithm cannot accurately detect the end point of the target user's recording, which will cause undesired voices to be recorded, resulting in errors in subsequent voice recognition and semantic understanding, thereby reducing user experience.
  • the technical problem to be solved by the present invention is to overcome the inability of the prior art to effectively solve the problem that the voice data of non-target users will be entered in the recording process, which causes errors in subsequent voice recognition and semantic understanding, and reduces user experience.
  • the present invention provides a method for detecting a target user change in voice interaction, and the detecting method includes:
  • the recording device After the recording device starts recording, it is determined whether it is detected that the target user starts to input a voice signal, and if so, the first voice signal of the first set number of frames input by the target user is acquired;
  • the recording device is controlled to stop recording.
  • the step of obtaining the first pitch period corresponding to each frame of the first speech signal includes:
  • Processing the first intermediate speech signal by using a waveform estimation method, an autocorrelation processing method or a cepstrum method to obtain the first pitch period corresponding to each frame of the first speech signal; and/or,
  • the step of obtaining the second pitch period corresponding to the second speech signal of each frame includes:
  • the second intermediate speech signal is processed by using a waveform estimation method, an autocorrelation processing method or a cepstrum method to obtain the second pitch period corresponding to the second speech signal in each frame.
  • the step of calculating the similarity between the first pitch period sequence and the second pitch period sequence includes:
  • the Euclidean distance is negatively correlated with the similarity.
  • the step of judging whether it is detected that the target user starts to input a voice signal includes:
  • the target input signal when the first set number of frames is N, includes the current frame i and the first input signal corresponding to the number of N-1 frames before the current frame, i ⁇ N>1 and takes Integer
  • the method further includes:
  • the step of determining whether it is detected that the target user starts to input a voice signal according to the target voice probability and the target non-voice probability of the first input signal in each frame of the target input signal includes:
  • the first frame number is greater than or equal to the fourth set threshold, and the current frame is a voice frame, it is determined that the target user is detected to start to input voice signal.
  • the method further includes:
  • the step of determining whether it is detected that the target user ends the voice signal input, and if so, the step of controlling the recording device to stop recording includes:
  • the second frame number is greater than or equal to the fifth set threshold, and the current frame is a non-speech frame, it is determined that the target user is detected to end the voice signal input , And control the recording device to stop recording.
  • the detection method when it is determined that the target user has not started to input a voice signal, the detection method further includes:
  • the target speech probability and the target non-speech probability of each frame of the first input signal in the new target input signal determine whether it is detected that the target user starts to input a speech signal; and/or,
  • the step of acquiring the first voice signal of a first set number of frames input by the target user includes:
  • the step of obtaining the target speech probability and target non-speech probability corresponding to the first input signal in each frame includes:
  • the first input signal corresponding to each frame is obtained according to the energy corresponding to the first input signal of each frame.
  • the step of obtaining the target speech probability and target non-speech probability corresponding to the first input signal in each frame of the energy includes:
  • the deep neural network algorithm when used to obtain the target speech probability and the target non-speech probability corresponding to the first input signal in each frame, the deep neural network algorithm is used to obtain the target corresponding to the first input signal in each frame.
  • the steps of speech probability and target non-speech probability include:
  • the signal types include voice signals and non-voice signals
  • the target non-speech probability corresponding to the first input signal of each frame is calculated according to the target speech probability.
  • the steps of the energy corresponding to the first input signal and using a deep neural network algorithm to obtain the target speech probability and the target non-speech probability corresponding to the first input signal in each frame include:
  • the deep neural network algorithm When the deep neural network algorithm is used to obtain the target speech probability and the target non-speech probability corresponding to the first input signal of each frame, the deep neural network algorithm is used to obtain the target speech probability and the target corresponding to the first input signal of each frame
  • the steps of non-speech probability include:
  • the signal types include voice signals and non-voice signals
  • the weighted average method is used to process the first speech probability and the second speech probability of the first input signal in the same frame, and the first non-speech probability and the first non-speech probability of the first input signal in the same frame
  • the second non-speech probability is processed to obtain the target speech probability and the target non-speech probability corresponding to the first input signal in each frame.
  • the step of obtaining the average energy value corresponding to the first input signal in each frame within a set frequency range includes:
  • the step of obtaining the target speech probability and the target non-speech probability corresponding to each frame of the first input signal according to the average energy value includes:
  • the current value is determined according to the average energy value, the second set threshold value, and the third set threshold value.
  • the first probability, the second probability, and the third probability are sorted in descending order
  • the target speech probability and the target non-speech probability corresponding to the second input signal of each frame are determined according to the first probability, the second probability, or the third probability.
  • the calculation formula corresponding to the step of determining the second probability that the current frame is speech according to the average energy value, the second set threshold and the third set threshold is as follows:
  • Prob_energy (energy-A)/(B-A)
  • Prob_energy represents the second probability
  • energy represents the average energy value
  • A represents the third set threshold
  • B represents the second set threshold
  • the detection method further includes:
  • a wake-up word is used to wake up the recording device.
  • the present invention also provides a detection system for target user changes in voice interaction.
  • the detection system includes a first judgment module, a first voice signal acquisition module, a first pitch period acquisition module, a first periodic sequence acquisition module, and a second voice signal acquisition module.
  • the first judgment module is used for judging whether the target user is detected to input a voice signal after the recording device starts recording, and if so, calling the first voice signal acquisition module to acquire the first set frame input by the target user Number of first voice signals;
  • the first pitch period acquisition module is configured to acquire the first pitch period corresponding to the first speech signal in each frame
  • the first periodic sequence obtaining module is configured to obtain a first pitch period sequence corresponding to the first set number of frames according to the first pitch period;
  • the second voice signal acquiring module is configured to acquire the second voice signal of the second set number of frames input by the current user into the recording device after the set time length;
  • the second pitch period acquisition module is configured to acquire the second pitch period corresponding to the second speech signal of each frame
  • the second periodic sequence obtaining module is configured to obtain a second pitch period sequence corresponding to the second set number of frames according to the second pitch period;
  • the similarity calculation module is used to calculate the similarity between the first pitch period sequence and the second pitch period sequence
  • the second judgment module is configured to judge whether the similarity is greater than a first set threshold, and if so, determine that the current user inputting the second voice signal is the target user, control the recording device to continue recording, and Recall the second voice signal acquisition module;
  • the recording device is controlled to stop recording.
  • the first pitch period acquisition module includes a first preprocessing unit, a first short-term energy processing unit, a first center clipping processing unit, and a first pitch period acquisition unit;
  • the first preprocessing unit is configured to preprocess the first speech signal of each frame
  • the first short-term energy processing unit is configured to use short-term energy to process each frame of the first speech signal after preprocessing, to obtain the first voiced signal in each frame of the first speech signal;
  • the first center clipping processing unit is configured to process the first voiced sound signal by using a center clipping method to obtain a first intermediate speech signal;
  • the first pitch period acquisition unit is configured to process the first intermediate speech signal by using a waveform estimation method, an autocorrelation processing method, or a cepstrum method, and obtain the first pitch corresponding to the first speech signal for each frame cycle;
  • the first period sequence acquisition module is configured to form the first pitch period sequence according to the first pitch period corresponding to the first speech signal in each frame of the first set number of frames; and/or,
  • the second pitch period acquisition module includes a second preprocessing unit, a second short-term energy processing unit, a second center clipping processing unit, and a second pitch period acquisition unit;
  • the second preprocessing unit is configured to preprocess the second speech signal of each frame
  • the second short-term energy processing unit is configured to use short-term energy to process each frame of the second speech signal after preprocessing, to obtain a second voiced signal in each frame of the second speech signal;
  • the second center clipping processing unit is used to process the second voiced sound signal by using a center clipping method to obtain a second intermediate speech signal;
  • the second pitch period acquisition unit is configured to process the second intermediate speech signal by using a waveform estimation method, an autocorrelation processing method, or a cepstrum method, and obtain the second pitch corresponding to the second speech signal for each frame cycle;
  • the second period sequence acquisition module is configured to form the second pitch period sequence according to the second pitch period corresponding to each frame of the second speech signal in the second set number of frames.
  • the similarity calculation module includes a Euclidean distance calculation unit and a similarity determination unit;
  • the Euclidean distance calculation unit is used to calculate the Euclidean distance between the first pitch period sequence and the second pitch period sequence by using a dynamic time warping algorithm
  • the similarity determination unit is configured to determine the similarity between the first pitch period sequence and the second pitch period sequence according to the Euclidean distance
  • the Euclidean distance is negatively correlated with the similarity.
  • the first judgment module includes a first input signal acquisition unit, a target probability acquisition unit, a target input signal acquisition unit, and a signal input determination unit;
  • the first input signal acquisition unit is configured to sequentially acquire the first input signal of each frame after the recording device starts recording;
  • the target probability acquisition unit is configured to acquire the target speech probability and the target non-speech probability corresponding to the first input signal in each frame;
  • the target input signal obtaining unit is configured to obtain the target input signal when the total number of frames of the first input signal obtained is greater than or equal to the first set number of frames;
  • the target input signal when the first set number of frames is N, includes the current frame i and the first input signal corresponding to the number of N-1 frames before the current frame, i ⁇ N>1 and takes Integer
  • the signal input determining unit is configured to determine whether it is detected that the target user starts to input a voice signal according to the target voice probability and the target non-voice probability of the first input signal in each frame of the target input signal.
  • the first judgment module further includes a voice frame determination unit, a frame number acquisition unit, and a total probability calculation unit;
  • the speech frame determining unit is used to determine whether the target speech probability is greater than the target non-speech probability, if yes, determine that the current frame is a speech frame; if not, determine that the current frame is Non-speech frame;
  • the frame number obtaining unit is configured to obtain a first frame number corresponding to a voice frame and a second frame number corresponding to a non-voice frame in the target input signal;
  • the total probability calculation unit is configured to calculate the sum of the target speech probabilities of each frame of the first input signal in the target input signal to obtain a first total probability, and calculate the first total probability of each frame in the target input signal.
  • the sum of the target non-speech probabilities of an input signal obtains a second total probability;
  • the signal input determining unit is configured to determine to detect when the first total probability is greater than or equal to the second total probability, the first frame number is greater than or equal to a fourth set threshold, and the current frame is a speech frame When the target user starts to input a voice signal.
  • the first judging module is also used to judge whether it is detected that the target user ends the voice signal input, and if so, to control the recording device Stop recording; if not, continue to call the second voice signal acquisition module.
  • the signal input determining unit is further configured to: when the second total probability is greater than the first total probability, the second frame number is greater than or equal to a fifth set threshold, and the current frame is a non-speech frame , It is determined that it is detected that the target user ends the voice signal input, and the recording device is controlled to stop recording.
  • the first input signal acquisition unit is further configured to continue to acquire the first input signal in the next frame;
  • the target input signal obtaining unit is further configured to obtain the new target input signal according to the first input signal of the next frame;
  • the signal input determining unit is further configured to determine whether it is detected that the target user starts to input speech according to the target speech probability and the target non-speech probability of each frame of the first input signal in the new target input signal Signal; and/or,
  • the first voice signal acquisition module is configured to acquire the first set number of frames input by the target user from the i-N+1th frame. input signal.
  • the target probability obtaining unit is configured to obtain the target speech probability and target non-speech corresponding to the first input signal in each frame according to the energy corresponding to the first input signal in each frame and/or using a deep neural network algorithm. Probability.
  • the target probability obtaining unit is used for the energy value obtaining sub Unit and target probability acquisition subunit;
  • the energy value obtaining subunit is configured to obtain the average energy value corresponding to the first input signal in each frame within a set frequency range
  • the target probability obtaining subunit is configured to obtain the target speech probability and the target non-speech probability corresponding to the first input signal in each frame according to the average energy value.
  • the target probability acquisition unit includes a historical signal acquisition subunit, a model establishment subunit and a target Probability acquisition subunit;
  • the historical signal acquisition subunit is configured to acquire each frame of historical input signal input into the recording device by the target user within a historical setting time and the signal type corresponding to each frame of the historical input signal;
  • the signal types include voice signals and non-voice signals
  • the model establishment subunit is used to take the historical input signal as an input and the signal type as an output, and use a deep neural network to establish a probability model for predicting that the input signal of each frame is a speech signal;
  • the target probability obtaining subunit is configured to input the first input signal of each frame into the prediction model, and obtain the target speech probability corresponding to the first input signal of each frame;
  • the target probability acquisition subunit is further configured to calculate the target non-speech probability corresponding to the first input signal in each frame according to the target speech probability.
  • the target probability obtaining unit Used for energy value obtaining subunit, target probability obtaining subunit, historical signal obtaining subunit, model building subunit and weighting calculation subunit;
  • the energy value obtaining subunit is configured to obtain the average energy value corresponding to the first input signal in each frame within a set frequency range
  • the target probability obtaining subunit is configured to obtain the first speech probability and the first non-speech probability corresponding to the first input signal in each frame according to the average energy value;
  • the historical signal acquisition subunit is configured to acquire each frame of historical input signal input into the recording device by the target user within a historical setting time and the signal type corresponding to each frame of the historical input signal;
  • the signal types include voice signals and non-voice signals
  • the model establishment subunit is used to take the historical input signal as an input and the signal type as an output, and use a deep neural network to establish a probability model for predicting that the input signal of each frame is a speech signal;
  • the target probability obtaining subunit is configured to input the first input signal of each frame into the prediction model, and obtain the second speech probability corresponding to the first input signal of each frame;
  • the target probability acquisition subunit is further configured to calculate a second non-speech probability corresponding to the first input signal in each frame according to the second speech probability;
  • the weighted calculation subunit is used to process the first speech probability and the second speech probability of the first input signal in the same frame by using a weighted average method, and the first input signal in the same frame
  • the first non-speech probability and the second non-speech probability are processed to obtain the target speech probability and the target non-speech probability corresponding to the first input signal in each frame.
  • the energy value obtaining subunit is used to convert the first input signal corresponding to each frame in the time domain into a second input signal corresponding to the frequency domain;
  • the energy value obtaining subunit is further configured to calculate the sub-band energy value corresponding to each frequency band in the set frequency range of the second input signal in each frame;
  • the energy value obtaining subunit is further configured to obtain the average energy value corresponding to the second input signal in each frame according to the subband energy value; and/or,
  • the target probability acquisition subunit is configured to determine the first probability that the current frame is speech when the average energy value is greater than a second set threshold
  • the current value is determined according to the average energy value, the second set threshold value, and the third set threshold value.
  • the first probability, the second probability, and the third probability are sorted in descending order
  • the target probability acquisition subunit is further configured to determine the target speech probability and the target non-speech corresponding to the second input signal in each frame according to the first probability, the second probability, or the third probability Probability.
  • the calculation formula corresponding to the second probability that the current frame is speech is determined by the target probability acquisition subunit according to the average energy value, the second set threshold and the third set threshold is as follows:
  • Prob_energy (energy-A)/(B-A)
  • Prob_energy represents the second probability
  • energy represents the average energy value
  • A represents the third set threshold
  • B represents the second set threshold
  • the detection system also includes a wake-up module
  • the wake-up module is used to wake up the recording device with a wake-up word before the recording device starts recording.
  • the present invention also provides an electronic device, including a memory, a processor, and a computer program stored on the memory and capable of running on the processor.
  • the processor executes the computer program to realize the above-mentioned method for detecting a target user change in voice interaction .
  • the present invention also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps of the method for detecting the change of the target user in the voice interaction are realized.
  • the present invention can quickly and effectively detect the continuous recording of the non-target user's speech content after the target user has stopped speaking, and control the recording device to stop recording in time, thereby shortening the response time of voice interaction and ensuring Respond to the target user’s request in time, avoid errors in subsequent speech recognition and semantic understanding, improve the accuracy of speech processing results, and improve the user experience; in addition, it reduces the resource occupation of recording equipment and avoids The situation that takes up too much resources.
  • the target speech probability and target non-speech probability corresponding to the first input signal of each frame are obtained through the energy corresponding to the first input signal of each frame and the deep neural network algorithm, which effectively improves the start point and end point of the VAD detection and input speech
  • the accuracy of the data entry ensures the integrity of the data entry, while further reducing the resource occupation of the recording equipment.
  • Fig. 1 is a flowchart of a method for detecting a target user change in voice interaction according to Embodiment 1 of the present invention.
  • Fig. 2 is a first flowchart of a method for detecting a target user change in voice interaction according to Embodiment 2 of the present invention.
  • Fig. 3 is a second flow chart of the method for detecting a target user change in voice interaction according to Embodiment 2 of the present invention.
  • Fig. 4 is a flowchart of a method for detecting a target user change in voice interaction according to Embodiment 3 of the present invention.
  • FIG. 5 is a schematic diagram of modules of a detection system for a target user change in voice interaction according to Embodiment 4 of the present invention.
  • FIG. 6 is a schematic diagram of modules of a detection system for a target user change in voice interaction according to Embodiment 5 of the present invention.
  • FIG. 7 is a schematic diagram of the first judging module in the system for detecting the change of the target user in the voice interaction according to Embodiment 6 of the present invention.
  • FIG. 8 is a schematic structural diagram of an electronic device that implements a method for detecting a change of a target user in a voice interaction in Embodiment 7 of the present invention.
  • the method for detecting the change of the target user in the voice interaction of this embodiment includes:
  • S101 Determine whether it is detected that the target user starts to input a voice signal, and if so, obtain a first voice signal input by the target user of a first set number of frames;
  • step S104 After determining that the current user inputting the second voice signal is the target user, continuously monitor whether the target user ends the voice signal input, and if so, control the recording device to stop recording; if not, continue to perform step S104.
  • the first m frame of the first voice signal is automatically obtained, and then the corresponding first pitch period sequence is obtained and stored in the first buffer storage area of the recording device, while continuously detecting whether the target user is finished Voice signal input; when the target user does not end the voice signal input, obtain the second voice signal of n frames input by the current user at regular intervals (such as 100ms), obtain the corresponding second pitch period sequence and save it to the recording device
  • the second buffer storage area among them, considering that the pitch period does not have strict periodicity, n can be randomly selected in the range of [m-5, m+5]; then the second pitch period obtained each time
  • the sequence is compared with the sequence of the first pitch period to determine whether the speaker has changed. If there is a change, it means that the target user has stopped talking and continues to record the non-target user's speech content. At this time, the recording needs to be stopped in time.
  • the first pitch period sequence is obtained according to the voice signal of the set number of frames input by the target user, and then the second pitch period sequence corresponding to the currently input voice signal is detected every set duration.
  • Pitch period sequence by comparing two pitch period sequences to quickly and effectively detect the continuous recording of the non-target user's speech after the target user has stopped speaking, and control the recording device to stop recording in time to ensure that it can be timely Respond to the request of the target user, shorten the response time of voice interaction, avoid errors in subsequent voice recognition and semantic understanding, improve the accuracy of voice processing results, and enhance the user experience; in addition, it reduces the need for recording
  • the resource occupation of the equipment avoids the situation of occupying too much resources.
  • the method for detecting the change of the target user in the voice interaction of this embodiment is a further improvement of Embodiment 1. Specifically:
  • step S102 includes:
  • methods such as waveform estimation method, autocorrelation processing method, or cepstrum method are used to process the first intermediate speech signal, and the first pitch period corresponding to the first speech signal of each frame is obtained.
  • Step S103 includes:
  • step S105 includes:
  • S1054. Process the second intermediate speech signal to obtain a second pitch period corresponding to each frame of the second speech signal.
  • the waveform estimation method, the autocorrelation processing method, or the cepstrum method are used to process the second intermediate speech signal to obtain the second pitch period corresponding to the second speech signal of each frame.
  • Step S106 includes:
  • Step S107 includes:
  • Euclidean distance is negatively correlated with similarity.
  • the first pitch period sequence is obtained according to the voice signal of the set number of frames input by the target user, and then the second pitch period sequence corresponding to the currently input voice signal is detected every set duration.
  • Pitch period sequence by comparing two pitch period sequences, it can quickly and effectively detect the continuous recording of the non-target user's speech after the target user has stopped speaking, and control the recording device to stop recording in time to ensure timely Respond to the request of the target user, shorten the response time of voice interaction, avoid errors in subsequent voice recognition and semantic understanding, improve the accuracy of voice processing results, and enhance the user experience; in addition, it reduces the need for recording
  • the resource occupation of the equipment avoids the situation of occupying too much resources.
  • the method for detecting the change of the target user in the voice interaction of this embodiment is a further improvement of the second embodiment, specifically:
  • step S101 includes:
  • the target speech probability and target non-speech probability corresponding to the first input signal of each frame are obtained according to the energy corresponding to the first input signal of each frame and/or using a deep neural network algorithm.
  • the step of obtaining the average energy value corresponding to the first input signal of each frame within the set frequency range includes:
  • the average energy value is less than or equal to the second set threshold and greater than the third set threshold, determine the second probability that the current frame is speech according to the average energy value, the second set threshold, and the third set threshold;
  • the first probability, the second probability, and the third probability are sorted in descending order;
  • the target speech probability and target non-speech probability corresponding to the second input signal of each frame are determined according to the first probability, the second probability, or the third probability.
  • the first probability is 1, and the third probability is 0.
  • the calculation formula corresponding to the step of determining the second probability that the current frame is speech according to the average energy value, the second set threshold and the third set threshold is as follows:
  • Prob_energy (energy-A)/(B-A)
  • Prob_energy represents the second probability
  • energy represents the average energy value
  • A represents the third set threshold
  • B represents the second set threshold
  • the second set threshold and the third set threshold are set according to actual experience, and can also be adjusted according to actual conditions. or,
  • the deep neural network algorithm When the deep neural network algorithm is used to obtain the target speech probability and the target non-speech probability corresponding to the first input signal of each frame, the deep neural network algorithm is used to obtain the target speech probability and the target non-speech probability corresponding to the first input signal of each frame
  • the steps include:
  • the signal types include voice signals and non-voice signals
  • the steps of using the deep neural network algorithm to obtain the target speech probability and the target non-speech probability corresponding to the first input signal of each frame include :
  • the signal types include voice signals and non-voice signals
  • the weighted average method is used to process the first speech probability and the second speech probability of the first input signal of the same frame, and the first non-speech probability and the second non-speech probability of the first input signal of the same frame are processed to obtain the first speech probability of each frame.
  • Prob a*prob_energy1+(1-a)*prob_dnnspeech
  • Prob represents the target speech probability
  • a represents the weighting coefficient (such as 0.7)
  • prob_energy1 represents the first speech probability of the first input signal of the frame
  • prob_dnnspeech represents the second speech probability of the first input signal of the frame.
  • the target input signal when the first set number of frames is N, includes the current frame i and the first input signal corresponding to the number of N-1 frames before the current frame, i ⁇ N>1 and an integer;
  • S1017 Determine whether the target user is detected to start inputting the voice signal according to the target voice probability and the target non-voice probability of the first input signal in each frame of the target input signal.
  • the first total probability is greater than or equal to the second total probability
  • the first frame number is greater than or equal to the fourth set threshold
  • the current frame is a voice frame
  • the second total probability is greater than the first total probability
  • the second frame number is greater than or equal to the fifth set threshold, and the current frame is a non-speech frame, it is determined that the target user is detected to end the voice signal input, and the recording device is controlled to stop recording.
  • the detection method of this embodiment further includes:
  • the target speech probability and the target non-speech probability of the first input signal in each frame of the new target input signal it is determined whether the target user is detected to start inputting the speech signal.
  • obtaining the first voice signal of the first set number of frames input by the target user in step S101 specifically includes:
  • the current frame is i
  • the first set number of frames is N.
  • i ⁇ N continue to record the first input signal of each frame in sequence; when i ⁇ N, obtain and extract the current frame i and the previous frame before the current frame.
  • the frame is determined to be a speech frame; otherwise, it is determined to be a non-speech frame;
  • the target input signal formed by the first input signal from frames 11 to 40 determines that the target user has not started to input a voice signal
  • the new target input formed by the first input signal from frames 12 to 41 is continued to be extracted Signal, and re-execute the above steps 4)-7) until the target input signal is extracted and it can be determined that the target user is detected to start inputting the voice signal.
  • the first input signal of 30 consecutive frames in the input of the target user is used as the first voice signal
  • the target input signal formed by the first input signal from the 11th frame to the 40th frame determines that it is detected that the target user ends the voice signal input, the recording is stopped, and the voice signal recording stops at the i-th frame.
  • the recording device can be controlled to stop recording in time, thereby shortening the response time of voice interaction and ensuring It can respond to the target user’s request in a timely manner, avoid errors in subsequent speech recognition and semantic understanding, and improve the accuracy of the speech processing results, thereby improving the user experience; in addition, it reduces the resource occupation of the recording equipment. Avoid occupying too much resources.
  • the target speech probability and target non-speech probability corresponding to the first input signal of each frame are obtained through the energy corresponding to the first input signal of each frame and the deep neural network algorithm, which effectively improves the start point and end point of the VAD detection and input speech
  • the accuracy of the data entry ensures the integrity of the data entry, while further reducing the resource occupation of the recording equipment.
  • the detection system for target user change in voice interaction of this embodiment includes a wake-up module 1, a first judgment module 2, a first voice signal acquisition module 3, a first pitch period acquisition module 4, and a first periodic sequence
  • the wake-up module 1 is used to wake up the recording device using a wake-up word; wherein the recording device automatically enters the recording state after being awakened, and uses a microphone to pick up the sound.
  • the first judging module 2 is used to judge whether it is detected that the target user starts to input a voice signal, and if so, call the first voice signal acquisition module 3 to acquire the first voice signal of the first set number of frames input by the target user;
  • the first pitch period acquisition module 4 is configured to acquire the first pitch period corresponding to each frame of the first speech signal
  • the first period sequence obtaining module 5 is configured to obtain the first pitch period sequence corresponding to the first set number of frames according to the first pitch period;
  • the second voice signal acquisition module 6 is configured to acquire the second set number of frames of the second voice signal input by the current user into the recording device after the set duration;
  • the second pitch period acquisition module 7 is configured to acquire the second pitch period corresponding to each frame of the second speech signal
  • the second period sequence obtaining module 8 is configured to obtain a second pitch period sequence corresponding to the second set number of frames according to the second pitch period;
  • the similarity calculation module 9 is used to calculate the similarity between the first pitch period sequence and the second pitch period sequence
  • the second judging module 10 is used to judge whether the similarity is greater than the first set threshold, and if so, it is determined that the current user inputting the second voice signal is the target user, the recording device is controlled to continue recording, and the second voice signal acquisition module 6 is called again ;
  • the recording device is controlled to stop recording.
  • the first m frame of the first voice signal is automatically obtained, and then the corresponding first pitch period sequence is obtained and stored in the first buffer storage area of the recording device, while continuously detecting whether the target user is finished Voice signal input; when the target user does not end the voice signal input, obtain the second voice signal of n frames input by the current user at regular intervals (such as 100ms), obtain the corresponding second pitch period sequence and save it to the recording device
  • the second buffer storage area among them, considering that the pitch period does not have strict periodicity, n can be randomly selected in the range of [m-5, m+5]; then the second pitch period obtained each time
  • the sequence is compared with the sequence of the first pitch period to determine whether the speaker has changed. If there is a change, it means that the target user has stopped talking and continues to record the non-target user's speech content. At this time, the recording needs to be stopped in time.
  • the first pitch period sequence is obtained according to the voice signal of the set number of frames input by the target user, and then the second pitch period sequence corresponding to the currently input voice signal is detected every set duration.
  • Pitch period sequence by comparing two pitch period sequences to quickly and effectively detect the continuous recording of the non-target user's speech after the target user has stopped speaking, and control the recording device to stop recording in time to ensure that it can be timely Respond to the request of the target user, shorten the response time of voice interaction, avoid errors in subsequent voice recognition and semantic understanding, improve the accuracy of voice processing results, and enhance the user experience; in addition, it reduces the need for recording
  • the resource occupation of the equipment avoids the situation of occupying too much resources.
  • the detection system for target user change in voice interaction of this embodiment is a further improvement of Embodiment 4. Specifically:
  • the first pitch period acquisition module 4 includes a first preprocessing unit 11, a first short-term energy processing unit 12, a first center clipping processing unit 13 and a first pitch period acquisition unit 14.
  • the first preprocessing unit 11 is configured to preprocess the first speech signal of each frame
  • the first short-term energy processing unit 12 is configured to use short-term energy to process the preprocessed first speech signal of each frame, and obtain the first voiced voice signal in the first speech signal of each frame;
  • the first center clipping processing unit 13 is configured to process the first voiced sound signal by using the center clipping method to obtain the first intermediate speech signal;
  • the first pitch period obtaining unit 14 is configured to process the first intermediate speech signal, and obtain the first pitch period corresponding to each frame of the first speech signal;
  • a method such as a waveform estimation method, an autocorrelation processing method, or a cepstrum method is used to process the first intermediate speech signal, and the first pitch period corresponding to the first speech signal of each frame is obtained.
  • the first period sequence acquisition module 5 is configured to form a first pitch period sequence according to the first pitch period corresponding to each frame of the first speech signal in the first set number of frames.
  • the second pitch period acquisition module 7 includes a second preprocessing unit 15, a second short-term energy processing unit 16, a second center clipping processing unit 17 and a second pitch period acquisition unit 18.
  • the second preprocessing unit 15 is configured to preprocess the second speech signal of each frame
  • the second short-term energy processing unit 16 is configured to use short-term energy to process the preprocessed second speech signal of each frame, and obtain the second voiced voice signal in each frame of the second speech signal;
  • the second center clipping processing unit 17 is configured to process the second voiced sound signal by using the center clipping method to obtain the second intermediate speech signal;
  • the second pitch period acquiring unit 18 is configured to process the second intermediate voice signal, and acquire the second pitch period corresponding to each frame of the second voice signal;
  • a waveform estimation method, an autocorrelation processing method, or a cepstrum method is used to process the second intermediate speech signal, and the second pitch period corresponding to each frame of the second speech signal is obtained.
  • the second period sequence acquisition module 8 is configured to form a second pitch period sequence according to the second pitch period corresponding to each frame of the second speech signal in the second set number of frames.
  • the similarity calculation module 9 includes a Euclidean distance calculation unit 19 and a similarity determination unit 20.
  • the Euclidean distance calculation unit 19 is used to calculate the Euclidean distance between the first pitch period sequence and the second pitch period sequence using a dynamic time warping algorithm
  • the similarity determination unit 20 is configured to determine the similarity between the first pitch period sequence and the second pitch period sequence according to the Euclidean distance;
  • Euclidean distance is negatively correlated with similarity.
  • the first pitch period sequence is obtained according to the voice signal of the set number of frames input by the target user, and then the second pitch period sequence corresponding to the currently input voice signal is detected every set duration.
  • Pitch period sequence by comparing two pitch period sequences, it can quickly and effectively detect the continuous recording of the non-target user's speech after the target user has stopped speaking, and control the recording device to stop recording in time to ensure timely Respond to the request of the target user, shorten the response time of voice interaction, avoid errors in subsequent voice recognition and semantic understanding, improve the accuracy of voice processing results, and enhance the user experience; in addition, it reduces the need for recording
  • the resource occupation of the equipment avoids the situation of occupying too much resources.
  • the method for detecting the change of the target user in the voice interaction of this embodiment is a further improvement of Embodiment 5. Specifically:
  • the first judgment module 2 includes a first input signal acquisition unit 21, a target probability acquisition unit 22, a speech frame determination unit 23, a frame number acquisition unit 24, a total probability calculation unit 25, and a target input signal acquisition unit 26 And the signal input determination unit 27.
  • the first input signal acquisition unit 21 is configured to sequentially acquire the first input signal of each frame after the recording device starts recording;
  • the target probability acquisition unit 22 is configured to acquire the target speech probability and the target non-speech probability corresponding to the first input signal of each frame;
  • the target probability obtaining unit 22 is configured to obtain the target speech probability and the target non-speech probability corresponding to the first input signal of each frame according to the energy corresponding to the first input signal of each frame and/or using a deep neural network algorithm.
  • the target probability obtaining unit 22 is used for the energy value obtaining subunit and the target Probability acquisition subunit;
  • the energy value obtaining subunit is used to obtain the average energy value corresponding to the first input signal of each frame within the set frequency range;
  • the target probability acquisition subunit is used to acquire the target speech probability and the target non-speech probability corresponding to the first input signal of each frame according to the average energy value.
  • the energy value obtaining subunit is used to convert the first input signal of each frame corresponding to the time domain into a second input signal corresponding to the frequency domain;
  • the energy value obtaining sub-unit is also used to calculate the sub-band energy value corresponding to each frequency band in the set frequency range of the second input signal of each frame;
  • the energy value obtaining subunit is further configured to obtain the average energy value corresponding to the second input signal of each frame according to the subband energy value.
  • the target probability acquisition subunit is used to determine the first probability that the current frame is speech when the average energy value is greater than the second set threshold;
  • the average energy value is less than or equal to the second set threshold and greater than the third set threshold, determine the second probability that the current frame is speech according to the average energy value, the second set threshold, and the third set threshold;
  • the first probability, the second probability, and the third probability are sorted in descending order;
  • the target probability acquisition subunit is also used to determine the target speech probability and the target non-speech probability corresponding to the second input signal of each frame according to the first probability, the second probability, or the third probability.
  • the first probability is 1, and the third probability is 0.
  • the calculation formula corresponding to the step of determining the second probability that the current frame is speech according to the average energy value, the second set threshold and the third set threshold is as follows:
  • Prob_energy (energy-A)/(B-A)
  • Prob_energy represents the second probability
  • energy represents the average energy value
  • A represents the third set threshold
  • B represents the second set threshold
  • the second set threshold and the third set threshold are set according to actual experience, and can also be adjusted according to actual conditions. or,
  • the target probability acquisition unit 22 includes a historical signal acquisition subunit, a model establishment subunit, and a target probability acquisition subunit ;
  • the historical signal acquisition subunit is used to acquire each frame of historical input signal input into the recording device by the target user within the historical setting time and the signal type corresponding to each frame of historical input signal;
  • the signal types include voice signals and non-voice signals
  • the model building subunit is used to take historical input signals as input and signal types as output, and use a deep neural network to establish a probability model for predicting that each frame of input signal is a speech signal;
  • the target probability obtaining subunit is used to input the first input signal of each frame into the prediction model, and obtain the target speech probability corresponding to the first input signal of each frame;
  • the target probability acquisition subunit is also used to calculate the target non-speech probability corresponding to the first input signal of each frame according to the target speech probability.
  • the target probability obtaining unit 22 is used for the energy value obtaining sub Unit, target probability acquisition sub-unit, historical signal acquisition sub-unit, model establishment sub-unit and weighted calculation sub-unit.
  • the energy value obtaining subunit is used to obtain the average energy value corresponding to the first input signal of each frame within the set frequency range;
  • the target probability acquisition subunit is used to acquire the first speech probability and the first non-speech probability corresponding to the first input signal of each frame according to the average energy value;
  • the historical signal acquisition subunit is used to acquire each frame of historical input signal input into the recording device by the target user within the historical setting time and the signal type corresponding to each frame of historical input signal;
  • the signal types include voice signals and non-voice signals
  • the model building subunit is used to take historical input signals as input and signal types as output, and use a deep neural network to establish a probability model for predicting that each frame of input signal is a speech signal;
  • the target probability acquisition subunit is used to input the first input signal of each frame into the prediction model, and acquire the second speech probability corresponding to the first input signal of each frame;
  • the target probability acquisition subunit is further configured to calculate the second non-speech probability corresponding to the first input signal of each frame according to the second speech probability;
  • the weighted calculation subunit is used to process the first voice probability and the second voice probability of the first input signal in the same frame by using the weighted average method, and perform the first non-speech probability and the second non-speech probability of the first input signal in the same frame Process to obtain the target speech probability and target non-speech probability corresponding to the first input signal of each frame.
  • Prob a*prob_energy1+(1-a)*prob_dnnspeech
  • Prob represents the target speech probability
  • a represents the weighting coefficient (such as 0.7)
  • prob_energy1 represents the first speech probability of the first input signal of the frame
  • prob_dnnspeech represents the second speech probability of the first input signal of the frame.
  • the target input signal acquiring unit 26 is configured to acquire the target input signal when the total number of frames of the acquired first input signal is greater than or equal to the first set number of frames;
  • the target input signal when the first set number of frames is N, includes the current frame i and the first input signal corresponding to the number of N-1 frames before the current frame, i ⁇ N>1 and an integer;
  • the speech frame determining unit 23 is used to determine whether the target speech probability is greater than the target non-speech probability, if yes, determine the current frame as a speech frame; if not, determine the current frame as a non-speech frame;
  • the frame number acquiring unit 24 is configured to acquire the first frame number corresponding to the speech frame and the second frame number corresponding to the non-speech frame in the target input signal;
  • the total probability calculation unit 25 is used to calculate the sum of the target speech probabilities of the first input signal of each frame in the target input signal to obtain the first total probability, and calculate the sum of the target non-speech probability of the first input signal of each frame in the target input signal to obtain Second total probability;
  • the signal input determining unit 27 is configured to determine whether the target user is detected to start inputting a voice signal according to the target voice probability and the target non-voice probability of the first input signal in each frame of the target input signal.
  • the signal input determining unit 27 is configured to determine that when the first total probability is greater than or equal to the second total probability, the first frame number is greater than or equal to the fourth set threshold, and the current frame is a voice frame, it is determined that the target user is detected to start Input the voice signal.
  • the signal input determining unit 27 is further configured to determine that when the second total probability is greater than the first total probability, the second frame number is greater than or equal to the fifth set threshold, and the current frame is a non-speech frame, it is determined that the target user's end speech is detected Signal input, and control the recording equipment to stop recording.
  • the first input signal obtaining unit 21 is further configured to continue to obtain the first input signal of the next frame;
  • the target input signal obtaining unit 26 is further configured to obtain a new target input signal according to the first input signal of the next frame;
  • the signal input determining unit 27 is further configured to determine whether the target user is detected to start inputting a voice signal according to the target voice probability and the target non-voice probability of the first input signal in each frame of the new target input signal.
  • the first voice signal acquisition module 3 is configured to acquire the first input signal of the first set number of frames input by the target user from the i-N+1th frame.
  • the current frame is i
  • the first set number of frames is N.
  • i ⁇ N continue to record the first input signal of each frame in sequence; when i ⁇ N, obtain and extract the current frame i and the previous frame before the current frame.
  • the frame is determined to be a speech frame; otherwise, it is determined to be a non-speech frame;
  • the target input signal formed by the first input signal from frames 11 to 40 determines that the target user has not started to input a voice signal
  • the new target input formed by the first input signal from frames 12 to 41 is continued to be extracted Signal, and re-execute the above steps 4)-7) until the target input signal is extracted and it can be determined that the target user is detected to start inputting the voice signal.
  • the first input signal of 30 consecutive frames in the input of the target user is used as the first voice signal
  • the target input signal formed by the first input signal from the 11th frame to the 40th frame determines that it is detected that the target user ends the voice signal input, the recording is stopped, and the voice signal recording stops at the i-th frame.
  • the recording device can be controlled to stop recording in time, thereby shortening the response time of voice interaction and ensuring It can respond to the target user’s request in a timely manner, avoid errors in subsequent speech recognition and semantic understanding, and improve the accuracy of the speech processing results, thereby improving the user experience; in addition, it reduces the resource occupation of the recording equipment. Avoid occupying too much resources.
  • the target speech probability and target non-speech probability corresponding to the first input signal of each frame are obtained through the energy corresponding to the first input signal of each frame and the deep neural network algorithm, which effectively improves the start point and end point of the VAD detection and input speech
  • the accuracy of the data entry ensures the integrity of the data entry, while further reducing the resource occupation of the recording equipment.
  • FIG. 8 is a schematic structural diagram of an electronic device according to Embodiment 7 of the present invention.
  • the electronic device includes a memory, a processor, and a computer program that is stored on the memory and can be run on the processor.
  • the processor implements the method for detecting the change of the target user in the voice interaction in any one of the embodiments 1 to 3 when the processor executes the program .
  • the electronic device 30 shown in FIG. 8 is only an example, and should not bring any limitation to the function and application scope of the embodiment of the present invention.
  • the electronic device 30 may be in the form of a general-purpose computing device, for example, it may be a server device.
  • the components of the electronic device 30 may include, but are not limited to: the above-mentioned at least one processor 31, the above-mentioned at least one memory 32, and a bus 33 connecting different system components (including the memory 32 and the processor 31).
  • the bus 33 includes a data bus, an address bus, and a control bus.
  • the memory 32 may include a volatile memory, such as a random access memory (RAM) 321 and/or a cache memory 322, and may further include a read-only memory (ROM) 323.
  • RAM random access memory
  • ROM read-only memory
  • the memory 32 may also include a program/utility tool 325 having a set of (at least one) program module 324.
  • program module 324 includes, but is not limited to: an operating system, one or more application programs, other program modules, and program data. Each of the examples or some combination may include the realization of a network environment.
  • the processor 31 executes various functional applications and data processing by running a computer program stored in the memory 32, such as the method for detecting changes in the target user in voice interaction in any one of Embodiments 1 to 3 of the present invention.
  • the electronic device 30 may also communicate with one or more external devices 34 (such as keyboards, pointing devices, etc.). This communication can be performed through an input/output (I/O) interface 35.
  • the model-generated device 30 may also communicate with one or more networks (for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through the network adapter 36. As shown in FIG. 8, the network adapter 36 communicates with other modules of the device 30 generated by the model through the bus 33.
  • networks for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet
  • This embodiment provides a computer-readable storage medium on which a computer program is stored.
  • the program is executed by a processor, the method for detecting a target user change in voice interaction in any one of embodiments 1 to 3 is implemented. step.
  • the readable storage medium may more specifically include but not limited to: portable disk, hard disk, random access memory, read only memory, erasable programmable read only memory, optical storage device, magnetic storage device or any of the above The right combination.
  • the present invention can also be implemented in the form of a program product, which includes program code.
  • the program product runs on a terminal device, the program code is used to make the terminal device execute the implementation in Embodiments 1 to 3. Steps in the method for detecting the change of the target user in the voice interaction in any one of the embodiments.
  • the program code for executing the present invention can be written in any combination of one or more programming languages.
  • the program code can be executed completely on the user equipment, partly executed on the user equipment, as an independent software.
  • the package is executed, partly on the user's device, partly on the remote device, or entirely on the remote device.

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Game Theory and Decision Science (AREA)
  • Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)

Abstract

一种语音交互中目标用户改变的检测方法、系统、电子设备和存储介质,检测方法包括:在录音设备开始录音后,在检测到目标用户开始输入语音信号时获取第一设定帧数的第一语音信号;获取第一基音周期及第一基音周期序列;在设定时长后,获取第二设定帧数的第二语音信号;获取第二基音周期及第二基音周期序列;根据第一基音周期序列和第二基音周期序列确定输入第二语音信号的当前用户不是目标用户时,控制录音设备停止录音,该方法实现在目标用户已经停止说话且有非目标用户录入语音时及时停止录音,以避免对后续的语音识别和语义理解产生误差;能够有效地提高VAD检测录入语音的起始点和结束点的准确性,提升了用户的使用体验。

Description

目标用户改变的检测方法、系统、电子设备和存储介质
本申请要求申请日为2019/11/18的中国专利申请2019111265954的优先权。本申请引用上述中国专利申请的全文。
技术领域
本发明涉及人工智能设备技术领域,特别涉及一种语音交互中目标用户改变的检测方法、系统、电子设备和存储介质。
背景技术
随着人工智能技术的快速发展,智能音箱也随之应运而生。智能音箱大多采用麦克风阵列拾音,然后通过用户输入唤醒词(如“你好小锐”)来激活,唤醒后再执行相应的控制指令。
VAD(Voice Activity Detection,语音活动检测)算法,在语音信号处理方面主要用于区分出语音信号中有语音区域和无语音区域,从而让语音处理算法集中处理语音信号中的有效部分,既减少了计算消耗,也避免了影响部分算法的性能。在现有的语音产品应用中,VAD的作用主要就是在设备或产品的语音交互功能被唤醒之后判断麦克风录入语音的起始点和结束点,如果没有VAD算法就只能通过人为地去干预设备的录音。但是,在实际应用场景中存在如下情况:当设备唤醒成功且开启录音功能后,目标用户开始录音,当目标用户说完之后,如果周围突然有其他非目标用户开始说话,此时麦克风仍然会继续拾音,即VAD算法无法准确检测到目标用户的录音结束点,这样就会造成录入非期望的语音,导致后续的语音识别和语义理解产生误差,进而降低了用户体验。
目前,主要通过如下两种方式解决上述存在的问题:1)放弃采用VAD算法,通过人工手动控制录音的开始和结束,这样的录音方式无疑会大大地降低用户的使用体验;2)使用复杂度高的VAD算法,以达到较好的检测效果,但是由于移动终端和嵌入式设备计算资源有限,能耗控制要求严格,因此不仅会造成很难达到实时性的标准,还会导致能耗消耗过快,所以使用复杂的VAD算法难以解决上述问题。
发明内容
本发明要解决的技术问题是为了克服现有技术中无法有效地解决在录音过程中会录入非目标用户的语音数据造成后续的语音识别和语义理解产生误差,降低用户使用体验 的缺陷,提供一种语音交互中目标用户改变的检测方法、系统、电子设备和存储介质。
本发明是通过下述技术方案来解决上述技术问题:
本发明提供一种语音交互中目标用户改变的检测方法,所述检测方法包括:
在录音设备开始录音后,判断是否检测到目标用户开始输入语音信号,若是,则获取所述目标用户输入的第一设定帧数的第一语音信号;
获取每帧所述第一语音信号对应的第一基音周期;
根据所述第一基音周期获取与所述第一设定帧数对应的第一基音周期序列;
在设定时长后,获取当前用户输入所述录音设备的第二设定帧数的第二语音信号;
获取每帧所述第二语音信号对应的第二基音周期;
根据所述第二基音周期获取与所述第二设定帧数对应的第二基音周期序列;
计算所述第一基音周期序列和所述第二基音周期序列的相似度;
判断所述相似度是否大于第一设定阈值,若是,则确定输入所述第二语音信号的当前用户是所述目标用户,控制所述录音设备继续录音,并重新执行所述在设定时长后,获取第二设定帧数的第二语音信号的步骤;
若否,则确定输入所述第二语音信号的当前用户不是所述目标用户,并控制所述录音设备停止录音。
较佳地,所述获取每帧所述第一语音信号对应的第一基音周期的步骤包括:
对每帧所述第一语音信号进行预处理;
采用短时能量对预处理后的每帧所述第一语音信号进行处理,获取每帧所述第一语音信号中的第一浊音信号;
采用中心削波法对所述第一浊音信号进行处理,获取第一中间语音信号;
采用波形估计法、自相关处理法或倒谱法对所述第一中间语音信号进行处理,获取每帧所述第一语音信号对应的所述第一基音周期;和/或,
所述获取每帧所述第二语音信号对应的第二基音周期的步骤包括:
对每帧所述第二语音信号进行预处理;
采用短时能量对预处理后的每帧所述第二语音信号进行处理,获取每帧所述第二语音信号中的第二浊音信号;
采用中心削波法对所述第二浊音信号进行处理,获取第二中间语音信号;
采用波形估计法、自相关处理法或倒谱法对所述第二中间语音信号进行处理,获取每帧所述第二语音信号对应的所述第二基音周期。
较佳地,所述计算所述第一基音周期序列和所述第二基音周期序列的相似度的步骤 包括:
采用DTW(动态时间规整算法)计算所述第一基音周期序列和所述第二基音周期序列之间的欧氏距离;
根据所述欧氏距离确定所述第一基音周期序列和所述第二基音周期序列的相似度;
其中,所述欧氏距离与所述相似度呈负相关。
较佳地,所述在录音设备开始录音后,判断是否检测到目标用户开始输入语音信号的步骤包括:
在所述录音设备开始录音后,依次获取每帧第一输入信号;
获取每帧所述第一输入信号对应的目标语音概率和目标非语音概率;
当获取的所述第一输入信号的总帧数大于或者等于所述第一设定帧数时,则获取目标输入信号;
其中,当所述第一设定帧数为N时,所述目标输入信号包括当前帧i以及当前帧之前的N-1帧数对应的所述第一输入信号,i≥N>1且取整数;
根据所述目标输入信号中每帧所述第一输入信号的所述目标语音概率和所述目标非语音概率确定是否检测到所述目标用户开始输入语音信号。
较佳地,所述获取每帧所述第一输入信号对应的目标语音概率和目标非语音概率的步骤之后、所述根据所述目标输入信号中每帧所述第一输入信号的所述目标语音概率和所述目标非语音概率确定是否检测到所述目标用户开始输入语音信号的步骤之前还包括:
对于同一帧所述第一输入信号,判断所述目标语音概率是否大于所述目标非语音概率,若是,则确定当前帧为语音帧;若否,确定当前帧为非语音帧;
获取所述目标输入信号中语音帧对应的第一帧数和非语音帧对应的第二帧数;
计算所述目标输入信号中每帧所述第一输入信号的所述目标语音概率之和得到第一总概率,以及计算所述目标输入信号中每帧所述第一输入信号的所述目标非语音概率之和得到第二总概率;
所述根据所述目标输入信号中每帧所述第一输入信号的所述目标语音概率和所述目标非语音概率确定是否检测到所述目标用户开始输入语音信号的步骤包括:
当所述第一总概率大于或者等于所述第二总概率、所述第一帧数大于或者等于第四设定阈值且当前帧为语音帧时,则确定检测到所述目标用户开始输入语音信号。
较佳地,所述控制所述录音设备继续录音的步骤之后、所述在设定时长后,获取第二设定帧数的第二语音信号的步骤之前还包括:
判断是否检测到所述目标用户结束语音信号输入,若是,则控制所述录音设备停止 录音;若否,继续执行所述在设定时长后,获取第二设定帧数的第二语音信号的步骤。
较佳地,所述判断是否检测到目标用户结束语音信号输入,若是,则控制所述录音设备停止录音的步骤包括:
当所述第二总概率大于所述第一总概率、所述第二帧数大于或者等于第五设定阈值且当前帧为非语音帧时,则确定检测到所述目标用户结束语音信号输入,并控制所述录音设备停止录音。
较佳地,当确定所述目标用户未开始输入语音信号时,所述检测方法还包括:
继续获取下一帧所述第一输入信号;
根据下一帧所述第一输入信号获取新的所述目标输入信号;
根据新的所述目标输入信号中每帧所述第一输入信号的所述目标语音概率和所述目标非语音概率确定是否检测到所述目标用户开始输入语音信号;和/或,
在检测到所述目标用户开始输入语音信号时,所述获取所述目标用户输入的第一设定帧数的第一语音信号的步骤包括:
从第i-N+1帧开始获取目标用户输入的所述第一设定帧数的所述第一输入信号。
较佳地,所述获取每帧所述第一输入信号对应的目标语音概率和目标非语音概率的步骤包括:
根据每帧所述第一输入信号对应的能量和/或采用DNN(深度神经网络算法)获取每帧所述第一输入信号对应的目标语音概率和目标非语音概率。
较佳地,当根据每帧所述第一输入信号对应的能量获取每帧所述第一输入信号对应的目标语音概率和目标非语音概率时,所述根据每帧所述第一输入信号对应的能量获取每帧所述第一输入信号对应的目标语音概率和目标非语音概率的步骤包括:
获取每帧所述第一输入信号在设定频率范围内对应的平均能量值;
根据所述平均能量值获取每帧所述第一输入信号对应的所述目标语音概率和所述目标非语音概率。
较佳地,当采用深度神经网络算法获取每帧所述第一输入信号对应的目标语音概率和目标非语音概率时,所述采用深度神经网络算法获取每帧所述第一输入信号对应的目标语音概率和目标非语音概率的步骤包括:
获取历史设定时间内所述目标用户输入至所述录音设备中的每帧历史输入信号以及与每帧所述历史输入信号对应的信号类型;
其中,所述信号类型包括语音信号和非语音信号;
将所述历史输入信号作为输入,所述信号类型作为输出,采用深度神经网络建立用 于预测每帧输入信号为语音信号的概率模型;
将每帧所述第一输入信号分别输入至所述预测模型,获取每帧所述第一输入信号对应的所述目标语音概率;
根据所述目标语音概率计算每帧所述第一输入信号对应的所述目标非语音概率。
较佳地,当根据每帧所述第一输入信号对应的能量和采用深度神经网络算法获取每帧所述第一输入信号对应的目标语音概率和目标非语音概率时,所述根据每帧所述第一输入信号对应的能量和采用深度神经网络算法获取每帧所述第一输入信号对应的目标语音概率和目标非语音概率的步骤包括:
获取每帧所述第一输入信号在设定频率范围内对应的平均能量值;
根据所述平均能量值获取每帧所述第一输入信号对应的第一语音概率和第一非语音概率;
当采用深度神经网络算法获取每帧所述第一输入信号对应的目标语音概率和目标非语音概率时,所述采用深度神经网络算法获取每帧所述第一输入信号对应的目标语音概率和目标非语音概率的步骤包括:
获取历史设定时间内所述目标用户输入至所述录音设备中的每帧历史输入信号以及与每帧所述历史输入信号对应的信号类型;
其中,所述信号类型包括语音信号和非语音信号;
将所述历史输入信号作为输入,所述信号类型作为输出,采用深度神经网络建立用于预测每帧输入信号为语音信号的概率模型;
将每帧所述第一输入信号分别输入至所述预测模型,获取每帧所述第一输入信号对应的第二语音概率;
根据所述第二语音概率计算每帧所述第一输入信号对应的第二非语音概率;
采用加权平均法对同一帧所述第一输入信号的所述第一语音概率和所述第二语音概率进行处理,以及同一帧所述第一输入信号的所述第一非语音概率和所述第二非语音概率进行处理,获取每帧所述第一输入信号对应的所述目标语音概率和所述目标非语音概率。
较佳地,所述获取每帧所述第一输入信号在设定频率范围内对应的平均能量值的步骤包括:
将与时域对应的每帧所述第一输入信号转换为与频域对应的第二输入信号;
计算每帧所述第二输入信号在设定频率范围内中每个频带对应的子带能量值;
根据所述子带能量值获取每帧所述第二输入信号对应的所述平均能量值;和/或,
所述根据所述平均能量值获取每帧所述第一输入信号对应的所述目标语音概率和所述目标非语音概率的步骤包括:
当所述平均能量值大于第二设定阈值时,则确定当前帧为语音的第一概率;
当所述平均能量值小于或者等于所述第二设定阈值且大于第三设定阈值时,则根据所述平均能量值、所述第二设定阈值和所述第三设定阈值确定当前帧为语音的第二概率;
当所述平均能量值小于或者等于所述第三设定阈值时,则确定当前帧为语音的第三概率;
其中,所述第一概率、所述第二概率和所述第三概率从大到小依次排序;
根据所述第一概率、所述第二概率或所述第三概率确定每帧所述第二输入信号对应的所述目标语音概率和所述目标非语音概率。
较佳地,所述根据所述平均能量值、所述第二设定阈值和所述第三设定阈值确定当前帧为语音的第二概率的步骤对应的计算公式如下:
Prob_energy=(energy-A)/(B-A)
其中,Prob_energy表示所述第二概率,energy表示所述平均能量值,A表示所述第三设定阈值,B表示所述第二设定阈值;和/或,
在所述录音设备开始录音之前,所述检测方法还包括:
采用唤醒词唤醒所述录音设备。
本发明还提供一种语音交互中目标用户改变的检测系统,所述检测系统包括第一判断模块、第一语音信号获取模块、第一基音周期获取模块和第一周期序列获取模块、第二语音信号获取模块、第二基音周期获取模块和第二周期序列获取模块、相似度计算模块和第二判断模块;
所述第一判断模块用于在录音设备开始录音后,判断是否检测到目标用户开始输入语音信号,若是,则调用所述第一语音信号获取模块获取所述目标用户输入的第一设定帧数的第一语音信号;
所述第一基音周期获取模块用于获取每帧所述第一语音信号对应的第一基音周期;
所述第一周期序列获取模块用于根据所述第一基音周期获取与所述第一设定帧数对应的第一基音周期序列;
所述第二语音信号获取模块用于在设定时长后,获取当前用户输入所述录音设备的第二设定帧数的第二语音信号;
所述第二基音周期获取模块用于获取每帧所述第二语音信号对应的第二基音周期;
所述第二周期序列获取模块用于根据所述第二基音周期获取与所述第二设定帧数对 应的第二基音周期序列;
所述相似度计算模块用于计算所述第一基音周期序列和所述第二基音周期序列的相似度;
所述第二判断模块用于判断所述相似度是否大于第一设定阈值,若是,则确定输入所述第二语音信号的当前用户是所述目标用户,控制所述录音设备继续录音,并重新调用所述第二语音信号获取模块;
若否,则确定输入所述第二语音信号的当前用户不是所述目标用户,并控制所述录音设备停止录音。
较佳地,所述第一基音周期获取模块包括第一预处理单元、第一短时能量处理单元、第一中心削波处理单元和第一基音周期获取单元;
所述第一预处理单元用于对每帧所述第一语音信号进行预处理;
所述第一短时能量处理单元用于采用短时能量对预处理后的每帧所述第一语音信号进行处理,获取每帧所述第一语音信号中的第一浊音信号;
所述第一中心削波处理单元用于采用中心削波法对所述第一浊音信号进行处理,获取第一中间语音信号;
所述第一基音周期获取单元用于采用波形估计法、自相关处理法或倒谱法对所述第一中间语音信号进行处理,获取每帧所述第一语音信号对应的所述第一基音周期;
所述第一周期序列获取模块用于根据所述第一设定帧数中每帧所述第一语音信号对应的所述第一基音周期构成所述第一基音周期序列;和/或,
所述第二基音周期获取模块包括第二预处理单元、第二短时能量处理单元、第二中心削波处理单元和第二基音周期获取单元;
所述第二预处理单元用于对每帧所述第二语音信号进行预处理;
所述第二短时能量处理单元用于采用短时能量对预处理后的每帧所述第二语音信号进行处理,获取每帧所述第二语音信号中的第二浊音信号;
所述第二中心削波处理单元用于采用中心削波法对所述第二浊音信号进行处理,获取第二中间语音信号;
所述第二基音周期获取单元用于采用波形估计法、自相关处理法或倒谱法对所述第二中间语音信号进行处理,获取每帧所述第二语音信号对应的所述第二基音周期;
所述第二周期序列获取模块用于根据所述第二设定帧数中每帧所述第二语音信号对应的所述第二基音周期构成所述第二基音周期序列。
较佳地,所述相似度计算模块包括欧式距离计算单元和相似度确定单元;
所述欧式距离计算单元用于采用动态时间规整算法计算所述第一基音周期序列和所述第二基音周期序列之间的欧氏距离;
所述相似度确定单元用于根据所述欧氏距离确定所述第一基音周期序列和所述第二基音周期序列的相似度;
其中,所述欧氏距离与所述相似度呈负相关。
较佳地,所述第一判断模块包括第一输入信号获取单元、目标概率获取单元、目标输入信号获取单元和信号输入确定单元;
所述第一输入信号获取单元用于在所述录音设备开始录音后,依次获取每帧第一输入信号;
所述目标概率获取单元用于获取每帧所述第一输入信号对应的目标语音概率和目标非语音概率;
所述目标输入信号获取单元用于当获取的所述第一输入信号的总帧数大于或者等于所述第一设定帧数时,则获取目标输入信号;
其中,当所述第一设定帧数为N时,所述目标输入信号包括当前帧i以及当前帧之前的N-1帧数对应的所述第一输入信号,i≥N>1且取整数;
所述信号输入确定单元用于根据所述目标输入信号中每帧所述第一输入信号的所述目标语音概率和所述目标非语音概率确定是否检测到所述目标用户开始输入语音信号。
较佳地,所述第一判断模块还包括语音帧确定单元、帧数获取单元和总概率计算单元;
对于同一帧所述第一输入信号,所述语音帧确定单元用于判断所述目标语音概率是否大于所述目标非语音概率,若是,则确定当前帧为语音帧;若否,确定当前帧为非语音帧;
所述帧数获取单元用于获取所述目标输入信号中语音帧对应的第一帧数和非语音帧对应的第二帧数;
所述总概率计算单元用于计算所述目标输入信号中每帧所述第一输入信号的所述目标语音概率之和得到第一总概率,以及计算所述目标输入信号中每帧所述第一输入信号的所述目标非语音概率之和得到第二总概率;
所述信号输入确定单元用于当所述第一总概率大于或者等于所述第二总概率、所述第一帧数大于或者等于第四设定阈值且当前帧为语音帧时,则确定检测到所述目标用户开始输入语音信号。
较佳地,在所述第二判断模块控制所述录音设备继续录音时,所述第一判断模块还 用于判断是否检测到所述目标用户结束语音信号输入,若是,则控制所述录音设备停止录音;若否,继续调用所述第二语音信号获取模块。
较佳地,所述信号输入确定单元还用于当所述第二总概率大于所述第一总概率、所述第二帧数大于或者等于第五设定阈值且当前帧为非语音帧时,则确定检测到所述目标用户结束语音信号输入,并控制所述录音设备停止录音。
较佳地,当确定所述目标用户未开始输入语音信号时,所述第一输入信号获取单元还用于继续获取下一帧所述第一输入信号;
所述目标输入信号获取单元还用于根据下一帧所述第一输入信号获取新的所述目标输入信号;
所述信号输入确定单元还用于根据新的所述目标输入信号中每帧所述第一输入信号的所述目标语音概率和所述目标非语音概率确定是否检测到所述目标用户开始输入语音信号;和/或,
在检测到所述目标用户开始输入语音信号时,所述第一语音信号获取模块用于从第i-N+1帧开始获取目标用户输入的所述第一设定帧数的所述第一输入信号。
较佳地,所述目标概率获取单元用于根据每帧所述第一输入信号对应的能量和/或采用深度神经网络算法获取每帧所述第一输入信号对应的目标语音概率和目标非语音概率。
较佳地,当根据每帧所述第一输入信号对应的能量获取每帧所述第一输入信号对应的目标语音概率和目标非语音概率时,所述目标概率获取单元用于能量值获取子单元和目标概率获取子单元;
所述能量值获取子单元用于获取每帧所述第一输入信号在设定频率范围内对应的平均能量值;
所述目标概率获取子单元用于根据所述平均能量值获取每帧所述第一输入信号对应的所述目标语音概率和所述目标非语音概率。
较佳地,当采用深度神经网络算法获取每帧所述第一输入信号对应的目标语音概率和目标非语音概率时,所述目标概率获取单元包括历史信号获取子单元、模型建立子单元和目标概率获取子单元;
所述历史信号获取子单元用于获取历史设定时间内所述目标用户输入至所述录音设备中的每帧历史输入信号以及与每帧所述历史输入信号对应的信号类型;
其中,所述信号类型包括语音信号和非语音信号;
所述模型建立子单元用于将所述历史输入信号作为输入,所述信号类型作为输出,采用深度神经网络建立用于预测每帧输入信号为语音信号的概率模型;
所述目标概率获取子单元用于将每帧所述第一输入信号分别输入至所述预测模型,获取每帧所述第一输入信号对应的所述目标语音概率;
所述目标概率获取子单元还用于根据所述目标语音概率计算每帧所述第一输入信号对应的所述目标非语音概率。
较佳地,当根据每帧所述第一输入信号对应的能量和采用深度神经网络算法获取每帧所述第一输入信号对应的目标语音概率和目标非语音概率时,所述目标概率获取单元用于能量值获取子单元、目标概率获取子单元、历史信号获取子单元、模型建立子单元和加权计算子单元;
所述能量值获取子单元用于获取每帧所述第一输入信号在设定频率范围内对应的平均能量值;
所述目标概率获取子单元用于根据所述平均能量值获取每帧所述第一输入信号对应的第一语音概率和第一非语音概率;
所述历史信号获取子单元用于获取历史设定时间内所述目标用户输入至所述录音设备中的每帧历史输入信号以及与每帧所述历史输入信号对应的信号类型;
其中,所述信号类型包括语音信号和非语音信号;
所述模型建立子单元用于将所述历史输入信号作为输入,所述信号类型作为输出,采用深度神经网络建立用于预测每帧输入信号为语音信号的概率模型;
所述目标概率获取子单元用于将每帧所述第一输入信号分别输入至所述预测模型,获取每帧所述第一输入信号对应的第二语音概率;
所述目标概率获取子单元还用于根据所述第二语音概率计算每帧所述第一输入信号对应的第二非语音概率;
所述加权计算子单元用于采用加权平均法对同一帧所述第一输入信号的所述第一语音概率和所述第二语音概率进行处理,以及同一帧所述第一输入信号的所述第一非语音概率和所述第二非语音概率进行处理,获取每帧所述第一输入信号对应的所述目标语音概率和所述目标非语音概率。
较佳地,所述能量值获取子单元用于将与时域对应的每帧所述第一输入信号转换为与频域对应的第二输入信号;
所述能量值获取子单元还用于计算每帧所述第二输入信号在设定频率范围内中每个频带对应的子带能量值;
所述能量值获取子单元还用于根据所述子带能量值获取每帧所述第二输入信号对应的平均能量值;和/或,
所述目标概率获取子单元用于当所述平均能量值大于第二设定阈值时,则确定当前帧为语音的第一概率;
当所述平均能量值小于或者等于所述第二设定阈值且大于第三设定阈值时,则根据所述平均能量值、所述第二设定阈值和所述第三设定阈值确定当前帧为语音的第二概率;
当所述平均能量值小于或者等于所述第三设定阈值时,则确定当前帧为语音的第三概率;
其中,所述第一概率、所述第二概率和所述第三概率从大到小依次排序;
所述目标概率获取子单元还用于根据所述第一概率、所述第二概率或所述第三概率确定每帧所述第二输入信号对应的所述目标语音概率和所述目标非语音概率。
较佳地,所述目标概率获取子单元根据所述平均能量值、所述第二设定阈值和所述第三设定阈值确定当前帧为语音的第二概率对应的计算公式如下:
Prob_energy=(energy-A)/(B-A)
其中,Prob_energy表示所述第二概率,energy表示所述平均能量值,A表示所述第三设定阈值,B表示所述第二设定阈值;和/或,
所述检测系统还包括唤醒模块;
所述唤醒模块用于在所述录音设备开始录音之前,采用唤醒词唤醒所述录音设备。
本发明还提供一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行计算机程序时实现上述的语音交互中目标用户改变的检测方法。
本发明还提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现上述的语音交互中目标用户改变的检测方法的步骤。
本发明的积极进步效果在于:
本发明中,能够快速有效地检测到在目标用户已停止说话后发生对非目标用户的说话内容继续录音的情况,并及时控制录音设备停止录音,从而缩短了语音交互的响应时间,保证了能够及时响应目标用户的请求,且避免了对后续的语音识别和语义理解产生误差,提高了语音处理结果的准确性,进而提升了用户的使用体验;另外,减少了对录音设备的资源占用,避免了占用过多资源的情况。另外,通过每帧第一输入信号对应的能量和采用深度神经网络算法获取每帧第一输入信号对应的目标语音概率和目标非语音概率,有效地提高了VAD检测录入语音的起始点和结束点的准确性,在保证数据录入的完整性的同时进一步地减少了对录音设备的资源占用。
附图说明
图1为本发明实施例1的语音交互中目标用户改变的检测方法的流程图。
图2为本发明实施例2的语音交互中目标用户改变的检测方法的第一流程图。
图3为本发明实施例2的语音交互中目标用户改变的检测方法的第二流程图。
图4为本发明实施例3的语音交互中目标用户改变的检测方法的流程图。
图5为本发明实施例4的语音交互中目标用户改变的检测系统的模块示意图。
图6为本发明实施例5的语音交互中目标用户改变的检测系统的模块示意图。
图7为本发明实施例6的语音交互中目标用户改变的检测系统中第一判断模块的模块示意图。
图8为本发明实施例7中的实现语音交互中目标用户改变的检测方法的电子设备的结构示意图。
具体实施方式
下面通过实施例的方式进一步说明本发明,但并不因此将本发明限制在所述的实施例范围之中。
实施例1
如图1所示,本实施例的语音交互中目标用户改变的检测方法包括:
S100、采用唤醒词唤醒录音设备;其中,录音设备在被唤醒之后会自动进入录音状态,采用麦克风进行拾音。
S101、判断是否检测到目标用户开始输入语音信号,若是,则获取目标用户输入的第一设定帧数的第一语音信号;
S102、获取每帧第一语音信号对应的第一基音周期;
S103、根据第一基音周期获取与第一设定帧数对应的第一基音周期序列;
S104、在设定时长后,获取当前用户输入录音设备的第二设定帧数的第二语音信号;
S105、获取每帧第二语音信号对应的第二基音周期;
S106、根据第二基音周期获取与第二设定帧数对应的第二基音周期序列;
S107、计算第一基音周期序列和第二基音周期序列的相似度;
S108、判断相似度是否大于第一设定阈值,若是,则执行步骤S109;若否,则执行步骤S1010;
S109、确定输入第二语音信号的当前用户是目标用户,控制录音设备继续录音,并重新执行步骤S104;
S1010、确定输入第二语音信号的当前用户不是目标用户,并控制录音设备停止录音。
另外,在确定输入第二语音信号的当前用户是目标用户之后,持续监测目标用户是否结束语音信号输入,若是,则控制录音设备停止录音;若否,继续执行步骤S104。
例如,在检测到有语音输入后,自动获取初始m帧的第一语音信号,进而获取其对应的第一基音周期序列并保存至录音设备的第一缓冲存储区,同时持续检测目标用户是否结束语音信号输入;在目标用户没有结束语音信号输入时,每隔一段时间(如100ms)获取当前用户输入的n帧的第二语音信号,获取其对应的第二基音周期序列并保存至录音设备的第二缓冲存储区;其中,考虑到基音周期并不具有严格的周期性,所以n可以在[m-5,m+5]的范围内随机取值;然后将每次获取的第二基音周期序列与第一基音周期序列进行对比,以确定说话者是否发生变化,若发生变化则表示目标用户已经停止说话且发生继续录制非目标用户的说话内容的情况,此时需要及时停止录音。
本实施例中,通过在目标用户开始输入语音信号后,根据目标用户输入的设定帧数的语音信号获取第一基音周期序列,然后每隔设定时长检测当前输入的语音信号对应的第二基音周期序列,通过比较两个基音周期序列以实现快速有效地检测到在目标用户已停止说话后对非目标用户的说话内容继续录音的情况,并及时控制录音设备停止录音,从而保证了能够及时响应目标用户的请求,缩短了语音交互的响应时间,且避免了对后续的语音识别和语义理解产生误差,提高了语音处理结果的准确性,提升了用户的使用体验;另外,减少了对录音设备的资源占用,避免了占用过多资源的情况。
实施例2
本实施例的语音交互中目标用户改变的检测方法是对实施例1的进一步改进,具体地:
如图2所示,步骤S102包括:
S1021、对每帧第一语音信号进行预处理;
S1022、采用短时能量对预处理后的每帧第一语音信号进行处理,获取每帧第一语音信号中的第一浊音信号;
S1023、采用中心削波法对第一浊音信号进行处理,获取第一中间语音信号;
S1024、对第一中间语音信号进行处理,获取每帧第一语音信号对应的第一基音周期;
其中,采用波形估计法、自相关处理法或倒谱法等方法对第一中间语音信号进行处理,获取每帧第一语音信号对应的第一基音周期。
步骤S103包括:
S1031、根据第一设定帧数中每帧第一语音信号对应的第一基音周期构成第一基音周 期序列。
如图3所示,步骤S105包括:
S1051、对每帧第二语音信号进行预处理;
S1052、采用短时能量对预处理后的每帧第二语音信号进行处理,获取每帧第二语音信号中的第二浊音信号;
S1053、采用中心削波法对第二浊音信号进行处理,获取第二中间语音信号;
S1054、对第二中间语音信号进行处理,获取每帧第二语音信号对应的第二基音周期;
其中,采用波形估计法、自相关处理法或倒谱法等方法对第二中间语音信号进行处理,获取每帧第二语音信号对应的第二基音周期。
步骤S106包括:
S1061、根据第二设定帧数中每帧第二语音信号对应的第二基音周期构成第二基音周期序列。
步骤S107包括:
S1071、采用动态时间规整算法计算第一基音周期序列和第二基音周期序列之间的欧氏距离;
S1072、根据欧氏距离确定第一基音周期序列和第二基音周期序列的相似度;
其中,欧氏距离与相似度呈负相关。
本实施例中,通过在目标用户开始输入语音信号后,根据目标用户输入的设定帧数的语音信号获取第一基音周期序列,然后每隔设定时长检测当前输入的语音信号对应的第二基音周期序列,通过比较两个基音周期序列以实现快速有效地检测到在目标用户已停止说话后对非目标用户的说话内容继续录音的情况,并及时控制录音设备停止录音,从而保证了能够及时响应目标用户的请求,缩短了语音交互的响应时间,且避免了对后续的语音识别和语义理解产生误差,提高了语音处理结果的准确性,提升了用户的使用体验;另外,减少了对录音设备的资源占用,避免了占用过多资源的情况。
实施例3
本实施例的语音交互中目标用户改变的检测方法是对实施例2的进一步改进,具体地:
如图4所示,步骤S101包括:
S1011、在录音设备开始录音后,依次获取每帧第一输入信号;
S1012、获取每帧第一输入信号对应的目标语音概率和目标非语音概率;
其中,根据每帧第一输入信号对应的能量和/或采用深度神经网络算法获取每帧第一 输入信号对应的目标语音概率和目标非语音概率。具体地,
(1)当根据每帧第一输入信号对应的能量获取每帧第一输入信号对应的目标语音概率和目标非语音概率时,根据每帧第一输入信号对应的能量获取每帧第一输入信号对应的目标语音概率和目标非语音概率的步骤包括:
获取每帧第一输入信号在设定频率范围内对应的平均能量值;
根据平均能量值获取每帧第一输入信号对应的目标语音概率和目标非语音概率。
其中,获取每帧第一输入信号在设定频率范围内对应的平均能量值的步骤包括:
将与时域对应的每帧第一输入信号转换为与频域对应的第二输入信号;
计算每帧第二输入信号在设定频率范围内中每个频带对应的子带能量值;
根据子带能量值获取每帧第二输入信号对应的平均能量值。
当平均能量值大于第二设定阈值时,则确定当前帧为语音的第一概率;
当平均能量值小于或者等于第二设定阈值且大于第三设定阈值时,则根据平均能量值、第二设定阈值和第三设定阈值确定当前帧为语音的第二概率;
当平均能量值小于或者等于第三设定阈值时,则确定当前帧为语音的第三概率;
其中,第一概率、第二概率和第三概率从大到小依次排序;
根据第一概率、第二概率或第三概率确定每帧第二输入信号对应的目标语音概率和目标非语音概率。
具体地,第一概率为1,第三概率为0。
根据平均能量值、第二设定阈值和第三设定阈值确定当前帧为语音的第二概率的步骤对应的计算公式如下:
Prob_energy=(energy-A)/(B-A)
其中,Prob_energy表示第二概率,energy表示平均能量值,A表示第三设定阈值,B表示第二设定阈值。
第二设定阈值和第三设定阈值根据实际经验设置,也可以根据实际情况进行调整。或,
(2)当采用深度神经网络算法获取每帧第一输入信号对应的目标语音概率和目标非语音概率时,采用深度神经网络算法获取每帧第一输入信号对应的目标语音概率和目标非语音概率的步骤包括:
获取历史设定时间内目标用户输入至录音设备中的每帧历史输入信号以及与每帧历史输入信号对应的信号类型;
其中,信号类型包括语音信号和非语音信号;
将历史输入信号作为输入,信号类型作为输出,采用深度神经网络建立用于预测每帧输入信号为语音信号的概率模型;
将每帧第一输入信号分别输入至预测模型,获取每帧第一输入信号对应的目标语音概率;
根据目标语音概率计算每帧第一输入信号对应的目标非语音概率。或,
(3)当根据每帧第一输入信号对应的能量和采用深度神经网络算法获取每帧第一输入信号对应的目标语音概率和目标非语音概率时,根据每帧第一输入信号对应的能量和采用深度神经网络算法获取每帧第一输入信号对应的目标语音概率和目标非语音概率的步骤包括:
获取每帧第一输入信号在设定频率范围内对应的平均能量值;
根据平均能量值获取每帧第一输入信号对应的第一语音概率和第一非语音概率;
当采用深度神经网络算法获取每帧第一输入信号对应的目标语音概率和目标非语音概率时,采用深度神经网络算法获取每帧第一输入信号对应的目标语音概率和目标非语音概率的步骤包括:
获取历史设定时间内目标用户输入至录音设备中的每帧历史输入信号以及与每帧历史输入信号对应的信号类型;
其中,信号类型包括语音信号和非语音信号;
将历史输入信号作为输入,信号类型作为输出,采用深度神经网络建立用于预测每帧输入信号为语音信号的概率模型;
将每帧第一输入信号分别输入至预测模型,获取每帧第一输入信号对应的第二语音概率;
根据第二语音概率计算每帧第一输入信号对应的第二非语音概率;
采用加权平均法对同一帧第一输入信号的第一语音概率和第二语音概率进行处理,以及同一帧第一输入信号的第一非语音概率和第二非语音概率进行处理,获取每帧第一输入信号对应的目标语音概率和目标非语音概率。
即此时同一帧第一输入信号的目标语音概率对应的计算公式如下:
Prob=a*prob_energy1+(1-a)*prob_dnnspeech
其中,Prob表示目标语音概率,a表示加权系数(如0.7),prob_energy1表示该帧第一输入信号的第一语音概率,prob_dnnspeech表示该帧第一输入信号的第二语音概率。
S1013、当获取的第一输入信号的总帧数大于或者等于第一设定帧数时,则获取目标输入信号;
其中,当第一设定帧数为N时,目标输入信号包括当前帧i以及当前帧之前的N-1帧数对应的第一输入信号,i≥N>1且取整数;
S1014、对于同一帧第一输入信号,判断目标语音概率是否大于目标非语音概率,若是,则确定当前帧为语音帧;若否,确定当前帧为非语音帧;
S1015、获取目标输入信号中语音帧对应的第一帧数和非语音帧对应的第二帧数;
S1016、计算目标输入信号中每帧第一输入信号的目标语音概率之和得到第一总概率,以及计算目标输入信号中每帧第一输入信号的目标非语音概率之和得到第二总概率;
S1017、根据目标输入信号中每帧第一输入信号的目标语音概率和目标非语音概率确定是否检测到目标用户开始输入语音信号。
具体地,当第一总概率大于或者等于第二总概率、第一帧数大于或者等于第四设定阈值且当前帧为语音帧时,则确定检测到目标用户开始输入语音信号。
另外,当第二总概率大于第一总概率、第二帧数大于或者等于第五设定阈值且当前帧为非语音帧时,则确定检测到目标用户结束语音信号输入,并控制录音设备停止录音。
当确定目标用户未开始输入语音信号时,本实施例的检测方法还包括:
继续获取下一帧第一输入信号;
根据下一帧第一输入信号获取新的目标输入信号;
根据新的目标输入信号中每帧第一输入信号的目标语音概率和目标非语音概率确定是否检测到目标用户开始输入语音信号。
在检测到目标用户开始输入语音信号时,步骤S101中获取目标用户输入的第一设定帧数的第一语音信号具体包括:
从第i-N+1帧开始获取目标用户输入的第一设定帧数的第一输入信号。
下面结合实例具体说明:
1)当录音设备被唤醒且开始录音后,依次录入每帧第一输入信号;
2)根据每帧第一输入信号对应的能量和深度神经网络算法获取每帧第一输入信号对应的目标语音概率p1和目标非语音概率(1-p1);
3)当前帧为i,第一设定帧数为N,当i<N时,则继续依次录入每帧第一输入信号;当i≥N时,则获取提取当前帧i以及当前帧之前的N-1帧数对应的第一输入信号构成目标输入信号;例如,i=40,N=30,此时提取第11帧至40帧的第一输入信号构成目标输入信号(对应30帧)。
4)对于同一帧第一输入信号,当目标语音概率p1大于目标非语音概率(1-p1)时,则确定该帧为语音帧;否则,确定为非语音帧;
5)计算目标输入信号中语音帧对应的第一帧数N1和非语音帧对应的第二帧数N2;
6)计算目标输入信号中30帧第一输入信号的目标语音概率p1之和P1,以及30帧第一输入信号的目标语音概率(1-p1)之和P2;
7)当P1≥P2、N1>C且i为语音帧,则确定检测到目标用户开始输入语音信号,其中C表示第四设定阈值;
当P2>P1、N2>D且i为非语音帧,则确定检测到目标用户结束语音信号输入,其中,D表示第五设定阈值。
另外,当通过第11帧至40帧的第一输入信号构成的目标输入信号确定目标用户未开始输入语音信号时,则继续提取第12帧至41帧的第一输入信号构成的新的目标输入信号,并重新执行上述步骤4)-7),直至提取到目标输入信号能够确定检测到目标用户开始输入语音信号。
8)若通过第11帧至40帧的第一输入信号构成的目标输入信号确定检测到目标用户开始输入语音信号,则控制从第i-N+1=40-30+1=11帧开始获取目标用户输入中连续30帧的第一输入信号作为第一语音信号;
此处,也可以从第11帧的前几帧(如第8帧)开始获取,以保证目标用户输入的语音信号的完整性,进而确保后续语音处理结果的准确性。
若通过第11帧至40帧的第一输入信号构成的目标输入信号确定检测到目标用户结束语音信号输入,则停止录制,语音信号录入停止在第i帧。
本实施例中,能够快速有效地检测到在目标用户已停止说话后发生对非目标用户的说话内容继续录音的情况,并及时控制录音设备停止录音,从而缩短了语音交互的响应时间,保证了能够及时响应目标用户的请求,且避免了对后续的语音识别和语义理解产生误差,提高了语音处理结果的准确性,进而提升了用户的使用体验;另外,减少了对录音设备的资源占用,避免了占用过多资源的情况。另外,通过每帧第一输入信号对应的能量和采用深度神经网络算法获取每帧第一输入信号对应的目标语音概率和目标非语音概率,有效地提高了VAD检测录入语音的起始点和结束点的准确性,在保证数据录入的完整性的同时进一步地减少了对录音设备的资源占用。
实施例4
如图5所示,本实施例的语音交互中目标用户改变的检测系统包括唤醒模块1、第一判断模块2、第一语音信号获取模块3、第一基音周期获取模块4和第一周期序列获取模块5、第二语音信号获取模块6、第二基音周期获取模块7、第二周期序列获取模块8、相似度计算模块9和第二判断模块10。
唤醒模块1用于采用唤醒词唤醒所述录音设备;其中,录音设备在被唤醒之后会自动进入录音状态,采用麦克风进行拾音。
第一判断模块2用于判断是否检测到目标用户开始输入语音信号,若是,则调用第一语音信号获取模块3获取目标用户输入的第一设定帧数的第一语音信号;
第一基音周期获取模块4用于获取每帧第一语音信号对应的第一基音周期;
第一周期序列获取模块5用于根据第一基音周期获取与第一设定帧数对应的第一基音周期序列;
第二语音信号获取模块6用于在设定时长后,获取当前用户输入录音设备的第二设定帧数的第二语音信号;
第二基音周期获取模块7用于获取每帧第二语音信号对应的第二基音周期;
第二周期序列获取模块8用于根据第二基音周期获取与第二设定帧数对应的第二基音周期序列;
相似度计算模块9用于计算第一基音周期序列和第二基音周期序列的相似度;
第二判断模块10用于判断相似度是否大于第一设定阈值,若是,则确定输入第二语音信号的当前用户是目标用户,控制录音设备继续录音,并重新调用第二语音信号获取模块6;
若否,则确定输入第二语音信号的当前用户不是目标用户,并控制录音设备停止录音。
例如,在检测到有语音输入后,自动获取初始m帧的第一语音信号,进而获取其对应的第一基音周期序列并保存至录音设备的第一缓冲存储区,同时持续检测目标用户是否结束语音信号输入;在目标用户没有结束语音信号输入时,每隔一段时间(如100ms)获取当前用户输入的n帧的第二语音信号,获取其对应的第二基音周期序列并保存至录音设备的第二缓冲存储区;其中,考虑到基音周期并不具有严格的周期性,所以n可以在[m-5,m+5]的范围内随机取值;然后将每次获取的第二基音周期序列与第一基音周期序列进行对比,以确定说话者是否发生变化,若发生变化则表示目标用户已经停止说话且发生继续录制非目标用户的说话内容的情况,此时需要及时停止录音。
本实施例中,通过在目标用户开始输入语音信号后,根据目标用户输入的设定帧数的语音信号获取第一基音周期序列,然后每隔设定时长检测当前输入的语音信号对应的第二基音周期序列,通过比较两个基音周期序列以实现快速有效地检测到在目标用户已停止说话后对非目标用户的说话内容继续录音的情况,并及时控制录音设备停止录音,从而保证了能够及时响应目标用户的请求,缩短了语音交互的响应时间,且避免了对后 续的语音识别和语义理解产生误差,提高了语音处理结果的准确性,提升了用户的使用体验;另外,减少了对录音设备的资源占用,避免了占用过多资源的情况。
实施例5
如图6所示,本实施例的语音交互中目标用户改变的检测系统是对实施例4的进一步改进,具体地:
第一基音周期获取模块4包括第一预处理单元11、第一短时能量处理单元12、第一中心削波处理单元13和第一基音周期获取单元14。
第一预处理单元11用于对每帧第一语音信号进行预处理;
第一短时能量处理单元12用于采用短时能量对预处理后的每帧第一语音信号进行处理,获取每帧第一语音信号中的第一浊音信号;
第一中心削波处理单元13用于采用中心削波法对第一浊音信号进行处理,获取第一中间语音信号;
第一基音周期获取单元14用于对第一中间语音信号进行处理,获取每帧第一语音信号对应的第一基音周期;
具体地,采用波形估计法、自相关处理法或倒谱法等方法对第一中间语音信号进行处理,获取每帧第一语音信号对应的第一基音周期。
第一周期序列获取模块5用于根据第一设定帧数中每帧第一语音信号对应的第一基音周期构成第一基音周期序列。
第二基音周期获取模块7包括第二预处理单元15、第二短时能量处理单元16、第二中心削波处理单元17和第二基音周期获取单元18。
第二预处理单元15用于对每帧第二语音信号进行预处理;
第二短时能量处理单元16用于采用短时能量对预处理后的每帧第二语音信号进行处理,获取每帧第二语音信号中的第二浊音信号;
第二中心削波处理单元17用于采用中心削波法对第二浊音信号进行处理,获取第二中间语音信号;
第二基音周期获取单元18用于对第二中间语音信号进行处理,获取每帧第二语音信号对应的第二基音周期;
具体地,采用波形估计法、自相关处理法或倒谱法等方法对第二中间语音信号进行处理,获取每帧第二语音信号对应的第二基音周期。
第二周期序列获取模块8用于根据第二设定帧数中每帧第二语音信号对应的第二基音周期构成第二基音周期序列。
相似度计算模块9包括欧式距离计算单元19和相似度确定单元20。
欧式距离计算单元19用于采用动态时间规整算法计算第一基音周期序列和第二基音周期序列之间的欧氏距离;
相似度确定单元20用于根据欧氏距离确定第一基音周期序列和第二基音周期序列的相似度;
其中,欧氏距离与相似度呈负相关。
本实施例中,通过在目标用户开始输入语音信号后,根据目标用户输入的设定帧数的语音信号获取第一基音周期序列,然后每隔设定时长检测当前输入的语音信号对应的第二基音周期序列,通过比较两个基音周期序列以实现快速有效地检测到在目标用户已停止说话后对非目标用户的说话内容继续录音的情况,并及时控制录音设备停止录音,从而保证了能够及时响应目标用户的请求,缩短了语音交互的响应时间,且避免了对后续的语音识别和语义理解产生误差,提高了语音处理结果的准确性,提升了用户的使用体验;另外,减少了对录音设备的资源占用,避免了占用过多资源的情况。
实施例6
本实施例的语音交互中目标用户改变的检测方法是对实施例5的进一步改进,具体地:
如图7所示,第一判断模块2包括第一输入信号获取单元21、目标概率获取单元22、语音帧确定单元23、帧数获取单元24、总概率计算单元25、目标输入信号获取单元26和信号输入确定单元27。
第一输入信号获取单元21用于在录音设备开始录音后,依次获取每帧第一输入信号;
目标概率获取单元22用于获取每帧第一输入信号对应的目标语音概率和目标非语音概率;
其中,目标概率获取单元22用于根据每帧第一输入信号对应的能量和/或采用深度神经网络算法获取每帧第一输入信号对应的目标语音概率和目标非语音概率。
具体地,(1)当根据每帧第一输入信号对应的能量获取每帧第一输入信号对应的目标语音概率和目标非语音概率时,目标概率获取单元22用于能量值获取子单元和目标概率获取子单元;
能量值获取子单元用于获取每帧第一输入信号在设定频率范围内对应的平均能量值;
目标概率获取子单元用于根据平均能量值获取每帧第一输入信号对应的目标语音概率和目标非语音概率。
其中,能量值获取子单元用于将与时域对应的每帧第一输入信号转换为与频域对应 的第二输入信号;
能量值获取子单元还用于计算每帧第二输入信号在设定频率范围内中每个频带对应的子带能量值;
能量值获取子单元还用于根据子带能量值获取每帧第二输入信号对应的平均能量值。
目标概率获取子单元用于当平均能量值大于第二设定阈值时,则确定当前帧为语音的第一概率;
当平均能量值小于或者等于第二设定阈值且大于第三设定阈值时,则根据平均能量值、第二设定阈值和第三设定阈值确定当前帧为语音的第二概率;
当平均能量值小于或者等于第三设定阈值时,则确定当前帧为语音的第三概率;
其中,第一概率、第二概率和第三概率从大到小依次排序;
目标概率获取子单元还用于根据第一概率、第二概率或第三概率确定每帧第二输入信号对应的目标语音概率和目标非语音概率。
具体地,第一概率为1,第三概率为0。
根据平均能量值、第二设定阈值和第三设定阈值确定当前帧为语音的第二概率的步骤对应的计算公式如下:
Prob_energy=(energy-A)/(B-A)
其中,Prob_energy表示第二概率,energy表示平均能量值,A表示第三设定阈值,B表示第二设定阈值。
第二设定阈值和第三设定阈值根据实际经验设置,也可以根据实际情况进行调整。或,
(2)当采用深度神经网络算法获取每帧第一输入信号对应的目标语音概率和目标非语音概率时,目标概率获取单元22包括历史信号获取子单元、模型建立子单元和目标概率获取子单元;
历史信号获取子单元用于获取历史设定时间内目标用户输入至录音设备中的每帧历史输入信号以及与每帧历史输入信号对应的信号类型;
其中,信号类型包括语音信号和非语音信号;
模型建立子单元用于将历史输入信号作为输入,信号类型作为输出,采用深度神经网络建立用于预测每帧输入信号为语音信号的概率模型;
目标概率获取子单元用于将每帧第一输入信号分别输入至预测模型,获取每帧第一输入信号对应的目标语音概率;
目标概率获取子单元还用于根据目标语音概率计算每帧第一输入信号对应的目标非 语音概率。或,
(3)当根据每帧第一输入信号对应的能量和采用深度神经网络算法获取每帧第一输入信号对应的目标语音概率和目标非语音概率时,目标概率获取单元22用于能量值获取子单元、目标概率获取子单元、历史信号获取子单元、模型建立子单元和加权计算子单元。
能量值获取子单元用于获取每帧第一输入信号在设定频率范围内对应的平均能量值;
目标概率获取子单元用于根据平均能量值获取每帧第一输入信号对应的第一语音概率和第一非语音概率;
历史信号获取子单元用于获取历史设定时间内目标用户输入至录音设备中的每帧历史输入信号以及与每帧历史输入信号对应的信号类型;
其中,信号类型包括语音信号和非语音信号;
模型建立子单元用于将历史输入信号作为输入,信号类型作为输出,采用深度神经网络建立用于预测每帧输入信号为语音信号的概率模型;
目标概率获取子单元用于将每帧第一输入信号分别输入至预测模型,获取每帧第一输入信号对应的第二语音概率;
目标概率获取子单元还用于根据第二语音概率计算每帧第一输入信号对应的第二非语音概率;
加权计算子单元用于采用加权平均法对同一帧第一输入信号的第一语音概率和第二语音概率进行处理,以及同一帧第一输入信号的第一非语音概率和第二非语音概率进行处理,获取每帧第一输入信号对应的目标语音概率和目标非语音概率。
即此时同一帧第一输入信号的目标语音概率对应的计算公式如下:
Prob=a*prob_energy1+(1-a)*prob_dnnspeech
其中,Prob表示目标语音概率,a表示加权系数(如0.7),prob_energy1表示该帧第一输入信号的第一语音概率,prob_dnnspeech表示该帧第一输入信号的第二语音概率。
目标输入信号获取单元26用于当获取的第一输入信号的总帧数大于或者等于第一设定帧数时,则获取目标输入信号;
其中,当第一设定帧数为N时,目标输入信号包括当前帧i以及当前帧之前的N-1帧数对应的第一输入信号,i≥N>1且取整数;
对于同一帧第一输入信号,语音帧确定单元23用于判断目标语音概率是否大于目标非语音概率,若是,则确定当前帧为语音帧;若否,确定当前帧为非语音帧;
帧数获取单元24用于获取目标输入信号中语音帧对应的第一帧数和非语音帧对应 的第二帧数;
总概率计算单元25用于计算目标输入信号中每帧第一输入信号的目标语音概率之和得到第一总概率,以及计算目标输入信号中每帧第一输入信号的目标非语音概率之和得到第二总概率;
信号输入确定单元27用于根据目标输入信号中每帧第一输入信号的目标语音概率和目标非语音概率确定是否检测到目标用户开始输入语音信号。
具体地,信号输入确定单元27用于当第一总概率大于或者等于第二总概率、第一帧数大于或者等于第四设定阈值且当前帧为语音帧时,则确定检测到目标用户开始输入语音信号。
另外,信号输入确定单元27还用于当第二总概率大于第一总概率、第二帧数大于或者等于第五设定阈值且当前帧为非语音帧时,则确定检测到目标用户结束语音信号输入,并控制录音设备停止录音。
当确定目标用户未开始输入语音信号时,第一输入信号获取单元21还用于继续获取下一帧第一输入信号;
目标输入信号获取单元26还用于根据下一帧第一输入信号获取新的目标输入信号;
信号输入确定单元27还用于根据新的目标输入信号中每帧第一输入信号的目标语音概率和目标非语音概率确定是否检测到目标用户开始输入语音信号。
在检测到目标用户开始输入语音信号时,第一语音信号获取模块3用于从第i-N+1帧开始获取目标用户输入的第一设定帧数的第一输入信号。
下面结合实例具体说明:
1)当录音设备被唤醒且开始录音后,依次录入每帧第一输入信号;
2)根据每帧第一输入信号对应的能量和深度神经网络算法获取每帧第一输入信号对应的目标语音概率p1和目标非语音概率(1-p1);
3)当前帧为i,第一设定帧数为N,当i<N时,则继续依次录入每帧第一输入信号;当i≥N时,则获取提取当前帧i以及当前帧之前的N-1帧数对应的第一输入信号构成目标输入信号;例如,i=40,N=30,此时提取第11帧至40帧的第一输入信号构成目标输入信号(对应30帧)。
4)对于同一帧第一输入信号,当目标语音概率p1大于目标非语音概率(1-p1)时,则确定该帧为语音帧;否则,确定为非语音帧;
5)计算目标输入信号中语音帧对应的第一帧数N1和非语音帧对应的第二帧数N2;
6)计算目标输入信号中30帧第一输入信号的目标语音概率p1之和P1,以及30帧 第一输入信号的目标语音概率(1-p1)之和P2;
7)当P1≥P2、N1>C且i为语音帧,则确定检测到目标用户开始输入语音信号,其中C表示第四设定阈值;
当P2>P1、N2>D且i为非语音帧,则确定检测到目标用户结束语音信号输入,其中,D表示第五设定阈值。
另外,当通过第11帧至40帧的第一输入信号构成的目标输入信号确定目标用户未开始输入语音信号时,则继续提取第12帧至41帧的第一输入信号构成的新的目标输入信号,并重新执行上述步骤4)-7),直至提取到目标输入信号能够确定检测到目标用户开始输入语音信号。
8)若通过第11帧至40帧的第一输入信号构成的目标输入信号确定检测到目标用户开始输入语音信号,则控制从第i-N+1=40-30+1=11帧开始获取目标用户输入中连续30帧的第一输入信号作为第一语音信号;
此处,也可以从第11帧的前几帧(如第8帧)开始获取,以保证目标用户输入的语音信号的完整性,进而确保后续语音处理结果的准确性。
若通过第11帧至40帧的第一输入信号构成的目标输入信号确定检测到目标用户结束语音信号输入,则停止录制,语音信号录入停止在第i帧。
本实施例中,能够快速有效地检测到在目标用户已停止说话后发生对非目标用户的说话内容继续录音的情况,并及时控制录音设备停止录音,从而缩短了语音交互的响应时间,保证了能够及时响应目标用户的请求,且避免了对后续的语音识别和语义理解产生误差,提高了语音处理结果的准确性,进而提升了用户的使用体验;另外,减少了对录音设备的资源占用,避免了占用过多资源的情况。另外,通过每帧第一输入信号对应的能量和采用深度神经网络算法获取每帧第一输入信号对应的目标语音概率和目标非语音概率,有效地提高了VAD检测录入语音的起始点和结束点的准确性,在保证数据录入的完整性的同时进一步地减少了对录音设备的资源占用。
实施例7
图8为本发明实施例7提供的一种电子设备的结构示意图。电子设备包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器执行程序时实现实施例1至3中任意一实施例中的语音交互中目标用户改变的检测方法。图8显示的电子设备30仅仅是一个示例,不应对本发明实施例的功能和使用范围带来任何限制。
如图8所示,电子设备30可以以通用计算设备的形式表现,例如其可以为服务器设备。电子设备30的组件可以包括但不限于:上述至少一个处理器31、上述至少一个存储 器32、连接不同系统组件(包括存储器32和处理器31)的总线33。
总线33包括数据总线、地址总线和控制总线。
存储器32可以包括易失性存储器,例如随机存取存储器(RAM)321和/或高速缓存存储器322,还可以进一步包括只读存储器(ROM)323。
存储器32还可以包括具有一组(至少一个)程序模块324的程序/实用工具325,这样的程序模块324包括但不限于:操作系统、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。
处理器31通过运行存储在存储器32中的计算机程序,从而执行各种功能应用以及数据处理,例如本发明实施例1至3中任意一实施例中的语音交互中目标用户改变的检测方法。
电子设备30也可以与一个或多个外部设备34(例如键盘、指向设备等)通信。这种通信可以通过输入/输出(I/O)接口35进行。并且,模型生成的设备30还可以通过网络适配器36与一个或者多个网络(例如局域网(LAN),广域网(WAN)和/或公共网络,例如因特网)通信。如图8所示,网络适配器36通过总线33与模型生成的设备30的其它模块通信。应当明白,尽管图中未示出,可以结合模型生成的设备30使用其它硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理器、外部磁盘驱动阵列、RAID(磁盘阵列)系统、磁带驱动器以及数据备份存储系统等。
应当注意,尽管在上文详细描述中提及了电子设备的若干单元/模块或子单元/模块,但是这种划分仅仅是示例性的并非强制性的。实际上,根据本发明的实施方式,上文描述的两个或更多单元/模块的特征和功能可以在一个单元/模块中具体化。反之,上文描述的一个单元/模块的特征和功能可以进一步划分为由多个单元/模块来具体化。
实施例8
本实施例提供了一种计算机可读存储介质,其上存储有计算机程序,程序被处理器执行时实现实施例1至3中任意一实施例中的语音交互中目标用户改变的检测方法中的步骤。
其中,可读存储介质可以采用的更具体可以包括但不限于:便携式盘、硬盘、随机存取存储器、只读存储器、可擦拭可编程只读存储器、光存储器件、磁存储器件或上述的任意合适的组合。
在可能的实施方式中,本发明还可以实现为一种程序产品的形式,其包括程序代码,当程序产品在终端设备上运行时,程序代码用于使终端设备执行实现实施例1至3中任意一实施例中的语音交互中目标用户改变的检测方法中的步骤。
其中,可以以一种或多种程序设计语言的任意组合来编写用于执行本发明的程序代码,程序代码可以完全地在用户设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户设备上部分在远程设备上执行或完全在远程设备上执行。
虽然以上描述了本发明的具体实施方式,但是本领域的技术人员应当理解,这些仅是举例说明,在不背离本发明的原理和实质的前提下,可以对这些实施方式做出多种变更或修改。因此,本发明的保护范围由所附权利要求书限定。

Claims (30)

  1. 一种语音交互中目标用户改变的检测方法,其特征在于,所述检测方法包括:
    在录音设备开始录音后,判断是否检测到目标用户开始输入语音信号,若是,则获取所述目标用户输入的第一设定帧数的第一语音信号;
    获取每帧所述第一语音信号对应的第一基音周期;
    根据所述第一基音周期获取与所述第一设定帧数对应的第一基音周期序列;
    在设定时长后,获取当前用户输入所述录音设备的第二设定帧数的第二语音信号;
    获取每帧所述第二语音信号对应的第二基音周期;
    根据所述第二基音周期获取与所述第二设定帧数对应的第二基音周期序列;
    计算所述第一基音周期序列和所述第二基音周期序列的相似度;
    判断所述相似度是否大于第一设定阈值,若是,则确定输入所述第二语音信号的当前用户是所述目标用户,控制所述录音设备继续录音,并重新执行所述在设定时长后,获取第二设定帧数的第二语音信号的步骤;
    若否,则确定输入所述第二语音信号的当前用户不是所述目标用户,并控制所述录音设备停止录音。
  2. 如权利要求1所述的语音交互中目标用户改变的检测方法,其特征在于,所述获取每帧所述第一语音信号对应的第一基音周期的步骤包括:
    对每帧所述第一语音信号进行预处理;
    采用短时能量对预处理后的每帧所述第一语音信号进行处理,获取每帧所述第一语音信号中的第一浊音信号;
    采用中心削波法对所述第一浊音信号进行处理,获取第一中间语音信号;
    采用波形估计法、自相关处理法或倒谱法对所述第一中间语音信号进行处理,获取每帧所述第一语音信号对应的所述第一基音周期;和/或,
    所述获取每帧所述第二语音信号对应的第二基音周期的步骤包括:
    对每帧所述第二语音信号进行预处理;
    采用短时能量对预处理后的每帧所述第二语音信号进行处理,获取每帧所述第二语音信号中的第二浊音信号;
    采用中心削波法对所述第二浊音信号进行处理,获取第二中间语音信号;
    采用波形估计法、自相关处理法或倒谱法对所述第二中间语音信号进行处理,获取每帧所述第二语音信号对应的所述第二基音周期。
  3. 如权利要求1或2所述的语音交互中目标用户改变的检测方法,其特征在于,所述计算所述第一基音周期序列和所述第二基音周期序列的相似度的步骤包括:
    采用动态时间规整算法计算所述第一基音周期序列和所述第二基音周期序列之间的欧氏距离;
    根据所述欧氏距离确定所述第一基音周期序列和所述第二基音周期序列的相似度;
    其中,所述欧氏距离与所述相似度呈负相关。
  4. 如权利要求2或3所述的语音交互中目标用户改变的检测方法,其特征在于,所述在录音设备开始录音后,判断是否检测到目标用户开始输入语音信号的步骤包括:
    在所述录音设备开始录音后,依次获取每帧第一输入信号;
    获取每帧所述第一输入信号对应的目标语音概率和目标非语音概率;
    当获取的所述第一输入信号的总帧数大于或者等于所述第一设定帧数时,则获取目标输入信号;
    其中,当所述第一设定帧数为N时,所述目标输入信号包括当前帧i以及当前帧之前的N-1帧数对应的所述第一输入信号,i≥N>1且取整数;
    根据所述目标输入信号中每帧所述第一输入信号的所述目标语音概率和所述目标非语音概率确定是否检测到所述目标用户开始输入语音信号。
  5. 如权利要求1-4中至少一项所述的语音交互中目标用户改变的检测方法,其特征在于,所述获取每帧所述第一输入信号对应的目标语音概率和目标非语音概率的步骤之后、所述根据所述目标输入信号中每帧所述第一输入信号的所述目标语音概率和所述目标非语音概率确定是否检测到所述目标用户开始输入语音信号的步骤之前还包括:
    对于同一帧所述第一输入信号,判断所述目标语音概率是否大于所述目标非语音概率,若是,则确定当前帧为语音帧;若否,确定当前帧为非语音帧;
    获取所述目标输入信号中语音帧对应的第一帧数和非语音帧对应的第二帧数;
    计算所述目标输入信号中每帧所述第一输入信号的所述目标语音概率之和得到第一总概率,以及计算所述目标输入信号中每帧所述第一输入信号的所述目标非语音概率之和得到第二总概率;
    所述根据所述目标输入信号中每帧所述第一输入信号的所述目标语音概率和所述目标非语音概率确定是否检测到所述目标用户开始输入语音信号的步骤包括:
    当所述第一总概率大于或者等于所述第二总概率、所述第一帧数大于或者等于第四设定阈值且当前帧为语音帧时,则确定检测到所述目标用户开始输入语音信号。
  6. 如权利要求5所述的语音交互中目标用户改变的检测方法,其特征在于,所述控 制所述录音设备继续录音的步骤之后、所述在设定时长后,获取第二设定帧数的第二语音信号的步骤之前还包括:
    判断是否检测到所述目标用户结束语音信号输入,若是,则控制所述录音设备停止录音;若否,继续执行所述在设定时长后,获取第二设定帧数的第二语音信号的步骤。
  7. 如权利要求5或6所述的语音交互中目标用户改变的检测方法,其特征在于,所述判断是否检测到目标用户结束语音信号输入,若是,则控制所述录音设备停止录音的步骤包括:
    当所述第二总概率大于所述第一总概率、所述第二帧数大于或者等于第五设定阈值且当前帧为非语音帧时,则确定检测到所述目标用户结束语音信号输入,并控制所述录音设备停止录音。
  8. 如权利要求4-7中至少一项所述的语音交互中目标用户改变的检测方法,其特征在于,当确定所述目标用户未开始输入语音信号时,所述检测方法还包括:
    继续获取下一帧所述第一输入信号;
    根据下一帧所述第一输入信号获取新的所述目标输入信号;
    根据新的所述目标输入信号中每帧所述第一输入信号的所述目标语音概率和所述目标非语音概率确定是否检测到所述目标用户开始输入语音信号;和/或,
    在检测到所述目标用户开始输入语音信号时,所述获取所述目标用户输入的第一设定帧数的第一语音信号的步骤包括:
    从第i-N+1帧开始获取目标用户输入的所述第一设定帧数的所述第一输入信号。
  9. 如权利要求4-7中至少一项所述的语音交互中目标用户改变的检测方法,其特征在于,所述获取每帧所述第一输入信号对应的目标语音概率和目标非语音概率的步骤包括:
    根据每帧所述第一输入信号对应的能量和/或采用深度神经网络算法获取每帧所述第一输入信号对应的目标语音概率和目标非语音概率。
  10. 如权利要求9所述的语音交互中目标用户改变的检测方法,其特征在于,当根据每帧所述第一输入信号对应的能量获取每帧所述第一输入信号对应的目标语音概率和目标非语音概率时,所述根据每帧所述第一输入信号对应的能量获取每帧所述第一输入信号对应的目标语音概率和目标非语音概率的步骤包括:
    获取每帧所述第一输入信号在设定频率范围内对应的平均能量值;
    根据所述平均能量值获取每帧所述第一输入信号对应的所述目标语音概率和所述目标非语音概率。
  11. 如权利要求9或10所述的语音交互中目标用户改变的检测方法,其特征在于,当采用深度神经网络算法获取每帧所述第一输入信号对应的目标语音概率和目标非语音概率时,所述采用深度神经网络算法获取每帧所述第一输入信号对应的目标语音概率和目标非语音概率的步骤包括:
    获取历史设定时间内所述目标用户输入至所述录音设备中的每帧历史输入信号以及与每帧所述历史输入信号对应的信号类型;
    其中,所述信号类型包括语音信号和非语音信号;
    将所述历史输入信号作为输入,所述信号类型作为输出,采用深度神经网络建立用于预测每帧输入信号为语音信号的概率模型;
    将每帧所述第一输入信号分别输入至所述预测模型,获取每帧所述第一输入信号对应的所述目标语音概率;
    根据所述目标语音概率计算每帧所述第一输入信号对应的所述目标非语音概率。
  12. 如权利要求9-11中至少一项所述的语音交互中目标用户改变的检测方法,其特征在于,当根据每帧所述第一输入信号对应的能量和采用深度神经网络算法获取每帧所述第一输入信号对应的目标语音概率和目标非语音概率时,所述根据每帧所述第一输入信号对应的能量和采用深度神经网络算法获取每帧所述第一输入信号对应的目标语音概率和目标非语音概率的步骤包括:
    获取每帧所述第一输入信号在设定频率范围内对应的平均能量值;
    根据所述平均能量值获取每帧所述第一输入信号对应的第一语音概率和第一非语音概率;
    当采用深度神经网络算法获取每帧所述第一输入信号对应的目标语音概率和目标非语音概率时,所述采用深度神经网络算法获取每帧所述第一输入信号对应的目标语音概率和目标非语音概率的步骤包括:
    获取历史设定时间内所述目标用户输入至所述录音设备中的每帧历史输入信号以及与每帧所述历史输入信号对应的信号类型;
    其中,所述信号类型包括语音信号和非语音信号;
    将所述历史输入信号作为输入,所述信号类型作为输出,采用深度神经网络建立用于预测每帧输入信号为语音信号的概率模型;
    将每帧所述第一输入信号分别输入至所述预测模型,获取每帧所述第一输入信号对应的第二语音概率;
    根据所述第二语音概率计算每帧所述第一输入信号对应的第二非语音概率;
    采用加权平均法对同一帧所述第一输入信号的所述第一语音概率和所述第二语音概率进行处理,以及同一帧所述第一输入信号的所述第一非语音概率和所述第二非语音概率进行处理,获取每帧所述第一输入信号对应的所述目标语音概率和所述目标非语音概率。
  13. 如权利要求10-12中至少一项所述的语音交互中目标用户改变的检测方法,其特征在于,所述获取每帧所述第一输入信号在设定频率范围内对应的平均能量值的步骤包括:
    将与时域对应的每帧所述第一输入信号转换为与频域对应的第二输入信号;
    计算每帧所述第二输入信号在设定频率范围内中每个频带对应的子带能量值;
    根据所述子带能量值获取每帧所述第二输入信号对应的所述平均能量值;和/或,所述根据所述平均能量值获取每帧所述第一输入信号对应的所述目标语音概率和所述目标非语音概率的步骤包括:
    当所述平均能量值大于第二设定阈值时,则确定当前帧为语音的第一概率;
    当所述平均能量值小于或者等于所述第二设定阈值且大于第三设定阈值时,则根据所述平均能量值、所述第二设定阈值和所述第三设定阈值确定当前帧为语音的第二概率;
    当所述平均能量值小于或者等于所述第三设定阈值时,则确定当前帧为语音的第三概率;
    其中,所述第一概率、所述第二概率和所述第三概率从大到小依次排序;
    根据所述第一概率、所述第二概率或所述第三概率确定每帧所述第二输入信号对应的所述目标语音概率和所述目标非语音概率。
  14. 如权利要求13所述的语音交互中目标用户改变的检测方法,其特征在于,所述根据所述平均能量值、所述第二设定阈值和所述第三设定阈值确定当前帧为语音的第二概率的步骤对应的计算公式如下:
    Prob_energy=(energy-A)/(B-A)
    其中,Prob_energy表示所述第二概率,energy表示所述平均能量值,A表示所述第三设定阈值,B表示所述第二设定阈值;和/或,
    在所述录音设备开始录音之前,所述检测方法还包括:
    采用唤醒词唤醒所述录音设备。
  15. 一种语音交互中目标用户改变的检测系统,其特征在于,所述检测系统包括第一判断模块、第一语音信号获取模块、第一基音周期获取模块和第一周期序列获取模块、第二语音信号获取模块、第二基音周期获取模块和第二周期序列获取模块、相似度计算 模块和第二判断模块;
    所述第一判断模块用于在录音设备开始录音后,判断是否检测到目标用户开始输入语音信号,若是,则调用所述第一语音信号获取模块获取所述目标用户输入的第一设定帧数的第一语音信号;
    所述第一基音周期获取模块用于获取每帧所述第一语音信号对应的第一基音周期;
    所述第一周期序列获取模块用于根据所述第一基音周期获取与所述第一设定帧数对应的第一基音周期序列;
    所述第二语音信号获取模块用于在设定时长后,获取当前用户输入所述录音设备的第二设定帧数的第二语音信号;
    所述第二基音周期获取模块用于获取每帧所述第二语音信号对应的第二基音周期;
    所述第二周期序列获取模块用于根据所述第二基音周期获取与所述第二设定帧数对应的第二基音周期序列;
    所述相似度计算模块用于计算所述第一基音周期序列和所述第二基音周期序列的相似度;
    所述第二判断模块用于判断所述相似度是否大于第一设定阈值,若是,则确定输入所述第二语音信号的当前用户是所述目标用户,控制所述录音设备继续录音,并重新调用所述第二语音信号获取模块;
    若否,则确定输入所述第二语音信号的当前用户不是所述目标用户,并控制所述录音设备停止录音。
  16. 如权利要求15所述的语音交互中目标用户改变的检测系统,其特征在于,所述第一基音周期获取模块包括第一预处理单元、第一短时能量处理单元、第一中心削波处理单元和第一基音周期获取单元;
    所述第一预处理单元用于对每帧所述第一语音信号进行预处理;
    所述第一短时能量处理单元用于采用短时能量对预处理后的每帧所述第一语音信号进行处理,获取每帧所述第一语音信号中的第一浊音信号;
    所述第一中心削波处理单元用于采用中心削波法对所述第一浊音信号进行处理,获取第一中间语音信号;
    所述第一基音周期获取单元用于采用波形估计法、自相关处理法或倒谱法对所述第一中间语音信号进行处理,获取每帧所述第一语音信号对应的所述第一基音周期;和/或,
    所述第二基音周期获取模块包括第二预处理单元、第二短时能量处理单元、第二中心削波处理单元和第二基音周期获取单元;
    所述第二预处理单元用于对每帧所述第二语音信号进行预处理;
    所述第二短时能量处理单元用于采用短时能量对预处理后的每帧所述第二语音信号进行处理,获取每帧所述第二语音信号中的第二浊音信号;
    所述第二中心削波处理单元用于采用中心削波法对所述第二浊音信号进行处理,获取第二中间语音信号;
    所述第二基音周期获取单元用于采用波形估计法、自相关处理法或倒谱法对所述第二中间语音信号进行处理,获取每帧所述第二语音信号对应的所述第二基音周期。
  17. 如权利要求15或16所述的语音交互中目标用户改变的检测系统,其特征在于,所述相似度计算模块包括欧式距离计算单元和相似度确定单元;
    所述欧式距离计算单元用于采用动态时间规整算法计算所述第一基音周期序列和所述第二基音周期序列之间的欧氏距离;
    所述相似度确定单元用于根据所述欧氏距离确定所述第一基音周期序列和所述第二基音周期序列的相似度;
    其中,所述欧氏距离与所述相似度呈负相关。
  18. 如权利要求16或17所述的语音交互中目标用户改变的检测系统,其特征在于,所述第一判断模块包括第一输入信号获取单元、目标概率获取单元、目标输入信号获取单元和信号输入确定单元;
    所述第一输入信号获取单元用于在所述录音设备开始录音后,依次获取每帧第一输入信号;
    所述目标概率获取单元用于获取每帧所述第一输入信号对应的目标语音概率和目标非语音概率;
    所述目标输入信号获取单元用于当获取的所述第一输入信号的总帧数大于或者等于所述第一设定帧数时,则获取目标输入信号;
    其中,当所述第一设定帧数为N时,所述目标输入信号包括当前帧i以及当前帧之前的N-1帧数对应的所述第一输入信号,i≥N>1且取整数;
    所述信号输入确定单元用于根据所述目标输入信号中每帧所述第一输入信号的所述目标语音概率和所述目标非语音概率确定是否检测到所述目标用户开始输入语音信号。
  19. 如权利要求15-18中至少一项所述的语音交互中目标用户改变的检测系统,其特征在于,所述第一判断模块还包括语音帧确定单元、帧数获取单元和总概率计算单元;
    对于同一帧所述第一输入信号,所述语音帧确定单元用于判断所述目标语音概率是否大于所述目标非语音概率,若是,则确定当前帧为语音帧;若否,确定当前帧为非语音 帧;
    所述帧数获取单元用于获取所述目标输入信号中语音帧对应的第一帧数和非语音帧对应的第二帧数;
    所述总概率计算单元用于计算所述目标输入信号中每帧所述第一输入信号的所述目标语音概率之和得到第一总概率,以及计算所述目标输入信号中每帧所述第一输入信号的所述目标非语音概率之和得到第二总概率;
    所述信号输入确定单元用于当所述第一总概率大于或者等于所述第二总概率、所述第一帧数大于或者等于第四设定阈值且当前帧为语音帧时,则确定检测到所述目标用户开始输入语音信号。
  20. 如权利要求19所述的语音交互中目标用户改变的检测系统,其特征在于,在所述第二判断模块控制所述录音设备继续录音时,所述第一判断模块还用于判断是否检测到所述目标用户结束语音信号输入,若是,则控制所述录音设备停止录音;若否,继续调用所述第二语音信号获取模块。
  21. 如权利要求19或20所述的语音交互中目标用户改变的检测系统,其特征在于,所述信号输入确定单元还用于当所述第二总概率大于所述第一总概率、所述第二帧数大于或者等于第五设定阈值且当前帧为非语音帧时,则确定检测到所述目标用户结束语音信号输入,并控制所述录音设备停止录音。
  22. 如权利要求18-21中至少一项所述的语音交互中目标用户改变的检测系统,其特征在于,当确定所述目标用户未开始输入语音信号时,所述第一输入信号获取单元还用于继续获取下一帧所述第一输入信号;
    所述目标输入信号获取单元还用于根据下一帧所述第一输入信号获取新的所述目标输入信号;
    所述信号输入确定单元还用于根据新的所述目标输入信号中每帧所述第一输入信号的所述目标语音概率和所述目标非语音概率确定是否检测到所述目标用户开始输入语音信号;和/或,
    在检测到所述目标用户开始输入语音信号时,所述第一语音信号获取模块用于从第i-N+1帧开始获取目标用户输入的所述第一设定帧数的所述第一输入信号。
  23. 如权利要求18-21中至少一项所述的语音交互中目标用户改变的检测系统,其特征在于,所述目标概率获取单元用于根据每帧所述第一输入信号对应的能量和/或采用深度神经网络算法获取每帧所述第一输入信号对应的目标语音概率和目标非语音概率。
  24. 如权利要求23所述的语音交互中目标用户改变的检测系统,其特征在于,当根 据每帧所述第一输入信号对应的能量获取每帧所述第一输入信号对应的目标语音概率和目标非语音概率时,所述目标概率获取单元用于能量值获取子单元和目标概率获取子单元;
    所述能量值获取子单元用于获取每帧所述第一输入信号在设定频率范围内对应的平均能量值;
    所述目标概率获取子单元用于根据所述平均能量值获取每帧所述第一输入信号对应的所述目标语音概率和所述目标非语音概率。
  25. 如权利要求23或24所述的语音交互中目标用户改变的检测系统,其特征在于,当采用深度神经网络算法获取每帧所述第一输入信号对应的目标语音概率和目标非语音概率时,所述目标概率获取单元包括历史信号获取子单元、模型建立子单元和目标概率获取子单元;
    所述历史信号获取子单元用于获取历史设定时间内所述目标用户输入至所述录音设备中的每帧历史输入信号以及与每帧所述历史输入信号对应的信号类型;
    其中,所述信号类型包括语音信号和非语音信号;
    所述模型建立子单元用于将所述历史输入信号作为输入,所述信号类型作为输出,采用深度神经网络建立用于预测每帧输入信号为语音信号的概率模型;
    所述目标概率获取子单元用于将每帧所述第一输入信号分别输入至所述预测模型,获取每帧所述第一输入信号对应的所述目标语音概率;
    所述目标概率获取子单元还用于根据所述目标语音概率计算每帧所述第一输入信号对应的所述目标非语音概率。
  26. 如权利要求23-25中至少一项所述的语音交互中目标用户改变的检测系统,其特征在于,当根据每帧所述第一输入信号对应的能量和采用深度神经网络算法获取每帧所述第一输入信号对应的目标语音概率和目标非语音概率时,所述目标概率获取单元用于能量值获取子单元、目标概率获取子单元、历史信号获取子单元、模型建立子单元和加权计算子单元;
    所述能量值获取子单元用于获取每帧所述第一输入信号在设定频率范围内对应的平均能量值;
    所述目标概率获取子单元用于根据所述平均能量值获取每帧所述第一输入信号对应的第一语音概率和第一非语音概率;
    所述历史信号获取子单元用于获取历史设定时间内所述目标用户输入至所述录音设备中的每帧历史输入信号以及与每帧所述历史输入信号对应的信号类型;
    其中,所述信号类型包括语音信号和非语音信号;
    所述模型建立子单元用于将所述历史输入信号作为输入,所述信号类型作为输出,采用深度神经网络建立用于预测每帧输入信号为语音信号的概率模型;
    所述目标概率获取子单元用于将每帧所述第一输入信号分别输入至所述预测模型,获取每帧所述第一输入信号对应的第二语音概率;
    所述目标概率获取子单元还用于根据所述第二语音概率计算每帧所述第一输入信号对应的第二非语音概率;
    所述加权计算子单元用于采用加权平均法对同一帧所述第一输入信号的所述第一语音概率和所述第二语音概率进行处理,以及同一帧所述第一输入信号的所述第一非语音概率和所述第二非语音概率进行处理,获取每帧所述第一输入信号对应的所述目标语音概率和所述目标非语音概率。
  27. 如权利要求24-26中至少一项所述的语音交互中目标用户改变的检测系统,其特征在于,所述能量值获取子单元用于将与时域对应的每帧所述第一输入信号转换为与频域对应的第二输入信号;
    所述能量值获取子单元还用于计算每帧所述第二输入信号在设定频率范围内中每个频带对应的子带能量值;
    所述能量值获取子单元还用于根据所述子带能量值获取每帧所述第二输入信号对应的平均能量值;和/或,
    所述目标概率获取子单元用于当所述平均能量值大于第二设定阈值时,则确定当前帧为语音的第一概率;
    当所述平均能量值小于或者等于所述第二设定阈值且大于第三设定阈值时,则根据所述平均能量值、所述第二设定阈值和所述第三设定阈值确定当前帧为语音的第二概率;
    当所述平均能量值小于或者等于所述第三设定阈值时,则确定当前帧为语音的第三概率;
    其中,所述第一概率、所述第二概率和所述第三概率从大到小依次排序;
    所述目标概率获取子单元还用于根据所述第一概率、所述第二概率或所述第三概率确定每帧所述第二输入信号对应的所述目标语音概率和所述目标非语音概率。
  28. 如权利要求27所述的语音交互中目标用户改变的检测系统,其特征在于,所述目标概率获取子单元根据所述平均能量值、所述第二设定阈值和所述第三设定阈值确定当前帧为语音的第二概率对应的计算公式如下:
    Prob_energy=(energy-A)/(B-A)
    其中,Prob_energy表示所述第二概率,energy表示所述平均能量值,A表示所述第三设定阈值,B表示所述第二设定阈值;和/或,
    所述检测系统还包括唤醒模块;
    所述唤醒模块用于在所述录音设备开始录音之前,采用唤醒词唤醒所述录音设备。
  29. 一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述处理器执行计算机程序时实现权利要求1-14中任一项所述的语音交互中目标用户改变的检测方法。
  30. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1-14中任一项所述的语音交互中目标用户改变的检测方法的步骤。
PCT/CN2020/087744 2019-11-18 2020-04-29 目标用户改变的检测方法、系统、电子设备和存储介质 WO2021098153A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911126595.4 2019-11-18
CN201911126595.4A CN110838296B (zh) 2019-11-18 2019-11-18 录音过程的控制方法、系统、电子设备和存储介质

Publications (1)

Publication Number Publication Date
WO2021098153A1 true WO2021098153A1 (zh) 2021-05-27

Family

ID=69576589

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/087744 WO2021098153A1 (zh) 2019-11-18 2020-04-29 目标用户改变的检测方法、系统、电子设备和存储介质

Country Status (2)

Country Link
CN (1) CN110838296B (zh)
WO (1) WO2021098153A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4080382A3 (en) * 2021-06-11 2022-11-30 Apollo Intelligent Connectivity (Beijing) Technology Co., Ltd. Method and apparatus for generating reminder audio, electronic device and storage medium

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110838296B (zh) * 2019-11-18 2022-04-29 锐迪科微电子科技(上海)有限公司 录音过程的控制方法、系统、电子设备和存储介质
CN111916076B (zh) * 2020-07-10 2024-06-07 北京搜狗智能科技有限公司 一种录音方法、装置和电子设备
CN111968686B (zh) * 2020-08-06 2022-09-30 维沃移动通信有限公司 录音方法、装置和电子设备
CN112579040B (zh) * 2020-12-25 2023-03-14 展讯半导体(成都)有限公司 嵌入式设备的录音方法及相关产品
CN114582365B (zh) * 2022-05-05 2022-09-06 阿里巴巴(中国)有限公司 音频处理方法和装置、存储介质和电子设备

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4720862A (en) * 1982-02-19 1988-01-19 Hitachi, Ltd. Method and apparatus for speech signal detection and classification of the detected signal into a voiced sound, an unvoiced sound and silence
CN108735205A (zh) * 2018-04-17 2018-11-02 上海康斐信息技术有限公司 一种智能音箱的控制方法及智能音箱
CN109087669A (zh) * 2018-10-23 2018-12-25 腾讯科技(深圳)有限公司 音频相似度检测方法、装置、存储介质及计算机设备
CN109378002A (zh) * 2018-10-11 2019-02-22 平安科技(深圳)有限公司 声纹验证的方法、装置、计算机设备和存储介质
CN110415699A (zh) * 2019-08-30 2019-11-05 北京声智科技有限公司 一种语音唤醒的判断方法、装置及电子设备
CN110428810A (zh) * 2019-08-30 2019-11-08 北京声智科技有限公司 一种语音唤醒的识别方法、装置及电子设备
CN110838296A (zh) * 2019-11-18 2020-02-25 锐迪科微电子科技(上海)有限公司 录音过程的控制方法、系统、电子设备和存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006017900A (ja) * 2004-06-30 2006-01-19 Mitsubishi Electric Corp タイムストレッチ処理装置
CN108346428B (zh) * 2017-09-13 2020-10-02 腾讯科技(深圳)有限公司 语音活动检测及其模型建立方法、装置、设备及存储介质
CN108922541B (zh) * 2018-05-25 2023-06-02 南京邮电大学 基于dtw和gmm模型的多维特征参数声纹识别方法
CN109065026B (zh) * 2018-09-14 2021-08-31 海信集团有限公司 一种录音控制方法及装置
CN109473123B (zh) * 2018-12-05 2022-05-31 百度在线网络技术(北京)有限公司 语音活动检测方法及装置

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4720862A (en) * 1982-02-19 1988-01-19 Hitachi, Ltd. Method and apparatus for speech signal detection and classification of the detected signal into a voiced sound, an unvoiced sound and silence
CN108735205A (zh) * 2018-04-17 2018-11-02 上海康斐信息技术有限公司 一种智能音箱的控制方法及智能音箱
CN109378002A (zh) * 2018-10-11 2019-02-22 平安科技(深圳)有限公司 声纹验证的方法、装置、计算机设备和存储介质
CN109087669A (zh) * 2018-10-23 2018-12-25 腾讯科技(深圳)有限公司 音频相似度检测方法、装置、存储介质及计算机设备
CN110415699A (zh) * 2019-08-30 2019-11-05 北京声智科技有限公司 一种语音唤醒的判断方法、装置及电子设备
CN110428810A (zh) * 2019-08-30 2019-11-08 北京声智科技有限公司 一种语音唤醒的识别方法、装置及电子设备
CN110838296A (zh) * 2019-11-18 2020-02-25 锐迪科微电子科技(上海)有限公司 录音过程的控制方法、系统、电子设备和存储介质

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4080382A3 (en) * 2021-06-11 2022-11-30 Apollo Intelligent Connectivity (Beijing) Technology Co., Ltd. Method and apparatus for generating reminder audio, electronic device and storage medium

Also Published As

Publication number Publication date
CN110838296B (zh) 2022-04-29
CN110838296A (zh) 2020-02-25

Similar Documents

Publication Publication Date Title
WO2021098153A1 (zh) 目标用户改变的检测方法、系统、电子设备和存储介质
CN106940998B (zh) 一种设定操作的执行方法及装置
US20230409102A1 (en) Low-power keyword spotting system
EP3577645B1 (en) End of query detection
CN107767863B (zh) 语音唤醒方法、系统及智能终端
US10325598B2 (en) Speech recognition power management
CN110428810B (zh) 一种语音唤醒的识别方法、装置及电子设备
US9026444B2 (en) System and method for personalization of acoustic models for automatic speech recognition
US9437186B1 (en) Enhanced endpoint detection for speech recognition
CN105741838A (zh) 语音唤醒方法及装置
CN111880856A (zh) 语音唤醒方法、装置、电子设备及存储介质
EP3989217B1 (en) Method for detecting an audio adversarial attack with respect to a voice input processed by an automatic speech recognition system, corresponding device, computer program product and computer-readable carrier medium
CN110223687B (zh) 指令执行方法、装置、存储介质及电子设备
US11670299B2 (en) Wakeword and acoustic event detection
CN114981887A (zh) 用于减少语音识别延迟的自适应帧批处理
CN114708856A (zh) 一种语音处理方法及其相关设备
JP7436757B2 (ja) 自動音声認識処理結果の減衰
CN112669818B (zh) 语音唤醒方法及装置、可读存储介质、电子设备
CN111179944A (zh) 语音唤醒及年龄检测方法、装置及计算机可读存储介质
CN116601598A (zh) 基于检测序列的热门短语触发
WO2023168713A1 (zh) 交互语音信号处理方法、相关设备及系统
US11205433B2 (en) Method and apparatus for activating speech recognition
JP2023553994A (ja) ホットワード特性に基づいた自動音声認識パラメータの適応
CN115910049A (zh) 基于声纹的语音控制方法、系统、电子设备及存储介质
US20200310523A1 (en) User Request Detection and Execution

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20890620

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20890620

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20890620

Country of ref document: EP

Kind code of ref document: A1