CN108172242A

CN108172242A - A kind of improved blue-tooth intelligence cloud speaker interactive voice end-point detecting method

Info

Publication number: CN108172242A
Application number: CN201810014999.3A
Authority: CN
Inventors: 鲁霖
Original assignee: Shenzhen Xinzhongxin Technology Co Ltd
Current assignee: Shenzhen Xinzhongxin Technology Co Ltd
Priority date: 2018-01-08
Filing date: 2018-01-08
Publication date: 2018-06-15
Anticipated expiration: 2038-01-08
Also published as: CN108172242B

Abstract

The present invention relates to a kind of improved blue-tooth intelligence cloud speaker interactive voice end-point detecting method, including intelligent cloud speaker, smart machine, data analyzing and processing software APP and bluetooth module.Wherein smart machine is mobile phone, tablet computer etc.；Wherein smart machine includes bluetooth module and data analyzing and processing software APP；Wherein intelligent cloud speaker includes cloud server；Data analyzing and processing software APP is mounted on smart machine；Bluetooth module establishes the connection in audio road with blue-tooth intelligence cloud speaker；The data analyzing and processing software APP of smart machine establishes the connection of control instruction by bluetooth module and blue-tooth intelligence cloud speaker, realizes the control data interaction of data analyzing and processing software APP and blue-tooth intelligence cloud speaker；The beneficial effects of the invention are as follows：Solve in existing the relevant technologies because environmental difference leads to that discrimination is poor, endpoint erroneous judgement, improve man machine language's interactive efficiency and experience.Efficiency is improved, improves user experience.

Description

A kind of improved blue-tooth intelligence cloud speaker interactive voice end-point detecting method

Technical field

The present invention relates to bluetooth low energy consumption technologies application field, more particularly to a kind of improved blue-tooth intelligence cloud speaker voice Interaction end-point detecting method.

Background technology

In field of human-computer interaction, voice activity detection（Voice Activity Detection,VAD）Be one very Crucial work, the quality of algorithm also directly determines the success or failure of entire voice interactive system to a certain extent, as one A complete voice interactive system, the effect finally realized and used depend not only on the algorithm of identification, many correlations Factor all directly affects the success or not of application system, and the purpose of end-point detection is exactly the signal under complicated application environment Voice signal and non-speech audio are told in stream, and determines the beginning and end of voice signal, good end-point detecting method energy The problems that change existing for speech recognition software that detection result is undesirable, discrimination is low etc., the high-precision of end-point detection can ensure that defeated The signal entered is effective complete voice signal, makes recognition effect more accurate quick.

Traditional end-point detecting method is the double-threshold comparison using short-time energy and zero-crossing rate, first in audio in short-term First time differentiation is carried out on energy, this can choose a high threshold and carry out primary thick judgement；Then using in Average zero-crossing rate Second is carried out to differentiate.Although it is small using double threshold end-point detection calculation amount, and preferable discrimination is gnawed in quiet environment, It is that it also has many deficiencies, for example, threshold value needs are set by experience, it is a fixed parameter；In constantly interactive voice In, the scene for being related to context pause is also easily judged by accident, causes man-machine interaction effect undesirable.

Therefore, in daily life, it is related to man-machine friendship field, how accurately detects that the endpoint location of audio signal is skill Art personnel urgently problem to be solved.

Invention content

The technical problems to be solved by the invention are：A kind of improved blue-tooth intelligence cloud speaker interactive voice endpoint inspection is provided Survey method, overcome in existing the relevant technologies because environmental difference leads to that discrimination is poor, endpoint erroneous judgement, improve man-machine language Sound interactive efficiency and experience.

In order to solve the above technical problems, the present invention provides a kind of improved blue-tooth intelligence cloud speaker interactive voice end-point detection Method, including intelligent cloud speaker, smart machine, data analyzing and processing software APP and bluetooth module.Wherein smart machine is hand Machine, tablet computer etc.；Wherein smart machine includes bluetooth module and data analyzing and processing software APP；Wherein intelligent cloud speaker packet Include cloud server；

The data analyzing and processing software APP is mounted on smart machine；

The bluetooth module establishes the connection in audio road with blue-tooth intelligence cloud speaker；

It advanced optimizes, the data analyzing and processing software APP of smart machine is established by bluetooth module and blue-tooth intelligence cloud speaker The control data interaction of data analyzing and processing software APP and blue-tooth intelligence cloud speaker are realized in the connection of control instruction；

It advanced optimizes, normal data interpretation software APP is in standby mode, when smart machine end wakes up interactive voice When, data analyzing and processing software APP starts bluetooth module connection, and starts to record, and acquires audio signal, while and blue-tooth intelligence The cloud server of cloud speaker establishes data transmission channel.

It advanced optimizes, data analyzing and processing software APP sets a mute guard time, and the guard time length is by counting It reaches an agreement on together with Cloud Server according to interpretation software APP；When waking up interactive voice, even if silent, 3 seconds quiet is also had Sound acquisition time is avoided when waking up interactive voice, and user has little time to speak, and whole system, which is just sentenced, stops；In addition, bluetooth module Towards connection mode SCO in very short time too frequent operation, system-level exception, the mute guard time control can be caused Bluetooth module processed towards connection mode SCO in very short time too frequent operation.

It advanced optimizes, the data analyzing and processing software APP of smart machine constantly extracts each frame audio signal；Data point The duration of the audio signal of each frame is set as 10ms by analysis processing software APP.

It advanced optimizes, the data analyzing and processing software APP of smart mobile phone calculates the short-time energy per frame audio signal, short When energy signal calculation formula be：；

It advanced optimizes, the data analyzing and processing software APP dynamics of smart machine judge whether per frame audio signal be speech frame； Wherein speech signal energy and amplitude size are directly reacted in short-time energy, and sound section and unvoiced segments are sentenced according to short-time energy Disconnected, data analyzing and processing software APP dynamics find each frame and the maximum energy value in audio frame before, audio frame below As long as less than ceiling capacity frame * threshold values（M）, current short-time energy hour, with regard to dynamically turning threshold value down, when the width of volume attenuation Value is too big, is just defined as non-speech frame, starts non-voice and counts, and non-speech frame continuous counter is equivalent to pause 2 seconds, then up to 200 Represent that speech terminates, if there is number of speech frames evidence in centre, counter resets count again.

The formula of adaptive threshold value is：；

It advanced optimizes, the data analyzing and processing software APP of smart machine carries out valid endpoint judgement；

It advanced optimizes, the data analyzing and processing software APP of smart machine sends acquisition to cloud server to be terminated, and starts voice Identification；After data analyzing and processing software APP is according to the result for terminating voice collecting, stop recording, and send to cloud server Acquisition completion command starts speech recognition, by interactive voice tests a large amount of in blue-tooth intelligence cloud speaker, accurately judging The endpoint of voice.

It advanced optimizes, a kind of work step of improved blue-tooth intelligence cloud speaker interactive voice end-point detecting method：

A, the data analyzing and processing software APP of smart machine is established with blue-tooth intelligence cloud speaker and is connected；

B, smart machine end wakes up interactive voice；

C, the data analyzing and processing software APP of smart machine starts mute guard time counter；

D, the data analyzing and processing software APP of smart machine constantly extracts each frame audio signal；

E, the data analyzing and processing software APP of smart machine calculates the short-time energy per frame audio signal；

F, the data analyzing and processing software APP dynamics of smart machine judge whether per frame audio signal be speech frame；

H, the data analyzing and processing software APP of smart machine carries out valid endpoint judgement；

I, the data analyzing and processing software APP of smart machine sends acquisition to cloud server and terminates, and starts speech recognition.

After employing above-mentioned technical proposal, the beneficial effects of the invention are as follows：

Scheme compared with the prior art provides a kind of improved blue-tooth intelligence cloud speaker interactive voice end-point detecting method, solution Lead to that discrimination is poor, endpoint erroneous judgement because of environmental difference in certainly existing the relevant technologies, improve man machine language and interact effect Rate and experience.Efficiency is improved, improves user experience.

Description of the drawings

Fig. 1 is a kind of improved blue-tooth intelligence cloud speaker interactive voice end-point detecting method Working mould block diagram

Fig. 2 is a kind of improved blue-tooth intelligence cloud speaker interactive voice end-point detecting method work flow diagram

Specific embodiment

1 to attached drawing 2 and specific embodiment, the present invention will be described in detail, but not as to the present invention below in conjunction with the accompanying drawings Restriction.

As shown in attached drawing 1 to attached drawing 2, a kind of improved blue-tooth intelligence cloud speaker interactive voice end-point detecting method, including Intelligent cloud speaker, smart machine, data analyzing and processing software APP and bluetooth module.Wherein smart machine is mobile phone, tablet computer Deng；Wherein smart machine includes bluetooth module and data analyzing and processing software APP；Wherein intelligent cloud speaker includes cloud service Device；Data analyzing and processing software APP is mounted on smart machine；Bluetooth module establishes audio road with blue-tooth intelligence cloud speaker Connection；The data analyzing and processing software APP of smart machine establishes control instruction by bluetooth module and blue-tooth intelligence cloud speaker Connection, realize the control data interaction of data analyzing and processing software APP and blue-tooth intelligence cloud speaker；Normal data analyzes and processes Software APP is in standby mode, and when smart machine end wakes up interactive voice, data analyzing and processing software APP starts bluetooth mould Block connects, and starts to record, and acquires audio signal, while establishes data transmission with the cloud server of blue-tooth intelligence cloud speaker and lead to Road.Data analyzing and processing software APP sets a mute guard time, and the guard time length is by data analyzing and processing software APP reaches an agreement on together with Cloud Server；When waking up interactive voice, even if silent, the mute acquisition time of 3 seconds is also had, is kept away Exempt from when waking up interactive voice, user has little time to speak, and whole system, which is just sentenced, stops；In addition, bluetooth module towards connection mode SCO too frequent operations in very short time can cause system-level exception, mute guard time control bluetooth module Towards connection mode SCO in very short time too frequent operation.The data analyzing and processing software APP of smart machine is constantly extracted often One frame audio signal；The duration of the audio signal of each frame is set as 10ms by data analyzing and processing software APP.Intelligent hand The data analyzing and processing software APP of machine calculates the short-time energy per frame audio signal, and the calculation formula of short-time energy signal is：；The data analyzing and processing software APP dynamics of smart machine judge per frame audio signal whether be Speech frame；Wherein speech signal energy and amplitude size are directly reacted in short-time energy, to sound section and noiseless according to short-time energy Duan Jinhang judges that data analyzing and processing software APP dynamics find each frame and the maximum energy value in audio frame before, behind As long as audio frame be less than ceiling capacity frame * threshold values（M）, current short-time energy hour just dynamically turns threshold value down, works as volume The amplitude of attenuation is too big, is just defined as non-speech frame, starts non-voice and counts, non-speech frame continuous counter is equivalent to and stops up to 200 Pause 2 seconds, then it represents that speech terminates, if there is number of speech frames evidence in centre, counter resets count again.

The formula of adaptive threshold value is：；

The data analyzing and processing software APP of smart machine carries out valid endpoint judgement；The data analyzing and processing software of smart machine APP sends acquisition to cloud server to be terminated, and starts speech recognition；Data analyzing and processing software APP is according to end voice collecting Result after, stop recording, and to cloud server send acquisition completion command, start speech recognition, pass through blue-tooth intelligence cloud In speaker in a large amount of interactive voice tests, the endpoint of voice is accurately judged.

A kind of work step of improved blue-tooth intelligence cloud speaker interactive voice end-point detecting method：

B, smart machine end wakes up interactive voice；

In embodiments of the present invention:

The data analyzing and processing software APP of S101 smart machines is established with blue-tooth intelligence cloud sound-box device and is connected；

First, the connection in audio road is established by the bluetooth module in cell phone system and blue-tooth intelligence cloud speaker；Then pass through again The data analyzing and processing software APP of smart machine establishes the connection of control instruction with blue-tooth intelligence cloud speaker, good in order to ensure to have Good compatibility, Android versions are established SPP channels with equipment and are connect, and what IOS editions were then established is the connection of BLE channels, can be real The control data interaction of existing APP and blue-tooth intelligence cloud sound-box device.

S102 smart machines end wakes up interactive voice；

Normal data interpretation software APP handles standby mode, only when equipment end wakes up interactive voice, starts bluetooth SCO connections, and start to record, audio signal is acquired, while data transmission channel is established with cloud server.

The data analyzing and processing software APP of S103 smart machines starts mute guard time counter；

The data analyzing and processing software APP of smart machine starts mute guard time counter, in order to which user has better experience, And the stability of system, a mute guard time is set, when waking up interactive voice, even if silent, specific duration and cloud Server is reached an agreement on together, also has the mute acquisition time of 3 seconds, and when avoiding waking up interactive voice, user has little time to speak, entirely System, which is just sentenced, stops；On the other hand, too frequent operation in the SCO very short time of bluetooth, can cause system-level exception.

The data analyzing and processing software APP of S104 smart machines constantly extracts each frame audio signal；

Audio signal be a unstable state, time-varying signal, in order to obtain more accurately result of calculation, it is believed that it is " short Be in the range of time " stable state, when constant, this time, general data interpretation software APP believes the audio of each frame Number duration be set as 10ms.

The data analyzing and processing software APP of S105 smart machines calculates the short-time energy per frame audio signal；

The calculation formula of short-time energy signal is：

Wherein, the energy value for m-th of sampled point in the i-th frame.

According to short-time energy calculation formula, APP example codes are as follows：

private long getRms(int end, int span) { int begin = end - span;if (begin < 0) { begin = 0; } if (begin % 2 != 0) {begin++; } long sum = 0; for (int i = begin; i < end; i += 2) { short curSample = getShort(this.mRecording[i], this.mRecording[i + 1]); sum += (long) (curSample * curSample); } return sum; }

The data analyzing and processing software APP dynamics of S106 smart machines judge whether per frame audio signal be speech frame；

Short-time energy can directly reflect speech signal energy and amplitude size, and then sound section and unvoiced segments can be carried out Judge, data analyzing and processing software APP dynamics find each frame and the maximum energy value in audio frame before, audio below As long as frame is less than ceiling capacity frame * threshold values（M）, current short-time energy hour, with regard to dynamically turning threshold value down, when volume attenuation Amplitude is too big, is just defined as non-speech frame, starts non-voice and counts, and non-speech frame continuous counter is equivalent to pause 2 seconds up to 200, Then represent that speech terminates, if there is number of speech frames evidence in centre, counter resets count again.

Adaptive threshold value:

APP code samples are as follows：

private static final int RMS_COUNT_MAX = 200; // 2s

public boolean isPausing() {

long rms = getRms(this.mRecordedLength, this.mOneSec);

if (rms > this.highestRMS) {

this.highestRMS = rms;

this.rmsCount = 0;

return false;

} else if (((double) rms) < M * ((double) this.highestRMS)) {

if(this.rmsCount < RMS_COUNT_MAX){

this.rmsCount++;

return false;

}else{

this.rmsCount = 0;

return true;

}

} else {

this.rmsCount = 0;

return false;

}

The data analyzing and processing software APP of S107 smart machines carries out valid endpoint judgement；

Sound end judgement in human-computer interaction is limited by various aspects, and the mute guard time of such as 3 seconds is local improved short When energy measuring sound end, the stopping acquisition instructions that high in the clouds issues.

APP code samples are as follows：

while (recorder != null && recorder.getState() == AudioRecorder.State.RE CORDING) {

boolean pausing = recorder.isPausing();

if (pausing && mRecordDurationReached) {

if (mBtDeviceSpeechType == BT_DEVICE_SPEECH_RECOGNITION) {

mBtDeviceSpeechType = BT_DEVICE_SPEECH_RECOGNITION_NONE;

stopBluetoothSCO();

}

stopListening(true);

break;

}

try {

Thread.sleep(10);

} catch (InterruptedException e) {

e.printStackTrace();

}

The data analyzing and processing software APP of S108 smart machines sends acquisition to high in the clouds to be terminated, and starts speech recognition；

After data analyzing and processing software APP is according to the result for terminating voice collecting, stop recording, and send acquisition to high in the clouds and complete Instruction, can start speech recognition, can cross in blue-tooth intelligence cloud speaker in a large amount of interactive voice tests, substantially can be accurately Judge the endpoint of voice.The transmission and processing of non-speech frame are greatly reduced, efficiency is improved, improves user experience.

As known by the technical knowledge, the technical program can pass through other essence without departing from its spirit or the reality of essential feature Scheme is applied to realize.Therefore, embodiment disclosed above, all things considered are all merely illustrative, and are not only 's.All changes within the scope of the invention or within the scope equivalent to the present invention are included in the invention.

Claims

1. a kind of improved blue-tooth intelligence cloud speaker interactive voice end-point detecting method, including intelligent cloud speaker, smart machine, number According to interpretation software APP and bluetooth module；It is characterized in that：Wherein smart machine is mobile phone, tablet computer etc.；It is wherein intelligent Equipment includes bluetooth module and data analyzing and processing software APP；Wherein intelligent cloud speaker includes cloud server；The data point Analysis processing software APP is mounted on smart machine；The bluetooth module establishes the company in audio road with blue-tooth intelligence cloud speaker It connects；The data analyzing and processing software APP of the smart machine establishes control instruction by bluetooth module and blue-tooth intelligence cloud speaker Connection, realize the control data interaction of data analyzing and processing software APP and blue-tooth intelligence cloud speaker.

2. a kind of improved blue-tooth intelligence cloud speaker interactive voice end-point detecting method according to claim 1, feature It is：A kind of work step of improved blue-tooth intelligence cloud speaker interactive voice end-point detecting method：

B, smart machine end wakes up interactive voice；

3. a kind of improved blue-tooth intelligence cloud speaker interactive voice end-point detecting method according to claim 2, feature It is：Normal data interpretation software APP is in standby mode, when smart machine end wakes up interactive voice, data analysis Handle software APP and start bluetooth module connection, and start to record, acquire audio signal, at the same with the cloud of blue-tooth intelligence cloud speaker End server establishes data transmission channel.

4. a kind of improved blue-tooth intelligence cloud speaker interactive voice end-point detecting method according to claim 2, feature It is：Data analyzing and processing software APP sets a mute guard time, and the guard time length is by data analyzing and processing software APP reaches an agreement on together with Cloud Server；When waking up interactive voice, even if silent, the mute acquisition time of 3 seconds is also had, is kept away Exempt from when waking up interactive voice, user has little time to speak, and whole system, which is just sentenced, stops；In addition, bluetooth module towards connection mode SCO too frequent operations in very short time can cause system-level exception, mute guard time control bluetooth module Towards connection mode SCO in very short time too frequent operation.

5. a kind of improved blue-tooth intelligence cloud speaker interactive voice end-point detecting method according to claim 2, feature It is：The data analyzing and processing software APP of smart machine constantly extracts each frame audio signal；Data analyzing and processing software APP The duration of the audio signal of each frame is set as 10ms.

6. a kind of improved blue-tooth intelligence cloud speaker interactive voice end-point detecting method according to claim 2, feature It is：The data analyzing and processing software APP of smart mobile phone calculates the short-time energy per frame audio signal, the meter of short-time energy signal Calculating formula is：。

7. a kind of improved blue-tooth intelligence cloud speaker interactive voice end-point detecting method according to claim 2, feature It is：The data analyzing and processing software APP dynamics of smart machine judge whether per frame audio signal be speech frame；Wherein in short-term can Amount directly reacts speech signal energy and amplitude size, and sound section and unvoiced segments are judged according to short-time energy, data point Analysis processing software APP dynamics find each frame and the maximum energy value in audio frame before, as long as audio frame below is less than Ceiling capacity frame * threshold values（M）, current short-time energy hour, with regard to dynamically turning threshold value down, when the amplitude of volume attenuation is too big, Non-speech frame is just defined as, starts non-voice and counts, non-speech frame continuous counter is equivalent to pause 2 seconds, then it represents that say up to 200 Words terminate, if there is number of speech frames evidence in centre, counter resets count again, and the formula of adaptive threshold value is：。

8. a kind of improved blue-tooth intelligence cloud speaker interactive voice end-point detecting method according to claim 2, feature It is：The data analyzing and processing software APP of smart machine carries out valid endpoint judgement.

9. a kind of improved blue-tooth intelligence cloud speaker interactive voice end-point detecting method according to claim 2, feature It is：The data analyzing and processing software APP of smart machine carries out valid endpoint judgement；The data analyzing and processing software of smart machine APP sends acquisition to cloud server to be terminated, and starts speech recognition；Data analyzing and processing software APP is according to end voice collecting Result after, stop recording, and to cloud server send acquisition completion command, start speech recognition, pass through blue-tooth intelligence cloud In speaker in a large amount of interactive voice tests, the endpoint of voice is accurately judged.