CN109859744B

CN109859744B - Voice endpoint detection method applied to range hood

Info

Publication number: CN109859744B
Application number: CN201711229316.8A
Authority: CN
Inventors: 杜杉杉; 茅忠群; 诸永定; 方献良
Original assignee: Ningbo Fotile Kitchen Ware Co Ltd
Current assignee: Ningbo Fotile Kitchen Ware Co Ltd
Priority date: 2017-11-29
Filing date: 2017-11-29
Publication date: 2021-01-19
Anticipated expiration: 2037-11-29
Also published as: CN109859744A

Abstract

The invention relates to a voice endpoint detection method applied to a range hood, which comprises the following steps: initializing the working gear number of the range hood; initializing a first short-time energy threshold value array, a second short-time energy threshold value array and a short-time zero-crossing rate threshold value array when the range hood works at each working gear; acquiring a current working gear; collecting and acquiring a short-time signal frame of a voice signal; calculating the short-time energy and the short-time zero crossing rate of each short-time signal frame in the voice signal; calculating first start-stop time coordinate data of the voice signal according to a first energy threshold value corresponding to the current working gear of the range hood; calculating second start-stop time coordinate data of the voice signal according to a second energy threshold value corresponding to the current working gear of the range hood; and calculating third start-stop time coordinate data of the voice signal as the start-stop time coordinate of the voice signal according to the short-time zero-crossing rate threshold value corresponding to the current working gear of the range hood.

Description

Voice endpoint detection method applied to range hood

Technical Field

The invention relates to the technical field of range hoods, in particular to a voice endpoint detection method applied to a range hood.

Background

With the continuous development of intelligent technology, speech recognition technology is popularized, and the speech recognition technology begins to penetrate into various daily necessities for use. If the chinese utility model patent of the grant bulletin number CN205208686U (application number 201521083692.7) 'a voice input control range hood', and the chinese utility model patent of the grant bulletin number CN206113052U (application number 201620882578.9) 'an intelligent lampblack absorber based on intelligent cloud system control', there is also the chinese utility model patent of the grant bulletin number CN206556088U (application number 201621294280.2) 'a range hood with menu broadcast system', wherein all adopted the voice recognition technique in the disclosed range hood, carry out automatic control to the range hood according to the pronunciation of discerning, make the operation of lampblack absorber more convenient and humanized.

But involve speech recognition then need solve the problem of removing noise of pronunciation, range hood noise is great at the during operation, is carrying out the speech recognition in-process, directly influences speech recognition's accuracy to range hood fan noise treatment effect. Meanwhile, when the range hood works at different gears, the noise characteristics are different, and the problem of how to improve the voice recognition capability of the range hood at different working gears is to be solved.

Disclosure of Invention

The invention aims to solve the technical problem of providing a voice endpoint detection method applied to a range hood, which is beneficial to reducing the probability of noise misrecognition into voice under different working gears of the range hood and can reduce the data storage amount in the voice recognition process.

The technical scheme adopted by the invention for solving the problems is as follows: a voice endpoint detection method applied to a range hood is characterized in that: the method comprises the following steps:

s1, initializing the working gear number S of the range hood;

initializing a first short-time energy threshold value array and a second short-time energy threshold value array when the range hood works at each working gear; the first short-time energy threshold array is [ T ]_h(1),T_h(2),T_h(3),......,T_h(i),......,T_h(s)](ii) a Second short-time energy threshold arrayIs [ T ]_l(1),T_l(2),T_l(3),......,T_l(i),......,T_l(s)]Wherein i is a natural number, i is more than or equal to 1 and less than or equal to s, and T_l(i)<T_h(i)；

Initializing a short-time zero-crossing threshold value array when the range hood works at each working gear:

[T_z(1),T_z(2),T_z(3),......,T_z(i),......,T_z(s)]；

s2, acquiring the current working gear data i of the range hood;

s3, acquiring and obtaining a voice signal, and performing pre-emphasis and framing windowing on the acquired voice signal to further acquire a short-time signal frame of the voice signal;

s4, calculating the short-time energy and the short-time zero crossing rate of each short-time signal frame in the voice signal, and further obtaining the relation between the short-time energy and the time of the voice signal and the relation between the short-time zero crossing rate and the time of the voice signal;

s5, according to a first energy threshold value T corresponding to the current working gear i of the range hood_h(i) Calculating first start-up time coordinate data (a, b) of the acquired voice signal;

s6, according to a second energy threshold value T corresponding to the current working gear i of the range hood_l(i) Calculating second start-stop time coordinate data (A, B) of the acquired voice signal;

s7, according to the short-time zero-crossing threshold value T corresponding to the current working gear i of the range hood_z(i) Calculating third start-stop time coordinate data (A) of the acquired voice signal₀,B₀)；

S8, obtaining the start-stop time coordinate of the voice signal as (A)₀,B₀)。

In order to shorten the processing time, when the second start-stop time coordinate data (a, B) is acquired in S6, a search is made to the left from the start time coordinate a in the first start-stop time coordinate data (a, B) to acquire the start time coordinate a of the second start-stop time, and a search is made to the right from the end time B in the first start-stop time coordinate data (a, B) to acquire the end time coordinate B of the second start-stop time.

To shorten the processing time, in S7, third start-stop time coordinate data (a) is acquired₀,B₀) Then, the start time coordinate A of the third start-stop time is obtained by searching to the left from the start time coordinate A in the second start-stop time coordinate data (A, B)₀Searching rightward from the ending time B in the second start-stop time coordinate data (A, B) to obtain the ending time coordinate B of the third start-stop time₀。

As an improvement, the method for acquiring the first short-time energy threshold value array, the second short-time energy threshold value array and the short-time zero-crossing rate threshold value array comprises the following steps:

the noise signal of range hood work when each operating range is gathered, and then calculate noise signal's short-term energy average under each operating range, and then constitute noise signal's short-term energy average array:

wherein

Representing the short-time energy average value of the noise signal when the range hood works at the i gear;

and simultaneously calculating the short-time zero-crossing rate average value of the noise signal under each working gear, and further forming a short-time zero-crossing rate average value array of the noise signal:

wherein

Representing the short-time zero-crossing rate average value of the noise signal when the range hood works at the i gear;

when the range hood works at each working gear, voice signals are acquired, and then the short-time energy average value of the voice signals under each working gear is calculated to form a short-time energy average value array of the voice signals:

wherein

Representing the short-time energy average value of the voice signal when the range hood works at the i gear;

meanwhile, calculating the short-time zero-crossing rate average value of the voice signals under each working gear, and further forming a short-time zero-crossing rate average value array of the voice signals:

wherein

The short-time zero-crossing rate average value of the voice signal when the range hood works at the i gear is represented;

calculating a first short-time energy threshold value of the range hood working under each working gear:

wherein 0<α<1; further obtain the first short-time energy threshold value array [ T ] of the range hood_h(1),T_h(2),T_h(3),......,T_h(i),......,T_h(s)]；

Calculating a second short-time energy threshold value of the range hood working under each working gear:

wherein 0<β<1, and T_h(i)>T_l(i) (ii) a Further obtain a second short-time energy threshold value array [ T ] of the range hood_l(1),T_l(2),T_l(3),......,T_l(i),......,T_l(s)]；

Calculating the short-time zero-crossing rate threshold of the range hood working under each working gear:

further obtain the short-time zero-crossing threshold value array [ T ] of the range hood_z(1),T_z(2),T_z(3),......,T_z(i),......,T_z(s)]。

Compared with the prior art, the invention has the advantages that: the voice endpoint detection method applied to the range hood can perform voice endpoint detection according to different working gears and different threshold values respectively, so that the detection result is more accurate, the influence of different characteristics of working gear noise on the detection result is effectively eliminated, the probability that the noise is mistakenly identified as voice in a noise environment is further reduced, meanwhile, the data storage amount in the subsequent voice identification process can be reduced, and the speed of voice identification is improved. In addition, the method has small and low requirement on hardware, and is suitable for the range hood which is an application environment with weak hardware performance.

Drawings

Fig. 1 is a flowchart of a voice endpoint detection method applied to a range hood in the embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the accompanying examples.

In the operation process of the range hood, the noise of the fan is continuously increased along with the improvement of the gears, so that the influence on the accuracy rate of voice recognition is different. The noise of the kitchen of the user mainly comes from the range hood, and because the mechanism of the range hood is fixed and the position of the fan is also fixed, the noise of the range hood is relatively fixed under the condition that the gears are fixed. Therefore, the voice recognition rate of each working gear can be effectively improved by performing voice recognition on the gear information pertinence of the range hood.

As shown in fig. 1, the voice endpoint detection method applied to the range hood in the embodiment includes the following steps:

s1, initializing a working gear number S of the range hood, wherein S is a natural number in the embodiment, and the working gear number S is stored in a control chip when the range hood leaves a factory, so that the chip can identify the working gear of the range hood.

Initializing a first short-time energy threshold value array and a second short-time energy threshold value array when the range hood works at each working gear; the first short-time energy threshold array is [ T ]_h(1),T_h(2),T_h(3),......,T_h(i),......,T_h(s)](ii) a The second short-time energy threshold array is [ T ]_l(1),T_l(2),T_l(3),......,T_l(i),......,T_l(s)]Wherein i is a natural number, i is more than or equal to 1 and less than or equal to s, and T_l(i)<T_h(i) (ii) a Initializing a short-time zero-crossing threshold value array when the range hood works at each working gear:

[T_z(1),T_z(2),T_z(3),......,T_z(i),......,T_z(s)]。

the first short-time energy threshold value array, the second short-time energy threshold value array and the short-time zero-crossing rate threshold value array can be tested and obtained in a laboratory environment before the range hood leaves a factory.

The specific acquisition method comprises the following steps: under the laboratory environment, adjust range hood operation on each work gear, utilize a speech processing chip to gather and handle the operating noise of range hood operation under each work gear respectively, specifically sample the quantization to the noise signal for the chip, carry out the pre-emphasis processing again, and then carrying out the framing and windowing processing, calculate the short-term energy average value of noise signal under each work gear at last, short-term energy average value adopts current calculation formula to calculate, and then constitute the short-term energy average value array of noise signal:

wherein

And the short-time energy average value of the noise signal when the range hood works at the i gear is represented.

Meanwhile, calculating the short-time zero-crossing rate average value of the noise signal under each working gear, wherein the short-time zero-crossing rate average value is calculated by adopting the existing calculation formula, and further forming a short-time zero-crossing rate average value array of the noise signal:

wherein

And the short-time zero-crossing rate average value of the noise signal when the range hood works in the i gear is represented.

Under the laboratory environment, adjust range hood work on each operating range, the control sends the test pronunciation of standard to range hood's control chip simultaneously, utilize to lead to aforementioned same speech processing chip and gather and handle the speech signal of range hood operation under each operating range respectively, specifically for the chip carries out the sampling quantization to the speech signal, carry out the pre-emphasis processing again, and then carry out framing and windowing processing, calculate the short-time energy average value of speech signal under each operating range at last, short-time energy average value adopts current calculation formula to calculate, and then constitute the short-time energy average value array of speech signal:

wherein

And the short-time energy average value of the voice signal when the range hood works in the i gear is represented.

Meanwhile, calculating the short-time zero-crossing rate average value of the voice signal under each working gear, wherein the short-time zero-crossing rate average value is calculated by adopting the existing calculation formula, and further forming a short-time zero-crossing rate average value array of the voice signal:

wherein

And the short-time zero-crossing rate average value of the voice signal when the range hood works in the i gear is represented.

wherein 0<α<1; further obtain the first short-time energy threshold value array [ T ] of the range hood_h(1),T_h(2),T_h(3),......,T_h(i),......,T_h(s)]. And (3) actually measuring and obtaining alpha, and in order to obtain a more accurate first short-time energy threshold value array, when the first short-time energy threshold value under each working gear of the range hood is calculated, performing multiple tests to obtain a more accurate alpha value.

wherein 0<β<1, and T_h(i)>T_l(i) (ii) a Further obtain a second short-time energy threshold value array [ T ] of the range hood_l(1),T_l(2),T_l(3),......,T_l(i),......,T_l(s)]. And actually measuring and obtaining beta, wherein in order to obtain a more accurate first short-time energy threshold value array, multiple tests can be carried out when the first short-time energy threshold value under each working gear of the range hood is calculated so as to obtain a more accurate beta value.

S2, when a user uses the range hood, the control chip in the range hood automatically detects and acquires the current working gear data i of the range hood.

And S3, acquiring a control voice signal of the user, and performing pre-emphasis, framing and windowing on the acquired voice signal to further acquire a short-time signal frame of the voice signal. Because human special laborsaving structure receives glottis excitation and the influence of scratching the nose radiation, the pronunciation that send out in the oral cavity have the decay at the high band, and pre-emphasis processing adopts high pass filter to promote the speech signal high band response usually. When the speech signal is processed by framing and windowing, a Hamming window can be adopted for framing.

s5, according to a first energy threshold value T corresponding to the current working gear i of the range hood_h(i) Calculating first start-up time coordinate data (a, b) of the acquired voice signal; the first start-stop time coordinate data (a, b) may identify an approximate start-stop time point of the speech signal.

S6, according to a second energy threshold value T corresponding to the current working gear i of the range hood_l(i) Second start-stop time coordinate data (A, B) of the acquired voice signal are calculated, and the start-stop time points of voiced sounds of the voice signal can be detected by the second start-stop time coordinate data (A, B). When the second start-stop time coordinate data (a, B) is acquired, a search is made to the left from the start time coordinate a in the first start-stop time coordinate data (a, B) to acquire the start time coordinate a of the second start-stop time, and a search is made to the right from the end time B in the first start-stop time coordinate data (a, B) to acquire the end time coordinate B of the second start-stop time, so that the processing time can be saved.

S7, according to the short-time zero-crossing threshold value T corresponding to the current working gear i of the range hood_z(i) Calculating third start-stop time coordinate data (A) of the acquired voice signal₀,B₀). Since the common initial consonant of Chinese is used as the start, most of the initial consonants are unvoiced sound, and are easy to be confused with the environmental noise, but the short-time zero crossing rate of the environmental noise is obviously lower than that of the unvoiced sound, the third start-stop time coordinate data (A)₀,B₀) Can be directly used as the starting and stopping time point of the voice signal.

Acquiring third start-stop time coordinate data (A)₀,B₀) Then, the start time coordinate A of the third start-stop time is obtained by searching to the left from the start time coordinate A in the second start-stop time coordinate data (A, B)₀Searching rightward from the ending time B in the second start-stop time coordinate data (A, B) to obtain the ending time coordinate B of the third start-stop time₀。

S8, obtaining the start-stop time coordinate of the voice signal as (A)₀,B₀). By the start-stop time coordinate (A) of the speech signal₀,B₀) The corresponding effective voice signals can be effectively obtained, and redundant information in the original voice can be removed after the characteristics of the effective voice signals are extracted. Finally, the voice information after the characteristic extraction is matched by using the trained model, so that the voice sent by the user can be effectively acquired.

The voice endpoint detection method applied to the range hood can perform voice endpoint detection according to different working gears and different threshold values respectively, so that the detection result is more accurate, the influence of different characteristics of working gear noise on the detection result is effectively eliminated, the probability that the noise is mistakenly identified as voice in a noise environment is further reduced, meanwhile, the data storage amount in the subsequent voice identification process can be reduced, and the speed of voice identification is improved. In addition, the method has small and low requirement on hardware, and is suitable for the range hood which is an application environment with weak hardware performance.

Claims

1. A voice endpoint detection method applied to a range hood is characterized in that: the method comprises the following steps:

s1, initializing the working gear number S of the range hood;

initializing the range hood to work at each working gearA first short-time energy threshold array and a second short-time energy threshold array of time; the first short-time energy threshold array is [ T ]_h(1),T_h(2),T_h(3),......,T_h(i),......,T_h(s)](ii) a The second short-time energy threshold array is [ T ]_l(1),T_l(2),T_l(3),......,T_l(i),......,T_l(s)]Wherein i is a natural number, i is more than or equal to 1 and less than or equal to s, and T_l(i)<T_h(i)；

[T_z(1),T_z(2),T_z(3),......,T_z(i),......,T_z(s)]；

the method for acquiring the first short-time energy threshold value array, the second short-time energy threshold value array and the short-time zero-crossing rate threshold value array comprises the following steps:

wherein

wherein

Short time of noise signal for indicating range hood working at i gearA zero crossing rate average value;

wherein

wherein

further obtain the short-time zero-crossing threshold value array [ T ] of the range hood_z(1),T_z(2),T_z(3),......,T_z(i),......,T_z(s)]；

S2, acquiring the current working gear data i of the range hood;

2. The method for detecting the voice endpoint applied to the range hood as claimed in claim 1, wherein: in S6, when the second start-stop time coordinate data (a, B) is acquired, a search is made to the left from the start time coordinate a in the first start-stop time coordinate data (a, B) to acquire the start time coordinate a of the second start-stop time, and a search is made to the right from the end time B in the first start-stop time coordinate data (a, B) to acquire the end time coordinate B of the second start-stop time.

3. The method for detecting the voice endpoint applied to the range hood as claimed in claim 1, wherein: in S7, third start-stop time coordinate data (a) is acquired₀,B₀) Then, the start time coordinate A of the third start-stop time is obtained by searching to the left from the start time coordinate A in the second start-stop time coordinate data (A, B)₀Searching rightward from the ending time B in the second start-stop time coordinate data (A, B) to obtain the ending time coordinate B of the third start-stop time₀。