CN111128169A

CN111128169A - Voice wake-up method and device

Info

Publication number: CN111128169A
Application number: CN201911402716.3A
Authority: CN
Inventors: 丁少为; 关海欣
Original assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2020-05-08

Abstract

The invention relates to a voice awakening method and a voice awakening device. The method comprises the following steps: determining awakening voice of an awakening word received by each piece of equipment to be awakened in a plurality of pieces of equipment to be awakened; determining noise data obtained by each device to be awakened; selecting target awakened equipment from the equipment to be awakened according to the awakening voice received by the equipment to be awakened and the noise data obtained by the equipment to be awakened; responding to the wake-up voice by the target awakened device. By the technical scheme, the determination accuracy of the awakened device can be improved, the awakening accuracy is further improved, and other devices to be awakened in the multiple devices to be awakened are prevented from being awakened as devices which need to be awakened really.

Description

Voice wake-up method and device

Technical Field

The present invention relates to the field of voice technologies, and in particular, to a voice wake-up method and apparatus.

Background

At present, with the popularization of voice intelligent devices, a plurality of different devices using the same awakening word may appear in a home environment (for example, a television, a refrigerator, an air conditioner, a washing machine and the like are awakened by the same awakening word), and a situation of 'one-to-one-hundred response' is likely to appear in such a scene, and in order to solve the problem, the simplest processing method is as follows: the closest equipment is selected according to the signal energy of the awakening word received by each equipment, namely the farther the sound propagation distance is, the more serious the energy attenuation is, the maximum energy of the awakening word received by the equipment closest to the user is, and accordingly the closest equipment with the maximum energy is determined to be the equipment needing to be awakened to respond to the awakening voice so as to avoid mistakenly awakening all equipment corresponding to the awakening word.

However, this method does not distinguish signal energy but depends blindly on the total signal energy received by the device in the wakeup word period, so the wakeup response accuracy will decrease sharply in a noisy environment, for example: if a device is closer to the noise source and farther from the user, the device receives a noise with larger energy while receiving the wakeup word, which may cause the energy of the device to be higher than the energy received by the nearest device and to be misjudged as the nearest device, and further cause the farther device to be misjudged as the nearest device to respond to the wakeup voice.

Disclosure of Invention

The embodiment of the invention provides a voice awakening method and device. The technical scheme is as follows:

according to a first aspect of the embodiments of the present invention, there is provided a voice wake-up method, including:

determining awakening voice of an awakening word received by each piece of equipment to be awakened in a plurality of pieces of equipment to be awakened;

determining noise data obtained by each device to be awakened;

selecting target awakened equipment from the equipment to be awakened according to the awakening voice received by the equipment to be awakened and the noise data obtained by the equipment to be awakened;

responding to the wake-up voice by the target awakened device.

In an embodiment, the selecting, according to the wake-up voice received by each device to be woken up and the noise data obtained by each device to be woken up, a target device to be woken up from each device to be woken up includes:

will be describedThe awakening voice received by each equipment to be awakened is subjected to framing windowing and short-time Fourier transform to obtain the time frequency Y of the awakening voice_k(f,n)；

Performing frame windowing and short-time Fourier transform on the noise data obtained by each device to be awakened to obtain the time frequency X of the noise data_k(f,n)；

According to the time frequency Y of the awakening voice received by each equipment to be awakened_k(f, n) and the time-frequency X of the noise data obtained by each device to be awakened_k(f, n), selecting target awakened equipment from the equipment to be awakened.

In an embodiment, the time-frequency Y according to the wake-up voice received by each device to be woken up is_k(f, n) and the time-frequency X of the noise data obtained by each device to be awakened_k(f, n), selecting a target awakened device from the devices to be awakened, comprising:

according to the time frequency Y of the awakening voice received by each equipment to be awakened_k(f, n), calculating first average frame energy of the awakening voice received by each device to be awakened;

according to the time frequency X of the noise data obtained by each device to be awakened_k(f, n), calculating second average frame energy of the noise data obtained by each device to be awakened;

and selecting target awakened equipment from the equipment to be awakened according to the first average frame energy of the equipment to be awakened and the second average frame energy of the equipment to be awakened.

In an embodiment, selecting a target awakened device from the devices to be awakened according to the first average frame energy of the devices to be awakened and the second average frame energy of the devices to be awakened includes:

calculating the energy difference of each device to be awakened according to the first average frame energy of each device to be awakened and the second average frame energy of each device to be awakened; the energy difference of each device to be awakened is the difference value of the first average frame energy and the second average frame energy of each device to be awakened;

and selecting the target awakened equipment from the equipment to be awakened according to the energy difference of the equipment to be awakened.

In an embodiment, the selecting the target awakened device from the devices to be awakened according to the energy difference between the devices to be awakened includes:

according to the energy difference of each device to be awakened, determining the device to be awakened with the largest energy difference from the devices to be awakened;

and determining the device to be awakened with the largest energy difference as the target awakened device.

According to a second aspect of the embodiments of the present invention, there is provided a voice wake-up apparatus, including:

the device comprises a first determining module, a second determining module and a control module, wherein the first determining module is used for determining the awakening voice of the awakening word received by each piece of equipment to be awakened in the plurality of equipment to be awakened;

the second determining module is used for determining the noise data obtained by each device to be awakened;

the selection module is used for selecting target awakened equipment from the equipment to be awakened according to the awakening voice received by the equipment to be awakened and the noise data obtained by the equipment to be awakened;

and the response module is used for responding to the awakening voice through the target awakened equipment.

In one embodiment, the selection module comprises:

a first processing submodule, configured to perform frame windowing and short-time fourier transform on the wake-up voice received by each device to be woken up, to obtain a time frequency Y of the wake-up voice_k(f,n)；

A second processing submodule, configured to perform frame windowing and short-time fourier transform on the noise data obtained by each device to be awakened, so as to obtain a time-frequency X of the noise data_k(f,n)；

A selection submodule for selecting the voice to be awakened according to the time frequency Y of the awakening voice received by each equipment to be awakened_k(f, n) and the time-frequency X of the noise data obtained by each device to be awakened_k(f, n), selecting target awakened equipment from the equipment to be awakened.

In one embodiment, the selection submodule includes:

a first computing unit, configured to compute a time-frequency Y of the wake-up voice received by each device to be woken up according to the wake-up voice_k(f, n), calculating first average frame energy of the awakening voice received by each device to be awakened;

a second computing unit for obtaining the time frequency X of the noise data according to the devices to be awakened_k(f, n), calculating second average frame energy of the noise data obtained by each device to be awakened;

and the selection unit is used for selecting target awakened equipment from the equipment to be awakened according to the first average frame energy of the equipment to be awakened and the second average frame energy of the equipment to be awakened.

In one embodiment, the selection unit includes:

the calculating subunit is configured to calculate an energy difference between the devices to be wakened according to the first average frame energy of each device to be wakened and the second average frame energy of each device to be wakened; the energy difference of each device to be awakened is the difference value of the first average frame energy and the second average frame energy of each device to be awakened;

and the selecting subunit is configured to select the target awakened device from the devices to be awakened according to the energy difference between the devices to be awakened.

In one embodiment, the selection subunit is specifically configured to:

The technical scheme provided by the embodiment of the invention can have the following beneficial effects:

after the awakening voice received by each device to be awakened and the noise data obtained before each device to be awakened receives the awakening voice are determined, which device to be awakened (namely, the target awakened device) is automatically selected from the devices to be awakened to respond to the awakening voice according to the awakening voice received by each device to be awakened and the obtained noise data, so that the determination accuracy of the awakened device can be improved by simultaneously combining the awakening voice and the noise data of the device to be awakened, the awakening accuracy is further improved, and the phenomenon that other devices to be awakened in a plurality of devices to be awakened are mistakenly awakened as devices which really need to be awakened is avoided.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

Fig. 1 is a flow chart illustrating a voice wake-up method according to an example embodiment.

Fig. 2 is a flow chart illustrating another voice wake-up method according to an example embodiment.

Fig. 3 is a block diagram illustrating a voice wake-up unit in accordance with an exemplary embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

In order to solve the above technical problem, an embodiment of the present invention provides a voice wake-up method, which may be used in a voice wake-up program, a system or a device, and an execution subject corresponding to the method may be a terminal or a server, as shown in fig. 1, where the method includes steps S101 to S104:

in step S101, determining a wake-up voice of a wake-up word received by each of a plurality of devices to be woken up;

in step S102, determining noise data obtained by each device to be wakened;

the noise data obtained by each device to be wakened is the noise data of each device to be wakened in a period of time (such as the previous 1 second) before the wakening voice of the wakening word is received.

In step S103, selecting a target device to be awakened from the devices to be awakened according to the awakening voice received by the devices to be awakened and the noise data obtained by the devices to be awakened;

in step S104, the target awakened device responds to the wake-up voice.

In addition, when the target awakened device is selected, noise data of a period of time before the period of time when the awakening voice of the awakening word is received by each device to be awakened is combined at the same time, and the method is not limited to only depending on the total signal energy value received by the device to be awakened in the period of time when the awakening word is received, so that the accuracy rate of determining the device to be awakened can be obviously improved, and the awakening accuracy rate is further improved compared with the prior art.

performing frame windowing and short-time Fourier transform on the awakening voice received by each device to be awakened to obtain the time frequency Y of the awakening voice_k(f,n)；Y_kAnd (f, n) is the time domain spectrum of the wake-up voice.

Performing frame windowing and short-time Fourier transform on the noise data obtained by each device to be awakened to obtain the time frequency X of the noise data_k(f,n)；X_kAnd (f, n) is a time domain spectrum of the noise data.

According to the time frequency Y of the awakening voice received by each equipment to be awakened_k(f, n) and the time-frequency X of the noise data obtained by each device to be awakened_kAnd (f, n), the target awakened device can be automatically and accurately selected from the devices to be awakened, so that the determination accuracy of the awakened device is improved, and the awakening accuracy is further improved.

according to the time frequency Y of the awakening voice received by each equipment to be awakened_k(f, n), calculating first average frame energy of the awakening voice received by each device to be awakened; f represents the frequency, n represents the total frame number of the awakening voice or noise data received by each device to be awakened, and k represents the kth device to be awakened.

According to the time frequency X of the noise data obtained by each device to be awakened_k(f, n), calculating second average frame energy of noise data obtained by the devices to be awakened in the same frequency range f;

the first average frame energy is an average energy obtained based on a sum of energies of voices per frame in the wake-up voice, and the second average frame energy is an average energy obtained based on a sum of energies of voices per frame in the noise data.

According to the time frequency Y of the awakening voice received by each equipment to be awakened_k(f, n), the first average frame energy of the awakening voice received by each equipment to be awakened can be accurately calculated, and meanwhile, the time frequency X of the noise data obtained by each equipment to be awakened can be used_kAnd (f, n), accurately calculating second average frame energy of the noise data obtained by each device to be awakened, so as to accurately select the target device to be awakened according to the two average frame energies of each device to be awakened, improve the awakening accuracy rate, and avoid mistakenly awakening other devices to be awakened in the plurality of devices to be awakened as devices which really need to be awakened. In addition, in the embodiment, signal energy is distinguished, that is, the energy is distinguished into energy of the awakening voice and energy of the noise data, so that compared with the prior art, obviously, the selection accuracy of the target awakened device can be further improved, and the awakening accuracy is further improved.

According to the first average frame energy and the second average frame energy of each device to be awakened, the energy difference of each device to be awakened can be calculated, and then the target awakened device is automatically selected from each device to be awakened according to the energy difference, so that the selection accuracy of the target awakened device is improved, the awakening accuracy is improved, and the other devices to be awakened in the devices to be awakened are prevented from being awakened as devices which need to be awakened really.

According to the energy difference of each device to be awakened, the energy differences can be sequenced from large to small to determine the maximum energy difference, so that the device to be awakened corresponding to the maximum energy difference can be determined, the device to be awakened corresponding to the maximum energy difference is automatically determined as the target device to be awakened, the selection accuracy of the target device to be awakened is improved, the awakening accuracy is improved, and other devices to be awakened in a plurality of devices to be awakened are prevented from being awakened as devices which really need to be awakened by mistake.

The technical solution of the present invention will be further described in detail with reference to fig. 2:

the prior art is poor in robustness to a noise environment only by relying on a mode of awakening word energy, and the patent provides a near response method aiming at the problem, so that robustness of near response in the noise environment is improved.

Step 1: supposing that K different intelligent devices are possible to be awakened at the same time, each device inputs voice data in an awakening word time period into the distributed engine, simultaneously inputs noise data in a period before the awakening voice of the awakening word is received into the distributed engine, and records the noise data as x_k(t), t represents the sampling time point, k represents the kth equipment, and the awakening word data is noted as y_k(t)；

Step 2: performing frame windowing and short-time Fourier transform on the noise data of each device to obtain the time-frequency domain form of the noise data, and recording the form as X_k(f, n), wherein f represents frequency and n represents frame number;

and 3, step 3: selecting a certain frequency range to calculate the average frame energy of the noise data

Total number of frames f representing noise data of kth apparatus₁And f₂Representing a lower frequency limit and an upper frequency limit under consideration;

and 4, step 4: performing frame windowing and short-time Fourier transform on the awakening data of the awakening words received by each device to obtain the time-frequency domain form of the awakening word data (namely the awakening data of the awakening words received by each device), and recording the form as Y_k(f,n)；

And 5, step 5: calculate the average frame energy of the wakeup word data at the same frequency range as the noisy data, note

Representing the total frame number of the wake-up word data of the kth device;

and 6, step 6: the average frame energy of the awakening word data is subtracted from the average frame energy of the noise data to obtain the reliable nearest equipment judgment energy of each equipment

Namely, it is

And 7, step 7: is reliable inThe corresponding device with the maximum energy is selected as the latest device response awakening word in the judgment of the devices, namely

K_FNumbering the devices of the final response.

According to the technical scheme, the noise data received before the awakening word data is received can be used as a punishment item, when a certain device is close to a noise source, the average frame energy of the awakening word data segment is high, meanwhile, the average frame energy of the noise data segment is also high, namely the punishment on the average frame energy of the awakening word data segment of the device is high, and therefore the robustness of the distributed engine in a noise scene is improved.

Finally, it is clear that: the above embodiments can be freely combined by those skilled in the art according to actual needs.

Corresponding to the voice wake-up method provided in the embodiment of the present invention, an embodiment of the present invention further provides a voice wake-up apparatus, as shown in fig. 3, the apparatus includes:

a first determining module 301, configured to determine a wake-up voice of a wake-up word received by each device to be wakened in a plurality of devices to be wakened;

a second determining module 302, configured to determine noise data obtained by each device to be wakened;

a selecting module 303, configured to select a target device to be awakened from the devices to be awakened according to the awakening voice received by each device to be awakened and the noise data obtained by each device to be awakened;

a response module 304, configured to respond to the wake-up voice through the target awakened device.

In one embodiment, the selection module comprises:

A second processing submodule for waking up the data to be woken upThe noise data obtained by the equipment is subjected to frame windowing and short-time Fourier transform to obtain the time-frequency X of the noise data_k(f,n)；

In one embodiment, the selection submodule includes:

In one embodiment, the selection unit includes:

In one embodiment, the selection subunit is specifically configured to:

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A voice wake-up method, comprising:

determining noise data obtained by each device to be awakened;

responding to the wake-up voice by the target awakened device.

2. The method according to claim 1, wherein the selecting a target awakened device from the devices to be awakened according to the awakening voice received by the devices to be awakened and the noise data obtained by the devices to be awakened comprises:

performing frame windowing and short-time Fourier transform on the awakening voice received by each device to be awakened to obtain the time frequency Y of the awakening voice_k(f,n)；

3. The method of claim 2,

the time frequency Y according to the awakening voice received by each equipment to be awakened_k(f, n) and the time-frequency X of the noise data obtained by each device to be awakened_k(f, n), selecting a target awakened device from the devices to be awakened, comprising:

4. The method of claim 3,

selecting a target awakened device from the devices to be awakened according to the first average frame energy of the devices to be awakened and the second average frame energy of the devices to be awakened, including:

5. The method of claim 4,

the selecting the target awakened device from the devices to be awakened according to the energy difference of the devices to be awakened includes:

6. A voice wake-up apparatus, comprising:

7. The apparatus of claim 6, wherein the selection module comprises:

A second processing submodule for performing frame windowing and short-time Fourier transform on the noise data obtained by each device to be awakened to obtain the dataTime-frequency X of the noise data_k(f,n)；

8. The apparatus of claim 7,

the selection submodule includes:

9. The apparatus of claim 8,

the selection unit includes:

10. The apparatus of claim 9,

the selection subunit is specifically configured to: