CN112820283B

CN112820283B - Voice processing method, equipment and system

Info

Publication number: CN112820283B
Application number: CN201911129042.4A
Authority: CN
Inventors: 胡俊锋; 汪贇; 黄俊岚
Original assignee: Zhejiang Future Elf Artificial Intelligence Technology Co ltd
Current assignee: Zhejiang Future Elf Artificial Intelligence Technology Co ltd
Filing date: 2019-11-18
Publication date: 2024-07-05
Anticipated expiration: 2039-11-18

Abstract

The invention discloses a voice processing method, which is used for receiving a voice signal from a peripheral device in communication connection with a computing device, wherein the peripheral device analyzes the voice signal and determines that the voice signal contains wake words suitable for changing the operation state of the computing device; analyzing the voice signal again to determine whether the voice signal contains a wake-up word; and switching the computing device from a first operating state to a second operating state upon determining that the speech signal contains wake words, for processing subsequent new speech signals, wherein the first operating state has a lower energy consumption than the second operating state. The invention also discloses corresponding intelligent equipment, peripheral equipment and an intelligent system comprising the equipment.

Description

Voice processing method, equipment and system

Technical Field

The present invention relates to the field of speech processing technologies, and in particular, to a speech processing method, device, and system for recognizing wake-up words.

Background

Over the past decade, the Internet has been deepened in various areas of people's life, and people can conveniently conduct shopping, social, entertainment, financial and other activities through the Internet. Meanwhile, in order to improve user experience, researchers realize a plurality of interaction schemes, such as text input, gesture input, voice input and the like. Among them, intelligent voice interaction is a research hotspot of a new generation of interaction modes due to the convenience of operation.

Currently, with the rapid development of the internet of things and intellectualization, some intelligent voice devices, such as intelligent sound boxes and intelligent mobile terminals, appear on the market. In some usage scenarios, the intelligent voice device may recognize voice data input by a user through a voice recognition technology, so as to provide personalized services for the user, such as listening to various audio content, supporting home appliance control, and the like.

Smart devices such as smart speakers are typically deployed in a fixed area such as the living room of a home. To save energy, these devices are in a low energy state when not in operation for a long period of time and only enter a normal operating state to consume more energy when needed for operation, e.g., interaction with a user. The process of letting the device change bit from a low power state to a normal operating state is called a wake-up procedure.

At present, waking up intelligent voice equipment through a voice mode is a very convenient user interaction mode. In the voice mode wake-up processing, the intelligent voice equipment acquires the voice of the user, and performs wake-up processing when determining that the voice of the user contains a specific wake-up word of the wake-up equipment. However, for a high-performance intelligent device, in order to perform a wake-up process in time, the device needs to be kept running with a certain energy consumption, and cannot stand by for a long time under the condition of battery power supply, so that the device is usually powered by a power supply, is inconvenient to carry about, and limits the range of people to wake-up the intelligent device by voice.

However, for portable low-power-consumption voice entrance devices, the computing power of these devices is insufficient, and an excellent voice wake-up algorithm cannot be realized thereon, so that there is a problem of low wake-up success rate.

Therefore, a new speech processing scheme is needed, which can reduce the energy consumption of the intelligent device while performing wake-up processing with high accuracy.

Disclosure of Invention

Accordingly, the present invention provides a speech processing method, apparatus, and system that seeks to solve or at least mitigate at least one of the above-identified problems.

According to one aspect of the present invention there is provided a speech processing method adapted to be executed in a computing device, the method comprising the steps of: receiving a speech signal from a peripheral device communicatively coupled to the computing device, wherein the peripheral device analyzes the speech signal and determines that the speech signal contains wake-up words suitable for changing an operational state of the computing device; analyzing the voice signal to determine whether the voice signal contains a wake-up word; and switching the computing device from a first operating state to a second operating state upon determining that the speech signal contains wake words, for processing subsequent new speech signals, wherein the first operating state has a lower energy consumption than the second operating state.

Optionally, the method according to the invention further comprises the step of: upon determining that the speech signal does not contain a wake word, the operating state of the computing device is not switched.

Optionally, in the method according to the invention, the peripheral device analyzes the speech signal to determine whether the wake-up word is included with a first accuracy; the computing device analyzing the speech signal to determine if the wake-up word is included with a second accuracy; and the first accuracy is lower than the second accuracy.

Optionally, in the method according to the invention, the peripheral device analyses the speech signal using a first neural network algorithm and the computing device analyses the speech information using a second neural network algorithm, and the parameters in the first neural network are smaller than the parameters in the second neural network.

Optionally, in the method according to the invention, receiving the speech signal from the peripheral device comprises: receiving voice signals from more than one peripheral device; and selecting a speech signal having the greatest sound intensity from the received speech signals as the speech signal to be analyzed.

Optionally, the method according to the invention further comprises the step of: after the computing device switches to the second operational state, the peripheral device is instructed to receive a new voice signal, so that the received new voice signal is sent to the computing device for processing.

Optionally, in the method according to the invention, the speech signal comprises a plurality of audio signals, and the step of receiving the speech signal from a peripheral device communicatively connected to the computing device comprises: a speech signal encoded in a predetermined format is received, the encoded speech signal including a first portion indicative of a number of audio signal paths, a second portion indicative of a length of each of the plurality of audio signals, and the plurality of audio signals.

Optionally, in the method according to the invention, the first part further indicates a number of paths of the reference audio signal; and the encoded speech signal further comprises a multiple reference audio signal and a third portion indicative of the length of each of the multiple reference audio signals.

Optionally, in the method according to the invention, the peripheral device is communicatively connected to the computing device in at least one of the following ways: bluetooth, zigBee, WIFI and mobile communication; and the peripheral device is a sound pickup device adapted to obtain a voice input.

According to another aspect of the present invention, there is provided a speech processing method adapted to be executed in a computing device, the method comprising the steps of: receiving a voice signal; analyzing the voice signal to determine whether the voice signal contains a wake-up word; and upon determining that the speech signal includes a wake word, transmitting the speech signal to a smart device communicatively coupled to the computing device, such that the smart device again analyzes the speech signal and determines that the speech signal includes a wake word suitable for changing an operational state of the smart device.

According to another aspect of the present invention, there is provided an intelligent device comprising: a communication unit adapted to communicate with the peripheral device to receive a voice signal from the peripheral device, wherein the peripheral device analyzes the voice signal and determines that the voice signal contains wake-up words adapted to change an operational state of the smart device; the voice processing unit is suitable for analyzing the voice signal to determine whether the voice signal contains wake-up words or not; and an operation state switching unit adapted to switch the intelligent device from the first operation state to the second operation state when the voice processing unit determines that the voice signal contains the wake-up word, so as to process the subsequent new voice signal, wherein the energy consumption of the first operation state is lower than that of the second operation state.

According to a further aspect of the present invention, there is provided a computing device comprising a sound pick-up unit adapted to obtain a speech signal around the computing device; a voice analysis unit adapted to analyze the voice signal to determine whether the voice signal contains a wake-up word; and a communication unit adapted to send the voice signal to an intelligent device communicatively connected to the computing device when the voice analysis unit determines that the voice signal contains wake-up words, so that the intelligent device again analyzes the voice signal and determines that the voice signal contains wake-up words adapted to change an operational state of the intelligent device.

According to yet another aspect of the present invention, there is provided a speech processing system comprising the smart device described above and the computing device described above.

According to yet another aspect of the present invention, there is provided a computing device comprising: at least one processor; and a memory storing program instructions, wherein the program instructions are configured to be adapted to be executed by at least one processor, the program instructions comprising instructions for performing any of the methods described above.

According to yet another aspect of the present invention, there is provided a readable storage medium storing program instructions that, when read and executed by a computing device, cause the computing device to perform any of the methods described above.

According to the voice processing scheme of the invention, the primary wake-up word judgment can be firstly carried out on the pickup device with relatively low processing performance, such as a bracelet or an earphone, when the wake-up word is determined to exist at the primary stage, the Guan Yuyin is sent to the intelligent device with relatively high processing performance to carry out the secondary wake-up word judgment, and the operation state of the intelligent device is changed into the operation state with relatively high energy consumption only when the wake-up word is determined to exist at the secondary judgment so as to process the subsequent voice. By using the scheme, the intelligent device does not need to always run voice recognition processing for judging the wake-up word, so that the energy consumption of the intelligent device can be reduced.

In addition, according to the voice processing scheme of the present invention, the neural network-based voice recognition algorithm, which is basically the same but has different parameter numbers, can be run on the sound pickup apparatus and the smart apparatus to realize different accuracy and execution speeds on the sound pickup apparatus and the smart apparatus.

In addition, according to the scheme of the invention, the pickup device can be arranged at a position which is a certain distance away from the intelligent device, for example, a position which is closer to a person, so that the voice signal can be acquired more clearly, and the problem that the voice signal cannot be acquired clearly due to the fact that the intelligent device is far away from the person is solved.

The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.

Drawings

To the accomplishment of the foregoing and related ends, certain illustrative aspects are described herein in connection with the following description and the annexed drawings, which set forth the various ways in which the principles disclosed herein may be practiced, and all aspects and equivalents thereof are intended to fall within the scope of the claimed subject matter. The above, as well as additional objects, features, and advantages of the present disclosure will become more apparent from the following detailed description when read in conjunction with the accompanying drawings. Like reference numerals generally refer to like parts or elements throughout the present disclosure.

FIG. 1 illustrates a schematic diagram of a scenario of a speech processing system 100 according to one embodiment of the present invention;

FIG. 2 shows a schematic diagram of a computing device 200 according to one embodiment of the invention;

FIG. 3 shows a flow chart of a speech processing method 300 according to one embodiment of the invention;

FIG. 4 shows a flow chart of a speech processing method 400 according to another embodiment of the invention;

FIG. 5 shows a schematic diagram of a smart device 110 according to another embodiment of the invention; and

Fig. 6 shows a schematic diagram of a sound pickup apparatus 120 according to another embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Fig. 1 illustrates a schematic diagram of a scenario of a speech processing system 100 according to some embodiments of the invention. As shown in fig. 1, a system 100 includes a smart device 110 and one or more peripheral devices 120. The smart device 110 is, for example, various mobile terminals such as a smart speaker, a smart phone, a smart digital terminal, etc. They may be deployed in a fixed location or portable and communicatively coupled to a server 130 to provide various services. For example, the smart device 110 may be a smart speaker that may receive voice input from the user 140 to obtain weather, navigation information from the server 130, and provide the user with voice or video. The smart device 110 may also receive voice input of the user to send a shopping request to the server 130 to implement the online shopping process.

The smart device 110 may present information to the user 140 in a variety of ways. For example, the smart device 110 may be a smart speaker that presents information to a user in an audio manner. The smart device 110 may also be a smart television or smart screen that presents information to a user in an audiovisual manner by presenting an interface on a screen or projection of the smart device 110.

Peripheral device 120 is communicatively coupled to smart device 110 in various ways. These include, but are not limited to, bluetooth, wiFi, zigbee, 4G or 5G mobile communication networks, and the like. The present invention is not limited to communication modes, and all modes in which information can be communicated between the peripheral device 120 and the smart device 110 are within the scope of the present invention.

The peripheral device 120 is, for example, a sound pickup device such as a bracelet and an earphone. The pickup device 120 may acquire external audio information, particularly various voice information, and transmit the audio information to the smart device 110, so that the smart device 110 processes the voice information to implement voice interaction.

Some peripheral devices 120 may also have a limited-size display screen and may utilize the display screen to interact with the smart device 110 (confirm and view text messages, view short videos, etc.) while simultaneously performing voice interactions with the smart device 110

Optionally, the peripheral device 120 may also act as an output device for the smart device 110. For example, the smart device 110 may output the processing result of the smart device 110 to the user 140 through the peripheral device 120 in an audio or vibration manner or the like.

The smart device 110 may have a variety of operational states. For example, when the smart device 110 has not interacted with the user for a long time, it may be in a sleep operation state with lower power consumption, in which other functions are not operated except for some necessary functions such as a communication function and the like, thus ensuring that the power consumption of the system is minimized. While the smart device 110 is processing and interacting with the user's voice signals, it may be in a normal operating state, where most of the functions are in operation and have relatively high power consumption.

According to one embodiment of the invention, the smart device 110 may wake up to switch from the sleep operation state to the normal operation state upon recognizing that a specific wake-up word is included in the received voice. The wake-up word may be a predetermined number of words or sentences, such as phrases "hello, xxx", "Hi, xxx".

The smart device 110 may employ various methods to identify a particular wake-up word from the voice information. For example, various neural network (DNN, CNN, LSTM, GRU, CRNN, DS-CNN, etc.) based deep learning methods can be used for speech recognition with high accuracy. The invention is not limited to a specific form of speech recognition algorithm, and all ways in which speech recognition processing can be performed on speech information to determine whether wake-up words are included are within the scope of the invention.

In addition, it should be noted that the smart device 110 may include more than two operating states depending on the actual needs. The present invention is not limited to the number of operating states in the smart device 110, so modes that may have different power consumption before and after waking up are all within the scope of the present invention.

The sound pickup apparatus 120 acquires external voice information and may perform preliminary processing on the voice information to determine whether or not wake-up words are included in the voice information. According to one embodiment of the present invention, the sound pickup apparatus 120 may recognize a specific wake-up word from the acquired voice information in various ways. For example, various neural network (DNN, CNN, LSTM, GRU, CRNN, DS-CNN, etc.) based deep learning methods can be employed for speech recognition. The invention is not limited to a specific form of speech recognition algorithm, and all ways in which speech recognition processing can be performed on speech information to determine whether wake-up words are included are within the scope of the invention. It should be noted that a voice recognition method of lower accuracy may be employed in view of the generally lower processing performance of the sound pickup apparatus 120. For example, in the case of a neural network-based deep learning method, a neural network having relatively fewer parameters and a simpler network structure may be employed. The accuracy of the voice recognition method thus constructed is lower than that of the voice recognition method employed in the smart device 110.

When the sound pickup apparatus 120 determines that the wake-up word is included in the acquired voice information, the sound pickup apparatus 120 may transmit the voice information to the smart apparatus 110, and perform a voice apparatus again on the voice information at the smart apparatus 110 to secondarily determine whether the wake-up word is included. Subsequent voice interaction processing is only started after the smart device 110 determines that the voice information contains wake words. The interaction process between the sound pickup apparatus 120 and the smart apparatus 110 will be described in detail below with reference to fig. 3.

It should be noted that the system 100 shown in fig. 1 is only an example, and those skilled in the art will understand that in practical applications, the system 100 may include a plurality of smart devices 110 and a plurality of sound pickup devices 120, and the present invention does not limit the number of smart devices 110 and sound pickup devices 120 included in the system 100.

According to an embodiment of the present invention, both the smart device 110 and the sound pickup device 120 may be implemented by the computing device 200 as described below. FIG. 2 shows a schematic diagram of a computing device 200 according to one embodiment of the invention.

As shown in FIG. 2, in a basic configuration 202, computing device 200 typically includes a system memory 206 and one or more processors 204. A memory bus 208 may be used for communication between the processor 204 and the system memory 206.

Depending on the desired configuration, the processor 204 may be any type of processing including, but not limited to: a microprocessor (μp), a microcontroller (μc), a digital information processor (DSP), or any combination thereof. Processor 204 may include one or more levels of cache, such as a first level cache 210 and a second level cache 212, a processor core 214, and registers 216. The example processor core 214 may include an Arithmetic Logic Unit (ALU), a Floating Point Unit (FPU), a digital signal processing core (DSP core), or any combination thereof. The example memory controller 218 may be used with the processor 204, or in some implementations, the memory controller 218 may be an internal part of the processor 204.

Depending on the desired configuration, system memory 206 may be any type of memory including, but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. The system memory 206 may include an operating system 220, one or more applications 222, and program data 224. In some implementations, the application 222 may be arranged to execute instructions on an operating system by the one or more processors 204 using the program data 224.

Computing device 200 may also include an interface bus 240 that facilitates communication from various interface devices (e.g., output devices 242, peripheral interfaces 244, and communication devices 246) to basic configuration 202 via bus/interface controller 230. The example output device 242 includes a graphics processing unit 248 and an audio processing unit 250. They may be configured to facilitate communication with various external devices, such as a display or speakers, via one or more a/V ports 252. The example peripheral interface 244 may include a serial interface controller 254 and a parallel interface controller 256, which may be configured to facilitate communication via one or more I/O ports 258 and external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.). The example communication device 246 may include a network controller 260 that may be arranged to facilitate communication with one or more other computing devices 262 over a network communication link via one or more communication ports 264.

The network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media in a modulated data signal, such as a carrier wave or other transport mechanism. A "modulated data signal" may be a signal that has one or more of its data set or changed in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or special purpose network, and wireless media such as acoustic, radio Frequency (RF), microwave, infrared (IR) or other wireless media. The term computer readable media as used herein may include both storage media and communication media.

In an embodiment according to the invention, the computing device 200 is configured to perform the speech processing method according to the invention, performed in the smart device 110, when implemented as the smart device 110. The program data 224 of the computing device 200 contains a plurality of program instructions for performing the speech processing method according to the invention, which is executed in the instruction device 110.

Accordingly, when the computing device 200 is configured to be implemented as the sound pickup device 120, the voice processing method performed in the sound pickup device 120 according to the present invention is performed. The program data 224 of the computing device 200 contains a plurality of program instructions for executing the voice processing method according to the present invention, which is executed in the sound pickup device 120.

Fig. 3 illustrates a flow chart of a speech processing method 300 according to some embodiments of the invention. The processing method 300 is adapted to be executed in the smart device 110 and the sound pickup device 120 in the system 100. It should be noted that the method shown in fig. 3 requires that the smart device 110 and the sound pick-up device 120 cooperate and perform different method steps, respectively, but this does not mean that the smart device 110 and the sound pick-up device 120 have to be present in pairs, that the method steps performed in the smart device 110 and the method steps performed in the sound pick-up device 120 may each constitute a separate speech processing method, i.e. the smart device 110 may be communicatively connected to any other sound pick-up device 120 and the sound pick-up device 120 may also be communicatively connected to any other smart device 110, all without departing from the scope of the invention.

As shown in fig. 3, the method 300 begins at step S310. In step S310, the sound pickup apparatus 120 listens to the surroundings and acquires, for example, a voice signal of the user 140 from the surroundings. For example, when the user 140 is speaking, a sound pickup device (e.g., a headset or wristband worn by the user, etc.) in the vicinity of the user 140 may acquire or receive a voice signal.

Subsequently, in step S312, in the sound pickup apparatus 120, the voice signal received in step S310 is analyzed to determine whether the voice signal contains a wake-up word. As described above with reference to fig. 1, the wake-up word is a specific word or phrase that is predetermined and that may wake up the smart device 110 to enter a normal operating state. Also as described above with reference to fig. 1, various speech recognition methods may be employed to determine whether a wake-up word is included in a speech signal. According to one embodiment, various neural network (DNN, CNN, LSTM, GRU, CRNN, DS-CNN, etc.) based deep learning methods may be employed for speech recognition.

When it is determined in step S312 that the voice signal contains a wake-up word, then in step S314, the sound pickup apparatus 120 transmits the voice signal to the smart apparatus 110 so that the smart apparatus 110 again analyzes the voice signal to secondarily determine whether the voice signal contains a wake-up word.

Alternatively, when it is determined in step S312 that the voice signal does not contain a wake-up word, the sound pickup apparatus 120 may not transmit the voice signal to the smart apparatus 110 in step S316, and may continue to acquire the voice signal around the sound pickup apparatus, returning to step S310 to restart processing of the newly received voice signal.

Alternatively, before performing step S312, the sound pickup apparatus 120 may first determine whether the amount of electricity itself is below a predetermined threshold, for example, 20%. If the power is too low and the power of the sound pickup apparatus 120 is consumed for the voice analysis, the voice signal may not be analyzed directly in step S314, but may be transmitted to the smart apparatus 110 for analysis directly in order to extend the service time of the sound pickup apparatus 120.

Accordingly, the smart device 110 receives the voice signal transmitted from the sound pickup device 120 in step S314. This speech signal has been subjected to speech analysis by the sound pickup apparatus 120 and is determined to contain a wake-up word in step S312.

Subsequently, in step S322, in the smart device 110, the voice signal received in step S314 is analyzed to determine again whether the voice signal contains a wake-up word. The smart device 110, as described above with reference to fig. 1, may employ various speech recognition methods to determine whether a wake-up word is included in a speech signal. According to one embodiment, various neural network (DNN, CNN, LSTM, GRU, CRNN, DS-CNN, etc.) based deep learning methods may be employed for speech recognition.

When the smart device 110 determines again that the voice signal contains the wake-up word in step S322, then in step S324, the operation state of the smart device 110 is switched to switch from the low-power state to the high-power state.

As described above with reference to fig. 1, the smart device 110 has a plurality of operating states. Most of the functions are not in operation before the smart device 110 is woken up, so the smart device 110 is in a sleep operation state with lower power consumption. After being awakened, the smart device 110 may enter a normal operating state, in which most of the functions of the device begin to operate normally, and process new voice or video inputs from subsequent users with higher power consumption.

Accordingly, when the smart device 110 determines in step S322 that the voice signal does not contain the wake-up word, the smart device 110 may continue to remain in the sleep operation state and wait to receive the voice signal from the sound pickup device 120 again, which contains the sound pickup device 120 determined to contain the wake-up word, for processing.

It should be noted that the smart device 110 may wake up temporarily after receiving the voice signal, and perform a voice recognition method to process the voice signal, and return to the sleep state when it is determined that the voice signal does not contain a wake-up word.

Alternatively, the smart device 110 may also set another operation state for exclusively performing the voice-recognition method. The energy consumption of this operating state can be between a dormant operating state and a normal operating state. When the smart device 110 receives the voice signal in step S314, it switches from the sleep state to this intermediate operation state; when it is determined in step S322 that the voice signal contains a wake-up word, then in step S324, the normal operation state is further switched; and when it is determined in step S322 that the voice signal does not include the wake-up word, switching back to the sleep operation state.

Optionally, in the above steps S312 and S322, the voice signal is analyzed in the sound pickup apparatus 120 and the smart apparatus 110, respectively, to determine whether the voice signal contains the wake-up word, it should be noted that, considering the processing performance of the sound pickup apparatus 120 and the smart apparatus 110 and the sequence of analyzing the voice signal, the accuracy of analyzing the voice signal to determine whether the wake-up word is contained may be lower than the accuracy of analyzing the voice signal by the smart apparatus 110 to determine whether the wake-up word is contained.

According to one embodiment, when both the sound pickup apparatus 120 and the smart apparatus 110 perform voice recognition using a neural network (DNN, CNN, LSTM, GRU, CRNN, DS-CNN, etc.) based deep learning method, the neural network structure employed in the sound pickup apparatus 120 may be simpler than that employed in the smart apparatus 110. For example, the number of parameters of the neural network structure employed in the sound pickup apparatus 120 is smaller than the number of parameters of the neural network in the smart apparatus 110, or the number of network layers of the neural network structure employed in the sound pickup apparatus 120 is smaller than the number of network layers of the neural network in the smart apparatus 110.

Alternatively, the sound pickup apparatus 120 encodes a voice signal to be transmitted in a predetermined format. The speech signal comprises a plurality of audio signals. The encoded speech signal comprises a first portion indicative of the number of audio signal paths, a second portion indicative of the length of each of the multiple audio signals, and the multiple audio signals themselves. In addition, the speech signal also includes multiple reference audio signals. In this case, in the encoded speech signal, the number of passes of the reference audio signal is also indicated in the first portion. And the encoded speech signal further comprises a third portion indicative of the length of each of the multiple reference audio signals and the multiple reference audio signals themselves.

The specific format definition of the data message after encoding the voice signal is given below:

In the above format, the data packets are arranged in each audio order, first the audio (i.e., mic audio) data picked up by the pick-up device, followed by the reference audio data. The first byte identifies that the audio has several paths, and the following four bytes respectively represent the length of each path of Mic audio and the length of each path of reference audio. And then follows the entire audio data. Each data packet must contain data for all audio paths of the sound pickup apparatus 120.

According to one embodiment, the pickup device 120 may perform the acquisition of the voice signal at a 16KHz sampling rate, and the Bit width of the voice signal is 16 bits. Each time, the sound pickup apparatus 120 may transmit a voice signal having a time length of 3 seconds to the smart device 110 for a secondary confirmation.

Fig. 4 shows a flow chart of a speech processing method 400 according to another embodiment of the invention. The speech processing method 400 is a further embodiment of the speech processing method 300 and therefore the same or similar labels as in the method 300 are used to indicate the same or similar processing steps.

The method 400 differs from the method 300 in that, in addition to the sound pickup apparatus 120a, another sound pickup apparatus 120b receives voice information from the surrounding environment as well, and makes a determination as to whether the voice information contains a wake-up word. Accordingly, steps 310a, 312a, 314a, and 316a performed in the sound pickup apparatus 120a, and steps 310b, 312b, 314b, and 316b performed in the sound pickup apparatus 120b are the same as steps 310, 312, 314, and 316 in fig. 3.

The method 400 further includes step 420, wherein in the smart device 110, when receiving the voice signals from the sound pickup devices 120a and 120b from 314a and 314b, respectively, the voice signal from one of them needs to be selected for subsequent processing. According to one embodiment of the present invention, in step S420, a voice signal having the greatest sound intensity is selected as a voice signal to be analyzed from among the received voice signals. According to other embodiments, the selection from the received plurality of speech signals may also be made under other conditions according to actual needs, for example, a speech signal with good sound quality may be selected, etc. Any way of selecting a high quality speech signal is within the scope of the invention.

It should be noted that although two sound pickup apparatuses 120a and 120b are described above in fig. 4, the present invention is not limited to the number of sound pickup apparatuses communicatively connected to the smart apparatus 110.

After the speech signal to be analyzed is determined in step S420, the subsequent processing is performed in step S322. These processes are identical to the corresponding steps in the method 300 described with reference to fig. 3 and are not described in detail.

In addition, after the smart device 110 is switched to the normal operation state for voice interaction with the user at step S324, one of the sound pickup devices is instructed by the smart device 110 to continue receiving a new voice signal for voice interaction at step S430.

According to one embodiment, the smart device 110 selects a pickup device 120 whose voice signal containing a wake-up word is secondarily confirmed to indicate. For example, as shown in fig. 4, it is determined in step S420 that the voice signal from the sound pickup apparatus 120b is selected for the second confirmation, and therefore, in step S430, the sound pickup apparatus 120b is instructed to acquire a new voice signal and transmit it to the smart apparatus 110 for the voice interaction process.

According to the voice processing scheme of the present invention, a preliminary wake-up word judgment may be performed on a pickup device 120 having a relatively low processing performance, such as a bracelet or an earphone, and when the wake-up word is determined to exist in the preliminary step, the phase Guan Yuyin is sent to the smart device 110 having a relatively high processing performance to perform a secondary wake-up word judgment, and only when the wake-up word is determined to exist in the secondary judgment, the operation state of the smart device 110 is changed to an operation state having a relatively high energy consumption so as to process the subsequent voice. With this scheme, the smart device 110 does not need to always run the voice recognition process for wake-up word judgment, so that the power consumption of the smart device can be reduced.

In addition, according to the present invention, the pickup device 120 may be disposed at a distance from the smart device 110, for example, closer to a person, so that a voice signal may be acquired more clearly, and the problem that the smart device 110 cannot acquire a voice signal clearly due to a distance from the person is reduced.

Fig. 5 shows a schematic diagram of a smart device 110 according to another embodiment of the invention. Fig. 5 shows the various components in the smart device 110 in a logically divided manner, and it should be noted that such division may be subdivided or recombined further depending on the actual physical implementation, and any smart device 110 having the logical components shown in fig. 5 is within the scope of the present invention without departing from the scope of the present invention.

As shown in fig. 5, the smart device 110 includes a communication unit 510, a voice processing unit 520, and an operation state switching unit 530.

The communication unit 510 provides a communication function for the smart device 110 and communicates with a peripheral device such as the sound pickup device 120 to receive a voice signal from the peripheral device 120. As described above, the sound pickup apparatus 120 has analyzed the voice signal and determined that the voice signal contains the wake-up word before transmitting the voice signal to the smart device 110. The communication unit 510 may be communicatively connected to the sound pickup apparatus 120 in various manners including, but not limited to, bluetooth, zigBee, WIFI, mobile communication. The present invention is not limited to communication modes, and all modes in which information can be communicated between the peripheral device 120 and the smart device 110 are within the scope of the present invention.

It should be noted that according to one embodiment, the voice signal may be encoded in a predetermined format so as to be suitable for transmission between the sound pickup apparatus 120 and the smart device 110. The format of the voice signal has been described in detail above, and will not be described here again.

The voice processing unit 520 is coupled to the communication unit 510, and analyzes the voice signal received by the communication unit 510 to determine whether the voice signal contains a wake-up word again. As described above, the voice processing unit 520 may recognize a specific wake word from the voice information in various methods. For example, the speech processing unit 520 may employ various deep learning methods based on neural networks (DNN, CNN, LSTM, GRU, CRNN, DS-CNN, etc.) for speech signal processing.

When the voice processing unit 520 determines again that the voice signal contains the wake-up word, the operation state switching unit 530 switches the operation state of the smart device 110 to switch from the low power consumption state to the high power consumption state. As described above with reference to fig. 1, the smart device 110 has a plurality of operating states. Most of the functions are not in operation before the smart device 110 is woken up, so the smart device 110 is in a sleep operation state with lower power consumption. After being awakened, the smart device 110 may enter a normal operating state, in which most of the functions of the device begin to operate normally, and process new voice or video inputs from subsequent users with higher power consumption.

Accordingly, when the voice processing unit 520 determines that the voice signal does not include the wake-up word, the operation state switching unit 530 does not switch the operation state of the smart device 110, and the smart device 110 may continue to remain in the sleep operation state and wait to receive the voice signal from the sound pickup apparatus 120, which includes the sound pickup apparatus 120 determined to include the wake-up word, again for processing.

It should be noted that the smart device 110 may be temporarily awakened by the operation state switching unit 530 after receiving the voice signal, perform a voice recognition method by the voice processing unit 520 to process the voice signal, and switch back to the sleep state by using the operation state switching unit 530 when it is determined that the voice signal does not contain the awakening word.

Alternatively, according to an embodiment of the present invention, both the sound pickup apparatus 120 and the voice signal processing unit 520 analyze the voice signal to determine whether the voice signal contains a wake-up word, it should be noted that, considering the processing performance of the sound pickup apparatus 120 and the smart apparatus 110 and the order of analyzing the voice signal, the accuracy of the sound pickup apparatus 120 analyzing the voice signal to determine whether the wake-up word is contained may be lower than the accuracy of the voice signal processing unit 520 analyzing the voice signal to determine whether the wake-up word is contained.

According to one embodiment, when both the sound pickup apparatus 120 and the voice signal processing unit 520 perform voice recognition using a neural network (DNN, CNN, LSTM, GRU, CRNN, DS-CNN, etc.) based deep learning method, the neural network structure employed in the sound pickup apparatus 120 may be simpler than that employed in the voice signal processing unit 520. For example, the number of parameters of the neural network structure employed in the sound pickup apparatus 120 is smaller than the number of parameters of the neural network in the voice signal processing unit 520, or the number of network layers of the neural network structure employed in the sound pickup apparatus 120 is smaller than the number of network layers of the neural network in the voice signal processing unit 520.

Alternatively, according to one embodiment of the invention, the communication unit 510 may also receive voice signals from more than one peripheral device 120. For example, in the system 100 shown in fig. 1, when the user 140 wears the headset 120a and the bracelet 120b, the voice information of the user is received by the headset 120a and the bracelet 120b simultaneously, and is primarily determined to include the wake-up word, and is sent to the smart device 110 for reconfirmation.

At this time, the voice signal processing unit 520 needs to select one of the voice signals from among them for subsequent processing. According to one embodiment of the present invention, the voice signal processing unit 520 selects a voice signal having the greatest sound intensity from among the received voice signals as a voice signal to be analyzed. According to other embodiments, the selection from the received plurality of speech signals may also be made under other conditions according to actual needs, for example, a speech signal with better sound quality may be selected, etc. Any way of selecting a high quality speech signal is within the scope of the invention.

In addition, after the operation state switching unit 530 switches the smart device 110 to the normal operation state for voice interaction with the user, the communication unit 510 instructs one of the sound pickup apparatuses 120 to continue receiving a new voice signal for voice interaction.

According to one embodiment, the smart device 110 selects a pickup device 120 whose voice signal containing a wake-up word is secondarily confirmed to indicate. For example, as described above with reference to fig. 4, if it is determined that the voice signal from the sound pickup apparatus 120b is selected for the second confirmation, the communication unit 510 instructs the sound pickup apparatus 120b to acquire a new voice signal to transmit to the smart apparatus 110 for the voice interaction process.

Fig. 6 shows a schematic diagram of a sound pickup apparatus 120 according to another embodiment of the present invention. Fig. 6 shows the individual components of the sound pickup apparatus 120 in a logically divided manner, it being noted that such division may be subdivided further or recombined in accordance with the actual physical implementation, without departing from the scope of the present invention, and any sound pickup apparatus 120 having the logical components shown in fig. 6 is within the scope of the present invention.

As shown in fig. 6, the sound pickup apparatus 120 includes a sound pickup unit 610, a voice analysis unit 620, and a communication unit 630.

The sound pickup unit 610 listens to the surroundings and acquires, for example, a voice signal of the user 140 from the surroundings. For example, when the user 140 is speaking, a sound pickup device (e.g., a headset or wristband worn by the user, etc.) in the vicinity of the user 140 may acquire or receive a voice signal.

The voice analysis unit 620 is coupled to the sound pickup unit 610, and analyzes the voice signal received by the sound pickup unit 610 to determine whether the voice signal contains a wake-up word. As described above with reference to fig. 1, the wake-up word is a specific word or phrase that is predetermined and that may wake up the smart device 110 to enter a normal operating state. Also as described above with reference to fig. 1, various speech recognition methods may be employed to determine whether a wake-up word is included in a speech signal. According to one embodiment, various neural network (DNN, CNN, LSTM, GRU, CRNN, DS-CNN, etc.) based deep learning methods may be employed for speech recognition.

When the voice analysis unit 620 determines that the voice signal contains a wake word, the communication unit 630 transmits the voice signal to the smart device 110 so that the smart device 110 again analyzes the voice signal to secondarily determine whether the voice signal contains the wake word.

Alternatively, when the voice analysis unit 620 determines that the voice signal does not include the wake-up word, the communication unit 630 does not transmit the voice signal to the smart device 110, and may continue to acquire the voice signal around the sound pickup device by the sound pickup unit 610 to resume processing of the newly received voice signal.

Further, alternatively, the sound pickup apparatus 120 may first determine whether the amount of electricity itself is below a predetermined threshold, for example, 20%. If the power is too low, since the voice analysis unit 620 consumes the power of the sound pickup apparatus 120 again, in order to extend the use time of the sound pickup apparatus 120, the voice signal may not be analyzed, but directly transmitted to the smart apparatus 110 by the communication unit 630 for analysis.

Optionally, according to an embodiment of the present invention, the voice analysis unit 620 and the smart device 110 both analyze the voice signal to determine whether the voice signal contains the wake-up word, and it should be noted that, considering the processing performance of the sound pickup device 120 and the smart device 110 and the sequence of analyzing the voice signal, the accuracy of analyzing the voice signal by the voice analysis unit 620 to determine whether the wake-up word is contained may be lower than the accuracy of analyzing the voice signal by the smart device 110 to determine whether the wake-up word is contained.

According to one embodiment, when the voice analysis unit 620 and the smart device 110 both use the deep learning method based on the neural network (DNN, CNN, LSTM, GRU, CRNN, DS-CNN, etc.) for voice recognition, the neural network structure used in the voice analysis unit 620 may be simpler than that used in the smart device 110. For example, the number of parameters of the neural network structure used in the voice analysis unit 620 is smaller than the number of parameters of the neural network in the smart device 110, or the number of network layers of the neural network structure used in the voice analysis unit 620 is smaller than the number of network layers of the neural network in the smart device 110.

Optionally, during the interaction of the pickup device 120 and the smart device 110 above, some pickup devices and smart devices have display interfaces and information related to the interaction may be presented on the display interfaces to help the user understand the interaction process better. For example, the user may be provided with a setting of whether to perform voice signal analysis for wake-up word recognition, a setting of a power threshold, or the like on an interface of the sound pickup apparatus and the smart device. And may provide information such as "detect wake-up word, send to the smart device 110 for secondary confirmation", etc. "receive voice of the sound pickup device, detect wake-up word", "detect wake-up word, switch device operation status", etc. as the interaction proceeds. Therefore, the interaction modes are all within the protection scope of the invention.

The various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present invention, or certain aspects or portions of the methods and apparatus of the present invention, may take the form of program code (i.e., instructions) embodied in tangible media, such as removable hard drives, U-drives, floppy diskettes, CD-ROMs, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention.

In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to perform the method of the invention in accordance with instructions in said program code stored in the memory.

By way of example, and not limitation, readable media comprise readable storage media and communication media. The readable storage medium stores information such as computer readable instructions, data structures, program modules, or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of readable media.

In the description provided herein, algorithms and displays are not inherently related to any particular computer, virtual system, or other apparatus. Various general-purpose systems may also be used with examples of the invention. The required structure for a construction of such a system is apparent from the description above. In addition, the present invention is not directed to any particular programming language. It will be appreciated that the teachings of the present invention described herein may be implemented in a variety of programming languages, and the above description of specific languages is provided for disclosure of enablement and best mode of the present invention.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be construed as reflecting the intention that: i.e., the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment, or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into a plurality of sub-modules.

Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and disposed in one or more apparatuses different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Any combination of all features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be used in combination, except insofar as at least some of such features and/or processes or units are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments can be used in any combination.

Furthermore, some of the embodiments are described herein as methods or combinations of method elements that may be implemented by a processor of a computer system or by other means of performing the functions. Thus, a processor with the necessary instructions for implementing the described method or method element forms a means for implementing the method or method element. Furthermore, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is for carrying out the functions performed by the elements for carrying out the objects of the invention.

As used herein, unless otherwise specified the use of the ordinal terms "first," "second," "third," etc., to describe a general object merely denote different instances of like objects, and are not intended to imply that the objects so described must have a given order, either temporally, spatially, in ranking, or in any other manner.

While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of the above description, will appreciate that other embodiments are contemplated within the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The disclosure of the present invention is intended to be illustrative, but not limiting, of the scope of the invention, which is defined by the appended claims.

Claims

1. A speech processing method adapted to be executed in a computing device, the method comprising the steps of:

Receiving a voice signal from a peripheral device communicatively connected to the computing device when the peripheral device analyzes the voice signal and determines that the voice signal contains wake words suitable for changing an operational state of the computing device;

Analyzing the voice signal to determine whether the voice signal contains the wake-up word; and

And when the voice signal contains the wake-up word, switching the computing device from a first operation state to a second operation state so as to process the subsequent new voice signal, wherein the energy consumption of the first operation state is lower than the energy consumption of the second operation state.

2. The method of claim 1, further comprising the step of:

And when the voice signal is determined not to contain the wake-up word, not switching the running state of the computing device.

3. The method of claim 1 or 2, wherein the peripheral device analyzes the speech signal to determine whether the wake word is included with a first accuracy;

the computing device analyzing the speech signal to determine whether the wake word is included with a second accuracy; and

The first accuracy is lower than the second accuracy.

4. The method of claim 3, wherein the peripheral device analyzes the speech signal using a first neural network algorithm and the computing device analyzes the speech information using a second neural network algorithm, and

The parameters in the first neural network are less than the parameters in the second neural network.

5. The method of any of claims 1-4, wherein receiving the voice signal from the peripheral device comprises:

receiving voice signals from more than one peripheral device; and

From the received speech signals, the speech signal with the highest sound intensity is selected as the speech signal to be analyzed.

6. The method of any of claims 1-5, further comprising the step of:

After the computing device switches to the second operational state, the peripheral device is instructed to receive a new voice signal in order to send the received new voice signal to the computing device for processing.

7. The method of any of claims 1-6, wherein the speech signal comprises a multi-channel audio signal, and the step of receiving the speech signal from the peripheral device comprises:

A speech signal encoded in a predetermined format is received, the encoded speech signal comprising a first portion indicative of a number of audio signal paths, a second portion indicative of a length of each of the plurality of audio signals, and the plurality of audio signals.

8. The method of claim 7, wherein the first portion further indicates a number of passes of a reference audio signal; and

The encoded speech signal further comprises a multi-reference audio signal and a third portion indicating the length of each of the multi-reference audio signals.

9. A speech processing method adapted to be executed in a computing device, the method comprising the steps of:

Receiving a voice signal;

Analyzing the voice signal to determine whether the voice signal contains a wake-up word; and

And when the voice signal is determined to contain the wake-up word, transmitting the voice signal to an intelligent device in communication connection with the computing device, so that the intelligent device analyzes the voice signal again and determines that the voice signal contains the wake-up word suitable for changing the operation state of the intelligent device, and when the voice signal is determined to contain the wake-up word, switching the intelligent device from a first operation state to a second operation state so as to process a subsequent new voice signal, wherein the energy consumption of the first operation state is lower than that of the second operation state.

10. The method of claim 9, further comprising the step of:

and when the voice signal is determined not to contain the wake-up word, not sending the voice signal to the intelligent device.

11. The method of claim 9 or 10, wherein the computing device analyzes the speech signal to determine whether the wake word is included with a first accuracy;

The intelligent device analyzes the voice signal to determine whether the wake-up word is included or not with a second accuracy; and

The first accuracy is lower than the second accuracy.

12. The method of claim 11, wherein the computing device analyzes the speech signal using a first neural network algorithm and the smart device analyzes the speech information using a second neural network algorithm, and

13. The method according to any one of claims 9-12, further comprising the step of:

when the power of the computing device is below a predetermined threshold, the voice signal is not analyzed and is sent to the smart device to determine by the smart device whether the voice information contains the wake word.

14. The method of any of claims 9-13, wherein the speech signal comprises a multi-path audio signal, and the step of transmitting the speech signal to the smart device comprises:

Transmitting a speech signal encoded in a predetermined format, the encoded speech signal comprising a first portion indicative of a number of audio signal paths, a second portion indicative of a length of each of the plurality of audio signals, and the plurality of audio signals.

15. The method of claim 14, wherein the first portion further indicates a number of passes of a reference audio signal; and

16. A smart device, comprising:

A communication unit adapted to communicate with a peripheral device to receive a voice signal from the peripheral device when the peripheral device analyzes the voice signal and determines that the voice signal contains a wake-up word adapted to change an operational state of the smart device;

A voice processing unit adapted to analyze the voice signal to determine whether the voice signal contains the wake-up word; and

And the running state switching unit is suitable for switching the intelligent equipment from a first running state to a second running state when the voice processing unit determines that the voice signal contains the wake-up word so as to process a subsequent new voice signal, wherein the energy consumption of the first running state is lower than that of the second running state.

17. The smart device of claim 16, wherein the operating state switching unit is further adapted to not switch the operating state of the smart device when the speech processing unit determines that the speech signal does not contain the wake-up word.

18. The smart device of claim 16 or 17, wherein the peripheral device analyzes the speech signal to determine whether the wake word is included with a first accuracy;

The voice processing unit analyzes the voice signal to determine whether the wake-up word is included with a second accuracy; and

The first accuracy is lower than the second accuracy.

19. The smart device of claim 18, wherein the peripheral device uses a first neural network algorithm to analyze the speech signal and the speech processing unit uses a second neural network algorithm to analyze the speech information, and

20. A smart device according to any one of claims 16-19, wherein the communication unit is adapted to receive speech signals from more than one peripheral device; and

The speech processing unit is adapted to select, from the received speech signals, the speech signal with the greatest sound intensity as the speech information to be analyzed.

21. A computing device, comprising

A sound pick-up unit adapted to obtain a speech signal around the computing device;

A voice analysis unit adapted to analyze the voice signal to determine whether the voice signal contains a wake-up word; and

And the communication unit is suitable for sending the voice signal to the intelligent device which is in communication connection with the computing device when the voice analysis unit determines that the voice signal contains the wake-up word, so that the intelligent device analyzes the voice signal again and determines that the voice signal contains the wake-up word which is suitable for changing the operation state of the intelligent device, and switching the intelligent device from a first operation state to a second operation state when the voice signal is determined to contain the wake-up word, so as to process a subsequent new voice signal, wherein the energy consumption of the first operation state is lower than that of the second operation state.

22. The computing device of claim 21, the communication unit further adapted to not send the voice signal to the smart device when the voice analysis unit determines that the voice signal does not contain the wake word.

23. The computing device of claim 21 or 22, wherein the speech analysis unit analyzes the speech signal to determine whether the wake word is included with a first accuracy;

The first accuracy is lower than the second accuracy.

24. The computing device of any of claims 21-23, wherein:

the voice analysis unit is adapted to not analyze the voice signal when the power of the computing device is below a predetermined threshold, and

The communication unit is adapted to send the voice signal to the smart device when the power of the computing device is below a predetermined threshold, in order to determine by the smart device whether the voice information contains the wake-up word.

25. A speech processing system includes

A smart device according to any one of claims 16-20; and

One or more computing devices as recited in any of claims 21-24, communicatively coupled to the smart device.

26. The speech processing system of claim 25 wherein the smart device is a smart speaker and the computing device is a headset or a bracelet.

27. An intelligent sound box, comprising:

a communication unit adapted to communicate with a peripheral device to receive a voice signal from the peripheral device when the peripheral device analyzes the voice signal and determines that the voice signal contains a wake-up word adapted to change an operational state of the smart speaker;

And the running state switching unit is suitable for switching the intelligent sound box from a first running state to a second running state when the voice processing unit determines that the voice signal contains the wake-up word so as to process a subsequent new voice signal, wherein the energy consumption of the first running state is lower than that of the second running state.

28. A computing device, comprising:

at least one processor; and

A memory storing program instructions, wherein the program instructions are configured to be adapted to be executed by the at least one processor, the program instructions comprising instructions for performing the method of any of claims 1-15.