CN112820283A

CN112820283A - Voice processing method, device and system

Info

Publication number: CN112820283A
Application number: CN201911129042.4A
Authority: CN
Inventors: 胡俊锋; 汪贇; 黄俊岚
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-11-18
Filing date: 2019-11-18
Publication date: 2021-05-18

Abstract

A speech processing method receives a speech signal from a peripheral device communicatively coupled to a computing device, wherein the peripheral device has analyzed the speech signal and determined that the speech signal contains a wake-up word adapted to change an operational state of the computing device; analyzing the voice signal again to determine whether the voice signal contains a wake-up word; and switching the computing device from a first operating state to a second operating state for processing a subsequent new speech signal upon determining that the speech signal contains a wake-up word, wherein the energy consumption of the first operating state is lower than the energy consumption of the second operating state. The invention also discloses corresponding intelligent equipment, peripheral equipment and an intelligent system comprising the equipment.

Description

Voice processing method, device and system

Technical Field

The present invention relates to the field of speech processing technologies, and in particular, to a speech processing method, device, and system for recognizing a wakeup word.

Background

In the past decade, the internet has been deepened in every field of people's life, and people can conveniently perform activities such as shopping, social contact, entertainment, financing and the like through the internet. Meanwhile, in order to improve user experience, researchers have implemented a number of interaction schemes, such as text input, gesture input, voice input, and the like. Among them, intelligent voice interaction becomes a research hotspot of a new generation of interaction mode due to the convenience of operation.

Currently, with the rapid development of the internet of things and intellectualization, some intelligent voice devices, such as intelligent sound boxes and intelligent mobile terminals, appear in the market. In some usage scenarios, the smart voice device may recognize voice data input by a user through a voice recognition technology, thereby providing personalized services to the user, such as listening to various audio contents, supporting a home control, and the like.

Smart devices such as smart speakers are often deployed in fixed areas such as the living room of a home. To save energy, these devices are in a low energy consumption state when they are not in operation for a long time and are in a normal operation state when they need to operate, for example, to interact with a user, and consume more energy. The process of having the device change from a low power consumption state to a normal operating state is referred to as a wake-up process.

At present, waking up an intelligent voice device in a voice mode is a very convenient user interaction mode. In the voice mode awakening processing, the intelligent voice equipment can acquire the voice of the user and carry out awakening processing when the voice of the user is determined to contain a specific awakening word of the awakening equipment. However, for a high-performance intelligent device, in order to wake up in time, the device needs to operate with a certain energy consumption, and cannot be in a standby state for a long time under the condition of battery power supply, so that the device generally needs to be powered by a power supply, is inconvenient to carry around, and limits the range of waking up the intelligent device by voice.

For the portable low-power-consumption voice access devices, the computing power of the devices is insufficient, and an excellent voice wake-up algorithm cannot be realized on the devices, so that the problem of low wake-up success rate exists.

Therefore, a new speech processing scheme is needed to reduce the energy consumption of the smart device while performing the wake-up process with high accuracy.

Disclosure of Invention

To this end, the present invention provides a speech processing method, apparatus and system in an attempt to solve or at least alleviate at least one of the problems identified above.

According to an aspect of the present invention, there is provided a speech processing method adapted to be executed in a computing device, the method comprising the steps of: receiving a voice signal from a peripheral device communicatively coupled to the computing device, wherein the peripheral device analyzes the voice signal and determines that the voice signal contains a wake word adapted to change an operational state of the computing device; analyzing the voice signal to determine whether the voice signal contains a wake-up word; and switching the computing device from a first operating state to a second operating state for processing a subsequent new speech signal upon determining that the speech signal contains a wake-up word, wherein the energy consumption of the first operating state is lower than the energy consumption of the second operating state.

Optionally, the method according to the present invention further comprises the steps of: the operational state of the computing device is not switched when it is determined that the speech signal does not contain a wake-up word.

Optionally, in the method according to the invention, the peripheral device analyzes the speech signal to determine whether the inclusion of the wake-up word has a first accuracy; the computing device analyzes the speech signal to determine whether the inclusion of the wake word has a second degree of accuracy; and the first accuracy is lower than the second accuracy.

Optionally, in the method according to the present invention, the peripheral device analyzes the speech signal using a first neural network algorithm, and the computing device analyzes the speech information using a second neural network algorithm, and the parameters in the first neural network are less than the parameters in the second neural network.

Optionally, in the method according to the present invention, receiving the voice signal from the peripheral device comprises: receiving voice signals from more than one peripheral device; and selecting a voice signal having the greatest sound intensity from the received voice signals as a voice signal to be analyzed.

Optionally, the method according to the invention further comprises the steps of: after the computing device switches to the second operating state, the peripheral device is instructed to receive a new speech signal to send the received new speech signal to the computing device for processing.

Optionally, in the method according to the present invention, the voice signal includes multiple audio signals, and the step of receiving the voice signal from a peripheral device communicatively connected to the computing device includes: a speech signal encoded in a predetermined format is received, the encoded speech signal including a first portion indicating a number of channels of an audio signal, a second portion indicating a length of each of a plurality of channels of the audio signal, and the plurality of channels of the audio signal.

Optionally, in the method according to the present invention, the first part further indicates a number of ways of the reference audio signal; and the encoded speech signal further comprises a multi-channel reference audio signal and a third portion indicating a length of each of the multi-channel reference audio signal.

Optionally, in the method according to the invention, the peripheral device is communicatively connected to the computing device in at least one of the following ways: bluetooth, ZigBee, WIFI and mobile communication; and the peripheral device is a sound pickup device adapted to acquire voice input.

According to another aspect of the present invention, there is provided a speech processing method adapted to be executed in a computing device, the method comprising the steps of: receiving a voice signal; analyzing the voice signal to determine whether the voice signal contains a wake-up word; and upon determining that the speech signal contains a wake word, sending the speech signal to a smart device communicatively coupled to the computing device such that the smart device again analyzes the speech signal and determines that the speech signal contains a wake word suitable for altering an operational state of the smart device.

According to another aspect of the present invention, there is provided a smart device including: a communication unit adapted to communicate with a peripheral device to receive a voice signal from the peripheral device, wherein the peripheral device analyzes the voice signal and determines that the voice signal contains a wake-up word adapted to change an operating state of the smart device; the voice processing unit is suitable for analyzing the voice signal to determine whether the voice signal contains a wake-up word; and the operation state switching unit is suitable for switching the intelligent equipment from the first operation state to the second operation state when the voice processing unit determines that the voice signal contains the awakening word so as to process a subsequent new voice signal, wherein the energy consumption of the first operation state is lower than that of the second operation state.

According to yet another aspect of the present invention, there is provided a computing device comprising a pickup unit adapted to acquire a voice signal around the computing device; the voice analysis unit is suitable for analyzing the voice signal to determine whether the voice signal contains a wake-up word; and the communication unit is suitable for sending the voice signal to an intelligent device in communication connection with the computing device when the voice analysis unit determines that the voice signal contains the awakening word, so that the intelligent device analyzes the voice signal again and determines that the voice signal contains the awakening word suitable for changing the operation state of the intelligent device.

According to yet another aspect of the present invention, there is provided a speech processing system comprising the above-mentioned smart device and the above-mentioned computing device.

According to yet another aspect of the present invention, there is provided a computing device comprising: at least one processor; and a memory storing program instructions, wherein the program instructions are configured to be executed by the at least one processor, the program instructions comprising instructions for performing any of the methods described above.

According to yet another aspect of the present invention, there is provided a readable storage medium storing program instructions that, when read and executed by a computing device, cause the computing device to perform any of the methods described above.

According to the voice processing scheme provided by the invention, the primary awakening word judgment can be firstly carried out on the sound pickup equipment with relatively low processing performance, such as a bracelet or an earphone, when the awakening word is determined to exist in the primary step, the related voice is sent to the intelligent equipment with relatively high processing performance to carry out secondary awakening word judgment, and only when the awakening word is also determined to exist in the secondary judgment, the operation state of the intelligent equipment is changed into the operation state with relatively high energy consumption so as to process the subsequent voice. By using the scheme, the intelligent device does not need to always operate the voice recognition processing for judging the awakening words, so that the energy consumption of the intelligent device can be reduced.

In addition, according to the speech processing scheme of the present invention, neural network-based speech recognition algorithms that are substantially the same but have different numbers of parameters can be run on the sound pickup apparatus and the smart apparatus to achieve different accuracy and execution speed on the sound pickup apparatus and the smart apparatus.

In addition, according to the scheme of the invention, the sound pickup equipment can be deployed at a position which is at a certain distance away from the intelligent equipment, for example, a position which is closer to a person, so that the voice signal can be acquired more clearly, and the problem that the voice signal cannot be acquired clearly because the intelligent equipment is farther away from the person is solved.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

To the accomplishment of the foregoing and related ends, certain illustrative aspects are described herein in connection with the following description and the annexed drawings, which are indicative of various ways in which the principles disclosed herein may be practiced, and all aspects and equivalents thereof are intended to be within the scope of the claimed subject matter. The above and other objects, features and advantages of the present disclosure will become more apparent from the following detailed description read in conjunction with the accompanying drawings. Throughout this disclosure, like reference numerals generally refer to like parts or elements.

FIG. 1 shows a schematic diagram of a scenario of a speech processing system 100 according to an embodiment of the invention;

FIG. 2 shows a schematic diagram of a computing device 200, according to one embodiment of the invention;

FIG. 3 shows a flow diagram of a method 300 of speech processing according to one embodiment of the invention;

FIG. 4 shows a flow diagram of a speech processing method 400 according to another embodiment of the invention;

FIG. 5 shows a schematic diagram of a smart device 110 according to another embodiment of the invention; and

fig. 6 shows a schematic diagram of a tone arm 120 according to another embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

FIG. 1 illustrates a scene diagram of a speech processing system 100 according to some embodiments of the inventions. As shown in fig. 1, a smart device 110 and one or more peripheral devices 120 are included in the system 100. The smart device 110 is, for example, various mobile terminals, such as a smart speaker, a smart phone, a smart digital terminal, and the like. Which may be deployed in fixed locations or portable and communicatively coupled to server 130 to provide various services. For example, the smart device 110 may be a smart speaker and may receive voice input from the user 140 to obtain weather and navigation information from the server 130 and provide the weather and navigation information to the user in a voice or video manner. The smart device 110 may also receive a voice input from the user to send a shopping request to the server 130 to implement the online shopping process.

The smart device 110 may present information to the user 140 in a variety of ways. For example, the smart device 110 may be a smart speaker that presents information to the user in an audible manner. The smart device 110 may also be a smart television or smart screen that presents information to the user in an audio-visual manner by presenting an interface on a screen or projection of the smart device 110.

The peripheral device 120 is communicatively coupled to the smart device 110 in various ways. These ways include, but are not limited to, bluetooth, WiFi, Zigbee, 4G or 5G mobile communication network, etc. The present invention is not limited to communication, and all ways of communicating information between the peripheral device 120 and the smart device 110 are within the scope of the present invention.

The peripheral device 120 is, for example, a sound pickup device such as a bracelet and an earphone. The sound pickup device 120 may acquire external audio information, particularly various voice information, and send the audio information to the smart device 110, so that the smart device 110 processes the voice information to implement voice interaction.

Some peripheral devices 120 may also have display screens that are limited in size and may interact with the smart device 110 (confirm and view text messages, view short videos, etc.) using the display screens while interacting with the smart device 110 in voice

Optionally, the peripheral device 120 may also serve as an output device for the smart device 110. For example, the smart device 110 may output the processing result of the smart device 110 to the user 140 in an audio or vibration manner through the peripheral device 120.

The smart device 110 may have multiple operating states. For example, when the smart device 110 does not interact with the user for a long time, it may be in a sleep operation state with low power consumption, in which other functions, except some necessary functions such as a communication function, etc., do not work, thus ensuring that the power consumption of the system is minimum. While the smart device 110 is processing the user's voice signals and interacting with the user, it may be in a normal operating state, in which most of its functions are in operation and have relatively high power consumption.

According to an embodiment of the present invention, the smart device 110 may be awakened to switch from the sleep operation state to the normal operation state upon recognizing that a specific awakening word is included in the received voice. The wake-up word may be a predetermined certain word or sentence, such as phrases "hello, xxx", "Hi, xxx".

The smart device 110 may employ various methods to recognize a particular wake word from the voice information. For example, various neural network (DNN, CNN, LSTM, GRU, CRNN, DS-CNN, etc.) based deep learning methods can be used for speech recognition with high accuracy. The present invention is not limited to the specific form of the speech recognition algorithm, and all ways in which speech recognition processing can be performed on speech information to determine whether a wake-up word is included are within the scope of the present invention.

Additionally, it should be noted that the smart device 110 may include more than two operational states depending on the actual needs. The present invention is not limited by the number of operating states in the smart device 110, and therefore, the manner in which operating states with different energy consumptions can be obtained before and after being awakened is within the scope of the present invention.

The sound pickup apparatus 120 acquires external voice information, and may perform preliminary processing on the voice information to determine whether a wakeup word is included in the voice information. According to one embodiment of the present invention, the sound pickup apparatus 120 may employ various ways to recognize a specific wake-up word from the acquired voice information. For example, various neural network (DNN, CNN, LSTM, GRU, CRNN, DS-CNN, etc.) based deep learning methods may be employed for speech recognition. The present invention is not limited to the specific form of the speech recognition algorithm, and all ways in which speech recognition processing can be performed on speech information to determine whether a wake-up word is included are within the scope of the present invention. It should be noted that a lower accuracy speech recognition method may be employed, considering that the sound pickup apparatus 120 generally has lower processing performance. For example, in the case of a deep learning method based on a neural network, a neural network having relatively few parameters and a simpler network structure may be employed. The accuracy of the voice recognition method thus constructed is lower than that of the voice recognition method employed in the smart device 110.

When the sound pickup apparatus 120 determines that the acquired voice information contains a wake word, the sound pickup apparatus 120 may transmit the voice information to the smart device 110, and perform voice device again on the voice information at the smart device 110 to determine whether the wake word is contained twice. Only after the smart device 110 determines that the voice message contains a wakeup word, the subsequent voice interaction process begins. The interaction process between the sound pickup apparatus 120 and the smart device 110 will be described in detail with reference to fig. 3.

It should be noted that the system 100 shown in fig. 1 is only an example, and those skilled in the art will understand that in practical applications, the system 100 may include a plurality of smart devices 110 and a plurality of sound pickup devices 120, and the present invention does not limit the number of smart devices 110 and sound pickup devices 120 included in the system 100.

According to embodiments of the present invention, the smart device 110 and the sound pickup device 120 may each be implemented by a computing device 200 as described below. FIG. 2 shows a schematic diagram of a computing device 200, according to one embodiment of the invention.

As shown in FIG. 2, in a basic configuration 202, a computing device 200 typically includes a system memory 206 and one or more processors 204. A memory bus 208 may be used for communication between the processor 204 and the system memory 206.

Depending on the desired configuration, the processor 204 may be any type of processing, including but not limited to: a microprocessor (μ P), a microcontroller (μ C), a Digital Signal Processor (DSP), or any combination thereof. The processor 204 may include one or more levels of cache, such as a level one cache 210 and a level two cache 212, a processor core 214, and registers 216. Example processor cores 214 may include Arithmetic Logic Units (ALUs), Floating Point Units (FPUs), digital signal processing cores (DSP cores), or any combination thereof. The example memory controller 218 may be used with the processor 204, or in some implementations the memory controller 218 may be an internal part of the processor 204.

Depending on the desired configuration, system memory 206 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. System memory 206 may include an operating system 220, one or more applications 222, and program data 224. In some implementations, the application 222 can be arranged to execute instructions on the operating system with the program data 224 by the one or more processors 204.

Computing device 200 may also include an interface bus 240 that facilitates communication from various interface devices (e.g., output devices 242, peripheral interfaces 244, and communication devices 246) to the basic configuration 202 via the bus/interface controller 230. The example output device 242 includes a graphics processing unit 248 and an audio processing unit 250. They may be configured to facilitate communication with various external devices, such as a display or speakers, via one or more a/V ports 252. Example peripheral interfaces 244 can include a serial interface controller 254 and a parallel interface controller 256, which can be configured to facilitate communications with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.) via one or more I/O ports 258. An example communication device 246 may include a network controller 260, which may be arranged to facilitate communications with one or more other computing devices 262 over a network communication link via one or more communication ports 264.

A network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media, such as carrier waves or other transport mechanisms, in a modulated data signal. A "modulated data signal" may be a signal that has one or more of its data set or its changes made in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or private-wired network, and various wireless media such as acoustic, Radio Frequency (RF), microwave, Infrared (IR), or other wireless media. The term computer readable media as used herein may include both storage media and communication media.

In an embodiment according to the invention, the computing device 200 is configured to, when implemented as the smart device 110, perform a speech processing method according to the invention as performed in the smart device 110. Program data 224 of computing device 200 contains a plurality of program instructions for executing the speech processing method according to the present invention, which is executed in instruction device 110.

Accordingly, when the computing apparatus 200 is configured to be implemented as the sound pickup apparatus 120, the voice processing method executed in the sound pickup apparatus 120 according to the present invention is executed. The program data 224 of the computing apparatus 200 contains therein a plurality of program instructions for executing the voice processing method executed in the sound pickup apparatus 120 according to the present invention.

FIG. 3 illustrates a flow diagram of a method 300 of speech processing according to some embodiments of the invention. The processing method 300 is suitable for execution in the smart device 110 and the sound pickup device 120 in the system 100. It should be noted that the method shown in fig. 3 requires the smart device 110 and the sound pickup apparatus 120 to cooperate and perform different method steps, respectively, but this does not mean that the smart device 110 and the sound pickup apparatus 120 must be paired, and the method steps performed in the smart device 110 and the method steps performed in the sound pickup apparatus 120 may constitute separate voice processing methods, respectively, i.e., the smart device 110 may be communicatively connected to any other sound pickup apparatus 120, and the sound pickup apparatus 120 may also be communicatively connected to any other smart device 110, all without departing from the scope of the present invention.

As shown in fig. 3, the method 300 begins at step S310. In step S310, the sound pickup apparatus 120 listens to the surroundings, and acquires, for example, a voice signal of the user 140 from the surroundings. For example, when the user 140 is speaking, a voice signal may be captured or received by a sound pickup device (e.g., an ear piece or a bracelet worn by the user) in the vicinity of the user 140.

Subsequently, in step S312, in the sound pickup apparatus 120, the voice signal received in step S310 is analyzed to determine whether the voice signal contains a wake-up word. As described above with reference to fig. 1, the wake-up word is a predetermined set of specific words or phrases that can wake up the smart device 110 to enter a normal operating state. Also as described above with reference to fig. 1, various speech recognition methods may be employed to determine whether a speech signal contains a wake-up word. According to one embodiment, various neural network (DNN, CNN, LSTM, GRU, CRNN, DS-CNN, etc.) based deep learning methods may be employed for speech recognition.

When it is determined in step S312 that the voice signal contains a wake word, then in step S314, the sound pickup apparatus 120 sends the voice signal to the smart device 110 so that the voice signal is analyzed again by the smart device 110 to determine whether the voice signal contains a wake word again.

Alternatively, when it is determined in step S312 that the voice signal does not contain the wake-up word, then in step S316, the sound pickup apparatus 120 may not send the voice signal to the smart device 110, and may continue to acquire the voice signal around the sound pickup apparatus, returning from step S310 to restart processing the newly received voice signal.

Alternatively, before performing step S312, the sound pickup apparatus 120 may first determine whether the amount of power thereof is lower than a predetermined threshold, for example, 20%. If the power is too low and the voice analysis consumes power of the sound pickup apparatus 120, the voice signal may not be analyzed but directly transmitted to the smart device 110 for direct analysis in step S314 in order to prolong the service life of the sound pickup apparatus 120.

Accordingly, the smart device 110 receives the voice signal transmitted from the sound pickup device 120 in step S314. This voice signal has been subjected to voice analysis by the sound pickup apparatus 120 and determined to contain a wake word in step S312.

Subsequently, in step S322, in the smart device 110, the voice signal received in step S314 is analyzed to determine again whether the voice signal contains a wake-up word. The smart device 110 may employ various speech recognition methods to determine whether a speech signal contains a wake-up word as described above with reference to fig. 1. According to one embodiment, various neural network (DNN, CNN, LSTM, GRU, CRNN, DS-CNN, etc.) based deep learning methods may be employed for speech recognition.

When the smart device 110 determines again that the voice signal contains the wake-up word in step S322, the operation state of the smart device 110 is switched to switch from the low power consumption state to the high power consumption state in step S324.

As described above with reference to fig. 1, the smart device 110 has a plurality of operational states. Most of the functions are not in operation until the smart device 110 is awakened, and thus the smart device 110 is in a sleep operation state with lower power consumption. After being awakened, the smart device 110 may enter a normal operation state in which most functions of the device start to operate normally, and new inputs such as voice or video of a subsequent user are processed with higher energy consumption.

Accordingly, when the smart device 110 determines that the voice signal does not include the wake word in step S322, the smart device 110 may continue to remain in the sleep operation state and wait to receive the voice signal from the sound pickup device 120 that includes the sound pickup device 120 determined to include the wake word again for processing.

It should be noted that the smart device 110 may wake up temporarily after receiving the voice signal and perform a voice recognition method to process the voice signal and return to the sleep state when it is determined that the voice signal does not contain a wake-up word.

Alternatively, the smart device 110 may set another operation state exclusively performing the voice recognition method. The energy consumption of such an operating state may be between a sleep operating state and a normal operating state. When the smart device 110 receives the voice signal in step S314, it switches from the sleep state to the intermediate operation state; when it is determined in step S322 that the voice signal contains the wakeup word, further switching to a normal operation state in step S324; and when it is determined in step S322 that the voice signal does not include the wake-up word, switching back to the sleep operation state.

Optionally, in steps S312 and S322 above, the voice signals are analyzed in the sound pickup device 120 and the smart device 110 respectively to determine whether the voice signals contain the wake-up word, and it should be noted that, considering the processing performance of the sound pickup device 120 and the smart device 110 and the sequence of analyzing the voice signals, the accuracy of analyzing the voice signals by the sound pickup device 120 to determine whether the voice signals contain the wake-up word is lower than the accuracy of analyzing the voice signals by the smart device 110 to determine whether the voice signals contain the wake-up word.

According to one embodiment, when both the sound pickup apparatus 120 and the smart device 110 use deep learning methods based on neural networks (DNN, CNN, LSTM, GRU, CRNN, DS-CNN, etc.) for speech recognition, the neural network architecture used in the sound pickup apparatus 120 may be simpler than the neural network architecture used in the smart device 110. For example, the neural network structure used in the sound pickup apparatus 120 has a smaller number of parameters than the neural network used in the smart apparatus 110, or the neural network structure used in the sound pickup apparatus 120 has a smaller number of network layers than the neural network used in the smart apparatus 110.

Optionally, the sound pickup apparatus 120 encodes a voice signal to be transmitted in a predetermined format. The speech signal comprises a plurality of audio signals. The encoded speech signal includes a first portion indicating the number of channels of the audio signal, a second portion indicating the length of each of the plurality of channels of the audio signal, and the plurality of channels of the audio signal itself. In addition, the speech signal also includes a multi-channel reference audio signal. In this case, in the encoded speech signal, the number of paths of the reference audio signal is also indicated in the first part. And the encoded speech signal further comprises a third portion indicating the length of each of the multiple reference audio signals and the multiple reference audio signal itself.

The following gives a specific format definition of the data message after encoding the speech signal:

in the above format, the data packets are arranged in the order of each audio, that is, audio (i.e., Mic audio) data picked up by the sound pickup apparatus, and then reference audio data. The first byte identifies several channels of audio, and the following four bytes represent the length of each Mic audio and the length of each reference audio, respectively. And then follows the entire audio data. Each packet must contain data for all audio channels of the sound pickup apparatus 120.

According to one embodiment, pickup apparatus 120 may perform acquisition of the voice signal at a sampling rate of 16KHz, and the Bit width of the voice signal is 16 bits. Each time, the sound pickup device 120 may send a voice signal with a duration of 3 seconds to the smart device 110 for a second confirmation.

FIG. 4 shows a flow diagram of a speech processing method 400 according to another embodiment of the invention. Speech processing method 400 is a further embodiment of speech processing method 300 and therefore the same or similar reference numerals are used to indicate the same or similar processing steps as in method 300.

The method 400 differs from the method 300 in that in addition to the sound pick-up apparatus 120a, another sound pick-up apparatus 120b receives voice information from the surrounding environment and makes a determination as to whether the voice information contains a wake-up word. Thus, steps 310a, 312a, 314a, and 316a performed in the pickup apparatus 120a, and steps 310b, 312b, 314b, and 316b performed in the pickup apparatus 120b are the same as steps 310, 312, 314, and 316 in fig. 3.

The method 400 further includes step 420, wherein in the smart device 110, when the voice signals from the

sound pickup devices

120a and 120b are received from 314a and 314b, respectively, the voice signal from one of them needs to be selected for subsequent processing. According to one embodiment of the present invention, in step S420, a voice signal having the greatest sound intensity is selected as a voice signal to be analyzed from among the received voice signals. According to other embodiments, the selection from the received multiple voice signals can be performed under other conditions according to actual needs, for example, a voice signal with good sound quality can be selected. Any way in which a high quality speech signal can be selected is within the scope of the present invention.

It should be noted that although two

tone pickups

120a and 120b are described above in fig. 4, the present invention is not limited by the number of tone pickups communicatively connected to the smart device 110.

After the voice signal to be analyzed is determined in step S420, the subsequent processing is performed in step S322. These processes are the same as the corresponding steps in the method 300 described with reference to fig. 3 and are not described in detail.

In addition, after the smart device 110 is switched to the normal operation state for voice interaction with the user at step S324, one of the sound pickup devices is instructed by the smart device 110 to continue receiving new voice signals for voice interaction at step S430.

According to one embodiment, the smart device 110 selects the tone arm 120 whose voice signal containing the wake word is twice acknowledged to indicate. For example, as shown in fig. 4, it is determined in step S420 that the voice signal selected from the sound pickup apparatus 120b is subjected to secondary confirmation, and therefore, in step S430, the sound pickup apparatus 120b is instructed to acquire a new voice signal and transmit the new voice signal to the smart device 110 so as to perform voice interaction processing.

According to the voice processing scheme of the present invention, a primary wake-up word determination may be performed on the sound pickup device 120 with relatively low processing performance, such as a bracelet or an earphone, when it is determined that a wake-up word is present in the primary wake-up word, the related voice is sent to the smart device 110 with relatively high processing performance to perform a secondary wake-up word determination, and only when it is also determined that a wake-up word is present in the secondary wake-up word determination, the operating state of the smart device 110 is changed to an operating state with relatively high energy consumption so as to process the subsequent voice. With this scheme, the smart device 110 does not need to always run the voice recognition process for the wakeup word determination, so that the energy consumption of the smart device can be reduced.

In addition, according to the solution of the present invention, the sound pickup apparatus 120 may be deployed at a position away from the smart device 110, for example, at a position closer to a person, so that the voice signal may be acquired more clearly, and the problem that the voice signal may not be acquired clearly due to the smart device 110 being far away from the person is reduced.

Fig. 5 shows a schematic diagram of a smart device 110 according to another embodiment of the invention. While fig. 5 illustrates the various components of the smart device 110 as logically divided, it should be noted that such divisions may be further subdivided or recombined depending on the actual physical implementation without departing from the scope of the present invention, and any smart device 110 having the logical components illustrated in fig. 5 is within the scope of the present invention.

As shown in fig. 5, the smart device 110 includes a communication unit 510, a voice processing unit 520, and an operation state switching unit 530.

The communication unit 510 provides a communication function for the smart device 110 and communicates with peripheral devices such as the sound pickup device 120 to receive voice signals from the peripheral device 120. As described above, the sound pickup apparatus 120 has analyzed the voice signal and determined that the voice signal contains the wake word before sending the voice signal to the smart device 110. The communication unit 510 may be communicatively coupled to the sound pickup device 120 in various ways, including but not limited to bluetooth, ZigBee, WIFI, and mobile communication. The present invention is not limited to communication, and all ways of communicating information between the peripheral device 120 and the smart device 110 are within the scope of the present invention.

It should be noted that according to one embodiment, the voice signals may be encoded in a predetermined format to be suitable for transmission between the tone arm 120 and the smart device 110. The format of the speech signal has been described in detail above, and is not described in detail here.

The voice processing unit 520 is coupled to the communication unit 510 and analyzes the voice signal received by the communication unit 510 to determine again whether the voice signal contains a wake-up word. As described above, the voice processing unit 520 may recognize a specific wake-up word from voice information in various methods. For example, the speech processing unit 520 may employ various neural network (DNN, CNN, LSTM, GRU, CRNN, DS-CNN, etc.) based deep learning methods for speech signal processing.

When the voice processing unit 520 determines again that the voice signal contains the wakeup word, the operation state switching unit 530 switches the operation state of the smart device 110 to switch from the low power consumption state to the high power consumption state. As described above with reference to fig. 1, the smart device 110 has a plurality of operational states. Most of the functions are not in operation until the smart device 110 is awakened, and thus the smart device 110 is in a sleep operation state with lower power consumption. After being awakened, the smart device 110 may enter a normal operation state in which most functions of the device start to operate normally, and new inputs such as voice or video of a subsequent user are processed with higher energy consumption.

Accordingly, when the voice processing unit 520 determines that the voice signal does not contain the wake word, the operation state switching unit 530 does not switch the operation state of the smart device 110, and the smart device 110 may continue to remain in the sleep operation state and wait to receive the voice signal from the sound pickup device 120 that is determined to contain the wake word again for processing.

It should be noted that the smart device 110 may be temporarily awakened by the operation state switching unit 530 after receiving the voice signal, perform a voice recognition method by the voice processing unit 520 to process the voice signal, and switch back to the sleep state by the operation state switching unit 530 when it is determined that the voice signal does not contain the awakening word.

Alternatively, according to one embodiment of the invention, both the sound pickup apparatus 120 and the voice signal processing unit 520 analyze the voice signal to determine whether the voice signal contains a wake up word, it should be noted that, considering the processing performance of the sound pickup apparatus 120 and the smart device 110 and the order of analyzing the voice signal, the accuracy of analyzing the voice signal by the sound pickup apparatus 120 to determine whether the voice signal contains a wake up word is lower than the accuracy of analyzing the voice signal processing unit 520 to determine whether the voice signal contains a wake up word.

According to one embodiment, when the sound pickup apparatus 120 and the voice signal processing unit 520 both use a deep learning method based on a neural network (DNN, CNN, LSTM, GRU, CRNN, DS-CNN, etc.) for voice recognition, the neural network structure used in the sound pickup apparatus 120 may be simpler than the neural network structure used in the voice signal processing unit 520. For example, the neural network structure employed in the sound pickup apparatus 120 has a smaller number of parameters than the neural network in the voice signal processing unit 520, or the neural network structure employed in the sound pickup apparatus 120 has a smaller number of network layers than the neural network in the voice signal processing unit 520.

Optionally, according to an embodiment of the present invention, the communication unit 510 may also receive voice signals from more than one peripheral device 120. For example, in the system 100 shown in fig. 1, when the user 140 wears the headset 120a and the bracelet 120b, the voice information of the user is simultaneously received by the headset 120a and the bracelet 120b, and is preliminarily determined to contain the wake-up word, respectively, and is sent to the smart device 110 for reconfirmation.

At this time, the voice signal processing unit 520 needs to select a voice signal from one of them for subsequent processing. According to an embodiment of the present invention, the voice signal processing unit 520 selects a voice signal having the greatest sound intensity from among the received voice signals as a voice signal to be analyzed. According to other embodiments, the selection from the received multiple voice signals can be performed under other conditions according to actual needs, for example, a voice signal with better sound quality can be selected. Any way in which a high quality speech signal can be selected is within the scope of the present invention.

In addition, after the operation state switching unit 530 switches the smart device 110 to the normal operation state for voice interaction with the user, the communication unit 510 instructs one of the sound pickup devices 120 to continue receiving new voice signals for voice interaction.

According to one embodiment, the smart device 110 selects the tone arm 120 whose voice signal containing the wake word is twice acknowledged to indicate. For example, as described above with reference to fig. 4, if it is determined that the voice signal from the sound pickup apparatus 120b is selected for secondary confirmation, the communication unit 510 instructs the sound pickup apparatus 120b to acquire a new voice signal to transmit to the smart device 110 for voice interaction processing.

Fig. 6 shows a schematic diagram of a tone arm 120 according to another embodiment of the present invention. Fig. 6 illustrates the various components of the sound pickup apparatus 120 in a logically partitioned manner, and it should be noted that such partitions may be further subdivided or recombined according to the actual physical implementation without departing from the scope of the present invention, and any sound pickup apparatus 120 having the logical components illustrated in fig. 6 is within the scope of the present invention.

As shown in fig. 6, the sound pickup apparatus 120 includes a sound pickup unit 610, a voice analysis unit 620, and a communication unit 630.

The sound pickup unit 610 listens to the surroundings and acquires, for example, a voice signal of the user 140 from the surroundings. For example, when the user 140 is speaking, a voice signal may be captured or received by a sound pickup device (e.g., an ear piece or a bracelet worn by the user) in the vicinity of the user 140.

The voice analyzing unit 620 is coupled to the pickup unit 610, and analyzes the voice signal received by the pickup unit 610 to determine whether the voice signal contains a wake-up word. As described above with reference to fig. 1, the wake-up word is a predetermined set of specific words or phrases that can wake up the smart device 110 to enter a normal operating state. Also as described above with reference to fig. 1, various speech recognition methods may be employed to determine whether a speech signal contains a wake-up word. According to one embodiment, various neural network (DNN, CNN, LSTM, GRU, CRNN, DS-CNN, etc.) based deep learning methods may be employed for speech recognition.

When the voice analysis unit 620 determines that the voice signal contains a wake word, the communication unit 630 transmits the voice signal to the smart device 110 so that the voice signal is analyzed again by the smart device 110 to determine whether the voice signal contains a wake word twice.

Alternatively, when the voice analysis unit 620 determines that the voice signal does not contain the wake-up word, the communication unit 630 does not transmit the voice signal to the smart device 110, and may continue to acquire the voice signal around the sound pickup device by the sound pickup unit 610 to restart processing of the newly received voice signal.

Additionally, optionally, the tone arm 120 may first determine whether its own charge is below a predetermined threshold, such as 20%. If the power is too low, the voice analysis unit 620 may consume power of the sound pickup apparatus 120 due to voice analysis, and in order to prolong the service life of the sound pickup apparatus 120, the voice signal may not be subjected to voice analysis, but may be directly transmitted to the smart device 110 by the communication unit 630 for direct analysis.

Alternatively, according to an embodiment of the present invention, the voice analyzing unit 620 and the smart device 110 both analyze the voice signal to determine whether the voice signal contains the wake word, and it should be noted that, considering the processing performance of the pickup device 120 and the smart device 110 and the sequence of analyzing the voice signal, the accuracy of analyzing the voice signal to determine whether the voice signal contains the wake word is lower than the accuracy of analyzing the voice signal by the smart device 110 to determine whether the voice signal contains the wake word.

According to one embodiment, when the speech analysis unit 620 and the smart device 110 both use a deep learning method based on neural networks (DNN, CNN, LSTM, GRU, CRNN, DS-CNN, etc.) for speech recognition, the neural network structure used in the speech analysis unit 620 is simpler than the neural network structure used in the smart device 110. For example, the number of parameters of the neural network structure adopted in the voice analysis unit 620 is smaller than the number of parameters of the neural network in the smart device 110, or the number of network layers of the neural network structure adopted in the voice analysis unit 620 is smaller than the number of network layers of the neural network in the smart device 110.

Optionally, during the interaction of the above tone arm 120 and smart device 110, some tone arms and smart devices have display interfaces, and information related to the interaction may be presented on the display interfaces to help the user better understand the interaction process. For example, a user may be provided with a setting on the interface of the sound pickup apparatus and the smart device whether to perform voice signal analysis for wake word recognition, a setting on a power threshold, and the like. And as the interaction proceeds, information such as "wake word is detected and sent to the smart device 110 for secondary confirmation", etc. "voice of sound pickup device is received," wake word detection is performed "," wake word is detected, and device operation state is switched "may be provided. Therefore, these interaction modes are all within the protection scope of the present invention.

The various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present invention, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as removable hard drives, U.S. disks, floppy disks, CD-ROMs, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention.

In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to perform the method of the invention according to instructions in said program code stored in the memory.

By way of example, and not limitation, readable media may comprise readable storage media and communication media. Readable storage media store information such as computer readable instructions, data structures, program modules or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of readable media.

In the description provided herein, algorithms and displays are not inherently related to any particular computer, virtual system, or other apparatus. Various general purpose systems may also be used with examples of this invention. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

Furthermore, some of the described embodiments are described herein as a method or combination of method elements that can be performed by a processor of a computer system or by other means of performing the described functions. A processor having the necessary instructions for carrying out the method or method elements thus forms a means for carrying out the method or method elements. Further, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is used to implement the functions performed by the elements for the purpose of carrying out the invention.

As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.

While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed in an illustrative rather than a restrictive sense with respect to the scope of the invention, as defined in the appended claims.

Claims

1. A speech processing method adapted to be executed in a computing device, the method comprising the steps of:

receiving a voice signal from a peripheral device communicatively connected to the computing device, wherein the peripheral device analyzes the voice signal and determines that the voice signal contains a wake word adapted to change an operational state of the computing device;

analyzing the voice signal to determine whether the voice signal contains the wake-up word; and

switching the computing device from a first operating state to a second operating state for processing a subsequent new speech signal upon determining that the speech signal contains the wake-up word, wherein the energy consumption of the first operating state is lower than the energy consumption of the second operating state.

2. The method of claim 1, further comprising the steps of:

not switching the operating state of the computing device when it is determined that the speech signal does not include the wake-up word.

3. The method of claim 1 or 2, wherein the peripheral device analyzes the voice signal to determine whether the inclusion of the wake-up word has a first accuracy;

the computing device analyzing the voice signal to determine whether including the wake word has a second degree of accuracy; and

the first accuracy is lower than the second accuracy.

4. The method of claim 3, wherein the peripheral device analyzes the speech signal using a first neural network algorithm and the computing device analyzes the speech information using a second neural network algorithm, and

the parameters in the first neural network are less than the parameters in the second neural network.

5. The method of any of claims 1-4, wherein the receiving a voice signal from a peripheral device comprises:

receiving voice signals from more than one peripheral device; and

from among the received voice signals, a voice signal having the greatest sound intensity is selected as a voice signal to be analyzed.

6. The method of any one of claims 1-5, further comprising the step of:

after the computing device switches to a second operating state, instructing the peripheral device to receive a new speech signal for sending the received new speech signal to the computing device for processing.

7. The method of any of claims 1-6, wherein the speech signal comprises a multiplexed audio signal, and the step of receiving the speech signal from a peripheral device communicatively connected to the computing device comprises:

receiving a speech signal encoded in a predetermined format, the encoded speech signal including a first portion indicating a number of channels of an audio signal, a second portion indicating a length of each of the plurality of channels of the audio signal, and the plurality of channels of the audio signal.

8. The method of claim 7, wherein the first portion further indicates a number of ways of the reference audio signal; and

the encoded speech signal further includes a multi-channel reference audio signal and a third portion indicating a length of each of the multi-channel reference audio signal.

9. A speech processing method adapted to be executed in a computing device, the method comprising the steps of:

receiving a voice signal;

analyzing the voice signal to determine whether the voice signal contains a wake-up word; and

and when the voice signal is determined to contain the awakening words, sending the voice signal to an intelligent device in communication connection with the computing device, so that the intelligent device analyzes the voice signal again and determines that the voice signal contains the awakening words suitable for changing the running state of the intelligent device.

10. The method of claim 9, further comprising the step of:

upon determining that the voice signal does not include the wake-up word, not sending the voice signal to the smart device.

11. The method of claim 9 or 10, wherein the computing device analyzes the voice signal to determine whether the inclusion of the wake word has a first accuracy;

the smart device analyzing the voice signal to determine whether the inclusion of the wake-up word has a second degree of accuracy; and

the first accuracy is lower than the second accuracy.

12. The method of claim 11, wherein the computing device analyzes the speech signal using a first neural network algorithm and the smart device analyzes the speech information using a second neural network algorithm, and

13. The method according to any of claims 9-12, further comprising the step of:

when the power of the computing device is below a predetermined threshold, the voice signal is not analyzed and sent to the smart device for determination by the smart device whether the voice information contains the wake-up word.

14. The method of any of claims 9-13, wherein the voice signal comprises multiple audio signals, and the step of transmitting the voice signal to the smart device comprises:

transmitting a voice signal encoded in a predetermined format, the encoded voice signal including a first portion indicating a number of channels of an audio signal, a second portion indicating a length of each of the plurality of channels of the audio signal, and the plurality of channels of the audio signal.

15. The method of claim 14, wherein the first portion further indicates a number of ways of the reference audio signal; and

16. A smart device, comprising:

a communication unit adapted to communicate with a peripheral device so as to receive a voice signal from the peripheral device, wherein the peripheral device analyzes the voice signal and determines that the voice signal contains a wake-up word adapted to change an operation state of the smart device;

the voice processing unit is suitable for analyzing the voice signal to determine whether the voice signal contains the awakening word; and

and the operation state switching unit is suitable for switching the intelligent equipment from a first operation state to a second operation state when the voice processing unit determines that the voice signal contains the awakening word so as to process a subsequent new voice signal, wherein the energy consumption of the first operation state is lower than that of the second operation state.

17. The smart device of claim 16, wherein the operating state switching unit is further adapted to not switch the operating state of the smart device when the speech processing unit determines that the speech signal does not include the wake-up word.

18. The smart device of claim 16 or 17, wherein the peripheral device analyzes the voice signal to determine whether the inclusion of the wake-up word has a first accuracy;

the voice processing unit analyzes the voice signal to determine whether the inclusion of the wake-up word has a second degree of accuracy; and

the first accuracy is lower than the second accuracy.

19. The smart device of claim 18, wherein the peripheral device analyzes the speech signal using a first neural network algorithm and the speech processing unit analyzes the speech information using a second neural network algorithm, and

20. The smart device of any of claims 16-19, wherein the communication unit is adapted to receive voice signals from more than one peripheral device; and

the speech processing unit is adapted to select, from the received speech signals, the speech signal with the greatest sound intensity as the speech information to be analyzed.

21. A computing device comprising

A pickup unit adapted to acquire a voice signal around the computing device;

the voice analysis unit is suitable for analyzing the voice signal to determine whether the voice signal contains a wake-up word; and

and the communication unit is suitable for sending the voice signal to intelligent equipment in communication connection with the computing equipment when the voice analysis unit determines that the voice signal contains the awakening words, so that the intelligent equipment analyzes the voice signal again and determines that the voice signal contains the awakening words suitable for changing the running state of the intelligent equipment.

22. The computing device of claim 21, the communication unit further adapted to not send the voice signal to the smart device when the voice analysis unit determines that the voice signal does not include the wake up word.

23. The computing device of claim 21 or 22, wherein the voice analysis unit analyzes the voice signal to determine whether the inclusion of the wake word has a first accuracy;

the first accuracy is lower than the second accuracy.

24. The computing device of any of claims 21-23, wherein:

the speech analysis unit is adapted to not analyze the speech signal when the power of the computing device is below a predetermined threshold, an

The communication unit is adapted to send the voice signal to the smart device when the power of the computing device is below a predetermined threshold in order to determine, by the smart device, whether the voice information contains the wake-up word.

25. A speech processing system comprising

The smart device of any one of claims 16-20; and

one or more computing devices as recited in any of claims 21-24 communicatively coupled to the smart device.

26. The speech processing system of claim 25 wherein the smart device is a smart speaker and the computing device is a headset or a bracelet.

27. A smart sound box, comprising:

28. A computing device, comprising:

at least one processor; and

a memory storing program instructions configured for execution by the at least one processor, the program instructions comprising instructions for performing the method of any of claims 1-15.