CN114220430A

CN114220430A - Multi-sound-zone voice interaction method, device, equipment and storage medium

Info

Publication number: CN114220430A
Application number: CN202111521161.1A
Authority: CN
Inventors: 杜春明; 王丹; 王永乐; 徐木水; 汪木金; 李鹏伟
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-12-13
Filing date: 2021-12-13
Publication date: 2022-03-22

Abstract

The disclosure provides a multi-vocal-zone voice interaction method, device, equipment and storage medium, and relates to the technical field of artificial intelligence, in particular to natural language processing, voice recognition and deep learning technology. The method comprises the following steps: receiving a voice signal of at least one to-be-recognized sound zone in a plurality of to-be-recognized sound zones; determining a voice zone identifier of at least one voice zone to be identified, and processing the voice signals by adopting an audio processing thread corresponding to the voice zone identifier to obtain a processing result; and executing the operation corresponding to the processing result. The multi-sound zone voice interaction method can be used for simultaneously carrying out parallel recognition on voices in a plurality of sound zones without mutual interference.

Description

Multi-sound-zone voice interaction method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to techniques for natural language processing, speech recognition and deep learning, and more particularly, to a method, an apparatus, a device, and a storage medium for multi-range speech interaction.

Background

With the rapid development of artificial intelligence technology, the voice interaction technology is also rapidly developed. The existing voice recognition scheme can realize the recognition of the voices of a plurality of users in a voice environment, but can only support the recognition of the voice of one user in the plurality of users at the same time. Traditional voice interaction and recognition solutions are all served around a target user.

For example, conventional vehicle-mounted voice recognition solutions are serviced around a primary driver, with the front microphone angle being designed to be aimed at the primary driver's seat. However, the voice recognition technology configured in the existing vehicle-mounted system can only recognize the voice signal of a fixed sound zone in the vehicle, or can support the voice interaction of different sound zones in the vehicle, but can only support one sound zone at the same time to perform the voice interaction, which reduces the voice interaction experience of the user.

Disclosure of Invention

The disclosure provides a multi-tone-zone voice interaction method, a multi-tone-zone voice interaction device, a multi-tone-zone voice interaction equipment and a storage medium.

According to a first aspect of the present disclosure, a multi-range voice interaction method is provided, including: receiving a voice signal of at least one to-be-recognized sound zone in a plurality of to-be-recognized sound zones; determining a voice zone identifier of at least one voice zone to be identified, and processing the voice signals by adopting an audio processing thread corresponding to the voice zone identifier to obtain a processing result; and executing the operation corresponding to the processing result.

According to a second aspect of the present disclosure, there is provided a multi-range voice interaction apparatus, comprising: a receiving module configured to receive a voice signal of at least one to-be-recognized sound zone of a plurality of to-be-recognized sound zones; the processing module is configured to determine a voice zone identifier of at least one voice zone to be identified, and process the voice signal by adopting an audio processing thread corresponding to the voice zone identifier to obtain a processing result; and the execution module is configured to execute the operation corresponding to the processing result.

According to a third aspect of the present disclosure, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in any one of the implementations of the first aspect.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method as described in any one of the implementations of the first aspect.

According to a fifth aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method as described in any of the implementations of the first aspect.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is an exemplary system architecture diagram in which the present disclosure may be applied;

FIG. 2 is a flow diagram of one embodiment of a multi-tonal voice interaction method in accordance with the present disclosure;

FIG. 3 is a flow diagram of another embodiment of a multi-tonal voice interaction method in accordance with the present disclosure;

FIG. 4 is a flow diagram of yet another embodiment of a multi-tonal voice interaction method in accordance with the present disclosure;

FIG. 5 is a diagram of an application scenario for a multi-tonal speech interaction method according to the present disclosure;

FIG. 6 is a schematic block diagram illustrating one embodiment of a multi-range voice interaction apparatus according to the present disclosure;

FIG. 7 is a block diagram of an electronic device for implementing a multi-tone voice interaction method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of a multi-tone zone voice interaction method or apparatus of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

A user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or transmit information or the like. Various client applications may be installed on the

terminal devices

101, 102, 103.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices including, but not limited to, smart phones, tablet computers, laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the above-described electronic apparatuses. It may be implemented as multiple pieces of software or software modules, or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may provide various services. For example, the server 105 may analyze and process the voice signals acquired from the

terminal apparatuses

101, 102, 103 and generate a processing result (e.g., perform an operation corresponding to the processing result).

The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 105 is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the multi-range voice interaction method provided by the embodiment of the present disclosure is generally executed by the server 105, and accordingly, the multi-range voice interaction apparatus is generally disposed in the server 105.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a multi-tonal voice interaction method in accordance with the present disclosure is shown. The multi-sound zone voice interaction method comprises the following steps:

step 201, receiving a voice signal of at least one to-be-recognized sound zone in a plurality of to-be-recognized sound zones.

In the present embodiment, an executing body (for example, the server 105 shown in fig. 1) of the multi-zone voice interaction method may receive a voice signal of at least one to-be-recognized zone of a plurality of to-be-recognized zones.

On the one hand, in this embodiment, the in-vehicle space is divided into a plurality of subspaces, each subspace is a sound zone, for example, the in-vehicle space includes two sound zones or four sound zones, that is, the in-vehicle space is divided into two sound zones for main driving and assistant driving by using a sound zone isolation technique, or the in-vehicle space is divided into four sound zones for main driving, assistant driving, left side of rear row and right side of rear row. And microphone equipment is distributed for each sound zone in the vehicle, so that the microphone equipment of each sound zone collects voice signals in the sound zone without mutual interference.

On the other hand, in the present embodiment, the vehicle outside surrounding area is further divided into a plurality of sound zones, such as the vehicle left side rear view mirror sound zone, the vehicle right side rear view mirror sound zone, and the vehicle tail side sound zone, and corresponding microphone devices are arranged in each sound zone, such as microphone devices arranged at the vehicle left side rear view mirror, the vehicle right side rear view mirror, and the vehicle tail, so that the microphone devices of each sound zone collect the voice signals in the sound zone without interfering with each other.

It should be noted that the interaction process inside the vehicle and the interaction process outside the vehicle at the same time are mutually exclusive, that is, the voice signal collected inside the vehicle or the voice signal collected outside the vehicle is processed at the same time, and the priority of the voice interaction inside the vehicle is higher than that of the voice interaction outside the vehicle.

The execution subject receives a voice signal collected by a microphone device of at least one of the voice areas to be recognized.

The number of sound zones in this embodiment may be set according to actual conditions, for example, the number of sound zones may be flexibly set by rewriting the configuration file.

Step 202, determining a voice zone identifier of at least one voice zone to be identified, and processing the voice signal by adopting an audio processing thread corresponding to the voice zone identifier to obtain a processing result.

In this embodiment, the executing entity first determines the zone identifier of each zone to be recognized in the at least one zone to be recognized received in step 201, and since the identifiers of the microphone devices of the respective zones to be recognized are different from each other, the executing entity may determine the zone identifier of the zone to be recognized through the identifier of the microphone device that collects the voice signal. Then, the execution main body processes the voice signal by adopting the voice frequency processing thread corresponding to the voice zone identification, thereby obtaining a corresponding processing result. That is, in this embodiment, each of the sound zones corresponds to one of the audio processing threads, the audio processing thread of each of the sound zones processes the voice signal of the sound zone, and the audio processing threads of the plurality of sound zones are performed in parallel without interfering with each other.

It should be noted that, in this embodiment, the execution main body may process the collected voice signal not only in an online condition, but also in an offline condition, or even in an offline and parallel condition, and may be selected according to actual conditions.

For example, the executing entity may first perform denoising Processing on the acquired voice Signal to eliminate background noise in the voice Signal and interference sound of other people, and for example, may perform denoising Processing in a Digital Signal Processing (DSP) hard denoising manner, a Baidu soft denoising manner, or a third party hard denoising manner to eliminate noise in the voice Signal to improve accuracy of voice recognition. Then, the execution main body carries out endpoint detection on the voice signal after the noise elimination processing to obtain detected voice information; and finally, recognizing the detected voice information to obtain a final recognition result.

And step 203, executing the operation corresponding to the processing result.

In this embodiment, the execution body may execute an operation corresponding to the processing result obtained in step 202. For example, if the processing result obtained in step 202 is "the user in the passenger compartment says" to heat the seat ", the executing body will heat the seat in the passenger compartment; for another example, if the processing result obtained in step 202 is "the user in the left sound zone in the back row says" play music ", then the executing entity will open the multimedia device in the left position in the back row and play music for the user.

The multi-sound zone voice interaction method provided by the embodiment of the disclosure comprises the steps of firstly receiving a voice signal of at least one to-be-recognized sound zone in a plurality of to-be-recognized sound zones; then, determining a voice zone identifier of at least one voice zone to be identified, and processing the voice signals by adopting an audio processing thread corresponding to the voice zone identifier to obtain a processing result; and finally, executing the operation corresponding to the processing result. In the multi-sound-zone voice interaction method in this embodiment, the method may collect the voice signal of each sound zone through the sound collection device of each sound zone, determine the sound zone identifier corresponding to the voice signal, and then process the voice signal by using the audio processing thread corresponding to the sound zone identifier, so as to perform parallel processing on the voice signals of multiple sound zones, thereby achieving the effect of performing voice interaction simultaneously on multiple sound zones, and because the processing processes of multiple sound zones do not interfere with each other, the accuracy of multi-sound-zone voice interaction may be ensured, thereby improving the multi-sound-zone voice interaction experience of the user.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

With continued reference to FIG. 3, FIG. 3 illustrates a flow 300 of another embodiment of a multi-tonal voice interaction method in accordance with the present disclosure. The multi-sound zone voice interaction method comprises the following steps:

step 301, receiving a voice signal of at least one to-be-recognized sound zone in a plurality of to-be-recognized sound zones.

Step 302, determining the sound zone identification of at least one sound zone to be identified.

The steps 301-.

And 303, performing echo denoising processing on the voice signal by adopting an audio processing thread corresponding to the voice zone identifier to obtain a voice signal to be recognized after denoising processing.

In this embodiment, an executing main body (for example, the server 105 shown in fig. 1) of the multi-speech-region speech interaction method may perform echo denoising processing on a speech signal by using an audio processing thread corresponding to a speech region identifier, so as to obtain a speech signal to be recognized after denoising processing. In this embodiment, each speech region has a corresponding noise cancellation module, and because there is a gap in the speech process of a person, there is a pause between words, and some background sounds (e.g., sound of a sound box, speech sounds of other people in a vehicle, etc.) are mixed in the speech process, the execution main body first performs noise cancellation on the acquired speech signal by using the noise cancellation module in the audio processing thread corresponding to the speech region identifier, for example, performs noise cancellation by using a DSP hard noise cancellation method, a Baidu soft noise cancellation method, or a third-party hard noise cancellation method, thereby eliminating noise in the speech signal and improving accuracy of speech recognition. Of course, other methods may also be used to perform denoising processing on the voice signal, which is not specifically limited in this embodiment.

For example, the executing entity may first find a delay between the reference signal/speaker signal and the voice signal collected by the microphone device, then estimate a linear echo component in the voice signal collected by the microphone device according to the reference signal, subtract the linear echo component from the microphone signal to obtain a residual signal, and finally completely suppress a residual echo in the residual signal through nonlinear processing, thereby obtaining the voice signal to be recognized after denoising processing.

And 304, performing endpoint detection on the voice signal to be recognized by adopting the audio processing thread corresponding to the sound zone identifier to obtain first voice information.

In this embodiment, the execution main body may perform endpoint detection on the voice signal to be recognized (i.e., the voice signal subjected to noise cancellation) by using the audio processing thread corresponding to the sound zone identifier, so as to obtain the first voice information. The end point Detection (VAD) is to distinguish the Voice from the non-Voice area, that is, the end point Detection is to accurately locate the start point and the end point of the Voice from the Voice with noise, remove the mute part and the noise part, and thus obtain the Voice containing the real effective content. Specifically, the executing entity performs framing processing on the voice signal to be recognized, extracts features from each frame of data, trains a classifier on a data frame set of a known voice and silence signal region, classifies unknown framed data, determines whether the unknown framed data belongs to the voice signal or the silence signal, and finally removes the silence signal, thereby obtaining the first voice information.

Step 305, determining harmonic features of the first speech information.

In this embodiment, the executing entity may determine the harmonic features of the first speech information obtained in step 304. The execution main body can analyze the characteristics of the first voice information to obtain harmonic characteristics of the first voice information, wherein the harmonic characteristics are waveform characteristics capable of reflecting the time dimension and the frequency dimension of the voice to be recognized.

And step 306, obtaining a recognition result corresponding to the first voice information based on the harmonic features and the pre-trained voice recognition model.

In this embodiment, the executing entity may obtain a recognition result corresponding to the first speech information based on the harmonic features determined in step 305 and a pre-trained speech recognition model. The execution subject may use the harmonic feature as input data of a pre-trained speech recognition model to obtain a recognition result corresponding to the first speech information output by the pre-trained speech recognition model. The speech recognition result can correspond to different speech recognition functions and output corresponding results. The recognition result is obtained based on the harmonic feature of the first voice information and the pre-trained voice recognition model, and the accuracy of the obtained recognition result can be improved.

Step 307, determining whether the recognition result contains a preset wake-up word.

In this embodiment, the execution body may determine whether the recognition result includes a preset wake-up word. The preset wake-up word may be a wake-up word fixed by the voice interaction program, such as "small degree", and the wake-up word may also be defined by the user. After the execution main body obtains the recognition result, whether the recognition result contains the preset awakening word is judged.

And 308, in response to the fact that the recognition result contains the preset awakening word, starting a voice interaction program corresponding to the awakening word.

In this embodiment, the executing body may start the voice interaction program corresponding to the wakeup word when it is determined that the recognition result includes the preset wakeup word, so as to wake up the voice interaction program, and perform voice interaction. For example, if the execution subject determines that the recognition result "degree of small" includes a preset wake word "degree", the execution subject starts a small intelligent assistant corresponding to the wake word "degree" to perform voice interaction.

As can be seen from fig. 3, compared with the embodiment corresponding to fig. 2, the multi-range speech interaction method in this embodiment highlights a process of processing a speech signal, accurately locates a start point and an end point of the speech from the speech with noise, removes the silence and the noise, thereby obtaining the speech containing really effective content, and then obtains a recognition result based on the harmonic features of the first speech information and a pre-trained speech recognition model, which can improve the accuracy of the obtained recognition result; and finally, under the condition that the recognition result contains the preset awakening word, starting a voice interaction program corresponding to the awakening word, so that a high-efficiency and intelligent multi-sound-zone voice interaction process is realized, the voice interaction experience of a user is improved, and the accuracy of the multi-sound-zone voice interaction is also improved.

With continued reference to FIG. 4, FIG. 4 illustrates a flow 400 of yet another embodiment of a multi-tonal voice interaction method in accordance with the present disclosure. The multi-sound zone voice interaction method comprises the following steps:

step 401, receiving a voice signal of at least one to-be-recognized sound zone in a plurality of to-be-recognized sound zones.

Step 402, determining a sound zone identification of at least one sound zone to be identified.

And 403, performing echo denoising processing on the voice signal by using the audio processing thread corresponding to the voice zone identifier to obtain a voice signal to be recognized after denoising processing.

And step 404, performing endpoint detection on the voice signal to be recognized by using the audio processing thread corresponding to the sound zone identifier to obtain first voice information.

At step 405, harmonic features of the first speech information are determined.

And 406, obtaining a recognition result corresponding to the first voice information based on the harmonic features and the pre-trained voice recognition model.

The steps 401-.

Step 407, extracting keywords in the recognition result.

In this embodiment, an executing subject (e.g., the server 105 shown in fig. 1) of the multi-sound-zone voice interaction method may extract a keyword in the recognition result, and may perform a word segmentation process on the recognition result, for example, so as to determine a plurality of keywords in the recognition result based on the word segmentation result.

And step 408, responding to the successful matching of the keyword and the preset operation instruction, and executing the operation corresponding to the successfully matched operation instruction.

In this embodiment, the execution main body may execute an operation corresponding to the successfully matched operation instruction when the extracted keyword is successfully matched with the preset operation instruction. The executing body may be configured with an operation command and an operation corresponding to the operation command in advance, for example, the operation command may be "heating the seat", "playing music", or the like. After the execution main body obtains the keyword, the keyword is matched with the operation instruction configured in advance, and if the matching is successful, the operation corresponding to the operation instruction which is successfully matched is executed. Through the steps, the effect of multi-sound zone voice interaction can be achieved under the condition of no awakening.

As can be seen from fig. 4, compared with the embodiment corresponding to fig. 3, in the multi-range voice interaction method in this embodiment, the method highlights a step of performing an operation based on a keyword in the recognition result, after the execution subject obtains the keyword in the recognition result, the keyword is matched with a pre-configured operation instruction, and if the matching is successful, an operation corresponding to the operation instruction that is successfully matched is performed. Through the steps, the effect of multi-sound zone voice interaction can be achieved under the condition of avoiding awakening, and therefore the effect of multi-sound zone voice interaction is improved.

With continued reference to fig. 5, fig. 5 shows an application scenario diagram of the multi-zone voice interaction method of the present disclosure, where a vehicle 501 in the application scenario includes four zones, namely, a main driver, a copilot, a left side of a rear row, and a right side of the rear row. The microphone devices of each sound zone in the vehicle collect the voice signals 502 in the sound zone and send the collected voice signals 502 to the server 503. After receiving the voice signals of the multiple to-be-recognized sound regions, the server 503 performs parallel processing on the voice signals of the multiple to-be-recognized sound regions. Specifically, a sound zone identifier of at least one sound zone to be recognized is determined, an audio processing thread corresponding to the sound zone identifier is adopted to process the acquired voice signal, so that a processing result 504 is obtained, and the vehicle is controlled to execute an operation corresponding to the processing result 504.

With further reference to fig. 6, as an implementation of the method shown in the above-mentioned figures, the present disclosure provides an embodiment of a multi-range voice interaction apparatus, which corresponds to the method embodiment shown in fig. 2, and which can be applied in various electronic devices.

As shown in fig. 6, the multi-range voice interaction apparatus 600 of the present embodiment includes: a receiving module 601, a processing module 602, and an executing module 603. The receiving module 601 is configured to receive a voice signal of at least one to-be-recognized sound zone in a plurality of to-be-recognized sound zones; the processing module 602 is configured to determine a speech region identifier of at least one speech region to be identified, and process the speech signal by using an audio processing thread corresponding to the speech region identifier to obtain a processing result; and the execution module 603 is configured to execute an operation corresponding to the processing result.

In this embodiment, in the multi-range voice interaction apparatus 600: the specific processing of the receiving module 601, the processing module 602, and the executing module 603 and the technical effects thereof can be referred to the related descriptions of step 201 and step 203 in the corresponding embodiment of fig. 2, and are not repeated herein.

In some optional implementations of this embodiment, the processing module includes: and the noise eliminating submodule is configured to perform echo noise eliminating processing on the voice signal by adopting an audio processing thread corresponding to the voice area identifier to obtain a voice signal to be recognized after the noise eliminating processing.

In some optional implementations of this embodiment, the processing module further includes: the detection submodule is configured to perform endpoint detection on the voice signal to be recognized by adopting an audio processing thread corresponding to the sound zone identifier to obtain first voice information; a first determining sub-module configured to determine harmonic features of the first speech information; and the recognition submodule is configured to obtain a recognition result corresponding to the first voice information based on the harmonic features and the pre-trained voice recognition model.

In some optional implementations of this embodiment, the execution module includes: a second determining submodule configured to determine whether the recognition result includes a preset wake-up word; and the starting submodule is configured to respond to the fact that the recognition result contains the preset awakening word and start the voice interaction program corresponding to the awakening word.

In some optional implementations of this embodiment, the execution module further includes: an extraction sub-module configured to extract keywords in the recognition result; and the execution sub-module is configured to respond to the fact that the matching of the keyword and the preset operation instruction is successful, and execute the operation corresponding to the successfully matched operation instruction.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 701 performs the various methods and processes described above, such as the polyphonic speech interaction method. For example, in some embodiments, the multi-tone speech interaction method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When the computer program is loaded into the RAM 703 and executed by the computing unit 701, one or more steps of the multi-tone region voice interaction method described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured by any other suitable means (e.g., by means of firmware) to perform the multi-tone voice interaction method.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A multi-zone voice interaction method comprises the following steps:

receiving a voice signal of at least one to-be-recognized sound zone in a plurality of to-be-recognized sound zones;

determining a sound zone identifier of the at least one sound zone to be recognized, and processing the voice signal by adopting an audio processing thread corresponding to the sound zone identifier to obtain a processing result;

and executing the operation corresponding to the processing result.

2. The method according to claim 1, wherein the processing the voice signal by using the audio processing thread corresponding to the range identifier to obtain a processing result includes:

and performing echo denoising processing on the voice signal by adopting an audio processing thread corresponding to the sound zone identifier to obtain a denoised voice signal to be recognized.

3. The method according to claim 2, wherein the processing the voice signal by using the audio processing thread corresponding to the range identifier to obtain a processing result, further comprises:

performing endpoint detection on the voice signal to be recognized by adopting an audio processing thread corresponding to the sound zone identifier to obtain first voice information;

determining harmonic features of the first speech information;

and obtaining a recognition result corresponding to the first voice information based on the harmonic features and a pre-trained voice recognition model.

4. The method of claim 3, wherein the performing the operation corresponding to the processing result comprises:

determining whether the recognition result contains a preset awakening word or not;

and responding to the fact that the recognition result contains a preset awakening word, and starting a voice interaction program corresponding to the awakening word.

5. The method of claim 3, wherein the executing the operation corresponding to the processing result further comprises:

extracting key words in the identification result;

and responding to the keyword and the preset operation instruction which are successfully matched, and executing the operation corresponding to the successfully matched operation instruction.

6. A multi-zone voice interaction device, comprising:

a receiving module configured to receive a voice signal of at least one to-be-recognized sound zone of a plurality of to-be-recognized sound zones;

the processing module is configured to determine a sound zone identifier of the at least one to-be-recognized sound zone, and process the voice signal by adopting an audio processing thread corresponding to the sound zone identifier to obtain a processing result;

and the execution module is configured to execute the operation corresponding to the processing result.

7. The apparatus of claim 6, wherein the processing module comprises:

and the noise elimination sub-module is configured to perform echo noise elimination on the voice signal by adopting the audio processing thread corresponding to the sound zone identifier to obtain a voice signal to be recognized after the noise elimination.

8. The apparatus of claim 7, wherein the processing module further comprises:

the detection submodule is configured to perform endpoint detection on the voice signal to be recognized by adopting an audio processing thread corresponding to the sound zone identifier to obtain first voice information;

a first determining sub-module configured to determine harmonic features of the first speech information;

and the recognition submodule is configured to obtain a recognition result corresponding to the first voice information based on the harmonic features and a pre-trained voice recognition model.

9. The apparatus of claim 8, wherein the means for performing comprises:

a second determining submodule configured to determine whether a preset wake-up word is included in the recognition result;

and the starting sub-module is configured to respond to the fact that the recognition result contains the preset awakening word and start the voice interaction program corresponding to the awakening word.

10. The apparatus of claim 8, wherein the means for performing further comprises:

an extraction sub-module configured to extract keywords in the recognition result;

and the execution sub-module is configured to respond to the fact that the keyword is successfully matched with the preset operation instruction, and execute the operation corresponding to the successfully matched operation instruction.

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.

12. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-5.

13. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-5.