CN114598963A

CN114598963A - Voice processing method and device, computer readable storage medium and electronic equipment

Info

Publication number: CN114598963A
Application number: CN202210325881.9A
Authority: CN
Inventors: 程光伟; 牛建伟; 余凯
Original assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Current assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Priority date: 2022-03-30
Filing date: 2022-03-30
Publication date: 2022-06-07

Abstract

The embodiment of the disclosure discloses a voice processing method, a voice processing device, a voice recognition system based on a movable microphone array, a computer readable storage medium and an electronic device, wherein the method comprises the following steps: determining the number and positions of people in the target space; determining the target position adjusted by the microphone array based on the number of the persons and the positions of the persons; controlling the direction of a target microphone in the microphone array to rotate to a target position; and extracting a voice signal from the audio signal collected by the microphone array for voice processing. The embodiment of the disclosure realizes that the audio signals of more personnel positions in the target space are collected by using a small number of microphones, and the voice signals can be accurately identified, so that the cost of microphone array arrangement is reduced, the flexibility of voice interaction by using the microphone array is improved, and the application scene of voice interaction by using the microphone array is expanded.

Description

Voice processing method and device, computer readable storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for speech processing, a speech recognition system based on an active microphone array, a computer-readable storage medium, and an electronic device.

Background

Currently, Artificial Intelligence (AI) speech recognition technology mainly uses a multi-microphone array to achieve speech pickup and sound source localization. The multi-microphone array may include 2 microphone arrays, 4 microphone arrays, 6 microphone arrays, and the like, which are classified in terms of number. The more the number of microphones, the wider the angle at which the speech signal is picked up, and the better the noise reduction effect. However, more microphones may increase hardware cost, occupy space, increase installation difficulty, increase data processing difficulty, and the like. Therefore, how to realize better speech recognition by using fewer microphones is a problem to be solved.

Disclosure of Invention

The embodiment of the disclosure provides a voice processing method and device, a voice recognition system based on a movable microphone array, a computer readable storage medium and an electronic device.

An embodiment of the present disclosure provides a method for processing a voice, including: determining the number and positions of personnel in the target space; determining the target position adjusted by the microphone array based on the number of the persons and the positions of the persons; controlling the direction of a target microphone in the microphone array to rotate to a target position; and extracting a voice signal from the audio signal collected by the microphone array for voice processing.

According to another aspect of the embodiments of the present disclosure, there is provided an apparatus for speech processing, the apparatus including: the first determination module is used for determining the number and the positions of the personnel in the target space; the second determination module is used for determining the target position adjusted by the microphone array based on the number of the persons and the positions of the persons; the control module is used for controlling the direction of a target microphone in the microphone array to rotate to a target position; and the extraction module is used for extracting a voice signal from the audio signal collected by the microphone array to perform voice processing.

According to another aspect of the embodiments of the present disclosure, there is provided a mobile microphone array-based speech recognition system, including: the microphone array is arranged on a movable device, and the movable device is used for enabling a microphone included in the microphone array to move under the control of the controller; the controller is used for executing the voice processing method.

According to another aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium storing a computer program for executing the above-described method of voice processing.

According to another aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing processor-executable instructions; and the processor is used for reading the executable instruction from the memory and executing the instruction to realize the voice processing method.

Based on the method and the device for processing voice, the voice recognition system based on the movable microphone array, the computer readable storage medium and the electronic device provided by the embodiments of the present disclosure, the number and the position of the personnel in the target space are determined, the target position adjusted by the microphone array is determined based on the number and the position of the personnel, then the direction of the target microphone in the microphone array is controlled to rotate to the target position, and finally the voice signal is extracted from the audio signal collected by the microphone array for voice processing, because the position of the microphone can be adjusted, the purpose of collecting the audio signals of more personnel positions in the target space by using a smaller number of microphones is achieved, the voice signal can be accurately recognized, the cost for setting the microphone array is reduced, and the flexibility of voice interaction using the microphone array is improved, the application scene of voice interaction by using the microphone array is expanded.

The technical solution of the present disclosure is further described in detail by the accompanying drawings and examples.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in more detail embodiments of the present disclosure with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the principles of the disclosure and not to limit the disclosure. In the drawings, like reference numbers generally represent like parts or steps.

Fig. 1 is a system diagram to which the present disclosure is applicable.

Fig. 2 is a flowchart illustrating a method for processing speech according to an exemplary embodiment of the disclosure.

Fig. 3 is a flowchart illustrating a method for processing speech according to another exemplary embodiment of the present disclosure.

Fig. 4A, 4B, and 4C are exemplary schematic diagrams of different rotational positions of a microphone array provided by an embodiment of the disclosure.

Fig. 5 is a flowchart illustrating a method for processing speech according to another exemplary embodiment of the present disclosure.

Fig. 6 is a flowchart illustrating a method for processing speech according to another exemplary embodiment of the present disclosure.

Fig. 7 is a flowchart illustrating a method for processing speech according to another exemplary embodiment of the present disclosure.

Fig. 8 is a flowchart illustrating a method for processing speech according to another exemplary embodiment of the present disclosure.

Fig. 9 is a schematic structural diagram of an apparatus for speech processing according to an exemplary embodiment of the present disclosure.

Fig. 10 is a schematic structural diagram of an apparatus for speech processing according to another exemplary embodiment of the present disclosure.

Fig. 11 is a block diagram of an electronic device provided in an exemplary embodiment of the present disclosure.

Detailed Description

Hereinafter, example embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of the present disclosure and not all embodiments of the present disclosure, and that the present disclosure is not limited by the example embodiments described herein.

It should be noted that: the relative arrangement of parts and steps, numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

It will be understood by those of skill in the art that the terms "first," "second," and the like in the embodiments of the present disclosure are used merely to distinguish one element from another, and are not intended to imply any particular technical meaning, nor is the necessary logical order between them.

It is also understood that in embodiments of the present disclosure, "a plurality" may refer to two or more and "at least one" may refer to one, two or more.

It is also to be understood that any reference to any component, data, or structure in the embodiments of the disclosure, may be generally understood as one or more, unless explicitly defined otherwise or stated otherwise.

In addition, the term "and/or" in the present disclosure is only one kind of association relationship describing an associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in the present disclosure generally indicates a relationship in which the front and rear associated objects are an "or".

It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and the same or similar parts may be referred to each other, so that the descriptions thereof are omitted for brevity.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual scale relationship for the convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be discussed further in subsequent figures.

The disclosed embodiments may be applied to electronic devices such as terminal devices, computer systems, servers, etc., which may act as a controller operable with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with electronic devices, such as terminal devices, computer systems, servers, and the like, include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set top boxes, programmable consumer electronics, network pcs, minicomputers, mainframe computer systems, distributed cloud computing environments that include any of the above systems, and the like.

Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

Summary of the application

In order to reduce the space occupied by the microphone array and the difficulty of data processing, some microphone arrays including a smaller number of microphones are currently used more widely. For example, the centralized dual-microphone array has wide application in the field of vehicle-mounted speech recognition due to low hardware cost.

However, in the current microphone array including a small number of microphones, the installation position is fixed, and the speech recognition accuracy is low in a scene of a large number of people, a loud area, and the like. The embodiment of the disclosure aims to improve the application range of a microphone array comprising a small number of microphones by arranging the movable microphone array, and improve the speech recognition accuracy in a scene with a large number of people and a large sound area.

Exemplary System

Fig. 1 illustrates an exemplary system architecture 100 of a method of speech processing or an apparatus of speech processing to which embodiments of the present disclosure may be applied.

As shown in fig. 1, a system architecture 100 may include a controller 101 and a microphone array 102 located within a target space, wherein the microphone array 102 is disposed on a movable device 103, and the movable device 103 is configured to activate microphones included in the microphone array 102 under the control of the controller 101. The microphone array 102 may include any number of microphones, and in general, the microphone array 102 may be composed of two microphones in order to save the cost of installing the microphones and reduce the complexity of the speech separation process.

The target space may be any type of space, such as an interior space of a vehicle, an interior space of a room, and the like.

Movable device 103 may be of various configurations. For example, may include a guide rail through which the movement is performed. The movable device 103 may be located at any desired location within the target space, for example, the movable device may be located at a central location on the roof of the vehicle in order to capture the sound of more people in the vehicle.

The controller 101 may be disposed in the target space as shown in the figure or may be disposed outside the target space. And the type of the controller 101 may include various types, for example, the controller 101 may include terminal devices, which may include, but are not limited to, mobile terminals such as an in-vehicle terminal, a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), etc., and fixed terminals such as a digital TV, a desktop computer, etc. Various applications, such as an audio/video playing application, a map application, a search application, a web browser application, a shopping application, an instant messaging tool, etc., may be installed on the terminal device. The terminal device can receive signals of devices such as a camera and a pressure sensor, identify the number of people in a target space, acquire audio signals collected by the microphone array 102 and identify the audio signals.

The control 101 may further include a remote server, which may determine the number of people, the location of people, and the like according to the information sent by the terminal devices in the target space, and may also receive and identify the audio signals sent by the microphone array 102 through the network.

It should be noted that the method for processing voice provided by the embodiment of the present disclosure is executed by the control 101, and accordingly, a device for processing voice may be disposed in the control 101.

In some alternative implementations, as shown in fig. 1, the movable apparatus 103 may include a rotatable surface on which the microphone array 102 is disposed. The controller 101 may control the rotatable surface to rotate, so as to drive the microphone array 102 to rotate, so that the sound collection direction of the microphone points to the position of the target person. In the structure of the microphone array 102 on the rotatable surface, the purpose of rotating the microphone array 102 can be achieved only by a small number of components such as a motor and a turntable, so that the structure of the movable device 103 is simpler, the installation is more convenient, the cost is saved, the installation efficiency is improved, and the control difficulty is reduced.

In some optional implementations, at least one wake-up key is disposed within the target space, each wake-up key of the at least one wake-up key corresponding to at least one person location.

For example, when the target space is the interior space of the vehicle, a wake-up button may be respectively disposed near each seat in the vehicle; or, each rear seat can be provided with a wake-up key; alternatively, each rear seat may share a wake-up button.

When the wake-up button is pressed, the controller 101 controls the voice interaction device to enter a wake-up state from a standby state. In the wake-up state, the controller 101 may obtain a voice signal of a person at a person position corresponding to the wake-up key, and recognize the voice signal, thereby implementing voice interaction between the person and the controller 101.

In the embodiment, the awakening key is arranged in the target space, so that the false identification caused by voice awakening identification can be avoided, the awakening operation is more accurate, and the awakening mode of each personnel position can be more flexibly configured in a voice interaction scene.

Exemplary method

Fig. 2 is a flowchart illustrating a method of speech processing according to an exemplary embodiment of the present disclosure. This embodiment can be applied to the controller 101 shown in fig. 1, and as shown in fig. 2, the method includes the following steps:

in step 201, the number and position of persons in the target space are determined.

In the present embodiment, the controller 101 may determine the number of persons and the positions of persons in the target space in various ways. As an example, a camera may be provided within the target space, and the number of persons and the positions of persons may be determined by recognizing persons in an image taken by the camera. For another example, a distance sensor, a pressure sensor, or the like may be disposed on a seat in the target space, and the controller 101 may determine whether there is a person in each seat according to the distance between the seat and the person, which is collected by the distance sensor, or determine whether there is a person in each seat according to a pressure signal collected by the pressure sensor, so as to determine the number of persons and the position of the person.

Based on the number of persons and the position of the persons, the target position of the microphone array adjustment is determined 202.

In this embodiment, the controller 101 may determine the target position adjusted by the microphone array 102 based on the number of persons and the position of the persons. As an example, as shown in fig. 1, the microphone array 102 is disposed on the movable device 103, and the controller 101 may determine a target position to which the current microphone array 102 is to be adjusted according to a correspondence relationship between the number of persons, the position of the person, and the position to which the microphone array 102 is adjusted, where the correspondence relationship may be set in advance.

And step 203, controlling the direction of the target microphone in the microphone array to rotate to the target position.

In this embodiment, the controller 101 may control the direction in which the target microphone in the microphone array 102 is turned to the target position. The target microphone may be a pre-designated microphone or a microphone determined according to the number of people and the position of the people.

For example, the microphone array 102 is disposed on a rotatable surface as shown in fig. 1, and when the number of people is 1, the target microphone may be the microphone closest to the person, and the controller 101 controls the movable device 103 to rotate so as to point the target microphone at the person. For another example, when the number of persons is equal to or greater than 2 and the target space is an in-vehicle space, the target microphone may be a microphone 1022 as shown in fig. 1, the microphone 1021 being dedicated to picking up the sound of the person on the driver's seat, and the microphone 1022 being dedicated to picking up the sound of the person on the passenger seat. Controller 101 may control movable device 103 to rotate such that microphone 1022 is pointed at the passenger's location.

Alternatively, the target microphone may be controlled to rotate independently, or the target microphone may be rotated while the other microphones are driven to rotate together.

Step 204, extracting a voice signal from the audio signal collected by the microphone array for voice processing.

In this embodiment, the controller 101 may extract a voice signal from the audio signal collected by the microphone array 102 for voice processing.

Generally, the audio signal collected by the microphone array 102 includes noise, and the sound collected by the microphone included in the microphone array 102 may come from multiple personnel positions, and multiple microphones may simultaneously collect the sound of the same person, so the controller 101 may further perform noise reduction processing on the collected audio signal by using an existing audio noise reduction technology, and may further separate the sound of multiple channels by using an existing voice separation technology to obtain a voice signal and determine a personnel position corresponding to the voice signal.

The extracted speech signal may be further processed, for example, by speech recognition, or by wake-up recognition, etc.

According to the method provided by the embodiment of the disclosure, the number of the persons and the positions of the persons in the target space are determined, the target position adjusted by the microphone array is determined based on the number of the persons and the positions of the persons, then, the direction of the target microphone in the microphone array rotating to the target position is controlled, and finally, the voice signals are extracted from the audio signals collected by the microphone array for voice processing.

In some alternative implementations, as shown in fig. 3, the step 202 may include the following sub-steps:

step 2021, determining the target position adjusted by the microphone array according to the position of the person based on the number of the persons being less than or equal to the preset number.

As an example, the preset number may be 2. When the number of persons is 1, the target position is even the target microphone is directed to the position of the person. The target microphone may be determined based on the person's location. For example, when the target space is a vehicle interior space, if the person is located at the driving position, the microphone 1021 shown in fig. 1 is a target microphone, the controller 101 may further control the movable device 103 to rotate to a target position, so that the microphone 1021 faces the driving position; if the person is located at the passenger space and the microphone 1022 shown in fig. 1 is a target microphone, the controller 101 may further control the movable device 103 to rotate to the target position, such that the microphone 1022 is directed to the passenger space where the person is located.

When the number of persons is 2, the target position can be determined based on the positions of the two persons by bringing the respective microphones as close as possible to the positions of the two persons. For example, as shown in fig. 4A, when the target space is a car interior space, if there are people in the driver seat 401 and the passenger seat 402, the target position adjusted by the microphone array 102 may be a position as shown in fig. 4A, in which the microphone 1021 is directed to the driver seat 401 and the microphone 1022 is directed to the passenger seat 402. Similarly, if there are people in the driving places 401 and 403, the target positions adjusted by the microphone array 102 are shown in fig. 4B; if there are people in the driver's seat 401 and the driver's seat 404, or there are people in the driver's seat 401 and the driver's seat 405, the target position adjusted by the microphone array 102 is as shown in fig. 4C.

It should be noted that the adjusted position of the microphone array 102 may be a preset fixed position (e.g., a fixed rotation angle shown in fig. 4A-4C), or may be adjusted in real time according to the real-time position of the person (the real-time position of the person may be determined through image recognition).

According to the method and the device, the target position adjusted by the microphone array is determined according to the position of the personnel under the condition that the number of the personnel is smaller than or equal to the preset number, and the effect that when the number of the personnel in the target space is small, the distance between the microphone included by the microphone array and the personnel in the target space is the shortest is achieved, so that the quality of the collected audio signals is improved, the corresponding relation between the voice signals and the position of the personnel is accurately determined during the separation of the voice, and the accuracy of voice processing is improved.

step 2022, based on the number of the persons being greater than the preset number, responding to the waking operation of the persons in the target space, and determining the target position adjusted by the microphone array according to the position of the person who wakes up the operation.

As an example, in the scenario as shown in fig. 4A to 4C, when the number of people is greater than 2, the controller 101 may monitor the wakeup operation performed by the person in the vehicle in real time and determine the position of the person performing the wakeup operation. For example, when a person in the vehicle utters a wake-up voice (e.g., "hello"), the controller 101 may extract the voice signal from the audio signal collected by the microphone array 102 and recognize that the voice signal is a wake-up voice, wake up a voice interaction device (which may be included in the controller 101 or may be a separate device) in the vehicle, and determine the location of the person who utters the wake-up voice based on a voice separation operation on the audio signal.

Each seat on the vehicle may have a preset correspondence with the microphones included in the microphone array 102. For example, when a person on the driver's seat 401 shown in fig. 4A makes a wake-up operation, the microphone array 102 rotates to the target position shown in fig. 4A, at which time the microphone 1021 continues to capture the sound of the person on the driver's seat 401. For another example, when a person at the passenger location 402 shown in fig. 4A makes a wake-up operation, the microphone array 102 is also rotated to the target position shown in fig. 4A, at which time the microphone 1022 continues to capture the sound of the person at the passenger location 402. Similarly, when the person at the passenger seat 403 shown in fig. 4B makes a wake-up operation, the microphone array 102 rotates to the target position shown in fig. 4B, and the microphone 1022 continues to pick up the sound of the person at the passenger seat 403. When a person at the

passenger space

404 or 405 shown in fig. 4C makes a wakeup operation, the microphone array 102 rotates to the target position shown in fig. 4C, and the microphone 1022 continues to pick up the sound of the person at the

passenger space

404 or 405.

According to the embodiment, the positions of the persons who call the awakening operation are determined under the condition that the number of the persons in the target space is larger than the preset number, and the target position adjusted by the microphone array is determined, so that the sounds of the persons who call the awakening operation are collected in a targeted manner under the condition that the number of the persons is large, the quality of the collected audio signals under the scene with the large number of the persons is improved, and the accuracy of voice processing is improved.

In some alternative implementations, the type of wake operation includes at least one of: voice awakening, key awakening and touch screen awakening.

The voice awakening means performing operations such as voice separation on the acquired audio signal, extracting the voice signal and identifying whether the voice signal is awakening voice. If so, the voice interaction device is awakened. The key wake-up and the touch screen wake-up refer to that a person in a target space wakes up a voice interaction device by pressing a wake-up key or clicking a wake-up icon on a touch screen.

The embodiment provides multiple awakening modes, can improve the flexibility of awakening operation of personnel in the target space, and can more accurately determine the position of the personnel who make the awakening operation due to button awakening and touch screen awakening, so that the probability of awakening false recognition can be further reduced by combining the multiple awakening modes, and the accuracy of voice interaction is improved.

In some alternative implementations, as shown in fig. 5, the step 2022 described above may include any one of the following two sub-steps:

in response to determining that the type of the wake-up operation is either button wake-up or touch screen wake-up, a target position of the microphone array adjustment is determined according to the position of the wake-up operation, step 20221.

The positions of the awakening key and the awakening icon of the touch screen correspond to the positions of the personnel. For example, when the target space is a vehicle interior space, a wake-up button or a touch screen with a wake-up icon may be provided near each seat, or a wake-up button or a touch screen with a wake-up icon may be provided near each individual seat. For example, a wake-up button or a touch screen is respectively arranged near the left and right seats in the rear row of the vehicle, or the touch screen is only arranged in the front row of the vehicle, and wake-up icons corresponding to the wake-up positions are displayed on the touch screen. The controller 101 may determine the location where the wake-up operation is performed according to the location of the pressed wake-up key or the location of the clicked wake-up icon on the touch screen.

The controller 101 may then determine the target position for the microphone array 102 adjustment based on the location of the wake-up operation. For example, in the scenario shown in fig. 4A-4C, a button or a touch screen is respectively disposed near the driving seat 401 and the passenger seat 402, and when a person in the driving seat 401 or the passenger seat 402 performs a wake-up operation, the microphone array 102 is adjusted to the target position shown in fig. 4A. When a person at the passenger seat 403 makes a wake-up operation, the microphone array 102 is adjusted to a target position as shown in fig. 4B. When a person at the

passenger location

404 or 405 makes a wake-up operation, the microphone array 102 adjusts to the target position as shown in fig. 4C.

In response to determining that the type of the wake-up operation is voice wake-up and the location of the wake-up operation is located in the first area within the target space, a target location of the microphone array adjustment is determined according to the location of the wake-up operation, step 20222.

The first region may be a predetermined region. For example, in the interior space scenario shown in fig. 4A-4C, the first region may be a front row region of the vehicle, i.e., a region including a driver seat 401 and a passenger seat 402. The controller 101 may perform operations such as voice separation on the audio signal when performing wake-up recognition on the acquired audio signal, so as to determine a position where a person who utters the wake-up voice is located as a position of the wake-up operation. Then, the target position adjusted by the microphone array 102 is determined according to the position of the wake-up operation. For example, when the driver seat 401 or the passenger seat 402 as shown in fig. 4A is a location where a voice wake-up is made, the microphone array 102 is adjusted to a target location as shown in fig. 4A.

It should be noted that step 20222 is only performed when the voice wakes up and the location of the wake-up operation is in the first area. For example, the personnel in the passenger seats 403, 404, and 405 shown in fig. 4A to 4C can only perform the wake-up operation through the keys or the touch screen, so as to avoid the wake-up misrecognition caused by voice wake-up.

According to the method provided by the embodiment, the button awakening and the touch screen awakening can avoid the awakening error identification, and the voice awakening can greatly improve the awakening convenience, so that the voice awakening mode is limited to be used outside the first area by using any awakening mode in the first area, and the convenience and the accuracy of the awakening operation are both considered.

In some alternative implementations, as shown in fig. 6, step 204 may include the following sub-steps:

step 2041, performing voice separation processing based on the audio signal to obtain a voice signal.

The method adopted by the voice separation processing can be the prior art, and the voice signal can be extracted from the audio signal through the voice separation processing.

Step 2042, determine the microphone that collects the voice signal.

The method of determining the microphone for collecting the speech signal may also be prior art. For example, each microphone may collect voice of the same person, and the controller 101 may compare the volume of the same voice signal collected by each microphone, and determine the microphone collecting the voice signal with the largest volume as the microphone collecting the voice signal in step 2041. Alternatively, existing multi-channel adaptive filtering methods may be employed to determine the microphone from which the speech signal was acquired.

Step 2043, in a state where the voice interaction device has been awakened, performing voice recognition on the separated voice signal in response to determining that the voice signal is collected from the microphone corresponding to the awakening operation.

The microphone corresponding to the wake-up operation is the microphone corresponding to the position of the person who performs the wake-up operation. For example, the microphone corresponding to the wake-up operation is the microphone 1022 shown in fig. 1, and the currently extracted voice signal is also collected from the microphone 1022, which indicates that the person who performs the wake-up operation needs to perform further voice interaction with the voice interaction device, and then performs voice recognition on the voice signal.

Step 2044, in the state that the voice interaction device has been woken up, in response to determining that the voice signal is not collected from the microphone corresponding to the wake-up operation, performing wake-up recognition on the voice signal.

For example, the microphone corresponding to the wake-up operation is the microphone 1022 shown in fig. 1, which indicates that the person who performs the wake-up operation needs to perform further voice interaction with the voice interaction device, but the currently extracted voice signal is collected from the microphone 1021, which indicates that other persons also need to perform voice interaction at this time, and in order to avoid interference of the voice sent by other persons on the voice interaction process of the person who performs the wake-up operation, the voice sent by other persons needs to be awakened and recognized at this time. If the awakening identification fails, prohibiting other personnel from performing voice interaction; if the awakening is successful, allowing the person who is currently making the awakening operation to perform voice interaction (i.e. allowing the voice signal collected by the microphone corresponding to the person to be identified), and forbidding the person who has made the awakening operation to perform voice interaction (i.e. forbidding the voice signal collected by the microphone corresponding to the person who has made the awakening operation to be identified).

According to the embodiment, the microphone for collecting the voice signals is determined, whether the microphone corresponding to the awakening operation is consistent with the microphone for collecting the voice signals or not is judged in the awakened state, further voice recognition or awakening recognition is carried out according to the judgment result, and therefore the awakening recognition is carried out again when the personnel for carrying out the awakening operation is inconsistent with the personnel currently sending voice, the interference of the voice sent by other personnel on the voice sent by the personnel currently carrying out voice interaction can be effectively reduced, the probability of voice misrecognition is reduced, and the accuracy of the voice interaction is further improved.

In some alternative implementations, as shown in fig. 7, after step 201, the method further includes:

based on the number of persons and the position of the persons, a preset initial position of the microphone array is determined 205.

Here, the initial position may be set arbitrarily, and for example, the initial position may be a position of the microphone array 102 as shown in fig. 4A.

Alternatively, when the controller 101 determines that the voice interaction is finished, the microphone array 102 may be controlled to rotate back to the initial position. The end of the voice interaction can be determined by recognizing a voice instruction (for example, "i have finished speaking"), pressing a key, clicking a screen, and the like, or the end of the voice interaction can be determined when it is determined that no voice signal is collected within a preset time period.

In the embodiment, the initial position of the microphone array is set, so that the microphone array can preferentially collect the voice information of the target person (for example, a person in a driving position in a vehicle) under most conditions, the use of the microphone array is more matched with an actual use scene, and the accuracy of recognizing the voice of the target person is improved in a targeted manner.

In some alternative implementations, as shown in fig. 8, before step 204, the method further comprises:

step 206, enhancing the audio signal collected by the microphone array based on a first beam width corresponding to a first microphone in the microphone array and a second beam width corresponding to a second microphone in the microphone array.

Wherein the first beam width is smaller than the second beam width. The beam width is a concept commonly used in the field of sound collection, and the method of adjusting the beam width can be implemented using the prior art. The first microphone and the second microphone may be designated in advance. For example, in the scenario shown in fig. 4A-4C, the microphone 1021 is a first microphone and the microphone 1022 is a second microphone. The first microphone is mainly used for collecting the voice of the person in the driver's seat 401, and the first microphone is mainly used for collecting the voice of the person in each passenger seat.

Further, step 204 includes:

step 2041, extracting a speech signal from the enhanced audio signal for speech processing.

The method can improve the directivity of the first microphone when collecting the sound, and can inhibit the sound collected from other directions and improve the collection quality of the sound emitted by the first microphone to a specific person position by setting the first beam to be a narrower beam (for example, by using a super directional beam (sdb) algorithm).

Exemplary devices

Fig. 9 is a schematic structural diagram of an apparatus for speech processing according to an exemplary embodiment of the present disclosure. This embodiment can be applied to an electronic device, and as shown in fig. 9, the speech processing apparatus includes: a first determining module 901, configured to determine the number of people and the positions of people in the target space; a second determining module 902, configured to determine an adjusted target position of the microphone array based on the number of people and the position of the people; a control module 903, configured to control a direction in which a target microphone in the microphone array rotates to a target position; and an extracting module 904, configured to extract a voice signal from the audio signal collected by the microphone array for voice processing.

In this embodiment, the first determining module 901 may determine the number of persons and the positions of persons in the target space in various ways. As an example, a camera may be provided within the target space, and the number of persons and the positions of persons may be determined by recognizing persons in an image taken by the camera. For another example, a distance sensor, a pressure sensor, etc. may be disposed on a seat in the target space, and the first determining module 901 may determine whether there is a person in each seat by the distance between the seat and the person collected by the distance sensor, or determine whether there is a person in each seat by using a pressure signal collected by the pressure sensor, so as to determine the number of persons and the position of the person.

In this embodiment, the second determining module 902 may determine the target position of the microphone array adjustment based on the number of persons and the position of the persons. As an example, as shown in fig. 1, the microphone array 102 is disposed on the movable apparatus 103, and the second determining module 902 may determine the target position to which the current microphone array is to be adjusted according to the corresponding relationship between the number of people, the position of people, and the position to which the microphone array is adjusted, where the corresponding relationship may be set in advance.

In this embodiment, the control module 903 may control the direction in which a target microphone of the microphone array is turned to a target position. The target microphone may be a pre-designated microphone or a microphone determined according to the number of persons and the positions of the persons.

For example, the microphone array 102 is disposed on a rotatable surface as shown in fig. 1, when the number of people is 1, the target microphone may be the microphone closest to the person, and the control module 903 controls the movable device 103 to rotate so as to point the target microphone at the person. For another example, when the number of persons is equal to or greater than 2 and the target space is an in-vehicle space, the target microphone may be a microphone 1022 as shown in fig. 1, the microphone 1021 is dedicated to picking up the sound of the person on the driver's seat, and the microphone 1022 is dedicated to picking up the sound of the person on the passenger seat. Controller 101 may control movable device 103 to rotate such that microphone 1022 is pointed at the passenger's location.

In this embodiment, the extraction module 904 may extract a voice signal from the audio signal collected by the microphone array 102 for voice processing.

Generally, the audio signal collected by the microphone array 102 includes noise, and the sound collected by the microphone included in the microphone array 102 may come from multiple personnel positions, and multiple microphones may simultaneously collect the sound of the same person, so the extracting module 904 may further perform noise reduction processing on the collected audio signal by using an existing audio noise reduction technology, and may further separate the sound of multiple channels by using an existing voice separation technology to obtain a voice signal and determine a personnel position corresponding to the voice signal.

Referring to fig. 10, fig. 10 is a schematic structural diagram of an apparatus for speech processing according to another exemplary embodiment of the present disclosure.

In some optional implementations, the second determining module 902 includes: the first determining unit 9021 is configured to determine, based on the number of people being less than or equal to the preset number, a target position of the microphone array adjustment according to the position of the people.

In some optional implementations, the second determining module 902 includes: and a second determining unit 9022, configured to determine, based on the number of people being greater than the preset number, a target position adjusted by the microphone array according to a position of a person who wakes up in the target space in response to the wake-up operation of the person in the target space.

In some optional implementations, the second determining unit 9022 includes: a first determining subunit 90221, configured to determine, in response to determining that the type of the wake-up operation is key wake-up or touch-screen wake-up, a target position adjusted by the microphone array according to the position of the wake-up operation; or, the second determining subunit 90222 is configured to, in response to determining that the type of the wake-up operation is voice wake-up and the location of the wake-up operation is located in the first area in the target space, determine the target location adjusted by the microphone array according to the location of the wake-up operation.

In some alternative implementations, the extraction module 904 includes: a separation unit 9041, configured to perform voice separation processing based on the audio signal to obtain a voice signal; a third determining unit 9042, configured to determine a microphone that collects a voice signal; the first recognition unit 9043 is configured to perform voice recognition on the separated voice signal in response to determining that the voice signal is collected from a microphone corresponding to the wakeup operation in the state where the voice interaction device is waken up; a second identifying unit 9044, configured to perform wake-up identification on the voice signal in response to determining that the voice signal is not collected from a microphone corresponding to the wake-up operation in a state where the voice interaction device has been woken up.

In some optional implementations, the apparatus further comprises: a third determining module 905 configured to determine a preset initial position of the microphone array based on the number of persons and the position of the persons.

In some optional implementations, the apparatus further comprises: an enhancing module 906, configured to enhance an audio signal acquired by the microphone array based on a first beam width corresponding to a first microphone in the microphone array and a second beam width corresponding to a second microphone in the microphone array, where the first beam width is smaller than the second beam width; the extraction module 904 is further for: and extracting a voice signal from the enhanced audio signal for voice processing.

According to the voice processing device provided by the embodiment of the disclosure, the number and the positions of the personnel in the target space are determined, the target position adjusted by the microphone array is determined based on the number and the positions of the personnel, then, the direction of the target microphone in the microphone array rotating to the target position is controlled, and finally, the voice signals are extracted from the audio signals collected by the microphone array for voice processing.

Exemplary electronic device

Next, an electronic apparatus according to an embodiment of the present disclosure is described with reference to fig. 11. The electronic device may be a controller 101 as shown in fig. 1, and the controller 101 may receive an input signal collected by a microphone, a camera, or the like.

FIG. 11 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure.

As shown in fig. 11, the electronic device 1100 includes one or more processors 1101 and memory 1102.

The processor 1101 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 1100 to perform desired functions.

Memory 1102 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory may include, for example, Random Access Memory (RAM), cache memory (or the like). The non-volatile memory may include, for example, Read Only Memory (ROM), a hard disk, flash memory, and the like. One or more computer program instructions may be stored on a computer-readable storage medium and executed by the processor 1101 to implement the methods of speech processing of the various embodiments of the present disclosure above and/or other desired functions. Various contents such as a voice signal may also be stored in the computer-readable storage medium.

In one example, the electronic device 1100 may further include: an input device 1103 and an output device 1104, which are interconnected by a bus system and/or other form of connection mechanism (not shown).

The input device 1103 may be a microphone, a camera, or the like, for inputting audio signals, images, or the like.

The output device 1104 can output various information including the extracted voice signal to the outside. The output devices 1104 may include, for example, speakers, displays, printers, and communication networks and remote output devices connected thereto.

Of course, for simplicity, only some of the components of the electronic device 1100 relevant to the present disclosure are shown in fig. 11, omitting components such as buses, input/output interfaces, and the like. In addition, electronic device 1100 may include any other suitable components depending on the particular application.

Exemplary computer program product and computer-readable storage Medium

In addition to the above-described methods and apparatus, embodiments of the present disclosure may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the method of speech processing according to various embodiments of the present disclosure described in the "exemplary methods" section of this specification above.

The computer program product may write program code for carrying out operations for embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform steps in a method of speech processing according to various embodiments of the present disclosure described in the "exemplary methods" section above in this specification.

The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The basic principles of the present disclosure have been described above in connection with specific embodiments, but it should be noted that advantages, effects, and the like, mentioned in the present disclosure are only examples and not limitations, and should not be considered essential to the various embodiments of the present disclosure. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the disclosure is not intended to be limited to the specific details so described.

In the present specification, the embodiments are described in a progressive manner, and each embodiment focuses on differences from other embodiments, and the same or similar parts in each embodiment are referred to each other. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The block diagrams of devices, apparatuses, systems referred to in this disclosure are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be implemented as a program recorded in a recording medium, the program including machine-readable instructions for implementing a method according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

It is also noted that in the devices, apparatuses, and methods of the present disclosure, each component or step can be decomposed and/or recombined. Such decomposition and/or recombination should be considered as equivalents of the present disclosure.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the disclosure to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. A method of speech processing, comprising:

determining the number and positions of people in the target space;

determining a target location for microphone array adjustment based on the number of people and the location of people;

controlling a direction in which a target microphone of the microphone array is rotated to the target position;

and extracting a voice signal from the audio signal collected by the microphone array for voice processing.

2. The method of claim 1, wherein the determining a target location for microphone array adjustment based on the number of people and the location of people comprises:

and determining the adjusted target position of the microphone array according to the personnel position based on the personnel number being less than or equal to the preset number.

3. The method of claim 1, wherein the determining a target location for microphone array adjustment based on the number of people and the location of people comprises:

and responding to the awakening operation of the personnel in the target space based on the fact that the number of the personnel is larger than the preset number, and determining the adjusted target position of the microphone array according to the position of the personnel in the awakening operation.

4. The method of claim 3, wherein the type of wake-up operation comprises at least one of: voice awakening, key awakening and touch screen awakening.

5. The method of claim 4, wherein the determining the adjusted target location of the microphone array from the person location of the wake-up operation in response to the wake-up operation of the person within the target space comprises:

in response to determining that the type of the wake-up operation is either key wake-up or touch screen wake-up, determining an adjusted target position of the microphone array according to the position of the wake-up operation; alternatively, the first and second electrodes may be,

in response to determining that the type of the wake-up operation is voice wake-up and that the location of the wake-up operation is in a first area within the target space, determining an adjusted target location of the microphone array from the location of the wake-up operation.

6. The method of claim 1, wherein the extracting a speech signal from an audio signal acquired by the microphone array for speech processing comprises:

performing voice separation processing based on the audio signal to obtain the voice signal;

determining a microphone that collects the voice signal;

in the state that the voice interaction equipment is awakened, responding to the microphone corresponding to the voice signal acquisition self-awakening operation, and performing voice recognition on the separated voice signal;

and in a state that the voice interaction equipment is awakened, responding to the fact that the voice signal is not collected from the microphone corresponding to the awakening operation, and performing awakening identification on the voice signal.

7. The method of claim 1, wherein after the determining the number of people and the location of people within the target space, further comprising:

determining a preset initial position of the microphone array based on the number of persons and the position of the persons.

8. The method of any of claims 1-7, wherein prior to the extracting speech signals from the audio signals collected by the microphone array for speech processing, the method further comprises:

enhancing audio signals acquired by the microphone array based on a first beam width corresponding to a first microphone of the microphone array and a second beam width corresponding to a second microphone of the microphone array, wherein the first beam width is smaller than the second beam width;

the extracting of the voice signal from the audio signal collected by the microphone array for voice processing includes:

and extracting a voice signal from the enhanced audio signal to perform voice processing.

9. A mobile microphone array based speech recognition system comprising: a controller and a microphone array located in a target space, wherein the microphone array is arranged on a movable device, and the movable device is used for enabling a microphone included in the microphone array to be active under the control of the controller;

the controller is configured to perform the method of any one of claims 1-8.

10. The system of claim 9, wherein the movable device comprises a rotatable surface, the microphone array disposed on the rotatable surface.

11. The system of claim 9, wherein at least one wake-up key is disposed within the target space, each of the at least one wake-up keys corresponding to at least one person location.

12. A computer-readable storage medium, the storage medium storing a computer program for performing the method of any of the preceding claims 1-8.

13. An electronic device, the electronic device comprising:

a processor;

a memory for storing executable instructions of the processor;

the processor is configured to read the executable instructions from the memory and execute the instructions to implement the method of any one of claims 1-8.