CN109712626B

CN109712626B - Voice data processing method and device

Info

Publication number: CN109712626B
Application number: CN201910161760.3A
Authority: CN
Inventors: 张明远
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-03-04
Filing date: 2019-03-04
Publication date: 2021-04-30
Anticipated expiration: 2039-03-04
Also published as: CN109712626A

Abstract

The embodiment of the invention discloses a voice data processing method and a device, wherein the method comprises the following steps: responding to a first trigger operation for the microphone array; the microphone array comprises a plurality of first microphone sets respectively pointing in corresponding directions, each first microphone set being associated with a first speech pickup pattern; activating at least one first microphone set associated with the first trigger operation, determining the activated first microphone set as a working microphone set, and determining a target direction range according to the direction pointed by the working microphone set; and carrying out voice pickup on the voice signals in the target direction range through the first voice pickup mode and the working microphone set to generate a first target voice signal. By adopting the embodiment of the invention, the noise interference in the voice data acquisition process can be reduced, and the accuracy of voice recognition is further improved.

Description

Voice data processing method and device

Technical Field

The present invention relates to the field of sound pickup technologies, and in particular, to a method and an apparatus for processing voice data.

Background

In the field of speech processing, the continuous popularization of intelligent devices (such as intelligent sound boxes, intelligent televisions and the like) puts higher requirements on microphone array technology in the field of speech processing.

In the current conference system, in order to record voice data of all speakers in a conference, voice data of speakers in all directions can be collected through an omnidirectional pickup device, and voice processing is performed on the collected omnidirectional voice data. Therefore, in the process of processing the voice data, collecting the voice data in all directions easily causes large noise interference, for example, the speaking voice of other people exists in the process of speaking by a speaker in a conference, which causes the sound pickup device to collect the voice data (i.e., noise) other than the voice data corresponding to the speaker, and further causes the accuracy of voice recognition to be low.

Disclosure of Invention

The embodiment of the invention provides a voice data processing method and a voice data processing device, which can reduce noise interference in a voice data acquisition process so as to improve the accuracy of voice recognition.

One aspect of the present invention provides a method for processing voice data, including:

responding to a first trigger operation for the microphone array; the microphone array comprises a plurality of first microphone sets respectively pointing in corresponding directions, each first microphone set being associated with a first speech pickup pattern;

activating at least one first microphone set associated with the first trigger operation, determining the activated first microphone set as a working microphone set, and determining a target direction range according to the direction pointed by the working microphone set;

and carrying out voice pickup on the voice signals in the target direction range through the first voice pickup mode and the working microphone set to generate a first target voice signal.

Wherein the activating at least one first set of microphones associated with the first triggering operation, determining the activated first set of microphones as a working set of microphones, determining a target range of directions from directions pointed by the working set of microphones, comprises:

when the first trigger operation is associated with at least two first microphone sets, activating the at least two first microphone sets, and determining the activated first microphone set as a working microphone set;

acquiring first angle information of the direction to which each working microphone set points respectively;

and if the included angle between every two adjacent working microphone sets is smaller than or equal to an angle threshold, determining the angle range between the minimum angle information and the maximum angle information in the first angle information as a target direction range.

Wherein the voice picking up the voice signals in the target direction range through the first voice picking up mode and the working microphone set to generate a first target voice signal includes:

generating a voice gain signal corresponding to each first microphone set respectively through the first voice pickup mode and the at least two first microphone sets; the voice gain signal is generated for each first microphone set based on voice signals in the target direction range;

and generating the first target voice signal according to the weighting coefficient corresponding to each first microphone set and the voice gain signal corresponding to each first microphone set.

acquiring a transfer function vector and a filter matrix corresponding to the working microphone set;

acquiring a voice signal, and determining second angle information between the direction pointed by the working microphone set and a sound source positioning direction corresponding to the voice signal;

determining a gain vector corresponding to the working microphone set in the first voice pickup mode according to the transfer function vector, the filter matrix and the second angle information;

convolving the voice signal based on the gain vector to generate a first target voice signal; if the second angle information belongs to the gain angle range, the first target voice signal is a voice signal after voice enhancement; and if the second angle information does not belong to the gain angle range, the first target voice signal is a voice signal after voice suppression.

Wherein the method further comprises:

acquiring a voice signal, and determining a sound source positioning direction corresponding to the voice signal according to the time difference of at least two microphones in the microphone array acquiring the voice signal.

Wherein the microphone array further comprises a second set of microphones;

the second set of microphones is associated with a second voice pickup mode for hyper-directional enhancement of the voice signal, the second voice pickup mode having a sound pickup distance greater than the sound pickup distance of the first voice pickup mode;

when switching from an active set of microphones to a second set of microphones, the second set of microphones is used for voice picking up voice signals within the target range of directions based on a second voice picking up mode.

Wherein the method further comprises:

acquiring a first volume parameter corresponding to the voice signal;

activating a second set of microphones if the first volume parameter is less than a volume threshold;

rotating the second set of microphones into the target range of directions, converting a first voice pickup pattern in the array of microphones to the second voice pickup pattern;

and performing voice pickup on the voice signals in the target direction range through the second voice pickup mode and the second microphone set to generate a second target voice signal.

Optionally, the method further includes:

responding to a second trigger operation for the microphone array;

activating a second set of microphones associated with the second trigger operation;

Optionally, the method further includes:

responding to a second trigger operation for the microphone array;

rotating the second set of microphones to the target range of directions, converting a first voice pickup pattern in the array of microphones to the second voice pickup pattern;

Optionally, the target direction range includes a first target direction range and a second target direction range;

the method further comprises the following steps:

acquiring a second volume parameter corresponding to the voice signal in the first target direction range, and acquiring a third volume parameter corresponding to the voice signal in the second target direction range;

if the second volume parameter is less than the volume threshold and the third volume parameter is greater than or equal to the volume threshold, activating a second set of microphones;

pausing the first microphone sets corresponding to the first target direction range and the second target direction range respectively, rotating the second microphone set to be within the first target direction range, and reactivating the first microphone set within the second target direction range in the rotated microphone array to serve as an updated microphone set;

performing voice pickup on voice signals in the second target direction range through the first voice pickup mode and the updated microphone set to generate a third target voice signal;

and performing voice pickup on the voice signals in the first target direction range through the second voice pickup mode and the second microphone set to generate a second target voice signal.

Wherein the method further comprises:

acquiring target voice characteristics corresponding to the first target voice signal, the second target voice signal and the third target voice signal respectively;

and respectively converting the first target voice signal, the second target voice signal and the third target voice signal into text information according to the target voice characteristics, and outputting the text information.

One aspect of the present invention provides a speech data processing apparatus, including:

a response module for responding to a first triggering operation for the microphone array; the microphone array comprises a plurality of first microphone sets respectively pointing in corresponding directions, each first microphone set being associated with a first speech pickup pattern;

an activation module, configured to activate at least one first microphone set associated with the first trigger operation, determine the activated first microphone set as a working microphone set, and determine a target direction range according to a direction pointed by the working microphone set;

and the generating module is used for carrying out voice pickup on the voice signals in the target direction range through the first voice pickup mode and the working microphone set to generate a first target voice signal.

Wherein the activation module comprises:

a determining unit, configured to activate at least two first microphone sets when the first trigger operation associates the at least two first microphone sets, and determine the activated first microphone set as a working microphone set;

the angle acquisition unit is used for acquiring first angle information of the direction to which each working microphone set points respectively;

and the direction range determining unit is used for determining the angle range between the minimum angle information and the maximum angle information in the first angle information as a target direction range if the included angle between each two adjacent working microphone sets is smaller than or equal to an angle threshold.

Wherein the generating module comprises:

a gain signal generating unit, configured to generate, through the first voice pickup mode and the at least two first microphone sets, a voice gain signal corresponding to each first microphone set respectively; the voice gain signal is generated for each first microphone set based on voice signals in the target direction range;

and the weighted summation unit is used for generating the first target voice signal according to the weighting coefficient corresponding to each first microphone set and the voice gain signal corresponding to each first microphone set.

Optionally, the generating module includes:

a first obtaining unit, configured to obtain a transfer function vector and a filter matrix corresponding to the working microphone set;

the angle information determining unit is used for acquiring a voice signal and determining second angle information between the direction pointed by the working microphone set and a sound source positioning direction corresponding to the voice signal;

a gain vector determination unit, configured to determine, according to the transfer function vector, the filter matrix, and the second angle information, a gain vector corresponding to the working microphone set in the first voice pickup mode;

the convolution unit is used for performing convolution on the voice signal based on the gain vector to generate a first target voice signal; if the second angle information belongs to the gain angle range, the first target voice signal is a voice signal after voice enhancement; and if the second angle information does not belong to the gain angle range, the first target voice signal is a voice signal after voice suppression.

Wherein the apparatus further comprises:

and the positioning module is used for acquiring a voice signal and determining a sound source positioning direction corresponding to the voice signal according to the time difference of the at least two microphones in the microphone array acquiring the voice signal.

Wherein the apparatus further comprises a first conversion module;

the first conversion module includes:

the second acquisition unit is used for acquiring a first volume parameter corresponding to the voice signal;

a first condition judgment unit, configured to activate a second microphone set if the first volume parameter is smaller than a volume threshold;

a first mode conversion unit, configured to rotate the second microphone set to be within the target direction range, and convert a first voice pickup mode in the microphone array to the second voice pickup mode;

and the first voice pickup unit is used for performing voice pickup on the voice signals in the target direction range through the second voice pickup mode and the second microphone set to generate a second target voice signal.

Wherein the apparatus further comprises a second conversion module;

the second conversion module includes:

a response operation unit for responding to a second trigger operation for the microphone array;

a microphone activation unit to activate a second set of microphones associated with the second trigger operation;

a second mode conversion unit, configured to rotate the second microphone set to be within the target direction range, and convert a first voice pickup mode in the microphone array into the second voice pickup mode;

and the second voice pickup unit is used for performing voice pickup on the voice signals in the target direction range through the second voice pickup mode and the second microphone set to generate a second target voice signal.

Wherein the target direction range comprises a first target direction range and a second target direction range; the apparatus further comprises a third conversion module;

the third conversion module includes:

a third obtaining unit, configured to obtain a second volume parameter corresponding to the voice signal in the first target direction range, and obtain a third volume parameter corresponding to the voice signal in the second target direction range;

a second condition determining unit, configured to activate a second microphone set if the second volume parameter is smaller than the volume threshold and the third volume parameter is greater than or equal to the volume threshold;

a rotation unit, configured to suspend the first microphone set corresponding to the first target direction range and the second target direction range, rotate the second microphone set to the first target direction range, and reactivate the first microphone set in the second target direction range in the rotated microphone array as an updated microphone set;

a third voice pickup unit, configured to perform voice pickup on the voice signal in the second target direction range through the first voice pickup mode and the updated microphone set, and generate a third target voice signal;

and the fourth voice pickup unit is used for performing voice pickup on the voice signals in the first target direction range through the second voice pickup mode and the second microphone set to generate a second target voice signal.

Wherein the device further comprises a speech recognition module:

the speech recognition module comprises:

a voice feature obtaining unit, configured to obtain target voice features corresponding to the first target voice signal, the second target voice signal, and the third target voice signal, respectively;

and the text conversion unit is used for respectively converting the first target voice signal, the second target voice signal and the third target voice signal into text information according to the target voice characteristics and outputting the text information.

One aspect of the present invention provides a speech data processing apparatus, including: a processor and a memory;

the processor is connected to a memory, wherein the memory is used for storing program codes, and the processor is used for calling the program codes to execute the method in one aspect of the embodiment of the invention.

An aspect of an embodiment of the present invention provides a computer storage medium storing a computer program, the computer program comprising program instructions that, when executed by a processor, perform a method as in an aspect of an embodiment of the present invention.

In the embodiment of the present invention, the microphone array may include a plurality of first microphone sets respectively pointing to corresponding directions, and by responding to a first trigger operation for the microphone array, at least one first microphone set associated with the first trigger operation may be activated, and a target direction range is determined according to a direction in which the first microphone set points, and voice signals within the target direction range may be subjected to voice pickup through the first microphone sets in the first voice pickup mode. In other words, the microphone can be activated according to the triggering operation, and the voice signal in the direction pointed by the microphone can be picked up through the activated microphone, so that the interference of the voice signal in other directions can be avoided, and the accuracy of voice recognition can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic view of a scenario of a voice data processing method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for processing voice data according to an embodiment of the present invention;

FIG. 3 is a flow chart of another method for processing voice data according to an embodiment of the present invention;

fig. 4a and 4b are schematic diagrams of a voice picking method according to an embodiment of the present invention;

FIG. 5 is a flow chart illustrating another method for processing voice data according to an embodiment of the present invention;

FIG. 6 is a flow chart illustrating another method for processing voice data according to an embodiment of the present invention;

fig. 7 a-7 c are schematic structural diagrams of a microphone array voice pick-up pattern according to an embodiment of the present invention;

FIG. 8 is a schematic structural diagram of a speech data processing apparatus according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of another speech data processing apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a schematic view of a scenario of a voice data processing method according to an embodiment of the present invention. In some important meeting scenes or auditioning scenes of inspection institutions, it is usually necessary to record the meeting or auditioning process, and therefore, it is necessary to collect voice data of a speaker during the meeting or auditioning process by using a sound pickup device (such as a circular microphone array, a linear microphone array, etc.), where the sound pickup device may include a group of omnidirectional microphones located at different positions in space and arranged in a certain regular shape, that is, a microphone array. As shown in fig. 1, the user 100a may be represented as a speaker during a conference or an interrogation, and when the user 100a speaks, the voice data of the user 100a may be collected through the sound pickup apparatus 500, and the collected voice data may be subjected to voice recognition. The microphone array 200 in the sound pickup apparatus 500 is shaped like a "spoon", and may include a first microphone set and a second microphone set, where the first microphone set may constitute an omnidirectional sound pickup area in the microphone array, so as to determine a direction of a voice and may perform omnidirectional voice pickup, i.e., a first voice pickup pattern. It is to be understood that the microphone array 200 may include a plurality of first microphone sets, each of which includes at least two microphones, as shown in fig. 1, and the microphone array includes 4 first microphone sets, which may be respectively represented as { M1, M2, M6, M10, M14}, { M1, M3, M7, M11, M15}, { M1, M4, M8, M12, M16}, { M1, M5, M9, M13, M17}, all the microphones in each first microphone set are on a straight line, and may be directed to both ends of the straight line, and the effect of voice data on both ends of the straight line is the best. The second set of microphones, which are super-pointing enhancement directions for far-distance directions, may be composed of multiple linear arrays of microphones for super-pointing pickup and speech recognition in specific directions, i.e., the second set of microphones is shown in fig. 1 as { M1, M2, M6, M10, M14, M18, M19, M20 }. Wherein, the above M1, M2, … … and M20 can be represented as microphones.

In the sound pickup apparatus 500, each first microphone set may correspond to two direction keys 300 at both ends of a straight line where the set is located, and the direction corresponding to the second microphone set { M1, M2, M6, M10, M14, M18, M19, M20} is taken as a 0-degree direction, the microphone consists of 5 uniform linear microphones in 8 directions of 0 degree, 45 degrees, 90 degrees, 135 degrees, 180 degrees, 225 degrees, 270 degrees and 315 degrees, one direction key 300 may be provided in each of the above-mentioned 8 directions, and when the direction key 300 is turned on, a first set of microphones corresponding to the direction key 300 may be activated, for example with the user 100a positioned at 45 degrees to the microphone array 200, when the direction key 300 in the 45-degree direction is turned on, the first set of microphones { M1, M3, M7, M11, M15} is activated, voice data of the user 100a is collected, and the collected voice data is voice-enhanced. Optionally, a direction key 300 and an extra-far distance key 400 may be arranged in the 0-degree direction, and when the direction key 300 is turned on, the first set of microphones { M1, M2, M6, M10, M14} is activated; when the ultra-far distance key is turned on, a second set of microphones { M1, M2, M6, M10, M14, M18, M19, M20} is activated. When the quality of the voice data of the user 100a collected by the sound pickup device 500 is too low (e.g., the volume is low, the word spitting is fuzzy, etc.), the microphone array 200 may be automatically rotated, the second microphone set is rotated to the 45-degree direction, and the first voice pickup mode in the microphone array is converted into the second voice pickup mode, so that the quality of the collected voice data may be improved, and the accuracy of voice recognition may be further improved.

Referring to fig. 2, fig. 2 is a flowchart illustrating a voice data processing method according to an embodiment of the present invention. As shown in fig. 2, the method may include:

step S101, responding to a first trigger operation aiming at a microphone array; the microphone array comprises a plurality of first microphone sets respectively pointing in corresponding directions, each first microphone set being associated with a first speech pickup pattern;

specifically, the microphone array is a key technology of the sound pickup apparatus, the microphone array may include a plurality of microphones, and the plurality of microphones are arranged in a certain rule, the microphone array may include a plurality of first microphone sets respectively pointing to corresponding directions, each first microphone set includes at least two microphones, and all the microphones in each first microphone set are arranged in a linear shape (may be arranged uniformly or non-uniformly), and each first microphone set is associated with the first voice pickup pattern. The first trigger operation may be a contact trigger operation such as manual key pressing, fingerprint recognition, manual touch control, or a non-contact trigger operation such as voice, remote control, or the like. In the voice data processing process of the microphone array, if the first trigger operation is an artificial key operation, each first microphone set may be associated with two direction keys on a straight line where the first microphone set is located. When the direction key is manually pressed, the sound pickup apparatus may respond to the first trigger operation for the manual pressing of the direction key. For example, the microphone array may include 4 first microphone sets, each of the first microphone sets includes 5 microphones, and the 4 first microphone sets may point to 8 uniform directions of 360 degrees, which are 0 degrees, 45 degrees, 90 degrees, 135 degrees, 180 degrees, 225 degrees, 270 degrees, 315 degrees, and each direction corresponds to one direction key; when the speaker is located in the 90-degree direction of the microphone array, the direction key in the 90-degree direction may be manually turned on, and after the operation of "pressing the direction key in the 90-degree direction" is manually performed, the sound pickup apparatus may respond to the operation of "pressing the direction key in the 90-degree direction" (i.e., the first trigger operation). In other words, the instruction information for the above-described operation of "pressing the direction key in the 90-degree direction" is acquired.

Optionally, the first trigger operation may be a fingerprint identification trigger manner, and after the sound pickup apparatus receives input fingerprint information, the sound pickup apparatus may respond to the first trigger operation for the fingerprint information, that is, match the input fingerprint information with the entered fingerprint information, and if matching is successful, may activate a first microphone set associated with the fingerprint information. Optionally, the first triggering operation may be an artificial touch triggering manner, each first microphone set may be associated with two touch areas on a straight line where the first microphone set is located, and when the sound pickup apparatus receives an artificial touch sensing signal (i.e., in response to the first triggering operation for the microphone array), the first microphone set associated with the touch area to which the sensing signal belongs may be activated. Alternatively, the first trigger operation may be a voice trigger operation, and when the sound pickup apparatus acquires a voice wake-up word such as "feed", "hello", "please record", and the like (i.e. in response to the first trigger operation for the microphone array), the first microphone set in the corresponding direction of the voice wake-up word may be activated. Alternatively, the first trigger operation may be a remote control trigger operation, and first microphone sets in different directions may be activated by the remote control device, each first microphone set being associated with a different key on the remote control device, and the sound pickup device may respond to the first trigger operation for a certain key on the remote control device when the certain key is manually pressed. For example, the microphone array comprises 4 first microphone sets, which 4 first microphone sets may point in 8 uniform directions of 360 degrees, each of which may correspond to one key on the remote control device, i.e. key 1, key 2, …, key 8.

Step S102, activating at least one first microphone set associated with the first trigger operation, determining the activated first microphone set as a working microphone set, and determining a target direction range according to the direction pointed by the working microphone set;

specifically, after the sound pickup apparatus responds to a first trigger operation for the microphone array, at least one first microphone set associated with the first trigger operation may be activated, and the activated first microphone set may be determined as a working microphone set, that is, the activated first microphone set may perform voice data processing (including performing voice enhancement and voice recognition on collected voice data), the non-activated first microphone sets are all in a non-working state, may receive a voice signal, but cannot perform voice enhancement processing and voice recognition processing on the received voice signal, and a target direction range for performing subsequent voice pickup may be determined according to a direction in which the working microphone set is pointed. When the first trigger operation is only associated with one first microphone set, the direction pointed by the first microphone set may be determined as a target direction range, for example, the direction corresponding to the first microphone set is a 0-degree direction, and an angle range around 0 degrees (for example, the 0-degree direction, or an angle range between plus or minus 22.5 degrees, where the angle range may be determined according to the microphone arrangement manner in the microphone array or actual needs, and is not limited herein) may be determined as a target direction range; when the first trigger operation is associated with the two first working microphone sets, the target direction range may be determined according to an angle range between directions to which the two first microphone sets point respectively, for example, directions corresponding to the two first working microphone sets are 0-degree directions and 45-degree directions respectively, and an angle range between 0 degrees and 45 degrees may be determined as the target direction range.

Step S103, performing voice pickup on the voice signals within the target direction range through the first voice pickup mode and the working microphone set, and generating a first target voice signal.

Specifically, after determining a working microphone set and a target direction range in the microphone array, a first voice pickup mode in the microphone array may be turned on, and the working microphone set is adopted to collect voice signals in the target direction range, and the collected voice signals are subjected to voice enhancement in the first voice pickup mode, that is, the voice signals in the target direction range are subjected to voice pickup to generate a first target voice signal. It can be understood that, the speaking process of the speaker is a continuous process, and the sound pickup apparatus can collect voice data of the speaker in real time by using the working microphone set and perform voice enhancement on the collected voice data, so that the working microphone set can perform voice pickup on all voice signals within the target direction range in real time.

Optionally, the first trigger operation may be associated with a plurality of direction keys, and then the first microphone sets corresponding to the plurality of direction keys may be activated, and all activated first microphone sets are determined as working microphone sets, and the target direction range is determined according to the directions corresponding to the plurality of direction keys. For example, if the direction keys operated by the first trigger operation are a direction key in a 45 degree direction, a direction key in a 135 degree direction, and a direction key in a 270 degree direction, the first microphone set corresponding to the 45 degree direction, the first microphone set corresponding to the 135 degree direction, and the first microphone set corresponding to the 270 degree direction may be activated, and the 3 activated first microphone sets may be determined as the working microphone set, and the angle ranges of about 45 degrees, about 135 degrees, and about 270 degrees may be determined as the target direction ranges, and the voice signal in the target direction range may be voice-picked up in the first voice pickup mode, so as to generate the first target voice signal.

Referring to fig. 3, fig. 3 is a flowchart illustrating another voice data processing method according to an embodiment of the present invention. As shown in fig. 3, the method may include:

step S201, responding to a first trigger operation aiming at a microphone array; the microphone array comprises a plurality of first microphone sets respectively pointing in corresponding directions, each first microphone set being associated with a first speech pickup pattern;

for a specific implementation manner of the step S201, reference may be made to the description of the step S101 in the embodiment corresponding to fig. 2, and details are not described here again.

Step S202, when the first trigger operation is associated with at least two first microphone sets, activating the at least two first microphone sets, and determining the activated first microphone set as a working microphone set;

specifically, when the first trigger operation is associated with at least two first microphone sets, all the first microphone sets associated with the first trigger operation are activated, and all the activated first microphone sets are determined as working microphone sets, that is, all the first microphone sets associated with the first trigger operation are in a working state, and the collected voice data can be subjected to voice enhancement processing and voice recognition processing.

Step S203, acquiring first angle information of the direction to which each working microphone set points respectively;

specifically, in the microphone array, each first microphone set points to a different direction, and after the working microphone sets are determined, angle information (i.e., first angle information) corresponding to the direction in which each working microphone set points can be obtained.

Step S204, if the included angle between each two adjacent working microphone sets is smaller than or equal to the angle threshold, determining the angle range between the minimum angle information and the maximum angle information in the first angle information as a target direction range;

specifically, according to the obtained first angle information, information of an included angle between every two adjacent working microphone sets may be determined, and if the information of the included angle between every two adjacent working microphone sets is less than or equal to an angle threshold, an angle range between minimum angle information and maximum angle information in the first angle information may be determined as a target direction range, where the angle threshold may be represented as an included angle between directions pointed by every two adjacent first microphone sets in the microphone array. For example, if all the first set of microphones in the microphone array can divide 360 degrees into 8 directions on average, the angle threshold may be 45 degrees. Assuming that the first microphone set associated with the first trigger operation is a first microphone set 1 corresponding to a 0-degree direction, a first microphone set 2 corresponding to a 45-degree direction, and a first microphone set 3 corresponding to a 90-degree direction, the first microphone sets corresponding to the three directions may be activated, and the first microphone sets corresponding to the three directions may be determined as working microphone sets, which may be respectively represented as a working microphone 1, a working microphone 2, and a working microphone 3, an angle between the working microphone set 1 and the working microphone set 2 in the three working microphone sets may be determined to be equal to an angle threshold value of 45 degrees, and an angle between the working microphone set 2 and the working microphone set 3 may be also determined to be equal to 45 degrees, and an angle range between 0 degrees and 90 degrees may be determined as a target direction range; assuming that the first set of microphones associated with the first trigger operation is a first set of microphones 1 corresponding to the 0 degree direction, a first set of microphones 2 corresponding to the 90 degree direction and a first set of microphones 3 corresponding to the 135 degree direction, and determines the first set of microphones 1 as the working set of microphones 1, the first set of microphones 2 as the working set of microphones 2, the first set of microphones 3 as the working set of microphones 3, it can be determined that the angle between working set of microphones 1 and working set of microphones 2 is 90 degrees, which is greater than the angle threshold value of 45 degrees, the angle between working set of microphones 2 and working set of microphones 3 is 45 degrees, which is equal to the angle threshold value of 45 degrees, an angular range of about 0 degrees (e.g., a range between plus or minus 22.5 degrees), an angular range between 45 degrees and 90 degrees may be determined as the target direction range. Optionally, when the first microphone set associated with the first trigger operation is the first microphone set 1 corresponding to the 0 degree direction, the first microphone set 2 corresponding to the 45 degree direction, and the first microphone set 3 corresponding to the 90 degree direction, it may be determined that the first angle information corresponding to the first microphone set 1 (i.e., the working microphone set 1) is an angle range between plus or minus 22.5 degrees, the first angle information corresponding to the first microphone set 2 (i.e., the working microphone set 2) is an angle range between 22.5 degrees and 67.5 degrees, the first angle information corresponding to the first microphone set 3 (i.e., the working microphone set 3) is an angle range between 67.5 degrees and 112.5 degrees, and the angle range between minus 22.5 degrees and 112.5 degrees may be determined as the target direction range.

It should be understood that the above steps S202-S204 represent a specific determination process of the target direction range in case the first trigger operation associates at least two first microphone sets. When the first trigger operation is only associated with a first microphone set, the target direction range may be determined directly according to the direction pointed by the first microphone set, that is, the target direction range is the direction range pointed by the first microphone set.

Step S205, acquiring a voice signal, and determining a sound source positioning direction corresponding to the voice signal according to the time difference of at least two microphones in the microphone array acquiring the voice signal;

specifically, the voice signal can be acquired through the microphone array, because the position of each microphone in the microphone array is different, a time difference must exist when the same voice signal is acquired, and the sound source localization direction of the sound production sound source corresponding to the voice signal can be determined through the distance between every two adjacent microphones in each first microphone set and the time difference. In other words, the propagation distance between two microphones (i.e. the distance between two microphones in the direction of the voice signal) can be calculated according to the distance between two adjacent microphones and the propagation speed of the sound in the air, and the included angle information between the direction of the voice signal and the straight line of the two microphones can be calculated according to the propagation distance and the distance between the two microphones. For example, when the distance between two microphones is a, the time difference between two microphones receiving a voice signal is t, and the propagation speed of sound in air is c, the propagation distance between the two microphones can be calculated as c × t, the included angle information between the direction of the voice signal and the straight line of the two microphones can be determined as cos θ ═ c × t/a, and the sound source localization direction corresponding to the voice signal can be determined. It can be seen that when c × t is a (i.e. θ is equal to 0), the sound source localization direction is the direction of the straight line where the two microphones are located.

Step S206, obtaining a transfer function vector and a filter matrix corresponding to the working microphone set;

specifically, after the working microphone set is determined, when the working microphone set is only one first microphone set, that is, the first trigger operation is only associated with one first microphone set, the transfer function vector and the filter matrix corresponding to the working microphone set may be obtained. Since each first microphone set is a linear microphone array, the working microphone set is also a linear microphone set. In an acoustic environment, when the distance between a sound source and a microphone array is much larger than the distance between microphones, a model corresponding to the microphone array can be regarded as a far-field model. In the far-field model, the sound waves received by the microphones can be regarded as plane waves, and then the transfer function corresponding to the working microphone set can be determined by a sound source orientation function, and for a uniform linear microphone array, the transfer function vector can be defined as:

wherein θ in the above formula (1)_dRepresenting the angle between the direction corresponding to the set of working microphones and the desired sound source direction, M may beExpressed as the number of microphone particles in the working set of microphones, τ₀The time difference between the two adjacent microphones in the working microphone set receiving the same voice signal can be calculated according to the above formula (2), wherein δ in the formula (2) represents the distance between the two adjacent microphones in the working microphone set, and c represents the propagation speed of sound in air.

The filter matrix corresponding to the working microphone set may be a multi-microphone filter matrix, and may be represented as:

h(ω)＝[H₁(ω) H₂(ω) ... H_M(ω)]^T (3)

step S207, determining second angle information between the direction pointed by the working microphone set and the sound source positioning direction corresponding to the voice signal;

specifically, the second angle information between the sound source localization direction corresponding to the determined voice signal and the target direction, that is, the information of the included angle between the sound source localization direction and the direction corresponding to the working microphone set, is equivalent to θ in formula (1)_d。

Step S208, determining a gain vector corresponding to the working microphone set in the first voice pickup mode according to the transfer function vector, the filter matrix and the second angle information;

specifically, according to the obtained transfer function vector, the multi-microphone filter matrix, and the second angle information, a gain vector corresponding to the working microphone set in the first voice pickup mode may be calculated, that is, according to the second angle information determined actually, the transfer function vector is multiplied by the filter matrix, so as to obtain a gain vector corresponding to the working microphone set, where the gain vector may be represented as:

where θ in the formula (4) can be represented as the second angle information, and the value range of M is 1, 2, …, M. It can be seen that the larger θ, i.e. the larger the angle between the sound source localization direction corresponding to the speech signal and the direction corresponding to the working microphone set, the smaller cos θ, the smaller the gain vector calculated by the formula (4), the poorer the speech pickup effect on the speech signal, the smaller θ, the larger cos θ, the larger the calculated gain vector, the better the speech pickup effect on the speech signal, and when θ is equal to 0, the best speech pickup effect on the speech signal.

Step S209, convolving the voice signal based on the gain vector to generate a first target voice signal; if the second angle information belongs to the gain angle range, the first target voice signal is a voice signal after voice enhancement; and if the second angle information does not belong to the gain angle range, the first target voice signal is a voice signal after voice suppression.

Specifically, the gain vector obtained by the calculation may be convolved with the acquired speech signal, and the convolved speech signal may be determined as the first target speech signal. The voice signals acquired by the microphone array are digital signals, in the voice picking process, the microphones in the working microphone set can convert the received voice signals from a time domain to a frequency domain, and the convolution of the gain vector and the voice signals in the time domain can be converted into a product in the frequency domain. It should be noted that, the working microphone set corresponds to a gain angle range, and when the second angle information belongs to the gain angle range, that is, an included angle between a sound source positioning direction corresponding to the speech signal and a direction corresponding to the working microphone set belongs to the gain angle range, the gain vector plays a speech gain effect, that is, speech enhancement can be performed on all speech signals in the sound source positioning direction through the gain vector; when the second angle information does not belong to the gain angle range, that is, an included angle between a sound source positioning direction corresponding to the voice signal and a direction corresponding to the working microphone set does not belong to the gain angle range, the gain vector plays a voice suppression effect, that is, voice suppression can be performed on all voice signals in the sound source positioning direction through the gain vector. For example, the first microphone set points to 8 uniform directions of 360 degrees, and assuming that the target direction is 0 degrees, the gain angle range of the first microphone set (which can be determined as the working microphone set) in the 0 degrees direction is plus or minus 22.5 degrees, voice signals in the plus or minus 22.5 degrees range can be subjected to voice enhancement, and voice signals in the rest angle ranges can be suppressed.

Please refer to fig. 4a and fig. 4b together, which are schematic diagrams illustrating a speech picking method according to an embodiment of the present invention. As shown in fig. 4a, for example, in the case of a first microphone set, the working microphone set includes M microphones, and the M microphones are uniformly arranged on a straight line, from right to left, there are microphone 1, microphone 2, … …, and microphone M, and the distance between every two adjacent microphones can be represented by δ. When the distance between the sound source 100b and the working microphone set is far greater than the distance between two adjacent microphones, a far-field algorithm may be used to perform voice pickup on the voice signal. In the far-field model, the speech signal in the sound source localization direction can be speech-enhanced through the beamforming in the microphone array, and the specific process can be expressed as: the sound wave generated by the sound source 100b is a plane wave, that is, the angle information between the sound source positioning direction corresponding to the sound source and each microphone in the working microphone set is the same and can be represented by θ; the distance between the microphone 1 and the microphone 2 can be determined to be (M-1) δ, and further, the distance difference between the sound wave generated by the sound source 100b and reaching the microphone 1 and the microphone M can be calculated to be (M-1) δ cos θ, and the time difference between the microphone 1 and the microphone M receiving the same voice signal can be determined according to the distance difference and the propagation speed of the sound, that is, when the microphone M receives the voice signal, a time delay exists compared with the time when the microphone 1 receives the voice signal, the received voice signal is subjected to time delay compensation, and the voice signals subjected to the time delay compensation are added, so that the first target voice signal corresponding to the working microphone set can be obtained. Wherein, the voice signal output by the microphone 1 can be represented as Y₁(ω) (corresponding to the product between the transfer function vector in equation (1) and the speech signal generated by the sound source, i.e. the received speech signal is delay compensatedThe latter speech signal), the filter corresponding to the microphone 1 may be denoted as H₁(ω) the microphone 1 corresponds to a speech enhancement signal of

Wherein the content of the first and second substances,

is H₁(omega) is obtained by conjugate transformation, and by analogy, the speech enhancement signal corresponding to the microphone M is

Summing the speech enhancement signals corresponding to each microphone in the working microphone set to obtain a final speech signal (i.e. a first target speech signal) obtained by enhancing the speech signals by the working microphone set, which can be represented as a filter h^HThe product of (omega) and the voice signal Y (omega) after time delay compensation, and h (omega) is the mathematical expression formula in the formula (3), namely h^H(ω) can be represented as

Therefore, the enhancement effect on the speech signal (i.e., the first target speech signal) here is the same as the speech enhancement effect obtained by convolving the speech signal with the gain vector in step S207 described above, but differs in mathematical expression. It should be noted that the voice signals received by the microphones 1 to M are all digital signals, and for convenience of calculation, the received voice signals may be converted from a time domain to a frequency domain. As shown in fig. 4b, it is a directional gain diagram of a speech signal in the 0-degree direction, when the working microphone set is the first microphone set corresponding to the 0-degree direction, the gain vector corresponding to the working microphone set is as shown in fig. 4b, and the first microphone set can perform speech enhancement on the speech signal in the 0-degree direction, so that the gain in the 0-degree direction is much greater than the gain in the 180-degree direction, and the specific calculation manner of the gain vector may refer to step S208 in the embodiment corresponding to fig. 3, which is not described herein again. Microphone (McR)All the first set of microphones in the wind array can just cover the speech signal of all angles in the 360-degree direction of the circle.

Optionally, when the first trigger operation associates at least two first microphone sets, the generation process of the first target speech signal may include the following two steps:

Specifically, when the working microphone set is a plurality of first microphone sets, that is, the first trigger operation is associated with the plurality of first microphone sets, for the voice signals between the directions pointed by two adjacent first microphone sets, the two adjacent first microphone sets may be used to perform synchronous voice data processing on the acquired voice signals, that is, perform voice enhancement processing synchronously, and perform weighted summation on the voice signals after the voice enhancement processing, so as to achieve a final enhancement result of the voice signals. For example, the working microphone set is a first microphone set corresponding to the 0 degree direction and a first microphone set corresponding to the 45 degree direction, and for a voice signal generated by a sound source with the sound source localization direction of 30 degrees, the response Y in the 0 degree direction can be obtained by the first microphone set corresponding to the 0 degree direction₁(i.e. the result after performing speech enhancement through the first microphone set corresponding to the 0-degree direction), the response Y in the 45-degree direction can be obtained through the first microphone set corresponding to the 45-degree direction₂(i.e., the result of speech enhancement performed by the first microphone set corresponding to the 45-degree direction), the final speech enhancement result (i.e., the first target speech signal) of the speech signal may be represented as Y ═ w₁Y₁+w₂Y₂Wherein w is₁、w₂Representing the weight coefficients in different directions, which can be obtained by actual measurement or simulation experiments, Y₁、Y₂The calculation method of (2) can be referred to the above step S206 to step S209, and is not described herein again.

In the embodiment of the present invention, the microphone array may include a plurality of first microphone sets respectively pointing to corresponding directions, and by responding to a first trigger operation for the microphone array, at least one first microphone set associated with the first trigger operation may be activated, and a target direction range is determined according to a direction in which the first microphone set points, and voice signals within the target direction range may be subjected to voice pickup through the first microphone sets in the first voice pickup mode. In other words, the corresponding microphone can be activated according to the trigger operation, and the activated microphone can perform voice pickup on the voice signal in the direction pointed by the microphone, so that the interference of the voice signal in other directions can be avoided, and the accuracy of voice recognition can be improved.

Referring to fig. 5, fig. 5 is a flowchart illustrating another voice data processing method according to an embodiment of the present invention. As shown in fig. 5, the method may include:

step S301, responding to a first trigger operation aiming at a microphone array; the microphone array comprises a plurality of first microphone sets respectively pointing in corresponding directions, each first microphone set being associated with a first speech pickup pattern;

step S302, activating at least one first microphone set associated with the first trigger operation, determining the activated first microphone set as a working microphone set, and determining a target direction range according to the direction pointed by the working microphone set;

step S303, performing voice pickup on the voice signals in the target direction range through the first voice pickup mode and the working microphone set to generate a first target voice signal;

for a specific implementation manner of the steps S301 to S303, reference may be made to the description of the steps S101 to S103 in the embodiment corresponding to fig. 2, and details are not repeated here.

Step S304, acquiring a first volume parameter corresponding to the voice signal;

specifically, in the process of performing voice pickup on the voice signal in the target direction range, the first volume parameter corresponding to the voice signal may be obtained according to a specific time frequency (e.g., every 2 minutes), or the first volume parameter corresponding to the voice signal in the target direction range may be obtained according to quality information (e.g., too small sound, fuzzy voice, etc.) of the received voice signal. The first volume parameter may be expressed as decibel information corresponding to the received voice signal.

Step S305, if the first volume parameter is smaller than a volume threshold, activating a second microphone set;

specifically, the microphone array may further include a second set of microphones, and the second set of microphones may be associated with a second voice pickup mode, and the second voice pickup mode may be used for super-directional enhancement of the voice signal, and the sound pickup distance of the second voice pickup mode is greater than that of the first voice pickup mode. When switching from the working set of microphones to the second set of microphones, the second set of microphones may be used for voice pickup of voice signals within the target range of directions based on a second voice pickup mode. If the acquired first volume parameter is smaller than the volume threshold, the second microphone set may be activated, that is, the second microphone set is switched from the non-operating state to the operating state, and the number of microphones in the second microphone set is greater than the number of microphones in the first microphone set. The volume threshold refers to a volume threshold preset for a voice signal pickup process in pickup equipment, and when the volume parameter of the voice signal is larger than the volume threshold, the acquired voice signal can be clearly output; when the volume parameter corresponding to the voice signal is smaller than the volume threshold, the acquired voice signal cannot be clearly output.

Step S306, rotating the second microphone set to the target direction range, and converting the first voice pickup pattern in the microphone array into the second voice pickup pattern;

specifically, because the volume parameter corresponding to the voice data acquired by the working microphone set in the target direction range is smaller than the volume threshold, the activated second microphone set can be rotated into the target direction range, and the first voice pickup mode in the microphone array is converted into the second voice pickup mode, so that a voice signal at a longer distance can be acquired. In other words, for voice signals at the same distance, the voice signal acquired by using the second voice pickup mode is clearer and has larger volume information than the voice signal acquired by using the first voice pickup mode.

Step S307, performing voice pickup on the voice signals in the target direction range through the second voice pickup mode and the second microphone set to generate a second target voice signal;

specifically, when a second voice pickup mode in the microphone array is turned on, the second microphone set is adopted to collect voice signals in the target direction range, and the collected voice signals are subjected to voice enhancement in the second voice pickup mode to generate a second target voice signal. For a specific voice pickup manner, reference may be made to the description of step S206 to step S209 in the embodiment corresponding to fig. 3, which is not described herein again. It should be noted that, while the voice picking-up manner in the embodiment corresponding to fig. 3 is executed based on the first voice picking-up mode (i.e. the first microphone set), the voice picking-up manner in the embodiment of the present invention is executed based on the second voice picking-up mode (i.e. the second microphone set), the adopted algorithm is the same, but the result obtained by the calculation is obviously better because the number of microphones included in the second microphone set is more. In other words, after the second microphone set is rotated to the target direction range, the enhancement effect on the voice signal in the target direction range is better than that of the previous first microphone set, and when the second microphone set is rotated to the direction corresponding to the voice signal, the enhancement effect on the voice signal is the best. For example, in a conference scenario, a speaker is located in a direction corresponding to 90 degrees of a microphone array, that is, the 90-degree direction may be determined as a target direction, a first microphone subset in the 90-degree direction is determined as a working microphone set, the first microphone set includes 5 microphones, the second microphone set includes 8 microphones and is located in the 0-degree direction, if volume information corresponding to a voice signal of the speaker in the 90-degree direction acquired by the working microphone set is less than a volume threshold, the second microphone set may be activated and rotated to the 90-degree direction, and then the voice signal of the speaker in the 90-degree direction may be subjected to voice pickup by using the second microphone set to generate a second target voice signal. Optionally, when the microphone array is in the second voice pickup mode, the volume threshold corresponding to the voice signal in the 90-degree direction is obtained to be greater than the volume threshold (the speaker in the 90-degree direction raises the speaking sound, or the speaker in the 90-degree direction moves the position, the distance from the microphone array is shortened, or the speaker in the 90-degree direction is changed to another person with a shorter distance from the microphone array, etc.), the second microphone set may automatically rotate to the initial 0-degree direction, convert the second voice pickup mode in the microphone array to the first voice pickup mode, and re-adopt the first microphone set in the 90-degree direction to perform voice pickup on the voice signal in the 90-degree direction, thereby generating the first target voice signal.

Step S308, responding to a second triggering operation aiming at the microphone array;

specifically, in the process of performing voice pickup on a voice signal based on a first voice pickup mode in a microphone array, a second trigger operation for the microphone array may be responded, where the second trigger operation may be a contact trigger operation such as manual key pressing, fingerprint recognition, and manual touch, or may also be a non-contact trigger operation such as voice, remote control, and the like. For example, in a conference scenario, the second triggering operation is a manual key triggering operation, the second microphone set may be associated with an ultra-long-distance key, the ultra-long-distance key may be turned on when a person participating in the conference perceives that the voice signal in the target direction is too loud to be clearly recorded, and the sound pickup apparatus may respond to the "press the ultra-long-distance key" (i.e., the second triggering operation) after manually performing the "press the ultra-long-distance key". In other words, the instruction information for the above-described "press of the extra-long-distance key" operation is acquired.

Step S309, activating a second set of microphones associated with the second trigger operation;

specifically, after the sound pickup apparatus responds to a second trigger operation for the microphone array, a second microphone set corresponding to the second trigger operation may be activated, that is, the second microphone set is switched from a non-operating state to an operating state.

Step S310, rotating the second microphone set to be within the target direction range, and converting a first voice pickup mode in the microphone array into a second voice pickup mode;

specifically, after responding to the second trigger operation for the microphone array, the activated second microphone set may be rotated to the target direction range according to the direction key and the second trigger operation in the target direction range, and the first voice pickup pattern in the microphone array may be converted into the second voice pickup pattern, which may be used to acquire a voice signal at a longer distance. In other words, for voice signals at the same distance, the voice signal acquired by using the second voice pickup mode is clearer and has larger volume information than the voice signal acquired by using the first voice pickup mode. For example, if the direction corresponding to 90 degrees in the microphone array is the target direction, the second microphone set is located in the 0 degree direction in the microphone array, and when the ultra-far distance key is manually pressed, the second microphone set may be activated in response to the second trigger operation for the second microphone set, and the second microphone set will be rotated to the 90 degree direction according to the previous direction key.

Step S311, performing voice pickup on the voice signal in the target direction range through the second voice pickup mode and the second microphone set, and generating a second target voice signal;

for a specific implementation manner of generating the second target speech signal through the second speech pickup mode and the second microphone set, refer to step S307, which is not described herein again.

It should be noted that the manner of converting the first voice pickup pattern in the microphone array into the second voice pickup pattern described in the above steps S304 to S307 is parallel to the manner of converting the first voice pickup pattern in the microphone array into the second voice pickup pattern described in the above steps S308 to S311, the former is automatically converted according to the volume information corresponding to the acquired voice signal, and the latter is converted based on a human trigger operation.

Step S312, acquiring target voice characteristics corresponding to the first target voice signal and the second target voice signal respectively;

specifically, the pickup device may perform speech enhancement on the collected speech signal through the microphone array, determine the speech signal after the speech enhancement in the first speech pickup mode as a first target speech signal, determine the speech signal after the speech enhancement in the second speech pickup mode as a second target speech signal, input the first target speech signal and the second target speech signal into the speech recognition model, and may obtain target speech features corresponding to the first target speech signal and the second target speech signal, respectively. The target voice features may include voice spectrum features, semantic features, and the like in the voice signal.

Step 313, respectively converting the first target voice signal and the second target voice signal into text information according to the target voice characteristics, and outputting the text information.

Specifically, according to the target speech feature, both the first target speech signal and the second target speech signal may be converted into text information by using a speech recognition model, and the converted text information may be output. The speech recognition model is trained through the sample speech data in the corpus database and the text information corresponding to the sample speech data, namely, the speech recognition model has a text conversion function. Optionally, the voice recognition model may input a voiceprint feature of the speaker in advance, and may recognize a person corresponding to the voice signal. For example, in an interrogation scene, voiceprint information of a suspect can be entered in advance, and when a voice signal is collected, whether the collected voice signal belongs to the suspect can be identified according to the voiceprint characteristics recorded in advance.

In the embodiment of the present invention, a microphone array may acquire a voice signal in a target direction range through a first microphone set (i.e., a working microphone set), and may perform voice pickup on the voice signal in the target direction range, when the microphone array is in a first voice pickup mode, it may be determined whether a second microphone set needs to be activated according to a volume parameter corresponding to the acquired voice signal, and when the acquired volume parameter is smaller than a volume threshold, the second microphone set may be automatically rotated into the target direction range, and the first voice pickup mode in the microphone array is converted into a second voice pickup mode; the first speech pickup pattern in the microphone array may also be converted to a second speech pickup pattern by activating the second set of microphones in response to a second trigger action for the second set of microphones, rotating the second set of microphones into the target range of directions. Therefore, the second microphone set can realize super-directional enhancement in a long-distance direction, namely the characteristic of unidirectional super-long-distance sound pickup is added on the basis of omnidirectional sound pickup, so that the using number of the microphones can be reduced, and the cost is saved; the voice signals in the specific direction are picked up in a voice mode, the voice signals in the other directions are restrained, the purpose of denoising can be achieved, and the accuracy of voice recognition is improved.

Referring to fig. 6, fig. 6 is a flowchart illustrating another voice data processing method according to an embodiment of the present invention. As shown in fig. 6, the method may include:

step S401, responding to a first trigger operation aiming at a microphone array; the microphone array comprises a plurality of first microphone sets respectively pointing in corresponding directions, each first microphone set being associated with a first speech pickup pattern;

step S402, activating a first microphone set associated with the first trigger operation, determining the activated first microphone set as a working microphone set, and determining a target direction range according to the direction pointed by the working microphone set;

step S403, performing voice pickup on the voice signal in the target direction range through the first voice pickup mode and the working microphone set, and generating a first target voice signal;

for a specific implementation manner of the steps S401 to S403, reference may be made to the description of the steps S101 to S103 in the embodiment corresponding to fig. 2, and details are not repeated here.

Step S404, the target direction comprises a first target direction range and a second target direction range, a second volume parameter corresponding to the voice signal in the first target direction range is obtained, and a third volume parameter corresponding to the voice signal in the second target direction range is obtained;

specifically, when the target direction includes a first target direction range and a second target direction range, in the process of performing voice pickup on the voice signal in the target direction range, a second volume parameter corresponding to the voice signal in the first target direction range and a third volume parameter corresponding to the voice signal in the second target direction range may be obtained according to a specific time frequency (e.g., every 2 minutes), or a second volume parameter corresponding to the voice signal in the first target direction range and a third volume parameter corresponding to the voice signal in the second target direction range may be obtained according to quality information (e.g., too small sound, voice blur, etc.) of the received voice signal. The second volume parameter and the third volume parameter may be both expressed as decibel information corresponding to the received voice signal.

Step S405, if the second volume parameter is smaller than the volume threshold and the third volume parameter is greater than or equal to the volume threshold, activating a second microphone set;

specifically, if the obtained second volume parameter is smaller than the volume threshold and the third volume parameter is greater than or equal to the volume threshold, the second microphone set may be activated, that is, the second microphone set may be switched from the off state to the operating state. For specific information about the second microphone set, reference may be made to step S305 in the embodiment corresponding to fig. 5, which is not described herein again.

Optionally, when the obtained third volume parameter is smaller than the volume threshold and the second volume parameter is greater than or equal to the volume threshold, the second microphone set may also be activated. The second volume parameter and the third volume parameter may represent volume parameters corresponding to the voice signals in different directions, and the second microphone set may be activated as long as the volume parameter corresponding to the voice signal in one direction is smaller than the volume threshold, which is not described in detail below.

Step S406, pausing the first microphone set corresponding to the first target direction range and the second target direction range, rotating the second microphone set to the first target direction range, and reactivating the first microphone set in the second target direction range in the rotated microphone array to serve as an updated microphone set;

specifically, the first microphone set corresponding to the first target direction range and the first microphone set corresponding to the second target direction range may be suspended, and since the volume parameter corresponding to the voice data acquired by the first microphone set in the first target direction range is smaller than the volume threshold, the activated second microphone set may be rotated to the first target direction range, and the first microphone set in the second target direction range may be reactivated in the rotated microphone array. Please refer to fig. 7a and fig. 7b together, which are schematic structural diagrams of a microphone array voice pickup mode according to an embodiment of the present invention. As shown in fig. 7a, if the user 100c and the user 100d are located in the 45 degree direction and the 315 degree direction of the microphone array 200, respectively, the 45 degree direction may be determined as the first target direction, and the 315 degree direction may be determined as the second target direction. If the volume parameter corresponding to the speech signal of the user 100c in the 45 degree direction is less than the volume threshold value and the volume parameter corresponding to the speech signal of the user 100d in the 315 degree direction is greater than or equal to the volume threshold value, then the second set of microphones, which may be represented by { M1, M2, M6, M10, M14, M18, M19, M20}, i.e. the ultra-far distance key 400 is turned on, the first set of microphones { M1, M3, M7, M11, M15} in the 45 degree direction may be paused, and the first set of microphones { M1, M5, M9, M13, M17} in the 315 degree direction, i.e. the two first sets of microphones are turned off, the direction keys 300 in the 45 degree direction and the 315 degree direction are turned off, the second set of microphones is turned to the 45 degree direction, and the microphone array 200 after turning off, the first set of microphones in the 315 degree direction may be turned back to the first set of microphones { M588 in the 45 degree direction, as shown in fig. 7b, the microphone array 200 may be turned back again turned off, m4, M8, M12, M16}, and the first set of microphones { M1, M4, M8, M12, M16} is taken as the updated set of microphones.

Step 407, performing voice pickup on the voice signal within the second target direction range through the first voice pickup mode and the update microphone set, and generating a third target voice signal;

step S408, performing voice pickup on the voice signal in the first target direction range through the second voice pickup mode and the second microphone set, and generating a second target voice signal;

the specific voice picking manner in step S407 and step S408 may refer to the description of step S206 to step S209 in the embodiment corresponding to fig. 3, and is not described herein again. It should be noted that step S407 is performed based on the first voice pickup mode (i.e., the first microphone set), and step S408 is performed based on the second voice pickup mode (i.e., the second microphone set), and the algorithm used is the same, but the result obtained by the calculation is obviously better because the number of microphones included in the second microphone set is more.

Step S409, acquiring target voice characteristics corresponding to the first target voice signal, the second target voice signal and the third target voice signal respectively;

step S410, respectively converting the first target voice signal, the second target voice signal, and the third target voice signal into text information according to the target voice feature, and outputting the text information.

For specific implementation of the steps S409 and S410, reference may be made to the description of the steps S313 to S313 in the embodiment corresponding to fig. 5, and details are not repeated here.

Optionally, the target direction range may include three or more directions, please refer to fig. 7c, in a scene of a court trial case, there are foreigners, announcements, trial leaders and counsel attorneys in the court, since the 4 persons are located in different directions and have fixed positions, when recording is performed by using the microphone array, the first microphone set or the second microphone set may be selected according to the positions of the 4 persons, if the trial long-distance microphone array is farthest away, the second microphone set is pointed to the direction of the trial leaders, the direction may be taken as the 0 degree direction, the counsel attorneys are located in the 315 degree direction, the foreigners are located in the 135 degree direction, and the announcement is located in the 225 degree direction, the corresponding distance buttons of the second microphone set are turned on, the direction buttons in the 135 degree direction, the 225 degree direction, the 315 degree direction are all turned on, and the 4 direction persons may be directionally picked up, and can do speech recognition to the sound of 4 directions simultaneously, can filter the sound in other directions (promptly the sound of other directions all can be suppressed), can separate the sound of above-mentioned 4 directions, do speech recognition respectively, can improve speech recognition's rate of accuracy.

In the embodiment of the present invention, the microphone array may acquire the voice signals in a plurality of target direction ranges through a plurality of first microphone sets (i.e. working microphone sets), and can perform voice pickup on voice signals within a plurality of target direction ranges, when the microphone array is in the first voice pickup mode, whether the second microphone set needs to be activated or not can be determined according to the volume parameter corresponding to the acquired voice signal, when it is acquired that the volume parameter corresponding to the voice signal in one target direction range (i.e. the first target direction range) among the plurality of target direction ranges is smaller than the volume threshold value, the second microphone set can be automatically rotated to be within the first target direction range, and the voice signals within the first target direction range are subjected to super-directional enhancement, voice pickup in the first voice pickup mode is performed on voice signals in the remaining target direction ranges of the plurality of target direction ranges. Therefore, voice data acquisition can be carried out on speakers in different directions, interference of voice signals in other directions is avoided, noise can be effectively reduced, and accuracy of voice recognition can be improved; adopt the second microphone set can realize the super directional enhancement of remote direction, increased the characteristic of the unidirectional super remote pickup promptly on the basis of omnidirectional pickup, can reduce the use quantity of microphone, practiced thrift the cost.

Referring to fig. 8, fig. 8 is a schematic structural diagram of a voice data processing apparatus according to an embodiment of the present invention. As shown in fig. 8, the voice data processing apparatus 1 may include a response module 10, an activation module 20, a generation module 30;

a response module 10 for responding to a first triggering operation for the microphone array; the microphone array comprises a plurality of first microphone sets respectively pointing in corresponding directions, each first microphone set being associated with a first speech pickup pattern;

an activation module 20, configured to activate at least one first microphone set associated with the first trigger operation, determine the activated first microphone set as a working microphone set, and determine a target direction range according to a direction pointed by the working microphone set;

and a generating module 30, configured to perform voice pickup on the voice signals in the target direction range through the first voice pickup mode and the working microphone set, and generate a first target voice signal.

The specific functional implementation manners of the response module 10, the activation module 20, and the generation module 30 may refer to steps S101 to S103 in the embodiment corresponding to fig. 2, which is not described herein again.

Referring to fig. 8, the voice data processing module may further include: a positioning module 40, a first conversion module 50, a second conversion module 60, a third conversion module 70, a voice recognition module 80;

the positioning module 40 is configured to acquire a voice signal, and determine a sound source positioning direction corresponding to the voice signal according to a time difference between the acquisition of the voice signal by at least two microphones in the microphone array;

a first conversion module 50, configured to convert a first voice pickup mode in the microphone array into the second voice pickup mode according to a first volume parameter of the voice signal, and generate a second target voice signal corresponding to the voice signal in the target direction range;

a second conversion module 60, configured to respond to a second trigger operation for the microphone array and convert the first voice pickup pattern in the microphone array into the second voice pickup pattern based on the second trigger operation, and generate a second target voice signal corresponding to the voice signal in the target direction range;

a third converting module 70 for, when the target direction range includes the first target direction range and the second target direction range, according to the second volume parameter corresponding to the voice signal in the first target direction range and the third volume parameter corresponding to the voice signal in the second target direction range, if the second volume parameter is smaller than the volume threshold value, and the third volume parameter is greater than or equal to the volume threshold, rotating the second set of microphones into the first target range of directions, reactivating the first set of microphones within the second target range of directions in the rotated array of microphones as updated microphones, generating a second target voice signal corresponding to the voice signal in the first target direction range and a third target voice signal corresponding to the voice signal in the second target direction range;

a voice recognition module 80, configured to convert the generated first target voice signal, the second target voice signal, and the third target voice signal into text data, and output the text data.

The specific functional implementation of the positioning module 40 may refer to step S205 in the embodiment corresponding to fig. 3, and the specific functional implementation of the first conversion module 50 and the second conversion module 60 may refer to step S304 to step S311 in the embodiment corresponding to fig. 5, and the specific functional implementation of the third conversion module 70 and the voice recognition module 80 may refer to step S404 to step S410 in the embodiment corresponding to fig. 6, which is not described herein again. The first conversion module 50, the second conversion module 60 and the third conversion module 70 are three parallel modules, and when the first conversion module 50 executes corresponding operations, the second conversion module 60 and the third conversion module 70 suspend executing the operations; when the second conversion module 60 is performing the corresponding operation, the first conversion module 50 and the third conversion module 70 are both suspended to perform the operation; when the third conversion module 70 is performing the corresponding operation, the first conversion module 50 and the second conversion module 60 are both suspended from performing the operation.

Referring also to fig. 8, the activation module 20 may include: a determination unit 201, an angle acquisition unit 202, a direction range determination unit 203;

a determining unit 201, configured to activate at least two first microphone sets when the first triggering operation associates the at least two first microphone sets, and determine the activated first microphone set as a working microphone set;

an angle obtaining unit 202, configured to obtain first angle information of directions to which each working microphone set points;

a direction range determining unit 203, configured to determine, as a target direction range, an angle range between the smallest angle information and the largest angle information in the first angle information if an included angle between each two adjacent working microphone sets is smaller than or equal to an angle threshold.

For specific functional implementation manners of the determining unit 201, the angle obtaining unit 202, and the direction range determining unit 203, reference may be made to steps S202 to S204 in the embodiment corresponding to fig. 3, which is not described herein again.

Referring also to fig. 8, the generating module 30 may include: a first acquisition unit 301, an angle information determination unit 302, a gain vector determination unit 303, a convolution unit 304, a gain signal generation unit 305, and a weighted summation unit 306;

a first obtaining unit 301, configured to obtain a transfer function vector and a filter matrix corresponding to the working microphone set;

an angle information determining unit 302, configured to obtain a voice signal, and determine second angle information between a direction pointed by the working microphone set and a sound source positioning direction corresponding to the voice signal;

a gain vector determining unit 303, configured to determine, according to the transfer function vector, the filter matrix, and the second angle information, a gain vector corresponding to the working microphone set in the first voice pickup mode;

a convolution unit 304, configured to convolve the speech signal based on the gain vector to generate a first target speech signal; if the second angle information belongs to the gain angle range, the first target voice signal is a voice signal after voice enhancement; if the second angle information does not belong to the gain angle range, the first target voice signal is a voice signal subjected to voice suppression;

a gain signal generating unit 305, configured to generate a voice gain signal corresponding to each first microphone set through the first voice pickup mode and the at least two first microphone sets; the voice gain signal is generated for each first microphone set based on voice signals in the target direction range;

a weighted summation unit 306, configured to generate the first target speech signal according to the weighting coefficient corresponding to each first microphone set and the speech gain signal corresponding to each first microphone set.

For specific functional implementation manners of the first obtaining unit 301, the angle information determining unit 302, the gain vector determining unit 303, and the convolution unit 304, reference may be made to step S206 to step S209 in the embodiment corresponding to fig. 3, and for specific functional implementation manners of the gain signal generating unit 305 and the weighted summing unit 306, reference may be made to a generation process of the first target speech signal when the first trigger operation is associated with at least two first microphone sets in the embodiment corresponding to fig. 3, which is not described herein again.

Referring to fig. 8, the first conversion module 50 may include: a second obtaining unit 501, a first condition judging unit 502, a first mode converting unit 503, a first voice picking-up unit 504;

a second obtaining unit 501, configured to obtain a first volume parameter corresponding to the voice signal;

a first condition determining unit 502, configured to activate a second microphone set if the first volume parameter is smaller than a volume threshold;

a first mode conversion unit 503, configured to rotate the second microphone set to be within the target direction range, and convert a first voice pickup mode in the microphone array into the second voice pickup mode;

a first voice pickup unit 504, configured to perform voice pickup on voice signals within the target direction range through the second voice pickup mode and the second microphone set, so as to generate a second target voice signal.

The specific functional implementation manners of the second obtaining unit 501, the first condition determining unit 502, the first mode converting unit 503 and the first voice picking unit 504 may refer to step S304 to step S307 in the embodiment corresponding to fig. 5, which is not described herein again.

Referring to fig. 8, the second conversion module 60 may include: a response operation unit 601, a microphone activation unit 602, a second mode conversion unit 603, a second voice pickup unit 604;

a response operation unit 601 for responding to a second trigger operation for the microphone array;

a microphone activation unit 602 configured to activate a second set of microphones associated with the second trigger operation;

a second mode conversion unit 603, configured to rotate the second microphone set to be within the target direction range, and convert the first voice pickup mode in the microphone array into the second voice pickup mode;

a second voice pickup unit 604, configured to perform voice pickup on the voice signals in the target direction range through the second voice pickup mode and the second microphone set, so as to generate a second target voice signal.

For specific functional implementation manners of the response operation unit 601, the microphone activation unit 602, the second mode conversion unit 603, and the second voice pickup unit 604, reference may be made to step S308 to step S311 in the embodiment corresponding to fig. 5, which is not described herein again.

Referring to fig. 8, the third converting module 70 may include: a third acquiring unit 701, a second condition judging unit 702, a rotating unit 703, a third voice pickup unit 704, and a fourth voice pickup unit 705;

a third obtaining unit 701, configured to obtain a second volume parameter corresponding to the voice signal in the first target direction range, and obtain a third volume parameter corresponding to the voice signal in the second target direction range;

a second condition determining unit 702, configured to activate a second microphone set if the second volume parameter is smaller than the volume threshold and the third volume parameter is greater than or equal to the volume threshold;

a rotating unit 703, configured to suspend the first microphone set corresponding to the first target direction range and the second target direction range, rotate the second microphone set to the first target direction range, and reactivate the first microphone set in the second target direction range in the rotated microphone array as an updated microphone set;

a third voice pickup unit 704, configured to perform voice pickup on voice signals within the second target direction range through the first voice pickup mode and the updated microphone set, and generate a third target voice signal;

a fourth voice pickup unit 705, configured to perform voice pickup on the voice signals in the first target direction range through the second voice pickup mode and the second microphone set, and generate a second target voice signal.

For specific functional implementation manners of the third obtaining unit 701, the second condition determining unit 702, the rotating unit 703, the third voice picking unit 704, and the fourth voice picking unit 705, reference may be made to step S404-step S408 in the embodiment corresponding to fig. 6, which is not described herein again.

Referring to fig. 8, the speech recognition module 80 may include: a voice feature acquisition unit 801, a text conversion unit 802;

a voice feature obtaining unit 801, configured to obtain target voice features corresponding to the first target voice signal, the second target voice signal, and the third target voice signal, respectively;

a text conversion unit 802, configured to convert the first target speech signal, the second target speech signal, and the third target speech signal into text information according to the target speech feature, and output the text information.

The specific functional implementation manners of the voice feature obtaining unit 801 and the text conversion unit 802 may refer to step S409 and step S410 in the embodiment corresponding to fig. 6, or step S312 to step S313 in the embodiment corresponding to fig. 5, which is not described herein again.

In the embodiment of the present invention, the microphone array may acquire the voice signals in a plurality of target direction ranges through a plurality of first microphone sets (i.e. working microphone sets), and can perform voice pickup on voice signals within a plurality of target direction ranges, when the microphone array is in the first voice pickup mode, whether the second microphone set needs to be activated or not can be determined according to the volume parameter corresponding to the acquired voice signal, when it is acquired that the volume parameter corresponding to the voice signal in one target direction range (i.e. the first target direction range) among the plurality of target direction ranges is smaller than the volume threshold value, the second microphone set can be automatically rotated to be within the first target direction range, and the voice signals within the first target direction range are subjected to super-directional enhancement, voice pickup in the first voice pickup mode is performed on voice signals in the remaining target direction ranges of the plurality of target direction ranges. Therefore, the voice data acquisition can be carried out on the speakers in different specific directions, the interference of voice signals in other directions is avoided, the noise can be effectively reduced, and the accuracy of voice recognition can be improved; adopt the second microphone set can realize the super directional enhancement of remote direction, increased the characteristic of the unidirectional super remote pickup promptly on the basis of omnidirectional pickup, can reduce the use quantity of microphone, practiced thrift the cost.

Referring to fig. 9, fig. 9 is a schematic structural diagram of another voice data processing apparatus according to an embodiment of the present invention. As shown in fig. 9, the voice data processing apparatus 1000 may include: the processor 1001, the network interface 1004, and the memory 1005, and the voice data processing apparatus 1000 may further include: a user interface 1003, and at least one communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display) and a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a standard wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1004 may be a high-speed RAM memory or a non-volatile memory (e.g., at least one disk memory). The memory 1005 may optionally be at least one memory device located remotely from the processor 1001. As shown in fig. 9, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a device control application program.

In the voice data processing apparatus 1000 shown in fig. 9, the network interface 1004 may provide a network communication function; the user interface 1003 is an interface for providing a user with input; the processor 1001 may be configured to call a device control application stored in the memory 1005, so as to implement the description of the voice data processing method in the embodiment corresponding to any one of fig. 2 to fig. 3 and fig. 5 to fig. 6, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.

It should be understood that the speech data processing apparatus 1000 described in the embodiment of the present invention may perform the description of the speech data processing method in the embodiment corresponding to any of the foregoing fig. 2-fig. 3 and fig. 5-fig. 6, and may also perform the description of the speech data processing apparatus 1 in the embodiment corresponding to the foregoing fig. 8, which is not repeated herein. In addition, the beneficial effects of the same method are not described in detail.

Further, here, it is to be noted that: an embodiment of the present invention further provides a computer storage medium, where the computer storage medium stores the aforementioned computer program executed by the speech data processing apparatus 1, and the computer program includes program instructions, and when the processor executes the program instructions, the processor can perform the description of the speech data processing method in the embodiment corresponding to any of fig. 2 to fig. 3 and fig. 5 to fig. 6, which will not be repeated herein. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in the embodiments of the computer storage medium to which the present invention relates, reference is made to the description of the method embodiments of the present invention.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present invention, and it is therefore to be understood that the invention is not limited by the scope of the appended claims.

Claims

1. A method for processing voice data, comprising:

performing voice pickup on voice signals in the target direction range through the first voice pickup mode and the working microphone set to generate a first target voice signal;

wherein the microphone array further comprises a second set of microphones; the second set of microphones is associated with a second voice pickup mode for hyper-directional enhancement of the voice signal, the second voice pickup mode having a sound pickup distance greater than the sound pickup distance of the first voice pickup mode; when switching from an active set of microphones to a second set of microphones, the second set of microphones for voice picking up voice signals within the target range of directions based on a second voice picking up mode;

the method further comprises the following steps:

acquiring a first volume parameter corresponding to the voice signal;

2. The method of claim 1, wherein the activating at least one first set of microphones associated with the first trigger action, determining the activated first set of microphones as a working set of microphones, and determining a target range of directions from which the working set of microphones are directed comprises:

3. The method of claim 2, wherein the voice picking up the voice signals in the target direction range through the first voice picking up mode and the working microphone set to generate a first target voice signal comprises:

4. The method of claim 1, wherein the voice picking up the voice signals in the target direction range through the first voice picking up mode and the working microphone set to generate a first target voice signal comprises:

5. The method of claim 4, further comprising:

6. The method of claim 1, further comprising:

responding to a second trigger operation for the microphone array;

7. The method of claim 1, wherein the target direction range comprises a first target direction range and a second target direction range;

the method further comprises the following steps:

if the second volume parameter is less than a volume threshold and the third volume parameter is greater than or equal to the volume threshold, activating a second set of microphones;

pausing the first microphone set corresponding to the first target direction range and the second target direction range respectively, rotating the second microphone set to be within the first target direction range, and reactivating the first microphone set corresponding to the second target direction range in the rotated microphone array to serve as an updated microphone set;

8. The method of claim 7, further comprising:

9. A speech data processing apparatus, comprising:

the generating module is used for carrying out voice pickup on the voice signals in the target direction range through the first voice pickup mode and the working microphone set to generate a first target voice signal;

the apparatus also includes a first conversion module;

the first conversion module includes:

10. A speech data processing apparatus, characterized by further comprising: a processor and a memory;

the processor is coupled to a memory, wherein the memory is configured to store program code and the processor is configured to invoke the program code to perform the method of any of claims 1-8.

11. A computer storage medium, characterized in that the computer storage medium stores a computer program comprising program instructions which, when executed by a processor, perform the method according to any one of claims 1-8.