CN112750455A

CN112750455A - Audio processing method and device

Info

Publication number: CN112750455A
Application number: CN202011593390.XA
Authority: CN
Inventors: 彭文超; 周强; 周晨; 张华兵; 邵雅婷
Original assignee: AI Speech Ltd
Current assignee: AI Speech Ltd
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2021-05-04

Abstract

The invention discloses an audio processing method and an audio processing device, wherein in the method, target azimuth information corresponding to a speaker sound source in candidate audio data is determined; judging whether the target azimuth information is in a set azimuth range; if the target azimuth information is out of the set azimuth range, filtering the candidate audio data; and if the target azimuth information is in the set azimuth range, determining target audio data for inputting to a voice recognition module according to the candidate audio data. Therefore, the noise in the far distance in the same direction can be filtered, and the method is suitable for being applied to near-field voice scenes.

Description

Audio processing method and device

Technical Field

The invention belongs to the technical field of voice equipment, and particularly relates to an audio processing method and device.

Background

With the development of speech technology, various speech recognition techniques are proposed by a plurality of experts and scholars, thereby improving the reliability of speech recognition results. However, the quality of the captured audio may also affect the reliability of the speech recognition result, for example, if the noise component in the audio is large, the speech recognition result may be seriously affected.

To improve the quality of the collected audio data, some experts and scholars in the related art envision designing a relevant solution, such as dividing the space into a plurality of beams using a beamforming technique, and suppressing signals in other directions by preserving signals in a specific direction.

However, these techniques can only distinguish noise in different directions, but cannot effectively suppress noise in the same direction, and due to the presence of side lobes, some noise in other directions may leak to the target direction, so that the noise suppression capability in non-target directions is limited, and when the signal and the noise are in the same direction, the noise cannot be suppressed, and the distance cannot be effectively distinguished, and thus the techniques cannot be applied to near-field voice devices (e.g., subway ticket vending machines, government offices).

In view of the above problems, the industry has not provided a better solution for the moment.

Disclosure of Invention

An embodiment of the present invention provides an audio processing method and apparatus, which are used to solve at least one of the above technical problems.

In a first aspect, an embodiment of the present invention provides an audio processing method, including: determining target azimuth information corresponding to a speaker sound source in the candidate audio data; judging whether the target azimuth information is in a set azimuth range; if the target azimuth information is out of the set azimuth range, filtering the candidate audio data; and if the target azimuth information is in the set azimuth range, determining target audio data for inputting to a voice recognition module according to the candidate audio data.

In a second aspect, an embodiment of the present invention provides an audio processing apparatus, including: a direction determining unit configured to determine target direction information corresponding to a speaker sound source in the candidate audio data; an azimuth determination unit configured to determine whether the target azimuth information is within a set azimuth range; an audio filtering unit configured to filter out the candidate audio data if the target position information is outside the set position range; an audio output unit configured to determine target audio data for input to a speech recognition module from the candidate audio data if the target orientation information is within the set orientation range.

In a third aspect, an embodiment of the present invention provides an electronic device, including: the computer-readable medium includes at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the above-described method.

In a fourth aspect, an embodiment of the present invention provides a storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the above method.

The embodiment of the invention has the beneficial effects that:

the processor may extract target azimuth information (i.e., direction and location information) corresponding to the speaker's sound source from the candidate audio data, and compare the target azimuth information with a set azimuth range, thereby filtering the candidate audio data or determining target audio data for speech recognition according to the candidate audio data. Therefore, the speaker audio data with the satisfied direction but the unsatisfied position (for example, too far away) can be filtered out as noise, and the method is suitable for being applied to near-field voice equipment (for example, subway ticket vending machines and government affairs all-in-one machines).

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 shows a flow diagram of an example of an audio processing method according to an embodiment of the invention;

FIG. 2 illustrates a flow diagram of one example of determining candidate audio data according to an embodiment of the invention;

FIG. 3 illustrates a flow chart of one example of determining target bearing information for a speaker's sound source in candidate audio data according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating an example of determining target audio data corresponding to candidate audio data having target azimuth information within a set azimuth range according to an embodiment of the present invention;

FIG. 5 shows a flow diagram of an example of an audio processing method according to an embodiment of the invention;

fig. 6 is a block diagram showing an example of an audio processing apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

As used herein, a "module," "system," and the like are intended to refer to a computer-related entity, either hardware, a combination of hardware and software, or software in execution. In particular, for example, an element may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. Also, an application or script running on a server, or a server, may be an element. One or more elements may be in a process and/or thread of execution and an element may be localized on one computer and/or distributed between two or more computers and may be operated by various computer-readable media. The elements may also communicate by way of local and/or remote processes based on a signal having one or more data packets, e.g., from a data packet interacting with another element in a local system, distributed system, and/or across a network in the internet with other systems by way of the signal.

Finally, it should be further noted that the terms "comprises" and "comprising," when used herein, include not only those elements but also other elements not expressly listed or inherent to such processes, methods, articles, or devices. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

In this document, the term "orientation information" may mean direction information and location description information, which can more precisely locate an object in addition to describing the direction of the object.

Fig. 1 shows a flowchart of an example of an audio processing method according to an embodiment of the present invention. As for the execution main body of the embodiment of the present invention, it may be various types of processors or controllers.

As shown in FIG. 1, in step 110, the corresponding target bearing information of the speaker's sound source in the candidate audio data is determined. Here, the candidate audio data is audio data containing speaker audio components, for example, the initial audio data received by the processor is by default the audio data containing speaker audio components, or the processor may process the initial audio data to obtain corresponding candidate audio data.

It should be understood that the processor may employ various known or potential ways to determine the directional information (e.g., directional information and location description information) corresponding to the speaker's sound source, which may be non-limiting herein.

In step 120, it is determined whether the target azimuth information is within the set azimuth range. Here, the set azimuth range may be set in advance by the operator according to a service demand.

If the judgment result in step 120 indicates that the target azimuth information is outside the set azimuth range, it jumps to step 130. If the judgment result in step 120 indicates that the target azimuth information is within the set azimuth range, it jumps to step 140.

In step 130, candidate audio data is filtered out.

In step 140, target audio data for input to the speech recognition module is determined from the candidate audio data.

Through the embodiment of the invention, the processor can filter the candidate audio data in the set azimuth range and determine the candidate audio data in the set azimuth range as the target audio data.

In some examples of the embodiment of the present invention, the target azimuth information includes target angle information and target distance information, and regarding implementation details of the above step 120, it is possible to identify whether the target angle information is within the set angle range, and at the same time, identify whether the target distance information is within the set distance range. Accordingly, it can be determined that the target azimuth information is within the set azimuth range only when both the target angle and the target distance are within the set azimuth range.

Fig. 2 shows a flowchart of an example of determining candidate audio data according to an embodiment of the present invention.

As shown in fig. 2, in step 210, initial audio data is captured based on an audio capture module. Here, the audio acquisition module may be a single microphone or a microphone array consisting of a plurality of microphones, and accordingly the initial audio data may be single-channel audio or multi-channel audio.

In step 220, at least one sound source audio component corresponding to the initial audio data is determined. Here, each sound source audio component has corresponding sound source information, respectively.

In one example of the embodiment of the present invention, each sound source information in the original audio data is analyzed using various known or potential sound source analysis techniques, and an audio component corresponding to the corresponding sound source information may be separated from the original audio data, thereby obtaining a sound source audio component. In another example of the embodiment of the present invention, the initial audio data is composed of a plurality of audio components, and various sound source analysis techniques may be used to perform sound source analysis on each audio component in the initial audio data, respectively, and identify sound source information corresponding to different audio components, so as to obtain corresponding sound source audio components.

In step 230, candidate audio data is determined based on the sound source audio component having the corresponding speaker sound source information. Illustratively, a sound source audio component corresponding to the speaker sound source information is screened from a plurality of sound source audio components as candidate audio data, and other sound source audio components are filtered.

According to the embodiment of the invention, the candidate audio data is determined by utilizing the sound source audio component corresponding to the speaker sound source information in the initial audio data, so that the noise information and the invalid information in the initial audio data can be effectively filtered, and the accuracy of the subsequent voice recognition result can be improved.

In some examples of embodiments of the invention, the audio acquisition module is a microphone array and accordingly the initial audio data is multi-channel audio data.

In some embodiments, each of the microphones of the microphone array conforms to a set azimuth arrangement, whereby the position of the different microphones of the microphone array and the corresponding signal sensing information can be utilized to determine the corresponding target azimuth information of the speaker's sound source. FIG. 3 illustrates a flow chart of one example of determining target bearing information for a speaker's sound source in candidate audio data according to an embodiment of the present invention.

As shown in fig. 3, in step 310, for the audio sub-data of each microphone in the candidate audio data corresponding to the microphone array, a signal arrival time of the audio sub-data transmitted to the corresponding microphone is obtained.

It should be understood that the candidate audio data is obtained by processing the multi-channel audio data collected by the microphone array, so that a plurality of audio sub-data respectively corresponding to the corresponding microphones or the corresponding channels exist in the candidate audio data. Furthermore, since different microphones are arranged in different orientations in the array, the transit times of sounds emitted from the same speaker's sound source to different microphones may also be different, i.e., different arrival times exist.

In step 320, the corresponding target azimuth information of the speaker sound source in the candidate audio data is determined based on the signal arrival time and the set azimuth arrangement corresponding to each microphone. Illustratively, the arrival time difference of the sound emitted by the speaker sound source aiming at different microphones in the microphone array can be calculated, and the target azimuth information of the speaker sound source can be determined by triangulation calculation by combining the set azimuth difference between different microphones, so that the speaker sound source can be effectively and accurately positioned.

It should be noted that the resolution for the distance is in a positive correlation with the pitch of the microphones in the microphone array, i.e., the larger the microphone pitch, the longer the distance that can be resolved. Furthermore, the spacing of the microphones in the microphone array may be adjusted accordingly for different application scenarios or traffic requirements, e.g. different constraints of angles and distances.

In some embodiments, the respective one or more sound source audio components may be determined from audio acquired for the different channels. Specifically, at least one sound source information corresponding to the initial audio data may be determined, and the sound source audio components corresponding to the respective sound source information may be separated from the initial audio data, respectively. Therefore, the sound source positioning and sound frequency separation technology are fused, and the sound source audio frequency component corresponding to the sound source information can be obtained.

In some application scenarios, there may be audio components corresponding to a plurality of speaker sound sources satisfying conditions in the candidate audio data, for example, a child holding a cry speaks to a subway machine to buy a voice ticket, and at this time, there may be a situation that the direction angle and the position distance of both the "mother" and the "child" are within a set range. To avoid filtering out valid audio components (e.g., the voice component of "mother"), the various audio components described above (e.g., the voice components of "mother" and "child") may all be output for analysis and recognition by a subsequent voice recognition system.

Fig. 4 shows a flowchart of an example of determining target audio data from candidate audio data satisfying a range condition (i.e., corresponding target bearing information within a set bearing range) according to an embodiment of the present invention.

As shown in fig. 4, in step 410, the number of sound source audio components in the candidate audio data is determined.

If the number of sound source audio components is single, it jumps to step 420. If the number of sound source audio components is plural, it jumps to step 430.

In step 420, target audio data is determined from the individual sound source audio components.

In step 430, the plurality of sound source audio components in the candidate audio data are fused according to the corresponding amplitude weights to determine corresponding target audio data. Here, the amplitude weight may be determined according to the traffic demand, or may be adaptively adjusted according to the number of audio components of the sound source, and all fall within the implementation scope of the embodiments of the present invention.

It should be noted that if a plurality of sound source audio components are directly superimposed, there is a possibility that the sound volume of the superimposed audio data is not consistent. Accordingly, fusion may be performed using preset magnitude weights, thereby determining target audio data.

In combination with the application scenario, when only an audio component corresponding to a single speaker sound source exists, the audio component may be directly used as corresponding target audio data. In addition, when there are audio components corresponding to multiple talker sound sources, different audio components need to be mixed, for example, audio components corresponding to "mother" and "child" are mixed to obtain target audio data.

According to the embodiment of the invention, sound sources in the audio data can be separated to obtain multi-channel audio signals, then sound sources containing human voice are selected, the angle and the distance of each sound source are respectively determined, the distance of the sound sources is supported to be measured, and the voice signals with the angles and the distances within the constraint range are fused to obtain the target audio data, so that the noise data can be effectively filtered, especially the noise information in the same direction and in the remote distance (for example, the speaking voice of pedestrians in a subway station with a longer distance).

Fig. 5 shows a flowchart of an example of an audio processing method according to an embodiment of the present invention.

It should be noted that, for a near-field voice interaction scene such as a subway vending machine, distance and angle need to be identified.

As shown in fig. 5, in step 510, initial audio data is acquired based on the microphone array, the initial audio data being accordingly multi-channel audio data. Here, a microphone array with a large microphone pitch may be used, for example, an 8-microphone array with a non-uniform distribution is selected, the coordinates of each microphone may be (-0.18, -0.15, -0.09, -0.02,0.02,0.05,0.09,0.18m) with the center of the array aperture as a null point.

In step 520, a plurality of sound source audio components are separated from the initial audio data. For example, the audio sub-data of a corresponding part of the microphones or all the microphones in the array may be selected, and one or more sound Source information corresponding to each audio sub-data may be identified by means of a method such as GSC (Generalized Sidelobe Cancellation), BSS (Blind Source Separation), deep learning model, and the like.

In step 530, a component for which distance angle estimation is required is selected from the separated plurality of sound source audio components. Illustratively, a sound source audio component corresponding to speaker sound source information may be selected for distance angle estimation.

In step 540, the distance and angle corresponding to the selected sound source audio component are estimated respectively. Illustratively, the distances and angles corresponding to the audio components of the sound source can be calculated using methods such as GCC-PHAT (Generalized Cross Correlation PHAse Transformation), SRP-PHAT (controlled Response Power-PHAse Transformation), MUSIC (multiple signal classification algorithm), and ESPRIT (Estimating signal parameters visual information techniques).

In addition, when calculating the angle and the distance, the microphone array can be divided into two 4-microphone sub-arrays for estimation respectively. Then, the angle and distance of the sound source are calculated by the trigonometric relationship, which can reduce the amount of calculation in performing angle and distance estimation. It should be understood that there may be errors in the fusion of the results of the calculations for the two 4-microphone subarrays, and that the cross-correlation information between the two subarrays may not be fully utilized, and the accuracy of the orientation calculation may not be as accurate as the overall calculation for the 8-microphone array.

In step 550, sound source audio components are selected for which the angles and distances of the corresponding sound sources are within the set range.

In step 560, it is determined whether the number of sound source audio components satisfying the range condition is 0.

If the judgment result in step 560 indicates that the number of sound source audio components is 0, it jumps to step 570. If the result of the judgment in step 560 indicates that the number of sound source audio components is not 0, it jumps to step 580.

In step 570, the output is determined to be 0, i.e., no valid speaker audio is identified.

In step 580, it is determined whether the number of sound source audio components satisfying the range condition is 1.

If the judgment result in step 580 indicates that the number of sound source audio components is 1, it jumps to step 590. If the judgment result in step 580 indicates that the number of sound source audio components is not 1, it jumps to step 5100.

In step 590, this sound source audio component is directly output.

In step 5100, multiple sound source audio components are fused, for example, fused according to corresponding amplitude proportions, to obtain a voice signal. At this time, the number of the sound source audio components is greater than or equal to 2, and the amplitude weighting factors corresponding to the sound source audio components can be calculated, so as to weight the multiple sound source audio components respectively through the weighting factors, and then sum up to obtain a final output result.

In step 5110, the voice signal is subjected to inverse fourier transform and voice output is performed.

By the embodiment of the invention, a plurality of sound sources in the initial audio data can be separated, and the angles and the distances of the sound sources can be estimated, so that the voice signals in the constraint range are reserved, and the method and the device are suitable for being applied to a near-field voice interaction scene.

As shown in fig. 6, the audio processing apparatus 600 includes an orientation determining unit 610, an orientation determining unit 620, an audio filtering unit 630, and an audio output unit 640.

The bearing determination unit 610 is configured to determine target bearing information corresponding to a speaker's sound source in the candidate audio data.

The bearing determination unit 620 is configured to determine whether the target bearing information is within a set bearing range.

The audio filtering unit 630 is configured to filter out the candidate audio data if the target position information is outside the set position range.

The audio output unit 640 is configured to determine target audio data for input to a speech recognition module from the candidate audio data if the target orientation information is within the set orientation range.

The apparatus according to the above embodiment of the present invention may be used to execute the corresponding method embodiment of the present invention, and accordingly achieve the technical effect achieved by the method embodiment of the present invention, which is not described herein again.

In the embodiment of the present invention, the relevant functional module may be implemented by a hardware processor (hardware processor).

In another aspect, an embodiment of the present invention provides a storage medium having a computer program stored thereon, where the computer program is executed by a processor to perform the steps of the audio processing method as above.

The product can execute the method provided by the embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided by the embodiment of the present invention.

The electronic device of embodiments of the present invention exists in a variety of forms, including but not limited to:

(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as ipads.

(3) Portable entertainment devices such devices may display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.

(4) And other electronic devices with data interaction functions.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the above technical solutions substantially or contributing to the related art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An audio processing method, comprising:

determining target azimuth information corresponding to a speaker sound source in the candidate audio data;

judging whether the target azimuth information is in a set azimuth range;

if the target azimuth information is out of the set azimuth range, filtering the candidate audio data;

and if the target azimuth information is in the set azimuth range, determining target audio data for inputting to a voice recognition module according to the candidate audio data.

2. The method of claim 1, wherein the target position information includes target angle information and target distance information, wherein the determining whether the target position information is within a set position range includes:

and identifying whether the target angle information is within a set angle range or not, and identifying whether the target distance information is within a set distance range or not.

3. The method of claim 1, wherein prior to determining the target bearing information corresponding to the speaker's sound source in the candidate audio data, the method further comprises:

acquiring initial audio data based on an audio acquisition module;

determining at least one sound source audio component corresponding to the initial audio data, wherein each sound source audio component has corresponding sound source information;

candidate audio data is determined based on the audio component of the sound source having the information of the sound source of the corresponding speaker.

4. A method as claimed in claim 3, wherein the audio acquisition module is a microphone array and accordingly the initial audio data is multi-channel audio data.

5. The method of claim 4, wherein each of the microphones of the array of microphones is in a set azimuth arrangement,

wherein, the determining the target azimuth information corresponding to the speaker sound source in the candidate audio data comprises:

aiming at audio subdata corresponding to each microphone in the microphone array in the candidate audio data, acquiring signal arrival time of the audio subdata transmitted to the corresponding microphone;

and determining the corresponding target azimuth information of the speaker sound source in the candidate audio data based on the signal arrival time corresponding to each microphone and the set azimuth arrangement.

6. The method of claim 4, wherein the determining at least one sound source audio component to which the initial audio data corresponds comprises:

determining at least one sound source information corresponding to the initial audio data;

and respectively separating sound source audio components corresponding to the sound source information from the initial audio data.

7. The method of claim 6, wherein the determining target audio data for input to a speech recognition module from the candidate audio data comprises:

determining the number of sound source audio components in the candidate audio data;

and when the number of the sound source audio components is multiple, fusing the multiple sound source audio components in the candidate audio data according to corresponding amplitude weights so as to determine corresponding target audio data.

8. An audio processing apparatus comprising:

a direction determining unit configured to determine target direction information corresponding to a speaker sound source in the candidate audio data;

an azimuth determination unit configured to determine whether the target azimuth information is within a set azimuth range;

an audio filtering unit configured to filter out the candidate audio data if the target position information is outside the set position range;

an audio output unit configured to determine target audio data for input to a speech recognition module from the candidate audio data if the target orientation information is within the set orientation range.

9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1-7.

10. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.