CN117953907A

CN117953907A - Audio processing method and device and terminal equipment

Info

Publication number: CN117953907A
Application number: CN202211275141.5A
Authority: CN
Inventors: 尚楚翔; 向肖肖; 赵成帅; 黄传增
Original assignee: Douyin Vision Co Ltd
Current assignee: Douyin Vision Co Ltd
Priority date: 2022-10-18
Filing date: 2022-10-18
Publication date: 2024-04-30

Abstract

The disclosure provides an audio processing method, an audio processing device and terminal equipment, wherein the method comprises the following steps: determining a target direction in response to the manipulation of the first terminal device; acquiring a plurality of first audios acquired by a plurality of audio acquisition devices; determining a second audio of the target direction based on the plurality of first audio and the target direction; and transmitting the second audio to a second terminal device. And the accuracy of audio acquisition is improved.

Description

Audio processing method and device and terminal equipment

Technical Field

The embodiment of the disclosure relates to the technical field of audio processing, in particular to an audio processing method, an audio processing device and terminal equipment.

Background

The terminal device may collect audio in the space and send the audio to other terminal devices. For example, the terminal device may collect real-time speech of the user and send the speech to the server so that the server forwards the speech to other terminal devices.

Currently, the terminal device may collect audio in space through a microphone. For example, in a live scenario, a microphone in the terminal device may acquire real-time speech uttered by the user. However, when there is more interference audio in the space (e.g., in a live scene, the space may include the host sound and other staff interference noise), the noise of the audio collected by the microphone is larger, which results in poor accuracy of the audio collection.

Disclosure of Invention

The disclosure provides an audio processing method, an audio processing device and terminal equipment, which are used for solving the technical problem of poor accuracy of audio acquisition in the prior art.

In a first aspect, the present disclosure provides an audio processing method, the method comprising:

determining a target direction in response to the manipulation of the first terminal device;

Acquiring a plurality of first audios acquired by a plurality of audio acquisition devices;

Determining a second audio of the target direction based on the plurality of first audio and the target direction; and transmitting the second audio to a second terminal device.

In a second aspect, the present disclosure provides an audio processing apparatus, including a response module, an acquisition module, a determination module, and a transmission module, wherein:

the response module is used for responding to the operation of the first terminal equipment and determining a target direction;

the acquisition module is used for acquiring a plurality of first audios acquired by a plurality of audio acquisition devices;

the determining module is used for determining second audio of the target direction based on the plurality of first audio and the target direction;

the sending module is used for sending the second audio to a second terminal device.

In a third aspect, an embodiment of the present disclosure provides a terminal device, including: a processor and a memory;

The memory stores computer-executable instructions;

The processor executes computer-executable instructions stored by the memory such that the at least one processor performs the audio processing method as described above in the first aspect and various possible aspects of the first aspect.

In a fourth aspect, embodiments of the present disclosure provide a computer-readable storage medium having stored therein computer-executable instructions which, when executed by a processor, implement the audio processing method as described in the first aspect and the various possible aspects of the first aspect above.

In a fifth aspect, embodiments of the present disclosure provide a computer program product comprising a computer program which, when executed by a processor, implements the audio processing method as described above in the first aspect and the various possible aspects of the first aspect.

The disclosure provides an audio processing method, an audio processing device and a terminal device, wherein the method and the device are used for responding to the operation of a first terminal device, determining a target direction, acquiring a plurality of first audios acquired by a plurality of audio acquisition devices, determining a second audio of the target direction based on the plurality of first audios and the target direction, and sending the second audio to the second terminal device. In the method, the first terminal device can flexibly determine the target direction based on the operation of the user on the first terminal device, so that the flexibility of audio collection is improved, and because the first terminal device can restrain the audio signals in other directions in the plurality of first audios based on the plurality of first audios and the target direction, the audio signals in the non-target direction in the second audio are lower, so that the accuracy of audio collection is improved.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the solutions in the prior art, a brief description will be given below of the drawings that are needed in the embodiments or the description of the prior art, it being obvious that the drawings in the following description are some embodiments of the present disclosure, and that other drawings may be obtained from these drawings without inventive effort to a person of ordinary skill in the art.

Fig. 1 is a schematic diagram of an application scenario provided in an embodiment of the present disclosure;

fig. 2 is a schematic flow chart of an audio processing method according to an embodiment of the disclosure;

FIG. 3A is a schematic diagram of a process for determining a target direction according to an embodiment of the present disclosure;

FIG. 3B is a schematic diagram of another process for determining a target direction according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of acquiring a plurality of first audio signals according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a method for determining predicted audio according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a process for determining a plurality of first phase differences according to an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of a process for determining target weights according to an embodiment of the present disclosure;

fig. 8 is a process schematic diagram of an audio processing method according to an embodiment of the disclosure;

fig. 9 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the disclosure; and

Fig. 10 is a schematic structural diagram of a terminal device according to an embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

In order to facilitate understanding, concepts related to the embodiments of the present disclosure are described below.

Terminal equipment: is a device with wireless receiving and transmitting function. The terminal device may be deployed on land, including indoors or outdoors, hand-held, wearable or vehicle-mounted; can also be deployed on the water surface (such as a ship, etc.). The terminal device may be a mobile phone (mobile phone), a tablet computer (Pad), a computer with a wireless transceiving function, a Virtual Reality (VR) terminal device, an augmented reality (augmented reality, AR) terminal device, a wireless terminal in industrial control (industrial control), a vehicle-mounted terminal device, a wireless terminal in unmanned (SELF DRIVING), a wireless terminal in remote medical (remote medical), a wireless terminal in smart grid (SMART GRID), a wireless terminal in transportation security (transportation safety), a wireless terminal in smart city (SMART CITY), a wireless terminal in smart home (smart home), a wearable terminal device, or the like. The terminal device according to the embodiments of the present disclosure may also be referred to as a terminal, a User Equipment (UE), an access terminal device, a vehicle terminal, an industrial control terminal, a UE unit, a UE station, a mobile station, a remote terminal device, a mobile device, a UE terminal device, a wireless communication device, a UE proxy, or a UE apparatus, etc. The terminal device may also be fixed or mobile.

In the related art, a terminal device may collect audio in a space and transmit the audio to other terminal devices. For example, the terminal device may collect real-time speech of the user and send the speech to the server so that the server forwards the speech to other terminal devices. Currently, the terminal device may collect audio in space through a microphone. For example, in a live scenario, a microphone in the terminal device may acquire real-time speech uttered by the user. However, when the interference audio in the space is more, the noise of the audio collected by the microphone is larger, for example, in a live broadcast scene, the sound of the host and the sound of other staff (interference noise) may be included in the space, and the microphone in the terminal device may collect the sound of the host and the sound of other staff, which further results in poor accuracy of the audio collection.

In order to solve the technical problems in the related art, an embodiment of the present disclosure provides an audio processing method, which determines a target direction in response to a touch operation on a display page of a first terminal device, acquires a plurality of first audio frequencies acquired by a plurality of audio acquisition devices, determines a predicted audio frequency of the target direction based on the plurality of first audio frequencies and the target direction, wherein the predicted audio frequency is associated with target phase information, and determines a second audio frequency of the target direction based on the plurality of first audio frequencies and the predicted audio frequency. In this way, the first terminal device can respond to the touch operation on any position of the display page to determine the target direction, so that the flexibility of audio collection is improved, and because the target phase information indicates the proportion of the sound source in the target direction in each first audio, the first terminal device can accurately determine the second audio in the target direction based on a plurality of first audios and the predicted audio associated with the target phase information, and the audio signals in the non-target directions in the second audio are lower, so that the accuracy of audio collection is improved.

Next, an application scenario of the embodiment of the present disclosure will be described with reference to fig. 1.

Fig. 1 is a schematic diagram of an application scenario provided in an embodiment of the present disclosure. Referring to fig. 1, the method includes: terminal equipment a, terminal equipment B, and a user. Terminal device a and terminal device B may be connected via a communication connection (e.g., a video connection, a voice connection, etc.), and terminal device a may record a video of the user. When the user sends out voice A, the space also comprises noise A and noise B, and 2 microphones in the terminal equipment A can collect the voice A, the noise A and the noise B.

Referring to fig. 1, in response to a touch operation on a position where a user is located in a screen of a terminal device a, the terminal device a may determine a target direction as a direction where the user is located, so that the terminal device may determine a predicted audio of the direction where the user is located according to the voice collected by the 1 st microphone, the voice collected by the 2 nd microphone, and the direction where the user is located, and obtain a voice a of the direction where the user is located according to the voice collected by the 1 st microphone, the voice collected by the 2 nd microphone, and the predicted audio, and send the voice a to the terminal device B. Therefore, the terminal equipment can restrain the audio signals in the non-target direction in the audio based on the audio collected by the 2 microphones and the predicted audio, and further obtain the voice A in the target direction, so that the audio signals in the non-target direction in the voice A are lower, and further the accuracy of audio collection can be improved.

It should be noted that fig. 1 is only an exemplary illustration of the application scenario of the embodiments of the present disclosure, and is not limited to the application scenario of the embodiments of the present disclosure.

The following describes the technical solutions of the present disclosure and how the technical solutions of the present disclosure solve the above technical problems in detail with specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present disclosure will be described below with reference to the accompanying drawings.

Fig. 2 is a flow chart of an audio processing method according to an embodiment of the disclosure. Referring to fig. 2, the method may include:

s201, in response to the manipulation of the first terminal device, determining a target direction.

The execution body of the embodiment of the disclosure may be the first terminal device, or may be an audio processing apparatus provided in the first terminal device. The audio processing device may be implemented by software, or the audio processing device may be implemented by a combination of software and hardware, which is not limited in the embodiments of the present disclosure.

Alternatively, the target direction may be a direction in which audio is to be acquired. For example, the target direction may be a direction directly in front of, left of, right of, or the like of the terminal device, and the target direction may be any direction in space, which is not limited by the embodiment of the present disclosure.

Alternatively, the first terminal device may be a device for recording audio. For example, the first terminal device may be a device for recording voice of a user, and the first terminal device may collect audio in a space through the audio collection device.

Alternatively, the manipulation of the first terminal device may be a touch operation on a display page of the first terminal device. For example, when the user records a video by using the terminal device, the user performs a touch operation on the area a in the display page of the first terminal device, and then the first terminal device determines the spatial position where the area a is located as the target direction.

Alternatively, the manipulation of the first terminal device may be a voice operation of the first terminal device. For example, when the user records a video using the terminal device, if the voice sent by the user to the first terminal device is "acquire front voice", the first terminal device may determine that the target direction is the front of the first terminal device.

It should be noted that, in the conference scenario, if the user sends the voice of "user a speaking" to the first terminal device, the first terminal device may determine the position of the user a in space, determine the position as the target direction, and the first terminal device may determine the target direction according to other voices, which is not limited in the embodiment of the present disclosure.

Next, a process of determining the target direction will be described with reference to fig. 3A to 3B.

Fig. 3A is a schematic diagram of a process for determining a target direction according to an embodiment of the disclosure. Please refer to fig. 3A, which includes a terminal device and a user. When the terminal device records the video including the user, the user may be included in the display page of the terminal device. When the user touches the head region of the user in the display page of the terminal device, the terminal device may determine that the target direction is the head of the user. The target direction is a direction in which the terminal device in the space is directed to the head of the user. Thus, the user can collect the audio in any direction, and the accuracy and the flexibility of audio collection are improved.

Fig. 3B is a schematic diagram of another process for determining a target direction according to an embodiment of the disclosure. Please refer to fig. 3B, which includes a terminal device and a user. The user sends a voice "front" to the terminal device, and after the terminal device receives the voice, the front of the terminal device may be determined as the target direction.

It should be noted that, in the embodiment shown in fig. 3B, since the terminal device may identify the content in the recorded video, the voice sent by the user to the terminal device may include the object in the video. For example, the video recorded by the terminal device includes a user a, and if the voice uttered by the user is "user a utters," the terminal device may determine the target direction based on the spatial position where the user a is located, which is not limited by the embodiment of the present disclosure.

S202, acquiring a plurality of first audios acquired by a plurality of audio acquisition devices.

Optionally, the audio collection device is configured to collect audio in space. For example, the audio capturing device may be a microphone. For example, the audio capturing device may be a microphone in the first terminal device. For example, the mobile phone may include 2 microphones, and the mobile phone may collect audio in space through the 2 microphones.

Optionally, the first audio is audio collected by an audio collection device. For example, in a live scene, a live device may include 2 microphones, each of which may capture a first audio. Alternatively, the first terminal device may acquire the plurality of first audios through other terminal devices. For example, the first terminal device may be communicatively connected to other terminal devices, and the other terminal devices may send the plurality of first audio to the first terminal device when microphones in the other terminal devices acquire the plurality of first audio.

It should be noted that, because the position of each audio capturing device in the first terminal device is different (for example, there are 2 microphones at two ends of the bottom of the mobile phone), although the audio sources are the same, the first audio captured by different audio capturing devices is also different. For example, when one sound source is included in the space, if the first terminal device includes 1 audio capturing device, 1 audio capturing device may capture 1 first audio for the sound source, and if the first terminal device includes 2 audio capturing devices, 2 audio capturing devices may capture 2 first audio for the sound source.

Next, a procedure in which the first terminal device acquires a plurality of first audio frequencies acquired by a plurality of audio acquisition devices will be described with reference to fig. 4.

Fig. 4 is a schematic diagram of acquiring a plurality of first audios according to an embodiment of the disclosure. Referring to fig. 4, the method includes: and a terminal device. The terminal equipment comprises a microphone A and a microphone B, and the space comprises audio A, audio B and audio C. The terminal equipment can acquire the audio in the space through the microphone A and the microphone B to obtain a first audio A and a first audio B. The first audio A comprises audio A collected by a microphone A, audio B collected by the microphone A and audio C collected by the microphone A. The first audio B comprises audio A collected by a microphone B, audio B collected by the microphone B and audio C collected by the microphone B.

Note that, in fig. 4, since the microphone a and the microphone B are disposed at the bottom of the terminal device and the positions of the microphone a and the microphone B are different, the spectrogram of the first audio a and the spectrogram of the first audio B are also different.

S203, determining second audio of the target direction based on the plurality of first audio and the target direction.

Alternatively, the second audio may be audio of the target direction. For example, if the first terminal device determines that the target direction is the front side of the first terminal device, the second audio is the audio of the front side of the first terminal device, and if the first terminal device determines that the target direction is the left side of the first terminal device, the second audio is the audio of the left side of the first terminal device.

Optionally, the first terminal device may determine the second audio of the target direction based on the following possible implementation manner: a predicted audio for the target direction is determined based on the plurality of first audio and the target direction, and a second audio for the target direction is determined based on the plurality of first audio and the predicted audio.

Optionally, the predicted audio is associated with target phase information. Alternatively, the target phase information may include phase information of the plurality of first audios and phase information associated with the target direction. Alternatively, the phase information of the first audio may include a phase difference between a plurality of first audio, and the phase information associated with the target direction may include a phase difference associated with the target direction.

Optionally, the target phase information indicates a specific gravity of the sound source of the target direction in each of the first audios. For example, the predicted audio may be predicted audio associated with the plurality of first audio in the target direction. For example, if the first audio acquired by the first terminal device includes a first audio a and a first audio B, the target phase information is used to indicate a specific gravity of the audio signal associated with the target direction in the first audio a and a specific gravity of the audio signal associated with the target direction in the first audio B.

It should be noted that, because the positions of the audio capturing devices are different, the spectrograms of the first audio captured by the audio capturing devices are also different, so that the audio source in the target direction has a corresponding specific gravity in each first audio.

Alternatively, the first terminal device may determine the predicted audio of the target direction based on the following possible implementation: target phase information is determined based on the plurality of first audios and the target direction, and predicted audios are determined based on the target phase information and the plurality of first audios.

Alternatively, the first terminal device may determine the target phase information according to a phase difference between the plurality of first audios and a phase difference corresponding to the target direction. For example, a phase difference in the target direction is determined according to a path difference between the arrival of the sound source in the target direction at the microphone, and a plurality of phase differences between the channels of the plurality of first audios are determined, so that the target phase information is obtained based on cosine similarity between the phase difference in the target direction and the plurality of phase differences between the channels of the plurality of first audios.

Alternatively, since the target phase information indicates the specific gravity of the sound source of the target direction in each first audio, the predicted audio of the target direction may be determined by the plurality of first audio and the target phase information. For example, the first terminal device may determine, through the target phase information, an audio signal in a target direction in each first audio, and further superimpose a plurality of audio signals, so as to obtain a predicted audio in the target direction.

Optionally, the second audio of the target direction is determined based on the plurality of first audio and the predicted audio, specifically: a second audio of the target direction is determined based on the first model, the plurality of first audio, and the predicted audio. Optionally, the first model is trained on a plurality of groups of samples, and the plurality of groups of samples include a plurality of sample first audios, sample prediction audios corresponding to the plurality of sample first audios, and sample second audios. For example, a plurality of sample first audios 1 and sample prediction audios 1 in a sample direction and a sample second audios 1 in a sample direction corresponding to the plurality of sample first audios 1 are obtained, so that a group of samples is obtained, wherein the group of samples comprises the plurality of sample first audios 1, the sample prediction audios 1 and the sample second audios 1.

Before inputting the plurality of first audios and the predicted audios into the first model, the first terminal device may concatenate the plurality of first audios and the predicted audios in a channel dimension, and further input an audio signal after the concatenation into the first model, and the first model may output the second audio in the target direction.

S204, sending the second audio to the second terminal equipment.

Alternatively, the second terminal device may be other terminal devices than the first terminal device. The first terminal device may send the second audio to the second terminal device. For example, the first terminal device may be connected to the second terminal device via a video connection, and further send the second audio to the second terminal device. Alternatively, the first terminal device may send the second audio to the server, so that the server forwards the second audio to the second terminal device. For example, in a live scene, after the first terminal device acquires the second audio, the second audio may be sent to the server, and the server may send the second audio to the second terminal device that requests the live scene.

The embodiment of the disclosure provides an audio processing method, which is used for responding to touch operation of a display page of first terminal equipment, determining a target direction, acquiring a plurality of first audios acquired by a plurality of audio acquisition equipment, determining predicted audios of the target direction based on the plurality of first audios and the target direction, and determining second audios of the target direction based on the plurality of first audios and the predicted audios. In this way, the first terminal device can respond to the touch operation on any position of the display page to determine the target direction, so that the flexibility of audio collection is improved, and because the target phase information indicates the proportion of the sound source in the target direction in each first audio, the first terminal device can accurately determine the second audio in the target direction based on a plurality of first audios and the predicted audio associated with the target phase information, and the audio signals in the non-target directions in the second audio are lower, so that the accuracy of audio collection is improved.

Based on the embodiment shown in fig. 2, a method for determining predicted audio of a target direction based on a plurality of first audio and the target direction in the above-described audio processing method will be described in detail with reference to fig. 5.

Fig. 5 is a schematic diagram of a method for determining predicted audio according to an embodiment of the disclosure. Referring to fig. 5, the method includes:

S501, determining target phase information based on the plurality of first audios and the target direction.

Optionally, the target phase information indicates a specific gravity of the sound source of the target direction in each of the first audios. Optionally, the first terminal device may determine the target phase information based on the following possible implementation manners: a plurality of first phase differences between the plurality of first audios and a second phase difference associated with a target direction are acquired, a similarity between the second phase difference and each of the first phase differences is determined, and target phase information is determined based on the similarity.

Optionally, the phase difference between the plurality of first audios is obtained, specifically: the first terminal device determines an initial audio among a plurality of first audios. Alternatively, the first terminal device may determine any one of the plurality of first audio as the initial audio. For example, if the plurality of first audio acquired by the first terminal device includes a first audio a, a first audio B, and a first audio C, the first terminal device may determine the first audio a as an initial audio, the first terminal device may determine the first audio B as an initial audio, and the first terminal device may also determine the first audio C as an initial audio, which is not limited in the embodiment of the present disclosure.

Optionally, a phase difference between the initial audio and each first audio is obtained, so as to obtain a plurality of first phase differences. For example, if the plurality of first audios collected by the first terminal device include a first audio a, a first audio B, and a first audio C, the first terminal device determines that the initial audio is the first audio a, the first terminal device may obtain a phase difference a between the first audio a and the first audio B, and the first terminal device may obtain a phase difference B between the first audio a and the first audio C, so as to determine the phase difference a and the phase difference B as the first phase difference.

It should be noted that, in the embodiment of the present disclosure, when the first terminal device determines the target phase information through the plurality of first audios and the target direction, the first terminal device may perform fourier transform on the plurality of first audios first, so as to obtain a plurality of spectrograms associated with the plurality of first audios, and further determine a plurality of first phase differences through the plurality of spectrograms, so that accuracy of determining the first phase differences may be improved, accurate target phase information may be obtained, and an audio processing effect may be improved.

Next, a process of determining a plurality of first phase differences will be described with reference to fig. 6.

Fig. 6 is a schematic diagram of a process for determining a plurality of first phase differences according to an embodiment of the present disclosure. Referring to fig. 6, the audio signal includes a first audio a, a first audio B, and a first audio C. And carrying out Fourier transform processing on the first audio A, the first audio B and the first audio C to obtain a frequency spectrum A associated with the first audio A, a frequency spectrum B associated with the first audio B and a frequency spectrum C associated with the first audio C. The first audio a is determined as the initial audio, the phase difference between the spectrum a and the spectrum B is determined as the first phase difference a, and the phase difference between the spectrum a and the spectrum C is determined as the first phase difference B.

Alternatively, the first terminal device may determine a phase difference between the sound source of the target direction and the audio collection device as the second phase difference associated with the target direction. For example, the first terminal device may acquire a phase difference caused by a path difference between arrival of the sound source in the target direction at the microphone, and determine the phase difference as the target phase difference.

Optionally, a similarity between the second phase difference and each of the first phase differences is determined, and the target phase information is determined based on the similarity, specifically: and determining cosine similarity between the second phase difference and each first phase difference to obtain a plurality of cosine similarities. For example, if the second phase difference associated with the target direction is a phase difference a, the first phase difference includes a phase difference B and a phase difference C, the plurality of cosine similarities include a cosine similarity between the phase difference a and the phase difference B and a cosine similarity between the phase difference a and the phase difference C.

And carrying out fusion processing on the cosine similarities to obtain target phase information. For example, the first terminal device may perform a splicing process on the multiple cosine similarities to obtain the target phase information, or the first terminal device may superimpose the multiple cosine similarities to obtain the target phase information, which is not limited in the embodiment of the present disclosure. For example, the first terminal device may perform the splicing process on the cosine similarity a and the cosine similarity B, so as to obtain the target phase information.

S502, determining predicted audio based on target phase information and a plurality of first audio.

Alternatively, the first terminal device may determine the predicted audio based on the following possible implementations: based on the target phase information and the plurality of first audio frequencies, a target weight associated with each first audio frequency is determined. Optionally, the target weight is a weight of the audio in the target direction occupied by the audio in the first audio.

Optionally, based on the target phase information and the plurality of first audios, determining a target weight corresponding to each first audio, specifically: a target weight is determined based on the second model, the plurality of first audio frequencies, and the target phase information. Optionally, the second model is trained on a plurality of sets of preset samples, the plurality of sets of preset samples including a plurality of sample tones, target phase information associated with the plurality of sample tones, and sample weights associated with each sample tone. For example, a plurality of sample audios 1, target phase information 1 associated with the plurality of sample audios 1 (the target direction may be any direction, and in the actual training process, the plurality of target phase information of each direction of the plurality of sample audios 1 in the space may be obtained), and a set of preset samples including the plurality of sample audios 1, the sample target phase information 1, and the sample weight 1 associated with each sample audio are obtained, and by adopting the method, a plurality of sets of preset samples may be obtained.

Optionally, the structure of the second model may include an encoder, a two-way recurrent neural network (DPRNN), and a decoder. For example, DPRNN can effectively model the correlation between frequencies and the correlation between frames of audio, so that the audio can be effectively decomposed, the encoder is used for encoding target phase information and features of a plurality of spectrograms, DPRNN can be composed of stacked SA-DPRNN blocks and an Adaptive Feature Fusion Block (AFFB), the SA-DPRNN blocks comprise intra-block RNNs and inter-block RNNs, the AFFB can distribute different weights to the contribution of training tasks according to different intermediate features, so that different intermediate features can be adaptively aggregated, and attention mechanisms can be introduced in DPRNN, so that the training effect of the model can be improved.

Optionally, when the second model is used for processing the plurality of first audios and the target phase information, fourier transform processing may be performed on the plurality of first audios to obtain a plurality of spectrograms corresponding to the plurality of first audios, fusion spectral features of the plurality of spectrograms are extracted through the second model, the fusion spectral features are spliced with the target phase information, and further the features after the splicing are processed to obtain a plurality of target weights associated with the plurality of first audios. For example, the first terminal device inputs 2 spectrograms corresponding to 2 first audios and 2 target phase information corresponding to the first audios and the target direction to the second model, the second model may extract a fused spectral feature from the 2 spectrograms, splice the fused spectral feature and the target phase information to obtain a spliced feature, and convolve the spliced feature to obtain 2 target weights.

Next, a process of determining the target weight will be described with reference to fig. 7.

Fig. 7 is a schematic diagram of a process for determining a target weight according to an embodiment of the disclosure. Referring to fig. 7, the method includes: first audio a, first audio B, first audio C, target direction, and second model. And carrying out Fourier transform processing on the first audio A, the first audio B and the first audio C to obtain a frequency spectrum A associated with the first audio A, a frequency spectrum B associated with the first audio B and a frequency spectrum C associated with the first audio C.

Referring to fig. 7, the target phase information is determined through the spectrum a, the spectrum B, the spectrum C, and the target direction, and the target phase information, the spectrum a, the spectrum B, and the spectrum C are input to the second model, which may output the weight a, the weight B, and the weight C. The weight A is a target weight associated with the first audio A, the weight B is a target weight associated with the first audio B, and the weight C is a target weight associated with the first audio C.

Based on the plurality of first audios and the plurality of target weights, a predicted audio is determined, specifically: multiplying each first audio by a target weight corresponding to the first audio to obtain target audio, and combining the plurality of target audio to obtain predicted audio. For example, each first audio is multiplied by a corresponding target weight to obtain a target audio corresponding to the first audio in a target direction, and by adopting the method, a plurality of target audios associated with a plurality of first audios can be obtained, and the plurality of target audios are combined (for example, combined in a frequency domain, etc.), so as to obtain a predicted audio.

The embodiment of the disclosure provides a method for determining predicted audio, which comprises the steps that a first terminal device determines phase differences among a plurality of first audio, obtains a plurality of first phase differences, determines a second phase difference related to a target direction, determines target phase information based on the second phase difference and the plurality of first phase differences, and determines predicted audio based on the target phase information and the plurality of first audio. Thus, since the target phase information can indicate the specific weight of the sound source in the target direction in each first audio, the audio signal in the target direction can be extracted from the plurality of first audio by the target phase information, and the accuracy of predicting the audio can be improved.

On the basis of any one of the above embodiments, a procedure of the above audio processing method will be described below with reference to fig. 8.

Fig. 8 is a process schematic diagram of an audio processing method according to an embodiment of the disclosure. Referring to fig. 8, the method includes: first audio a, first audio B, first audio C, target direction, first model, and second model. And carrying out Fourier transform processing on the first audio A, the first audio B and the first audio C to obtain a frequency spectrum A associated with the first audio A, a frequency spectrum B associated with the first audio B and a frequency spectrum C associated with the first audio C.

Referring to fig. 8, the target phase information is determined through the spectrum a, the spectrum B, the spectrum C, and the target direction, and the target phase information, the spectrum a, the spectrum B, and the spectrum C are input to the second model, which may output the weight a, the weight B, and the weight C. The weight A is a target weight associated with the first audio A, the weight B is a target weight associated with the first audio B, and the weight C is a target weight associated with the first audio C.

Referring to fig. 8, the product of the first audio a and the first weight a is determined as a first target audio a, the product of the first audio a and the second weight a is determined as an interfering audio a, the product of the first audio B and the first weight B is determined as a first target audio B, the product of the first audio B and the second weight B is determined as an interfering audio B, the product of the first audio C and the first weight C is determined as a first target audio C, and the product of the first audio C and the second weight C is determined as an interfering audio C.

Referring to fig. 8, a target audio a may be obtained by multiplying a frequency spectrum a by a weight a, a target audio B may be obtained by multiplying a frequency spectrum B by a weight B, a target audio C may be obtained by multiplying a frequency spectrum C by a weight C, and a predicted audio may be obtained by performing fusion processing on the target audio a, the target audio B, and the target audio C. It should be noted that, the target audio may be a frequency spectrum, and the target audio may also be a piece of audio, which is not limited in the embodiments of the present disclosure.

Referring to fig. 8, a first audio a, a first audio B, a first audio C, and a predicted audio are input to a first model, and the first model may output a second audio corresponding to the target direction. In this way, since the target phase information indicates the specific gravity of the sound source in the target direction in each first audio, the terminal device can predict the target audio in the target direction based on the plurality of first audio and the target phase information, and accurately determine the second audio in the target direction based on the target audio and the plurality of first audio, and since the interference signals in other directions in the space in the second audio are low, the accuracy of the audio collection is high.

Fig. 9 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the disclosure. Referring to fig. 9, the audio processing apparatus 90 includes a response module 91, an acquisition module 92, a determination module 93, and a transmission module 94, wherein:

The response module 91 is configured to determine a target direction in response to a manipulation of the first terminal device;

the acquiring module 92 is configured to acquire a plurality of first audio frequencies acquired by a plurality of audio frequency acquisition devices;

The determining module 93 is configured to determine, based on the plurality of first audios and the target direction, a second audio of the target direction;

the sending module 94 is configured to send the second audio to a second terminal device.

In accordance with one or more embodiments of the present disclosure, the determining module 93 is specifically configured to:

Determining predicted audio of the target direction based on the plurality of first audio and the target direction, wherein the predicted audio is associated with target phase information indicating a specific gravity of a sound source of the target direction in each first audio;

A second audio of the target direction is determined based on the plurality of first audio and the predicted audio.

determining the target phase information based on the plurality of first audio frequencies and the target direction;

the predicted audio is determined based on the target phase information and the plurality of first audio.

Determining a target weight associated with each first audio based on the target phase information and the plurality of first audio, wherein the target weight is the weight of the audio in the target direction in the first audio;

The predicted audio is determined based on the plurality of first audio and a plurality of target weights.

multiplying each first audio by a target weight corresponding to the first audio to obtain target audio;

and combining the plurality of target audios to obtain the predicted audios.

Determining phase differences among the plurality of first audios to obtain a plurality of first phase differences;

Determining a second phase difference associated with the target direction;

the target phase information is determined based on the plurality of first phase differences and the second phase differences.

Determining cosine similarity between the second phase difference and each first phase difference to obtain a plurality of cosine similarities;

and carrying out fusion processing on the cosine similarities to obtain the target phase information.

obtaining the second audio of the target direction based on a first model, the plurality of first audio and the predicted audio;

The first model is obtained by training a plurality of groups of samples, and the plurality of groups of samples comprise a plurality of sample first audios, sample prediction audios corresponding to the plurality of sample first audios and the sample second audios.

According to one or more embodiments of the present disclosure, the manipulation of the first terminal device is a touch operation on a display page of the first terminal device.

According to one or more embodiments of the present disclosure, the manipulation of the first terminal device is a voice operation of the first terminal device.

The audio processing device provided in the embodiments of the present disclosure may be used to execute the technical solutions of the embodiments of the methods, and the implementation principle and the technical effects are similar, and are not repeated here.

Fig. 10 is a schematic structural diagram of a terminal device according to an embodiment of the present disclosure. Referring to fig. 10, a schematic diagram of a terminal device 1000 suitable for implementing embodiments of the present disclosure is shown, where terminal device 1000 may be a terminal device or a terminal device. The terminal device may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a Personal Digital Assistant (PDA) or the like, a tablet computer (Portable Android Device) or the like, a Portable Multimedia Player (PMP) or the like, a car-mounted terminal (e.g., car navigation terminal) or the like, and a fixed terminal such as a digital TV or a desktop computer or the like. The terminal device shown in fig. 10 is only one example, and should not impose any limitation on the functions and scope of use of the embodiments of the present disclosure.

As shown in fig. 10, the terminal device 1000 may include a processing means (e.g., a central processor, a graphic processor, etc.) 1001 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1002 or a program loaded from a storage means 1008 into a random access Memory (Random Access Memory RAM) 1003. In the RAM 1003, various programs and data necessary for the operation of the terminal device 1000 are also stored. The processing device 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

In general, the following devices may be connected to the I/O interface 1005: input devices 1006 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 1007 including, for example, a Liquid Crystal Display (LCD) CRYSTAL DISPLAY, a speaker, a vibrator, and the like; storage 1008 including, for example, magnetic tape, hard disk, etc.; and communication means 1009. Communication means 1009 may allow terminal device 1000 to communicate wirelessly or by wire with other devices to exchange data. While fig. 10 shows terminal device 1000 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication device 1009, or installed from the storage device 1008, or installed from the ROM 1002. The above-described functions defined in the method of the embodiment of the present disclosure are performed when the computer program is executed by the processing device 1001.

It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

The computer readable medium may be contained in the terminal device; or may exist alone without being fitted into the terminal device.

The computer-readable medium carries one or more programs which, when executed by the terminal device, cause the terminal device to perform the method shown in the above embodiment.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a local area network (Local Area Network, LAN for short) or a wide area network (Wide Area Network, WAN for short), or may be connected to an external computer (e.g., through the internet using an internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. The name of the unit does not in any way constitute a limitation of the unit itself, for example the first acquisition unit may also be described as "unit acquiring at least two internet protocol addresses".

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

It will be appreciated that prior to using the technical solutions disclosed in the embodiments of the present disclosure, the user should be informed and authorized of the type, usage range, usage scenario, etc. of the personal information related to the present disclosure in an appropriate manner according to the relevant legal regulations.

For example, in response to receiving an active request from a user, a prompt is sent to the user to explicitly prompt the user that the operation it is requesting to perform will require personal information to be obtained and used with the user. Thus, the user can autonomously select whether to provide personal information to software or hardware such as an electronic device, an application program, a server or a storage medium for executing the operation of the technical scheme of the present disclosure according to the prompt information.

As an alternative but non-limiting implementation, in response to receiving an active request from a user, the manner in which the prompt information is sent to the user may be, for example, a popup, in which the prompt information may be presented in a text manner. In addition, a selection control for the user to select to provide personal information to the electronic device in a 'consent' or 'disagreement' manner can be carried in the popup window.

It will be appreciated that the above-described notification and user authorization process is merely illustrative and not limiting of the implementations of the present disclosure, and that other ways of satisfying relevant legal regulations may be applied to the implementations of the present disclosure.

It will be appreciated that the data (including but not limited to the data itself, the acquisition or use of the data) involved in the present technical solution should comply with the corresponding legal regulations and the requirements of the relevant regulations. The data may include information, parameters, messages, etc., such as tangential flow indication information.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.

Claims

1. An audio processing method, comprising:

2. The method of claim 1, wherein the determining the second audio for the target direction based on the plurality of first audio and the target direction comprises:

determining a predicted audio for the target direction based on the plurality of first audio and the target direction, wherein the predicted audio is associated with target phase information, the target phase information including phase information for the plurality of first audio and phase information associated with the target direction;

3. The method of claim 2, wherein the determining predicted audio for the target direction based on the plurality of first audio and the target direction comprises:

4. The method of claim 3, wherein the determining the predicted audio based on the target phase information and the plurality of first audio comprises:

5. The method of claim 4, wherein the determining the predicted audio based on the plurality of first audio and a plurality of target weights comprises:

and combining the plurality of target audios to obtain the predicted audios.

6. The method of any of claims 3-5, wherein the determining target phase information based on the plurality of first audio frequencies and the target direction comprises:

Acquiring a plurality of first phase differences among the plurality of first audios and a second phase difference associated with the target direction;

a degree of similarity between the second phase difference and each of the first phase differences is determined, and the target phase information is determined based on the degree of similarity.

7. The method of any of claims 2-5, wherein determining a second audio for the target direction based on the plurality of first audio and the predicted audio comprises:

8. The method according to any of claims 1-5, wherein the manipulation of the first terminal device is a touch operation of a display page of the first terminal device.

9. The method according to any of claims 1-5, wherein the manipulation of the first terminal device is a voice operation of the first terminal device.

10. An audio processing device is characterized by comprising a response module, an acquisition module, a determination module and a sending module, wherein:

11. A terminal device, comprising: a processor and a memory;

The memory stores computer-executable instructions;

The processor executing computer-executable instructions stored in the memory, causing the processor to perform the audio processing method of any one of claims 1-9.

12. A computer readable storage medium having stored therein computer executable instructions which, when executed by a processor, implement the audio processing method of any of claims 1-9.

13. A computer program product comprising a computer program which, when executed by a processor, implements the audio processing method according to any of claims 1-9.