CN113707149A - Audio processing method and device - Google Patents

Audio processing method and device Download PDF

Info

Publication number
CN113707149A
CN113707149A CN202111000865.4A CN202111000865A CN113707149A CN 113707149 A CN113707149 A CN 113707149A CN 202111000865 A CN202111000865 A CN 202111000865A CN 113707149 A CN113707149 A CN 113707149A
Authority
CN
China
Prior art keywords
audio signal
spatial spectrum
sound source
audio
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111000865.4A
Other languages
Chinese (zh)
Inventor
周美林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Vivo Mobile Communication Co Ltd
Original Assignee
Vivo Mobile Communication Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Vivo Mobile Communication Co Ltd filed Critical Vivo Mobile Communication Co Ltd
Priority to CN202111000865.4A priority Critical patent/CN113707149A/en
Publication of CN113707149A publication Critical patent/CN113707149A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02087Noise filtering the noise being separate speech, e.g. cocktail party

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The application discloses an audio processing method and device, and belongs to the technical field of signal processing. The audio processing method comprises the following steps: acquiring a first audio signal; determining sound source information of N sound sources of a first audio signal, wherein N is a positive integer; at least one sub audio signal is separated from the first audio signal according to the sound source information, wherein one sub audio signal is an audio signal of one of the N sound sources.

Description

Audio processing method and device
Technical Field
The present application belongs to the field of signal processing technology, and in particular, relates to an audio processing method and apparatus.
Background
Currently, speech recognition technology can automatically process audio into text so as to display or store the audio in text form. In some application scenarios, there may be multiple sound sources, and the correspondingly obtained audio may include voices of the multiple sound sources. However, in the prior art, it is often difficult to accurately separate voices of different sound sources in an audio, thereby resulting in a poor audio processing effect.
Disclosure of Invention
The embodiment of the application aims to provide an audio processing method and an audio processing device, so as to solve the problem that in the prior art, voices of different sound sources in audio are difficult to accurately separate, and further the audio processing effect is poor.
In a first aspect, an embodiment of the present application provides an audio processing method, where the method includes:
acquiring a first audio signal;
determining sound source information of N sound sources of a first audio signal, wherein N is a positive integer;
at least one sub audio signal is separated from the first audio signal according to the sound source information, wherein one sub audio signal is an audio signal of one of the N sound sources.
In a second aspect, an embodiment of the present application provides an audio processing apparatus, including:
the acquisition module is used for acquiring a first audio signal;
a determining module, configured to determine sound source information of N sound sources of the first audio signal, where N is a positive integer;
the separation module is used for separating at least one sub-audio signal from the first audio signal according to the sound source information, wherein one sub-audio signal is an audio signal of one sound source in the N sound sources.
In a third aspect, embodiments of the present application provide an electronic device, which includes a processor, a memory, and a program or instructions stored on the memory and executable on the processor, where the program or instructions, when executed by the processor, implement the steps of the method according to the first aspect.
In a fourth aspect, embodiments of the present application provide a readable storage medium on which a program or instructions are stored, which when executed by a processor, implement the steps of the method according to the first aspect.
In a fifth aspect, embodiments of the present application provide a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and the processor is configured to execute a program or instructions to implement the method according to the first aspect.
The audio processing method provided by the embodiment of the application obtains a first audio signal, determines sound source information of N sound sources of the first audio signal, and separates at least one sub-audio signal from the first audio signal according to the sound source information, wherein one sub-audio signal is an audio signal of one sound source of the N sound sources. According to the embodiment of the application, the sound source information of the N sound sources of the first audio signal is determined, prior data can be provided for subsequent separation processing of the first audio signal, and the audio processing effect is improved from the audio signals of the sound sources which are beneficial to accurate separation.
Drawings
Fig. 1 is a schematic flowchart of an audio processing method provided in an embodiment of the present application;
FIG. 2 is a schematic diagram of a relationship between an electronic device and a user in an application scenario;
FIG. 3 is an exemplary illustration of a spatial spectrogram;
FIG. 4 is an exemplary graph of a target change curve;
FIG. 5 is a flow chart of an audio processing method in a specific application example;
fig. 6 is a schematic diagram of the sound source information assistance module acquiring sound source information;
fig. 7 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of an electronic device provided in an embodiment of the present application;
fig. 9 is a schematic hardware structure diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described clearly below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present disclosure.
The terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that embodiments of the application may be practiced in sequences other than those illustrated or described herein, and that the terms "first," "second," and the like are generally used herein in a generic sense and do not limit the number of terms, e.g., the first term can be one or more than one. In addition, "and/or" in the specification and claims means at least one of connected objects, a character "/" generally means that a preceding and succeeding related objects are in an "or" relationship.
The following describes in detail an audio processing method and apparatus provided in the embodiments of the present application with reference to the accompanying drawings.
As shown in fig. 1, an audio processing method provided in an embodiment of the present application includes:
step 101, acquiring a first audio signal;
step 102, determining sound source information of N sound sources of a first audio signal, wherein N is a positive integer;
step 103, separating at least one sub-audio signal from the first audio signal according to the sound source information, wherein one sub-audio signal is an audio signal of one sound source of the N sound sources.
The audio processing method provided by the embodiment of the application can be applied to electronic devices such as mobile terminals and tablet computers, and the specific type of the electronic device is not limited herein.
The first audio signal obtained in step 101 may include audio signals from at least one sound source, and the audio signal of any sound source may correspond to the sub audio signal.
For simplicity of explanation, an audio signal may be considered to have N sound sources corresponding thereto, N being a positive integer, but the value of N is generally unknown prior to further processing of the audio signal.
In some examples, the main purpose of audio processing the first audio signal may be considered to separate each sub-audio signal from the first audio signal for subsequent accurate speech recognition of the sub-audio signal.
Of course, this is merely an example of an application of the audio processing method provided in the embodiment of the present application, and a specific application form thereof may be set according to actual needs.
In some application scenarios, the first audio signal may be recorded for sounds emitted by N sound sources.
For example, in a meeting or a classroom discussion, there may be multiple users, each corresponding to a sound source. The users may speak sequentially or at least two users may speak simultaneously. Therefore, in the first audio signal recorded for these scenes, there may be a plurality of sub audio signals, one for each sub audio signal. And the electronic device may need to recognize the utterances of different users separately.
In the case where there are a plurality of users who speak at the same time, there may be overlapping portions of the plurality of sub audio signals on the time axis. In this case, the electronic device often needs to deal with the "cocktail party effect" (cocktail party effect), that is, the electronic device needs to focus on the speech of a certain user when a plurality of users speak simultaneously.
As can be seen, in the above application scenarios, the electronic device may need to separate the individual sub-audio signals. If the first audio data is directly input into the relevant audio separation model for separation, the separation result may be inaccurate, or the computational requirement on the electronic device may be high.
Of course, the first audio signal acquired in step 101 may be an original audio signal acquired by a sensor such as a microphone, may also be an audio signal stored in the electronic device, or may also be an audio signal obtained by preprocessing the original audio signal, and the like, and is not limited in this respect.
As indicated above, the first audio signal may be recorded for sounds emitted by N sound sources, in other words, the first audio signal may correspond to N sound sources.
In step 102, the electronic device may determine sound source information for N sound sources of the first audio signal.
For example, the sound source information of N sound sources of the first audio signal may include sound source orientations. Generally, in an application scenario such as a conference or a course discussion as shown above, there may be a relationship between the sound source position and the user. In the case where the sound source orientation is determined, the electronic device may actually determine which users are speaking at various points in time, thereby helping to provide corresponding reference information for the separation of the sub-audio signals.
For another example, the sound source information may include the number of sound sources. It is easy to understand that the determination of the number of sound sources can be regarded as a process of determining the value of N to some extent.
In the event that the number of sound sources is determined, the electronic device may assign a corresponding number of audio processing channels to the first audio signal, each audio processing channel for processing a sub-audio signal from one sound source. Thereby providing corresponding reference information for the separation or identification of the sub audio signal.
It can be seen that the sound source information may actually be used as reference information in a subsequent separation process on the first audio signal, or may also be referred to as auxiliary information or prior information of the audio separation process.
As for the above-described acquisition of the sound source information, it can be achieved by various means.
For example, the sound source direction may be estimated based on a Time Difference of Arrival (TDOA), a controlled Response Power (SRP), or a Multiple Signal Classification (MUSIC).
As for the sound source data, the number of sound sources may be determined according to the number of sound source orientations, or by extracting audio features in the first audio signal, or the like.
In general, the sound source information determined in step 102 may serve as a priori information to assist in subsequent audio separation processing of the first audio signal.
Accordingly, in step 103, the electronic device may separate at least one sub audio signal from the first audio signal according to the sound source information, wherein one sub audio signal is an audio signal of one of the N sound sources.
As for the manner of the audio separation process, it can be realized by the prior art, for example, the audio separation can be realized by an audio separation model based on deep learning or a speech separation model, which is not illustrated here.
In this embodiment, the sub audio signals of all the N sound sources may be separated from the audio signal as needed, or the sub audio signals of some of the N sound sources may be separated from the audio signal, and may be set according to actual needs.
In other words, in theory, N sub-audio signals may be separated from the audio signal, and in practical applications, at least one sub-audio signal may be separated from the audio signal as needed, where the at least one sub-audio signal is at least a part of the N sub-audio signals. And one of the at least one sub audio signal may correspond to one sound source, i.e., one sub audio signal may be an audio signal of one sound source.
The audio processing method provided by the embodiment of the application obtains a first audio signal, determines sound source information of N sound sources of the first audio signal, and separates at least one sub-audio signal from the first audio signal according to the sound source information, wherein one sub-audio signal is an audio signal of one sound source of the N sound sources. According to the embodiment of the application, the sound source information of the N sound sources of the first audio signal is determined, prior data can be provided for subsequent separation processing of the first audio signal, and the audio processing effect is improved from the audio signals of the sound sources which are beneficial to accurate separation.
In one example, the electronic device may further perform speech recognition on each of the separated sub-audio signals, respectively, to obtain a speech recognition result.
In this example, each sub-audio signal can be regarded as an audio signal caused by the sound of an independent sound source to a certain extent, so that when the electronic device performs voice recognition on each sub-audio signal, the problem of the cocktail party effect and the like is considered fully, and the accuracy of the voice recognition can be effectively improved.
In one example, the first audio signal described above may be a preprocessed audio signal. Specifically, in step 101, the step of acquiring the first audio signal may include:
acquiring a second audio signal acquired by a sensor;
and preprocessing the second audio signal to obtain a first audio signal.
In other words, in the present example, the second audio signal may be considered as an original audio signal, and the first audio signal may be an audio signal obtained by preprocessing the original audio signal.
The preprocessing method may include at least one of echo cancellation, noise suppression, and dereverberation, and the selection of the specific preprocessing method may be set according to actual needs, which is not illustrated herein.
In this example, the first audio signal obtained through the preprocessing has a higher audio quality, and based on the first audio signal, the sound source information can be determined more accurately, and the effect of subsequent audio separation or speech recognition can be improved.
Optionally, the first audio signal includes a plurality of audio frames, and one audio frame is associated with one piece of spatial spectrum information, where the spatial spectrum information includes a correspondence between a spatial spectrum and a preset orientation;
determining sound source information of N sound sources of the first audio signal specifically includes:
sound source information of N sound sources of the first audio signal is determined from the spatial spectrum information.
It is readily understood that for the first audio signal, a plurality of audio frames may be included. For example, the first audio signal may be sampled at a certain sampling frequency, each sampling point corresponding to an audio frame.
In this embodiment, each audio frame may be associated with a piece of spatial spectrum information. The spatial spectrum information may include a correspondence between the spatial spectrum and a preset orientation.
The spatial spectrum may correspond to a certain extent to an energy intensity or power value of the audio signal. Accordingly, in this embodiment, the spatial spectrum may be digitized, and the value of the spatial spectrum may be positively correlated with the power value of the audio signal.
The preset orientation may be an orientation defined according to a preset starting angle and direction, centered on the electronic device that records the first audio frame.
As shown in fig. 2, fig. 2 is an exemplary diagram of an electronic device with a recording function and a predetermined orientation defined therein. With reference to fig. 2, in fig. 2, the direction extending downward along the electronic device is a preset direction of 0 degree, the direction extending rightward along the electronic device is a preset direction of 90 degrees, and so on.
Of course, this is only an example of the preset orientation, and in practical applications, the preset orientation may be set as needed.
As indicated above, the sound source location may be estimated based on TDOA, SRP, or MUSIC, etc., and the correspondence between the spatial spectrum and the preset location may be actually obtained based on these methods.
As shown in fig. 2, fig. 2 may also be an exemplary diagram of a position relationship between the electronic device and users in a conference scenario, and each user may be numbered and represented by a number in a circle.
As shown in fig. 3, fig. 3 may be a spatial spectrogram obtained by performing audio signal acquisition on the conference scene shown in fig. 2 by using the SRP method. In fig. 3, the abscissa may be time, or a sample point corresponding to an audio frame; the left side ordinate corresponds to a preset position; the scale on the right may correspond to the value of the spatial spectrum.
The relationship between fig. 2 and fig. 3 can be described as: in the conference scenario shown in fig. 2, users numbered 1 to 6 can speak in sequence according to the numbering sequence, and after the speaking sequence is finished, 6 users can speak at the same time. And fig. 3 is a spatial spectrum of the first audio signal acquired according to the above speaking process.
As can be seen from fig. 3, at a certain sampling point, there is a spatial spectrum corresponding to each preset orientation. Correspondingly, the position of the sound source can be determined according to the preset position corresponding to the maximum spatial spectrum in the sampling point.
In fig. 3, the number of sound sources can also be obtained if the orientations of a plurality of sound sources determined from a plurality of sampling points are combined.
The direction of the sound source or the number of sound sources may be considered as the sound source information of the N sound sources.
In combination with the above example, it can be seen that, under the condition that each audio frame in the first audio signal is associated with one piece of spatial spectrum information, the sound source information of the N sound sources of the first audio signal can be determined according to the spatial spectrum information.
In this embodiment, based on the spatial spectrum information, the sound source information of the N sound sources of the first audio signal may be determined with less computation power, and the consumption of computation power is reduced while providing prior information for audio separation processing.
In some embodiments, in combination with the above example, the sound source information includes at least one of a number of sound sources and a bearing of the sound sources.
In the case where the sound source information includes a sound source bearing, the sound source bearing may provide a corresponding reference for determining which sound source's sound source information is specific during the subsequent audio separation process.
Specifically, in some application scenarios, the positions of the sound sources relative to the recording device are fixed, and therefore, each sound source orientation may correspond to a sound source to some extent. Therefore, under the condition that the sound source direction is determined, the sound source direction can be used as prior information to be applied to the audio separation processing process, reference is provided for determining the sound source to which each second sound source signal belongs, and the accuracy of audio separation is improved.
In the case where the sound source information includes the number of sound sources, the electronic device may allocate a corresponding number of audio processing channels to the first audio signal, each audio processing channel being for processing a sub-audio signal from one sound source, improving the audio separation effect. Meanwhile, the electronic equipment can focus each sub-audio signal respectively, and the accuracy of audio identification is improved.
Optionally, in a case where the sound source information includes a sound source bearing, determining sound source information of N sound sources of the first audio signal according to the spatial spectrum information includes:
determining P target orientations according to the P spatial spectrum information, wherein the P target orientations correspond to the P spatial spectrum information one by one, each target orientation is a preset orientation corresponding to the maximum spatial spectrum in the corresponding spatial spectrum information, the P spatial spectrum information is associated with P audio frames in the plurality of audio frames, and P is a positive integer;
and determining the sound source position according to the P target positions.
As shown above, the spatial spectrum information may include a corresponding relationship between the spatial spectrum and the preset azimuth, and for any audio frame, a sampling point at a time may be corresponded, that is, each audio frame may correspond to a time, and the time is denoted as t; recording a preset azimuth as theta; in conjunction with fig. 3, where t and θ are determined, a spatial spectrum may be determined.
That is, the spatial Spectrum may be a function of t and θ, and thus the spatial Spectrum may be denoted as Spectrum (θ, t).
For each audio frame, a spatial spectrum of information may be associated, and a target bearing may be determined. In other words, for an audio frame, its corresponding t may be fixed, and Spectrum (θ, t) is a function of θ. By traversing the spatial spectrum at each preset angle, the maximum spatial spectrum can be determined, and the preset orientation corresponding to the maximum spatial spectrum, namely the target orientation (doa in the figure), is determined.
In this case, when the target azimuth is represented as θ (t), θ (t) is argmax ((Spectrum (θ, t)).
In one example, P described above may be equal to 1, and accordingly, the target bearing determined from one audio frame may be directly determined as the sound source bearing. In other words, in this example, the sound source position may be determined in real time.
In another example, P as described above may be equal to 3, and 3 spatial spectral information may be associated with 3 consecutive audio frames in the first audio signal. The 3 spatial spectral information may determine 3 target orientations, and a mode or an average of the 3 target orientations may be determined as the sound source orientation.
In yet another example, P may be equal to the number of audio frames in the first audio signal, and the electronic device may determine the target bearing in the spatial spectral information associated with each audio frame and determine the sound source bearing according to the distribution of the target bearings.
For example, for the target azimuths, preset azimuth intervals in which the target azimuth distribution is concentrated may be determined approximately, and an average value or a mode of the target azimuths in the preset azimuth intervals is calculated, so as to determine the azimuth of the sound source.
As can be seen from fig. 3, according to the distribution of the target directions, 6 sound source directions can be determined, which are 230 °, 130 °, 310 °, 10 °, 180 ° and 50 °, respectively, and these sound source directions substantially coincide with the angular positions of the users relative to the electronic device in fig. 2.
Of course, in practical applications, the value of P may be selected as needed, and the specific manner of determining the sound source bearing according to the P target bearings may be the real-time determination of the sound source bearing, or the determination of the sound source bearing according to the average or mode of the sound source bearings.
In this embodiment, the corresponding target orientations are determined according to the respective spatial spectrum information, and the sound source orientation is determined according to the target orientations. Therefore, in the embodiment, the sound source position can be conveniently determined by directly comparing and processing the values of the spatial spectrum information, and the calculation power consumption brought by the determination of the sound source position is saved.
Optionally, in a case where the sound source information includes the number of sound sources, determining the sound source information of the N sound sources of the first audio signal according to the spatial spectrum information includes:
determining a target change curve according to the Q pieces of spatial spectrum information, wherein the target change curve is a change curve of a total spatial spectrum along with preset positions, the total spatial spectrum at each preset position is equal to the sum of the Q pieces of spatial spectra associated with each preset position, the Q pieces of spatial spectrum information are associated with continuous Q audio frames in the plurality of audio frames, and Q is an integer greater than 1;
and determining the number of sound sources according to the target change curve, wherein the number of the sound sources is equal to the number of target wave crests in the target change curve, and the target wave crests are wave crests of which the corresponding peak values are larger than a spatial spectrum threshold value.
In this embodiment, the value of Q may be adjusted as needed, and if the times corresponding to Q pieces of spatial spectrum information are respectively recorded as t1、t2、……、ti、……、tQThen, in any spatial Spectrum information, the spatial Spectrum may be a function of θ, and thus the spatial Spectrum may be denoted as Spectrum (θ, t)i)。
And adding the spatial spectrums in the same preset direction in the Q pieces of spatial spectrum information to obtain the corresponding relation between the total spatial spectrum and the preset direction.
In this embodiment, a plurality of pieces of spatial spectrum information are processed as a whole, the whole processing mode may be denoted as block, correspondingly, the total spatial spectrum may be denoted as spectrum (block), and when Q times are determined, the correspondence between spectrum (block) and θ may be denoted as:
Spectrum(block)=∑Q i=1Spectrum(θ,ti)
the correspondence between spectrum (block) and theta can be embodied by a target change curve. As shown in fig. 4, fig. 4 is an exemplary graph of the target variation curve. In fig. 4, the abscissa is a preset azimuth, and the ordinate is a total spatial spectrum.
Based on the target variation curve, the peak therein may be determined, and the determination may be performed by calculating a local extremum, and the like, which is not specifically described herein. Each peak may be denoted as a peak.
It is easy to understand that, for any peak, when its corresponding peak, that is, the maximum value of the total spatial spectrum in the peak, is greater than the spatial spectrum threshold, it can be considered that there is a valid sound source at the preset position corresponding to the peak. If the peak value corresponding to a peak is smaller than or equal to the spatial spectrum threshold, it may be determined that there may be occasional sound, or interference or error, etc. at the preset position corresponding to the peak, and it may be further determined that there is no effective sound source at the preset position.
In fig. 4, the peak whose corresponding peak is larger than the spatial spectrum threshold, that is, the target peak, is circled by an ellipse. The number of target peaks can be represented by Num of peaks (spectrum (block) > T), where T is the spatial spectrum threshold described above.
If the number of sound sources is represented by N, there are:
N=Num of peaks(Spectrum(block)>T)。
for example, referring to fig. 4, when the spatial spectrum threshold is equal to 0.24, the number of sound sources determined from the target variation curve is equal to 5, and corresponds to the positions circled by 5 ellipses respectively.
In this embodiment, based on the Q pieces of spatial spectrum information, a change curve of the total spatial spectrum along with the preset azimuth, that is, the target change curve described above, may be obtained. On the basis of the target change curve, the number of the sound sources can be obtained only by determining the number of the target wave crests of which the corresponding peak values are larger than the spatial spectrum threshold value. Therefore, in the embodiment, the method for determining the number of the sound sources is simple, and correspondingly generated computing resources can be effectively saved.
The following describes an audio processing method provided by the embodiments of the present application with reference to a specific application example.
As shown in fig. 5, in this specific application example, the audio processing method may be applied to an electronic device including a plurality of microphones, and the method may roughly include the following steps:
step 501, inputting a signal;
in this step, conference audio signals acquired by a plurality of microphones, that is, initial audio signals, may be acquired.
Step 502, front-end voice signal processing;
in this step, the conference audio signal may be subjected to audio processing such as echo cancellation, noise suppression, reverberation removal, and the like, that is, the initial audio signal is preprocessed to obtain the first audio signal;
step 503, sound source information is obtained;
sound source information is obtained from the first audio signal, which may include a sound source bearing and a number of sound sources. The following will specifically exemplify the manner of acquiring the sound source information.
Step 504, audio separation and voice recognition;
and performing audio separation and voice recognition on the first audio signal by using the sound source information as prior data.
Generally, the audio separation and the speech recognition are usually corresponding to a deep learning model, and based on the prior data of sound source information, the classification difficulty of the deep learning model can be effectively reduced, the classification accuracy is improved, and further, the more reliable speech recognition result can be obtained.
The sound source information acquisition in step 503 may be implemented based on a preset sound source information auxiliary module.
As shown in fig. 6, after the first audio signal is input to the sound source information auxiliary module, the sound source information auxiliary module may input the first audio signal to two paths, respectively. These two paths are named real-time path and block path, respectively.
The real-time path is used for inputting sound source information of a sound source azimuth according to the spatial spectrum information of the current frame; the block path can superpose the spatial spectrums in the same preset direction in a preset time period (marked as a block time period) to obtain a target change curve, and the sound source information of the number of sound sources can be output based on the target change curve.
Based on the specific application example, the audio processing method provided by the embodiment of the application can fully utilize the spatial spectrum information and determine more sound source information; the sound source information can provide prior information for the sound source separation processing, so that the algorithm difficulty of the audio separation processing is simplified, and the accuracy of the audio separation result is improved; in addition, the embodiment of the application can also effectively solve the problems of cocktail party effect and the like.
It should be noted that, in the audio processing method provided in the embodiment of the present application, the execution main body may be an audio processing apparatus, or a control module in the audio processing apparatus for executing the audio processing method. In the embodiment of the present application, an audio processing apparatus executing an audio processing method is taken as an example to describe the audio processing apparatus provided in the embodiment of the present application.
As shown in fig. 7, an audio processing apparatus 700 provided in an embodiment of the present application includes:
an obtaining module 701, configured to obtain a first audio signal;
a determining module 702, configured to determine sound source information of N sound sources of the first audio signal, where N is a positive integer;
a separating module 703, configured to separate at least one sub audio signal from the first audio signal according to the sound source information, where one sub audio signal is an audio signal of one sound source of the N sound sources.
Optionally, the first audio signal includes a plurality of audio frames, and one audio frame is associated with one piece of spatial spectrum information, where the spatial spectrum information includes a correspondence between a spatial spectrum and a preset orientation;
the determining module 702 may be specifically configured to:
sound source information of N sound sources of the first audio signal is determined from the spatial spectrum information.
Optionally, the sound source information includes at least one of a number of sound sources and a bearing of the sound sources.
Optionally, the determining module 702 may include:
the first determining unit is used for determining P target orientations according to the P spatial spectrum information, the P target orientations correspond to the P spatial spectrum information one by one, each target orientation is a preset orientation corresponding to the maximum spatial spectrum in the corresponding spatial spectrum information, the P spatial spectrum information is associated with P audio frames in the plurality of audio frames, and P is a positive integer;
and the second determining unit is used for determining the sound source position according to the P target positions.
Optionally, the determining module 702 may include:
a third determining unit, configured to determine a target variation curve according to the Q pieces of spatial spectrum information, where the target variation curve is a variation curve of a total spatial spectrum along with preset orientations, the total spatial spectrum at each preset orientation is equal to a sum of Q spatial spectrums associated with each preset orientation, the Q pieces of spatial spectrum information are associated with Q consecutive audio frames in the multiple audio frames, and Q is an integer greater than 1;
and the fourth determining unit is used for determining the number of sound sources according to the target change curve, wherein the number of the sound sources is equal to the number of target wave crests in the target change curve, and the target wave crests are wave crests of which the corresponding peak values are larger than the spatial spectrum threshold.
The audio processing device that this application embodiment provided acquires first audio signal, confirms the sound source information of N sound sources of first audio signal, and this sound source information can regard as prior information, applies to the separation processing to the sub audio signal of each sound source in first audio signal, and then helps reducing the degree of difficulty that the audio frequency separation was handled, improves the degree of accuracy of the sub audio signal of each sound source that the separation obtained, promotes the audio frequency treatment effect.
The audio processing device in the embodiment of the present application may be a device, and may also be a component, an integrated circuit, or a chip in a terminal. The device can be mobile electronic equipment or non-mobile electronic equipment. By way of example, the mobile electronic device may be a mobile phone, a tablet computer, a notebook computer, a palm top computer, a vehicle-mounted electronic device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), and the like, and the non-mobile electronic device may be a server, a Network Attached Storage (NAS), a Personal Computer (PC), a Television (TV), a teller machine or a self-service machine, and the like, and the embodiments of the present application are not particularly limited.
The audio processing apparatus in the embodiment of the present application may be an apparatus having an operating system. The operating system may be an Android (Android) operating system, an ios operating system, or other possible operating systems, and embodiments of the present application are not limited specifically.
The audio processing apparatus provided in the embodiment of the present application can implement each process implemented by the method embodiments in fig. 1 to fig. 6, and is not described herein again to avoid repetition.
Optionally, as shown in fig. 8, an electronic device 800 is further provided in this embodiment of the present application, and includes a processor 801, a memory 802, and a program or an instruction stored in the memory 802 and executable on the processor 801, where the program or the instruction is executed by the processor 801 to implement each process of the foregoing audio processing method embodiment, and can achieve the same technical effect, and no further description is provided here to avoid repetition.
It should be noted that the electronic devices in the embodiments of the present application include the mobile electronic device and the non-mobile electronic device described above.
Fig. 9 is a schematic diagram of a hardware structure of an electronic device implementing an embodiment of the present application.
The electronic device 900 includes, but is not limited to: a radio frequency unit 901, a network module 902, an audio output unit 903, an input unit 904, a sensor 905, a display unit 906, a user input unit 907, an interface unit 908, a memory 909, and a processor 910.
Those skilled in the art will appreciate that the electronic device 900 may further include a power source (e.g., a battery) for supplying power to various components, and the power source may be logically connected to the processor 910 through a power management system, so as to manage charging, discharging, and power consumption management functions through the power management system. The electronic device structure shown in fig. 9 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than those shown, or combine some components, or arrange different components, and thus, the description is not repeated here.
The processor 910 is configured to obtain a first audio signal;
determining sound source information of N sound sources of a first audio signal, wherein N is a positive integer;
at least one sub audio signal is separated from the first audio signal according to the sound source information, wherein one sub audio signal is an audio signal of one of the N sound sources.
The electronic device provided by the embodiment of the application acquires a first audio signal, determines sound source information of N sound sources of the first audio signal, and separates at least one sub-audio signal from the first audio signal according to the sound source information, wherein one sub-audio signal is an audio signal of one sound source of the N sound sources. According to the embodiment of the application, the sound source information of the N sound sources of the first audio signal is determined, prior data can be provided for subsequent separation processing of the first audio signal, and the audio processing effect is improved from the audio signals of the sound sources which are beneficial to accurate separation.
Optionally, the first audio signal includes a plurality of audio frames, and one audio frame is associated with one piece of spatial spectrum information, where the spatial spectrum information includes a correspondence between a spatial spectrum and a preset orientation;
accordingly, the processor 910 is specifically configured to determine sound source information of N sound sources of the first audio signal according to the spatial spectrum information.
Optionally, the sound source information includes at least one of a number of sound sources and a bearing of the sound sources.
Optionally, the processor 910 may further be configured to:
determining P target orientations according to the P spatial spectrum information, wherein the P target orientations correspond to the P spatial spectrum information one by one, each target orientation is a preset orientation corresponding to the maximum spatial spectrum in the corresponding spatial spectrum information, the P spatial spectrum information is associated with P audio frames in the plurality of audio frames, and P is a positive integer;
and determining the sound source position according to the P target positions.
Optionally, the processor 910 may further be configured to:
determining a target change curve according to the Q pieces of spatial spectrum information, wherein the target change curve is a change curve of a total spatial spectrum along with preset positions, the total spatial spectrum at each preset position is equal to the sum of the Q pieces of spatial spectra associated with each preset position, the Q pieces of spatial spectrum information are associated with continuous Q audio frames in the plurality of audio frames, and Q is an integer greater than 1;
and determining the number of sound sources according to the target change curve, wherein the number of the sound sources is equal to the number of target wave crests in the target change curve, and the target wave crests are wave crests of which the corresponding peak values are larger than a spatial spectrum threshold value.
It should be understood that, in the embodiment of the present application, the input Unit 904 may include a Graphics Processing Unit (GPU) 9041 and a microphone 9042, and the Graphics Processing Unit 9041 processes image data of a still picture or a video obtained by an image capturing device (such as a camera) in a video capturing mode or an image capturing mode. The display unit 906 may include a display panel 9061, and the display panel 9061 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 907 includes a touch panel 9071 and other input devices 9072. A touch panel 9071 also referred to as a touch screen. The touch panel 9071 may include two parts, a touch detection device and a touch controller. Other input devices 9072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described in detail herein. Memory 909 can be used to store software programs as well as various data including, but not limited to, application programs and operating systems. The processor 910 may integrate an application processor, which primarily handles operating systems, user interfaces, applications, etc., and a modem processor, which primarily handles wireless communications. It is to be appreciated that the modem processor described above may not be integrated into processor 910.
The embodiment of the present application further provides a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or the instruction is executed by a processor, the program or the instruction implements each process of the embodiment of the audio processing method, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.
The processor is the processor in the electronic device in the above embodiment. Readable storage media, including computer-readable storage media, such as Read-Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, etc.
The embodiment of the present application further provides a chip, where the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is configured to execute a program or an instruction to implement each process of the above-mentioned audio processing method embodiment, and can achieve the same technical effect, and in order to avoid repetition, the description is omitted here.
It should be understood that the chips mentioned in the embodiments of the present application may also be referred to as system-on-chip, system-on-chip or system-on-chip, etc.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Further, it should be noted that the scope of the methods and apparatus of the embodiments of the present application is not limited to performing the functions in the order illustrated or discussed, but may include performing the functions in a substantially simultaneous manner or in a reverse order based on the functions involved, e.g., the methods described may be performed in an order different than that described, and various steps may be added, omitted, or combined. In addition, features described with reference to certain examples may be combined in other examples.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a computer software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, or a network device) to execute the method of the embodiments of the present application.
While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1. An audio processing method, comprising:
acquiring a first audio signal;
determining sound source information of N sound sources of the first audio signal, wherein N is a positive integer;
separating at least one sub audio signal from the first audio signal according to the sound source information, wherein one sub audio signal is an audio signal of one of the N sound sources.
2. The method according to claim 1, wherein the first audio signal comprises a plurality of audio frames, and one audio frame is associated with one spatial spectrum information, and the spatial spectrum information comprises a corresponding relationship between a spatial spectrum and a preset orientation;
the determining sound source information of N sound sources of the first audio signal specifically includes:
and determining sound source information of N sound sources of the first audio signal according to the spatial spectrum information.
3. The method of claim 2, wherein the sound source information comprises at least one of a number of sound sources and a bearing of sound sources.
4. The method of claim 3, wherein in the case that the sound source information includes the sound source bearing, the determining the sound source information of the N sound sources of the first audio signal according to the spatial spectrum information comprises:
determining P target orientations according to the P pieces of spatial spectrum information, wherein the P target orientations correspond to the P pieces of spatial spectrum information one by one, each target orientation is a preset orientation corresponding to the maximum spatial spectrum in the corresponding spatial spectrum information, the P pieces of spatial spectrum information are associated with P audio frames in the plurality of audio frames, and P is a positive integer;
and determining the sound source position according to the P target positions.
5. The method of claim 3, wherein in the case that the sound source information includes the number of sound sources, the determining the sound source information of the N sound sources of the first audio signal according to the spatial spectrum information comprises:
determining a target change curve according to the Q pieces of spatial spectrum information, wherein the target change curve is a change curve of a total spatial spectrum along with preset positions, the total spatial spectrum at each preset position is equal to the sum of the Q pieces of spatial spectrum associated with each preset position, the Q pieces of spatial spectrum information are associated with Q continuous audio frames in the plurality of audio frames, and Q is an integer greater than 1;
and determining the number of the sound sources according to the target change curve, wherein the number of the sound sources is equal to the number of target wave crests in the target change curve, and the target wave crests are wave crests of which the corresponding peak values are larger than a spatial spectrum threshold value.
6. An audio processing apparatus, comprising:
the acquisition module is used for acquiring a first audio signal;
a determining module, configured to determine sound source information of N sound sources of the first audio signal, where N is a positive integer;
a separation module, configured to separate at least one sub audio signal from the first audio signal according to the sound source information, where one sub audio signal is an audio signal of one of the N sound sources.
7. The apparatus according to claim 6, wherein the first audio signal comprises a plurality of audio frames, and one audio frame is associated with one spatial spectrum information, and the spatial spectrum information comprises a corresponding relationship between a spatial spectrum and a preset orientation;
the determining module is specifically configured to:
and determining sound source information of N sound sources of the first audio signal according to the spatial spectrum information.
8. The apparatus of claim 7, wherein the sound source information comprises at least one of a number of sound sources and a bearing of sound sources.
9. The apparatus of claim 8, wherein the determining module comprises:
a first determining unit, configured to determine P target orientations according to P spatial spectrum information, where the P target orientations correspond to P spatial spectrum information one to one, each target orientation is a preset orientation corresponding to a maximum spatial spectrum in the corresponding spatial spectrum information, the P spatial spectrum information is associated with P audio frames in the multiple audio frames, and P is a positive integer;
a second determining unit, configured to determine the sound source bearing according to the P target bearings.
10. The apparatus of claim 8, wherein the determining module comprises:
a third determining unit, configured to determine a target variation curve according to Q pieces of spatial spectrum information, where the target variation curve is a variation curve of a total spatial spectrum along a preset direction, a total spatial spectrum at each preset direction is equal to a sum of Q spatial spectrums associated with each preset direction, Q pieces of spatial spectrum information are associated with Q consecutive audio frames in the multiple audio frames, and Q is an integer greater than 1;
a fourth determining unit, configured to determine the number of sound sources according to the target variation curve, where the number of sound sources is equal to the number of target peaks in the target variation curve, and the target peaks are peaks whose corresponding peak values are greater than a spatial spectrum threshold.
CN202111000865.4A 2021-08-30 2021-08-30 Audio processing method and device Pending CN113707149A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111000865.4A CN113707149A (en) 2021-08-30 2021-08-30 Audio processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111000865.4A CN113707149A (en) 2021-08-30 2021-08-30 Audio processing method and device

Publications (1)

Publication Number Publication Date
CN113707149A true CN113707149A (en) 2021-11-26

Family

ID=78656444

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111000865.4A Pending CN113707149A (en) 2021-08-30 2021-08-30 Audio processing method and device

Country Status (1)

Country Link
CN (1) CN113707149A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114512141A (en) * 2022-02-09 2022-05-17 腾讯科技(深圳)有限公司 Method, apparatus, device, storage medium and program product for audio separation

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105467364A (en) * 2015-11-20 2016-04-06 百度在线网络技术(北京)有限公司 Method and apparatus for localizing target sound source
CN109859749A (en) * 2017-11-30 2019-06-07 阿里巴巴集团控股有限公司 A kind of voice signal recognition methods and device
CN111445920A (en) * 2020-03-19 2020-07-24 西安声联科技有限公司 Multi-sound-source voice signal real-time separation method and device and sound pick-up
CN111933182A (en) * 2020-08-07 2020-11-13 北京字节跳动网络技术有限公司 Sound source tracking method, device, equipment and storage medium
CN113056925A (en) * 2018-08-06 2021-06-29 阿里巴巴集团控股有限公司 Method and device for detecting sound source position
CN113064118A (en) * 2021-03-19 2021-07-02 维沃移动通信有限公司 Sound source positioning method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105467364A (en) * 2015-11-20 2016-04-06 百度在线网络技术(北京)有限公司 Method and apparatus for localizing target sound source
CN109859749A (en) * 2017-11-30 2019-06-07 阿里巴巴集团控股有限公司 A kind of voice signal recognition methods and device
CN113056925A (en) * 2018-08-06 2021-06-29 阿里巴巴集团控股有限公司 Method and device for detecting sound source position
CN111445920A (en) * 2020-03-19 2020-07-24 西安声联科技有限公司 Multi-sound-source voice signal real-time separation method and device and sound pick-up
CN111933182A (en) * 2020-08-07 2020-11-13 北京字节跳动网络技术有限公司 Sound source tracking method, device, equipment and storage medium
CN113064118A (en) * 2021-03-19 2021-07-02 维沃移动通信有限公司 Sound source positioning method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114512141A (en) * 2022-02-09 2022-05-17 腾讯科技(深圳)有限公司 Method, apparatus, device, storage medium and program product for audio separation

Similar Documents

Publication Publication Date Title
US11289072B2 (en) Object recognition method, computer device, and computer-readable storage medium
CN111370014B (en) System and method for multi-stream target-voice detection and channel fusion
CN109670074B (en) Rhythm point identification method and device, electronic equipment and storage medium
US10109277B2 (en) Methods and apparatus for speech recognition using visual information
CN110970057B (en) Sound processing method, device and equipment
EP3289586B1 (en) Impulsive noise suppression
WO2016095218A1 (en) Speaker identification using spatial information
CN108922553B (en) Direction-of-arrival estimation method and system for sound box equipment
CN107799126A (en) Sound end detecting method and device based on Supervised machine learning
EP3501026B1 (en) Blind source separation using similarity measure
CN110610718B (en) Method and device for extracting expected sound source voice signal
US11790900B2 (en) System and method for audio-visual multi-speaker speech separation with location-based selection
US20230298593A1 (en) Method and apparatus for real-time sound enhancement
CN110827823A (en) Voice auxiliary recognition method and device, storage medium and electronic equipment
CN113779208A (en) Method and device for man-machine conversation
CN113707149A (en) Audio processing method and device
CN112992190B (en) Audio signal processing method and device, electronic equipment and storage medium
CN112466327B (en) Voice processing method and device and electronic equipment
CN113064118A (en) Sound source positioning method and device
CN112487246A (en) Method and device for identifying speakers in multi-person video
Cabañas-Molero et al. Multimodal speaker diarization for meetings using volume-evaluated SRP-PHAT and video analysis
CN113035225A (en) Visual voiceprint assisted voice separation method and device
US20230116052A1 (en) Array geometry agnostic multi-channel personalized speech enhancement
CN112542157A (en) Voice processing method and device, electronic equipment and computer readable storage medium
CN112788278B (en) Video stream generation method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination