CN113571033B

CN113571033B - Accompaniment stepping back detection method, accompaniment stepping back detection equipment and computer readable storage medium

Info

Publication number: CN113571033B
Application number: CN202110791459.8A
Authority: CN
Inventors: 张超鹏
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2021-07-13
Filing date: 2021-07-13
Publication date: 2024-06-14
Anticipated expiration: 2041-07-13
Also published as: CN113571033A

Abstract

The application discloses an accompaniment stepping back detection method, electronic equipment and a computer readable storage medium, which are used for acquiring recorded target audio signals and corresponding accompaniment audio signals thereof, acquiring power spectrums of the target audio signals and power spectrums of accompaniment audio through Fourier transformation, acquiring middle and high frequency point information of the target audio signals and middle and high frequency point information of the accompaniment audio signals, judging whether the middle and high frequency point information of the target audio signals is similar to the middle and high frequency point information of the accompaniment audio signals, and determining that the accompaniment stepping back exists in the target audio signals if the target similarity of the middle and high frequency point information of the target audio signals and the middle and high frequency point information of the accompaniment audio signals is greater than a threshold value. Therefore, the application introduces the power spectrum information, the medium-high frequency point information and the similarity judgment of the audio in the accompaniment back stepping detection, can rapidly carry out the accompaniment back stepping detection, does not need to use the existing AEC tool, and has small calculated amount and high detection efficiency.

Description

Accompaniment stepping back detection method, accompaniment stepping back detection equipment and computer readable storage medium

Technical Field

The present application relates to the field of audio processing technology, and more particularly, to an accompaniment step-back detection method, apparatus, and computer readable storage medium.

Background

In the process of recording audio, users generally wear headphones to listen to accompaniment for audio recording, if the headphones leak voice or play a song, accompaniment exists in the recorded audio, accompaniment stepping back is brought, and the quality of the recorded audio is greatly reduced. To ensure the quality of the recorded audio, accompaniment back-stepping detection is required for the recorded audio, for example, by calculating the energy ratio of the echo signal to the target signal through an acoustic echo cancellation (Acoustic Echo Cancellation, AEC) tool, or by using Double Talk Detection (double talk detection technology) in the method, the back-stepping probability is obtained. However, the above scheme depends strongly on the output effect of AEC, the calculation amount is relatively large, and the accompaniment stepping back detection efficiency is low.

In summary, how to quickly perform accompaniment stepping back detection is a problem to be solved by those skilled in the art.

Disclosure of Invention

Accordingly, the present invention is directed to a method, apparatus and medium for detecting back stepping of accompaniment, which can effectively increase the speed of detecting back stepping of accompaniment. The specific scheme is as follows:

In a first aspect, the application discloses an accompaniment stepping back detection method, which comprises the following steps:

acquiring a recorded target audio signal and a corresponding accompaniment audio signal thereof;

The target audio signal and the accompaniment audio signal are subjected to Fourier transformation to respectively obtain a power spectrum of the target audio signal and a power spectrum of the accompaniment audio signal;

Respectively obtaining medium-high frequency point information of the target audio signal and medium-high frequency point information of the accompaniment audio signal based on the power spectrum of the target audio signal and the power spectrum of the accompaniment audio signal;

if the target similarity between the middle-high frequency point information of the target audio signal and the middle-high frequency point information of the accompaniment audio signal is larger than a threshold value, determining that accompaniment back stepping exists in the target audio signal.

Optionally, the obtaining the medium-high frequency point information of the target audio signal and the medium-high frequency point information of the accompaniment audio signal based on the power spectrum of the target audio signal and the power spectrum of the accompaniment audio signal respectively includes:

Removing envelope processing is carried out on the power spectrum of the target audio signal and the power spectrum of the accompaniment audio signal to respectively obtain a de-envelope power spectrum of the target audio signal and a de-envelope power spectrum of the accompaniment audio signal;

Respectively obtaining a target medium-high frequency harmonic point of the target audio signal and a target medium-high frequency harmonic point of the accompaniment audio signal based on the de-envelope power spectrum of the target audio signal and the de-envelope power spectrum of the accompaniment audio signal;

Taking a target medium-high frequency harmonic point of the target audio signal as medium-high frequency point information of the target audio signal;

and taking the target medium-high frequency harmonic point of the accompaniment audio signal as medium-high frequency point information of the accompaniment audio signal.

Optionally, the obtaining the target medium-high frequency harmonic point of the target audio signal and the target medium-high frequency harmonic point of the accompaniment audio signal based on the de-envelope power spectrum of the target audio signal and the de-envelope power spectrum of the accompaniment audio signal respectively includes:

for each frame of the selected de-envelope power spectrum of the target audio signal and the de-envelope power spectrum of the accompaniment audio signal, respectively determining an initial medium-high frequency harmonic point of the target audio signal and an initial medium-high frequency harmonic point of the accompaniment audio signal, wherein the power values of the initial medium-high frequency harmonic point and the initial medium-high frequency harmonic point are larger than the corresponding average power values;

And respectively selecting the initial medium-high frequency harmonic point of the target audio signal and the harmonic point of the power value in the initial medium-high frequency harmonic point of the accompaniment audio signal at the peak position as the target medium-high frequency harmonic point of the target audio signal and the target medium-high frequency harmonic point of the accompaniment audio signal.

Optionally, for each selected frame of the de-envelope power spectrum of the target audio signal and the de-envelope power spectrum of the accompaniment audio signal, determining an initial mid-high frequency harmonic point of the target audio signal and an initial mid-high frequency harmonic point of the accompaniment audio signal, where power values are greater than corresponding average power values, respectively, including:

Determining a medium-high frequency point range, wherein the density of the target audio signal is smaller than a preset density, as a target frequency point range;

And respectively determining an initial medium-high frequency harmonic point of the target audio signal with a power value larger than a corresponding average power value and an initial medium-high frequency harmonic point of the accompaniment audio signal in the target frequency point range for the selected de-envelope power spectrum of the target audio signal and the de-envelope power spectrum of the accompaniment audio signal of each frame.

Optionally, the removing the envelope processing from the power spectrum of the target audio signal and the power spectrum of the accompaniment audio signal to obtain a de-envelope power spectrum of the target audio signal and a de-envelope power spectrum of the accompaniment audio signal respectively includes:

taking logarithms of the power spectrum of the target audio signal and the power spectrum of the accompaniment audio signal respectively to obtain a logarithm power spectrum of the target audio signal and a logarithm power spectrum of the accompaniment audio signal;

Removing envelope processing is carried out on the logarithmic power spectrum of the target audio signal and the logarithmic power spectrum of the accompaniment audio signal, so that a de-envelope power spectrum of the target audio signal and a de-envelope power spectrum of the accompaniment audio signal are respectively obtained.

Optionally, the removing the envelope processing on the log power spectrum of the target audio signal and the log power spectrum of the accompaniment audio signal to obtain a de-envelope power spectrum of the target audio signal and a de-envelope power spectrum of the accompaniment audio signal respectively includes:

Based on a de-envelope processing formula, carrying out zero-phase delay filtering processing on the logarithmic power spectrum of the target audio signal and the logarithmic power spectrum of the accompaniment audio signal to respectively obtain a de-envelope power spectrum of the target audio signal and a de-envelope power spectrum of the accompaniment audio signal;

The de-envelope processing formula comprises:

Wherein, Representing a de-envelope power spectrum; l _p (k, n) represents the log power spectrum; filtfilt denotes a zero-phase delay filter processing algorithm; b. a represents a filtering parameter; n represents a frame index; k represents the frame point index on the frame.

Optionally, if the target similarity between the middle-high frequency point information of the target audio signal and the middle-high frequency point information of the accompaniment audio signal is greater than a threshold, determining that accompaniment callback exists in the target audio includes:

calculating the union value of the medium-high frequency point information of the target audio signal and the medium-high frequency point information of the accompaniment audio signal of each same frame;

Calculating intersection values of the medium-high frequency point information of the target audio signal and the medium-high frequency point information of the accompaniment audio signal of each same frame;

Determining the ratio of the intersection value and the union value corresponding to the same frame as Jaccard similarity values of the middle-high frequency point information of the target audio signal and the middle-high frequency point information of the accompaniment audio signal of the same frame;

determining the target similarity between the middle-high frequency point information of the target audio signal and the middle-high frequency point information of the accompaniment audio signal based on the Jaccard similarity value of the middle-high frequency point information of the target audio signal and the middle-high frequency point information of the accompaniment audio signal of each same frame;

and if the target similarity is larger than the threshold value, determining that accompaniment stepping back exists in the target audio signal.

Optionally, the determining the target similarity between the middle-high frequency point information of the target audio signal and the middle-high frequency point information of the accompaniment audio signal based on the middle-high frequency point information of the target audio signal and the Jaccard similarity value of the middle-high frequency point information of the accompaniment audio signal in the same frame includes:

And determining the average value of Jaccard similarity values of the medium-high frequency point information of the target audio signal and the medium-high frequency point information of the accompaniment audio signal of each same frame in a preset duration as the target similarity.

Optionally, the acquiring the recorded target audio signal includes:

acquiring a recorded initial audio signal;

And performing delay compensation on the initial audio signal to obtain the target audio signal.

In a second aspect, the present application discloses an electronic device, comprising:

A memory for storing a computer program;

and a processor for implementing the steps of the accompaniment stepping back detection method as any one of the above when executing the computer program.

In a third aspect, the present application discloses a computer readable storage medium having stored therein a computer program which, when executed by a processor, implements the steps of the accompaniment stepping back method as described in any of the above.

In the application, a recorded target audio signal and a corresponding accompaniment audio signal are firstly obtained, then a power spectrum of the target audio signal and a power spectrum of the accompaniment audio are obtained through Fourier transformation, and further, medium-high frequency point information of the target audio signal and medium-high frequency point information of the accompaniment audio signal are obtained, and finally, if the medium-high frequency point information of the target audio signal is similar to the medium-high frequency point information of the accompaniment audio signal, the condition that accompaniment back stepping exists in the target audio signal can be determined by judging whether the medium-high frequency point information of the target audio signal is similar to the medium-high frequency point information of the accompaniment audio signal or not, and if the target similarity of the medium-high frequency point information of the target audio signal and the medium-high frequency point information of the accompaniment audio signal is greater than a threshold value. Therefore, the application introduces the power spectrum information, the medium-high frequency point information and the similarity judgment of the audio in the accompaniment back stepping detection, can rapidly carry out the accompaniment back stepping detection, does not need to use the existing AEC tool, and has small calculated amount and high detection efficiency. The electronic device and the computer readable storage medium disclosed by the application also solve the corresponding technical problems.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of a system framework to which the accompaniment step-back detection scheme provided by the present application is applicable;

fig. 2 is a flowchart of an accompaniment stepping back detection method according to an embodiment of the present application;

Fig. 3 is a flowchart of a specific accompaniment stepping back detection method according to an embodiment of the present application;

FIG. 4 is a graph showing the original log power spectrum and the trend of the log power spectrum;

fig. 5 is a flowchart of a specific accompaniment stepping back detection method according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a target de-envelope power spectrum and target high frequency harmonic points;

fig. 7 is a flowchart of a specific accompaniment stepping back detection method according to an embodiment of the present application;

Fig. 8 is a flowchart of a specific accompaniment stepping back detection method according to an embodiment of the present application;

FIG. 9 is a schematic diagram of accompaniment step-back detection of recorded audio in accordance with aspects of the present application;

fig. 10 is a schematic structural diagram of an accompaniment step-back detecting device according to the present application;

Fig. 11 is a block diagram of an electronic device according to the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

In the process of recording audio, users generally wear headphones to listen to accompaniment for audio recording, if the headphones leak voice or play a song, accompaniment exists in the recorded audio, accompaniment stepping back is brought, and the quality of the recorded audio is greatly reduced. To ensure the quality of the recorded audio, accompaniment back-stepping detection is required for the recorded audio, for example, by calculating the energy ratio of the echo signal to the target signal through an acoustic echo cancellation (Acoustic Echo Cancellation, AEC) tool, or by using Double Talk Detection (double talk detection technology) in the method, the back-stepping probability is obtained. However, the above scheme depends strongly on the output effect of AEC, the calculation amount is relatively large, and the accompaniment stepping back detection efficiency is low. In order to overcome the technical problems, the application provides an accompaniment back stepping detection scheme which can improve the accompaniment back stepping detection efficiency.

According to the accompaniment back stepping detection scheme provided by the application, the accompaniment back stepping detection scheme can be flexibly applied to the front end of APP equipment or a server end and the like according to different application scenes, for example, in the process of recording sound by a user through the APP installed on the client, the client applying the accompaniment back stepping detection scheme can carry out accompaniment back stepping detection on the recording process of the user so as to help the user to improve the recording quality; in the server for analyzing the tone quality of the audio dry sound, the accompaniment back stepping detection scheme of the application can also be applied to identify the dry sound with obvious accompaniment back stepping so as to accurately carry out later operations such as sound effect processing, tone scoring and the like on the dry sound, or provide high-quality dry sound data and the like for neural network training for carrying out sound accompaniment separation. For the sake of understanding, it is assumed that the accompaniment step-back detection scheme provided by the present application is applied to a server, and the adopted system framework may specifically be shown in fig. 1, including: a background server 01 and a number of clients 02 establishing a communication connection with the background server 01.

In the application, a background server 01 is used for executing the steps of an accompaniment back stepping detection method, which comprises the steps of obtaining a recorded target audio signal and a corresponding accompaniment audio signal; the target audio signal and the accompaniment audio signal are subjected to Fourier transformation to respectively obtain a power spectrum of the target audio signal and a power spectrum of the accompaniment audio signal; respectively obtaining the medium-high frequency point information of the target audio signal and the medium-high frequency point information of the accompaniment audio signal based on the power spectrum of the target audio signal and the power spectrum of the accompaniment audio signal; if the target similarity between the middle and high frequency point information of the target audio signal and the middle and high frequency point information of the accompaniment audio signal is larger than a threshold value, determining that the accompaniment back stepping exists in the target audio signal.

Furthermore, the background server 01 can be also provided with an audio database, a power spectrum database, a medium-high frequency point information database and an accompaniment stepping back audio database. Wherein the audio database is used for storing various audio signals, such as accompaniment audio signals, target audio signals and the like. The power spectrum database may be used to store various power spectrum information, such as power spectrum information of the accompaniment audio signal obtained through calculation, power spectrum information of the target audio signal, and the like. The medium-high frequency point information database can be used for storing various medium-high frequency point information, such as medium-high frequency point information of a target audio signal, medium-high frequency point information of an accompaniment audio signal and the like. The accompaniment step-back audio database may then be used to store the detected presence of the accompaniment step-back target audio signal, etc. Of course, the application can also set the audio database in the service server of the third party, and the service server can collect the audio content data uploaded by the service server. In this way, when the background server 01 needs to use the audio, the corresponding audio may be obtained by initiating a corresponding audio call request to the service server.

In the application, the background server 01 can respond to the accompaniment back stepping detection requests of one or more user sides 02, and it can be understood that the accompaniment back stepping detection requests initiated by different user sides 02 of the application can be accompaniment back stepping detection requests initiated by different recorded audios of the same accompaniment audio or accompaniment back stepping detection requests initiated by recorded audios of different accompaniment audios.

Fig. 2 is a flowchart of an accompaniment stepping back detection method according to an embodiment of the present application. Referring to fig. 2, the accompaniment stepping back detection method includes:

step S101: and acquiring the recorded target audio signal and the corresponding accompaniment audio signal thereof.

In this embodiment, because accompaniment stepping back is performed, the recorded target audio signal and the accompaniment audio signal corresponding to the target audio signal need to be acquired first. The accompaniment referred to in the present application is a musical term, and refers to an instrumental performance accompanied by a singing, and for vocal music, a part other than a vocal is called accompaniment.

It should be noted that, the type and content of the accompaniment audio may be determined according to the application scenario, for example, during the recording process of the singer's singer, the accompaniment audio may be a music score of a song, etc. The type and content of the target audio can be determined according to the application scene, and still taking the singing voice recording process of the singer as an example, the target audio can be singing voice of the singer with the accompaniment.

Step S102: and carrying out Fourier transformation on the target audio signal and the accompaniment audio signal to respectively obtain a power spectrum of the target audio signal and a power spectrum of the accompaniment audio signal.

In this embodiment, in the process of performing accompaniment step-back detection, the power spectrum of the accompaniment audio signal needs to be calculated, and the power spectrum of the target audio signal needs to be calculated, so that accompaniment step-back detection can be performed according to the power spectrum of the accompaniment audio signal and the power spectrum signal of the target audio.

It can be understood that, in the process of calculating the power spectrum information of the accompaniment audio signal and the power spectrum information of the target audio signal, the power spectrum information of the audio is calculated, and the process can be flexibly determined according to actual needs, for example, the power spectrum of the audio can be calculated through fourier transform, STFT (short-time Fourier transform ), etc., and assuming that the audio signal to be calculated is x (i), i=1, 2, …, and i represents the sample position, the specific process of calculating the power spectrum information of the audio through STFT can be as follows:

Based on a formula x _w(n,i)＝x(L·n+i)·w_hann (i), segmenting an audio signal into a plurality of frames through preset frame shift and preset frame length, and adding hanning window to obtain a segmented windowed frame signal sequence; wherein x _w (n, i) represents a windowed frame signal sequence; l represents a preset frame shift, and the value of the L represents a preset frame shift can be determined according to actual needs, such as 10 ms; n represents a frame index; the definition of hanning window can be determined according to the actual needs, for example, it can be: I is more than or equal to 0 and less than or equal to N-1, wherein N represents the window length and corresponds to the preset frame length, and the value of the window length can be determined according to actual needs, such as 30ms and the like;

By the formula Performing Fourier transform on each frame of windowed frame signal to obtain the frequency spectrum distribution of the current frame signal; wherein/>Representing fourier transform, k representing a frame point index on the frame;

The power spectrum information of each frame signal is calculated by the formula P (k, n) = |x (k, n) | ².

Step S103: and respectively obtaining the medium-high frequency point information of the target audio signal and the medium-high frequency point information of the accompaniment audio signal based on the power spectrum of the target audio signal and the power spectrum of the accompaniment audio signal. Step S104: if the target similarity between the middle and high frequency point information of the target audio signal and the middle and high frequency point information of the accompaniment audio signal is larger than a threshold value, determining that the accompaniment back stepping exists in the target audio signal.

In this embodiment, if there is an accompaniment back-stepping in the target audio signal, there will be an accompaniment audio signal in the target audio signal, that is, the target audio signal and the accompaniment audio signal will be similar, if reflected on the power spectrum, the middle-high frequency point information of the target audio signal will be similar to the middle-high frequency point information of the accompaniment audio signal, so in performing the accompaniment back-stepping detection, the middle-high frequency point information of the target audio signal and the middle-high frequency point information of the accompaniment audio signal can be obtained based on the power spectrum of the target audio signal and the power spectrum of the accompaniment audio signal, respectively, and if the target similarity between the middle-high frequency point information of the target audio signal and the middle-high frequency point information of the accompaniment audio signal is greater than the threshold value, it is determined that the accompaniment back-stepping exists in the target audio signal.

It can be appreciated that the setting of the threshold determines the accuracy of accompaniment step-back detection to a certain extent, and the threshold can be flexibly determined according to the application scenario, for example, the threshold is set to 0.05.

It should be noted that, the mid-high frequency in the present application refers to the mid-high frequency in the audio frequency domain, the frequency value is 500hz to 20000hz, and the range of the mid-high frequency point information acquired in the present embodiment may be determined according to the need. In addition, in the process of comparing the target similarity of the middle-high frequency point information of the target audio signal and the middle-high frequency point information of the accompaniment audio signal with the threshold value, the target similarity of the middle-high frequency point information of the target audio signal and the middle-high frequency point information of the accompaniment audio signal needs to be calculated, and the calculation method of the target similarity can be determined according to actual needs, for example, the target similarity can be calculated through calculation methods of Euclidean distance, manhattan distance, minkowski distance, cosine similarity and the like.

In the accompaniment back stepping detection method provided by the application, the recorded target audio signal and the accompaniment audio signal corresponding to the target audio signal are firstly obtained, then the power spectrum of the target audio signal and the power spectrum of the accompaniment audio are obtained through Fourier transformation, the medium-high frequency point information of the target audio signal and the medium-high frequency point information of the accompaniment audio signal are further obtained, finally, whether the medium-high frequency point information of the target audio signal is similar to the medium-high frequency point information of the accompaniment audio signal or not is judged, and if the target similarity of the medium-high frequency point information of the target audio signal and the medium-high frequency point information of the accompaniment audio signal is greater than a threshold value, the accompaniment back stepping of the target audio signal can be determined. Therefore, the application introduces the power spectrum information, the medium-high frequency point information and the similarity judgment of the audio in the accompaniment back stepping detection, can rapidly carry out the accompaniment back stepping detection, does not need to use the existing AEC tool, and has small calculated amount and high detection efficiency.

Fig. 3 is a flowchart of a specific accompaniment stepping back detection method according to an embodiment of the present application. Referring to fig. 3, the accompaniment stepping back detection method includes:

step S201: and acquiring the recorded target audio signal and the corresponding accompaniment audio signal thereof.

Step S202: and carrying out Fourier transformation on the target audio signal and the accompaniment audio signal to respectively obtain a power spectrum of the target audio signal and a power spectrum of the accompaniment audio.

Step S203: and removing envelope processing is carried out on the power spectrum of the target audio signal and the power spectrum of the accompaniment audio signal, so that a de-envelope power spectrum of the target audio signal and a de-envelope power spectrum of the accompaniment audio signal are respectively obtained.

In this embodiment, an envelope exists in the calculated power spectrum, the envelope affects the relative change of the spectrum, and assuming that the original logarithmic power spectrum and the trend of the logarithmic power spectrum change are shown in fig. 4, where the abscissa represents the frequency, and the ordinate represents the logarithmic power spectrum, as can be seen from fig. 4, the harmonic points below the envelope will not become target frequency harmonic points, but if the envelope exists, the harmonic points below the envelope will also participate in the determination process of the mid-high frequency point information, that is, the determination efficiency of the mid-high frequency point information will be reduced due to the existence of the envelope, so in order to ensure the determination efficiency of the mid-high frequency point information, after the power spectrum of the target audio signal and the power spectrum of the accompaniment audio signal are removed, the power spectrum of the target audio signal and the power spectrum of the accompaniment audio signal are obtained respectively, and the power spectrum of the target audio signal and the power spectrum of the accompaniment audio signal are removed, and the power spectrum of the target audio signal and the power spectrum of the accompaniment audio signal are obtained respectively.

In a specific application scene, in the process of removing envelope processing on the power spectrum of the target audio signal and the power spectrum of the accompaniment audio signal to respectively obtain the de-envelope power spectrum of the target audio signal and the de-envelope power spectrum of the accompaniment audio signal, in order to facilitate calculation, the logarithm power spectrum can be firstly obtained, the logarithm power spectrum is subsequently applied to determine the medium-high frequency point information, that is, the logarithm power spectrum of the target audio signal and the logarithm power spectrum of the accompaniment audio signal can be firstly obtained, and the logarithm power spectrum of the target audio signal and the logarithm power spectrum of the accompaniment audio signal are obtained; and removing envelope processing is carried out on the logarithmic power spectrum of the target audio signal and the logarithmic power spectrum of the accompaniment audio signal, so as to respectively obtain a de-envelope power spectrum of the target audio signal and a de-envelope power spectrum of the accompaniment audio signal.

It will be appreciated that, taking the power spectrum information of each frame of signal as the P (k, n) calculated above as an example, the target log power spectrum can be obtained by the formula L _p (k, n) =log|p (k, n) |, where L _p (k, n) represents the log power spectrum.

It can be understood that in the process of removing the envelope processing of the log power spectrum of the target audio signal and the log power spectrum of the accompaniment audio signal to obtain the de-envelope power spectrum of the target audio signal and the de-envelope power spectrum of the accompaniment audio signal respectively, the log power spectrum of the target audio signal and the log power spectrum of the accompaniment audio signal can be subjected to zero-phase delay filtering processing based on the de-envelope processing formula to obtain the de-envelope power spectrum of the target audio signal and the de-envelope power spectrum of the accompaniment audio signal respectively;

The de-envelope processing formula includes:

Wherein, Representing a de-envelope power spectrum; filtfilt denotes zero-phase delay filtering processing; b. a represents a filtering parameter, that is, filtering processing needs to be performed in envelope removal processing, and a filtering type and parameters can be determined according to an application scene, for example, b and a in the application can be filtering parameters of low-pass filtering, the value of b can be [0.0305,0.0305], the value of a can be [1, -0.9391], and the like; n represents a frame index; k represents the frame point index on the frame.

Step S204: and respectively obtaining a target medium-high frequency harmonic point of the target audio signal and a target medium-high frequency harmonic point of the accompaniment audio signal based on the de-envelope power spectrum of the target audio signal and the de-envelope power spectrum of the accompaniment audio signal.

Step S205: and taking the target medium-high frequency harmonic point of the target audio signal as the medium-high frequency point information of the target audio signal.

Step S206: and taking the target medium-high frequency harmonic point of the accompaniment audio signal as the medium-high frequency point information of the accompaniment audio signal.

In this embodiment, in the process of removing the envelope processing on the power spectrum, the power spectrum is redrawn with the variation trend in the power spectrum as a reference, and the harmonic wave refers to each sub-component greater than the integral multiple of the fundamental wave frequency obtained by performing fourier series decomposition on the periodic non-sinusoidal alternating current amount, so the harmonic point can be easily determined in the envelope removing power spectrum, and therefore the corresponding middle-high frequency harmonic point can be used as corresponding middle-high frequency point information, that is, the middle-high frequency harmonic point of the target audio signal and the middle-high frequency harmonic point of the target audio signal can be obtained respectively based on the envelope removing power spectrum of the target audio signal and the envelope removing power spectrum of the accompaniment audio signal, and the middle-high frequency harmonic point of the target audio signal is used as middle-high frequency point information of the target audio signal. It should be noted that, the selection condition of the high-frequency harmonic point in the target may be determined according to specific needs, and the present application is not limited herein.

Step S207: if the target similarity between the middle and high frequency point information of the target audio signal and the middle and high frequency point information of the accompaniment audio signal is larger than a threshold value, determining that the accompaniment back stepping exists in the target audio signal.

Fig. 5 is a flowchart of a specific accompaniment stepping back detection method according to an embodiment of the present application. Referring to fig. 5, the accompaniment stepping back detection method includes:

step S301: and acquiring the recorded target audio signal and the corresponding accompaniment audio signal thereof.

Step S302: and carrying out Fourier transformation on the target audio signal and the accompaniment audio signal to respectively obtain a power spectrum of the target audio signal and a power spectrum of the accompaniment audio signal.

Step S303: and removing envelope processing is carried out on the power spectrum of the target audio signal and the power spectrum of the accompaniment audio signal, so that a de-envelope power spectrum of the target audio signal and a de-envelope power spectrum of the accompaniment audio signal are respectively obtained.

Step S304: for the selected de-envelope power spectrum of each frame of target audio signal and the de-envelope power spectrum of the accompaniment audio signal, respectively determining the initial middle-high frequency harmonic point of the target audio signal with the power value larger than the corresponding average power value and the initial middle-high frequency harmonic point of the accompaniment audio signal.

Step S305: and respectively selecting the initial medium-high frequency harmonic point of the target audio signal and the harmonic point of the power value in the initial medium-high frequency harmonic point of the accompaniment audio signal at the peak position as the target medium-high frequency harmonic point of the target audio signal and the target medium-high frequency harmonic point of the accompaniment audio signal.

In this embodiment, because the audio signal is composed of a plurality of frames of audio, the power spectrum of the audio signal is composed of a plurality of frames of audio, and harmonic points can be extracted from the power spectrum of each frame of audio, so in this embodiment, the selection of high-frequency harmonic points in the target is performed for each selected frame of the de-envelope power spectrum of the target audio signal and each selected frame of the de-envelope power spectrum of the accompaniment audio signal. In addition, considering that a plurality of harmonic points exist in the decomplexing power spectrum of a single audio signal, in order to reduce the number of high-frequency harmonic points in a target and enable the high-frequency harmonic points in the target to reflect the similarity between the target audio signal and an accompanying audio signal as far as possible, harmonic points with power values larger than an average power value and at peak positions in the decomplexing power spectrum of each frame of audio signal can be used as the high-frequency harmonic points in the target, the schematic diagrams of the decomplexing power spectrum and the high-frequency harmonic points in the target can be shown as fig. 6, wherein an abscissa represents frequency, an ordinate represents logarithmic power spectrum, the logarithmic power spectrum after removing trend, namely the target decomplexing power spectrum, and harmonic position points, namely the high-frequency harmonic points in the target.

Correspondingly, in the de-envelope power spectrum of a frame of accompaniment audio signal, the power value is larger than the average power value of the de-envelope power spectrum of the frame of accompaniment audio signal, and the harmonic point of the power value at the peak position in the de-envelope power spectrum of the frame of accompaniment audio signal is used as the target medium-high frequency harmonic point of the frame of accompaniment audio signal; in the de-envelope power spectrum of a frame of target audio signal, taking the harmonic point with the power value larger than the average power value of the de-envelope power spectrum of the frame of target audio signal and the power value at the peak position in the de-envelope power spectrum of the frame of target audio signal as the target medium-high frequency harmonic point of the frame of target audio signal.

It should be noted that, in the process of determining the initial middle-high frequency harmonic points of the target audio signal and the initial middle-high frequency harmonic points of the accompaniment audio signal, the power values of which are larger than the corresponding average power values, for the selected de-envelope power spectrum of each frame of the target audio signal and the de-envelope power spectrum of the accompaniment audio signal respectively, it can be understood that the number of the target middle-high frequency harmonic points in each frame of the de-envelope power spectrum is affected by the audio signal, if all the target middle-high frequency harmonic points are taken out each time, the detection efficiency of accompaniment stepping back is reduced, in order to ensure the detection efficiency of accompaniment stepping back, for example, in the de-envelope power spectrum of each frame of the audio signal, only the preset number of target middle-high frequency harmonic points are selected, the preset number of values can be determined according to the application scene, and the preset number of values is 20, at this time, the first 20 initial middle-high frequency harmonic points with the maximum peak value can be determined as the target middle-high frequency harmonic points, etc.

In addition, it can be understood that the recorded target audio signal and the accompaniment audio signal are distributed differently in each middle-high frequency band, for example, in some middle-high frequency bands, the distribution of the singing voice in the target audio signal and the harmonic of the music voice in the accompaniment audio signal is denser, the number of the target middle-high frequency harmonic points in the frequency band is more, and even if the accompaniment exists in the target audio signal due to the influence of the singing voice, the influence of the singing voice on the accompaniment is weak in the frequency band with sparse singing voice considering the accompaniment exists in the target audio signal, the target audio signal and the accompaniment audio signal tend to be more consistent, and at this time, the similarity judgment is easier to be carried out, so that the middle-high frequency point range with the density of the target audio signal being smaller than the preset density can be determined as the target frequency point range, and the initial middle-high frequency point and the initial middle-high harmonic point of the accompaniment audio signal of the target audio signal with the average power value being larger than the corresponding average power value can be respectively determined in the target frequency point range.

It should be noted that, the target frequency point range may be determined according to an application scenario, for example, in a voice recording process, because voice is generally distributed below 1000hz, the target frequency point range may be set to 1000 hz-20000 hz or 1500 hz-20000 hz, etc.; wherein the upper limit of the target frequency point range can be fixed to 20000hz, and the lower limit can be selected from 1000hz to 5000 hz.

Step S306: and taking the target medium-high frequency harmonic point of the target audio signal as the medium-high frequency point information of the target audio signal.

Step S307: and taking the target medium-high frequency harmonic point of the accompaniment audio signal as the medium-high frequency point information of the accompaniment audio signal.

Step S308: if the target similarity between the middle and high frequency point information of the target audio signal and the middle and high frequency point information of the accompaniment audio signal is larger than a threshold value, determining that the accompaniment back stepping exists in the target audio signal.

Fig. 7 is a flowchart of a specific accompaniment stepping back detection method according to an embodiment of the present application. Referring to fig. 7, the accompaniment stepping back detection method includes:

Step S401: and acquiring the recorded target audio signal and the corresponding accompaniment audio signal thereof.

Step S402: and carrying out Fourier transformation on the target audio signal and the accompaniment audio signal to respectively obtain a power spectrum of the target audio signal and a power spectrum of the accompaniment audio signal.

Step S403: and respectively obtaining the medium-high frequency point information of the target audio signal and the medium-high frequency point information of the accompaniment audio signal based on the power spectrum of the target audio signal and the power spectrum of the accompaniment audio signal.

Step S404: and calculating the union value of the medium-high frequency point information of the target audio signals and the medium-high frequency point information of the accompaniment audio signals of the same frames.

Step S405: and calculating intersection values of the medium-high frequency point information of the target audio signals of the same frames and the medium-high frequency point information of the accompaniment audio signals.

Step S405: and determining the ratio of the intersection value and the union value corresponding to the same frame as the Jaccard similarity value of the middle-high frequency point information of the target audio signal and the middle-high frequency point information of the accompaniment audio signal of the same frame.

Step S406: and determining the target similarity of the middle-high frequency point information of the target audio signal and the middle-high frequency point information of the accompaniment audio signal based on the Jaccard similarity value of the middle-high frequency point information of the target audio signal and the middle-high frequency point information of the accompaniment audio signal of each same frame. In this embodiment, in the process that whether the target similarity between the middle and high frequency point information of the target audio signal and the middle and high frequency point information of the accompaniment audio signal is greater than the threshold value, different amounts of frequency point information are carried in the middle and high frequency point information of the accompaniment audio signal and the middle and high frequency point information of the accompaniment audio signal are considered, and Jaccard (Jaccard similarity coefficient, jaccard coefficient) similarity can calculate the similarity based on the amounts, so in order to calculate the target similarity quickly, a Jaccard (Jaccard similarity coefficient, jaccard coefficient) similarity calculation method can be used to calculate the target similarity between the middle and high frequency point information of the target audio signal and the middle and high frequency point information of the accompaniment audio signal, so as to quickly determine whether the middle and high frequency point information of the accompaniment audio signal and the middle and high frequency point information of the target audio signal are similar.

In this embodiment, it is assumed that the middle and high frequency point information of the accompaniment audio isThe medium-high frequency point information of the target audio signal is/>The calculation formula of the Jaccard similarity value may be: /(I)Wherein J (n) represents a Jaccard similarity value of an n-th frame audio signal, and n represents intersection operation; and U represents a union operation. The method comprises the steps of calculating the sum value of the middle-high frequency point information of the target audio signal and the middle-high frequency point information of the accompaniment audio signal of each same frame, calculating the intersection value of the middle-high frequency point information of the target audio signal and the middle-high frequency point information of the accompaniment audio signal of each same frame, and determining the ratio of the intersection value corresponding to the same frame to the sum value as the Jaccard similarity value of the middle-high frequency point information of the target audio signal and the middle-high frequency point information of the accompaniment audio signal of the same frame.

It can be understood that, because there are many frames of audio in the audio signal, each frame of audio will calculate to obtain a Jaccard similarity value, so that a plurality of Jaccard similarity values will be obtained, if in the process of determining the target similarity between the middle and high frequency point information of the target audio signal and the middle and high frequency point information of the accompaniment audio signal based on the middle and high frequency point information of the target audio signal and the Jaccard similarity value of the middle and high frequency point information of the accompaniment audio signal of each same frame, the calculated amount will be greatly increased, so that in order to reduce the calculated amount, the Jaccard similarity value corresponding to each frame of audio can be selected or calculated to reduce the amount of the Jaccard similarity value, and then subsequent calculation is performed, for example, the average value of the middle and high frequency point information of the target audio signal and the middle and high frequency point information of the accompaniment audio signal of each same frame in a preset duration can be determined as the target similarity, and the average value of the Jaccard similarity value of the middle and high frequency point information of the accompaniment audio signal can be preset as the target similarity, and the similarity can be determined as the target similarity, for example, the similarity can be determined as the target length, and the similarity can be determined as the target similarity according to a formula, and the similarity can be determined as the target length, and the similarity is 1, and the similarity can be calculated according to the equation: the S (n) represents the target similarity, and certainly, the median value of the Jaccard similarity values of the middle-high frequency point information of the target audio signal and the middle-high frequency point information of the accompaniment audio signal of each same frame in the preset duration may also be determined as the target similarity, which is not specifically limited herein.

Step S407: if the target similarity is greater than the threshold value, determining that the accompaniment stepping back exists in the target audio signal.

Fig. 8 is a flowchart of a specific accompaniment stepping back method according to an embodiment of the present application. Referring to fig. 8, the accompaniment stepping back detection method includes:

step S501: a recorded initial audio signal and a corresponding accompaniment audio signal are obtained.

Step S502: and performing delay compensation on the initial audio signal to obtain a target audio signal.

In this embodiment, if there is a recording delay in the recording process of the target audio signal, the final detection accuracy will be affected, although the method of the present application has a certain robustness to the recording delay, if the recording delay is too long, the final detection accuracy will be poor, so in order to ensure the accuracy of accompaniment back stepping detection, in the process of obtaining the recorded target audio signal, the recorded initial audio signal may be obtained; and performing delay compensation on the initial audio signal to obtain a target audio signal.

It can be understood that in the process of performing delay compensation on the initial audio signal, considering the robustness of the method for recording delay, a delay time length threshold for determining whether to perform delay compensation on audio can be set, for example, the delay time length threshold is set to 100ms, whether the recording delay time length of the initial audio signal is greater than the delay time length threshold is judged, if the recording delay time length of the initial audio signal is greater than the delay time length threshold, the delay compensation is performed on the initial audio signal, and a target audio signal is obtained; if the recording delay time of the initial audio signal is less than or equal to the delay time threshold, the initial audio signal can be directly used as the target audio signal.

Step S503: and carrying out Fourier transformation on the target audio signal and the accompaniment audio signal to respectively obtain a power spectrum of the target audio signal and a power spectrum of the accompaniment audio signal.

Step S504: and respectively obtaining the medium-high frequency point information of the target audio signal and the medium-high frequency point information of the accompaniment audio signal based on the power spectrum of the target audio signal and the power spectrum of the accompaniment audio signal.

Step S505: if the target similarity between the middle and high frequency point information of the target audio signal and the middle and high frequency point information of the accompaniment audio signal is larger than a threshold value, determining that the accompaniment back stepping exists in the target audio signal.

Taking a singing recording process of an APP of a certain music client as an example, the technical scheme of the present application is described below, and the whole process may include the following steps:

Acquiring an accompaniment audio signal;

acquiring singing voice audio signals recorded by a user;

The method comprises the steps that a singing voice frequency signal and an accompaniment audio frequency signal are subjected to Fourier transformation to obtain a power spectrum of the singing voice frequency signal and a power spectrum of the accompaniment audio frequency signal respectively, and logarithms are taken from the power spectrum of the singing voice frequency signal and the power spectrum of the accompaniment audio frequency signal respectively to obtain a logarithm power spectrum of the singing voice frequency signal and a logarithm power spectrum of the accompaniment audio frequency signal;

Based on a de-envelope processing formula, carrying out zero-phase delay filtering processing on the logarithmic power spectrum of the singing voice audio signal and the logarithmic power spectrum of the accompaniment audio signal to respectively obtain a de-envelope power spectrum of the singing voice audio signal and a de-envelope power spectrum of the accompaniment audio signal; the de-envelope processing formula includes:

Wherein, Representing a de-envelope power spectrum; l _p (k, n) represents the target log power spectrum; filtfilt denotes zero-phase delay filtering processing; b. a represents a filtering parameter; n represents a frame index; k represents a frame point index on the frame;

Determining a medium-high frequency point range of which the density of the singing voice audio signal is smaller than a preset density as a target frequency point range;

For the selected envelope-removed power spectrum of each frame of singing voice audio signal and the envelope-removed power spectrum of the accompaniment audio signal, respectively determining an initial medium-high frequency harmonic point of the singing voice audio signal with a power value larger than a corresponding average power value and an initial medium-high frequency harmonic point of the accompaniment audio signal in the singing voice frequency point range;

Respectively selecting an initial medium-high frequency harmonic point of the singing voice audio signal and a harmonic point of the accompaniment audio signal, wherein the power value of the harmonic point is positioned at the peak position, as a target medium-high frequency harmonic point of the singing voice audio signal and a target medium-high frequency harmonic point of the accompaniment audio signal;

calculating the union value of the high-frequency harmonic points in the targets of the singing voice audio signals and the high-frequency harmonic points in the targets of the accompaniment audio signals of the same frames;

Calculating intersection values of high-frequency harmonic points in the targets of singing voice audio signals of the same frames and high-frequency harmonic points in the targets of accompaniment audio signals;

Determining the ratio of intersection values and union values corresponding to the same frames as Jaccard similarity values of high-frequency harmonic points in the targets of singing voice audio signals and high-frequency harmonic points in the targets of accompaniment audio signals of the same frames;

Determining average values of Jaccard similarity values of medium-high frequency point information of singing voice audio signals and medium-high frequency point information of accompaniment audio signals of the same frames in a preset duration as target similarity;

judging whether the target similarity value is larger than 0.05;

if the target similarity value is larger than 0.05, determining that accompaniment stepping back exists in the singing voice audio signal;

If the target similarity value is less than or equal to 0.05, determining that the singing voice audio signal does not have accompaniment stepping back.

In order to intuitively understand the accompaniment step-back detection scheme provided by the application, referring to fig. 8, in fig. 8, input signal harmonic points represent target medium-high frequency harmonic points in singing voice audio signals, reference signal harmonic points represent target medium-high frequency harmonic points in accompaniment audio, abscissas in a first row and a second row of graphs represent frequencies, abscissas represent logarithmic power spectrums, abscissas in a third row of graphs represent frequencies, and abscissas represent target similarity values; the third diagram on the left side of fig. 9 shows the distribution of high-frequency harmonic points in the target on the input signal and the reference signal which are obviously different, and the similarity value of the target is about 0.02 and is obviously less than 0.05, so that the scheme provided by the application can determine that no accompaniment back stepping exists in the recording signal; the third graph on the right side of fig. 9 shows that the distribution similarity of the high-frequency harmonic points in the target on the input signal and the reference signal is higher, and the target similarity value is about 0.15 and is obviously greater than 0.05, so that the accompaniment back stepping of the recording signal can be determined according to the scheme provided by the application.

Referring to fig. 10, an accompaniment step-back detection apparatus according to an embodiment of the present application may further include:

an audio acquisition module 101, configured to acquire a recorded target audio signal and a corresponding accompaniment audio signal thereof;

The power spectrum calculation module 102 is configured to obtain a power spectrum of the target audio signal and a power spectrum of the accompaniment audio signal by performing fourier transform on the target audio signal and the accompaniment audio signal, respectively;

a medium-high frequency point information determining module 103, configured to obtain medium-high frequency point information of the target audio signal and medium-high frequency point information of the accompaniment audio signal based on the power spectrum of the target audio signal and the power spectrum of the accompaniment audio signal, respectively;

the judging module 104 is configured to determine that the accompaniment callback exists in the target audio signal if the target similarity between the middle-high frequency point information of the target audio signal and the middle-high frequency point information of the accompaniment audio signal is greater than a threshold value.

In the accompaniment back stepping detection device provided by the application, the recorded target audio signal and the accompaniment audio signal corresponding to the target audio signal are firstly obtained, then the power spectrum of the target audio signal and the power spectrum of the accompaniment audio are obtained through Fourier transformation, the medium-high frequency point information of the target audio signal and the medium-high frequency point information of the accompaniment audio signal are further obtained, finally, whether the medium-high frequency point information of the target audio signal is similar to the medium-high frequency point information of the accompaniment audio signal or not is judged, and if the target similarity of the medium-high frequency point information of the target audio signal and the medium-high frequency point information of the accompaniment audio signal is greater than a threshold value, the accompaniment back stepping of the target audio signal can be determined. Therefore, the application introduces the power spectrum information, the medium-high frequency point information and the similarity judgment of the audio in the accompaniment back stepping detection, can rapidly carry out the accompaniment back stepping detection, does not need to use the existing AEC tool, and has small calculated amount and high detection efficiency.

The description of the corresponding modules in the accompaniment step-back detection device provided by the application can refer to the above embodiment, and is not repeated here.

Further, the embodiment of the application also provides electronic equipment. Fig. 11 is a block diagram of an electronic device 20, according to an exemplary embodiment, and the contents of the diagram should not be construed as limiting the scope of use of the present application in any way.

Fig. 11 is a schematic structural diagram of an electronic device 20 according to an embodiment of the present application. The electronic device 20 may specifically include: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input output interface 25, and a communication bus 26. The memory 22 is configured to store a computer program, which is loaded and executed by the processor 21 to implement relevant steps in the accompaniment step-back detection method disclosed in any of the foregoing embodiments. In addition, the electronic device 20 in the present embodiment may be a server.

In this embodiment, the power supply 23 is configured to provide an operating voltage for each hardware device on the electronic device 20; the communication interface 24 can create a data transmission channel between the electronic device 20 and an external device, and the communication protocol to be followed is any communication protocol applicable to the technical solution of the present application, which is not specifically limited herein; the input/output interface 25 is used for acquiring external input data or outputting external output data, and the specific interface type thereof may be selected according to the specific application requirement, which is not limited herein.

The memory 22 may be a carrier for storing resources, such as a read-only memory, a random access memory, a magnetic disk, or an optical disk, and the resources stored thereon may include an operating system 221, a computer program 222, video data 223, and the like, and the storage may be temporary storage or permanent storage.

The operating system 221 is used for managing and controlling various hardware devices on the electronic device 20 and the computer program 222, so as to implement the operation and processing of the processor 21 on the massive video data 223 in the memory 22, which may be Windows Server, netware, unix, linux, etc. The computer program 222 may further include a computer program that can be used to perform other specific tasks in addition to the computer program that can be used to perform the accompaniment step-back detection method performed by the electronic device 20 disclosed in any of the foregoing embodiments. The data 223 may include various video data collected by the electronic device 20.

Further, the embodiment of the application also discloses a storage medium, wherein the storage medium stores a computer program, and when the computer program is loaded and executed by a processor, the steps of the accompaniment stepping back detection method disclosed in any embodiment are realized.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The accompanying step-back detection method, device, equipment and medium provided by the invention are described in detail, and specific examples are applied to illustrate the principle and implementation of the invention, and the description of the above examples is only used for helping to understand the method and core ideas of the invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. An accompaniment stepping back detection method, comprising:

2. The method according to claim 1, wherein the obtaining the medium-high frequency point information of the target audio signal and the medium-high frequency point information of the accompaniment audio signal based on the power spectrum of the target audio signal and the power spectrum of the accompaniment audio signal, respectively, comprises:

3. The method of claim 2, wherein the obtaining the target medium-high frequency harmonic points of the target audio signal and the target medium-high frequency harmonic points of the accompaniment audio signal based on the de-envelope power spectrum of the target audio signal and the de-envelope power spectrum of the accompaniment audio signal, respectively, comprises:

4. A method according to claim 3, wherein said determining initial mid-high frequency harmonic points of the target audio signal and initial mid-high frequency harmonic points of the accompaniment audio signal having power values greater than the corresponding average power values, respectively, for each frame of the target audio signal and the accompaniment audio signal selected, comprises:

5. The method of claim 2, wherein said subjecting the power spectrum of the target audio signal and the power spectrum of the accompaniment audio signal to a de-envelope process to obtain a de-envelope power spectrum of the target audio signal and a de-envelope power spectrum of the accompaniment audio signal, respectively, comprises:

6. The method of claim 5, wherein said subjecting the log power spectrum of the target audio signal and the log power spectrum of the accompaniment audio signal to a de-envelope process to obtain a de-envelope power spectrum of the target audio signal and a de-envelope power spectrum of the accompaniment audio signal, respectively, comprises:

The de-envelope processing formula comprises:

7. The method according to any one of claims 2 to 6, wherein determining that the target audio has accompaniment back treading if the target similarity between the middle-high frequency point information of the target audio signal and the middle-high frequency point information of the accompaniment audio signal is greater than a threshold value, comprises:

8. The method according to claim 7, wherein the determining the target similarity of the middle-high frequency point information of the target audio signal and the middle-high frequency point information of the accompaniment audio signal based on the Jaccard similarity value of the middle-high frequency point information of the target audio signal and the middle-high frequency point information of the accompaniment audio signal for each same frame includes:

9. The method of claim 1, wherein the acquiring the recorded target audio signal comprises:

acquiring a recorded initial audio signal;

10. An electronic device, comprising:

A memory for storing a computer program;

A processor for implementing the steps of the accompaniment pedal-back detection method according to any one of claims 1 to 9 when executing the computer program.

11. A computer readable storage medium, wherein a computer program is stored in the computer readable storage medium, which when executed by a processor, implements the steps of the accompaniment back-stepping detection method according to any one of claims 1 to 9.