CN117765977A

CN117765977A - Training method of overlapped voice detection model, overlapped voice detection method and device

Info

Publication number: CN117765977A
Application number: CN202311840989.2A
Authority: CN
Inventors: 罗程方
Original assignee: Guangzhou Kugou Computer Technology Co Ltd
Current assignee: Guangzhou Kugou Computer Technology Co Ltd
Priority date: 2023-12-28
Filing date: 2023-12-28
Publication date: 2024-03-26

Abstract

The embodiment of the application provides a training method of an overlapped voice detection model, an overlapped voice detection method and an overlapped voice detection device, and relates to the technical field of audio detection and audio processing. The training method of the overlapped voice detection model comprises the following steps: acquiring a training sample set of the overlapped voice detection model, wherein the training sample set comprises at least one training sample, and each training sample comprises a section of song audio and an overlapped voice marking result corresponding to the song audio; outputting an overlapping voice detection result corresponding to the song audio through the overlapping voice detection model, wherein the overlapping voice detection result is used for indicating overlapping voice probability values corresponding to frames in the song audio respectively; and adjusting parameters of the overlapped voice detection model according to the difference between the overlapped voice detection result and the overlapped voice marking result to obtain the trained overlapped voice detection model. By adopting the technical scheme provided by the embodiment of the application, the detection accuracy of the overlapped voice in the song audio can be improved.

Description

Training method of overlapped voice detection model, overlapped voice detection method and device

Technical Field

The embodiment of the application relates to the field of audio detection and audio processing, in particular to a training method of an overlapped voice detection model, an overlapped voice detection method and an overlapped voice detection device.

Background

With the development of computer technology, audio detection and audio processing technologies are also becoming increasingly popular.

In the related art, by identifying fundamental frequencies in song audio, a place where a plurality of fundamental frequencies occur simultaneously in song audio is determined as the presence of overlapping human voice. However, in the related art, since the voice parts of the overlapping voices are unified to a high degree in most song audios, the degree of harmonic matching of the overlapping voices is high, resulting in low detection accuracy in the related art.

Disclosure of Invention

The embodiment of the application provides a training method of an overlapping voice detection model, an overlapping voice detection method and an overlapping voice detection device, which can improve the detection accuracy of overlapping voice in song audio. The technical scheme is as follows:

according to an aspect of an embodiment of the present application, there is provided a training method of an overlapping human voice detection model, the method including:

acquiring a training sample set of the overlapped voice detection model, wherein the training sample set comprises at least one training sample, each training sample comprises a section of song audio and an overlapped voice marking result corresponding to the song audio, and the overlapped voice marking result is used for indicating whether each frame in the song audio has overlapped voice or not respectively;

Outputting an overlapping voice detection result corresponding to the song audio through the overlapping voice detection model, wherein the overlapping voice detection result is used for indicating overlapping voice probability values corresponding to frames in the song audio respectively, and the overlapping voice probability values corresponding to the frames refer to the probability values of overlapping voice of the frames;

and adjusting parameters of the overlapped voice detection model according to the difference between the overlapped voice detection result and the overlapped voice marking result to obtain a trained overlapped voice detection model.

According to an aspect of the embodiments of the present application, there is provided an overlapping voice detection method, the method including:

acquiring song audio to be detected;

outputting an overlapping voice detection result corresponding to the song audio through an overlapping voice detection model, wherein the overlapping voice detection result is used for indicating overlapping voice probability values corresponding to frames in the song audio respectively, and the overlapping voice probability values corresponding to the frames are probability values of overlapping voice exists in the frames;

and determining the overlapped voice segments in the song audio according to the overlapped voice detection result corresponding to the song audio.

According to an aspect of an embodiment of the present application, there is provided a training apparatus for overlapping human voice detection models, the apparatus including:

the sample acquisition module is used for acquiring a training sample set of the overlapped voice detection model, wherein the training sample set comprises at least one training sample, each training sample comprises a section of song audio and an overlapped voice marking result corresponding to the song audio, and the overlapped voice marking result is used for indicating whether each frame in the song audio has overlapped voice or not respectively;

the result output module is used for outputting an overlapping voice detection result corresponding to the song audio through the overlapping voice detection model, wherein the overlapping voice detection result is used for indicating overlapping voice probability values corresponding to frames in the song audio respectively, and the overlapping voice probability values corresponding to the frames are probability values of overlapping voice existence of the frames;

and the parameter adjustment module is used for adjusting the parameters of the overlapped voice detection model according to the difference between the overlapped voice detection result and the overlapped voice marking result to obtain the trained overlapped voice detection model.

According to an aspect of the embodiments of the present application, there is provided an overlapping human voice detection apparatus, the apparatus including:

The audio acquisition module is used for acquiring song audio to be detected;

the result output module is used for outputting an overlapping voice detection result corresponding to the song audio through the overlapping voice detection model, wherein the overlapping voice detection result is used for indicating overlapping voice probability values corresponding to frames in the song audio respectively, and the overlapping voice probability values corresponding to the frames are probability values of overlapping voice exists in the frames;

and the overlapping determining module is used for determining overlapping voice fragments in the song audio according to the overlapping voice detection result corresponding to the song audio.

According to an aspect of the embodiments of the present application, there is provided a computer device, including a processor and a memory, in which a computer program is stored, the computer program being loaded and executed by the processor to implement the training method of the above-mentioned overlapping voice detection model, or to implement the above-mentioned overlapping voice detection method.

According to an aspect of the embodiments of the present application, there is provided a computer readable storage medium having stored therein a computer program loaded and executed by a processor to implement the training method of the above-described overlapping human voice detection model, or to implement the above-described overlapping human voice detection method.

According to an aspect of embodiments of the present application, there is provided a computer program product that is loaded and executed by a processor to implement the training method of the above-described overlapping human voice detection model, or to implement the above-described overlapping human voice detection method.

The technical scheme provided by the embodiment of the application can comprise the following beneficial effects:

by means of the training samples marked with whether the overlapping voice exists in each frame in the song audio or not, the overlapping voice detection model is trained to generate an overlapping voice detection result, the uniformity of the voice part of the overlapping voice is reduced or avoided to a higher degree, the influence on overlapping voice detection is reduced, and therefore the detection accuracy of the overlapping voice in the song audio is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an implementation environment for an embodiment provided herein;

FIG. 2 is a flow chart of a training method for an overlapping human voice detection model provided in one embodiment of the present application;

FIG. 3 is a flow chart of a training method for an overlapping human voice detection model provided in another embodiment of the present application;

FIG. 4 is a flow chart of a method for overlapping human voice detection provided in one embodiment of the present application;

FIG. 5 is a flow chart of a method for overlapping human voice detection provided in another embodiment of the present application;

FIG. 6 is a block diagram of a training apparatus for overlapping human voice detection models provided in one embodiment of the present application;

FIG. 7 is a block diagram of an overlapping voice detection apparatus provided in another embodiment of the present application;

FIG. 8 is a block diagram of a computer device provided in one embodiment of the present application.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of methods that are consistent with some aspects of the present application as detailed in the accompanying claims.

Referring to fig. 1, a schematic diagram of an implementation environment provided in one embodiment of the present application is shown, where the implementation environment may be implemented as an overlapping voice detection system. As shown in fig. 1, the system 10 may include: model training apparatus 11 and model using apparatus 12.

The model training device 11 is a computer device for training the overlapped human voice detection model to obtain a trained overlapped human voice detection model; model-using device 12 is a computer device that performs overlapping voice detection using a trained overlapping voice detection model. Wherein, the computer device refers to an electronic device with data computing, processing and storage capabilities. The computer device may be a terminal such as a PC (Personal Computer ), tablet, smart phone, wearable device, smart robot, etc.; or may be a server. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing service.

In some embodiments, in the training method of the overlapped voice detection model provided in the embodiments of the present application, the execution subject of each step may be the model training device 11; in the method for detecting overlapping voice provided in the embodiment of the present application, the execution subject of each step may be the model using device 12. The model training device 11 and the model using device 12 may be different devices or the same device, which is not particularly limited in the embodiment of the present application.

The following describes the technical scheme of the application through several embodiments.

Referring to fig. 2, a flowchart of a training method of an overlapping voice detection model according to an embodiment of the present application is shown. In the present embodiment, the method is mainly applied to the model training apparatus described above for illustration. The method may include the following steps (210-230):

step 210, a training sample set of the overlapped voice detection model is obtained, wherein the training sample set comprises at least one training sample, each training sample comprises a section of song audio and an overlapped voice marking result corresponding to the song audio, and the overlapped voice marking result is used for indicating whether each frame in the song audio has overlapped voice or not respectively.

In some embodiments, the song audio used to train the overlapping voice detection model may be of any type, such as popular songs, national songs, pure music/pure accompaniment, vocal singing songs, etc., as embodiments of the present application are not specifically limited in this regard. In some embodiments, the song audio (i.e., training samples) used to train the overlapping voice detection model may be audio that includes voice and accompaniment; or voice audio containing only voice; or may be accompaniment audio including only accompaniment, i.e., pure accompaniment/pure music.

In some embodiments, each song audio may be divided into a plurality of audio frames (the audio frames may be simply referred to as frames). In some embodiments, the duration of each frame is equal. For example, song audio may be divided in 50 frames/second, i.e., 20 milliseconds per frame.

In some embodiments, the overlapping voice labeling results may be labeled in frames, where the overlapping voice labeling result of each frame of song audio in the training sample is a first labeling result or a second labeling result, where the first labeling result is used to indicate that the corresponding frame has overlapping voice, and the second labeling result is used to indicate that the corresponding frame does not have overlapping voice. In some embodiments, the first labeling result may be represented by a "1" and the second labeling result may be represented by a "0".

In some embodiments, a segment of song audio belongs to a chorus segment if there are multiple voices at each time instant in the segment, i.e., there is an overlapping voice (i.e., multiple voices superimposed) for each frame of the segment. In some embodiments, overlapping voices refer to voices that exist in the same frame of song audio, and overlapping voices may be two voices or more than two voices superimposed.

In some embodiments, overlapping vocal refers to harmony or vocal chorus. The harmony in the embodiments of the present application belongs to human voice, i.e. a sound made by a person or a sound made by a simulated person. In some embodiments, harmony can have the auxiliary effect of enriching song levels and enhancing hearing, so that the volume of harmony can be lower than other sounds at corresponding times of song audio (e.g., other human voice audio at corresponding times). In some embodiments, the pitch of the harmony may be higher than the other sounds in the song audio at the corresponding time (e.g., other human voice audio at the corresponding time), and the pitch of the harmony may be lower than the other sounds in the song audio at the corresponding time (e.g., other human voice audio at the corresponding time). In some embodiments, a chorus refers to song audio in which a single or multi-part vocal work is singed by multiple persons or groups, i.e., there are two or more human voices simultaneously in the same audio frame in which an audio segment of a chorus exists. In some embodiments, the overlapping vocal sounds in the song audio may include only the harmony, may include only the vocal chorus, and may include both the harmony and the vocal chorus, which is not particularly limited in this embodiment of the present application. In the embodiment of the application, by adopting the detection model of overlapping voice, the voice and/or the voice chorus fragments in the song audio are detected, so that the detection accuracy of the voice frequency band and/or the voice chorus fragments in the song audio is improved.

In some embodiments, the training sample set includes at least one first training sample including song audio that has overlapping human voices and at least one second training sample including song audio that has no overlapping human voices. In some embodiments, the number of second training samples has a duty cycle a in the training sample set. In some embodiments, a may be 0.8%, 1%, 1.5%, 2%, etc., and of course, a may take other values, and a may be specifically set by a person skilled in the relevant art according to the actual situation, which is not specifically limited in the embodiments of the present application.

In some embodiments, the training sample set includes both training samples with overlapping human voices (i.e., the first training sample) and training samples without overlapping human voices (i.e., the second training sample), so that the anti-noise capability of the overlapping human voice detection model can be improved, and the accuracy of the overlapping human voice detection result can be further improved.

Step 220, outputting an overlapping voice detection result corresponding to the song audio through the overlapping voice detection model, wherein the overlapping voice detection result is used for indicating overlapping voice probability values corresponding to frames in the song audio respectively, and the overlapping voice probability values corresponding to the frames are probability values of overlapping voice.

In some embodiments, the training samples are input into an overlapping voice detection model that generates and outputs corresponding overlapping voice detection results. In some embodiments, the overlapping voice detection result may be labeled in frames, that is, the overlapping voice probability values corresponding to each frame of the song audio in the training sample respectively form the overlapping voice detection result obtained by the song audio in the training sample. In some embodiments, the overlapping human voice detection model may be an RCRNN (Residual Convolution Recurrent Neural Network, residual convolutional recurrent neural network), which is a network of recurrent neural networks combined with convolutional neural networks, commonly used in sequence prediction tasks. Of course, the stacking sound detection model may also be constructed based on other network structures, which is not specifically limited in the embodiment of the present application.

In some embodiments, this step 220 may include the steps of:

1. acquiring Mel spectrum characteristics of each frame in song audio;

2. and (3) inputting the Mel frequency spectrum characteristics of each frame in the song audio to the overlapped voice detection model, and outputting an overlapped voice detection result corresponding to the song audio by the overlapped voice detection model.

In some embodiments, the training sample needs to extract the mel spectrum features of each frame of the song audio, and then input the mel spectrum features of each frame of the song audio to the overlapped voice detection model to obtain the corresponding overlapped voice detection result. In some embodiments, the window length of the mel-frequency spectral feature is 25 milliseconds and the window shift is 20 milliseconds, so the Frame rate of the mel-frequency spectral feature is 50FPS (frames Per Second), i.e., 50 frames Per Second, each Frame being 20 milliseconds in duration. Thus, the frame rate of the overlapping voice note results for song audio is also 50FPS. In this embodiment, mel-spectrum feature (Mel-spectrum), which may also be referred to as Mel-spectrum feature, is a feature extraction method of commonly used speech signal (i.e., human voice signal) and audio signal processing, which is designed based on the auditory perception characteristics of human ears. The perception of the audio signal by the human ear is not linear and is relatively complex. The mel frequency used for obtaining the mel frequency spectrum characteristic is a frequency scale designed according to the auditory characteristics of the human ear, and different from the common linear frequency scale, the mel frequency scale can better simulate the perception characteristics of the human ear. In this embodiment, the mel spectrum feature of the song audio is input to the overlapped voice detection model, and the overlapped voice detection result output by the overlapped voice detection model also more accords with the overlapped voice judgment result of the human ear resolution judgment because the mel spectrum feature is more fit with the human ear perception feature, so that the detection accuracy of the overlapped voice in the song audio is further improved.

And step 230, adjusting parameters of the overlapped voice detection model according to the difference between the overlapped voice detection result and the overlapped voice marking result to obtain the trained overlapped voice detection model.

In some embodiments, the difference between the overlapping voice detection result output by the overlapping voice detection model and the overlapping voice labeling result is compared and analyzed, so that the detection accuracy of the overlapping voice detection model can be judged, and the parameters of the overlapping voice detection model can be adjusted based on the judgment. In some embodiments, calculating a loss function of the overlapping voice detection model based on the overlapping voice detection result and the overlapping voice labeling result, obtaining a value of the loss function, and adjusting parameters of the overlapping voice detection model based on the value of the loss function; and under the condition that the value of the loss function meets the condition, stopping training the overlapped human voice detection model to obtain a trained overlapped human voice detection model, namely obtaining a trained overlapped human voice detection model, wherein the loss function can be a Focal loss function in some embodiments. The problem that the analog unbalance is difficult to train is solved by the Focal loss function, and the detection accuracy of the overlapped human voice detection model is improved by adopting the embodiment.

In some embodiments, the trained overlapped voice detection model may be put into practical use, and the overlapped voice detection result output by the overlapped voice detection model is determined as the overlapped voice detection result of the song audio.

In summary, according to the technical scheme provided by the embodiment of the application, through the training samples marked with whether each frame in song audio respectively has overlapping voice, the overlapping voice detection model is trained to generate the overlapping voice detection result, the uniformity of the voice part of the overlapping voice is reduced or avoided to a higher degree, and the influence on the overlapping voice detection is reduced, so that the detection accuracy of the overlapping voice in song audio is improved.

In some possible implementations, as shown in fig. 3, the step 210 may further include the following steps (211 to 214):

step 211, at least one piece of vocal audio and at least one piece of accompaniment audio are acquired.

In some embodiments, vocal audio refers to audio that contains only vocal sounds, without accompaniment; the accompaniment audio is audio containing only accompaniment and no human voice. In some embodiments, the vocal audio may be audio recorded while the vocal is singing (i.e., without accompaniment). In some embodiments, the accompaniment audio is recorded and/or produced pure accompaniment audio that has not been mixed with human voice. In some embodiments, the song audio including the voice and the accompaniment is subjected to voice accompaniment separation processing, and the voice and the accompaniment are separated to obtain voice audio including only the voice and accompaniment audio including only the accompaniment.

Step 212, for a first voice audio in the at least one voice audio, acquiring an overlapping voice annotation result corresponding to the first voice audio, where the first voice audio is a voice audio with overlapping voice.

In some embodiments, for a first voice audio in which an overlapping voice exists, an overlapping voice annotation result corresponding to the first voice audio is directly obtained.

Step 213, for a second voice audio in the at least one voice audio, adding a voice to be added to the second voice audio to obtain a third voice audio, and obtaining an overlapping voice annotation result corresponding to the third voice audio, where the second voice audio is a voice audio without overlapping voice.

In some embodiments, for the second voice audio without overlapping voice, the voice to be added may be added, so that in the obtained third voice audio, there is an overlapping portion between the second voice audio and the voice to be added, that is, the third voice audio with overlapping voice is obtained; and then, the overlapping voice marking result corresponding to the third voice frequency can be obtained.

In some embodiments, adding the voice to be added to the second voice audio results in a third voice audio, including at least one of the following ways 1-4:

Mode 1:

in some embodiments, the second voice audio is subjected to tone changing processing to obtain tone-changed second voice audio; and superposing the second voice audio and the changed second voice audio to obtain a third voice audio.

In some embodiments, the tone of the second voice audio is changed (i.e. the tone changing process) to obtain a tone-changed second voice audio, where the tone-changed second voice audio is the voice to be added; and then, superposing the second voice audio with the changed second voice audio to obtain a third voice audio. Because only the second voice audio is changed and the duration is not changed, the changed second voice audio and the original second voice audio can be completely overlapped according to the start-stop time of the second voice audio to obtain the third voice audio. In the embodiment, by overlapping the original voice audio after the voice audio is changed, song audio containing overlapped voice can be generated, so that the number of training samples is enriched, and the training effect and the training efficiency of the overlapped voice detection model are improved.

In some embodiments, the modified second voice audio includes an up voice and/or a down voice, where up voice refers to voice audio with a higher tone corresponding to the audio frame than the second voice audio, and down voice refers to voice audio with a lower tone corresponding to the audio frame than the second voice audio. In some embodiments, the pedestrian and pedestrian sounds, which may be harmony.

In some embodiments, the pitch of the second human voice audio is increased overall to obtain the pedestrian sound. In some embodiments, the upstream voice may be b octaves higher than the second voice audio, b being a positive integer. In some embodiments, b may take a value of 3, i.e., the upstream voice may be 3 octaves higher than the second voice audio. Of course, b may take other values, which are not particularly limited in the embodiments of the present application. In some embodiments, the second voice audio is uplifted by b octaves, the tones of which audio segments after the melody uplinks are not within the chords are found according to the chords of the song audio, and the tones of the audio segments are adjusted to the nearby in-chord tones, so that the pedestrian voice can be obtained.

In some embodiments, the pitch of the second voice audio is reduced overall to obtain the downstream voice. In some embodiments, the downstream voice may be voice audio c octaves lower than the second voice audio, c being a positive integer. In some embodiments, c may take a value of 1, i.e., the downstream voice may be voice audio that is 1 octave lower than the second voice audio. Of course, c may take other values, which are not particularly limited in the embodiments of the present application. In some embodiments, the second voice audio is down-set by c octaves, the tones of which audio segments after the melody down-set are not in the chord are found according to the chord of the song audio, and the tones of the audio segments are adjusted to the nearby in-chord tones, so that the down-set voice can be obtained.

In some embodiments, the second human voice audio is superimposed with the pedestrian sound to obtain a third human voice audio. In some embodiments, the uplink voice may be superimposed into the second voice audio, such as by a complete overlap of the second voice audio and the start-stop time period of the uplink voice, resulting in the third voice audio. In some embodiments, only a portion of the audio of the pedestrian sound may be added to the second human sound audio, and the remaining portion of the audio of the pedestrian sound may be discarded without being added.

In some embodiments, the second human voice audio is superimposed with the pedestrian sound to obtain a third human voice audio. In some embodiments, the downlink voice may be superimposed into the second voice audio, such as by a complete overlap of the second voice audio and the start-stop time period of the downlink voice, resulting in the third voice audio. In some embodiments, only a portion of the audio of the pedestrian sound may be added to the second human sound audio, and the remaining portion of the audio of the pedestrian sound may be discarded without being added.

In some embodiments, the second human voice audio, the pedestrian upward voice and the pedestrian downward voice are superimposed to obtain a third human voice audio. In some embodiments, the upward voice and the downward voice may be superimposed into the second voice audio simultaneously, resulting in a third voice audio. Thus, the third vocal audio comprises at least three different vocal portions, from low to high, respectively: descending voice, second voice audio and ascending voice.

In some embodiments, the up-going or down-going voice is superimposed at a different location of the second voice audio to obtain the third voice audio. In some embodiments, the upward or downward voice may be added to the second voice audio using only a portion, respectively. In some embodiments, the corresponding upstream voice is added in the audio portion of the second voice audio tone that is lower, the corresponding downstream voice is added in the audio portion of the second voice audio tone that is higher, the corresponding downstream voice in the audio portion of the second voice audio tone that is lower is discarded without adding, and the corresponding upstream voice in the audio portion of the second voice audio tone that is higher is discarded without adding.

In some embodiments, the corresponding downstream human voice is added in the audio portion of the second human voice audio tone that is lower, the corresponding upstream human voice is added in the audio portion of the second human voice audio tone that is higher, the corresponding upstream human voice in the audio portion of the second human voice audio tone that is lower is discarded without adding, and the corresponding downstream human voice in the audio portion of the second human voice audio tone that is higher is discarded without adding.

Mode 2:

in some embodiments, performing voice conversion processing on the second voice audio to obtain a second voice audio after voice conversion, where the second voice audio is an audio for recording a singing of the first voice, and the second voice audio after voice conversion is used for simulating a singing effect of the second voice, and the first voice and the second voice are different; and superposing the second voice audio and the voice-converted second voice audio to obtain a third voice audio.

In some embodiments, the second voice audio obtained by singing the first voice is converted into audio simulating the singing effect of the second voice through voice conversion processing, that is, the second voice audio after voice conversion, so that the second voice audio after voice conversion and the second voice audio do not display the singing of the same person on the auditory sense of the human ear. In some embodiments, the voice conversion process includes converting the characteristics of the tone color, pronunciation habit, etc. of the first voice contained in the second voice audio into voice audio having the characteristics of the tone color, pronunciation habit, etc. of the second voice.

Mode 3:

in some embodiments, performing position movement processing on the second voice audio to obtain second voice audio after the position movement; and superposing the second voice audio and the second voice audio after the position movement to obtain a third voice audio.

In some embodiments, after the second voice audio is moved to the position, the second voice audio is overlapped with the original second voice audio, that is, the two second voice audios are overlapped for a partial period of time, so as to obtain a third voice audio.

Mode 4:

in some embodiments, the second voice audio is superimposed with at least one fourth voice audio to obtain a third voice audio, the fourth voice audio being different from the song content of the second voice audio.

In some embodiments, the second vocal audio and the at least one fourth vocal audio from different songs are superimposed to obtain a third vocal audio.

In the above embodiment, through the above modes 1 to 4, after various processing is performed on the voice audio, the voice audio is superimposed into the original voice audio, so that song audio including overlapping voice can be generated, the number of training samples is enriched, and the training effect and training efficiency of the overlapping voice detection model are improved.

In some embodiments, the method may further comprise the steps of:

1. adding accompaniment audio to target voice audio to generate song audio, wherein the target voice audio is first voice audio or third voice audio;

2. determining an overlapping voice marking result corresponding to the target voice audio as an overlapping voice marking result corresponding to the song audio;

3. and generating at least one training sample based on at least one song audio and the overlapping voice annotation result corresponding to each song audio respectively to obtain a training sample set.

In some embodiments, whether or not the target audio is accompaniment, the accompaniment audio may be added to obtain song audio, and the required training sample is generated based on the obtained song audio and the corresponding overlapping voice annotation result. In the embodiment, by adding the accompaniment audio, the song audio in the training sample is ensured to comprise accompaniment, so that the detection difficulty of the overlapped voice detection model is improved, and the training efficiency of the overlapped voice detection model is further improved.

In some embodiments, the ratio of the average volume of the accompaniment audio to the average volume of the target voice audio is less than 1, so as to ensure that the accompaniment volume is less than the target voice audio as much as possible, and the obtained accompaniment voice audio is more consistent with the actual situation in comparison. In some embodiments, the ratio is randomly determined within a set range of ratios. For example, the ratio may range from 0.3 to 0.9.

In some embodiments, a segment of human voice audio of a first duration intercepted from a target human voice audio is obtained; randomly selecting an accompaniment segment from the accompaniment segment set, adding the accompaniment segment to the voice audio segment to generate song audio, wherein the accompaniment segment set comprises a plurality of accompaniment audio segments, each accompaniment audio segment is of a first time length, and the accompaniment audio segments are obtained by intercepting the accompaniment audio.

In some embodiments, the accompaniment audio added into the target audio is randomly selected from the accompaniment clip set, so that song audio with dissonance melody is generated, the detection difficulty of the overlapped voice detection model is improved, and the training efficiency of the overlapped voice detection model is further improved.

Step 214, generating at least one training sample according to the overlapping voice marking result corresponding to the at least one section of the first voice audio, the overlapping voice marking result corresponding to the at least one section of the third voice audio, and the at least one section of accompaniment audio, and obtaining a training sample set.

In some embodiments, each first voice audio and corresponding overlapping voice annotation result is determined to be a training sample; and determining each third voice frequency and the corresponding overlapping voice labeling result as a training sample.

In the implementation manner, the training samples can be from the existing song audio with the overlapped voice and the song audio with the overlapped voice obtained by synthesis, so that the number of the training samples with the overlapped voice is increased, and further, the training effect of the overlapped voice detection model can be improved, namely, the detection accuracy of the overlapped voice detection model on the overlapped voice in the song audio is improved.

Referring to fig. 4, a flowchart of an overlapping voice detection method according to an embodiment of the present application is shown. In the present embodiment, the method is mainly applied to the model using apparatus described above for illustration. The method may include the following steps (410-430):

In step 410, song audio to be detected is obtained.

In some embodiments, the song audio to be detected may be of any type, such as pop songs, ethnic songs, pure music/pure accompaniment, vocal songs, etc., as embodiments of the present application are not specifically limited thereto.

Step 420, outputting an overlapping voice detection result corresponding to the song audio through the overlapping voice detection model, where the overlapping voice detection result is used to indicate overlapping voice probability values corresponding to each frame in the song audio, and the overlapping voice probability value corresponding to the frame refers to a probability value of the frame having overlapping voice.

In some embodiments, the song audio to be detected is input into the overlapped voice detection model, and a corresponding overlapped voice detection result can be obtained. In some embodiments, the overlapping human voice detection model may be an overlapping human voice detection model that was trained based on the embodiments of fig. 2 and 3 described above.

In some embodiments, overlapping vocal refers to harmony or vocal chorus.

Step 430, determining the overlapped voice segments in the song audio according to the overlapped voice detection result corresponding to the song audio.

In some embodiments, the overlapping voice detection results corresponding to the song audio to be detected are analyzed and processed, so that overlapping voice fragments in the song audio can be determined.

In some embodiments, this step 430 may further include the steps of:

1. marking a first numerical value for frames with overlapping voice probability values larger than a first threshold value;

2. marking a second numerical value for the frames with the overlapped voice probability value smaller than or equal to the first threshold value, wherein the second numerical value is different from the first numerical value;

3. consecutive frames marked with a first value and having a number greater than or equal to a second threshold are determined to be overlapping segments of human voice in the song audio.

In some embodiments, the first threshold is a number between 0 and 1, and the first threshold may be 0.5, or may be other values, which may be specifically set by a person skilled in the relevant arts according to the actual situation, which is not specifically limited in the embodiments of the present application. In some embodiments, the first value is used to indicate that there is overlapping voice for the corresponding frame and the second value is used to indicate that there is no overlapping voice for the corresponding frame. In some embodiments, the first value may be 1 and the second value may be 0. Of course, other manners of taking the first value and the second value may also exist, which are not specifically limited in the embodiments of the present application.

In some embodiments, if only a few consecutive frames are marked with a first value and adjacent other frames are marked with a second value, there is a high probability that these consecutive frames do not have overlapping voices because the overlapping voice segment is unlikely to be too short. Thus, only consecutive frames marked with a first value and having a number greater than or equal to the second threshold value may be considered overlapping segments of voice in song audio.

In some embodiments, the second threshold may be 50, 100, 120, etc., and the specific value of the second threshold may be set by a person skilled in the relevant art, which is not specifically limited in the embodiments of the present application.

In some embodiments, the descriptions of the steps 410 to 430 may refer to the contents of the embodiments of fig. 2 and 3, and are not repeated herein.

In summary, in the technical scheme provided by the embodiment of the application, the overlapping voice detection result is generated by training the overlapping voice detection model, so that the influence of higher degree of uniformity of the voice part of the overlapping voice on the overlapping voice detection is reduced or avoided, and the detection accuracy of the overlapping voice in song audio is improved.

In some possible implementations, as shown in fig. 5, the step 420 may further include the following steps (421 to 423):

in step 421, n audio clips are extracted from the song audio, where for two adjacent audio clips, there is an overlap between the tail of the previous audio clip and the head of the next audio clip, and n is an integer greater than 1.

In some embodiments, in consideration of the fact that the overlapping voice detection model may be relatively inaccurate and unstable for edge detection of the input training sample segments, a head-to-tail overlapping division may be used in dividing the song audio into audio segments. For example, each audio clip has a duration of 25s (seconds), the overlapping portion between adjacent audio clips is set to 5s, the divided n audio clips are arranged in time series, the first audio clip is 0s to 25s, the second audio clip is 20s to 45s, the third audio clip is 40s to 65s, and so on, and if the last audio clip is cut to less than 25s, the 25s is padded with 0.

Step 422, outputting the overlapping voice detection results corresponding to the n audio clips respectively through the overlapping voice detection model.

In some embodiments, n audio clips are input into the overlapped voice detection model, so as to obtain overlapped voice detection results respectively corresponding to the n audio clips.

In some embodiments, this step 422 may further include the steps of:

1. for each of the n audio clips, obtaining mel spectrum characteristics of each frame in the audio clip;

2. and (3) inputting the Mel frequency spectrum characteristics of each frame in the audio fragment into an overlapped human voice detection model, and outputting an overlapped human voice detection result corresponding to the audio fragment by the overlapped human voice detection model.

In some embodiments, the window length of the mel-frequency spectral feature is 25 milliseconds and the window shift is 20 milliseconds, so the frame rate of the mel-frequency spectral feature is 50FPS, i.e., 50 frames per second, each frame being 20 milliseconds.

In this embodiment, the mel spectrum feature of the song audio is input to the overlapped voice detection model, and because the mel spectrum feature is more fit to the human ear perception feature, the overlapped voice detection result output by the overlapped voice detection model also more fits the overlapped voice judgment result of human ear resolution judgment, thereby improving the detection accuracy of the overlapped voice in the song audio.

Step 423, splicing the overlapping voice detection results corresponding to the n audio clips respectively to obtain the overlapping voice detection result corresponding to the song audio.

In some embodiments, this step 423 may be considered as a step of pinching off the head and tail of each of the n audio segments, removing the overlapping portions and stitching together to obtain the overlapping voice probabilities for each frame of the complete, non-overlapping song audio.

In some embodiments, this step 423 may further comprise the steps of:

1. for each of the n audio clips, performing head pinching and tail removing on the overlapping voice detection result corresponding to the audio clip to obtain a processed overlapping voice detection result corresponding to the audio clip, wherein the head pinching is to remove overlapping voice probability values respectively corresponding to at least one frame of the head of the audio clip, and the tail removing is to remove overlapping voice probability values respectively corresponding to at least one frame of the tail of the audio clip;

2. and splicing the processed overlapping voice detection results corresponding to the n audio clips respectively to obtain the overlapping voice detection result corresponding to the song audio.

In some embodiments, for an audio segment in the middle of the n audio segments, i.e., other audio segments than the first and last audio segments, the overlapping human voice probability values of the audio segment head portion of each audio segment are removed, respectively, and the overlapping human voice probability values of the audio segment tail portion of each audio segment are removed, respectively. In some embodiments, the portion of the head portion that is removed is half the overlapping length of the two adjacent audio segments, and the portion of the tail portion that is removed is half the overlapping length of the two adjacent audio segments; for the first audio segment, only the tail part needs to be removed, and the tail part accounts for half of the overlapping length of two adjacent audio segments; for the last audio segment, only the head portion needs to be removed, which portion accounts for half the overlapping length of the two adjacent audio segments. For example, if two adjacent audio clips overlap by 250 frames, the length of the portion of the head portion removed is 125 frames (i.e., 2.5 s) and the length of the portion of the tail portion removed is 125 frames for the audio clip located in the middle; for the first audio clip, only 125 frames of the tail portion need to be removed; for the last audio clip, only 125 frames of the header portion need to be removed.

In some embodiments, after the head and tail pinching processing in this step, the processed harmony detection results corresponding to the two adjacent audio clips are the two continuous frames in the song audio, namely, the last frame at the tail of the previous audio clip and the first frame at the head of the next audio clip, so that the overlapping human voice detection results corresponding to the frames of the song audio that are complete and non-overlapping are obtained.

In the embodiment, the audio clips are firstly extracted in an overlapping manner, then the head and the tail of the overlapping voice detection results corresponding to the n audio clips are respectively cut, and the overlapping voice detection results corresponding to the frames of the complete and non-overlapping song audio are obtained in a splicing manner, so that the negative influence of unstable edge detection on the overlapping voice detection results is reduced to a great extent, and the detection accuracy of the overlapping voice in the song audio is improved.

In some embodiments, the part of the explanation of the steps 421 to 423 may refer to the content of the embodiments of fig. 2 and 3, which is not described herein.

In the implementation manner, the audio clips of the song audio are extracted in an overlapping mode, so that the obtained overlapping voice detection result avoids the influence of unstable edge detection as much as possible, and the accuracy of detecting the overlapping voice in the song audio is improved.

The following are device embodiments of the present application, which may be used to perform method embodiments of the present application. For details not disclosed in the device embodiments of the present application, please refer to the method embodiments of the present application.

Referring to fig. 6, a block diagram of a training apparatus for an overlapping human voice detection model according to an embodiment of the present application is shown. The device has the function of realizing the training method example of the overlapped human voice detection model, and the function can be realized by hardware or can be realized by executing corresponding software by hardware. The device can be the model training equipment introduced above, or can be arranged on the model training equipment. The apparatus 600 may include: a sample acquisition module 610, a result output module 620, and a parameter adjustment module 630.

The sample obtaining module 610 is configured to obtain a training sample set of the overlapping voice detection model, where the training sample set includes at least one training sample, each training sample includes a section of song audio and an overlapping voice marking result corresponding to the song audio, and the overlapping voice marking result is used to indicate whether each frame in the song audio has overlapping voice respectively.

The result output module 620 is configured to output, by using the overlapping voice detection model, an overlapping voice detection result corresponding to the song audio, where the overlapping voice detection result is used to indicate overlapping voice probability values corresponding to frames in the song audio, where the overlapping voice probability value corresponding to a frame is a probability value that the frame has overlapping voice.

The parameter adjustment module 630 is configured to adjust parameters of the overlapping voice detection model according to a difference between the overlapping voice detection result and the overlapping voice labeling result, so as to obtain a trained overlapping voice detection model.

In some embodiments, the sample acquisition module 610 is configured to:

acquiring at least one piece of voice audio and at least one piece of accompaniment audio;

for a first voice audio in the at least one voice audio, acquiring an overlapping voice marking result corresponding to the first voice audio, wherein the first voice audio is voice audio with overlapping voice;

adding a voice to be added to the second voice audio to obtain a third voice audio for a second voice audio in the at least one voice audio, and obtaining an overlapping voice marking result corresponding to the third voice audio, wherein the second voice audio refers to a voice audio without overlapping voice;

and generating at least one training sample according to at least one section of overlapping voice marking result corresponding to the first voice frequency, at least one section of overlapping voice marking result corresponding to the third voice frequency and at least one section of accompaniment frequency to obtain the training sample set.

In some embodiments, the sample acquisition module 610 is configured to:

performing tone changing processing on the second voice audio to obtain tone-changed second voice audio; superposing the second voice audio and the changed second voice audio to obtain the third voice audio;

or, performing voice conversion processing on the second voice audio to obtain second voice audio after voice conversion, wherein the second voice audio is audio for recording a singing of a first voice, the second voice audio after voice conversion is used for simulating a singing effect of the second voice, and the first voice and the second voice are different; superposing the second voice audio and the voice-converted second voice audio to obtain the third voice audio;

or, performing position movement processing on the second voice audio to obtain second voice audio after the position movement; superposing the second voice audio and the second voice audio after the position movement to obtain the third voice audio;

or, superposing the second voice audio and at least one fourth voice audio to obtain the third voice audio, wherein the fourth voice audio is different from the second voice audio in song content.

In some embodiments, the modified second voice audio includes an uplink voice and/or a downlink voice, where the uplink voice refers to voice audio with a tone higher than the second voice audio corresponding to an audio frame, and the downlink voice refers to voice audio with a tone lower than the second voice audio corresponding to an audio frame; the sample acquisition module 610 is configured to:

superposing the second voice audio and the upward voice to obtain the third voice audio;

or, superposing the second voice audio with the pedestrian voice to obtain the third voice audio;

or, superposing the second voice audio, the upward voice and the downward voice to obtain the third voice audio;

or overlapping the uplink voice or the downlink voice at different positions of the second voice audio to obtain the third voice audio.

In some embodiments, the sample acquisition module 610 is configured to:

adding the accompaniment audio to target voice audio to generate song audio, wherein the target voice audio is the first voice audio or the third voice audio;

determining an overlapping voice marking result corresponding to the target voice audio as an overlapping voice marking result corresponding to the song audio;

And generating at least one training sample based on at least one song audio and overlapping voice marking results corresponding to each song audio respectively to obtain the training sample set.

In some embodiments, the sample acquisition module 610 is configured to:

acquiring a voice audio fragment of a first duration intercepted from the target voice audio;

randomly selecting an accompaniment segment from an accompaniment segment set, adding the accompaniment segment to the voice audio segment, and generating the song audio, wherein the accompaniment segment set comprises a plurality of accompaniment audio segments, each accompaniment audio segment is of the first duration, and the accompaniment audio segment is obtained by intercepting the accompaniment audio.

In some embodiments, the training sample set includes at least one first training sample including song audio having overlapping human voices and at least one second training sample including song audio having no overlapping human voices.

In some embodiments, the result output module 620 is configured to:

acquiring the mel spectrum characteristics of each frame in the song audio;

and inputting the Mel frequency spectrum characteristics of each frame in the song audio to the overlapped human voice detection model, and outputting an overlapped human voice detection result corresponding to the song audio by the overlapped human voice detection model.

In some embodiments, the overlapping vocal is a harmony or vocal chorus.

Referring to fig. 7, a block diagram of an overlapping voice detection apparatus according to an embodiment of the present application is shown. The device has the function of realizing the above overlapping voice detection method example, and the function can be realized by hardware or by executing corresponding software by hardware. The device can be the model using equipment introduced above or can be arranged on the model using equipment. The apparatus 700 may include: an audio acquisition module 710, a result output module 720, and an overlap determination module 730.

The audio acquisition module 710 is configured to acquire audio of a song to be detected.

The result output module 720 is configured to output an overlapping voice detection result corresponding to the song audio through an overlapping voice detection model, where the overlapping voice detection result is used to indicate overlapping voice probability values corresponding to each frame in the song audio, and the overlapping voice probability value corresponding to the frame is a probability value that the frame has overlapping voice.

The overlapping determining module 730 is configured to determine overlapping voice segments in the song audio according to the overlapping voice detection result corresponding to the song audio.

In some embodiments, the result output module 720 is configured to:

extracting n audio clips from the song audio, wherein for two adjacent audio clips, the tail of the previous audio clip and the head of the next audio clip overlap, and n is an integer greater than 1;

outputting overlapping voice detection results respectively corresponding to the n audio clips through the overlapping voice detection model;

and splicing the overlapping voice detection results corresponding to the n audio clips respectively to obtain the overlapping voice detection result corresponding to the song audio.

In some embodiments, the result output module 720 is configured to:

for each audio fragment of the n audio fragments, performing head pinching and tail removing processing on the overlapping voice detection result corresponding to the audio fragment to obtain a processed overlapping voice detection result corresponding to the audio fragment, wherein the head pinching processing refers to removing overlapping voice probability values respectively corresponding to at least one frame of the head of the audio fragment, and the tail removing processing refers to removing overlapping voice probability values respectively corresponding to at least one frame of the tail of the audio fragment;

And splicing the processed overlapping voice detection results corresponding to the n audio clips respectively to obtain the overlapping voice detection result corresponding to the song audio.

In some embodiments, the result output module 720 is configured to:

for each audio segment of the n audio segments, obtaining mel spectrum characteristics of each frame in the audio segment;

and inputting the Mel frequency spectrum characteristics of each frame in the audio fragment into the overlapped human voice detection model, and outputting an overlapped human voice detection result corresponding to the audio fragment by the overlapped human voice detection model.

In some embodiments, the overlap determination module 730 is configured to:

labeling a first numerical value to the frame with the overlapped voice probability value larger than a first threshold value;

labeling a second value to the frame with the overlapping voice probability value less than or equal to the first threshold, the second value being different from the first value;

consecutive frames marked with the first value and having a number greater than or equal to a second threshold are determined as overlapping segments of voice in the song audio.

In some embodiments, the overlapping vocal is a harmony or vocal chorus.

It should be noted that, in the apparatus provided in the foregoing embodiment, when implementing the functions thereof, only the division of the foregoing functional modules is used as an example, in practical application, the foregoing functional allocation may be implemented by different functional modules, that is, the internal structure of the device is divided into different functional modules, so as to implement all or part of the functions described above. In addition, the apparatus and the method embodiments provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the apparatus and the method embodiments are detailed in the method embodiments and are not repeated herein.

Referring to fig. 8, a block diagram of a computer device according to an embodiment of the present application is shown. The computer device may be implemented as the model training device or the model using device described above. The computer device is used for implementing the training method of the overlapping human voice detection model provided in the above embodiment, or implementing the overlapping human voice detection method of the model using the device side provided in the embodiment. Specifically, the present invention relates to a method for manufacturing a semiconductor device.

The computer apparatus 800 includes a CPU (Central Processing Unit ) 801, a system Memory 804 including a RAM (Random Access Memory ) 802 and a ROM (Read-Only Memory) 803, and a system bus 805 connecting the system Memory 804 and the central processing unit 801. The computer device 800 also includes a basic I/O (Input/Output) system 806 that facilitates the transfer of information between various devices within the computer, and a mass storage device 807 for storing an operating system 813, application programs 814, and other program modules 815.

The basic input/output system 806 includes a display 808 for displaying information and an input device 809, such as a mouse, keyboard, or the like, for user input of information. Wherein the display 808 and the input device 809 are connected to the central processing unit 801 via an input output controller 810 connected to the system bus 805. The basic input/output system 806 can also include an input/output controller 810 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input output controller 810 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 807 is connected to the central processing unit 801 through a mass storage controller (not shown) connected to the system bus 805. The mass storage device 807 and its associated computer-readable media provide non-volatile storage for the computer device 800. That is, the mass storage device 807 may include a computer readable medium (not shown) such as a hard disk or CD-ROM (Compact Disc Read-Only Memory) drive.

The computer readable medium may include computer storage media and communication media without loss of generality. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM (Erasable Programmable Read Only Memory), EEPROM (Electrically Erasable Programmable Read Only Memory, erasable programmable read-only memory), flash memory or other solid state memory, CD-ROM, DVD (Digital Video Disc, high density digital video disc) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that the computer storage medium is not limited to the one described above. The system memory 804 and mass storage device 807 described above may be collectively referred to as memory.

According to various embodiments of the present application, the computer device 800 may also operate by being connected to a remote computer on a network, such as the Internet. I.e., the computer device 800 may be connected to a network 812 through a network interface unit 811 connected to the system bus 805, or other types of networks or remote computer systems (not shown) may be connected to the system using the network interface unit 811.

In an exemplary embodiment, a computer readable storage medium is also provided, in which a computer program is stored, which, when being executed by a processor, implements the above-mentioned training method of the overlapping human voice detection model, or implements the above-mentioned overlapping human voice detection method.

In an exemplary embodiment, a computer program product is also provided, which is loaded and executed by a processor to implement the training method of the above-described overlapping human voice detection model, or to implement the above-described overlapping human voice detection method.

It should be understood that references herein to "a plurality" are to two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

The foregoing description of the exemplary embodiments of the present application is not intended to limit the invention to the particular embodiments disclosed, but on the contrary, the intention is to cover all modifications, equivalents, alternatives, and alternatives falling within the spirit and scope of the invention.

Claims

1. A method of training an overlapping human voice detection model, the method comprising:

2. The method of claim 1, wherein the obtaining a training sample set of the overlapping human voice detection model comprises:

3. The method of claim 2, wherein adding the to-be-added human voice to the second human voice audio results in a third human voice audio, comprising:

or,

performing voice conversion processing on the second voice audio to obtain second voice audio after voice conversion, wherein the second voice audio is audio for recording a singing of a first voice, the second voice audio after voice conversion is used for simulating a singing effect of the second voice, and the first voice and the second voice are different; superposing the second voice audio and the voice-converted second voice audio to obtain the third voice audio;

or,

performing position movement processing on the second voice audio to obtain second voice audio after the position movement; superposing the second voice audio and the second voice audio after the position movement to obtain the third voice audio;

or,

and superposing the second voice audio and at least one fourth voice audio to obtain the third voice audio, wherein the fourth voice audio is different from the second voice audio in song content.

4. A method according to claim 3, wherein the modified second human voice audio comprises an up-going human voice, which is a human voice audio with a higher tone of the corresponding audio frame than the second human voice audio, and/or a down-going human voice, which is a human voice audio with a lower tone of the corresponding audio frame than the second human voice audio;

the step of superposing the second voice audio with the changed second voice audio to obtain the third voice audio comprises the following steps:

or,

superposing the second voice audio and the descending voice to obtain the third voice audio;

or,

superposing the second voice audio, the upward voice and the downward voice to obtain the third voice audio;

or,

and overlapping the uplink voice or the downlink voice at different positions of the second voice audio to obtain the third voice audio.

5. The method of claim 2, wherein generating at least one training sample from at least one overlapping voice marking result corresponding to the first voice audio, at least one overlapping voice marking result corresponding to the third voice audio, and the at least one accompaniment audio, comprises:

6. The method of claim 5, wherein adding the accompaniment audio to the target human voice audio generates the song audio, comprising:

7. The method of claim 1, wherein the training sample set comprises at least one first training sample comprising song audio having overlapping voices and at least one second training sample comprising song audio having no overlapping voices.

8. The method according to claim 1, wherein outputting, by the overlapping voice detection model, an overlapping voice detection result corresponding to the song audio, comprises:

acquiring the mel spectrum characteristics of each frame in the song audio;

9. The method according to any one of claims 1 to 8, wherein the overlapping human voice is harmony or chorus.

10. A method of overlapping human voice detection, the method comprising:

acquiring song audio to be detected;

11. The method of claim 10, wherein outputting the overlapping voice detection result corresponding to the song audio through the overlapping voice detection model comprises:

12. The method of claim 11, wherein the splicing the overlapping voice detection results corresponding to the n audio segments to obtain the overlapping voice detection result corresponding to the song audio includes:

13. The method according to claim 11, wherein outputting, by the overlapping voice detection model, overlapping voice detection results respectively corresponding to the n audio clips, includes:

14. The method of claim 10, wherein the determining the overlapping voice segments in the song audio based on the overlapping voice detection results corresponding to the song audio comprises:

15. A method according to any one of claims 10 to 14, wherein the overlapping human voice is harmony or chorus.

16. A training device for overlapping human voice detection models, the device comprising:

17. An overlapping voice detection apparatus, the apparatus comprising:

the audio acquisition module is used for acquiring song audio to be detected;

18. A computer device, characterized in that it comprises a processor and a memory in which a computer program is stored, which computer program is loaded and executed by the processor to implement the method of training the model of overlapping human voice detection according to any one of the preceding claims 1 to 9 or to implement the method of training the model of overlapping human voice detection according to any one of the preceding claims 10 to 15.

19. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program, which is loaded and executed by a processor to implement the method of training the overlapping human voice detection model of any one of the preceding claims 1 to 9 or to implement the method of training the overlapping human voice detection model of any one of the preceding claims 10 to 15.