CN106970950B

CN106970950B - Similar audio data searching method and device

Info

Publication number: CN106970950B
Application number: CN201710129982.8A
Authority: CN
Inventors: 孔令城
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2017-03-07
Filing date: 2017-03-07
Publication date: 2021-08-24
Anticipated expiration: 2037-03-07
Also published as: CN106970950A

Abstract

The embodiment of the invention discloses a method and a device for searching similar audio data, wherein the method comprises the following steps: acquiring deductive audio data, corresponding to a first user account, at least one second user account and a target song, of audio data segments corresponding to a plurality of clauses of the target song, extracting fundamental frequency data of the audio data segments, acquiring a note value sequence of the fundamental frequency data, and calculating note difference values between a note value sequence of the first user account under the target clauses and a note value sequence of the second user account under the target clauses; calculating a likelihood reference value between first characteristic data of the first user account under the target clause and second characteristic data of the second user account under the target clause; calculating a similarity reference value according to the note difference value and the likelihood reference value; and screening out similar audio data segments from the audio data segments of the second user account. By adopting the method and the device, the accuracy of searching the similar audio data can be improved.

Description

Similar audio data searching method and device

Technical Field

The invention relates to the technical field of internet, in particular to a method and a device for searching similar audio data.

Background

With the continuous development and improvement of terminal technology, terminal devices such as mobile phones and tablet computers become an indispensable part of people's lives, and users can realize various application functions by installing various application programs on the terminals, so that different requirements of the users in daily life, such as music software or karaoke software, are met.

In the existing music software or karaoke software, besides downloading or playing music files frequently, a user can sing songs and share the singed songs. For example, after a user records his or her singing work or a singing work including background music accompaniment of a corresponding song through a terminal, the corresponding work can be uploaded, so that the user and other users can check the corresponding work. Because a large number of users upload their own large deductive works of songs, there may be similarities between two works in partial sentences or whole songs before partial deductive works because of personal register, timbre and singing skills of the singer.

In the prior art, a new sound deduction work can be obtained by performing song splicing or performance clause replacement between different deduction works, or works of other users which are closer to the deduction work of the current user are recommended to the user; however, in the existing music software or karaoke software, the deductive works of other users most similar to the works sung by the user or the singing clauses thereof in all aspects of tone, intonation and the like cannot be accurately found in a large number of deductive works, that is, the technical scheme of the existing technology for searching the audio data similar to the deductive works or the deductive clauses has the problem of insufficient preparation.

Disclosure of Invention

The embodiment of the invention provides a method for searching similar audio data, which can accurately search audio data similar to the current deduction works or deduction clauses from a large number of deduction works, thereby improving the accuracy of searching similar audio data.

A method for searching similar audio data comprises the following steps:

respectively acquiring deduction audio data corresponding to a target song under a first user account and at least one second user account, wherein the deduction audio data comprises audio data fragments corresponding to a plurality of clauses of the target song;

extracting fundamental frequency data of an audio data segment of each sentence of each deductive audio data, acquiring a note value sequence corresponding to the extracted fundamental frequency data, determining a first note value sequence of the first user account under the target sentence and a second note value sequence of the second user account under the target sentence, and calculating note difference values between the first note value sequence and the second note value sequence;

extracting first characteristic data of an audio data segment of the first user account under the target clause and second characteristic data of an audio data segment of the second user account under the target clause, and calculating a likelihood reference value between the first characteristic data and the second characteristic data;

calculating a similarity reference value of the first user account and the second user account under the target clause according to the note difference value and the likelihood reference value;

and screening out similar audio data segments corresponding to the first user account under the target clause from the audio data segments of the at least one second user account according to the similarity reference value.

Optionally, in one embodiment, after the step of screening out, according to the size of the similarity reference value, a similar audio data segment corresponding to the first user account in the target clause from among audio data segments of the at least one second user account, the step of: determining at least one alternative clause corresponding to the target song; and replacing the audio data segment corresponding to the replacing clause in the first deduction audio data corresponding to the target song and the first user with a similar audio data segment corresponding to the first user account and corresponding to the replacing clause, and generating replacing audio data corresponding to the target song and corresponding to the first user account.

Optionally, in one embodiment, the calculating the note difference between the first note value sequence and the second note value sequence further includes: calculating a sum/average of distance values between each note value included in the first sequence of note values and second sequence of note values as a note difference between the first sequence of note values and the second sequence of note values.

Optionally, in one embodiment, the extracting fundamental frequency data of the audio data segment of each sentence of each deductive audio data, and the obtaining of the sequence of note values corresponding to the extracted fundamental frequency data includes: respectively extracting fundamental frequency data of the audio data segment of each clause in each deduction audio data according to a preset frame length and a preset frame shift so as to generate at least one fundamental frequency point corresponding to each clause in each deduction audio data; and adjusting the base frequency value of each base frequency point in the at least one base frequency point, and converting the adjusted base frequency value of each base frequency point into a note value corresponding to each base frequency point, thereby obtaining a note value sequence corresponding to the base frequency data of the audio data segments of different users in each clause.

Optionally, in one embodiment, the adjusting the fundamental frequency value of each of the at least one base frequency point includes: carrying out zero setting processing on the fundamental frequency value of a singular fundamental frequency point in the at least one fundamental frequency point; and performing median filtering processing on the fundamental frequency points.

Optionally, in one embodiment, the extracting of the first feature data of the audio data segment of the first user account under the target clause specifically includes: and extracting a first MFCC characteristic corresponding to the audio data segment of the first user account under the target clause as first characteristic data through a preset characteristic extraction algorithm.

Optionally, in one embodiment, the extracting second feature data of the audio data segment of the second user account under the target clause specifically includes: and extracting a second MFCC characteristic corresponding to the audio data fragment from the audio data fragment of the second user account under the target clause through the preset characteristic extraction algorithm to serve as third characteristic data, and training the third characteristic data through a preset training model to obtain data serving as second characteristic data.

Optionally, in one embodiment, the method further includes: training the first characteristic data through a preset training model to obtain data serving as the fourth characteristic data; calculating a first likelihood reference value between the third feature data and the fourth feature data; and calculating an average value of the first likelihood reference value and the likelihood reference values between the first feature data and the second feature data, taking the average value as the likelihood reference value, and updating the likelihood reference value.

Optionally, in one embodiment, the calculating the likelihood reference value between the first feature data and the second feature data specifically includes: and calculating a likelihood reference value between the first characteristic data and the second characteristic data according to a preset likelihood function.

Optionally, in one embodiment, the calculating, according to the note difference value and the likelihood reference value, similarity reference values of the first user account and the second user account in the target clause respectively is specifically: and acquiring a preset weighting coefficient, and weighting the note difference value and the likelihood reference value according to the preset weighting coefficient to obtain the similarity reference value, wherein the weighting coefficient of the note difference value is smaller than 0.

In addition, the embodiment of the invention also provides a searching device of similar audio data, which can accurately search the audio data similar to the current deduction works or deduction clauses from a large number of deduction works, thereby improving the searching accuracy of the similar audio data.

A similar audio data searching apparatus, comprising:

the audio data acquisition module is used for respectively acquiring deductive audio data corresponding to the target song under a first user account and at least one second user account, wherein the deductive audio data comprises audio data segments corresponding to a plurality of clauses of the target song;

a note difference value calculating module, configured to extract fundamental frequency data of an audio data segment of each clause of each deductive audio data, obtain a note value sequence corresponding to the extracted fundamental frequency data, determine a first note value sequence of the first user account in the target clause and a second note value sequence of the second user account in the target clause, and calculate a note difference value between the first note value sequence and the second note value sequence;

a likelihood reference value calculation module, configured to extract first feature data of an audio data segment of the first user account in the target clause and second feature data of an audio data segment of the second user account in the target clause, and calculate a likelihood reference value between the first feature data and the second feature data;

a similarity reference value calculating module, configured to calculate, according to the note difference value and the likelihood reference value, a similarity reference value of the first user account and the second user account in the target clause;

and the similar audio data fragment screening module is used for screening out a similar audio data fragment corresponding to the first user account under the target clause from the audio data fragments of the at least one second user account according to the size of the similarity reference value.

Optionally, in one embodiment, the apparatus further includes an audio data segment replacing module, configured to determine at least one replacing clause corresponding to the target song; and replacing the audio data segment corresponding to the replacing clause in the first deduction audio data corresponding to the target song and the first user with a similar audio data segment corresponding to the first user account and corresponding to the replacing clause, and generating replacing audio data corresponding to the target song and corresponding to the first user account.

Optionally, in one embodiment, the note difference calculation module is further configured to calculate a sum/average of distance values between each note value included in the first note value sequence and the second note value as the note difference between the first note value sequence and the second note value sequence.

Optionally, in one embodiment, the note difference calculation module is further configured to extract, according to a preset frame length and a preset frame shift, fundamental frequency data of an audio data segment of each clause in each deductive audio data, so as to generate at least one fundamental frequency point corresponding to each clause in each deductive audio data; and adjusting the base frequency value of each base frequency point in the at least one base frequency point, and converting the adjusted base frequency value of each base frequency point into a note value corresponding to each base frequency point, thereby obtaining a note value sequence corresponding to the base frequency data of the audio data segments of different users in each clause.

Optionally, in one embodiment, the note difference value calculating module is further configured to perform zeroing processing on fundamental frequency values of singular fundamental frequency points in the at least one fundamental frequency point; and performing median filtering processing on the fundamental frequency points.

Optionally, in one embodiment, the likelihood reference value calculation module is further configured to extract, as the first feature data, a first MFCC feature corresponding to the audio data segment of the first user account under the target clause from the audio data segment by using a preset feature extraction algorithm.

Optionally, in an embodiment, the likelihood reference value calculating module is further configured to extract, by using the preset feature extraction algorithm, a second MFCC feature corresponding to the audio data segment of the second user account under the target clause as third feature data, and train, by using a preset training model, the third feature data to obtain data serving as second feature data.

Optionally, in one embodiment, the likelihood reference value calculating module is further configured to train the first feature data through a preset training model to obtain data serving as the fourth feature data; calculating a first likelihood reference value between the third feature data and the fourth feature data; and calculating an average value of the first likelihood reference value and the likelihood reference values between the first feature data and the second feature data, taking the average value as the likelihood reference value, and updating the likelihood reference value.

Optionally, in one embodiment, the likelihood reference value calculating module is further configured to calculate a likelihood reference value between the first feature data and the second feature data according to a preset likelihood function.

Optionally, in one embodiment, the similarity reference value calculating module is further configured to obtain a preset weighting coefficient, and weight the note difference value and the likelihood reference value according to the preset weighting coefficient to obtain the similarity reference value, where the weighting coefficient of the note difference value is smaller than 0.

The embodiment of the invention has the following beneficial effects:

after the searching method and the searching device for similar audio data are adopted, aiming at singing works recorded by a target user through a terminal, audio data segments which are similar to the audio data segments of each clause in the singing works of the target user in terms of tone and timbre can be searched in all the singing works uploaded by other users. That is to say, when searching for similar audio data, it is considered whether the intonation corresponding to each audio data is consistent or close, and also considered whether the corresponding singer is similar in tone, so that compared with the technical scheme in the prior art that only whether the intonation is always considered, the searched similar audio data is more similar to the audio data performed by the target user, the similarity between the searched similar audio data and the current audio data is improved, and the accuracy of searching for similar audio data is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Wherein:

FIG. 1 is a flowchart illustrating a method for searching for similar audio data according to an embodiment;

FIG. 2 is a diagram illustrating an exemplary structure of a device for searching similar audio data;

fig. 3 is a schematic structural diagram of a computer device for executing the similar audio data searching method in one embodiment.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the embodiment of the invention, the method for searching the similar audio data is provided, and the audio data similar to the current deductive work or deductive clause can be accurately searched from a large number of deductive works, so that the accuracy of searching the similar audio data is improved. In particular, the implementation of the method may rely on a computer program that is executable on a computer system based on the von neumann architecture, which may be a music application program including a singing function, such as a music playback application, a karaoke application, for example, a national karaoke application. The computer system may be a server or a terminal device such as a smart phone, a tablet computer, a personal computer, etc. running the computer program.

It should be noted that, in this embodiment, the search method for the similar audio data may be performed based on the server corresponding to the music application.

Embodiments of the present invention are applicable to any song that may be sung or deducted by a user, and are described in detail below with reference to only one song (i.e., a target song).

As shown in fig. 1, the searching method of similar audio data at least includes the following steps S102-S110:

step S102: deduction audio data corresponding to a target song under a first user account and at least one second user account are obtained respectively, and the deduction audio data comprise audio data fragments corresponding to a plurality of clauses of the target song.

Specifically, in the embodiment of the present invention, the deductive audio data may be audio data recorded by the terminal device with a recording function when the user sings the target song, and the deductive audio data may be only vocal audio data sung by the user, or may be audio data of background music such as vocal audio data sung by the user and accompaniment remix of the target song, which is not specifically limited herein.

The target song comprises a plurality of clauses, the clauses can be divided through lyric information of the target song, each sentence of lyrics corresponds to one clause, and each clause corresponds to one audio data fragment. After the user sings the target song, the performance audio data sung by the user can be uploaded to a server corresponding to the music software through the music software installed on the terminal. That is, the deductive audio data uploaded by each user is stored in the server, and other users can view the deductive audio data of other users by requesting the server.

In this embodiment, the first user account is the target user account, and in this embodiment, for example, audio data similar to the deductive audio data of the deductive target song under the target user account is searched, the searched target is the deductive audio data of the second user account for the target song. It should be noted that, in this embodiment, there is more than one second user account; for example, the second user account may be all user accounts except the target user account, or may be some user accounts among all other user accounts, for example, all user accounts except the target user account with gender identification "female".

Step S104: extracting fundamental frequency data of an audio data segment of each clause of each deductive audio data, obtaining a note value sequence corresponding to the extracted fundamental frequency data, determining a first note value sequence of the first user account under the target clause and a second note value sequence of the second user account under the target clause, and calculating note difference values between the first note value sequence and the second note value sequence.

Specifically, for the deductive audio data of the first user account corresponding to the target song and the audio data segment corresponding to each separate residence under the deductive audio data of the second user account, the fundamental frequency data corresponding to each audio data segment is respectively extracted, and then the note value sequence corresponding to the threshold value is obtained according to the fundamental frequency data. The fundamental frequency data may be a fundamental tone of the audio data segment, which is used to determine a treble of each audio frequency in the audio data segment, wherein the note value refers to a standard value for a Digital interface of a midi (musical Instrument Digital interface).

In a specific embodiment, said extracting fundamental frequency data of the audio data segment of each sentence of each deductive audio data, and said obtaining a sequence of note values corresponding to said extracted fundamental frequency data comprises: respectively extracting fundamental frequency data of the audio data segment of each clause in each deduction audio data according to a preset frame length and a preset frame shift so as to generate at least one fundamental frequency point corresponding to each clause in each deduction audio data; and adjusting the base frequency value of each base frequency point in the at least one base frequency point, and converting the adjusted base frequency value of each base frequency point into a note value corresponding to each base frequency point, thereby obtaining a note value sequence corresponding to the base frequency data of the audio data segments of different users in each clause.

For example, the frame length may be preset to be 30ms, the frame shift may be 10ms, and the fundamental frequency data of the audio data segment of each sentence performed by each user may be collected, so that at least one fundamental frequency point may exist correspondingly for each sentence performed by each user. And performing denoising, smoothing and other processing on the at least one basic frequency point, and then converting the adjusted basic frequency value of each basic frequency point into a note value corresponding to each basic frequency point, wherein each clause deduced by each user corresponds to the at least one basic frequency point, at least one basic frequency point corresponds to at least one note value, and the at least one note value forms a corresponding note value sequence, so that the note value sequence corresponding to the basic frequency data of the audio data segment of each clause without the user is obtained.

In a possible implementation scenario, a preset note conversion formula may be adopted, and the note value of each fundamental frequency point is calculated according to the adjusted fundamental frequency value of each fundamental frequency point. The preset note conversion formula may be:

wherein m is_iNote value, x, expressed as the current base frequency point_iExpressed as the fundamental frequency value of the current fundamental frequency point, and p represents the length of the sequence of note values.

In addition, when M is represented as a sequence of note values, and X is represented as a sequence of fundamental frequency values composed of fundamental frequency values of fundamental frequency points in the fundamental frequency data, the preset note conversion formula can be identified as:

it should be noted that, when the fundamental frequency value of each fundamental frequency point in at least one fundamental frequency point is adjusted, the fundamental frequency value of a singular fundamental frequency point in at least one fundamental frequency point may be zeroed. For example, if the front fundamental frequency value and the rear fundamental frequency value of a non-0 fundamental frequency point are both 0, the fundamental frequency point is marked as 0; the median filtering processing can also be performed on several continuous fundamental frequency points, and through the median filtering processing (for example, 5-point median filtering), the curve of the fundamental frequency band can be smoothed, and the occurrence of noise points is avoided.

Optionally, in an embodiment, before extracting the fundamental frequency data of the audio data segments of each sentence performed by each user, each audio data segment may be further normalized according to a preset format, for example, a PCM format of 16k 16bit may be normalized.

After the note value sequences corresponding to the first user account and the second user account under each clause of the target song are determined, note difference values of the first user account and the second user account under the target clause can be calculated, namely the difference between the note values of the audio data segment of the target clause deduced by the first user account and the audio data segment of the target clause deduced by the second user account is calculated.

In a specific embodiment, taking the target clause as the kth clause of the target song as an example, the users singing for the respective user accounts are synonymous sentences, so the lengths of the corresponding audio data segments are the same, that is, the number of fundamental frequency values included in the corresponding fundamental frequency data is the same, so that each user is consistent in the length of the note value sequence under the kth clause, that is, the number of note value components included in the note value is the same. Let the length of the kth clause be p, which means that the kth clause has p frames, or the sequence of note values corresponding to the kth clause contains p note values.

Setting the sequence of note values of the k-th clause of the deduction under the first user account as M_1k＝(m_1k1,m_1k2,…,m_1kp) The sequence of the note values of the k sentence of the deduction under the second user account is M_2k＝(m_2k1,m_2k2,…,m_2kp) In calculating M_1kAnd M_2kThe difference between them is obtained by a preset note difference calculation formula.

For example, in one embodiment, the note difference may be calculated by calculating M_1kAnd M_2kIs obtained by summing the absolute values of the differences between each note value component, i.e., the preset note difference value calculation formula is as follows:

wherein S_1k2The note difference between the first note sequence and the second note sequence representing the kth clause.

For another example, in another alternative embodiment, the difference value of the notes can be calculated by calculating M_1kAnd M_2kIs obtained by averaging the absolute values of the differences between each of the note value components, i.e., the preset note difference value calculation formula is as follows:

note that the note difference indicates the difference between the pitches of two users when performing the same sentence, and the smaller the note difference, the smaller the pitch difference between the corresponding two users 'deductions, whereas the larger the note difference, the larger the pitch difference between the corresponding two users' deductions.

Step S106: extracting first characteristic data of the audio data segment of the first user account under the target clause and second characteristic data of the audio data segment of the second user account under the target clause, and calculating a likelihood reference value between the first characteristic data and the second characteristic data.

In this embodiment, for the audio data segment corresponding to the first user account deduction target clause and the audio data segment corresponding to the second user account deduction target clause, corresponding feature data are extracted according to a preset feature extraction algorithm, respectively. It should be noted that, in this embodiment, the preset feature extraction algorithm may be a known tone feature extraction algorithm, for example, an MFCC (Mel-scale Frequency Cepstral Coefficients) feature extraction algorithm extracts a static MFCC feature corresponding to the audio data.

Specifically, in a specific embodiment, the extracting of the first feature data of the audio data segment of the first user account under the target clause specifically includes: and extracting a first MFCC characteristic corresponding to the audio data segment of the first user account under the target clause as first characteristic data through a preset characteristic extraction algorithm.

Firstly, the audio data segment A of the kth clause is deduced aiming at the first user account_1kPerforming framing processing according to the preset frame length and the preset frame shift to obtain the audio frame data after framing

Y_1k＝{y_1k1,y_1k2,…,y_1kp}，

Wherein p is audio frame data Y_1kLength of (i.e., number of frames).

Then, for each audio frame data component, by comparing each audio frame data component y_1kiPerforming discrete Fourier transform, modular square, filtering with triangular band-pass filter, logarithm, and discrete cosine transform to obtain the sum y_1kiCorresponding MFCC feature λ_1kiI.e. audio data segment A_1kThe MFCC characteristic of (a) may be expressed as:

λ_1k＝{λ_1k1,λ_1k2,…,λ_1kp}。

further, the audio data segment A of the kth clause is deduced for the user corresponding to the second user account_2kPerforming framing processing according to the preset frame length and the preset frame shift to obtain the audio frame data after framing

Y_2k＝{y_2k1,y_2k2,…,y_2kp}，

Wherein p is audio frame data Y_2kLength of (i.e., number of frames).

Then, for each audio frame data component, by comparing each audio frame data component y_2kiPerforming discrete Fourier transform, modular square, filtering with triangular band-pass filter, logarithm, and discrete cosine transform to obtain the sum y_2kiCorresponding MFCC feature λ_2kiI.e. audio data segment A_2kThe MFCC characteristic of (a) may be expressed as:

λ_2k＝{λ_2k1,λ_2k2,…,λ_2kp}。

note that, in the present embodiment, the MFCC characteristic λ_1kiOr λ_2kiThe method can be a multidimensional array or vector, for example, a 13-dimensional vector, or a 39-dimensional feature sequence obtained by calculating a first difference and a second difference on a 13-dimensional MFCC feature vector.

In this embodiment, the audio data segment A of the k-th clause is deduced in calculating the first user account_1kA between audio data segments of deductive target clause with second user account_2kLikelihood reference value Q_1k2Then, can be calculated by_1kAnd A_2kRespectively corresponding MFCC characteristics lambda_1kAnd λ_2kThe likelihood values in between.

Further, in an optional implementation, the calculating a likelihood reference value between the first feature data and the second feature data specifically includes: and calculating a likelihood reference value between the first characteristic data and the second characteristic data according to a preset likelihood function. It should be noted that, in the present embodiment, the preset likelihood function may be any known likelihood function, and is not specifically limited herein.

When calculating the feature data of the extracted audio data segment, besides extracting the MFCC features of the audio data segment, other feature extraction methods may be adopted, for example, the extracted MFCC features are further processed.

Specifically, in a specific embodiment, the extracting of the second feature data of the audio data segment of the second user account under the target clause specifically includes: and extracting a second MFCC characteristic corresponding to the audio data fragment from the audio data fragment of the second user account under the target clause through the preset characteristic extraction algorithm to serve as third characteristic data, and training the third characteristic data through a preset training model to obtain data serving as new characteristic data corresponding to the audio data fragment.

For example, the second user account deduces audio data segment A of the kth clause_2kThe MFCC feature data extracted in (1) is λ_2k＝{λ_2k1,λ_2k2,…,λ_2kpTo carry out Gaussian model training and samplingTraining a 256-dimensional Gaussian mixture model by using an EM (Expectation Maximization) Algorithm (maximum Expectation Algorithm, also called Expectation Maximization Algorithm), wherein the obtained characteristic data is eta_2k＝{η_2k1,η_2k2,…,η_2kpWhere η is a third characteristic data_2kiAnd λ_2kiIs the corresponding.

Deducting audio data segment A of k sentence in calculating first user account_1kA between audio data segments of deductive target clause with second user account_2kLikelihood reference value Q_1k2Then, can be calculated by_1kCorresponding MFCC feature λ_1kAnd A_2kCorresponding third characteristic data eta_2kThe likelihood values in between.

Further, to improve the calculated likelihood reference value Q_1k2In another embodiment, when extracting the feature data, further extraction of further features of the MFCC features of the audio data segment of the first user account deductive target clause is required. That is, data obtained by training the first feature data through a preset training model is used as the fourth feature data.

That is, the first user account deduces the audio data segment A of the kth clause_1kThe MFCC feature data extracted in (1) is λ_1k＝{λ_1k1,λ_1k2,…,λ_1kpAnd training a Gaussian model, and training a 256-dimensional Gaussian mixture model by adopting an EM (effective electromagnetic tomography) algorithm to obtain characteristic data of eta_1k＝{η_1k1,η_1k2,…,η_1kpWhere η is a fourth feature data_1kiAnd λ_1kiIs the corresponding.

In one embodiment, the likelihood reference value may be calculated as follows: calculating lambda_1kAnd λ_2kAnd is noted as the likelihood value between

And calculating eta_1kAnd η_2kAnd is noted as the likelihood value between

Then through the above

And

audio data segment A for calculating k sentence of first user account deduction_1kAudio data segment A deducting target clause from second user account_2kLikelihood reference value Q therebetween_1k2E.g. Q_1k2Can be

And

i.e.:

in another embodiment, λ is first calculated_1kAnd η_2kAnd is noted as q_1k2Then calculate λ_2kAnd η_1kAnd is noted as q_2k1Finally by the above q_1k2And q is_2k1Audio data segment A for calculating k sentence of first user account deduction_1kA between audio data segments of deductive target clause with second user account_2kLikelihood reference value Q_1k2E.g. Q_1k2May be q_1k2And q is_2k1I.e.:

it should be noted that, in this embodiment, the larger the likelihood reference value is, the closer the tone color between the audio data segment of the first user account deduction target clause and the audio data segment of the second user account deduction target clause is, whereas the smaller the likelihood reference value is, the larger the tone color difference between the audio data segment of the first user account deduction target clause and the audio data segment of the second user account deduction target clause is.

Step S108: and calculating a similarity reference value of the first user account and the second user account under the target clause according to the note difference value and the likelihood reference value.

After the note difference value and the likelihood reference value between the audio data segment of the first user account deduction target clause and the audio data segment of the second user account deduction target clause are obtained through the calculation of the steps, the similarity reference value between the two can be calculated according to the note difference value and the likelihood reference value between the two.

For example, in one embodiment, the similarity reference value corresponding to the note difference value and the likelihood reference value are calculated according to a preset similarity reference value calculation formula by using the note difference value and the likelihood reference value as arguments.

For another example, in one embodiment, the similarity reference value may be a weighted average of the note difference value and the likelihood reference value. It should be noted that, since the smaller the note difference value is, the smaller the tone difference between the audio data segment representing the first user account deduction target clause and the audio data segment representing the second user account deduction target clause is, and the larger the likelihood reference value is, the higher the timbre similarity between the audio data segment representing the first user account deduction target clause and the audio data segment representing the second user account deduction target clause is, the weighting coefficient of the note difference value is a negative number and the weighting coefficient of the likelihood reference value is a positive number when the weighted average of the note difference value and the likelihood reference value is calculated, and the larger the finally obtained similarity reference value is, the smaller the difference between the audio data segment of the first user account deduction target clause and the audio data segment of the second user account deduction target clause is, and the smaller the similarity reference value is, the smaller the audio data segment of the first user account deduction target clause and the audio data segment of the second user account deduction target clause are interpreted as The larger the difference of the frequency data fragments.

The similarity reference value between the audio data segment of the k-th clause of the deduction target clause of the first user account and the audio data segment of the deduction target song of the second user account is T_1k2The calculation formula can be calculated by the following similarity reference value calculation formula:

T_1k2＝α·Q_1k2-β·S_1k2,α＞0,β＞0

wherein, alpha and beta are preset coefficients.

Step S110: and screening out similar audio data segments corresponding to the first user account under the target clause from the audio data segments of the at least one second user account according to the similarity reference value.

The similarity reference value represents the similarity degree between the audio data segment of the corresponding second user account deduction target clause and the audio data segment of the user deduction target clause corresponding to the first user account, so that one or more audio data segments closest to the audio data segment of the first user account deduction target clause can be selected from the audio data segments corresponding to the target clause under all the second user accounts as similar audio data segments according to the similarity reference value.

It should be noted that, in this embodiment, the number of similar audio data segments in the target clause corresponding to the first user account may be one or multiple. For example, the number of similar audio data segments may be any preset number.

In a specific embodiment, all the audio data segments of the second user account under the target clause may be sorted according to the similarity reference value, and then N audio data segments before sorting are obtained as similar audio data segments, where N is a preset number constant.

Further, the similar audio data segments may be not only the audio data segments corresponding to the first N-th order of the similarity reference value, but also all audio data segments whose similarity reference value satisfies a preset condition, for example, all audio data segments whose similarity reference value is greater than 80% are regarded as similar audio data segments.

It should be noted that, in the above-mentioned method for searching similar audio data, an operation procedure and an operation method are given how to find a similar audio data segment similar to the audio data segment of the target user for deducing the target clause, because the found similar audio data segment is very similar to the audio data segment of the target user for deducing the target clause, for some users, the difference between the two may not be distinguished or is difficult to distinguish in the listening process, therefore, in this embodiment, the audio data segment corresponding to the target clause in the target song can be replaced by the similar audio data segment, so as to obtain new audio data.

Specifically, in an optional embodiment, after the step of screening out, from the audio data segments of the at least one second user account, similar audio data segments corresponding to the first user account in the target clause according to the size of the similarity reference value, the method further includes: determining at least one alternative clause corresponding to the target song; and replacing the audio data segment corresponding to the replacing clause in the first deduction audio data corresponding to the target song and the first user with the similar audio data segment corresponding to the first user account and corresponding to the replacing clause, and generating replacing audio data corresponding to the target song and corresponding to the first user account.

When new replacement audio data is generated, firstly, it is determined that audio data segments corresponding to the clauses in the original target song are replaced by similar audio data segments corresponding to other user accounts, that is, which clauses are specific to the replacement clauses.

In an alternative embodiment, the determination of the replacement clause may be entered by a manual selection of the user, for example, the user may choose to replace a clause if he feels that the sentence is not sung enough; in another alternative embodiment, the determination of the replacement clauses may be randomly selected by the server, and the number of replacement clauses is greater than or equal to 1 and less than half of the total number of clauses of the original target song.

After the alternative clauses are determined, similar audio data segments that are similar to the audio data clause from which each alternative clause was deduced from the first user account may be determined through steps S102-S110 described above. If the number of the similar audio data segments corresponding to a certain alternative clause is more than one, the similar audio data segment corresponding to the similarity reference value with the largest similarity reference value can be used as the target similar audio data segment corresponding to the alternative clause, or one of the plurality of similar audio data segments can be randomly selected as the target similar audio data segment corresponding to the alternative clause.

And finally, replacing the audio data segments corresponding to all the replacing clauses in the deduction audio data of the first user account deduction target song with corresponding target similar audio data segments, and finally generating new replacing audio data corresponding to the target song.

For example, for a target user, the deductive audio data for a target song is a, the kth clause is a replacement clause, and now the audio data segment a corresponding to the kth clause needs to be divided into_kReplacing with corresponding similar audio data segment, if finally determined by the method described above with A_kCorresponding similar audio data segments are

Wherein n is a user identifier. A in A_kBy direct substitution into

The replacement audio data corresponding to a can be obtained.

It should be noted that in another alternative embodiment, the audio data segment is also required to be energy-normalized before the replacement of the audio data segment.

Specifically, if the number of sampling points of the kth clause is L, A is_k＝{a_k1,a_k2,…,a_kL}，

Then separately calculating the audio data segmentsA_kSimilar audio data segments as

Energy value | A of_kI and

then similar audio data segments are cut

Performing energy normalization to obtain:

wherein, a'_kiIs a pair of

Sampling points obtained after energy normalization are carried out, so that similar audio data segments can be obtained

The audio data segment obtained after the normalization

Then, A is added to the original deductive audio data A_kSubstituted by A'_kAnd finally, the replacement audio data A' corresponding to A is obtained.

In this embodiment, a new replacement audio data obtained by replacing a part of audio data segments in a certain deductive audio data may be used for an "auditory identification" function in music software, that is, playing the replacement audio data to a user corresponding to the original deductive audio data for trial listening, and the user determines which clauses are replaced with audio data of other users.

Specifically, when the replacement audio data is played, the audio data is played in a clause form, and when the audio data segment corresponding to each clause is played, a user can input a judgment operation through a terminal, that is, whether the currently played audio data segment is deduced by other users is judged.

After the replacement audio data is played, determining whether each judgment operation is correct according to all judgment operations input by the user, and giving a corresponding evaluation value according to a preset evaluation formula, for example, giving a title of "hearing person" to the user account when the evaluation value is 100 minutes.

In another embodiment, a user account corresponding to the similar audio data segment replaced in the process of generating the replacement audio data may also be recommended to the current user account, for example, the current user account may select to perform chorus deduction of new audio data with the recommended user account.

In addition, in the embodiment of the invention, a similar audio data searching device is also provided, which can accurately search audio data similar to the current deduction work or deduction clause from a large number of deduction works, thereby improving the accuracy of searching similar audio data. Specifically, as shown in fig. 2, the apparatus for searching similar audio data includes an audio data obtaining module 102, a note difference calculating module 104, a likelihood reference value calculating module 106, a similarity reference value calculating module 108, and a similar audio data segment screening module 110, wherein:

the audio data acquisition module 102 is configured to acquire deductive audio data corresponding to the target song in the first user account and the at least one second user account, where the deductive audio data includes audio data segments corresponding to multiple clauses of the target song;

a note difference calculation module 104, configured to extract fundamental frequency data of an audio data segment of each clause of each deductive audio data, obtain a note value sequence corresponding to the extracted fundamental frequency data, determine a first note value sequence of a first user account in a target clause and a second note value sequence of a second user account in the target clause, and calculate a note difference between the first note value sequence and the second note value sequence;

a likelihood reference value calculation module 106, configured to extract first feature data of an audio data segment of a first user account in a target clause and second feature data of an audio data segment of a second user account in the target clause, and calculate a likelihood reference value between the first feature data and the second feature data;

a similarity reference value calculating module 108, configured to calculate, according to the note difference value and the likelihood reference value, a similarity reference value of the first user account and the second user account in the target clause;

and the similar audio data segment screening module 110 is configured to screen out, from the audio data segments of the at least one second user account, a similar audio data segment corresponding to the first user account in the target clause according to the size of the similarity reference value.

Optionally, in an embodiment, as shown in fig. 2, the apparatus further includes an audio data segment replacing module 112, configured to determine at least one replacing clause corresponding to the target song; and replacing the audio data segment corresponding to the replacing clause in the first deduction audio data corresponding to the target song and the first user with the similar audio data segment corresponding to the first user account and corresponding to the replacing clause, and generating replacing audio data corresponding to the target song and corresponding to the first user account.

Optionally, in one embodiment, the note difference value calculating module 104 is further configured to calculate a sum/average of distance values between each note value included in the first note value sequence and the second note value as a note difference value between the first note value sequence and the second note value sequence.

Optionally, in an embodiment, the note difference calculation module 104 is further configured to extract the fundamental frequency data of the audio data segment of each clause in each deductive audio data according to a preset frame length and a preset frame shift, so as to generate at least one fundamental frequency point corresponding to each clause in each deductive audio data; and adjusting the base frequency value of each base frequency point in at least one base frequency point, and converting the adjusted base frequency value of each base frequency point into a note value corresponding to each base frequency point, thereby obtaining a note value sequence corresponding to the base frequency data of the audio data segments of different users in each clause.

Optionally, in an embodiment, the note difference value calculating module 104 is further configured to perform zeroing processing on the fundamental frequency values of the singular fundamental frequency points in the at least one fundamental frequency point; and performing median filtering processing on each fundamental frequency point.

Optionally, in an embodiment, the likelihood reference value calculating module 106 is further configured to extract, as the first feature data, a first MFCC feature corresponding to the audio data segment of the first user account under the target clause by using a preset feature extraction algorithm.

Optionally, in an embodiment, the likelihood reference value calculating module 106 is further configured to extract, by using a preset feature extraction algorithm, a second MFCC feature corresponding to an audio data segment of the second user account under the target clause as third feature data, and train, by using a preset training model, the third feature data to obtain data serving as second feature data.

Optionally, in an embodiment, the likelihood reference value calculating module 106 is further configured to train the first feature data through a preset training model to obtain data serving as fourth feature data; calculating a first likelihood reference value between the third feature data and the fourth feature data; and calculating an average value of the likelihood reference values among the first likelihood reference value, the first feature data and the second feature data, taking the average value as the likelihood reference value, and updating the likelihood reference value.

Optionally, in an embodiment, the likelihood reference value calculating module 106 is further configured to calculate a likelihood reference value between the first feature data and the second feature data according to a preset likelihood function.

Optionally, in an embodiment, the similarity reference value calculating module 108 is further configured to obtain a preset weighting coefficient, and weight the note difference value and the likelihood reference value according to the preset weighting coefficient to obtain the similarity reference value, where the weighting coefficient of the note difference value is smaller than 0.

The embodiment of the invention has the following beneficial effects:

In one embodiment, as shown in fig. 3, fig. 3 illustrates a terminal of a von neumann-based computer system that runs the above-described similar audio data lookup method. The computer system can be terminal equipment such as a smart phone, a tablet computer, a palm computer, a notebook computer or a personal computer. Specifically, an external input interface 1001, a processor 1002, a memory 1003, and an output interface 1004 connected through a system bus may be included. The external input interface 1001 may optionally include at least a network interface 10012. Memory 1003 can include external memory 10032 (e.g., a hard disk, optical or floppy disk, etc.) and internal memory 10034. The output interface 1004 may include at least a display 10042 or the like.

In this embodiment, the method is executed based on a computer program, program files of which are stored in the external memory 10032 of the computer system based on the von neumann system, loaded into the internal memory 10034 at the time of execution, and then compiled into machine code and then transferred to the processor 1002 for execution, so that the audio data acquisition module 102, the note difference value calculation module 104, the likelihood reference value calculation module 106, the similarity reference value calculation module 108, the similar audio data segment screening module 110, and the audio data segment replacement module 112 are logically formed in the computer system based on the von neumann system. In the execution process of the method for searching similar audio data, the input parameters are all received through the external input interface 1001, and are transferred to the memory 1003 for buffering, and then are input into the processor 1002 for processing, and the processed result data is buffered in the memory 1003 for subsequent processing, or is transferred to the output interface 1004 for output.

Specifically, the processor 1002 is configured to perform the following operations:

respectively acquiring deduction audio data corresponding to the target song under the first user account and the at least one second user account, wherein the deduction audio data comprises audio data fragments corresponding to a plurality of clauses of the target song;

extracting fundamental frequency data of an audio data segment of each clause of each deductive audio data, acquiring a note value sequence corresponding to the extracted fundamental frequency data, determining a first note value sequence of a first user account under the target clause and a second note value sequence of a second user account under the target clause, and calculating a note difference value between the first note value sequence and the second note value sequence;

extracting first characteristic data of an audio data fragment of a first user account under a target clause and second characteristic data of an audio data fragment of a second user account under the target clause, and calculating a likelihood reference value between the first characteristic data and the second characteristic data;

and screening out similar audio data segments corresponding to the first user account under the target clause from the audio data segments of the at least one second user account according to the size of the similarity reference value.

Optionally, in an embodiment, the processor 1002 is further configured to determine at least one alternative clause corresponding to the target song; and replacing the audio data segment corresponding to the replacing clause in the first deduction audio data corresponding to the target song and the first user with the similar audio data segment corresponding to the first user account and corresponding to the replacing clause, and generating replacing audio data corresponding to the target song and corresponding to the first user account.

Optionally, in one embodiment, the processor 1002 is further configured to calculate a sum/average of distance values between each note value contained in the first sequence of note values and the second sequence of note values as a note difference between the first sequence of note values and the second sequence of note values.

Optionally, in an embodiment, the processor 1002 is further configured to perform extracting, according to a preset frame length and a preset frame shift, the fundamental frequency data of the audio data segment of each clause in each deductive audio data, respectively, to generate at least one fundamental frequency point corresponding to each clause in each deductive audio data; and adjusting the base frequency value of each base frequency point in at least one base frequency point, and converting the adjusted base frequency value of each base frequency point into a note value corresponding to each base frequency point, thereby obtaining a note value sequence corresponding to the base frequency data of the audio data segments of different users in each clause.

Optionally, in an embodiment, the processor 1002 is further configured to perform zeroing processing on fundamental frequency values of singular fundamental frequency points in the at least one fundamental frequency point; and performing median filtering processing on each fundamental frequency point.

Optionally, in an embodiment, the processor 1002 is further configured to execute, by using a preset feature extraction algorithm, extracting, from an audio data segment of the first user account under the target clause, a first MFCC feature corresponding to the audio data segment as first feature data.

Optionally, in an embodiment, the processor 1002 is further configured to execute a preset feature extraction algorithm to extract, from an audio data segment of the second user account under the target clause, a second MFCC feature corresponding to the audio data segment as third feature data, and train the third feature data through a preset training model to obtain data as second feature data.

Optionally, in an embodiment, the processor 1002 is further configured to execute data obtained by training the first feature data through a preset training model, as fourth feature data; calculating a first likelihood reference value between the third feature data and the fourth feature data; and calculating an average value of the likelihood reference values among the first likelihood reference value, the first feature data and the second feature data, taking the average value as the likelihood reference value, and updating the likelihood reference value.

Optionally, in an embodiment, the processor 1002 is further configured to calculate a likelihood reference value between the first feature data and the second feature data according to a preset likelihood function.

Optionally, in an embodiment, the processor 1002 is further configured to obtain a preset weighting coefficient, and weight the note difference value and the likelihood reference value according to the preset weighting coefficient to obtain a similarity reference value, where the weighting coefficient of the note difference value is smaller than 0.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present invention, and it is therefore to be understood that the invention is not limited by the scope of the appended claims.

Claims

1. A method for searching similar audio data, comprising:

respectively acquiring deductive audio data corresponding to a target song under a first user account and at least one second user account, wherein the deductive audio data comprises audio data fragments corresponding to a plurality of clauses of the target song; the deductive audio data is voice type audio data;

extracting fundamental frequency data of an audio data segment of each clause of each deductive audio data, acquiring a note value sequence corresponding to the extracted fundamental frequency data, determining a first note value sequence of the first user account under a target clause and a second note value sequence of the second user account under the target clause, and calculating note difference values between the first note value sequence and the second note value sequence; the note difference values characterize tonal similarity between audio data segments;

extracting first characteristic data of an audio data segment of the first user account under the target clause and second characteristic data of an audio data segment of the second user account under the target clause, and calculating a likelihood reference value between the first characteristic data and the second characteristic data; the likelihood reference values represent human voice tone similarity among the audio data fragments;

screening out similar audio data segments corresponding to the first user account under the target clause from the audio data segments of the at least one second user account according to the similarity reference value;

performing energy normalization operation on the similar audio data segments according to the audio data segments of the first user account under the target clause to obtain the similar audio data segments after the energy normalization operation; the similar audio data segments subjected to energy warping have energy similar to the audio data segments of the first user account under the target clause;

and replacing and optimizing the audio data segment corresponding to the target clause in the deductive audio data under the first user account into the similar audio data segment subjected to the energy normalization operation to obtain replaced audio data of the deductive audio data under the first user account.

2. The method of claim 1, wherein the step of screening out the audio data segments of the at least one second user account corresponding to the first user account for similar audio data segments under the target clause according to the size of the similarity reference value further comprises:

determining at least one alternative clause corresponding to the target song;

and replacing the audio data segment corresponding to the replacing clause in the first deduction audio data corresponding to the target song and the first user with a similar audio data segment corresponding to the first user account and corresponding to the replacing clause, and generating replacing audio data corresponding to the target song and corresponding to the first user account.

3. A method according to claim 1, wherein said calculating note difference values between said first sequence of note values and said second sequence of note values further comprises:

calculating a sum/average of distance values between each note value included in the first sequence of note values and second sequence of note values as a note difference between the first sequence of note values and the second sequence of note values.

4. The method of claim 1, wherein said extracting fundamental frequency data of the audio data segment of each sentence of each deductive audio data, and wherein said obtaining a sequence of note values corresponding to said extracted fundamental frequency data comprises:

respectively extracting fundamental frequency data of the audio data segment of each clause in each deduction audio data according to a preset frame length and a preset frame shift so as to generate at least one fundamental frequency point corresponding to each clause in each deduction audio data;

and adjusting the base frequency value of each base frequency point in the at least one base frequency point, and converting the adjusted base frequency value of each base frequency point into a note value corresponding to each base frequency point, thereby obtaining a note value sequence corresponding to the base frequency data of the audio data segments of different users in each clause.

5. The method according to claim 4, wherein the adjusting the fundamental frequency value of each of the at least one basic frequency point comprises:

carrying out zero setting processing on the fundamental frequency value of a singular fundamental frequency point in the at least one fundamental frequency point;

and performing median filtering processing on the fundamental frequency points.

6. The method according to claim 1, wherein the extracting of the first feature data of the audio data segment of the first user account under the target clause is specifically:

and extracting a first MFCC characteristic corresponding to the audio data segment of the first user account under the target clause as first characteristic data through a preset characteristic extraction algorithm.

7. The method according to claim 6, wherein the extracting of the second feature data of the audio data segment of the second user account under the target clause is specifically:

and extracting a second MFCC characteristic corresponding to the audio data fragment from the audio data fragment of the second user account under the target clause through the preset characteristic extraction algorithm to serve as third characteristic data, and training the third characteristic data through a preset training model to obtain data serving as second characteristic data.

8. The method of claim 7, further comprising:

training the first characteristic data through a preset training model to obtain data serving as fourth characteristic data;

calculating a first likelihood reference value between the third feature data and the fourth feature data;

and calculating an average value of the first likelihood reference value and the likelihood reference values between the first feature data and the second feature data, taking the average value as the likelihood reference value, and updating the likelihood reference value.

9. The method according to any one of claims 1 to 8, wherein the calculating of the likelihood reference value between the first feature data and the second feature data is specifically:

and calculating a likelihood reference value between the first characteristic data and the second characteristic data according to a preset likelihood function.

10. The method according to any one of claims 1 to 8, wherein the calculating the similarity reference values of the first user account and the second user account under the target clause according to the note difference values and the likelihood reference values is specifically:

and acquiring a preset weighting coefficient, and weighting the note difference value and the likelihood reference value according to the preset weighting coefficient to obtain the similarity reference value, wherein the weighting coefficient of the note difference value is smaller than 0.

11. A device for searching similar audio data, comprising:

the audio data acquisition module is used for respectively acquiring deductive audio data corresponding to the target song under a first user account and at least one second user account, wherein the deductive audio data comprises audio data segments corresponding to a plurality of clauses of the target song; the deductive audio data is voice type audio data;

a note difference value calculating module, configured to extract fundamental frequency data of an audio data segment of each sentence of each deductive audio data, obtain a note value sequence corresponding to the extracted fundamental frequency data, determine a first note value sequence of the first user account in a target sentence, and a second note value sequence of the second user account in the target sentence, and calculate a note difference value between the first note value sequence and the second note value sequence; the note difference values characterize tonal similarity between audio data segments;

a likelihood reference value calculation module, configured to extract first feature data of an audio data segment of the first user account in the target clause and second feature data of an audio data segment of the second user account in the target clause, and calculate a likelihood reference value between the first feature data and the second feature data; the likelihood reference values represent human voice tone similarity among the audio data fragments;

a similar audio data segment screening module, configured to screen, from the audio data segments of the at least one second user account, a similar audio data segment corresponding to the first user account in the target clause according to the size of the similarity reference value;

the similar audio data segment screening module is further configured to perform energy normalization operation on the similar audio data segments according to the audio data segments of the first user account in the target clause, so as to obtain the similar audio data segments after the energy normalization operation is performed; the similar audio data segments subjected to energy warping have energy similar to the audio data segments of the first user account under the target clause;

and the similar audio data fragment screening module is further configured to replace and optimize the audio data fragment corresponding to the target clause in the deductive audio data in the first user account into the similar audio data fragment subjected to the energy normalization operation, so as to obtain replaced audio data of the deductive audio data in the first user account.

12. The apparatus of claim 11, further comprising an audio data segment replacement module configured to determine at least one replacement clause corresponding to the target song; and replacing the audio data segment corresponding to the replacing clause in the first deduction audio data corresponding to the target song and the first user with a similar audio data segment corresponding to the first user account and corresponding to the replacing clause, and generating replacing audio data corresponding to the target song and corresponding to the first user account.

13. An apparatus as recited in claim 11, wherein said note difference calculation module is further configured to calculate a sum/average of distance values between each note value included in said first sequence of note values and said second sequence of note values as the note difference between said first sequence of note values and said second sequence of note values.

14. The apparatus of claim 11, wherein the note difference calculation module is further configured to extract the fundamental frequency data of the audio data segment of each sentence in each of the deductive audio data according to a preset frame length and a preset frame shift, respectively, to generate at least one fundamental frequency point corresponding to each sentence in each of the deductive audio data; and adjusting the base frequency value of each base frequency point in the at least one base frequency point, and converting the adjusted base frequency value of each base frequency point into a note value corresponding to each base frequency point, thereby obtaining a note value sequence corresponding to the base frequency data of the audio data segments of different users in each clause.

15. The apparatus according to claim 14, wherein the note difference calculation module is further configured to zero the fundamental frequency values of the singular fundamental frequency points in the at least one fundamental frequency point; and performing median filtering processing on the fundamental frequency points.

16. The apparatus of claim 11, wherein the likelihood reference value calculating module is further configured to extract, as the first feature data, a first MFCC feature corresponding to the audio data segment of the first user account under the target clause from the audio data segment through a preset feature extraction algorithm.

17. The apparatus according to claim 16, wherein the likelihood reference value calculating module is further configured to extract, through the preset feature extraction algorithm, a second MFCC feature corresponding to the audio data segment of the second user account under the target clause as third feature data, and train, through a preset training model, data of the third feature data as second feature data.

18. The apparatus according to claim 17, wherein the likelihood reference value calculation module is further configured to use data obtained by training the first feature data through a preset training model as fourth feature data; calculating a first likelihood reference value between the third feature data and the fourth feature data; and calculating an average value of the first likelihood reference value and the likelihood reference values between the first feature data and the second feature data, taking the average value as the likelihood reference value, and updating the likelihood reference value.

19. The apparatus according to any one of claims 11 to 18, wherein the likelihood reference value calculating module is further configured to calculate the likelihood reference value between the first feature data and the second feature data according to a preset likelihood function.

20. The apparatus according to any one of claims 11 to 18, wherein the similarity reference value calculating module is further configured to obtain a preset weighting coefficient, and weight the note difference value and the likelihood reference value according to the preset weighting coefficient to obtain the similarity reference value, where the weighting coefficient of the note difference value is smaller than 0.