CN106970950B - Similar audio data searching method and device - Google Patents

Similar audio data searching method and device Download PDF

Info

Publication number
CN106970950B
CN106970950B CN201710129982.8A CN201710129982A CN106970950B CN 106970950 B CN106970950 B CN 106970950B CN 201710129982 A CN201710129982 A CN 201710129982A CN 106970950 B CN106970950 B CN 106970950B
Authority
CN
China
Prior art keywords
audio data
user account
note
value
clause
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710129982.8A
Other languages
Chinese (zh)
Other versions
CN106970950A (en
Inventor
孔令城
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Music Entertainment Technology Shenzhen Co Ltd
Original Assignee
Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Music Entertainment Technology Shenzhen Co Ltd filed Critical Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority to CN201710129982.8A priority Critical patent/CN106970950B/en
Publication of CN106970950A publication Critical patent/CN106970950A/en
Application granted granted Critical
Publication of CN106970950B publication Critical patent/CN106970950B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/63Querying
    • G06F16/635Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/26Pre-filtering or post-filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/54Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for retrieval

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a method and a device for searching similar audio data, wherein the method comprises the following steps: acquiring deductive audio data, corresponding to a first user account, at least one second user account and a target song, of audio data segments corresponding to a plurality of clauses of the target song, extracting fundamental frequency data of the audio data segments, acquiring a note value sequence of the fundamental frequency data, and calculating note difference values between a note value sequence of the first user account under the target clauses and a note value sequence of the second user account under the target clauses; calculating a likelihood reference value between first characteristic data of the first user account under the target clause and second characteristic data of the second user account under the target clause; calculating a similarity reference value according to the note difference value and the likelihood reference value; and screening out similar audio data segments from the audio data segments of the second user account. By adopting the method and the device, the accuracy of searching the similar audio data can be improved.

Description

Similar audio data searching method and device
Technical Field
The invention relates to the technical field of internet, in particular to a method and a device for searching similar audio data.
Background
With the continuous development and improvement of terminal technology, terminal devices such as mobile phones and tablet computers become an indispensable part of people's lives, and users can realize various application functions by installing various application programs on the terminals, so that different requirements of the users in daily life, such as music software or karaoke software, are met.
In the existing music software or karaoke software, besides downloading or playing music files frequently, a user can sing songs and share the singed songs. For example, after a user records his or her singing work or a singing work including background music accompaniment of a corresponding song through a terminal, the corresponding work can be uploaded, so that the user and other users can check the corresponding work. Because a large number of users upload their own large deductive works of songs, there may be similarities between two works in partial sentences or whole songs before partial deductive works because of personal register, timbre and singing skills of the singer.
In the prior art, a new sound deduction work can be obtained by performing song splicing or performance clause replacement between different deduction works, or works of other users which are closer to the deduction work of the current user are recommended to the user; however, in the existing music software or karaoke software, the deductive works of other users most similar to the works sung by the user or the singing clauses thereof in all aspects of tone, intonation and the like cannot be accurately found in a large number of deductive works, that is, the technical scheme of the existing technology for searching the audio data similar to the deductive works or the deductive clauses has the problem of insufficient preparation.
Disclosure of Invention
The embodiment of the invention provides a method for searching similar audio data, which can accurately search audio data similar to the current deduction works or deduction clauses from a large number of deduction works, thereby improving the accuracy of searching similar audio data.
A method for searching similar audio data comprises the following steps:
respectively acquiring deduction audio data corresponding to a target song under a first user account and at least one second user account, wherein the deduction audio data comprises audio data fragments corresponding to a plurality of clauses of the target song;
extracting fundamental frequency data of an audio data segment of each sentence of each deductive audio data, acquiring a note value sequence corresponding to the extracted fundamental frequency data, determining a first note value sequence of the first user account under the target sentence and a second note value sequence of the second user account under the target sentence, and calculating note difference values between the first note value sequence and the second note value sequence;
extracting first characteristic data of an audio data segment of the first user account under the target clause and second characteristic data of an audio data segment of the second user account under the target clause, and calculating a likelihood reference value between the first characteristic data and the second characteristic data;
calculating a similarity reference value of the first user account and the second user account under the target clause according to the note difference value and the likelihood reference value;
and screening out similar audio data segments corresponding to the first user account under the target clause from the audio data segments of the at least one second user account according to the similarity reference value.
Optionally, in one embodiment, after the step of screening out, according to the size of the similarity reference value, a similar audio data segment corresponding to the first user account in the target clause from among audio data segments of the at least one second user account, the step of: determining at least one alternative clause corresponding to the target song; and replacing the audio data segment corresponding to the replacing clause in the first deduction audio data corresponding to the target song and the first user with a similar audio data segment corresponding to the first user account and corresponding to the replacing clause, and generating replacing audio data corresponding to the target song and corresponding to the first user account.
Optionally, in one embodiment, the calculating the note difference between the first note value sequence and the second note value sequence further includes: calculating a sum/average of distance values between each note value included in the first sequence of note values and second sequence of note values as a note difference between the first sequence of note values and the second sequence of note values.
Optionally, in one embodiment, the extracting fundamental frequency data of the audio data segment of each sentence of each deductive audio data, and the obtaining of the sequence of note values corresponding to the extracted fundamental frequency data includes: respectively extracting fundamental frequency data of the audio data segment of each clause in each deduction audio data according to a preset frame length and a preset frame shift so as to generate at least one fundamental frequency point corresponding to each clause in each deduction audio data; and adjusting the base frequency value of each base frequency point in the at least one base frequency point, and converting the adjusted base frequency value of each base frequency point into a note value corresponding to each base frequency point, thereby obtaining a note value sequence corresponding to the base frequency data of the audio data segments of different users in each clause.
Optionally, in one embodiment, the adjusting the fundamental frequency value of each of the at least one base frequency point includes: carrying out zero setting processing on the fundamental frequency value of a singular fundamental frequency point in the at least one fundamental frequency point; and performing median filtering processing on the fundamental frequency points.
Optionally, in one embodiment, the extracting of the first feature data of the audio data segment of the first user account under the target clause specifically includes: and extracting a first MFCC characteristic corresponding to the audio data segment of the first user account under the target clause as first characteristic data through a preset characteristic extraction algorithm.
Optionally, in one embodiment, the extracting second feature data of the audio data segment of the second user account under the target clause specifically includes: and extracting a second MFCC characteristic corresponding to the audio data fragment from the audio data fragment of the second user account under the target clause through the preset characteristic extraction algorithm to serve as third characteristic data, and training the third characteristic data through a preset training model to obtain data serving as second characteristic data.
Optionally, in one embodiment, the method further includes: training the first characteristic data through a preset training model to obtain data serving as the fourth characteristic data; calculating a first likelihood reference value between the third feature data and the fourth feature data; and calculating an average value of the first likelihood reference value and the likelihood reference values between the first feature data and the second feature data, taking the average value as the likelihood reference value, and updating the likelihood reference value.
Optionally, in one embodiment, the calculating the likelihood reference value between the first feature data and the second feature data specifically includes: and calculating a likelihood reference value between the first characteristic data and the second characteristic data according to a preset likelihood function.
Optionally, in one embodiment, the calculating, according to the note difference value and the likelihood reference value, similarity reference values of the first user account and the second user account in the target clause respectively is specifically: and acquiring a preset weighting coefficient, and weighting the note difference value and the likelihood reference value according to the preset weighting coefficient to obtain the similarity reference value, wherein the weighting coefficient of the note difference value is smaller than 0.
In addition, the embodiment of the invention also provides a searching device of similar audio data, which can accurately search the audio data similar to the current deduction works or deduction clauses from a large number of deduction works, thereby improving the searching accuracy of the similar audio data.
A similar audio data searching apparatus, comprising:
the audio data acquisition module is used for respectively acquiring deductive audio data corresponding to the target song under a first user account and at least one second user account, wherein the deductive audio data comprises audio data segments corresponding to a plurality of clauses of the target song;
a note difference value calculating module, configured to extract fundamental frequency data of an audio data segment of each clause of each deductive audio data, obtain a note value sequence corresponding to the extracted fundamental frequency data, determine a first note value sequence of the first user account in the target clause and a second note value sequence of the second user account in the target clause, and calculate a note difference value between the first note value sequence and the second note value sequence;
a likelihood reference value calculation module, configured to extract first feature data of an audio data segment of the first user account in the target clause and second feature data of an audio data segment of the second user account in the target clause, and calculate a likelihood reference value between the first feature data and the second feature data;
a similarity reference value calculating module, configured to calculate, according to the note difference value and the likelihood reference value, a similarity reference value of the first user account and the second user account in the target clause;
and the similar audio data fragment screening module is used for screening out a similar audio data fragment corresponding to the first user account under the target clause from the audio data fragments of the at least one second user account according to the size of the similarity reference value.
Optionally, in one embodiment, the apparatus further includes an audio data segment replacing module, configured to determine at least one replacing clause corresponding to the target song; and replacing the audio data segment corresponding to the replacing clause in the first deduction audio data corresponding to the target song and the first user with a similar audio data segment corresponding to the first user account and corresponding to the replacing clause, and generating replacing audio data corresponding to the target song and corresponding to the first user account.
Optionally, in one embodiment, the note difference calculation module is further configured to calculate a sum/average of distance values between each note value included in the first note value sequence and the second note value as the note difference between the first note value sequence and the second note value sequence.
Optionally, in one embodiment, the note difference calculation module is further configured to extract, according to a preset frame length and a preset frame shift, fundamental frequency data of an audio data segment of each clause in each deductive audio data, so as to generate at least one fundamental frequency point corresponding to each clause in each deductive audio data; and adjusting the base frequency value of each base frequency point in the at least one base frequency point, and converting the adjusted base frequency value of each base frequency point into a note value corresponding to each base frequency point, thereby obtaining a note value sequence corresponding to the base frequency data of the audio data segments of different users in each clause.
Optionally, in one embodiment, the note difference value calculating module is further configured to perform zeroing processing on fundamental frequency values of singular fundamental frequency points in the at least one fundamental frequency point; and performing median filtering processing on the fundamental frequency points.
Optionally, in one embodiment, the likelihood reference value calculation module is further configured to extract, as the first feature data, a first MFCC feature corresponding to the audio data segment of the first user account under the target clause from the audio data segment by using a preset feature extraction algorithm.
Optionally, in an embodiment, the likelihood reference value calculating module is further configured to extract, by using the preset feature extraction algorithm, a second MFCC feature corresponding to the audio data segment of the second user account under the target clause as third feature data, and train, by using a preset training model, the third feature data to obtain data serving as second feature data.
Optionally, in one embodiment, the likelihood reference value calculating module is further configured to train the first feature data through a preset training model to obtain data serving as the fourth feature data; calculating a first likelihood reference value between the third feature data and the fourth feature data; and calculating an average value of the first likelihood reference value and the likelihood reference values between the first feature data and the second feature data, taking the average value as the likelihood reference value, and updating the likelihood reference value.
Optionally, in one embodiment, the likelihood reference value calculating module is further configured to calculate a likelihood reference value between the first feature data and the second feature data according to a preset likelihood function.
Optionally, in one embodiment, the similarity reference value calculating module is further configured to obtain a preset weighting coefficient, and weight the note difference value and the likelihood reference value according to the preset weighting coefficient to obtain the similarity reference value, where the weighting coefficient of the note difference value is smaller than 0.
The embodiment of the invention has the following beneficial effects:
after the searching method and the searching device for similar audio data are adopted, aiming at singing works recorded by a target user through a terminal, audio data segments which are similar to the audio data segments of each clause in the singing works of the target user in terms of tone and timbre can be searched in all the singing works uploaded by other users. That is to say, when searching for similar audio data, it is considered whether the intonation corresponding to each audio data is consistent or close, and also considered whether the corresponding singer is similar in tone, so that compared with the technical scheme in the prior art that only whether the intonation is always considered, the searched similar audio data is more similar to the audio data performed by the target user, the similarity between the searched similar audio data and the current audio data is improved, and the accuracy of searching for similar audio data is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Wherein:
FIG. 1 is a flowchart illustrating a method for searching for similar audio data according to an embodiment;
FIG. 2 is a diagram illustrating an exemplary structure of a device for searching similar audio data;
fig. 3 is a schematic structural diagram of a computer device for executing the similar audio data searching method in one embodiment.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the embodiment of the invention, the method for searching the similar audio data is provided, and the audio data similar to the current deductive work or deductive clause can be accurately searched from a large number of deductive works, so that the accuracy of searching the similar audio data is improved. In particular, the implementation of the method may rely on a computer program that is executable on a computer system based on the von neumann architecture, which may be a music application program including a singing function, such as a music playback application, a karaoke application, for example, a national karaoke application. The computer system may be a server or a terminal device such as a smart phone, a tablet computer, a personal computer, etc. running the computer program.
It should be noted that, in this embodiment, the search method for the similar audio data may be performed based on the server corresponding to the music application.
Embodiments of the present invention are applicable to any song that may be sung or deducted by a user, and are described in detail below with reference to only one song (i.e., a target song).
As shown in fig. 1, the searching method of similar audio data at least includes the following steps S102-S110:
step S102: deduction audio data corresponding to a target song under a first user account and at least one second user account are obtained respectively, and the deduction audio data comprise audio data fragments corresponding to a plurality of clauses of the target song.
Specifically, in the embodiment of the present invention, the deductive audio data may be audio data recorded by the terminal device with a recording function when the user sings the target song, and the deductive audio data may be only vocal audio data sung by the user, or may be audio data of background music such as vocal audio data sung by the user and accompaniment remix of the target song, which is not specifically limited herein.
The target song comprises a plurality of clauses, the clauses can be divided through lyric information of the target song, each sentence of lyrics corresponds to one clause, and each clause corresponds to one audio data fragment. After the user sings the target song, the performance audio data sung by the user can be uploaded to a server corresponding to the music software through the music software installed on the terminal. That is, the deductive audio data uploaded by each user is stored in the server, and other users can view the deductive audio data of other users by requesting the server.
In this embodiment, the first user account is the target user account, and in this embodiment, for example, audio data similar to the deductive audio data of the deductive target song under the target user account is searched, the searched target is the deductive audio data of the second user account for the target song. It should be noted that, in this embodiment, there is more than one second user account; for example, the second user account may be all user accounts except the target user account, or may be some user accounts among all other user accounts, for example, all user accounts except the target user account with gender identification "female".
Step S104: extracting fundamental frequency data of an audio data segment of each clause of each deductive audio data, obtaining a note value sequence corresponding to the extracted fundamental frequency data, determining a first note value sequence of the first user account under the target clause and a second note value sequence of the second user account under the target clause, and calculating note difference values between the first note value sequence and the second note value sequence.
Specifically, for the deductive audio data of the first user account corresponding to the target song and the audio data segment corresponding to each separate residence under the deductive audio data of the second user account, the fundamental frequency data corresponding to each audio data segment is respectively extracted, and then the note value sequence corresponding to the threshold value is obtained according to the fundamental frequency data. The fundamental frequency data may be a fundamental tone of the audio data segment, which is used to determine a treble of each audio frequency in the audio data segment, wherein the note value refers to a standard value for a Digital interface of a midi (musical Instrument Digital interface).
In a specific embodiment, said extracting fundamental frequency data of the audio data segment of each sentence of each deductive audio data, and said obtaining a sequence of note values corresponding to said extracted fundamental frequency data comprises: respectively extracting fundamental frequency data of the audio data segment of each clause in each deduction audio data according to a preset frame length and a preset frame shift so as to generate at least one fundamental frequency point corresponding to each clause in each deduction audio data; and adjusting the base frequency value of each base frequency point in the at least one base frequency point, and converting the adjusted base frequency value of each base frequency point into a note value corresponding to each base frequency point, thereby obtaining a note value sequence corresponding to the base frequency data of the audio data segments of different users in each clause.
For example, the frame length may be preset to be 30ms, the frame shift may be 10ms, and the fundamental frequency data of the audio data segment of each sentence performed by each user may be collected, so that at least one fundamental frequency point may exist correspondingly for each sentence performed by each user. And performing denoising, smoothing and other processing on the at least one basic frequency point, and then converting the adjusted basic frequency value of each basic frequency point into a note value corresponding to each basic frequency point, wherein each clause deduced by each user corresponds to the at least one basic frequency point, at least one basic frequency point corresponds to at least one note value, and the at least one note value forms a corresponding note value sequence, so that the note value sequence corresponding to the basic frequency data of the audio data segment of each clause without the user is obtained.
In a possible implementation scenario, a preset note conversion formula may be adopted, and the note value of each fundamental frequency point is calculated according to the adjusted fundamental frequency value of each fundamental frequency point. The preset note conversion formula may be:
Figure BDA0001239646680000081
wherein m isiNote value, x, expressed as the current base frequency pointiExpressed as the fundamental frequency value of the current fundamental frequency point, and p represents the length of the sequence of note values.
In addition, when M is represented as a sequence of note values, and X is represented as a sequence of fundamental frequency values composed of fundamental frequency values of fundamental frequency points in the fundamental frequency data, the preset note conversion formula can be identified as:
Figure BDA0001239646680000091
it should be noted that, when the fundamental frequency value of each fundamental frequency point in at least one fundamental frequency point is adjusted, the fundamental frequency value of a singular fundamental frequency point in at least one fundamental frequency point may be zeroed. For example, if the front fundamental frequency value and the rear fundamental frequency value of a non-0 fundamental frequency point are both 0, the fundamental frequency point is marked as 0; the median filtering processing can also be performed on several continuous fundamental frequency points, and through the median filtering processing (for example, 5-point median filtering), the curve of the fundamental frequency band can be smoothed, and the occurrence of noise points is avoided.
Optionally, in an embodiment, before extracting the fundamental frequency data of the audio data segments of each sentence performed by each user, each audio data segment may be further normalized according to a preset format, for example, a PCM format of 16k 16bit may be normalized.
After the note value sequences corresponding to the first user account and the second user account under each clause of the target song are determined, note difference values of the first user account and the second user account under the target clause can be calculated, namely the difference between the note values of the audio data segment of the target clause deduced by the first user account and the audio data segment of the target clause deduced by the second user account is calculated.
In a specific embodiment, taking the target clause as the kth clause of the target song as an example, the users singing for the respective user accounts are synonymous sentences, so the lengths of the corresponding audio data segments are the same, that is, the number of fundamental frequency values included in the corresponding fundamental frequency data is the same, so that each user is consistent in the length of the note value sequence under the kth clause, that is, the number of note value components included in the note value is the same. Let the length of the kth clause be p, which means that the kth clause has p frames, or the sequence of note values corresponding to the kth clause contains p note values.
Setting the sequence of note values of the k-th clause of the deduction under the first user account as M1k=(m1k1,m1k2,…,m1kp) The sequence of the note values of the k sentence of the deduction under the second user account is M2k=(m2k1,m2k2,…,m2kp) In calculating M1kAnd M2kThe difference between them is obtained by a preset note difference calculation formula.
For example, in one embodiment, the note difference may be calculated by calculating M1kAnd M2kIs obtained by summing the absolute values of the differences between each note value component, i.e., the preset note difference value calculation formula is as follows:
Figure BDA0001239646680000092
wherein S1k2The note difference between the first note sequence and the second note sequence representing the kth clause.
For another example, in another alternative embodiment, the difference value of the notes can be calculated by calculating M1kAnd M2kIs obtained by averaging the absolute values of the differences between each of the note value components, i.e., the preset note difference value calculation formula is as follows:
Figure BDA0001239646680000101
note that the note difference indicates the difference between the pitches of two users when performing the same sentence, and the smaller the note difference, the smaller the pitch difference between the corresponding two users 'deductions, whereas the larger the note difference, the larger the pitch difference between the corresponding two users' deductions.
Step S106: extracting first characteristic data of the audio data segment of the first user account under the target clause and second characteristic data of the audio data segment of the second user account under the target clause, and calculating a likelihood reference value between the first characteristic data and the second characteristic data.
In this embodiment, for the audio data segment corresponding to the first user account deduction target clause and the audio data segment corresponding to the second user account deduction target clause, corresponding feature data are extracted according to a preset feature extraction algorithm, respectively. It should be noted that, in this embodiment, the preset feature extraction algorithm may be a known tone feature extraction algorithm, for example, an MFCC (Mel-scale Frequency Cepstral Coefficients) feature extraction algorithm extracts a static MFCC feature corresponding to the audio data.
Specifically, in a specific embodiment, the extracting of the first feature data of the audio data segment of the first user account under the target clause specifically includes: and extracting a first MFCC characteristic corresponding to the audio data segment of the first user account under the target clause as first characteristic data through a preset characteristic extraction algorithm.
Firstly, the audio data segment A of the kth clause is deduced aiming at the first user account1kPerforming framing processing according to the preset frame length and the preset frame shift to obtain the audio frame data after framing
Y1k={y1k1,y1k2,…,y1kp},
Wherein p is audio frame data Y1kLength of (i.e., number of frames).
Then, for each audio frame data component, by comparing each audio frame data component y1kiPerforming discrete Fourier transform, modular square, filtering with triangular band-pass filter, logarithm, and discrete cosine transform to obtain the sum y1kiCorresponding MFCC feature λ1kiI.e. audio data segment A1kThe MFCC characteristic of (a) may be expressed as:
λ1k={λ1k11k2,…,λ1kp}。
further, the audio data segment A of the kth clause is deduced for the user corresponding to the second user account2kPerforming framing processing according to the preset frame length and the preset frame shift to obtain the audio frame data after framing
Y2k={y2k1,y2k2,…,y2kp},
Wherein p is audio frame data Y2kLength of (i.e., number of frames).
Then, for each audio frame data component, by comparing each audio frame data component y2kiPerforming discrete Fourier transform, modular square, filtering with triangular band-pass filter, logarithm, and discrete cosine transform to obtain the sum y2kiCorresponding MFCC feature λ2kiI.e. audio data segment A2kThe MFCC characteristic of (a) may be expressed as:
λ2k={λ2k12k2,…,λ2kp}。
note that, in the present embodiment, the MFCC characteristic λ1kiOr λ2kiThe method can be a multidimensional array or vector, for example, a 13-dimensional vector, or a 39-dimensional feature sequence obtained by calculating a first difference and a second difference on a 13-dimensional MFCC feature vector.
In this embodiment, the audio data segment A of the k-th clause is deduced in calculating the first user account1kA between audio data segments of deductive target clause with second user account2kLikelihood reference value Q1k2Then, can be calculated by1kAnd A2kRespectively corresponding MFCC characteristics lambda1kAnd λ2kThe likelihood values in between.
Further, in an optional implementation, the calculating a likelihood reference value between the first feature data and the second feature data specifically includes: and calculating a likelihood reference value between the first characteristic data and the second characteristic data according to a preset likelihood function. It should be noted that, in the present embodiment, the preset likelihood function may be any known likelihood function, and is not specifically limited herein.
When calculating the feature data of the extracted audio data segment, besides extracting the MFCC features of the audio data segment, other feature extraction methods may be adopted, for example, the extracted MFCC features are further processed.
Specifically, in a specific embodiment, the extracting of the second feature data of the audio data segment of the second user account under the target clause specifically includes: and extracting a second MFCC characteristic corresponding to the audio data fragment from the audio data fragment of the second user account under the target clause through the preset characteristic extraction algorithm to serve as third characteristic data, and training the third characteristic data through a preset training model to obtain data serving as new characteristic data corresponding to the audio data fragment.
For example, the second user account deduces audio data segment A of the kth clause2kThe MFCC feature data extracted in (1) is λ2k={λ2k12k2,…,λ2kpTo carry out Gaussian model training and samplingTraining a 256-dimensional Gaussian mixture model by using an EM (Expectation Maximization) Algorithm (maximum Expectation Algorithm, also called Expectation Maximization Algorithm), wherein the obtained characteristic data is eta2k={η2k12k2,…,η2kpWhere η is a third characteristic data2kiAnd λ2kiIs the corresponding.
Deducting audio data segment A of k sentence in calculating first user account1kA between audio data segments of deductive target clause with second user account2kLikelihood reference value Q1k2Then, can be calculated by1kCorresponding MFCC feature λ1kAnd A2kCorresponding third characteristic data eta2kThe likelihood values in between.
Further, to improve the calculated likelihood reference value Q1k2In another embodiment, when extracting the feature data, further extraction of further features of the MFCC features of the audio data segment of the first user account deductive target clause is required. That is, data obtained by training the first feature data through a preset training model is used as the fourth feature data.
That is, the first user account deduces the audio data segment A of the kth clause1kThe MFCC feature data extracted in (1) is λ1k={λ1k11k2,…,λ1kpAnd training a Gaussian model, and training a 256-dimensional Gaussian mixture model by adopting an EM (effective electromagnetic tomography) algorithm to obtain characteristic data of eta1k={η1k11k2,…,η1kpWhere η is a fourth feature data1kiAnd λ1kiIs the corresponding.
In one embodiment, the likelihood reference value may be calculated as follows: calculating lambda1kAnd λ2kAnd is noted as the likelihood value between
Figure BDA0001239646680000121
And calculating eta1kAnd η2kAnd is noted as the likelihood value between
Figure BDA0001239646680000122
Then through the above
Figure BDA0001239646680000123
And
Figure BDA0001239646680000124
audio data segment A for calculating k sentence of first user account deduction1kAudio data segment A deducting target clause from second user account2kLikelihood reference value Q therebetween1k2E.g. Q1k2Can be
Figure BDA0001239646680000125
And
Figure BDA0001239646680000126
i.e.:
Figure BDA0001239646680000127
in another embodiment, λ is first calculated1kAnd η2kAnd is noted as q1k2Then calculate λ2kAnd η1kAnd is noted as q2k1Finally by the above q1k2And q is2k1Audio data segment A for calculating k sentence of first user account deduction1kA between audio data segments of deductive target clause with second user account2kLikelihood reference value Q1k2E.g. Q1k2May be q1k2And q is2k1I.e.:
Figure BDA0001239646680000128
it should be noted that, in this embodiment, the larger the likelihood reference value is, the closer the tone color between the audio data segment of the first user account deduction target clause and the audio data segment of the second user account deduction target clause is, whereas the smaller the likelihood reference value is, the larger the tone color difference between the audio data segment of the first user account deduction target clause and the audio data segment of the second user account deduction target clause is.
Step S108: and calculating a similarity reference value of the first user account and the second user account under the target clause according to the note difference value and the likelihood reference value.
After the note difference value and the likelihood reference value between the audio data segment of the first user account deduction target clause and the audio data segment of the second user account deduction target clause are obtained through the calculation of the steps, the similarity reference value between the two can be calculated according to the note difference value and the likelihood reference value between the two.
For example, in one embodiment, the similarity reference value corresponding to the note difference value and the likelihood reference value are calculated according to a preset similarity reference value calculation formula by using the note difference value and the likelihood reference value as arguments.
For another example, in one embodiment, the similarity reference value may be a weighted average of the note difference value and the likelihood reference value. It should be noted that, since the smaller the note difference value is, the smaller the tone difference between the audio data segment representing the first user account deduction target clause and the audio data segment representing the second user account deduction target clause is, and the larger the likelihood reference value is, the higher the timbre similarity between the audio data segment representing the first user account deduction target clause and the audio data segment representing the second user account deduction target clause is, the weighting coefficient of the note difference value is a negative number and the weighting coefficient of the likelihood reference value is a positive number when the weighted average of the note difference value and the likelihood reference value is calculated, and the larger the finally obtained similarity reference value is, the smaller the difference between the audio data segment of the first user account deduction target clause and the audio data segment of the second user account deduction target clause is, and the smaller the similarity reference value is, the smaller the audio data segment of the first user account deduction target clause and the audio data segment of the second user account deduction target clause are interpreted as The larger the difference of the frequency data fragments.
The similarity reference value between the audio data segment of the k-th clause of the deduction target clause of the first user account and the audio data segment of the deduction target song of the second user account is T1k2The calculation formula can be calculated by the following similarity reference value calculation formula:
T1k2=α·Q1k2-β·S1k2,α>0,β>0
wherein, alpha and beta are preset coefficients.
Step S110: and screening out similar audio data segments corresponding to the first user account under the target clause from the audio data segments of the at least one second user account according to the similarity reference value.
The similarity reference value represents the similarity degree between the audio data segment of the corresponding second user account deduction target clause and the audio data segment of the user deduction target clause corresponding to the first user account, so that one or more audio data segments closest to the audio data segment of the first user account deduction target clause can be selected from the audio data segments corresponding to the target clause under all the second user accounts as similar audio data segments according to the similarity reference value.
It should be noted that, in this embodiment, the number of similar audio data segments in the target clause corresponding to the first user account may be one or multiple. For example, the number of similar audio data segments may be any preset number.
In a specific embodiment, all the audio data segments of the second user account under the target clause may be sorted according to the similarity reference value, and then N audio data segments before sorting are obtained as similar audio data segments, where N is a preset number constant.
Further, the similar audio data segments may be not only the audio data segments corresponding to the first N-th order of the similarity reference value, but also all audio data segments whose similarity reference value satisfies a preset condition, for example, all audio data segments whose similarity reference value is greater than 80% are regarded as similar audio data segments.
It should be noted that, in the above-mentioned method for searching similar audio data, an operation procedure and an operation method are given how to find a similar audio data segment similar to the audio data segment of the target user for deducing the target clause, because the found similar audio data segment is very similar to the audio data segment of the target user for deducing the target clause, for some users, the difference between the two may not be distinguished or is difficult to distinguish in the listening process, therefore, in this embodiment, the audio data segment corresponding to the target clause in the target song can be replaced by the similar audio data segment, so as to obtain new audio data.
Specifically, in an optional embodiment, after the step of screening out, from the audio data segments of the at least one second user account, similar audio data segments corresponding to the first user account in the target clause according to the size of the similarity reference value, the method further includes: determining at least one alternative clause corresponding to the target song; and replacing the audio data segment corresponding to the replacing clause in the first deduction audio data corresponding to the target song and the first user with the similar audio data segment corresponding to the first user account and corresponding to the replacing clause, and generating replacing audio data corresponding to the target song and corresponding to the first user account.
When new replacement audio data is generated, firstly, it is determined that audio data segments corresponding to the clauses in the original target song are replaced by similar audio data segments corresponding to other user accounts, that is, which clauses are specific to the replacement clauses.
In an alternative embodiment, the determination of the replacement clause may be entered by a manual selection of the user, for example, the user may choose to replace a clause if he feels that the sentence is not sung enough; in another alternative embodiment, the determination of the replacement clauses may be randomly selected by the server, and the number of replacement clauses is greater than or equal to 1 and less than half of the total number of clauses of the original target song.
After the alternative clauses are determined, similar audio data segments that are similar to the audio data clause from which each alternative clause was deduced from the first user account may be determined through steps S102-S110 described above. If the number of the similar audio data segments corresponding to a certain alternative clause is more than one, the similar audio data segment corresponding to the similarity reference value with the largest similarity reference value can be used as the target similar audio data segment corresponding to the alternative clause, or one of the plurality of similar audio data segments can be randomly selected as the target similar audio data segment corresponding to the alternative clause.
And finally, replacing the audio data segments corresponding to all the replacing clauses in the deduction audio data of the first user account deduction target song with corresponding target similar audio data segments, and finally generating new replacing audio data corresponding to the target song.
For example, for a target user, the deductive audio data for a target song is a, the kth clause is a replacement clause, and now the audio data segment a corresponding to the kth clause needs to be divided intokReplacing with corresponding similar audio data segment, if finally determined by the method described above with AkCorresponding similar audio data segments are
Figure BDA00012396466800001510
Wherein n is a user identifier. A in AkBy direct substitution into
Figure BDA0001239646680000159
The replacement audio data corresponding to a can be obtained.
It should be noted that in another alternative embodiment, the audio data segment is also required to be energy-normalized before the replacement of the audio data segment.
Specifically, if the number of sampling points of the kth clause is L, A isk={ak1,ak2,…,akL},
Figure BDA0001239646680000151
Then separately calculating the audio data segmentsAkSimilar audio data segments as
Figure BDA0001239646680000152
Energy value | A ofkI and
Figure BDA0001239646680000153
then similar audio data segments are cut
Figure BDA0001239646680000154
Performing energy normalization to obtain:
Figure BDA0001239646680000155
wherein, a'kiIs a pair of
Figure BDA0001239646680000156
Sampling points obtained after energy normalization are carried out, so that similar audio data segments can be obtained
Figure BDA0001239646680000157
The audio data segment obtained after the normalization
Figure BDA0001239646680000158
Then, A is added to the original deductive audio data AkSubstituted by A'kAnd finally, the replacement audio data A' corresponding to A is obtained.
In this embodiment, a new replacement audio data obtained by replacing a part of audio data segments in a certain deductive audio data may be used for an "auditory identification" function in music software, that is, playing the replacement audio data to a user corresponding to the original deductive audio data for trial listening, and the user determines which clauses are replaced with audio data of other users.
Specifically, when the replacement audio data is played, the audio data is played in a clause form, and when the audio data segment corresponding to each clause is played, a user can input a judgment operation through a terminal, that is, whether the currently played audio data segment is deduced by other users is judged.
After the replacement audio data is played, determining whether each judgment operation is correct according to all judgment operations input by the user, and giving a corresponding evaluation value according to a preset evaluation formula, for example, giving a title of "hearing person" to the user account when the evaluation value is 100 minutes.
In another embodiment, a user account corresponding to the similar audio data segment replaced in the process of generating the replacement audio data may also be recommended to the current user account, for example, the current user account may select to perform chorus deduction of new audio data with the recommended user account.
In addition, in the embodiment of the invention, a similar audio data searching device is also provided, which can accurately search audio data similar to the current deduction work or deduction clause from a large number of deduction works, thereby improving the accuracy of searching similar audio data. Specifically, as shown in fig. 2, the apparatus for searching similar audio data includes an audio data obtaining module 102, a note difference calculating module 104, a likelihood reference value calculating module 106, a similarity reference value calculating module 108, and a similar audio data segment screening module 110, wherein:
the audio data acquisition module 102 is configured to acquire deductive audio data corresponding to the target song in the first user account and the at least one second user account, where the deductive audio data includes audio data segments corresponding to multiple clauses of the target song;
a note difference calculation module 104, configured to extract fundamental frequency data of an audio data segment of each clause of each deductive audio data, obtain a note value sequence corresponding to the extracted fundamental frequency data, determine a first note value sequence of a first user account in a target clause and a second note value sequence of a second user account in the target clause, and calculate a note difference between the first note value sequence and the second note value sequence;
a likelihood reference value calculation module 106, configured to extract first feature data of an audio data segment of a first user account in a target clause and second feature data of an audio data segment of a second user account in the target clause, and calculate a likelihood reference value between the first feature data and the second feature data;
a similarity reference value calculating module 108, configured to calculate, according to the note difference value and the likelihood reference value, a similarity reference value of the first user account and the second user account in the target clause;
and the similar audio data segment screening module 110 is configured to screen out, from the audio data segments of the at least one second user account, a similar audio data segment corresponding to the first user account in the target clause according to the size of the similarity reference value.
Optionally, in an embodiment, as shown in fig. 2, the apparatus further includes an audio data segment replacing module 112, configured to determine at least one replacing clause corresponding to the target song; and replacing the audio data segment corresponding to the replacing clause in the first deduction audio data corresponding to the target song and the first user with the similar audio data segment corresponding to the first user account and corresponding to the replacing clause, and generating replacing audio data corresponding to the target song and corresponding to the first user account.
Optionally, in one embodiment, the note difference value calculating module 104 is further configured to calculate a sum/average of distance values between each note value included in the first note value sequence and the second note value as a note difference value between the first note value sequence and the second note value sequence.
Optionally, in an embodiment, the note difference calculation module 104 is further configured to extract the fundamental frequency data of the audio data segment of each clause in each deductive audio data according to a preset frame length and a preset frame shift, so as to generate at least one fundamental frequency point corresponding to each clause in each deductive audio data; and adjusting the base frequency value of each base frequency point in at least one base frequency point, and converting the adjusted base frequency value of each base frequency point into a note value corresponding to each base frequency point, thereby obtaining a note value sequence corresponding to the base frequency data of the audio data segments of different users in each clause.
Optionally, in an embodiment, the note difference value calculating module 104 is further configured to perform zeroing processing on the fundamental frequency values of the singular fundamental frequency points in the at least one fundamental frequency point; and performing median filtering processing on each fundamental frequency point.
Optionally, in an embodiment, the likelihood reference value calculating module 106 is further configured to extract, as the first feature data, a first MFCC feature corresponding to the audio data segment of the first user account under the target clause by using a preset feature extraction algorithm.
Optionally, in an embodiment, the likelihood reference value calculating module 106 is further configured to extract, by using a preset feature extraction algorithm, a second MFCC feature corresponding to an audio data segment of the second user account under the target clause as third feature data, and train, by using a preset training model, the third feature data to obtain data serving as second feature data.
Optionally, in an embodiment, the likelihood reference value calculating module 106 is further configured to train the first feature data through a preset training model to obtain data serving as fourth feature data; calculating a first likelihood reference value between the third feature data and the fourth feature data; and calculating an average value of the likelihood reference values among the first likelihood reference value, the first feature data and the second feature data, taking the average value as the likelihood reference value, and updating the likelihood reference value.
Optionally, in an embodiment, the likelihood reference value calculating module 106 is further configured to calculate a likelihood reference value between the first feature data and the second feature data according to a preset likelihood function.
Optionally, in an embodiment, the similarity reference value calculating module 108 is further configured to obtain a preset weighting coefficient, and weight the note difference value and the likelihood reference value according to the preset weighting coefficient to obtain the similarity reference value, where the weighting coefficient of the note difference value is smaller than 0.
The embodiment of the invention has the following beneficial effects:
after the searching method and the searching device for similar audio data are adopted, aiming at singing works recorded by a target user through a terminal, audio data segments which are similar to the audio data segments of each clause in the singing works of the target user in terms of tone and timbre can be searched in all the singing works uploaded by other users. That is to say, when searching for similar audio data, it is considered whether the intonation corresponding to each audio data is consistent or close, and also considered whether the corresponding singer is similar in tone, so that compared with the technical scheme in the prior art that only whether the intonation is always considered, the searched similar audio data is more similar to the audio data performed by the target user, the similarity between the searched similar audio data and the current audio data is improved, and the accuracy of searching for similar audio data is improved.
In one embodiment, as shown in fig. 3, fig. 3 illustrates a terminal of a von neumann-based computer system that runs the above-described similar audio data lookup method. The computer system can be terminal equipment such as a smart phone, a tablet computer, a palm computer, a notebook computer or a personal computer. Specifically, an external input interface 1001, a processor 1002, a memory 1003, and an output interface 1004 connected through a system bus may be included. The external input interface 1001 may optionally include at least a network interface 10012. Memory 1003 can include external memory 10032 (e.g., a hard disk, optical or floppy disk, etc.) and internal memory 10034. The output interface 1004 may include at least a display 10042 or the like.
In this embodiment, the method is executed based on a computer program, program files of which are stored in the external memory 10032 of the computer system based on the von neumann system, loaded into the internal memory 10034 at the time of execution, and then compiled into machine code and then transferred to the processor 1002 for execution, so that the audio data acquisition module 102, the note difference value calculation module 104, the likelihood reference value calculation module 106, the similarity reference value calculation module 108, the similar audio data segment screening module 110, and the audio data segment replacement module 112 are logically formed in the computer system based on the von neumann system. In the execution process of the method for searching similar audio data, the input parameters are all received through the external input interface 1001, and are transferred to the memory 1003 for buffering, and then are input into the processor 1002 for processing, and the processed result data is buffered in the memory 1003 for subsequent processing, or is transferred to the output interface 1004 for output.
Specifically, the processor 1002 is configured to perform the following operations:
respectively acquiring deduction audio data corresponding to the target song under the first user account and the at least one second user account, wherein the deduction audio data comprises audio data fragments corresponding to a plurality of clauses of the target song;
extracting fundamental frequency data of an audio data segment of each clause of each deductive audio data, acquiring a note value sequence corresponding to the extracted fundamental frequency data, determining a first note value sequence of a first user account under the target clause and a second note value sequence of a second user account under the target clause, and calculating a note difference value between the first note value sequence and the second note value sequence;
extracting first characteristic data of an audio data fragment of a first user account under a target clause and second characteristic data of an audio data fragment of a second user account under the target clause, and calculating a likelihood reference value between the first characteristic data and the second characteristic data;
calculating a similarity reference value of the first user account and the second user account under the target clause according to the note difference value and the likelihood reference value;
and screening out similar audio data segments corresponding to the first user account under the target clause from the audio data segments of the at least one second user account according to the size of the similarity reference value.
Optionally, in an embodiment, the processor 1002 is further configured to determine at least one alternative clause corresponding to the target song; and replacing the audio data segment corresponding to the replacing clause in the first deduction audio data corresponding to the target song and the first user with the similar audio data segment corresponding to the first user account and corresponding to the replacing clause, and generating replacing audio data corresponding to the target song and corresponding to the first user account.
Optionally, in one embodiment, the processor 1002 is further configured to calculate a sum/average of distance values between each note value contained in the first sequence of note values and the second sequence of note values as a note difference between the first sequence of note values and the second sequence of note values.
Optionally, in an embodiment, the processor 1002 is further configured to perform extracting, according to a preset frame length and a preset frame shift, the fundamental frequency data of the audio data segment of each clause in each deductive audio data, respectively, to generate at least one fundamental frequency point corresponding to each clause in each deductive audio data; and adjusting the base frequency value of each base frequency point in at least one base frequency point, and converting the adjusted base frequency value of each base frequency point into a note value corresponding to each base frequency point, thereby obtaining a note value sequence corresponding to the base frequency data of the audio data segments of different users in each clause.
Optionally, in an embodiment, the processor 1002 is further configured to perform zeroing processing on fundamental frequency values of singular fundamental frequency points in the at least one fundamental frequency point; and performing median filtering processing on each fundamental frequency point.
Optionally, in an embodiment, the processor 1002 is further configured to execute, by using a preset feature extraction algorithm, extracting, from an audio data segment of the first user account under the target clause, a first MFCC feature corresponding to the audio data segment as first feature data.
Optionally, in an embodiment, the processor 1002 is further configured to execute a preset feature extraction algorithm to extract, from an audio data segment of the second user account under the target clause, a second MFCC feature corresponding to the audio data segment as third feature data, and train the third feature data through a preset training model to obtain data as second feature data.
Optionally, in an embodiment, the processor 1002 is further configured to execute data obtained by training the first feature data through a preset training model, as fourth feature data; calculating a first likelihood reference value between the third feature data and the fourth feature data; and calculating an average value of the likelihood reference values among the first likelihood reference value, the first feature data and the second feature data, taking the average value as the likelihood reference value, and updating the likelihood reference value.
Optionally, in an embodiment, the processor 1002 is further configured to calculate a likelihood reference value between the first feature data and the second feature data according to a preset likelihood function.
Optionally, in an embodiment, the processor 1002 is further configured to obtain a preset weighting coefficient, and weight the note difference value and the likelihood reference value according to the preset weighting coefficient to obtain a similarity reference value, where the weighting coefficient of the note difference value is smaller than 0.
The above disclosure is only for the purpose of illustrating the preferred embodiments of the present invention, and it is therefore to be understood that the invention is not limited by the scope of the appended claims.

Claims (20)

1. A method for searching similar audio data, comprising:
respectively acquiring deductive audio data corresponding to a target song under a first user account and at least one second user account, wherein the deductive audio data comprises audio data fragments corresponding to a plurality of clauses of the target song; the deductive audio data is voice type audio data;
extracting fundamental frequency data of an audio data segment of each clause of each deductive audio data, acquiring a note value sequence corresponding to the extracted fundamental frequency data, determining a first note value sequence of the first user account under a target clause and a second note value sequence of the second user account under the target clause, and calculating note difference values between the first note value sequence and the second note value sequence; the note difference values characterize tonal similarity between audio data segments;
extracting first characteristic data of an audio data segment of the first user account under the target clause and second characteristic data of an audio data segment of the second user account under the target clause, and calculating a likelihood reference value between the first characteristic data and the second characteristic data; the likelihood reference values represent human voice tone similarity among the audio data fragments;
calculating a similarity reference value of the first user account and the second user account under the target clause according to the note difference value and the likelihood reference value;
screening out similar audio data segments corresponding to the first user account under the target clause from the audio data segments of the at least one second user account according to the similarity reference value;
performing energy normalization operation on the similar audio data segments according to the audio data segments of the first user account under the target clause to obtain the similar audio data segments after the energy normalization operation; the similar audio data segments subjected to energy warping have energy similar to the audio data segments of the first user account under the target clause;
and replacing and optimizing the audio data segment corresponding to the target clause in the deductive audio data under the first user account into the similar audio data segment subjected to the energy normalization operation to obtain replaced audio data of the deductive audio data under the first user account.
2. The method of claim 1, wherein the step of screening out the audio data segments of the at least one second user account corresponding to the first user account for similar audio data segments under the target clause according to the size of the similarity reference value further comprises:
determining at least one alternative clause corresponding to the target song;
and replacing the audio data segment corresponding to the replacing clause in the first deduction audio data corresponding to the target song and the first user with a similar audio data segment corresponding to the first user account and corresponding to the replacing clause, and generating replacing audio data corresponding to the target song and corresponding to the first user account.
3. A method according to claim 1, wherein said calculating note difference values between said first sequence of note values and said second sequence of note values further comprises:
calculating a sum/average of distance values between each note value included in the first sequence of note values and second sequence of note values as a note difference between the first sequence of note values and the second sequence of note values.
4. The method of claim 1, wherein said extracting fundamental frequency data of the audio data segment of each sentence of each deductive audio data, and wherein said obtaining a sequence of note values corresponding to said extracted fundamental frequency data comprises:
respectively extracting fundamental frequency data of the audio data segment of each clause in each deduction audio data according to a preset frame length and a preset frame shift so as to generate at least one fundamental frequency point corresponding to each clause in each deduction audio data;
and adjusting the base frequency value of each base frequency point in the at least one base frequency point, and converting the adjusted base frequency value of each base frequency point into a note value corresponding to each base frequency point, thereby obtaining a note value sequence corresponding to the base frequency data of the audio data segments of different users in each clause.
5. The method according to claim 4, wherein the adjusting the fundamental frequency value of each of the at least one basic frequency point comprises:
carrying out zero setting processing on the fundamental frequency value of a singular fundamental frequency point in the at least one fundamental frequency point;
and performing median filtering processing on the fundamental frequency points.
6. The method according to claim 1, wherein the extracting of the first feature data of the audio data segment of the first user account under the target clause is specifically:
and extracting a first MFCC characteristic corresponding to the audio data segment of the first user account under the target clause as first characteristic data through a preset characteristic extraction algorithm.
7. The method according to claim 6, wherein the extracting of the second feature data of the audio data segment of the second user account under the target clause is specifically:
and extracting a second MFCC characteristic corresponding to the audio data fragment from the audio data fragment of the second user account under the target clause through the preset characteristic extraction algorithm to serve as third characteristic data, and training the third characteristic data through a preset training model to obtain data serving as second characteristic data.
8. The method of claim 7, further comprising:
training the first characteristic data through a preset training model to obtain data serving as fourth characteristic data;
calculating a first likelihood reference value between the third feature data and the fourth feature data;
and calculating an average value of the first likelihood reference value and the likelihood reference values between the first feature data and the second feature data, taking the average value as the likelihood reference value, and updating the likelihood reference value.
9. The method according to any one of claims 1 to 8, wherein the calculating of the likelihood reference value between the first feature data and the second feature data is specifically:
and calculating a likelihood reference value between the first characteristic data and the second characteristic data according to a preset likelihood function.
10. The method according to any one of claims 1 to 8, wherein the calculating the similarity reference values of the first user account and the second user account under the target clause according to the note difference values and the likelihood reference values is specifically:
and acquiring a preset weighting coefficient, and weighting the note difference value and the likelihood reference value according to the preset weighting coefficient to obtain the similarity reference value, wherein the weighting coefficient of the note difference value is smaller than 0.
11. A device for searching similar audio data, comprising:
the audio data acquisition module is used for respectively acquiring deductive audio data corresponding to the target song under a first user account and at least one second user account, wherein the deductive audio data comprises audio data segments corresponding to a plurality of clauses of the target song; the deductive audio data is voice type audio data;
a note difference value calculating module, configured to extract fundamental frequency data of an audio data segment of each sentence of each deductive audio data, obtain a note value sequence corresponding to the extracted fundamental frequency data, determine a first note value sequence of the first user account in a target sentence, and a second note value sequence of the second user account in the target sentence, and calculate a note difference value between the first note value sequence and the second note value sequence; the note difference values characterize tonal similarity between audio data segments;
a likelihood reference value calculation module, configured to extract first feature data of an audio data segment of the first user account in the target clause and second feature data of an audio data segment of the second user account in the target clause, and calculate a likelihood reference value between the first feature data and the second feature data; the likelihood reference values represent human voice tone similarity among the audio data fragments;
a similarity reference value calculating module, configured to calculate, according to the note difference value and the likelihood reference value, a similarity reference value of the first user account and the second user account in the target clause;
a similar audio data segment screening module, configured to screen, from the audio data segments of the at least one second user account, a similar audio data segment corresponding to the first user account in the target clause according to the size of the similarity reference value;
the similar audio data segment screening module is further configured to perform energy normalization operation on the similar audio data segments according to the audio data segments of the first user account in the target clause, so as to obtain the similar audio data segments after the energy normalization operation is performed; the similar audio data segments subjected to energy warping have energy similar to the audio data segments of the first user account under the target clause;
and the similar audio data fragment screening module is further configured to replace and optimize the audio data fragment corresponding to the target clause in the deductive audio data in the first user account into the similar audio data fragment subjected to the energy normalization operation, so as to obtain replaced audio data of the deductive audio data in the first user account.
12. The apparatus of claim 11, further comprising an audio data segment replacement module configured to determine at least one replacement clause corresponding to the target song; and replacing the audio data segment corresponding to the replacing clause in the first deduction audio data corresponding to the target song and the first user with a similar audio data segment corresponding to the first user account and corresponding to the replacing clause, and generating replacing audio data corresponding to the target song and corresponding to the first user account.
13. An apparatus as recited in claim 11, wherein said note difference calculation module is further configured to calculate a sum/average of distance values between each note value included in said first sequence of note values and said second sequence of note values as the note difference between said first sequence of note values and said second sequence of note values.
14. The apparatus of claim 11, wherein the note difference calculation module is further configured to extract the fundamental frequency data of the audio data segment of each sentence in each of the deductive audio data according to a preset frame length and a preset frame shift, respectively, to generate at least one fundamental frequency point corresponding to each sentence in each of the deductive audio data; and adjusting the base frequency value of each base frequency point in the at least one base frequency point, and converting the adjusted base frequency value of each base frequency point into a note value corresponding to each base frequency point, thereby obtaining a note value sequence corresponding to the base frequency data of the audio data segments of different users in each clause.
15. The apparatus according to claim 14, wherein the note difference calculation module is further configured to zero the fundamental frequency values of the singular fundamental frequency points in the at least one fundamental frequency point; and performing median filtering processing on the fundamental frequency points.
16. The apparatus of claim 11, wherein the likelihood reference value calculating module is further configured to extract, as the first feature data, a first MFCC feature corresponding to the audio data segment of the first user account under the target clause from the audio data segment through a preset feature extraction algorithm.
17. The apparatus according to claim 16, wherein the likelihood reference value calculating module is further configured to extract, through the preset feature extraction algorithm, a second MFCC feature corresponding to the audio data segment of the second user account under the target clause as third feature data, and train, through a preset training model, data of the third feature data as second feature data.
18. The apparatus according to claim 17, wherein the likelihood reference value calculation module is further configured to use data obtained by training the first feature data through a preset training model as fourth feature data; calculating a first likelihood reference value between the third feature data and the fourth feature data; and calculating an average value of the first likelihood reference value and the likelihood reference values between the first feature data and the second feature data, taking the average value as the likelihood reference value, and updating the likelihood reference value.
19. The apparatus according to any one of claims 11 to 18, wherein the likelihood reference value calculating module is further configured to calculate the likelihood reference value between the first feature data and the second feature data according to a preset likelihood function.
20. The apparatus according to any one of claims 11 to 18, wherein the similarity reference value calculating module is further configured to obtain a preset weighting coefficient, and weight the note difference value and the likelihood reference value according to the preset weighting coefficient to obtain the similarity reference value, where the weighting coefficient of the note difference value is smaller than 0.
CN201710129982.8A 2017-03-07 2017-03-07 Similar audio data searching method and device Active CN106970950B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710129982.8A CN106970950B (en) 2017-03-07 2017-03-07 Similar audio data searching method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710129982.8A CN106970950B (en) 2017-03-07 2017-03-07 Similar audio data searching method and device

Publications (2)

Publication Number Publication Date
CN106970950A CN106970950A (en) 2017-07-21
CN106970950B true CN106970950B (en) 2021-08-24

Family

ID=59329114

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710129982.8A Active CN106970950B (en) 2017-03-07 2017-03-07 Similar audio data searching method and device

Country Status (1)

Country Link
CN (1) CN106970950B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107368609B (en) * 2017-08-10 2018-09-04 广州酷狗计算机科技有限公司 Obtain the method, apparatus and computer readable storage medium of multimedia file
CN111274415A (en) * 2020-01-14 2020-06-12 广州酷狗计算机科技有限公司 Method, apparatus and computer storage medium for determining alternate video material
CN112487940B (en) * 2020-11-26 2023-02-28 腾讯音乐娱乐科技(深圳)有限公司 Video classification method and device

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101627423A (en) * 2006-10-20 2010-01-13 法国电信 There is the digital audio and video signals of the correction of pitch period to lose the synthetic of piece
CN103177722A (en) * 2013-03-08 2013-06-26 北京理工大学 Tone-similarity-based song retrieval method
CN103531196A (en) * 2013-10-15 2014-01-22 中国科学院自动化研究所 Sound selection method for waveform concatenation speech synthesis
CN104112444A (en) * 2014-07-28 2014-10-22 中国科学院自动化研究所 Text message based waveform concatenation speech synthesis method
CN104347068A (en) * 2013-08-08 2015-02-11 索尼公司 Audio signal processing device, audio signal processing method and monitoring system
CN104778957A (en) * 2015-03-20 2015-07-15 广东欧珀移动通信有限公司 Song audio processing method and device
CN104778958A (en) * 2015-03-20 2015-07-15 广东欧珀移动通信有限公司 Method and device for splicing noise-containing songs
CN105788589A (en) * 2016-05-04 2016-07-20 腾讯科技(深圳)有限公司 Audio data processing method and device
CN105931634A (en) * 2016-06-15 2016-09-07 腾讯科技(深圳)有限公司 Audio screening method and device
CN106057208A (en) * 2016-06-14 2016-10-26 科大讯飞股份有限公司 Audio correction method and device

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050222515A1 (en) * 2004-02-23 2005-10-06 Biosignetics Corporation Cardiovascular sound signature: method, process and format
CN101226558B (en) * 2008-01-29 2011-08-31 福州大学 Method for searching audio data based on MFCCM
CN102053998A (en) * 2009-11-04 2011-05-11 周明全 Method and system device for retrieving songs based on voice modes
US9122753B2 (en) * 2011-04-11 2015-09-01 Samsung Electronics Co., Ltd. Method and apparatus for retrieving a song by hummed query
CN103870466A (en) * 2012-12-10 2014-06-18 哈尔滨网腾科技开发有限公司 Automatic extracting method for audio examples
CN103854646B (en) * 2014-03-27 2018-01-30 成都康赛信息技术有限公司 A kind of method realized DAB and classified automatically
CN105022744A (en) * 2014-04-24 2015-11-04 上海京知信息科技有限公司 Dynamic programming based humming melody extracting and matching search method

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101627423A (en) * 2006-10-20 2010-01-13 法国电信 There is the digital audio and video signals of the correction of pitch period to lose the synthetic of piece
CN103177722A (en) * 2013-03-08 2013-06-26 北京理工大学 Tone-similarity-based song retrieval method
CN104347068A (en) * 2013-08-08 2015-02-11 索尼公司 Audio signal processing device, audio signal processing method and monitoring system
CN103531196A (en) * 2013-10-15 2014-01-22 中国科学院自动化研究所 Sound selection method for waveform concatenation speech synthesis
CN104112444A (en) * 2014-07-28 2014-10-22 中国科学院自动化研究所 Text message based waveform concatenation speech synthesis method
CN104778957A (en) * 2015-03-20 2015-07-15 广东欧珀移动通信有限公司 Song audio processing method and device
CN104778958A (en) * 2015-03-20 2015-07-15 广东欧珀移动通信有限公司 Method and device for splicing noise-containing songs
CN105788589A (en) * 2016-05-04 2016-07-20 腾讯科技(深圳)有限公司 Audio data processing method and device
CN106057208A (en) * 2016-06-14 2016-10-26 科大讯飞股份有限公司 Audio correction method and device
CN105931634A (en) * 2016-06-15 2016-09-07 腾讯科技(深圳)有限公司 Audio screening method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
VIDEO ENCODING AND SPLICING FOR TUNE-IN TIME REDUCTION IN IP DATACASTING (IPDC) OVER DVB-H;Mehdi Rezaei 等;《2006 IEEE International Conference on Multimedia and Expo》;20061226;601-604 *
基于变换域的音频水印研究及实现;董斌;《中国优秀硕士学位论文全文数据库 信息科技辑》;20110915(第09期);I138-72 *
数字音频水印在版权保护和篡改检测中的应用研究;常乐杰;《中国优秀硕士学位论文全文数据库 信息科技辑》;20170215(第02期);I138-81 *

Also Published As

Publication number Publication date
CN106970950A (en) 2017-07-21

Similar Documents

Publication Publication Date Title
EP2659482B1 (en) Ranking representative segments in media data
Rocamora et al. Comparing audio descriptors for singing voice detection in music audio files
CN110880329B (en) Audio identification method and equipment and storage medium
Tsunoo et al. Beyond timbral statistics: Improving music classification using percussive patterns and bass lines
CN106991163A (en) A kind of song recommendations method based on singer's sound speciality
Hu et al. Separation of singing voice using nonnegative matrix partial co-factorization for singer identification
CN106997769B (en) Trill recognition method and device
CN106970950B (en) Similar audio data searching method and device
Yu et al. Sparse cepstral codes and power scale for instrument identification
Elowsson et al. Modeling the perception of tempo
Tsunoo et al. Music mood classification by rhythm and bass-line unit pattern analysis
Yang Computational modelling and analysis of vibrato and portamento in expressive music performance
Loni et al. Robust singer identification of Indian playback singers
CN107025902B (en) Data processing method and device
Dittmar et al. Novel mid-level audio features for music similarity
Zhang et al. A novel singer identification method using GMM-UBM
CN112270929B (en) Song identification method and device
Song et al. Implementation of a practical query-by-singing/humming (QbSH) system and its commercial applications
WO2019053544A1 (en) Identification of audio components in an audio mix
CN106548784B (en) Voice data evaluation method and system
Sridhar et al. Music information retrieval of carnatic songs based on carnatic music singer identification
Tang et al. Melody Extraction from Polyphonic Audio of Western Opera: A Method based on Detection of the Singer's Formant.
Shirali-Shahreza et al. Fast and scalable system for automatic artist identification
CN113744721B (en) Model training method, audio processing method, device and readable storage medium
Alvarez et al. Singer identification using convolutional acoustic motif embeddings

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant