CN111782864A - Singing audio classification method, computer program product, server and storage medium - Google Patents

Singing audio classification method, computer program product, server and storage medium Download PDF

Info

Publication number
CN111782864A
CN111782864A CN202010614700.5A CN202010614700A CN111782864A CN 111782864 A CN111782864 A CN 111782864A CN 202010614700 A CN202010614700 A CN 202010614700A CN 111782864 A CN111782864 A CN 111782864A
Authority
CN
China
Prior art keywords
sequence
base frequency
matching
subsequence
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010614700.5A
Other languages
Chinese (zh)
Other versions
CN111782864B (en
Inventor
周宇
林森
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Music Entertainment Technology Shenzhen Co Ltd
Original Assignee
Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Music Entertainment Technology Shenzhen Co Ltd filed Critical Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority to CN202010614700.5A priority Critical patent/CN111782864B/en
Publication of CN111782864A publication Critical patent/CN111782864A/en
Application granted granted Critical
Publication of CN111782864B publication Critical patent/CN111782864B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/65Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/685Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using automatically derived transcript of audio data, e.g. lyrics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0008Associated control or indicating means
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/031Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
    • G10H2210/036Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal of musical genre, i.e. analysing the style of musical pieces, usually for selection, filtering or classification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2240/00Data organisation or data communication aspects, specifically adapted for electrophonic musical tools or instruments
    • G10H2240/075Musical metadata derived from musical analysis or for use in electrophonic musical instruments
    • G10H2240/081Genre classification, i.e. descriptive metadata for classification or selection of musical pieces according to style
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2240/00Data organisation or data communication aspects, specifically adapted for electrophonic musical tools or instruments
    • G10H2240/171Transmission of musical instrument data, control or status information; Transmission, remote access or control of music data for electrophonic musical instruments
    • G10H2240/281Protocol or standard connector for transmission of analog or digital data to or from an electrophonic musical instrument
    • G10H2240/311MIDI transmission

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Auxiliary Devices For Music (AREA)
  • Reverberation, Karaoke And Other Acoustics (AREA)

Abstract

The embodiment of the application discloses a singing audio classification method, a computer program product, a server and a storage medium, wherein the singing audio classification method comprises the following steps: acquiring a voice base frequency subsequence corresponding to the ith lyric in the singing audio from the voice base frequency sequence of the singing audio; acquiring a reference base frequency subsequence corresponding to the ith lyric in the singing audio from the reference base frequency sequence of the singing audio; determining an optimal matching mapping relation between the voice base frequency subsequence and the reference base frequency subsequence based on the voice base frequency subsequence and the reference base frequency subsequence corresponding to the ith lyric in the singing audio, wherein the optimal matching mapping relation comprises the voice base frequency matching subsequence and the reference base frequency matching subsequence; and determining the melody matching degree of the ith lyric based on the human voice base frequency matching subsequence and the reference base frequency matching subsequence, and obtaining the classification result of the singing audio based on the melody matching degree of each lyric. By the method and the device, accuracy of the singing audio classification result can be improved, and willingness of a user to issue audio works is improved.

Description

Singing audio classification method, computer program product, server and storage medium
Technical Field
The present application relates to the field of multimedia technologies, and in particular, to a singing audio classification method, a computer program product, a server, and a storage medium.
Background
At present, the classification of the singing audio is realized by comparing the fundamental frequency sequence of the human voice on the absolute time axis with the MIDI sequence of the standard human voice main melody file of the singing audio, the classification accuracy mainly depends on the accuracy of the MIDI sequence of the standard human voice main melody file of the singing audio, and the two groups of sequences cannot be aligned accurately due to the problems of human errors, delay generated at a user recording equipment end and the like in the process of manufacturing the MIDI sequence.
Disclosure of Invention
The embodiment of the application provides a singing audio classification method and device, a computer program product, a server and a storage medium, so that the accuracy of a singing audio classification result is improved, and the willingness of a user to issue audio works is improved.
In a first aspect, a method for singing audio classification is provided in an embodiment of the present application, including:
acquiring a voice base frequency subsequence corresponding to the ith lyric in the singing audio from the voice base frequency sequence of the singing audio, wherein the voice base frequency sequence comprises a plurality of voice base frequency subsequences;
acquiring a reference base frequency subsequence corresponding to the ith lyric in the singing audio from the reference base frequency sequence of the singing audio, wherein the reference base frequency sequence comprises a plurality of reference base frequency subsequences, and each reference base frequency subsequence in the plurality of reference base frequency subsequences corresponds to each lyric in the singing audio one by one;
determining an optimal matching mapping relation between the human voice base frequency matching subsequence and the reference base frequency matching subsequence based on the human voice base frequency subsequence corresponding to the ith lyric in the singing audio, wherein the optimal matching mapping relation comprises the human voice base frequency matching subsequence and the reference base frequency matching subsequence;
and determining the melody matching degree of the ith lyric based on the human voice base frequency matching subsequence and the reference base frequency matching subsequence, and obtaining the classification result of the singing audio based on the melody matching degree of each lyric.
Optionally, before the obtaining of the voice base frequency subsequence corresponding to the ith lyric in the singing audio from the voice base frequency sequence of the singing audio, the method includes:
framing the voice audio signal of the singing audio to obtain at least one audio frame, and performing wavelet decomposition on each audio frame in the at least one audio frame to obtain a plurality of wavelet decomposition signals corresponding to each audio frame, wherein each wavelet decomposition signal comprises wavelet high-frequency decomposition signals and wavelet low-frequency decomposition signals of a plurality of audio sampling points;
determining the target wavelet decomposition times according to the maximum value in the amplitude values of the wavelet low-frequency decomposition signals corresponding to every two adjacent wavelet decompositions of the at least one audio frame;
and calculating to obtain the human voice base frequency sequence of the singing audio according to the period of the wavelet high-frequency decomposition signal under the target wavelet decomposition times.
Optionally, the obtaining a voice base frequency subsequence corresponding to the ith lyric in the singing audio from the voice base frequency sequence of the singing audio includes:
determining a plurality of lyric pause moments of the human voice base frequency sequence according to a time interval between starting moments corresponding to each two adjacent sequence elements in the human voice base frequency sequence and a first preset time threshold;
determining the starting time and the ending time of the nth lyric according to the plurality of lyric stopping times, wherein n is greater than or equal to 2 and is a positive integer;
determining the starting time of the 1 st lyric in the singing audio according to the time interval between the starting times corresponding to the first two non-zero sequence elements in the voice fundamental frequency sequence and a second preset time threshold;
and extracting a plurality of sequence elements between the starting time and the ending time of the ith lyric from the voice base frequency sequence to obtain a voice base frequency subsequence corresponding to the ith lyric in the singing audio.
Optionally, before the obtaining of the reference fundamental frequency subsequence corresponding to the ith lyric in the singing audio from the reference fundamental frequency sequence of the singing audio, the method includes:
calculating to obtain a musical interval characteristic sequence and a track information entropy sequence of the singing audio according to the audio file of the singing audio;
inputting the characteristic sequence of the musical track interval and the characteristic sequence of the audio track information entropy of the singing audio into a main melody audio track classification model to obtain classification results of a plurality of audio tracks corresponding to the singing audio;
and determining a main melody track of the singing audio according to the classification results of the plurality of tracks, extracting data of the main melody track from an audio file of the singing audio to obtain a reference base frequency sequence of the singing audio, wherein the main melody track classification model is obtained by training a sample song set and a track label sequence corresponding to each song in the sample song set.
Optionally, the determining an optimal matching mapping relationship between the voice fundamental frequency subsequence and the reference fundamental frequency subsequence based on the voice fundamental frequency subsequence and the reference fundamental frequency subsequence corresponding to the ith lyric in the singing audio includes:
calculating to obtain a matching matrix A according to the human voice base frequency sub-sequence, the reference base frequency sub-sequence and a preset matching value, wherein the preset matching value is used for representing a melody matching standard between the human voice base frequency sub-sequence and the reference base frequency sub-sequence, the matching matrix A comprises (j +1) × (k +1) matrix elements, j is the number of the sequence elements of the human voice base frequency sub-sequence, and k is the number of the sequence elements of the reference base frequency sub-sequence;
determining a matrix element A (j +1, k +1) in the matching matrix A as a first target element, and determining a path element of the first target element from a plurality of selectable path elements corresponding to the first target element in the matching matrix A;
determining a path element of the first target element as a second target element, and determining a path element of the second target element from a plurality of selectable path elements corresponding to the second target element in the matching matrix A until a path element of an mth target element is a matrix element A (1,1) in the matching matrix A;
determining an optimal matching path according to the first target element and at least one path element;
and determining the human voice base frequency matching subsequence and the reference base frequency matching subsequence according to the position relationship between the path direction between every two adjacent matrix elements in the optimal matching path and the sequence direction of the human voice base frequency subsequence in the matching matrix A and the sequence direction of the reference base frequency subsequence in the matching matrix A.
Optionally, the human voice base frequency matching sub-sequences correspond to the reference base frequency matching sub-sequences one to one;
the determining the melody matching degree of the ith lyric based on the human voice base frequency matching subsequence and the reference base frequency matching subsequence comprises:
and calculating the melody matching degree of the ith lyric according to the number of sequence elements with preset matching values of the difference value between the sequence elements in the human voice base frequency matching subsequence and the corresponding sequence elements of the sequence elements in the reference base frequency subsequence and the number of the sequence elements in the human voice base frequency matching subsequence.
Optionally, the preset matching value is any one of 12, 7, 5, 0, -12, -7 and-5.
In a second aspect, a singing audio classification device is provided for an embodiment of the present application, including:
the first obtaining unit is used for obtaining a human voice base frequency subsequence corresponding to the ith lyric in the singing audio from a human voice base frequency sequence of the singing audio, wherein the human voice base frequency sequence comprises a plurality of human voice base frequency subsequences;
a second obtaining unit, configured to obtain, from a reference baseband sequence of the singing audio, a reference baseband subsequence corresponding to an ith lyric in the singing audio, where the reference baseband sequence includes multiple reference baseband subsequences, and each reference baseband subsequence in the multiple reference baseband subsequences corresponds to each lyric in the singing audio one to one;
a mapping relation determining unit, configured to determine an optimal matching mapping relation between the voice base frequency subsequence and the reference base frequency subsequence based on the voice base frequency subsequence and the reference base frequency subsequence corresponding to the ith lyric in the singing audio, where the optimal matching mapping relation includes a voice base frequency matching subsequence and a reference base frequency matching subsequence;
and the classification result determining unit is used for determining the melody matching degree of the ith lyric based on the voice fundamental frequency matching subsequence and the reference fundamental frequency matching subsequence, and obtaining the classification result of the singing audio based on the melody matching degree of each lyric.
Optionally, the apparatus further comprises:
the wavelet decomposition unit is used for framing the vocal audio signal of the singing audio to obtain at least one audio frame, performing wavelet decomposition on each audio frame in the at least one audio frame to obtain a plurality of wavelet decomposition signals corresponding to each audio frame, wherein each wavelet decomposition signal comprises a wavelet high-frequency decomposition signal and a wavelet low-frequency decomposition signal of a plurality of audio sampling points;
a decomposition frequency determining unit for determining a target wavelet decomposition frequency according to a maximum value in the amplitudes of the wavelet low-frequency decomposition signals corresponding to every two adjacent wavelet decompositions of the at least one audio frame;
and the computing unit is used for computing the vocal base frequency sequence of the singing audio according to the period of the wavelet high-frequency decomposition signal under the target wavelet decomposition times.
Optionally, the first obtaining unit is specifically configured to:
determining a plurality of lyric pause moments of the human voice base frequency sequence according to a time interval between starting moments corresponding to each two adjacent sequence elements in the human voice base frequency sequence and a first preset time threshold;
determining the starting time and the ending time of the nth lyric according to the plurality of lyric stopping times, wherein n is greater than or equal to 2 and is a positive integer;
determining the starting time of the 1 st lyric in the singing audio according to the time interval between the starting times corresponding to the first two non-zero sequence elements in the voice fundamental frequency sequence and a second preset time threshold;
and extracting a plurality of sequence elements between the starting time and the ending time of the ith lyric from the voice base frequency sequence to obtain a voice base frequency subsequence corresponding to the ith lyric in the singing audio.
Optionally, the apparatus further comprises: and a third acquisition unit.
The third obtaining unit is used for calculating a musical interval characteristic sequence and a track information entropy sequence of the singing audio according to the audio file of the singing audio;
inputting the characteristic sequence of the musical track interval and the characteristic sequence of the audio track information entropy of the singing audio into a main melody audio track classification model to obtain classification results of a plurality of audio tracks corresponding to the singing audio;
and determining a main melody track of the singing audio according to the classification results of the plurality of tracks, extracting data of the main melody track from an audio file of the singing audio to obtain a reference base frequency sequence of the singing audio, wherein the main melody track classification model is obtained by training a sample song set and a track label sequence corresponding to each song in the sample song set.
Optionally, the mapping relationship determining unit is specifically configured to:
calculating to obtain a matching matrix A according to the human voice base frequency sub-sequence, the reference base frequency sub-sequence and a preset matching value, wherein the preset matching value is used for representing a melody matching standard between the human voice base frequency sub-sequence and the reference base frequency sub-sequence, the matching matrix A comprises (j +1) × (k +1) matrix elements, j is the number of the sequence elements of the human voice base frequency sub-sequence, and k is the number of the sequence elements of the reference base frequency sub-sequence;
determining a matrix element A (j +1, k +1) in the matching matrix A as a first target element, and determining a path element of the first target element from a plurality of selectable path elements corresponding to the first target element in the matching matrix A;
determining a path element of the first target element as a second target element, and determining a path element of the second target element from a plurality of selectable path elements corresponding to the second target element in the matching matrix A until a path element of an mth target element is a matrix element A (1,1) in the matching matrix A;
determining an optimal matching path according to the first target element and at least one path element;
and determining the human voice base frequency matching subsequence and the reference base frequency matching subsequence according to the position relationship between the path direction between every two adjacent matrix elements in the optimal matching path and the sequence direction of the human voice base frequency subsequence in the matching matrix A and the sequence direction of the reference base frequency subsequence in the matching matrix A.
Optionally, the human voice base frequency matching sub-sequences correspond to the reference base frequency matching sub-sequences one to one;
the classification result determining unit is specifically configured to: and calculating the melody matching degree of the ith lyric according to the number of sequence elements with preset matching values of the difference value between the sequence elements in the human voice base frequency matching subsequence and the corresponding sequence elements of the sequence elements in the reference base frequency subsequence and the number of the sequence elements in the human voice base frequency matching subsequence.
Optionally, the preset matching value is any one of 12, 7, 5, 0, -12, -7 and-5.
In a third aspect, a computer program product is provided for embodiments of the present application, the computer program product comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions, so that the computer device executes the singing audio classification method as described in an aspect of the embodiments of the present application.
In a fourth aspect, a server is provided for an embodiment of the present application, and includes a processor, a memory, and a transceiver, where the processor, the memory, and the transceiver are connected to each other, where the memory is used to store a computer program that supports the electronic device to execute the above singing audio classification method, and the computer program includes program instructions; the processor is configured to invoke the program instructions to execute the singing audio classification method as described in an aspect of an embodiment of the present application.
In a fifth aspect, a storage medium is provided for an embodiment of the present application, where the storage medium stores a computer program, and the computer program includes program instructions; the program instructions, when executed by a processor, cause the processor to perform a method of singing audio classification as described in an aspect of an embodiment of the present application.
In the embodiment of the application, the server acquires a voice base frequency subsequence corresponding to the ith lyric in the singing audio from the voice base frequency sequence of the singing audio; acquiring a reference base frequency subsequence corresponding to the ith lyric in the singing audio from the reference base frequency sequence of the singing audio; determining an optimal matching mapping relation between the voice base frequency subsequence and the reference base frequency subsequence based on the voice base frequency subsequence and the reference base frequency subsequence corresponding to the ith lyric in the singing audio, wherein the optimal matching mapping relation comprises the voice base frequency matching subsequence and the reference base frequency matching subsequence; and determining the melody matching degree of the ith lyric based on the human voice base frequency matching subsequence and the reference base frequency matching subsequence, and obtaining the classification result of the singing audio based on the melody matching degree of each lyric. Therefore, the accuracy of the singing audio classification result can be improved, and the willingness of the user to issue the audio works is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flowchart of a singing audio classification method according to an embodiment of the present application;
fig. 2 is a schematic diagram of a process for determining an optimal matching path according to an embodiment of the present disclosure;
fig. 3 is a schematic flowchart of another singing audio classification method provided in the embodiment of the present application;
fig. 4 is a schematic structural diagram of a singing audio classification apparatus according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a server according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Please refer to fig. 1, which is a flowchart illustrating a singing audio classification method according to an embodiment of the present application. As shown in fig. 1, this method embodiment comprises the steps of:
s101, acquiring a human voice base frequency subsequence corresponding to the ith lyric in the singing audio from the human voice base frequency sequence of the singing audio.
Before the server executes step S101, the voice audio signal of the singing audio may be processed by using a wavelet transform algorithm to obtain a voice base frequency sequence of the singing audio, and the implementation manner is as follows:
framing the voice audio signal of the singing audio to obtain at least one audio frame, and performing wavelet decomposition on each audio frame in the at least one audio frame to obtain a plurality of wavelet decomposition signals corresponding to each audio frame, wherein each wavelet decomposition signal comprises wavelet high-frequency decomposition signals and wavelet low-frequency decomposition signals of a plurality of audio sampling points;
determining the target wavelet decomposition times according to the maximum value in the amplitude values of the wavelet low-frequency decomposition signals corresponding to every two adjacent wavelet decompositions of the at least one audio frame;
and calculating to obtain the human voice base frequency sequence of the singing audio according to the period of the wavelet high-frequency decomposition signal under the target wavelet decomposition times.
Wherein, the human voice fundamental frequency sequence of the singing audio can be understood as the pitch sequence of the human voice audio signal. The sound is generated by the vibration of an object, instantaneous changes of air flow can be caused in the vibration process, the changes cause instantaneous sharp changes of an audio signal, a mutation point is generated, and the reciprocal of the time interval length between two adjacent mutation points is the fundamental tone frequency at the moment. Because the wavelet has strong detection capability to the signal mutation point, the pitch frequency can be determined by determining the position of the maximum value point after wavelet transformation.
In a possible implementation manner, the server divides the human voice audio signal of the singing audio by taking 16kHz as the sampling frequency, taking 10ms as the frame shift and 10ms as the frame length, frames the human voice audio signal, each audio frame comprises 160 audio sampling points, wavelet decomposition is performed on each audio frame, the number of audio sampling points in the wavelet high-pass filtered high-frequency decomposition signal is 160, the number of audio sampling points in the wavelet low-frequency decomposition signal after the first low-pass filtering is also 160, so as to form a level 1 wavelet decomposition signal, in order to keep the number of audio sampling points after the wavelet decomposition consistent with the number of audio sampling points of the original audio frame, the signal after the high-pass filtering and the low-pass filtering can be downsampled, that is, the wavelet low-frequency decomposition signal after the first low-pass filtering is downsampled, the sampling frequency of the wavelet low-frequency decomposition signal after the first low-pass filtering is half of the sampling frequency of the first audio frame, the number of audio sampling points in the wavelet low-frequency decomposition signal after the first low-pass filtering down-sampling is 80; similarly, the number of audio sampling points in the wavelet high-frequency decomposition signal after the first high-pass filtering down-sampling is 80, and the number of audio sampling points in the wavelet decomposition signal of level 1 is 160 obtained by adding the number of audio sampling points after the first low-pass filtering down-sampling and the first high-pass filtering down-sampling, and is consistent with the number of audio sampling points of one audio frame signal.
Further, the server calculates a ratio alpha of a maximum value in amplitudes of wavelet low-frequency decomposition signals corresponding to the (l +1) th-time wavelet decomposition of all audio frames in the vocal audio signal of the singing audio to a maximum value in amplitudes of wavelet low-frequency decomposition signals corresponding to the l-th-time wavelet decomposition of all audio frames, and if the ratio is smaller than a first preset threshold A, the l +1 is determined as the optimal wavelet decomposition times. The first predetermined threshold a may be pi/2.
And then, the server determines at least one maximum value sampling point and the corresponding time M (i) of all the maximum value sampling points from the plurality of audio sampling points according to the amplitude of the wavelet high-frequency decomposition signal corresponding to the (l +1) th time of wavelet decomposition, wherein the maximum value sampling points can be understood as the mutation points, so that the time sequence M corresponding to each mutation point is obtained, and the reciprocal of the time interval T (i) between the adjacent mutation points is calculated to obtain the vocal fundamental frequency sequence of the singing audio.
After obtaining the voice base frequency sequence of the singing audio, the server obtains a voice base frequency subsequence corresponding to the ith lyric in the singing audio from the voice base frequency sequence of the singing audio, and the method comprises the following steps:
determining a plurality of lyric pause moments of the human voice base frequency sequence according to a time interval between starting moments corresponding to each two adjacent sequence elements in the human voice base frequency sequence and a first preset time threshold;
determining the starting time and the ending time of the nth lyric according to the plurality of lyric stopping times, wherein n is greater than or equal to 2 and is a positive integer;
determining the starting time of the 1 st lyric in the singing audio according to the time interval between the starting times corresponding to the first two non-zero sequence elements in the voice fundamental frequency sequence and a second preset time threshold;
and extracting a plurality of sequence elements between the starting time and the ending time of the ith lyric from the voice base frequency sequence to obtain a voice base frequency subsequence corresponding to the ith lyric in the singing audio.
Specifically, the server traverses a time interval between a starting time corresponding to an mth sequence element and a starting time corresponding to an m-1 sequence element in the human voice fundamental frequency sequence, if the time interval is greater than a first preset time threshold, the starting time corresponding to the mth sequence element is determined as a lyric pause time to obtain a plurality of lyric pause times of the human voice fundamental frequency sequence, and an nth-1 lyric pause time and an nth lyric pause time in the plurality of lyric pause times are determined as the starting time and the ending time of an nth lyric, wherein n is greater than or equal to 2 and is a positive integer;
calculating a time interval between a starting time corresponding to a 2 nd non-zero sequence element in a human voice base frequency sequence and a starting time corresponding to a 1 st non-zero sequence element, and if the time interval is smaller than a second preset time threshold, determining the starting time corresponding to the 1 st non-zero sequence element as the starting time of a 1 st lyric in a singing audio, wherein the ending time of the 1 st lyric in the singing audio is the starting time of the 2 nd lyric in the singing audio;
and extracting a plurality of sequence elements between the starting time and the ending time of the ith lyric from the voice base frequency sequence to obtain a voice base frequency subsequence corresponding to the ith lyric in the singing audio.
For example, the server calculates a first time difference between start times corresponding to two adjacent sequence elements in the human voice fundamental frequency sequence, obtains that 1 second between 0 minute 20 seconds of the start time corresponding to the 10 th sequence element and 0 minute 21 seconds of the start time corresponding to the 11 th sequence element is greater than a first time threshold value of 30 milliseconds, the start time 0 minutes 21 seconds corresponding to the 11 th sequence element is determined as the lyric pause time, the lyrics pause times of the singing audio calculated according to the method are respectively 0 minute 21 seconds, 0 minute 35 seconds, 0 minute 50 seconds, 1 minute 10 seconds, 1 minute 30 seconds, 1 minute 55 seconds, 2 minute 10 seconds and 2 minute 30 seconds, the 1 st lyric pause time 0 minute 21 seconds and the 2 nd lyric pause time 0 minute 35 seconds will then be determined as the start time and end time of the 2 nd lyric of the singing audio, and obtaining the starting time and the ending time of each lyric starting from the 2 nd lyric in the singing audio according to the mode.
Then, the server calculates that a second time difference between 0 minute 5.2 seconds of a starting time corresponding to a 2 nd non-zero sequence element in the human voice base frequency sequence and 0 minute 5 seconds of the starting time corresponding to a 1 st non-zero sequence element is 20 milliseconds and is less than a second preset time threshold value 40 milliseconds, and determines the 0 minute 5 seconds of the starting time corresponding to the 1 st non-zero sequence element as the starting time of a 1 st lyric in the singing audio, wherein the ending time of the 1 st lyric is the 0 minute 21 seconds of the starting time of the 2 nd lyric.
And extracting a plurality of sequence elements between the starting time 0 minute 35 seconds and the ending time 0 minute 50 seconds of the lyrics of the first sentence from the human voice base frequency sequence to obtain a human voice base frequency subsequence corresponding to the lyrics of the 3 rd sentence in the singing audio.
S102, obtaining a reference base frequency subsequence corresponding to the ith lyric in the singing audio from the reference base frequency sequence of the singing audio.
Before the server executes step S102, the server may process the audio file of the singing audio by using a BP neural network algorithm to obtain a reference fundamental frequency sequence of the singing audio, and the implementation manner is as follows:
calculating to obtain a musical interval characteristic sequence and a track information entropy sequence of the singing audio according to the audio file of the singing audio;
inputting the characteristic sequence of the musical track interval and the characteristic sequence of the audio track information entropy of the singing audio into a main melody audio track classification model to obtain classification results of a plurality of audio tracks corresponding to the singing audio;
and determining a main melody track of the singing audio according to the classification results of the plurality of tracks, extracting data of the main melody track from an audio file of the singing audio to obtain a reference base frequency sequence of the singing audio, wherein the main melody track classification model is obtained by training a sample song set and a track label sequence corresponding to each song in the sample song set.
Specifically, the server randomly selects a sample song set in the song library, the sample song set includes at least one song and a Musical Instrument Digital Interface (MIDI) file of a song audio of each song in the at least one song, an actual track label sequence corresponding to each song in the sample song set is obtained in a manual classification mode, then, the server extracts a track interval characteristic sequence and a track information entropy characteristic sequence of each song in the sample song set, and a specific implementation manner of extracting the track interval characteristic sequence and the track information entropy characteristic sequence of each song by the server refers to a specific implementation manner of determining a reference base frequency sequence of a singing audio according to a main melody track classification model in the step to extract the track interval characteristic sequence and the track information entropy characteristic sequence of the singing audio, which is not described herein again, thereby obtaining the track interval characteristic sequence and the track information entropy characteristic sequence of the sample song set, and dividing the sample song set into a training set and a testing set according to a preset proportion (such as 7: 3). Then, the server inputs the audio track pitch characteristic sequence and the audio track information entropy characteristic sequence of the training set and the actual audio track label sequence corresponding to each song in the training set into an initial Error Back Propagation (BP) network model for training and learning to obtain a first BP network model, inputs the audio track pitch characteristic sequence and the audio track information entropy characteristic sequence of the verification set into the first BP network model to obtain a predicted audio track label sequence corresponding to each song in the verification set, calculates the proportion of the total number of songs contained in the verification set, the number of the songs being consistent with the actual audio track label sequence in the predicted audio track label sequence of the verification set, and judges whether the first BP neural model reaches a convergence condition according to the proportion, illustratively, the convergence condition can be that the preset output accuracy rate is 90%, when the adjusted first BP network model meets the convergence condition, and determining the adjusted first BP network model as a main melody track classification model.
As most of the melodic musical intervals are concentrated between 0 and 6 degrees, musical intervals above octaves rarely occur, so that the musical intervals can be used as differencesFeatures of main and accompaniment melody tracks, where the interval is the absolute value of the difference between two adjacent pitches. The server extracts the pitch sequence of a plurality of tracks of the singing audio from the MIDI file of the singing audio and calculates two adjacent notes P in the pitch sequence of each track of the singing audionAnd Pn+1The absolute value of the difference is obtained to obtain the interval I of each tracknThe calculation formula is as follows:
Figure BDA0002563338640000111
and calculating the ratio of the number of the intervals with the interval value less than or equal to 6 in the track i to the total number of all the intervals (the track value is less than or equal to 25) of the track i, determining the ratio as the interval characteristic quantity of the track i, and obtaining the interval characteristic quantity of all the tracks of the singing audio according to the mode so as to obtain the track interval characteristic sequence of the singing audio.
Then, the server calculates the information entropy characteristic quantity of the audio track i according to the pitch sequence of the audio track i, wherein the information entropy characteristic quantity is the pitch entropy characteristic quantity H (P), and the calculation formula is as follows:
Figure BDA0002563338640000112
where x is a pitch value in the pitch sequence of track i and p (x) is the probability that the pitch value x occurs in the pitch sequence of track i. Obtaining the pitch entropy feature quantities of all tracks of the singing audio according to the above manner, thereby obtaining the track information entropy feature sequence of the singing audio, and it can be understood that the track information entropy feature sequence is the track pitch entropy feature sequence.
And then, inputting the characteristic sequence of the musical track interval and the characteristic sequence of the audio track information entropy of the singing audio into the main melody audio track classification model to obtain classification results of a plurality of audio tracks of the singing audio, and extracting the classification results as data of the main melody audio track to obtain a reference fundamental frequency sequence of the singing audio.
Further, obtaining a reference base frequency subsequence corresponding to the ith lyric in the singing audio from the reference base frequency sequence of the singing audio, comprising:
and acquiring the starting time and the ending time of the ith lyric in the lyric text sequence according to the lyric text sequence of the singing audio, and extracting a plurality of sequence elements between the starting time and the ending time from the reference base frequency sequence to obtain a reference base frequency subsequence corresponding to the ith lyric in the singing audio.
For example, the server obtains the starting time and the ending time of the 2 nd lyric from the lyric text sequence of the singing audio as 0 minute 20 seconds and 0 minute 26 seconds respectively, and extracts a plurality of sequence elements in the time period from 0 minute 20 seconds to 0 minute 26 seconds from the reference base frequency sequence, thereby obtaining the reference base frequency subsequence corresponding to the 2 nd lyric in the singing audio.
S103, determining an optimal matching mapping relation between the voice fundamental frequency sub-sequence and the reference fundamental frequency sub-sequence based on the voice fundamental frequency sub-sequence and the reference fundamental frequency sub-sequence corresponding to the ith lyric in the singing audio, wherein the optimal matching mapping relation comprises a voice fundamental frequency matching sub-sequence and a reference fundamental frequency matching sub-sequence.
In a possible implementation manner, the server calculates a matching matrix a according to the human voice fundamental frequency sub-sequence, the reference fundamental frequency sub-sequence and a preset matching value, where the matching matrix a includes (j +1) × (k +1) matrix elements, where j is the number of sequence elements of the human voice fundamental frequency sub-sequence, and k is the number of sequence elements of the reference fundamental frequency sub-sequence;
determining a matrix element A (j +1, k +1) in the matching matrix A as a first target element, and determining a path element of the first target element from a plurality of selectable path elements corresponding to the first target element in the matching matrix A;
determining a path element of the first target element as a second target element, and determining a path element of the second target element from a plurality of selectable path elements corresponding to the second target element in the matching matrix A until a path element of an mth target element is a matrix element A (1,1) in the matching matrix A;
determining an optimal matching path according to the first target element and at least one path element;
and determining the human voice base frequency matching subsequence and the reference base frequency matching subsequence according to the position relationship between the path direction between every two adjacent matrix elements in the optimal matching path and the sequence direction of the human voice base frequency subsequence in the matching matrix A and the sequence direction of the reference base frequency subsequence in the matching matrix A.
Firstly, the server calculates a matching matrix A according to the human voice base frequency subsequence, the reference base frequency subsequence and a preset matching value, wherein the matching matrix A comprises (j +1) × (k +1) matrix elements, j is the number of the sequence elements of the human voice base frequency subsequence, and k is the number of the sequence elements of the reference base frequency subsequence, and the specific implementation process is as follows:
the server determines a first row matrix element A (1,: and a first column matrix element A (: 1) of the matching matrix A according to a preset matrix element A (1,1), a preset vacancy fraction, the voice base frequency subsequence and the reference base frequency subsequence, determines a first candidate matrix element of A (m +1, n +1) according to a difference value, a preset matching value, a matrix element A (m, n), a preset matching fraction and a preset mismatch fraction between an mth sequence element of the voice base frequency subsequence and an nth sequence element of the reference base frequency subsequence, wherein m is an integer greater than or equal to 1 and less than or equal to j, and n is an integer greater than or equal to 1 and less than or equal to k, it can be understood that if the difference value between the mth sequence element of the voice base frequency subsequence and the nth sequence element of the reference base frequency subsequence is equal to the preset matching value, the first candidate matrix element of a (m +1, n +1) is the sum between a (m, n) and the preset matching score, otherwise, the first candidate matrix element of a (m +1, n +1) is the sum between a (m, n) and the preset mismatching score; determining the sum of the matrix element A (m, n +1) and the preset deletion score as a second candidate matrix element of A (m +1, n + 1); determining the sum of the matrix element A (m +1, n) and the preset insertion score as a third candidate matrix element of A (m +1, n + 1); determining a (m +1, n +1), i.e. the matrix element values of the m +1 th row and the n +1 th column in the matching matrix a, according to the first candidate matrix element, the second candidate matrix element and the third candidate matrix element, and illustratively, if the preset matching score is smaller than the preset mismatching score, determining the minimum value of the first candidate matrix element, the second candidate matrix element and the third candidate matrix element as a (m +1, n + 1). And calculating to obtain a matching matrix A according to the mode.
In the present embodiment, a (1,1), a preset vacancy fraction, a preset matching fraction, a preset mismatch fraction, a preset insertion fraction, and a preset deletion fraction are all set by human, which is not limited herein, and in the present embodiment, a (1,1) ═ 0, a preset vacancy fraction ═ preset mismatch fraction ═ preset insertion fraction ═ preset deletion fraction ═ 1, and a preset matching fraction ═ 0 are all exemplified. Furthermore, the preset matching value includes any one of 12, 7, 5, 0, -12, -7, and-5, and the preset matching value is used to represent a melody matching criterion between the human voice fundamental frequency sub-sequence and the reference fundamental frequency sub-sequence.
Here, a detailed example of the implementation process of the server determining the matching matrix between the human voice base frequency sub-sequence corresponding to the lyric of the 2 nd sentence and the reference base frequency sub-sequence when the preset matching value is 12 is given, please refer to fig. 2, which is a schematic diagram of a process of determining the optimal matching path provided in the embodiment of the present application. As shown in fig. 2, the human voice base frequency sub-sequence S of the 2 nd lyric in the singing audio is ACGC, and the reference base frequency sub-sequence T is catg, wherein the difference between the sequence elements A, C and G in the human voice base frequency sub-sequence and the sequence elements a, c, and G in the reference base frequency sub-sequence is equal to the preset matching value 12, i.e., the sequence elements A, C and G in the human voice base frequency sub-sequence are matched with the sequence elements a, c, and G in the reference base frequency sub-sequence, respectively. The server determines that the matching matrix comprises 5 × 5 matrix elements according to the human voice base frequency subsequence S ═ ACGC and the reference base frequency subsequence T ═ catg, calculates the sum of the matrix element a (1,1) in the first row and the first column being 0 and the preset vacancy fraction being 1, obtains the matrix element a (1,2) in the first row and the second column being 1 and the matrix element a (2,1) in the second row and the first column being 1, calculates the sum of a (1,2) being 1 and the preset vacancy fraction being 1, obtains the matrix element a (1,3) 2 in the first row and the third column, calculates the sum of a (2,1) being 1 and the preset vacancy fraction being 1, obtains the matrix element a (3,1) in the third row and the first column being 2, and calculates the matrix element a (1,2) in the first row and the first column being 0, 1,2, and obtains the matrix element a (1,1) in the first row and the first column being 0, 2,3 and 4, then determining matrix elements of the 2 nd row and the 2 nd column in the matching matrix, wherein the specific calculation process can be as follows: calculating a difference between a first sequence element a in the human voice base frequency subsequence S and a first sequence element c in the reference base frequency subsequence T, where the difference is not equal to a preset matching value 12, that is, the first sequence element a in the human voice base frequency subsequence S does not match the first sequence element c in the reference base frequency subsequence T, calculating a sum of a (1,1) ═ 0 and a preset mismatch score of 1, obtaining a first candidate matrix element of the 2 nd row and 2 nd column matrix elements a (2,2) of the matching matrix as 1, and respectively calculating a sum of a (2,1) ═ 1 and a preset insertion score of 1 in the matching matrix, and calculating a sum of a (1,2) ═ 1 and a preset deletion score of 1 in the matching matrix, obtaining a second candidate matrix element and a third candidate matrix element of a (2,2) as 2, since the preset matching score is 0, that is, the higher the matching degree of the sequence elements in the human voice fundamental frequency sub-sequence S and the sequence elements in the reference fundamental frequency sub-sequence T is, the smaller the corresponding matching score in the matching matrix is, and therefore, the minimum value 1 among the above-mentioned first candidate matrix element 1, second candidate matrix element 2, and third candidate matrix element 2 is determined as the matrix element a (2,2) in the matching matrix a. Each matrix element in the matching matrix is calculated according to the above method, and the matching matrix between the human voice base frequency subsequence corresponding to the 2 nd lyric in the singing audio and the reference base frequency subsequence is shown in the area in the dotted line in fig. 2.
Further, the server may calculate a plurality of matching matrices corresponding to preset matching values of-12, + -7, + -5, and 0, respectively, according to the above manner.
Then, the server can find the optimal matching path in the matching matrix a by a backtracking method.
For example, referring to fig. 2 again, the matching matrix a may be composed of a plurality of matching paths from a first matrix element a (1,1) to a last matrix element a (5,5) to 3 of the matching matrix, and the server determines the last matrix element a (5,5) to 3 as a first target element, and calculates a difference between the first target element a (5,5) to 3 and a preset mismatch score of 1 because the first target element is mismatched between a sequence element g in a corresponding reference fundamental frequency subsequence in the matching matrix a and a sequence element C in a human voice fundamental frequency subsequence, so as to obtain a first optional path element a (4,4) of the first target element, which is a trace value of 2 corresponding to 3; calculating a difference value between the first target element a (5,5) ═ 3 and the preset deletion score ═ 1, and obtaining a backtracking value of 2 corresponding to the second optional path element a (4,5) ═ 2 of the first target element; calculating a difference value between the first target element a (5,5) ═ 3 and the preset insertion score ═ 1, and obtaining a backtracking value of 2 corresponding to the third optional path element a (5,4) ═ 3 of the first target element; and determining the matrix element A (4,5) ═ 2 of the first selectable path element, the second selectable path element and the third selectable path element which are equal to the corresponding backtracking values as the path element of the first target element. Determining a (4,5) ═ 2 as a second target element, and calculating a difference value between the second target element a (4,5) ═ 2 and a preset matching score ═ 0 as a result of matching between a sequence element G in a corresponding human voice base frequency subsequence in the matching matrix a and a sequence element G in a reference base frequency subsequence, so as to obtain a backtracking value of 2 corresponding to the first optional path element a (3,4) ═ 2 of the second target element; calculating a difference value between the second target element a (4,5) ═ 2 and the preset deletion score ═ 1, and obtaining a backtracking value of the second optional path element a (3,5) ═ 3 of the second target element, which corresponds to 1; calculating a difference value between the second target element a (4,5) ═ 2 and the preset insertion score ═ 1, and obtaining a backtracking value of the second target element corresponding to the third optional path element a (4,4) ═ 3 as 1; and determining the matrix element A (3,4) of the first selectable path element, the second selectable path element and the third selectable path element of the second target element, which is equal to the corresponding backtracking value, as the path element of the second target element, wherein the matrix element A (3,4) is 2. According to the above manner, the path element of each target element is obtained, and until the last obtained path element is a (1,1) ═ 0, the path elements are connected in the order determined by the first target element and the plurality of path elements to form the optimal matching path, that is, the path formed by the arrow in fig. 2.
And then, the server determines an optimal matching mapping relation according to the optimal matching path, wherein the optimal matching mapping relation comprises a voice base frequency matching subsequence and a reference base frequency matching subsequence.
Specifically, the server traverses a connecting line direction between positions of two adjacent matrix elements in the optimal matching path, namely a path direction, and respectively forms a position relationship with a direction of the voice fundamental frequency sub-sequence and a direction of the reference fundamental frequency sub-sequence, if the connecting line direction between the position of the l matrix element and the l +1 matrix element in the optimal matching path is perpendicular to the position relationship formed by any one of the direction of the voice fundamental frequency sub-sequence and the direction of the reference fundamental frequency sub-sequence, the voice fundamental frequency sub-sequence with the sequence direction perpendicular to the connecting line direction or the voice fundamental frequency matching sub-sequence corresponding to the reference fundamental frequency sub-sequence or the l sequence element in the reference fundamental frequency matching sub-sequence is determined as a blank;
if the position relationship formed by the connecting line direction between the position of the ith matrix element and the position of the (l +1) th matrix element in the optimal matching path and any one of the direction of the human voice fundamental frequency sub-sequence and the direction of the reference fundamental frequency sub-sequence is not vertical, determining the human voice fundamental frequency sub-sequence with the direction of the sequence not vertical to the connecting line direction, the human voice fundamental frequency matching sub-sequence corresponding to the reference fundamental frequency sub-sequence or the l-th sequence element in the reference fundamental frequency matching sub-sequence as the human voice fundamental frequency sub-sequence with the row or the column of the (l +1) th matrix element being not vertical in the position relationship or the corresponding sequence element in the reference fundamental frequency sub-sequence, and further obtaining the human voice fundamental frequency matching sub-sequence and the reference fundamental frequency matching sub-sequence corresponding to the ith lyric.
For example, when the preset matching value in fig. 2 is 12, the matrix element a (1,1) ═ 0 of the best matching path of the lyric of the 2 nd sentence and the matrix element a (1,2) ═ 1 of the position in the matching matrix form the connection direction (path direction), i.e. horizontal direction, the position relationship formed by the direction in which the human voice fundamental frequency subsequence S is located (vertical direction) and the direction in which the reference fundamental frequency subsequence T is located (horizontal direction) is vertical and non-vertical, respectively, the first sequence element in the human voice fundamental frequency matching subsequence corresponding to the human voice fundamental frequency subsequence S whose sequence direction is vertical to the connection direction is determined as a blank space, i.e. "", the first sequence element in the reference fundamental frequency matching subsequence corresponding to the reference fundamental frequency subsequence T whose sequence direction is not vertical to the connection direction is determined as the first sequence element c in the reference fundamental frequency subsequence T, and according to the preset matching value, the matrix element A (1,2) ═ 1 of the best matching path of the 2 nd lyric at 12 times, and the matrix element A (2,3) ═ 1 in the matching matrix, i.e. the position relation formed by the direction of a certain included angle with the horizontal direction, the direction of the human voice fundamental frequency subsequence S (vertical direction) and the direction of the reference fundamental frequency subsequence T (horizontal direction) are not vertical, then the second sequence element in the human voice fundamental frequency matching subsequence corresponding to the human voice fundamental frequency subsequence S whose sequence direction is not vertical to the above-mentioned line direction is determined as the first sequence element A in the human voice fundamental frequency subsequence S, the second sequence element in the reference fundamental frequency matching subsequence corresponding to the reference fundamental frequency subsequence T whose sequence direction is not vertical to the above-mentioned line direction is determined as the second sequence element a in the reference fundamental frequency subsequence T, according to the method, the human voice base frequency matching subsequence 'ACGC' and the reference base frequency matching subsequence 'catg' corresponding to the 2 nd lyric can be obtained when the preset matching value is 12.
Further, the server can respectively obtain a plurality of optimal matching paths of the 2 nd lyric corresponding to-12, 7, 5 and 0 according to the preset matching values to obtain a human voice base frequency matching subsequence and a reference base frequency matching subsequence corresponding to the 2 nd lyric corresponding to different preset values.
And S104, determining the melody matching degree of the ith lyric based on the human voice fundamental frequency matching subsequence and the reference fundamental frequency matching subsequence, and obtaining the classification result of the singing audio based on the melody matching degree of each lyric.
And the human voice base frequency matching subsequences correspond to the reference base frequency matching subsequences one to one. Here, the classification result of the singing audio may be a singing score of the singing audio.
In a possible implementation manner, the determining the melody matching degree of the ith lyric based on the human voice fundamental frequency matching subsequence and the reference fundamental frequency matching subsequence includes:
and calculating the melody matching degree of the ith lyric according to the number of sequence elements with preset matching values of the difference value between the sequence elements in the human voice base frequency matching subsequence and the corresponding sequence elements of the sequence elements in the reference base frequency subsequence and the number of the sequence elements in the human voice base frequency matching subsequence.
For example, when the preset matching value is 12, the difference between the second sequence element a and the fourth sequence element G in the human voice fundamental frequency matching subsequence 'ACGC' corresponding to the 2 nd lyric and the second sequence element a and the fourth sequence element G in the reference fundamental frequency matching subsequence 'catg _' is equal to the preset matching value 12, so as to obtain the sequence element number of which the difference between the sequence element in the human voice fundamental frequency matching subsequence and the sequence element in the reference fundamental frequency matching subsequence is equal to the preset matching value is 2, the ratio between the sequence element number 2 and the total sequence element number of the human voice fundamental frequency matching subsequence is calculated, here, in order to compensate the situation that the fundamental frequency value of a part of the turning part in the human voice fundamental frequency sub-sequence cannot be matched with the fundamental frequency value in the reference fundamental frequency sub-sequence, the fundamental frequency is replaced by 95% of the sequence element number of the human voice matching sub-sequence in the manner of calculating the ratio And obtaining the ratio of 4.75 by the total number of the sequence elements of the sequence, namely obtaining the melody matching degree of the lyrics of the 2 nd sentence of 4.75 when the preset value is 12. According to the mode, the melody matching degrees of the lyrics of the 2 nd sentence corresponding to the preset matching values of 7, 5, 0, -12, -7 and-5 are respectively selected, so that the melody matching degrees of the lyrics of each sentence in the singing audio corresponding to the preset matching values of +/-12, +/-7, +/-5 and 0 are respectively selected, the candidate singing scores of the singing audio corresponding to the preset matching values of 7, 5, 0, -12, -7 and-5 are obtained according to the mode of calculating the product of the sum of the melody matching degrees of the lyrics of each sentence in the singing audio and 100, and the highest score in the candidate singing scores of the singing audio corresponding to different preset matching values is determined as the singing score of the singing audio, namely the classification result.
In the embodiment of the application, the server matches the voice base frequency subsequence and the reference base frequency subsequence of each lyric in the singing audio by a dynamic programming method to obtain a voice base frequency matching subsequence and a reference base frequency matching subsequence corresponding to each lyric in the singing audio, calculates the singing score of each lyric according to the voice base frequency matching subsequence and the reference base frequency matching subsequence corresponding to each lyric, calculates the sum of the singing scores of each lyric in the singing audio to obtain the classification result of the singing audio, namely the singing score, so that the situation that the singing person slightly changes the rhythm in the sentence due to fluctuation of mood is adapted, in addition, the voice base frequency subsequence and the reference base frequency subsequence can be automatically aligned in the process of matching the voice base frequency subsequence and the reference base frequency subsequence by the dynamic programming method, therefore, the problem that the human voice base frequency sequence is not aligned with the reference base frequency sequence is solved, and therefore the accuracy of the singing audio classification result and the willingness of a user to release the audio works are improved.
Please refer to fig. 3, which is a flowchart illustrating another singing audio classification method according to an embodiment of the present application. As shown in fig. 3, this method embodiment includes the steps of:
s201, acquiring a voice base frequency sequence of the singing audio, and aligning the voice base frequency sequence with a reference base frequency sequence of the singing audio to obtain an initial voice base frequency sequence.
Here, for a specific implementation manner of obtaining the vocal fundamental frequency sequence of the singing audio in step S201, reference may be made to the description of obtaining the vocal fundamental frequency sequence of the singing audio by the server in step S101 in the embodiment corresponding to fig. 1 and processing the vocal audio signal of the singing audio by using a wavelet transform algorithm, which is not described herein again.
Then, the server aligns the voice base frequency sequence with the reference base frequency sequence of the singing audio to obtain an initial voice base frequency sequence, which comprises the following steps:
and calculating a time difference value between the starting time corresponding to the second non-zero sequence element and the starting time corresponding to the first non-zero sequence element in the voice base frequency sequence, and aligning the starting time corresponding to the first non-zero sequence element with the starting time of the first word in the reference base frequency sequence if the time difference value is smaller than a preset time threshold value to obtain the initial voice base frequency sequence.
For example, the server calculates that a time difference between 0 minute 5.2 seconds of the start time corresponding to the 2 nd non-zero sequence element in the human voice fundamental frequency sequence and 0 minute 5 seconds of the start time corresponding to the 1 st non-zero sequence element is 20 milliseconds and is less than a preset time threshold value 40 milliseconds, and aligns the 0 minute 5 seconds of the start time corresponding to the 1 st non-zero sequence element with 0 minute 6 seconds of the start time of the first word in the reference fundamental frequency sequence of the singing audio, that is, the time corresponding to each sequence element in the human voice fundamental frequency sequence is shifted backwards by 1 second to obtain an initial human voice fundamental frequency sequence.
S202, obtaining an initial human voice base frequency subsequence corresponding to the jth word in the ith lyric of the singing audio from the initial human voice base frequency sequence.
Specifically, the server extracts a plurality of sequence elements within a time period of the start time and the end time corresponding to each word from the initial voice base frequency sequence according to the detected start time and end time corresponding to each word in the initial voice base frequency sequence, so as to obtain an initial voice base frequency subsequence corresponding to each word.
S203, obtaining an initial reference base frequency subsequence corresponding to the jth word in the ith lyric of the singing audio from the initial reference base frequency sequence of the singing audio.
Here, before the server executes step S203, a specific implementation manner of obtaining the initial reference fundamental frequency sequence of the singing audio may refer to that in the embodiment corresponding to fig. 1, the server processes the audio file of the singing audio by using a BP neural network algorithm in step S102, and obtains a description of the reference fundamental frequency sequence of the singing audio, which is not described herein again.
Specifically, the server extracts a plurality of sequence elements within a time period of the start time and the end time corresponding to each word from the initial reference baseband frequency sequence according to the start time and the end time corresponding to each word in the initial reference baseband frequency sequence, so as to obtain an initial reference baseband frequency subsequence corresponding to each word.
S204, calculating a matching matrix between an initial voice base frequency subsequence corresponding to the jth word in the ith lyric of the singing audio and an initial reference base frequency subsequence, and obtaining a voice base frequency matching subsequence and a reference base frequency matching subsequence corresponding to the jth word in the ith lyric according to the matching matrix.
Here, a specific implementation manner of step S204 may refer to that in step S103 in the embodiment corresponding to fig. 1, the server calculates a matching matrix between the human voice fundamental frequency subsequence corresponding to the ith lyric in the singing audio and the reference fundamental frequency subsequence, and obtains descriptions of the human voice fundamental frequency matching subsequence corresponding to the ith lyric and the reference fundamental frequency matching subsequence according to the matching matrix, which is not described herein again.
S205, calculating the melody matching degree of the jth character in the ith sentence of lyrics of the singing audio according to the human voice fundamental frequency matching sub-sequence and the reference fundamental frequency matching sub-sequence, and obtaining the classification result of the singing audio based on the melody matching degree of each character in the singing audio.
Specifically, the server calculates a ratio of a difference value between sequence elements in the human voice fundamental frequency matching subsequence or the human voice fundamental frequency matching subsequence in the reference fundamental frequency matching subsequence to sequence elements in the reference fundamental frequency matching subsequence to a total number of sequence elements in the human voice fundamental frequency matching subsequence or the reference fundamental frequency matching subsequence, wherein the difference value is equal to a preset matching value (any one of 12, 7, 5, 0, -12, -7, and-5), obtains a melody matching degree of a jth word in an ith sentence lyric of the singing audio according to the ratio, obtains the preset matching values according to the above method, respectively takes the melody matching degree of each word in the singing audio corresponding to 12, 7, 5, 0, -12, -7, and-5, and calculates a sum of the melody matching degrees of each word in the singing audio corresponding to different preset matching values, therefore, the candidate singing scores of the singing audios corresponding to 12, 7, 5, 0, -12, -7 and-5 are respectively taken as the preset matching values, and the highest score in the candidate singing scores of the multiple singing audios corresponding to the multiple different preset matching values is determined as the singing score of the singing audio, namely the classification result.
In the embodiment of the application, the server aligns the vocal basic frequency sequence of the singing audio with the reference basic frequency sequence to obtain an initial vocal basic frequency sequence, extracts an initial vocal basic frequency subsequence of each character in the singing audio from the initial vocal basic frequency sequence to obtain an initial vocal basic frequency subsequence of each character in the singing audio, matches the initial vocal basic frequency subsequence of each character in the singing audio with the initial reference basic frequency subsequence corresponding to different preset values respectively to obtain a plurality of vocal basic frequency matching subsequences and a plurality of reference basic frequency matching subsequences corresponding to each character in the singing audio under different preset values, further calculates a plurality of singing scores of each character corresponding to different preset values, calculates the sum of the singing scores of all characters in the singing audio to obtain candidate singing scores of a plurality of singing audios corresponding to different preset matching values, and determines the highest score in the candidate singing scores of the plurality of singing audios as the classification result of the singing audio, the classification result is obtained, and the initial voice base frequency sub-sequence corresponding to each word is matched with the initial reference base frequency sub-sequence by using a dynamic programming method, so that automatic alignment is realized, the obtained classification result of the singing audio is more accurate, and the willingness of a user to release the audio works is further improved.
Please refer to fig. 4, which provides a schematic structural diagram of a singing audio classification apparatus according to an embodiment of the present application. As shown in fig. 4, the singing audio classification apparatus includes a first acquisition unit 401, a second acquisition unit 402, a mapping relationship determination unit 403, a classification result determination unit 404, a wavelet decomposition unit 405, a decomposition number determination unit 406, a calculation unit 407, and a third acquisition unit 408.
A first obtaining unit 401, configured to obtain, from a voice base frequency sequence of a singing audio, a voice base frequency subsequence corresponding to an ith lyric in the singing audio, where the voice base frequency sequence includes multiple voice base frequency subsequences;
a second obtaining unit 402, configured to obtain a reference baseband subsequence corresponding to an ith lyric in the singing audio from the reference baseband sequence of the singing audio, where the reference baseband sequence includes multiple reference baseband subsequences, and each reference baseband subsequence in the multiple reference baseband subsequences corresponds to each lyric in the singing audio one to one;
a mapping relation determining unit 403, configured to determine, based on the voice fundamental frequency sub-sequence and the reference fundamental frequency sub-sequence corresponding to the ith lyric in the singing audio, an optimal matching mapping relation between the voice fundamental frequency sub-sequence and the reference fundamental frequency sub-sequence, where the optimal matching mapping relation includes a voice fundamental frequency matching sub-sequence and a reference fundamental frequency matching sub-sequence;
and the classification result determining unit 404 is configured to determine a melody matching degree of the ith lyric based on the human voice fundamental frequency matching subsequence and the reference fundamental frequency matching subsequence, and obtain a singing score of the singing audio based on the melody matching degree of each lyric.
Optionally, the apparatus further comprises:
a wavelet decomposition unit 405, configured to perform framing on a vocal audio signal of the singing audio to obtain at least one audio frame, perform wavelet decomposition on each audio frame in the at least one audio frame to obtain a plurality of wavelet decomposition signals corresponding to each audio frame, where each wavelet decomposition signal includes a wavelet high-frequency decomposition signal and a wavelet low-frequency decomposition signal of a plurality of audio sampling points;
a decomposition number determining unit 406, configured to determine a target wavelet decomposition number according to a maximum value in the amplitudes of the wavelet low-frequency decomposition signals corresponding to each two adjacent wavelet decompositions of the at least one audio frame;
and the calculating unit 407 is configured to calculate a vocal fundamental frequency sequence of the singing audio according to the period of the wavelet high-frequency decomposition signal under the target wavelet decomposition frequency.
Optionally, the first obtaining unit 401 is specifically configured to:
determining a plurality of lyric pause moments of the human voice base frequency sequence according to a time interval between starting moments corresponding to each two adjacent sequence elements in the human voice base frequency sequence and a first preset time threshold;
determining the starting time and the ending time of the nth lyric according to the plurality of lyric stopping times, wherein n is greater than or equal to 2 and is a positive integer;
determining the starting time of the 1 st lyric in the singing audio according to the time interval between the starting times corresponding to the first two non-zero sequence elements in the voice fundamental frequency sequence and a second preset time threshold;
and extracting a plurality of sequence elements between the starting time and the ending time of the ith lyric from the voice base frequency sequence to obtain a voice base frequency subsequence corresponding to the ith lyric in the singing audio.
Optionally, the apparatus further comprises: a third acquisition unit 408.
The third obtaining unit 408 is configured to obtain a musical interval characteristic sequence and a track information entropy sequence of the singing audio by calculating according to the audio file of the singing audio;
inputting the characteristic sequence of the musical track interval and the characteristic sequence of the audio track information entropy of the singing audio into a main melody audio track classification model to obtain classification results of a plurality of audio tracks corresponding to the singing audio;
and determining a main melody track of the singing audio according to the classification results of the plurality of tracks, extracting data of the main melody track from an audio file of the singing audio to obtain a reference base frequency sequence of the singing audio, wherein the main melody track classification model is obtained by training a sample song set and a track label sequence corresponding to each song in the sample song set.
Optionally, the mapping relationship determining unit 403 is specifically configured to:
calculating to obtain a matching matrix A according to the human voice base frequency sub-sequence, the reference base frequency sub-sequence and a preset matching value, wherein the preset matching value is used for representing a melody matching standard between the human voice base frequency sub-sequence and the reference base frequency sub-sequence, the matching matrix A comprises (j +1) × (k +1) matrix elements, j is the number of the sequence elements of the human voice base frequency sub-sequence, and k is the number of the sequence elements of the reference base frequency sub-sequence;
determining a matrix element A (j +1, k +1) in the matching matrix A as a first target element, and determining a path element of the first target element from a plurality of selectable path elements corresponding to the first target element in the matching matrix A;
determining a path element of the first target element as a second target element, and determining a path element of the second target element from a plurality of selectable path elements corresponding to the second target element in the matching matrix A until a path element of an mth target element is a matrix element A (1,1) in the matching matrix A;
determining an optimal matching path according to the first target element and at least one path element;
and determining the human voice base frequency matching subsequence and the reference base frequency matching subsequence according to the position relationship between the path direction between every two adjacent matrix elements in the optimal matching path and the sequence direction of the human voice base frequency subsequence in the matching matrix A and the sequence direction of the reference base frequency subsequence in the matching matrix A.
Optionally, the human voice base frequency matching sub-sequences correspond to the reference base frequency matching sub-sequences one to one;
the classification result determining unit 404 is specifically configured to: and calculating the melody matching degree of the ith lyric according to the number of sequence elements with preset matching values of the difference value between the sequence elements in the human voice base frequency matching subsequence and the corresponding sequence elements of the sequence elements in the reference base frequency subsequence and the number of the sequence elements in the human voice base frequency matching subsequence.
Optionally, the preset matching value is any one of 12, 7, 5, 0, -12, -7, and-5.
It is understood that the singing audio classification apparatus 400 is used for implementing the steps performed by the server in the embodiments of fig. 1 and 3. For specific implementation and corresponding advantageous effects of the functional blocks included in the singing audio classification apparatus 400 of fig. 4, reference may be made to the detailed description of the embodiments of fig. 1 and fig. 3, which is not repeated herein.
The singing audio classification apparatus 400 in the embodiment shown in fig. 4 can be implemented by the server 500 shown in fig. 5. Please refer to fig. 5, which provides a schematic structural diagram of a server according to an embodiment of the present application. As shown in fig. 5, the server 500 may include: one or more processors 501 and memory 502. The processor 501 and the memory 502 are connected by a bus 503. The memory 502 is used for storing a computer program, and the computer program includes program instructions; the processor 501 is configured to execute the program instructions stored in the memory 502, and perform the following operations:
acquiring a voice base frequency subsequence corresponding to the ith lyric in the singing audio from the voice base frequency sequence of the singing audio, wherein the voice base frequency sequence comprises a plurality of voice base frequency subsequences;
acquiring a reference base frequency subsequence corresponding to the ith lyric in the singing audio from the reference base frequency sequence of the singing audio, wherein the reference base frequency sequence comprises a plurality of reference base frequency subsequences, and each reference base frequency subsequence in the plurality of reference base frequency subsequences corresponds to each lyric in the singing audio one by one;
determining an optimal matching mapping relation between the voice base frequency subsequence and the reference base frequency subsequence based on the voice base frequency subsequence and the reference base frequency subsequence corresponding to the ith lyric in the singing audio, wherein the optimal matching mapping relation comprises a voice base frequency matching subsequence and a reference base frequency matching subsequence;
and determining the melody matching degree of the ith lyric based on the human voice base frequency matching subsequence and the reference base frequency matching subsequence, and obtaining the classification result of the singing audio based on the melody matching degree of each lyric. Optionally, before the processor 501 obtains the voice base frequency subsequence corresponding to the ith lyric in the singing audio from the voice base frequency sequence of the singing audio, the following operations are specifically performed:
framing the voice audio signal of the singing audio to obtain at least one audio frame, and performing wavelet decomposition on each audio frame in the at least one audio frame to obtain a plurality of wavelet decomposition signals corresponding to each audio frame, wherein each wavelet decomposition signal comprises wavelet high-frequency decomposition signals and wavelet low-frequency decomposition signals of a plurality of audio sampling points;
determining the target wavelet decomposition times according to the maximum value in the amplitude values of the wavelet low-frequency decomposition signals corresponding to every two adjacent wavelet decompositions of the at least one audio frame;
and calculating to obtain the human voice base frequency sequence of the singing audio according to the period of the wavelet high-frequency decomposition signal under the target wavelet decomposition times.
Optionally, the processor 501 obtains a voice base frequency subsequence corresponding to the ith lyric in the singing audio from the voice base frequency sequence of the singing audio, and specifically performs the following operations:
determining a plurality of lyric pause moments of the human voice base frequency sequence according to a time interval between starting moments corresponding to each two adjacent sequence elements in the human voice base frequency sequence and a first preset time threshold;
determining the starting time and the ending time of the nth lyric according to the plurality of lyric stopping times, wherein n is greater than or equal to 2 and is a positive integer;
determining the starting time of the 1 st lyric in the singing audio according to the time interval between the starting times corresponding to the first two non-zero sequence elements in the voice fundamental frequency sequence and a second preset time threshold;
and extracting a plurality of sequence elements between the starting time and the ending time of the ith lyric from the voice base frequency sequence to obtain a voice base frequency subsequence corresponding to the ith lyric in the singing audio.
Optionally, before the processor 501 obtains the reference fundamental frequency subsequence corresponding to the ith lyric in the singing audio from the reference fundamental frequency sequence of the singing audio, the following operations are specifically performed:
calculating to obtain a musical interval characteristic sequence and a track information entropy sequence of the singing audio according to the audio file of the singing audio;
inputting the characteristic sequence of the musical track interval and the characteristic sequence of the audio track information entropy of the singing audio into a main melody audio track classification model to obtain classification results of a plurality of audio tracks corresponding to the singing audio;
and determining a main melody track of the singing audio according to the classification results of the plurality of tracks, extracting data of the main melody track from an audio file of the singing audio to obtain a reference base frequency sequence of the singing audio, wherein the main melody track classification model is obtained by training a sample song set and a track label sequence corresponding to each song in the sample song set.
Optionally, the processor 501 determines an optimal matching mapping relationship between the voice base frequency sub-sequence and the reference base frequency sub-sequence based on the voice base frequency sub-sequence and the reference base frequency sub-sequence corresponding to the ith lyric in the singing audio, and specifically performs the following operations:
calculating to obtain a matching matrix A according to the human voice base frequency sub-sequence, the reference base frequency sub-sequence and a preset matching value, wherein the preset matching value is used for representing a melody matching standard between the human voice base frequency sub-sequence and the reference base frequency sub-sequence, the matching matrix A comprises (j +1) × (k +1) matrix elements, j is the number of the sequence elements of the human voice base frequency sub-sequence, and k is the number of the sequence elements of the reference base frequency sub-sequence;
determining a matrix element A (j +1, k +1) in the matching matrix A as a first target element, and determining a path element of the first target element from a plurality of selectable path elements corresponding to the first target element in the matching matrix A;
determining a path element of the first target element as a second target element, and determining a path element of the second target element from a plurality of selectable path elements corresponding to the second target element in the matching matrix A until a path element of an mth target element is a matrix element A (1,1) in the matching matrix A;
determining an optimal matching path according to the first target element and at least one path element;
and determining the human voice base frequency matching subsequence and the reference base frequency matching subsequence according to the position relationship between the path direction between every two adjacent matrix elements in the optimal matching path and the sequence direction of the human voice base frequency subsequence in the matching matrix A and the sequence direction of the reference base frequency subsequence in the matching matrix A.
The human voice base frequency matching subsequence corresponds to the reference base frequency matching subsequence one by one;
optionally, the processor 501 determines the melody matching degree of the lyric of the ith sentence based on the human voice fundamental frequency matching subsequence and the reference fundamental frequency matching subsequence, and specifically performs the following operations:
and calculating the melody matching degree of the ith lyric according to the number of sequence elements with preset matching values of the difference value between the sequence elements in the human voice base frequency matching subsequence and the corresponding sequence elements of the sequence elements in the reference base frequency subsequence and the number of the sequence elements in the human voice base frequency matching subsequence.
Optionally, the preset matching value is any one of 12, 7, 5, 0, -12, -7, and-5.
In the embodiment of the present application, a computer storage medium may be provided, which may be used to store computer software instructions for the singing audio classification apparatus in the embodiment shown in fig. 4, and which includes a program designed for executing the singing audio classification apparatus in the embodiment described above. The storage medium includes, but is not limited to, flash memory, hard disk, solid state disk.
Also provided in an embodiment of the present application is a computer program product comprising computer instructions stored in a computer readable storage medium; the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that when the computer program product or the computer program is executed by the computer device, the singing audio classification apparatus designed in the embodiment shown in fig. 4 can be executed.
The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
In the present application, "a and/or B" means one of the following cases: a, B, A and B. "at least one of … …" refers to any combination of the listed items or any number of the listed items, e.g., "at least one of A, B and C" refers to one of: any one of seven cases, a, B, C, a and B, B and C, a and C, A, B and C.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims (10)

1. A method for singing audio classification, comprising:
acquiring a voice base frequency subsequence corresponding to the ith lyric in the singing audio from the voice base frequency sequence of the singing audio, wherein the voice base frequency sequence comprises a plurality of voice base frequency subsequences;
acquiring a reference base frequency subsequence corresponding to the ith lyric in the singing audio from the reference base frequency sequence of the singing audio, wherein the reference base frequency sequence comprises a plurality of reference base frequency subsequences, and each reference base frequency subsequence in the plurality of reference base frequency subsequences corresponds to each lyric in the singing audio one by one;
determining an optimal matching mapping relation between the voice base frequency subsequence and the reference base frequency subsequence based on the voice base frequency subsequence and the reference base frequency subsequence corresponding to the ith lyric in the singing audio, wherein the optimal matching mapping relation comprises a voice base frequency matching subsequence and a reference base frequency matching subsequence;
and determining the melody matching degree of the ith lyric based on the human voice base frequency matching subsequence and the reference base frequency matching subsequence, and obtaining the classification result of the singing audio based on the melody matching degree of each lyric.
2. The method of claim 1, wherein before the obtaining of the sub-sequence of fundamental human voice frequency corresponding to the ith lyric in the singing audio from the sequence of fundamental human voice frequency in the singing audio, the method comprises:
framing the voice audio signal of the singing audio to obtain at least one audio frame, and performing wavelet decomposition on each audio frame in the at least one audio frame to obtain a plurality of wavelet decomposition signals corresponding to each audio frame, wherein each wavelet decomposition signal comprises wavelet high-frequency decomposition signals and wavelet low-frequency decomposition signals of a plurality of audio sampling points;
determining the target wavelet decomposition times according to the maximum value in the amplitude values of the wavelet low-frequency decomposition signals corresponding to every two adjacent wavelet decompositions of the at least one audio frame;
and calculating to obtain the human voice base frequency sequence of the singing audio according to the period of the wavelet high-frequency decomposition signal under the target wavelet decomposition times.
3. The method of claim 1, wherein the obtaining of the sub-sequence of fundamental human voice frequency corresponding to the ith lyric in the singing audio from the sequence of fundamental human voice frequency in the singing audio comprises:
determining a plurality of lyric pause moments of the human voice base frequency sequence according to a time interval between starting moments corresponding to each two adjacent sequence elements in the human voice base frequency sequence and a first preset time threshold;
determining the starting time and the ending time of the nth lyric according to the plurality of lyric stopping times, wherein n is greater than or equal to 2 and is a positive integer;
determining the starting time of the 1 st lyric in the singing audio according to the time interval between the starting times corresponding to the first two non-zero sequence elements in the voice fundamental frequency sequence and a second preset time threshold;
and extracting a plurality of sequence elements between the starting time and the ending time of the ith lyric from the voice base frequency sequence to obtain a voice base frequency subsequence corresponding to the ith lyric in the singing audio.
4. The method of claim 1, wherein before the obtaining the reference fundamental frequency subsequence corresponding to the ith lyric in the singing audio from the reference fundamental frequency sequence of the singing audio, the method comprises:
calculating to obtain a musical interval characteristic sequence and a track information entropy sequence of the singing audio according to the audio file of the singing audio;
inputting the characteristic sequence of the musical track interval and the characteristic sequence of the audio track information entropy of the singing audio into a main melody audio track classification model to obtain classification results of a plurality of audio tracks corresponding to the singing audio;
and determining a main melody track of the singing audio according to the classification results of the plurality of tracks, extracting data of the main melody track from an audio file of the singing audio to obtain a reference base frequency sequence of the singing audio, wherein the main melody track classification model is obtained by training a sample song set and a track label sequence corresponding to each song in the sample song set.
5. The method of claim 1, wherein the determining the optimal matching mapping relationship between the sub-sequence of fundamental human voice frequencies and the sub-sequence of reference fundamental voice frequencies based on the sub-sequence of fundamental human voice frequencies and the sub-sequence of reference fundamental voice frequencies corresponding to the ith lyric in the singing audio comprises:
calculating to obtain a matching matrix A according to the human voice base frequency sub-sequence, the reference base frequency sub-sequence and a preset matching value, wherein the preset matching value is used for representing a melody matching standard between the human voice base frequency sub-sequence and the reference base frequency sub-sequence, the matching matrix A comprises (j +1) × (k +1) matrix elements, j is the number of the sequence elements of the human voice base frequency sub-sequence, and k is the number of the sequence elements of the reference base frequency sub-sequence;
determining a matrix element A (j +1, k +1) in the matching matrix A as a first target element, and determining a path element of the first target element from a plurality of selectable path elements corresponding to the first target element in the matching matrix A;
determining a path element of the first target element as a second target element, and determining a path element of the second target element from a plurality of selectable path elements corresponding to the second target element in the matching matrix A until a path element of an mth target element is a matrix element A (1,1) in the matching matrix A;
determining an optimal matching path according to the first target element and at least one path element;
and determining the human voice base frequency matching subsequence and the reference base frequency matching subsequence according to the position relationship between the path direction between every two adjacent matrix elements in the optimal matching path and the sequence direction of the human voice base frequency subsequence in the matching matrix A and the sequence direction of the reference base frequency subsequence in the matching matrix A.
6. The method according to claim 1 or 5, wherein the human voice fundamental frequency matching sub-sequence and the reference fundamental frequency matching sub-sequence are in one-to-one correspondence;
the determining the melody matching degree of the ith lyric based on the human voice base frequency matching subsequence and the reference base frequency matching subsequence comprises:
and calculating the melody matching degree of the ith lyric according to the number of sequence elements with preset matching values of the difference value between the sequence elements in the human voice base frequency matching subsequence and the corresponding sequence elements of the sequence elements in the reference base frequency subsequence and the number of the sequence elements in the human voice base frequency matching subsequence.
7. The method according to claim 5 or 6, wherein the preset match value is any one of 12, 7, 5, 0, -12, -7 and-5.
8. A computer program product, characterized in that the computer program product comprises computer instructions, the computer instructions being stored in a computer readable storage medium; a processor of a computer device reads the computer instructions from the computer-readable storage medium, the processor executing the computer instructions causing the computer device to perform the method of singing audio classification of any of claims 1-7.
9. A server, comprising a processor and a memory, the processor and the memory being interconnected, wherein the memory is configured to store program code, and wherein the processor is configured to invoke the program code to perform the method of singing audio classification of any of claims 1-7.
10. A storage medium, characterized in that the storage medium stores a computer program comprising program instructions; the program instructions, when executed by a processor, cause the processor to perform the method of singing audio classification of any one of claims 1-7.
CN202010614700.5A 2020-06-30 2020-06-30 Singing audio classification method, computer program product, server and storage medium Active CN111782864B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010614700.5A CN111782864B (en) 2020-06-30 2020-06-30 Singing audio classification method, computer program product, server and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010614700.5A CN111782864B (en) 2020-06-30 2020-06-30 Singing audio classification method, computer program product, server and storage medium

Publications (2)

Publication Number Publication Date
CN111782864A true CN111782864A (en) 2020-10-16
CN111782864B CN111782864B (en) 2023-11-07

Family

ID=72761292

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010614700.5A Active CN111782864B (en) 2020-06-30 2020-06-30 Singing audio classification method, computer program product, server and storage medium

Country Status (1)

Country Link
CN (1) CN111782864B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112597331A (en) * 2020-12-25 2021-04-02 腾讯音乐娱乐科技(深圳)有限公司 Method, device, equipment and storage medium for displaying range matching information

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070131094A1 (en) * 2005-11-09 2007-06-14 Sony Deutschland Gmbh Music information retrieval using a 3d search algorithm
WO2018200268A1 (en) * 2017-04-26 2018-11-01 Microsoft Technology Licensing, Llc Automatic song generation
CN111326171A (en) * 2020-01-19 2020-06-23 成都嗨翻屋科技有限公司 Human voice melody extraction method and system based on numbered musical notation recognition and fundamental frequency extraction

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070131094A1 (en) * 2005-11-09 2007-06-14 Sony Deutschland Gmbh Music information retrieval using a 3d search algorithm
WO2018200268A1 (en) * 2017-04-26 2018-11-01 Microsoft Technology Licensing, Llc Automatic song generation
CN111326171A (en) * 2020-01-19 2020-06-23 成都嗨翻屋科技有限公司 Human voice melody extraction method and system based on numbered musical notation recognition and fundamental frequency extraction

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
鲁帆;王民;: "一个基于哼唱的音乐检索系统", 中国西部科技, no. 04 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112597331A (en) * 2020-12-25 2021-04-02 腾讯音乐娱乐科技(深圳)有限公司 Method, device, equipment and storage medium for displaying range matching information

Also Published As

Publication number Publication date
CN111782864B (en) 2023-11-07

Similar Documents

Publication Publication Date Title
US6633845B1 (en) Music summarization system and method
US7035742B2 (en) Apparatus and method for characterizing an information signal
CA1337728C (en) Method for automatically transcribing music and apparatus therefore
Molina et al. Evaluation framework for automatic singing transcription
US20120132056A1 (en) Method and apparatus for melody recognition
CN104978962A (en) Query by humming method and system
Mesaros et al. Automatic alignment of music audio and lyrics
JP3776673B2 (en) Music information analysis apparatus, music information analysis method, and recording medium recording music information analysis program
CN105895079B (en) Voice data processing method and device
CN111782864A (en) Singing audio classification method, computer program product, server and storage medium
Putri et al. Music information retrieval using Query-by-humming based on the dynamic time warping
CN112837698A (en) Singing or playing evaluation method and device and computer readable storage medium
CN105244021B (en) Conversion method of the humming melody to MIDI melody
Konev et al. The program complex for vocal recognition
KR970009939B1 (en) Method for transcribing music and apparatus therefor
Every Discriminating between pitched sources in music audio
Sinith et al. Pattern recognition in South Indian classical music using a hybrid of HMM and DTW
Jaczyńska et al. Music recognition algorithms using queries by example
JP2010054535A (en) Chord name detector and computer program for chord name detection
Desblancs Self-supervised beat tracking in musical signals with polyphonic contrastive learning
CN111383620B (en) Audio correction method, device, equipment and storage medium
KR101302568B1 (en) Fast music information retrieval system based on query by humming and method thereof
US11749237B1 (en) System and method for generation of musical notation from audio signal
CN115171729B (en) Audio quality determination method and device, electronic equipment and storage medium
KR101481060B1 (en) Device and method for automatic Pansori transcription

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant