CN113516971B - Lyric conversion point detection method, device, computer equipment and storage medium - Google Patents

Lyric conversion point detection method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN113516971B
CN113516971B CN202110775920.0A CN202110775920A CN113516971B CN 113516971 B CN113516971 B CN 113516971B CN 202110775920 A CN202110775920 A CN 202110775920A CN 113516971 B CN113516971 B CN 113516971B
Authority
CN
China
Prior art keywords
audio data
target
target audio
waveform
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110775920.0A
Other languages
Chinese (zh)
Other versions
CN113516971A (en
Inventor
萧博耀
高旋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Wondershare Software Co Ltd
Original Assignee
Shenzhen Wondershare Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Wondershare Software Co Ltd filed Critical Shenzhen Wondershare Software Co Ltd
Priority to CN202110775920.0A priority Critical patent/CN113516971B/en
Publication of CN113516971A publication Critical patent/CN113516971A/en
Application granted granted Critical
Publication of CN113516971B publication Critical patent/CN113516971B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/81Detection of presence or absence of voice signals for discriminating voice from music
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Abstract

The embodiment of the application discloses a lyric conversion point detection method, a lyric conversion point detection device, computer equipment and a storage medium, and relates to the technical field of audio processing. The method comprises the following steps: acquiring target audio data; detecting the target audio data to obtain beats of the target audio data; performing voice separation processing on the target audio data to obtain voice data; calculating the amplitude of the human voice data to obtain a human voice energy waveform; preprocessing the human voice energy waveform to obtain a target waveform; and detecting the target waveform according to the beat of the target audio data and a preset conversion condition to determine the conversion point of the lyrics. The method realizes the effective identification of the machine equipment on music and voice, detects the processed voice data through the beat of the target audio data and the preset conversion condition, realizes the accurate determination of the conversion point of the lyrics, and greatly improves the accuracy and efficiency of positioning the conversion point of the song words.

Description

Lyric conversion point detection method, device, computer equipment and storage medium
Technical Field
The present application relates to the field of audio processing technologies, and in particular, to a method and apparatus for detecting a lyric conversion point, a computer device, and a storage medium.
Background
The click video is a new audio/video editing software function in recent years, and mainly comprises the steps of enabling a user to add dynamic and static images by himself and selecting a piece of music, automatically generating an audio/video by the software, wherein transition or rendering time points of a video part and the decided music have a specific designed relevance, for example, the transition or rendering time points and the decided music can appear on a drum point, a re-shooting point and a special effect point of the music, so that the automatically generated audio/video can not seem to be conflicted as a result after the user spends a lot of time and elaborates.
Based on the demand of the Ka-Point video, the common music video editing result can be generalized, so that the lyric conversion point (the time point when the playing of a song is finished and the voice starts singing) is also very suitable for being used as a point of transition or rendering time besides the traditional characteristic points such as the re-shooting and the drum point of the music.
However, detecting human voice in music has been a difficult and challenging problem in the field of MIR (Music Information Retrieval ). The content in the song comprises two parts of music and human voice, and the frequency spectrums of the two parts overlap with each other and influence each other. Although the human ear can clearly distinguish music containing human voice, the music and human voice cannot be effectively recognized for a machine device such as a computer. In the prior art, the lyric conversion points are mainly positioned in a manual mode, and the positioning accuracy and efficiency of the mode are low.
Disclosure of Invention
The embodiment of the application provides a lyric conversion point detection method, a lyric conversion point detection device, computer equipment and a storage medium, and aims to solve the problems of low accuracy and low efficiency of positioning a lyric conversion point in the existing manual mode.
In a first aspect, an embodiment of the present application provides a lyric conversion point detection method, where the lyric conversion point detection method includes:
acquiring target audio data; detecting the target audio data to obtain beats of the target audio data; performing voice separation processing on the target audio data to obtain voice data; calculating the amplitude of the voice data to obtain a voice energy waveform; preprocessing the human voice energy waveform to obtain a target waveform; and detecting the target waveform according to the beat of the target audio data and preset conversion conditions to determine the conversion point of the lyrics.
In a second aspect, an embodiment of the present application further provides a lyric conversion point detection device, where the device includes:
an acquisition unit configured to acquire target audio data;
the detection unit is used for detecting the target audio data to obtain beats of the target audio data;
the separation unit is used for carrying out voice separation processing on the target audio data to obtain voice data;
the computing unit is used for computing the amplitude of the voice data to obtain a voice energy waveform;
the preprocessing unit is used for preprocessing the human voice energy waveform to obtain a target waveform;
and the determining unit is used for detecting the target waveform according to the beat of the target audio data and preset conversion conditions so as to determine the conversion point of the lyrics.
In a third aspect, an embodiment of the present application further provides a computer device, where the computer device includes a memory and a processor, where the memory stores a computer program, and the processor implements the method when executing the computer program.
In a fourth aspect, embodiments of the present application also provide a computer-readable storage medium storing a computer program which, when executed by a processor, implements the above-described method.
The embodiment of the application provides a lyric conversion point detection method, a lyric conversion point detection device, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring target audio data; detecting the target audio data to obtain beats of the target audio data; performing voice separation processing on the target audio data to obtain voice data; calculating the amplitude of the voice data to obtain a voice energy waveform; preprocessing the human voice energy waveform to obtain a target waveform; and detecting the target waveform according to the beat of the target audio data and preset conversion conditions to determine the conversion point of the lyrics. The method realizes the effective identification of the machine equipment on music and voice, detects the processed voice data through the beat of the target audio data and the preset conversion condition, realizes the accurate determination of the conversion point of the lyrics, and greatly improves the accuracy and efficiency of positioning the conversion point of the song words.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a method for detecting a lyric conversion point according to an embodiment of the present application;
FIG. 2 is a schematic sub-flowchart of a lyric conversion point detection method according to an embodiment of the present application;
FIG. 3 is a schematic sub-flowchart of a lyric conversion point detection method according to an embodiment of the present application;
FIG. 4 is a schematic sub-flowchart of a lyric conversion point detection method according to an embodiment of the present application;
FIG. 5 is a schematic sub-flowchart of a lyric conversion point detection method according to an embodiment of the present application;
FIG. 6 is a schematic sub-flowchart of a lyric conversion point detection method according to an embodiment of the present application;
FIG. 7 is a schematic block diagram of a lyric conversion point detection device according to an embodiment of the present application;
FIG. 8 is a schematic block diagram of a computer device provided in an embodiment of the present application;
FIG. 9 is a waveform diagram of a target audio waveform and a waveform diagram of voice data separated therefrom in one embodiment;
FIG. 10 is a diagram of a voice data waveform and a voice energy waveform in an embodiment;
FIG. 11 is a diagram of intermediate waveforms and lyric conversion points processed in one embodiment.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
It should be understood that the terms "comprising" and "including" when used in this specification and the appended claims, are also to be understood that the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
As used in this specification and the appended claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".
Referring to fig. 1, fig. 1 is a flowchart illustrating a lyric transformation point detection method according to an embodiment of the application. As shown in FIG. 1, the method includes the following steps S1-S6.
S1, acquiring target audio data.
In implementations, target audio data is obtained. Wherein, the target audio data comprises voice data and background music data. In an embodiment, the target audio data may be in a common audio format such as mp3, wav, ogg, and the like. The format of the target audio data is not particularly limited herein.
It will be appreciated that, to achieve detection of the lyric conversion point, separation processing is required for the vocal data and the background music data in the target audio data, i.e. step S3.
S2, detecting the target audio data to obtain the beat of the target audio data.
In specific implementation, the target audio data is detected to obtain the beat of the target audio data. In one embodiment, the target audio data is input into a beat detection model for beat detection to obtain the beat of the target audio data.
Beats are the organization of the fixed unit time values and the intensity law in the musical composition, and are also called beats. Beats have two characteristics: periodicity and continuity. The beat periodicity is represented as a beat structure, which is a sequence of rhythms that periodically occur in a musical composition. Common beats are 1/4,2/4,3/4,4/4,3/8,6/8,7/8,9/8, 12/8 beats, etc., with the length of each bar being fixed. The beat of a piece of music is fixed at the time of composing the music and does not change. Therefore, accurate detection of beats is beneficial to improving the accuracy of detection of lyric conversion points.
The length of a beat may vary from one target audio data to another, and even in the same target audio data, the duration of a beat may vary from one music passage to another. It is necessary to calculate in combination with the beat per minute BPM (beat per minute), if the beat rate is 120 beats per minute, that beat is 60/120=0.5 seconds, if the beat rate is 80 beats per minute, that beat is 60/80=0.75 seconds, and so on.
It should be noted that, in an embodiment, the song speed BPM of the target audio data is also estimated, the song speed of the target audio data is obtained by inputting the target audio data into the music analysis module, the duration of each beat can be calculated by obtaining the song speed of the target audio data, and then the duration of one beat can be calculated according to the detected beat for judging the lyric conversion point. The detected pace of the target audio data in this embodiment is 108 beats per minute.
Referring to fig. 2, in an embodiment, the step S2 specifically includes: steps S201 to S202.
S201, extracting audio features of the target audio data to obtain the audio features of the target audio data.
In specific implementation, the audio feature extraction is performed on the target audio data to obtain the audio feature of the target audio data. In one embodiment, the implementation manner of extracting the audio features of the target audio data to obtain the audio features of the target audio data may include: performing low-pass filtering processing on the target audio data to obtain a low-pass audio signal; according to the preset frame shift and at least one frame length threshold, carrying out frame division processing on the low-pass audio signals to obtain at least one frame-divided audio signal set, wherein different frame-divided audio signal sets correspond to different frame length thresholds, each frame-divided audio signal set comprises at least two sub-audio signals, and the frame length of each sub-audio signal is equal to the frame length threshold corresponding to the audio signal set to which the sub-audio signal belongs; respectively extracting features of each sub-frame audio signal set in at least one sub-frame audio signal set to obtain sub-audio features corresponding to each sub-frame audio signal set; and splicing the audio-splitting characteristics corresponding to each frame-splitting audio signal set to obtain the audio characteristics of the target audio data.
S202, detecting the beat of the audio features of the target audio data by using a beat detection model to obtain the beat of the target audio data.
In specific implementation, beat detection is performed on the audio features of the target audio data by using a beat detection model, so as to obtain the beat of the target audio data. In one embodiment, the beat detection model is trained based on the training samples and beat labels corresponding to the training samples. And detecting the beat of the audio features of the detected audio by utilizing a beat detection model so as to acquire the realization mode of the beat of the target audio data.
Referring to fig. 3, the above step S202 specifically includes: steps S2021 to S2022.
And S2021, stacking the audio features of the target audio data to obtain output features.
In specific implementation, the audio features of the target audio data are stacked to obtain output features. And after the audio features of the target audio data are subjected to stacking processing by the processing unit, obtaining output features, wherein the output features are time series data with the same length as the audio features of the target audio data.
S2022, inputting the output features into a classifier to obtain the tempo of the target audio data.
In an implementation, the output features are input into a classifier to obtain beats of the target audio data. And inputting the output characteristics into a classifier so that the output characteristics map the output characteristics of each frame to each time point along a time sequence to obtain a beat detection result corresponding to each time point, wherein the beat detection result is the beat of the target audio data.
In an embodiment, the beat detection model is trained based on training samples and beat labels corresponding to the training samples. In specific implementation, a training sample is obtained, and the training sample is provided with a corresponding beat label; extracting audio features of the training samples to obtain audio features of the training samples; calling a beat detection model to detect audio characteristics to obtain a prediction result; and carrying out optimization training on the beat detection model based on the beat label and the prediction result to obtain an optimized beat detection model.
S3, performing voice separation processing on the target audio data to obtain voice data.
In specific implementation, the voice separation processing is performed on the target audio data to obtain voice data. In an embodiment, the target audio data is input into an audio separation tool to extract human voice data from the target audio data. The application can utilize the audio track separator obtained based on artificial intelligence technology as the audio separation tool to realize the separation of human voice, such as an interface provided by an open source project Spleeter (namely an audio track AI separation software) based on MIT protocol, and carry out the audio track separation on the target audio data so as to obtain the human voice data in the target audio data. The above-mentioned track separator is only one implementation manner of the audio separation tool of the present application, and the audio separation tool for performing the human voice separation processing on the target audio data of the present application is not particularly limited.
As shown in fig. 9, where a curve W1 is a waveform of the target audio data, and a curve W2 is a waveform of the voice data separated from the target audio data.
And S4, calculating the amplitude of the voice data to obtain a voice energy waveform.
In a specific implementation, the amplitude of the human voice data is calculated to obtain a human voice energy waveform. In one embodiment, the amplitude of the voice energy waveform is calculated by calculating dBFS (Decibels Full Scale, full db scale) of the voice data. The calculation formula is as follows:
value_dBFS=20*log10(rms(signal)*sqrt(2))=20*log10(rms(signal))+3.0103
wherein signal is voice data.
As shown in fig. 10, wherein curve W2 is a human voice data waveform and curve W3 is a human voice energy waveform.
S5, preprocessing the human voice energy waveform to obtain a target waveform.
In a specific implementation, the human voice energy waveform is preprocessed to obtain a target waveform. The human sound energy is converted into square waves with certain amplitude by preprocessing the human sound energy waveform, so that the detection of the lyric conversion point is facilitated.
Referring to fig. 4, in an embodiment, the step S5 specifically includes: steps S501-S503.
S501, performing smoothing processing on the voice energy waveform to obtain a smoothed energy waveform.
In a specific implementation, the human voice energy waveform is subjected to smoothing processing to obtain a smoothed energy waveform. In the actual processing process, the voice energy waveform obtained in the step S4 is easy to generate high-frequency burrs, which can interfere with the subsequent detection of the song word conversion point, so that the voice energy waveform needs to be smoothed to eliminate the high-frequency burrs on the waveform to improve the stability of the voice energy waveform amplitude.
Referring to fig. 5, in an embodiment, the step S501 specifically includes: steps S5011-S5012.
S5011, calling a window function to calculate the weight.
In particular implementations, a window function is invoked to calculate the weights. Different window functions have different effects on the signal spectrum because different window functions produce different magnitudes of leakage and different frequency resolution. The truncation of the signal produces energy leakage, while the computation of the spectrum by fourier algorithm produces a fence effect, both of which errors cannot be eliminated in principle, but the influence of which can be suppressed by selecting different window functions. In one embodiment, a Hanning window of length 0.8 seconds is selected as the window function to calculate the weights. The window function may be selected by the user according to the actual situation, which is not particularly limited by the present application.
S5012, carrying out convolution operation on the human voice energy waveform according to the weight to obtain the smooth energy waveform.
In specific implementation, convolution operation is performed on the human voice energy waveform according to the weight so as to obtain the smooth energy waveform. In one embodiment, the smoothed energy waveform is obtained by computing a convolution of an exponential function of the equal weights.
S502, performing threshold limiting processing on the smooth energy waveform according to a preset threshold value to obtain a threshold limiting waveform.
In specific implementation, the smoothed energy waveform is thresholded according to a preset threshold to obtain a thresholded waveform. In one embodiment, the simplified conversion of an irregular smooth energy waveform into a square wave that facilitates decision making, a thresholding waveform, is performed by a thresholding process.
It should be noted that, in one embodiment, the preset threshold is-34 (dBFS). The user may set the preset threshold according to actual conditions, which is not particularly limited in the present application.
And S503, carrying out holding processing on the threshold-limited waveform to obtain a target waveform.
In particular implementations, the threshold-limited waveform is subjected to a hold process to obtain a target waveform. The number of detection and judgment time points can be reduced by carrying out holding processing on the threshold-limited waveform, so that the detection efficiency of the lyric conversion point is improved.
Referring to fig. 6, in an embodiment, the step S503 specifically includes: steps S5031 to S5032.
S5031, identifying peaks with time intervals smaller than a preset time interval in the threshold-limited waveform as target peaks.
In specific implementation, a peak with a time interval smaller than a preset time interval in the threshold-limited waveform is identified as a target peak. In one embodiment, the time interval of rising edges of two adjacent peaks is identified as the time interval between two peaks. And carrying out holding processing on the threshold-limited waveform to improve the detection precision of the lyric conversion point. Wherein the preset time interval may be set to 2s, typically less than the time of one beat of the target audio data. The holding time can be set by the user according to the actual situation, which is not particularly limited by the present application.
S5032, connecting all the target wave peaks to obtain a target waveform.
In specific implementation, all the target wave peaks are connected to obtain a target waveform. If the time interval between two adjacent wave peaks in the identification threshold waveform is smaller than the preset time interval, connecting the two adjacent wave peaks into a line avoids detecting the time point with the time interval smaller than the preset time interval, and improves the detection efficiency.
And S6, detecting the target waveform according to the beat of the target audio data and preset conversion conditions to determine the conversion point of the lyrics.
In specific implementation, the target waveform is detected according to the beat of the target audio data and a preset conversion condition to determine a conversion point of lyrics.
It should be noted that, before the target audio data is obtained, a preset conversion condition is received. In one embodiment, the preset transition conditions are:
1) No human voice appears at the last time point;
2) The voice appears at the current time point;
3) No human voice appears in the past for as long as one beat;
4) Continuous human voice occurs for a period of up to one beat in the future.
And detecting the threshold-limiting waveform according to the beat of the target audio data, wherein the detected time point meeting the four conditions is the lyric conversion time point.
As shown in fig. 11, W3 is a human acoustic energy waveform, curve W4 is a smooth energy waveform, curve W5 is a threshold-limiting waveform, curve W6 is a target waveform, and curve W7 is a lyric transformation point waveform. Wherein the conversion point of the lyrics is obtainable from the lyrics conversion point waveform.
The detection shows that the target audio frequency is 8 beats, 108 beats exist per minute, so the time length of one beat is 8×60/108×4.44s, and as shown in fig. 11, two time points show that the human voice meets the 1 st and 2) conditions, and the conditions are respectively at the rising edges of the first peak and the second peak of the target waveform; further judging whether conditions 3) and 4) are satisfied; judging a non-lyric conversion point at the rising edge of the first peak because the duration of the first peak is less than 4.44s, and judging that the duration of the second peak exceeds 4.44s and no human voice appears in a period of one beat before the rising edge of the second peak; the rising edge of the second peak is the lyric conversion point.
The lyric conversion point detection method provided by the embodiment of the application comprises the following steps: acquiring target audio data; detecting the target audio data to obtain beats of the target audio data; performing voice separation processing on the target audio data to obtain voice data; calculating the amplitude of the voice data to obtain a voice energy waveform; preprocessing the human voice energy waveform to obtain a target waveform; and detecting the target waveform according to the beat of the target audio data and preset conversion conditions to determine the conversion point of the lyrics. The method realizes the effective identification of the machine equipment on music and voice, detects the processed voice data through the beat of the target audio data and the preset conversion condition, realizes the accurate determination of the conversion point of the lyrics, and greatly improves the accuracy and efficiency of positioning the conversion point of the song words.
Fig. 7 is a schematic block diagram of a lyric conversion point detection device according to an embodiment of the present application. As shown in fig. 7, the present application also provides a lyric conversion point detection device 100 corresponding to the above lyric conversion point detection method. The lyric conversion point detection apparatus 100 includes a unit for performing the lyric conversion point detection method described above, and may be configured in a desktop computer, a tablet computer, a portable computer, or the like. Specifically, referring to fig. 7, the lyric conversion point detection device 100 includes an acquisition unit 101, a detection unit 102, a separation unit 103, a calculation unit 104, a preprocessing unit 105, and a determination unit 106.
An acquisition unit 101 for acquiring target audio data;
a detecting unit 102, configured to detect the target audio data to obtain a beat of the target audio data;
a separation unit 103, configured to perform a voice separation process on the target audio data to obtain voice data;
a calculating unit 104, configured to calculate an amplitude of the voice data to obtain a voice energy waveform;
a preprocessing unit 105, configured to preprocess the voice energy waveform to obtain a target waveform;
and the determining unit 106 is used for detecting the target waveform according to the beat of the target audio data and a preset conversion condition so as to determine the conversion point of the lyrics.
In an embodiment, the detecting the target audio data to obtain the beat of the target audio data includes;
extracting audio characteristics of the target audio data to obtain audio characteristics of the target audio data;
and detecting the beat of the audio features of the target audio data by using a beat detection model to obtain the beat of the target audio data.
In an embodiment, the detecting the beat of the audio feature of the target audio data by using a beat detection model to obtain the beat of the target audio data includes:
stacking the audio features of the target audio data to obtain output features, wherein the output features are time series data with equal length with the audio features of the target audio data;
the output features are input into a classifier to obtain beats of the target audio data.
In an embodiment, the preprocessing the human voice energy waveform to obtain a target waveform includes:
smoothing the voice energy waveform to obtain a smoothed energy waveform;
performing threshold limiting processing on the smooth energy waveform according to a preset threshold value to obtain a threshold limiting waveform;
and carrying out holding treatment on the threshold limit waveform to obtain a target waveform.
In an embodiment, the smoothing the human voice energy waveform to obtain a smoothed energy waveform includes:
calling a window function to calculate weights;
and carrying out convolution operation on the voice energy waveform according to the weight to obtain the smooth energy waveform.
In one embodiment, the holding the threshold-limited waveform to obtain a target waveform includes:
identifying wave peaks with time intervals smaller than a preset time interval in the threshold-limiting waveform as target wave peaks;
and connecting all the target wave peaks to obtain a target waveform.
In an embodiment, the performing the voice separation processing on the target audio data to obtain voice data includes:
the target audio data is input into an audio separation tool to extract human voice data from the target audio data.
It should be noted that, as those skilled in the art can clearly understand, the specific implementation process of the lyric conversion point detection device and each unit may refer to the corresponding description in the foregoing method embodiment, and for convenience and brevity of description, the detailed description is omitted herein.
The lyric conversion point detection means described above may be implemented in the form of a computer program which is executable on a computer device as shown in fig. 8.
Referring to fig. 8, fig. 8 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 300 is a host computer. The upper computer can be electronic equipment such as a tablet personal computer, a notebook computer, a desktop computer and the like.
With reference to FIG. 8, the computer device 300 includes a processor 302, a memory, and a network interface 305, which are connected by a system bus 301, wherein the memory may include a non-volatile storage medium 303 and an internal memory 304.
The non-volatile storage medium 303 may store an operating system 3031 and a computer program 3032. The computer program 3032, when executed, may cause the processor 302 to perform a lyric conversion point detection method.
The processor 302 is used to provide computing and control capabilities to support the operation of the overall computer device 300.
The internal memory 304 provides an environment for the execution of a computer program 3032 in the non-volatile storage medium 303, which computer program 3032, when executed by the processor 302, causes the processor 302 to perform a lyric conversion point detection method.
The network interface 305 is used for network communication with other devices. It will be appreciated by those skilled in the art that the structure shown in FIG. 8 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device 300 to which the present inventive arrangements may be applied, and that a particular computer device 300 may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
Wherein the processor 302 is configured to execute a computer program 3032 stored in a memory to implement the following steps:
acquiring target audio data;
detecting the target audio data to obtain beats of the target audio data;
performing voice separation processing on the target audio data to obtain voice data;
calculating the amplitude of the voice data to obtain a voice energy waveform;
preprocessing the human voice energy waveform to obtain a target waveform;
and detecting the target waveform according to the beat of the target audio data and preset conversion conditions to determine the conversion point of the lyrics.
In an embodiment, the detecting the target audio data to obtain the beat of the target audio data includes;
extracting audio characteristics of the target audio data to obtain audio characteristics of the target audio data;
and detecting the beat of the audio features of the target audio data by using a beat detection model to obtain the beat of the target audio data.
In an embodiment, the detecting the beat of the audio feature of the target audio data by using a beat detection model to obtain the beat of the target audio data includes:
stacking the audio features of the target audio data to obtain output features, wherein the output features are time series data with equal length with the audio features of the target audio data;
the output features are input into a classifier to obtain beats of the target audio data.
In an embodiment, the preprocessing the human voice energy waveform to obtain a target waveform includes:
smoothing the voice energy waveform to obtain a smoothed energy waveform;
performing threshold limiting processing on the smooth energy waveform according to a preset threshold value to obtain a threshold limiting waveform;
and carrying out holding treatment on the threshold limit waveform to obtain a target waveform.
In an embodiment, the smoothing the human voice energy waveform to obtain a smoothed energy waveform includes:
calling a window function to calculate weights;
and carrying out convolution operation on the voice energy waveform according to the weight to obtain the smooth energy waveform.
In one embodiment, the holding the threshold-limited waveform to obtain a target waveform includes:
identifying wave peaks with time intervals smaller than a preset time interval in the threshold-limiting waveform as target wave peaks;
and connecting all the target wave peaks to obtain a target waveform.
In an embodiment, the performing the voice separation processing on the target audio data to obtain voice data includes:
the target audio data is input into an audio separation tool to extract human voice data from the target audio data.
It should be appreciated that in embodiments of the present application, the processor 302 may be a central processing unit (Central Processing Unit, CPU), the processor 302 may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf Programmable gate arrays (FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Those skilled in the art will appreciate that all or part of the flow in a method embodying the above described embodiments may be accomplished by computer programs instructing the relevant hardware. The computer program may be stored in a storage medium that is a computer readable storage medium. The computer program is executed by at least one processor in the computer system to implement the flow steps of the embodiments of the method described above.
Accordingly, the present application also provides a storage medium. The storage medium may be a computer readable storage medium. The storage medium stores a computer program. The computer program, when executed by a processor, causes the processor to perform the steps of:
acquiring target audio data;
detecting the target audio data to obtain beats of the target audio data;
performing voice separation processing on the target audio data to obtain voice data;
calculating the amplitude of the voice data to obtain a voice energy waveform;
preprocessing the human voice energy waveform to obtain a target waveform;
and detecting the target waveform according to the beat of the target audio data and preset conversion conditions to determine the conversion point of the lyrics.
In an embodiment, the detecting the target audio data to obtain the beat of the target audio data includes;
extracting audio characteristics of the target audio data to obtain audio characteristics of the target audio data;
and detecting the beat of the audio features of the target audio data by using a beat detection model to obtain the beat of the target audio data.
In an embodiment, the detecting the beat of the audio feature of the target audio data by using a beat detection model to obtain the beat of the target audio data includes:
stacking the audio features of the target audio data to obtain output features, wherein the output features are time series data with equal length with the audio features of the target audio data;
the output features are input into a classifier to obtain beats of the target audio data.
In an embodiment, the preprocessing the human voice energy waveform to obtain a target waveform includes:
smoothing the voice energy waveform to obtain a smoothed energy waveform;
performing threshold limiting processing on the smooth energy waveform according to a preset threshold value to obtain a threshold limiting waveform;
and carrying out holding treatment on the threshold limit waveform to obtain a target waveform.
In an embodiment, the smoothing the human voice energy waveform to obtain a smoothed energy waveform includes:
calling a window function to calculate weights;
and carrying out convolution operation on the voice energy waveform according to the weight to obtain the smooth energy waveform.
In one embodiment, the holding the threshold-limited waveform to obtain a target waveform includes:
identifying wave peaks with time intervals smaller than a preset time interval in the threshold-limiting waveform as target wave peaks;
and connecting all the target wave peaks to obtain a target waveform.
In an embodiment, the performing the voice separation processing on the target audio data to obtain voice data includes:
the target audio data is input into an audio separation tool to extract human voice data from the target audio data.
The storage medium is a physical, non-transitory storage medium, and may be, for example, a U-disk, a removable hard disk, a Read-only memory (ROM), a magnetic disk, or an optical disk.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of each unit is only one logic function division, and there may be another division manner in actual implementation. For example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed.
The steps in the method of the embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs. The units in the device of the embodiment of the application can be combined, divided and deleted according to actual needs. In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The integrated unit may be stored in a storage medium if implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present application is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a terminal, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application.
In the foregoing embodiments, the descriptions of the embodiments are focused on, and for those portions of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.
While the application has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims (10)

1. A lyric conversion point detection method, comprising:
acquiring target audio data;
detecting the target audio data to obtain beats of the target audio data;
performing voice separation processing on the target audio data to obtain voice data;
calculating the amplitude of the voice data to obtain a voice energy waveform;
preprocessing the human voice energy waveform to obtain a target waveform;
detecting the target waveform according to the beats of the target audio data and preset conversion conditions to determine a conversion point of lyrics, wherein the preset conversion conditions are that no human voice appears at the previous time point, no human voice appears at the current time point, no human voice appears in the past time reaching one beat, and continuous human voice appears in the future time reaching one beat;
and detecting the target waveform according to the beat of the target audio data, wherein the detected time point which simultaneously meets the preset conversion condition is the lyric conversion point.
2. The lyrics conversion point detection method of claim 1, wherein the detecting the target audio data to obtain a beat of the target audio data comprises;
extracting audio features of the target audio data to obtain audio features of the target audio data;
and detecting the beat of the audio features of the target audio data by using a beat detection model to obtain the beat of the target audio data.
3. The lyrics conversion point detection method of claim 2, wherein the detecting the beat of the audio feature of the target audio data using a beat detection model to obtain the beat of the target audio data comprises:
stacking the audio features of the target audio data to obtain output features, wherein the output features are time series data with equal length with the audio features of the target audio data;
the output features are input into a classifier to obtain beats of the target audio data.
4. The lyrics conversion point detection method of claim 1, wherein the preprocessing the human voice energy waveform to obtain a target waveform comprises:
smoothing the voice energy waveform to obtain a smoothed energy waveform;
performing threshold limiting processing on the smooth energy waveform according to a preset threshold value to obtain a threshold limiting waveform;
and carrying out holding treatment on the threshold limit waveform to obtain a target waveform.
5. The lyrics transformation point detection method of claim 4, wherein the smoothing the human voice energy waveform to obtain a smoothed energy waveform comprises:
calling a window function to calculate weights;
and carrying out convolution operation on the voice energy waveform according to the weight to obtain the smooth energy waveform.
6. The lyrics switch point detection method of claim 4, wherein the holding the threshold limited waveform to obtain a target waveform comprises:
identifying wave peaks with time intervals smaller than a preset time interval in the threshold-limiting waveform as target wave peaks;
and connecting all the target wave peaks to obtain a target waveform.
7. The lyric conversion point detection method of claim 1, wherein the performing a human voice separation process on the target audio data to obtain human voice data comprises:
the target audio data is input into an audio separation tool to extract human voice data from the target audio data.
8. A lyric conversion point detection device, comprising:
an acquisition unit configured to acquire target audio data;
the detection unit is used for detecting the target audio data to obtain beats of the target audio data;
the separation unit is used for carrying out voice separation processing on the target audio data to obtain voice data;
the computing unit is used for computing the amplitude of the voice data to obtain a voice energy waveform;
the preprocessing unit is used for preprocessing the human voice energy waveform to obtain a target waveform;
a determining unit, configured to detect the target waveform according to a beat of the target audio data and a preset conversion condition to determine a conversion point of lyrics, where the preset conversion condition is that no voice occurs at a previous time point, no voice occurs at a current time point, no voice occurs in a time period of one beat in the past, and continuous voice occurs in a time period of one beat in the future;
and the determining subunit is used for detecting the target waveform according to the beat of the target audio data, and the detected time point which simultaneously meets the preset conversion condition is the lyric conversion point.
9. A computer device, characterized in that it comprises a memory on which a computer program is stored and a processor which, when executing the computer program, implements the method according to any of claims 1-7.
10. A computer readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method according to any of claims 1-7.
CN202110775920.0A 2021-07-09 2021-07-09 Lyric conversion point detection method, device, computer equipment and storage medium Active CN113516971B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110775920.0A CN113516971B (en) 2021-07-09 2021-07-09 Lyric conversion point detection method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110775920.0A CN113516971B (en) 2021-07-09 2021-07-09 Lyric conversion point detection method, device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113516971A CN113516971A (en) 2021-10-19
CN113516971B true CN113516971B (en) 2023-09-29

Family

ID=78066502

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110775920.0A Active CN113516971B (en) 2021-07-09 2021-07-09 Lyric conversion point detection method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113516971B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10187147A (en) * 1996-10-23 1998-07-14 Yamaha Corp Device and method for voice inputting, and storage medium
JP2001125582A (en) * 1999-10-26 2001-05-11 Victor Co Of Japan Ltd Method and device for voice data conversion and voice data recording medium
JP2002082665A (en) * 2000-09-11 2002-03-22 Toshiba Corp Device, method and processing program for assigning lyrics
JP2006048808A (en) * 2004-08-03 2006-02-16 Fujitsu Ten Ltd Audio apparatus
CN101751914A (en) * 2008-12-04 2010-06-23 江亮都 Lyric display system and method
CN104252872A (en) * 2014-09-23 2014-12-31 深圳市中兴移动通信有限公司 Lyric generating method and intelligent terminal
CN105096987A (en) * 2015-06-01 2015-11-25 努比亚技术有限公司 Audio data processing method and terminal
JP2016050974A (en) * 2014-08-29 2016-04-11 株式会社第一興商 Karaoke scoring system
CN108206029A (en) * 2016-12-16 2018-06-26 北京酷我科技有限公司 A kind of method and system for realizing the word for word lyrics
CN112399247A (en) * 2020-11-18 2021-02-23 腾讯音乐娱乐科技(深圳)有限公司 Audio processing method, audio processing device and readable storage medium
CN112669811A (en) * 2020-12-23 2021-04-16 腾讯音乐娱乐科技(深圳)有限公司 Song processing method and device, electronic equipment and readable storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8173883B2 (en) * 2007-10-24 2012-05-08 Funk Machine Inc. Personalized music remixing
US9721551B2 (en) * 2015-09-29 2017-08-01 Amper Music, Inc. Machines, systems, processes for automated music composition and generation employing linguistic and/or graphical icon based musical experience descriptions

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10187147A (en) * 1996-10-23 1998-07-14 Yamaha Corp Device and method for voice inputting, and storage medium
JP2001125582A (en) * 1999-10-26 2001-05-11 Victor Co Of Japan Ltd Method and device for voice data conversion and voice data recording medium
JP2002082665A (en) * 2000-09-11 2002-03-22 Toshiba Corp Device, method and processing program for assigning lyrics
JP2006048808A (en) * 2004-08-03 2006-02-16 Fujitsu Ten Ltd Audio apparatus
CN101751914A (en) * 2008-12-04 2010-06-23 江亮都 Lyric display system and method
JP2016050974A (en) * 2014-08-29 2016-04-11 株式会社第一興商 Karaoke scoring system
CN104252872A (en) * 2014-09-23 2014-12-31 深圳市中兴移动通信有限公司 Lyric generating method and intelligent terminal
CN105096987A (en) * 2015-06-01 2015-11-25 努比亚技术有限公司 Audio data processing method and terminal
CN108206029A (en) * 2016-12-16 2018-06-26 北京酷我科技有限公司 A kind of method and system for realizing the word for word lyrics
CN112399247A (en) * 2020-11-18 2021-02-23 腾讯音乐娱乐科技(深圳)有限公司 Audio processing method, audio processing device and readable storage medium
CN112669811A (en) * 2020-12-23 2021-04-16 腾讯音乐娱乐科技(深圳)有限公司 Song processing method and device, electronic equipment and readable storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
人工智能语音技术在广电媒体的应用;刘晓曦;;广播电视信息(第03期);全文 *

Also Published As

Publication number Publication date
CN113516971A (en) 2021-10-19

Similar Documents

Publication Publication Date Title
US6721699B2 (en) Method and system of Chinese speech pitch extraction
US8543387B2 (en) Estimating pitch by modeling audio as a weighted mixture of tone models for harmonic structures
CN110599987A (en) Piano note recognition algorithm based on convolutional neural network
Dressler Pitch estimation by the pair-wise evaluation of spectral peaks
EP2962299B1 (en) Audio signal analysis
CN103824565A (en) Humming music reading method and system based on music note and duration modeling
CN109817191B (en) Tremolo modeling method, device, computer equipment and storage medium
CN101740025A (en) Singing score evaluation method and karaoke apparatus using the same
JP2004538525A (en) Pitch determination method and apparatus by frequency analysis
CN103915093A (en) Method and device for realizing voice singing
JP5569228B2 (en) Tempo detection device, tempo detection method and program
Kumar et al. Musical onset detection on carnatic percussion instruments
Hsu et al. Robust voice activity detection algorithm based on feature of frequency modulation of harmonics and its DSP implementation
CN113516971B (en) Lyric conversion point detection method, device, computer equipment and storage medium
US20110166857A1 (en) Human Voice Distinguishing Method and Device
JP4128848B2 (en) Pitch pitch determination method and apparatus, pitch pitch determination program and recording medium recording the program
Tang et al. Melody Extraction from Polyphonic Audio of Western Opera: A Method based on Detection of the Singer's Formant.
Vincent et al. Predominant-F0 estimation using Bayesian harmonic waveform models
CN112185338B (en) Audio processing method, device, readable storage medium and electronic equipment
CN109817205B (en) Text confirmation method and device based on semantic analysis and terminal equipment
CN110827859B (en) Method and device for vibrato recognition
Mishra et al. Language vs Speaker Change: A Comparative Study
Collins An automated event analysis system with compositional applications
Zien et al. Monophonic piano music transcription
Castro et al. Musical beat recognition using a MLP-HMM hybrid classifier

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20211118

Address after: 518000 1001, block D, building 5, software industry base, Yuehai street, Nanshan District, Shenzhen City, Guangdong Province

Applicant after: Shenzhen Wanxing Software Co.,Ltd.

Address before: 518000 1002, block D, building 5, software industry base, Yuehai street, Nanshan District, Shenzhen City, Guangdong Province

Applicant before: SHENZHEN SIBO TECHNOLOGY Co.,Ltd.

GR01 Patent grant
GR01 Patent grant