CN112133327B

CN112133327B - Audio sample extraction method, device, terminal and storage medium

Info

Publication number: CN112133327B
Application number: CN202010984280.XA
Authority: CN
Inventors: 鲁霄; 赵伟峰
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2020-09-17
Filing date: 2020-09-17
Publication date: 2024-02-13
Anticipated expiration: 2040-09-17
Also published as: CN112133327A

Abstract

The embodiment of the invention discloses an extraction method, equipment, a terminal and a readable storage medium of an audio sample, wherein the method comprises the following steps: acquiring first lyrics and second lyrics, wherein the similarity between each lyric in a first lyrics file and each lyric in a second lyrics file is larger than a preset similarity threshold; determining first time information of the first lyrics according to a first mapping relation between the lyrics and the time information in the first lyrics file, and determining second time information of the second lyrics according to a second mapping relation between the lyrics and the time information in the second lyrics file; cutting the first audio file according to the first time information to obtain a first sub audio file, and cutting the second audio file according to the second time information to obtain a second sub audio file; the first sub-audio file added with the first annotation information and the second sub-audio file added with the second annotation information are determined to be audio samples of the same lyric fragment, so that the requirements of automation and intellectualization for extracting the audio samples are met, and the efficiency of extracting the audio samples is improved.

Description

Audio sample extraction method, device, terminal and storage medium

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a method, an apparatus, a terminal, and a storage medium for extracting an audio sample.

Background

In the audio field, most of the algorithms based on machine learning require a large number of audio training samples for iterative training of the algorithm, which also means that the ability to efficiently collect a large number of audio samples is indispensable and significant.

At present, in the field of computer hearing (audio), for audio samples, the collected ready resources have a flexible index number and a deficient collection way, and the audio samples are mainly collected by adopting a manual labeling collection mode, however, the manual labeling workload is extremely large and the difficulty exists in unified standards. Therefore, how to improve the efficiency of acquiring audio samples is very important.

Disclosure of Invention

The embodiment of the invention provides an audio sample extraction method, equipment, a terminal and a storage medium, which can realize the extraction of the audio samples of the same lyric fragment based on a lyric file, reduce the workload, meet the automatic and intelligent requirements on the extraction of the audio samples and improve the efficiency of the extraction of the audio samples.

In a first aspect, an embodiment of the present invention provides a method for extracting an audio sample, including:

Acquiring a first lyric file and a second lyric file, wherein the first lyric file comprises a first mapping relation between lyrics and time information, and the second lyric file comprises a second mapping relation between lyrics and time information;

calculating the similarity of each lyric in the first lyric file and each lyric in the second lyric file, and acquiring the first lyric in the first lyric file and the second lyric in the second lyric file, wherein the similarity is larger than a preset similarity threshold;

determining first time information corresponding to the first lyrics according to the first mapping relation, and determining second time information corresponding to the second lyrics according to the second mapping relation;

cutting a first audio file corresponding to the first lyric file according to the first time information to obtain a first sub audio file, and cutting a second audio file corresponding to the second lyric file according to the second time information to obtain a second sub audio file;

adding first annotation information to the first sub-audio file, adding second annotation information to the second sub-audio file, and determining that the first sub-audio file added with the first annotation information and the second sub-audio file added with the second annotation information are audio samples of the same lyric fragment, wherein the first annotation information and the second annotation information are used for training a neural network model.

In a second aspect, an embodiment of the present invention provides an apparatus for extracting an audio sample, including:

the system comprises an acquisition unit, a storage unit and a storage unit, wherein the acquisition unit is used for acquiring a first lyric file and a second lyric file, the first lyric file comprises a first mapping relation between lyrics and time information, and the second lyric file comprises a second mapping relation between lyrics and time information;

the calculating unit is used for calculating the similarity between each lyric in the first lyric file and each lyric in the second lyric file and obtaining the first lyric in the first lyric file and the second lyric in the second lyric file, wherein the similarity is larger than a preset similarity threshold;

a first determining unit, configured to determine first time information corresponding to the first lyrics according to the first mapping relationship, and determine second time information corresponding to the second lyrics according to the second mapping relationship;

the clipping unit is used for clipping a first audio file corresponding to the first lyric file according to the first time information to obtain a first sub audio file, and clipping a second audio file corresponding to the second lyric file according to the second time information to obtain a second sub audio file;

The second determining unit is used for adding first annotation information to the first sub-audio file, adding second annotation information to the second sub-audio file, and determining that the first sub-audio file added with the first annotation information and the second sub-audio file added with the second annotation information are audio samples of the same lyric fragment, wherein the first annotation information and the second annotation information are used for training a neural network model.

In a third aspect, an embodiment of the present invention provides a terminal, including: a processor and a memory, the processor to perform:

In a fourth aspect, an embodiment of the present invention further provides a computer readable storage medium, where program instructions are stored, the program instructions being configured to implement the method according to the first aspect.

According to the embodiment of the invention, the similarity between each lyric in the first lyric file and each lyric in the second lyric file is calculated by acquiring the first lyric file and the second lyric file, so that the first lyric in the first lyric file and the second lyric in the second lyric file with the similarity larger than the preset similarity threshold value are acquired; determining first time information corresponding to the first lyrics according to a first mapping relation between the lyrics and the time information in the first lyrics file, and determining second time information corresponding to the second lyrics according to a second mapping relation between the lyrics and the time information in the second lyrics file; cutting a first audio file corresponding to a first lyric file according to first time information to obtain a first sub-audio file, cutting a second audio file corresponding to a second lyric file according to second time information to obtain a second sub-audio file, adding first annotation information to the first sub-audio file, adding second annotation information to the second sub-audio file, and determining that the first sub-audio file added with the first annotation information and the second sub-audio file added with the second annotation information are audio samples of the same lyric fragment, wherein the first annotation information and the second annotation information are used for training a neural network model. By the implementation mode, the audio samples of the same lyric fragment can be extracted based on the lyric file, workload is reduced, the requirements on automation and intellectualization of extracting the audio samples are met, and the efficiency of extracting the audio samples is improved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic structural diagram of an audio sample extraction system according to an embodiment of the present invention;

fig. 2 is a flow chart of an audio sample extraction method according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an audio sample extracting apparatus according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a terminal according to an embodiment of the present invention.

Detailed Description

The technical solutions of the embodiments of the present invention will be clearly described below with reference to the drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The method for extracting the audio sample provided by the embodiment of the invention can be applied to an audio sample extraction system, wherein the system comprises audio sample extraction equipment and a server, the audio sample extraction equipment can be arranged in a terminal, and in some embodiments, the terminal can comprise intelligent terminal equipment such as a smart phone, a tablet computer, a notebook computer, a desktop computer, an on-board intelligent terminal, a smart watch and the like. In some embodiments, the server includes one or more databases therein, which may be used to store content of audio files, such as songs. In some embodiments, the server may be a cloud server. In some embodiments, the audio samples extracted by the audio sample extraction method provided by the embodiment of the present invention may be applied to various different scenes: for example, training a song overturn recognition model, the trained model can recognize whether the song belongs to the overturn song; for another example, training a timbre conversion model, e.g., a trained model can convert timbre a to timbre B; and for training the song serial burning model, the trained model can splice the front N lyrics of a certain song with the rear M lyrics of a B song. Of course, the above application scenario is merely an example, and the audio sample extracted by the embodiment of the present invention may be applied to any scenario in which audio processing is performed according to the correspondence between lyrics fragments.

An audio sample extraction system according to an embodiment of the present invention is schematically illustrated in the following with reference to fig. 1.

Referring to fig. 1, fig. 1 is a schematic structural diagram of an audio sample extraction system according to an embodiment of the present invention, where the system includes a terminal 11 and a server 12, and in some embodiments, the terminal 11 and the server 12 may establish a communication connection through a wireless communication manner; in some scenarios, the communication connection between the terminal 11 and the server 12 may also be established through a wired communication manner. In some embodiments, the terminal 11 may include, but is not limited to, a smart terminal device such as a smart phone, tablet, notebook, desktop, in-vehicle smart terminal, smart watch, etc.

In the embodiment of the present invention, the terminal 11 may obtain 2 audio files, i.e. song files, from the server 12, and the terminal 11 may parse the obtained 2 audio files to obtain a corresponding first lyric file and a corresponding second lyric file, where the first lyric file includes a first mapping relationship between lyrics and time information, and the second lyric file includes a second mapping relationship between lyrics and time information. The terminal 11 obtains the first lyrics in the first lyrics file and the second lyrics in the second lyrics file, the similarity of which is larger than a preset similarity threshold, by calculating the similarity of each lyric in the first lyrics file and each lyric in the second lyrics file, determines first time information corresponding to the first lyrics according to the first mapping relation, and determines second time information corresponding to the second lyrics according to the second mapping relation; cutting a first audio file corresponding to the first lyric file according to the first time information to obtain a first sub audio file, and cutting a second audio file corresponding to the second lyric file according to the second time information to obtain a second sub audio file; and adding first annotation information to the first sub-audio file, adding second annotation information to the second sub-audio file, and determining that the first sub-audio file added with the first annotation information and the second sub-audio file added with the second annotation information are audio samples of the same lyric fragment, wherein the first annotation information and the second annotation information are used for training a neural network model.

By the method, the audio samples of the same lyric fragment can be extracted based on the lyric file, workload is reduced, the requirements on automation and intellectualization of extracting the audio samples are met, and the efficiency of extracting the audio samples is improved.

The method for extracting an audio sample according to an embodiment of the present invention is schematically described below with reference to fig. 2.

Referring specifically to fig. 2, fig. 2 is a flowchart of an audio sample extraction method provided by an embodiment of the present invention, where the audio sample extraction method according to the embodiment of the present invention may be performed by an audio sample extraction device, and the audio sample extraction device is disposed in a terminal, where a specific explanation of the terminal is as described above. Specifically, the method of the embodiment of the invention comprises the following steps.

S201: obtaining a first lyric file and a second lyric file, wherein the first lyric file comprises a first mapping relation between lyrics and time information, and the second lyric file comprises a second mapping relation between lyrics and time information.

In the embodiment of the invention, the audio sample extracting device may obtain a first lyric file and a second lyric file, where the first lyric file includes a first mapping relationship between lyrics and time information, and the second lyric file includes a second mapping relationship between lyrics and time information. In some embodiments, the audio sample extracting device may obtain the first lyrics file and the second lyrics file from other terminal devices, platforms or servers, where the first lyrics file and the second lyrics file may be lyrics files of different song segments obtained from the same song, or may be lyrics files of song segments obtained from different songs.

In some embodiments, the obtained first lyrics file and second lyrics file may be stored in a specified data structure, for example, the specified data structure may be a map structure in a dictionary or C in the python language.

In one embodiment, in the process of obtaining the first lyric file and the second lyric file, the extracting device of the audio sample may obtain the first audio file and the second audio file, parse the first audio file to obtain the first lyric file, and parse the second audio file to obtain the second lyric file.

In one embodiment, in the process of analyzing the first audio file to obtain the first lyrics file and analyzing the second audio file to obtain the second lyrics file, the extracting device of the audio sample may perform traversal analysis on the first audio file according to a specified file format to obtain time information of each lyric corresponding to the first audio file, and determine the first lyrics file according to each lyric and the time information of each lyric, where the time information includes a start time and a lyric duration corresponding to each lyric; and performing traversal analysis on the second audio file according to the specified file format to obtain time information of each lyric corresponding to the second audio file, and determining the second lyric file according to each lyric and the time information of each lyric, wherein the time information comprises starting time and lyric duration corresponding to each lyric.

In some embodiments, the specified file format may include, but is not limited to, LRC, QRC, wherein the LRC format is a lyric file format with information marks such as a time axis, so as to obtain time information corresponding to each lyric, and the QRC format is an LRC-based modified lyric file format.

S202: calculating the similarity of each lyric in the first lyric file and each lyric in the second lyric file, and acquiring the first lyric in the first lyric file and the second lyric in the second lyric file, wherein the similarity is larger than a preset similarity threshold value.

In the embodiment of the invention, the extracting device of the audio sample can calculate the similarity between each lyric in the first lyric file and each lyric in the second lyric file, and acquire the first lyric in the first lyric file and the second lyric in the second lyric file, wherein the similarity is greater than a preset similarity threshold.

In one embodiment, in the process of calculating the similarity between each lyric in the first lyric file and each lyric in the second lyric file, the extracting device of the audio sample may determine a third mapping relationship between each lyric in the first lyric file and each lyric in the second lyric file according to a preset rule, and calculate the similarity between each lyric in the first lyric file and each lyric in the second lyric file according to the third mapping relationship.

In one embodiment, when determining a third mapping relation between each lyric in the first lyric file and each lyric in the second lyric file according to a preset rule and calculating similarity between each lyric in the first lyric file and each lyric in the second lyric file according to the third mapping relation, a third mapping relation between an nth sentence lyric in the first lyric file and an [ N-M, n+m ] th sentence lyric in the second lyric file may be determined, and similarity between each lyric in the first lyric file and each lyric in the second lyric file may be calculated according to the third mapping relation. Wherein N is an integer greater than or equal to 1, M is a number greater than or equal to 0, and N is greater than M. For example, the 4 th sentence lyrics in the first lyrics file are compared with the 3 rd sentence lyrics, the 4 th sentence lyrics and the 5 th sentence lyrics in the second lyrics file, so as to calculate the similarity between the 4 th sentence lyrics in the first lyrics file and the 3 rd sentence lyrics in the second lyrics file, calculate the similarity between the 4 th sentence lyrics in the first lyrics file and the 4 th sentence lyrics in the second lyrics file, and calculate the similarity between the 4 th sentence lyrics in the first lyrics file and the 5 th sentence lyrics in the second lyrics file.

In one embodiment, in the process of calculating the similarity between each lyric in the first lyric file and each lyric in the second lyric file according to the third mapping relationship, the extracting device of the audio sample may compare each lyric in the first lyric file with each corresponding lyric in the second lyric file according to the third mapping relationship, determine, according to the comparison result, a target lyric sequence in the first lyric file, in which a third lyric is identical to a fourth lyric in the first lyric file, and calculate the similarity between the third lyric and the fourth lyric according to the sequence length of the target lyric sequence, the sequence length of the third lyric, and the sequence length of the fourth lyric.

In one embodiment, when calculating the similarity between the third lyrics and the fourth lyrics according to the sequence length of the target lyrics sequence, the sequence length of the third lyrics and the sequence length of the fourth lyrics, the extracting device of the audio sample may compare the sequence length of the third lyrics with the sequence length of the fourth lyrics to obtain a maximum sequence length, and determine the similarity between the third lyrics and the fourth lyrics according to the ratio of the sequence length of the target lyrics to the maximum sequence length.

Specifically, assuming that the sequence length of the third lyrics is L (a), the sequence length of the fourth lyrics is L (B), and the sequence length of the target lyrics sequence of the third lyrics identical to the fourth lyrics is L (N), the calculation formula of the similarity S (a, B) between the third lyrics and the fourth lyrics is as follows:

S(A,B)＝L(N)/max(L(A),L(B)) (1)

where max (L (a), L (B)) represents the maximum sequence length between the sequence length of the third lyrics and the sequence length of the fourth lyrics.

For example, assuming that the sequence length of the third lyrics is 10, the sequence length of the fourth lyrics is 8, and the sequence length of the target lyrics sequence in which the third lyrics are identical to the fourth lyrics is 5, the similarity S (a, B) =5/10=0.5=50% of the third lyrics to the fourth lyrics, and thus, it can be determined that the similarity of the third lyrics to the fourth lyrics is 50%.

S203: and determining first time information corresponding to the first lyrics according to the first mapping relation, and determining second time information corresponding to the second lyrics according to the second mapping relation.

In the embodiment of the invention, the extracting device of the audio sample can determine the first time information corresponding to the first lyrics according to the first mapping relation, and determine the second time information corresponding to the second lyrics according to the second mapping relation. In some embodiments, the first time information and the second time information include, but are not limited to, a start time, a lyric duration, a lyric end time, and the like, corresponding to each lyric.

S204: cutting a first audio file corresponding to the first lyric file according to the first time information to obtain a first sub audio file, and cutting a second audio file corresponding to the second lyric file according to the second time information to obtain a second sub audio file.

In the embodiment of the invention, the audio sample extracting device may cut the first audio file corresponding to the first lyric file according to the first time information to obtain a first sub-audio file, and cut the second audio file corresponding to the second lyric file according to the second time information to obtain a second sub-audio file.

In one embodiment, when the extracting device of the audio sample clips a first audio file corresponding to the first lyric file according to the first time information to obtain a first sub audio file, and clips a second audio file corresponding to the second lyric file according to the second time information to obtain a second sub audio file, the extracting device may clip the first audio file according to a start time and a lyric duration corresponding to the first lyric in the first time information to obtain the first sub audio file, and clip the second audio file according to a start time and a lyric duration corresponding to the second lyric in the second time information to obtain the second sub audio file.

In one embodiment, the clipping tool may be used to clip the first audio file and the second audio file, and name and store the first sub-audio file and the second sub-audio file obtained by clipping according to the user requirement.

S205: adding first annotation information to the first sub-audio file, adding second annotation information to the second sub-audio file, and determining that the first sub-audio file added with the first annotation information and the second sub-audio file added with the second annotation information are audio samples of the same lyric fragment, wherein the first annotation information and the second annotation information are used for training a neural network model.

In the embodiment of the invention, the extracting device of the audio sample can add the first annotation information to the first sub-audio file, add the second annotation information to the second sub-audio file, and determine that the first sub-audio file added with the first annotation information and the second sub-audio file added with the second annotation information are audio samples of the same lyric fragment, wherein the first annotation information and the second annotation information are used for training a neural network model.

For example, assuming that the lyrics of the first lyrics file of the first audio file are "my love my country, and cannot be split at one moment", the sequence length is 14, the lyrics of the second lyrics file of the second audio file are "my love my hometown, and cannot be split at each moment", and the sequence length is 15, the same target lyrics sequence of the two lyrics is "my love my one moment and cannot be split at one moment", and the sequence length is 11, so that the similarity of the two lyrics is 11/15=73%, if the preset similarity threshold is 30%, the two lyrics can be determined to be similar songs, the first audio file and the second audio file are cut according to the starting time and the lyrics duration respectively corresponding to the two lyrics, the first audio file and the second audio file are obtained, first labeling information such as the identifier corresponding to the first audio file, the starting time and the lyrics corresponding to the first file, and the lyrics duration corresponding to the second audio file are added, and the lyrics sample length is determined after the second audio file is added with the second audio labeling information such as the identifier corresponding to the first lyrics file, the corresponding to the lyrics of the lyrics sample is added to the lyrics.

In some embodiments, the audio sample extracting device may obtain the identifier of the first sub-audio file, and obtain the first lyrics and a start time, a lyric duration, and the like of the first lyrics in the first time information of the first lyrics, so as to determine the first labeling information according to one or more of the identifier of the first sub-audio file, the start time of the first lyrics, and the lyric duration. In some embodiments, the identification of the first sub-audio file may be determined according to the identification of the first audio file, including but not limited to a file name, a file number, etc., for example, assuming that the identification of the first audio file is 1, the identification of the first sub-audio file may be 12.

In some embodiments, the audio sample extracting device may obtain the identifier of the second sub-audio file, and obtain the second lyrics and a start time, a lyric duration, and the like of the second lyrics in the second time information of the second lyrics, so as to determine the second labeling information according to one or more of the identifier of the second sub-audio file, the start time of the second lyrics, and the lyric duration. In some embodiments, the identification of the second sub-audio file may be determined based on the identification of the second audio file, including, but not limited to, a file name, a file number, and the like. Wherein the identification of the first sub-audio file is different from the identification of the second sub-audio file.

In one implementation manner, when the extracting device of the audio sample adds the first annotation information to the first sub-audio file and adds the second annotation information to the second sub-audio file, the type of the annotation information required by the neural network model to be trained can be determined first, and then the first annotation information and the second annotation information corresponding to the type of the annotation information are determined according to the determined type of the annotation information.

In some embodiments, when determining the type of annotation information required by the neural network model to be trained, the type of annotation information may be determined according to an application scenario of the neural network model to be trained, for example, when the application scenario of the neural network model to be trained is a song overturn scenario, it may be determined that the type of annotation information includes a song identifier (such as a first audio file identifier, a second audio file identifier), a start time of lyrics, and a lyrics duration; for another example, when the application scenario of the neural network model to be trained is a scenario of lyric identification, it may be determined that the type of annotation information includes song identification, lyrics, start time of lyrics, and lyric duration.

In another implementation manner, the target annotation information can be selected for model training according to the application scene of the neural network model to be trained in practical application, wherein the added annotation information is optionally added according to the preset annotation information type, and the added annotation information is used as the alternative annotation information. For example, the preset annotation information types include, but are not limited to, song identification, lyrics, start time of lyrics, duration of lyrics, and the like. In some embodiments, the annotation information type includes, but is not limited to, an identification of the audio file, lyrics corresponding to the audio file, a start time of the lyrics, a duration of the lyrics, and the like.

In one embodiment, when the first annotation information and the second annotation information are used for training a music recognition model, the first annotation information includes any one or more of an identifier corresponding to the first sub-audio file, a start time corresponding to first lyrics, and a lyric duration, and the second annotation information includes any one or more of an identifier corresponding to the second sub-audio file, a start time corresponding to second lyrics, and a lyric duration. And training the first sub-audio file added with the first annotation information and the second sub-audio file added with the second annotation information to obtain a music identification model. When the music recognition model is trained, the audio samples can be segmented (for example, each piece is 1-5 seconds), the characteristics of each piece of audio sample are extracted, and the characteristics of each piece of audio sample added with the labeling information are input into the neural network model for training to obtain the music recognition model.

In one embodiment, when the first labeling information and the second labeling information are used for audio lyrics recognition training, the first labeling information includes any one or more of an identifier corresponding to the first sub-audio file, a first lyric, a start time corresponding to the first lyric, and a lyric duration, and the second labeling information includes any one or more of an identifier corresponding to the second sub-audio file, a start time corresponding to the second lyric, and a lyric duration. Training the first labeling information and the second labeling information to obtain an audio lyric identification model, and identifying different audio signals corresponding to the same lyrics by utilizing the audio lyric identification model.

For example, the target song with the same lyrics as the target song with the same lyrics is input into a training audio lyrics recognition model, and two different audio signals (namely, an audio signal corresponding to the target song with the same lyrics and an audio signal corresponding to the target song with the same lyrics) are recognized. By means of this audio lyrics recognition model, different audio signals under the same lyrics text can be recognized.

In one embodiment, when the first annotation information and the second annotation information are used for training of singing a string of songs, the first annotation information includes any one or more of an identifier corresponding to the first sub-audio file, a start time corresponding to first lyrics, and a lyric duration, and the second annotation information includes any one or more of an identifier corresponding to the second sub-audio file, a start time corresponding to second lyrics, and a lyric duration. Training through the first labeling information and the second labeling information to obtain a singing song model, and identifying different audio signals corresponding to the same lyrics by utilizing the singing song model.

For example, the lyrics of the audio file to be tested are divided into an A part and a B part, the B part lyrics are input into a singing song model to obtain an identifier, a lyric starting time and a lyric duration corresponding to the singing song corresponding to the B part lyrics, and the A part audio and the B part singing audio of the audio file to be tested are spliced together through audio cutting processing to form a song string.

Referring to fig. 3, fig. 3 is a schematic structural diagram of an audio sample extracting apparatus according to an embodiment of the present invention. Specifically, the device for extracting the audio sample is arranged in the terminal, and the device comprises: an acquisition unit 301, a calculation unit 302, a first determination unit 303, a clipping unit 304, a second determination unit 305;

an obtaining unit 301, configured to obtain a first lyric file and a second lyric file, where the first lyric file includes a first mapping relationship between lyrics and time information, and the second lyric file includes a second mapping relationship between lyrics and time information;

a calculating unit 302, configured to calculate a similarity between each lyric in the first lyric file and each lyric in the second lyric file, and obtain a first lyric in the first lyric file and a second lyric in the second lyric file, where the similarity is greater than a preset similarity threshold;

a first determining unit 303, configured to determine first time information corresponding to the first lyrics according to the first mapping relationship, and determine second time information corresponding to the second lyrics according to the second mapping relationship;

the clipping unit 304 is configured to clip a first audio file corresponding to the first lyric file according to the first time information to obtain a first sub audio file, and clip a second audio file corresponding to the second lyric file according to the second time information to obtain a second sub audio file;

A second determining unit 305, configured to add first annotation information to the first sub-audio file, add second annotation information to the second sub-audio file, and determine that the first sub-audio file to which the first annotation information is added and the second sub-audio file to which the second annotation information is added are audio samples of the same lyrics fragments, where the first annotation information and the second annotation information are used for training of a neural network model.

Further, when the obtaining unit 301 obtains the first lyric file and the second lyric file, the obtaining unit is specifically configured to:

acquiring the first audio file and the second audio file;

analyzing the first audio file to obtain the first lyrics file, and analyzing the second audio file to obtain the second lyrics file.

Further, when the obtaining unit 301 parses the first audio file to obtain the first lyrics file and parses the second audio file to obtain the second lyrics file, the obtaining unit is specifically configured to:

performing traversal analysis on the first audio file according to a specified file format to obtain time information of each lyric corresponding to the first audio file, and determining the first lyric file according to each lyric and the time information of each lyric, wherein the time information comprises a starting time and a lyric duration corresponding to each lyric;

Performing traversal analysis on the second audio file according to the specified file format to obtain time information of each lyric corresponding to the second audio file, and determining the second lyric file according to each lyric and the time information of each lyric, wherein the time information comprises starting time and lyric duration corresponding to each lyric.

Further, when the clipping unit 304 clips the first audio file corresponding to the first lyric file according to the first time information to obtain a first sub audio file, and clips the second audio file corresponding to the second lyric file according to the second time information to obtain a second sub audio file, the clipping unit is specifically configured to:

cutting the first audio file according to the starting time and lyric duration corresponding to the first lyrics in the first time information to obtain the first sub audio file;

and cutting the second audio file according to the starting time and the lyric duration corresponding to the second lyrics in the second time information to obtain the second sub audio file.

Further, when the calculating unit 302 calculates the similarity between each lyric in the first lyric file and each lyric in the second lyric file, the calculating unit is specifically configured to:

Determining a third mapping relation between each lyric in the first lyric file and each lyric in the second lyric file according to a preset rule;

comparing each lyric in the first lyric file with each corresponding lyric in the second lyric file according to the third mapping relation;

determining a target lyric sequence with the same third lyrics as the fourth lyrics in the first lyric file according to the comparison result;

and calculating the similarity of the third lyrics and the fourth lyrics according to the sequence length of the target lyrics sequence, the sequence length of the third lyrics and the sequence length of the fourth lyrics.

Further, the calculating unit 302 is specifically configured to, when calculating the similarity between the third lyrics and the fourth lyrics according to the sequence length of the target lyrics sequence, the sequence length of the third lyrics, and the sequence length of the fourth lyrics:

comparing the sequence length of the third lyrics with the sequence length of the fourth lyrics to obtain the maximum sequence length;

and determining the similarity of the third lyrics and the fourth lyrics according to the ratio of the sequence length of the target lyrics sequence to the maximum sequence length.

Further, when the first annotation information and the second annotation information are used for training a music recognition model, the first annotation information comprises any one or more of an identifier corresponding to the first sub-audio file, a starting time corresponding to first lyrics and a lyric duration, and the second annotation information comprises any one or more of an identifier corresponding to the second sub-audio file, a starting time corresponding to second lyrics and a lyric duration.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a terminal according to an embodiment of the present invention. Specifically, the terminal includes: memory 401, and processor 402.

In an embodiment, the terminal further comprises a data interface 403, the data interface 403 being for transferring data information between the terminal and other devices.

The memory 401 may include volatile memory (volatile memory); memory 401 may also include non-volatile memory (nonvolatile memory); memory 401 may also include a combination of the above types of memory. The processor 402 may be a central processing unit (central processing unit, CPU). The processor 402 may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a programmable logic device (programmable logic device, PLD), or a combination thereof. The PLD may be a complex programmable logic device (complex programmable logic device, CPLD), a field-programmable gate array (field-programmable gate array, FPGA), or any combination thereof.

The memory 401 is used for storing a program, and the processor 402 may call the program stored in the memory 401, for performing the following steps:

Further, when the processor 402 obtains the first lyrics file and the second lyrics file, the method specifically is used for:

acquiring the first audio file and the second audio file;

Further, when the processor 402 parses the first audio file to obtain the first lyrics file and parses the second audio file to obtain the second lyrics file, the method is specifically configured to:

Further, when the processor 402 clips the first audio file corresponding to the first lyric file according to the first time information to obtain a first sub audio file, and clips the second audio file corresponding to the second lyric file according to the second time information to obtain a second sub audio file, the method is specifically used for:

Further, when the processor 402 calculates the similarity between each lyric in the first lyric file and each lyric in the second lyric file, the processor is specifically configured to:

Further, when the processor 402 calculates the similarity between the third lyrics and the fourth lyrics according to the sequence length of the target lyrics sequence, the sequence length of the third lyrics and the sequence length of the fourth lyrics, the processor is specifically configured to:

The embodiment of the present invention further provides a computer readable storage medium, where a computer program is stored, where the computer program when executed by a processor implements the method described in the embodiment corresponding to fig. 2 of the present invention, and may also implement the apparatus according to the embodiment corresponding to fig. 3 of the present invention, which is not described herein again.

The computer readable storage medium may be an internal storage unit of the device according to any of the foregoing embodiments, for example, a hard disk or a memory of the device. The computer readable storage medium may also be an external storage device of the device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the device. Further, the computer readable storage medium may also include both internal storage units and external storage devices of the device. The computer-readable storage medium is used to store the computer program and other programs and data required by the terminal. The computer-readable storage medium may also be used to temporarily store data that has been output or is to be output.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), or the like.

The above disclosure is only a few examples of the present invention, and it is not intended to limit the scope of the present invention, but it is understood by those skilled in the art that all or a part of the above embodiments may be implemented and equivalents thereof may be modified according to the scope of the present invention.

Claims

1. A method for extracting an audio sample, comprising:

2. The method of claim 1, wherein the obtaining the first lyrics file and the second lyrics file comprises:

acquiring the first audio file and the second audio file;

3. The method of claim 2, wherein the parsing the first audio file to obtain the first lyrics file and the parsing the second audio file to obtain the second lyrics file comprises:

4. The method of claim 3, wherein the clipping the first audio file corresponding to the first lyric file according to the first time information to obtain a first sub-audio file, and clipping the second audio file corresponding to the second lyric file according to the second time information to obtain a second sub-audio file, comprises:

5. The method of claim 1, wherein the calculating similarity of each lyric in the first lyrics file to each lyric in the second lyrics file comprises:

6. The method of claim 5, wherein the calculating the similarity of the third lyrics to the fourth lyrics based on the sequence length of the target lyrics sequence, the sequence length of the third lyrics, and the sequence length of the fourth lyrics comprises:

7. The method of claim 1, wherein the step of determining the position of the substrate comprises,

when the first annotation information and the second annotation information are used for training a music recognition model, the first annotation information comprises any one or more of an identifier corresponding to the first sub-audio file, a starting time corresponding to first lyrics and a lyric duration, and the second annotation information comprises any one or more of an identifier corresponding to the second sub-audio file, a starting time corresponding to second lyrics and a lyric duration.

8. An audio sample extraction apparatus, comprising:

9. A terminal comprising a processor, an input device, an output device and a memory, the processor, the input device, the output device and the memory being interconnected, wherein the memory is adapted to store a computer program, the computer program comprising a program, the processor being configured to invoke the program to perform the method of any of claims 1-7.

10. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein program instructions, which when executed, are adapted to carry out the method according to any of claims 1-7.