CN111816157A

CN111816157A - Music score intelligent video-singing method and system based on voice synthesis

Info

Publication number: CN111816157A
Application number: CN202010590726.0A
Authority: CN
Inventors: 刘昆宏; 吴清强; 吴苏悦; 张敬峥; 詹旺平
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2020-06-24
Filing date: 2020-06-24
Publication date: 2020-10-23
Anticipated expiration: 2040-06-24
Also published as: CN111816157B

Abstract

The invention provides a music score intelligent video-singing method and a music score intelligent video-singing system based on voice synthesis, wherein the method comprises the following steps: step one, data preparation, namely inputting and analyzing an abc music score to obtain pitch and duration information of each note in the specific abc music score; step two, training parameters, namely generating notes with the length within 5 when training data are made, namely dividing all notes into a group of 5 notes when a complete abc music score is processed; step three, synthesizing audio splicing, specifically comprising three substeps of music score segmentation identification, segment splicing, waveform alignment and blank segment filling; and fourthly, visually displaying the synthesized audio. The invention solves the technical problems of large calculation amount in the training process, obvious splicing trace in direct splicing, splicing noise and the like, and the difference is difficult to distinguish by comparing the effects of the generated audio and the original data.

Description

Music score intelligent video-singing method and system based on voice synthesis

Technical Field

The invention belongs to the field of computers, and particularly relates to a music score intelligent video-singing method and system based on voice synthesis.

Background

In recent years, with the continuous development of networks, network education is continuously developed due to the continuous updating and iteration of mobile phone application and computer application, subjects of art are special, communication and mutual feedback of teachers and students are emphasized, students need to receive feedback in time when learning the class of courses to correctly recognize own errors, for example, when music learning is performed, each student simply looks at a music score and sings the music score, the teachers cannot give consideration to each student at the same time, and when the students learn the class of music, the students preferably look at the eyes, listen to ears, sing in the mouth and cooperate with multiple senses, so that the better subjects of learning the music and other stimulating senses can be realized.

The invention researches a video-song teaching block of a music on-line teaching platform, hopes that music score video-song can be changed from the traditional professional teacher to the computer for simulating the real-person singing method to sing the music score, so that students can flexibly learn various music scores and play a great role in expanding a database for the teacher or the platform, and the research problem of the invention is that the computer is used for completing the music score video-song task.

Music score sing is a problem similar to speech synthesis, and the current speech synthesis technology reaches the level of putting on the market, the music score can be regarded as a special language, and all the computer needs to learn the pronunciation of each note. The music score gives information such as beats in the information of the beginning, so the information needs to be brought into all notes in the character string during processing, the character string contains information of the duration and the pitch of each note, but the technical problems that training cannot be performed due to too large training data, splicing traces are obvious when direct splicing is performed, noise is generated and the like are encountered during specific application.

Disclosure of Invention

The invention provides a music score intelligent video-singing method and system based on voice synthesis, which can translate a music score into a new language to be fit with the corresponding mapping relation of characters and pronunciations of a certain language, solves the technical problems of large calculation amount in the training process, obvious splicing trace in direct splicing, splicing noise and the like, and is difficult to distinguish the difference by comparing the effects of generated audio and original data.

In order to solve the above problems, the present invention provides a music score intelligent video-singing method based on speech synthesis, the method comprising:

step one, data preparation, namely inputting and analyzing an abc music score to obtain pitch and duration information of each note in the specific abc music score;

step two, training parameters, namely generating notes with the length within 5 when training data are made, namely dividing all notes into a group of 5 notes when a complete abc music score is processed;

step three, synthesizing audio splicing, specifically comprising three substeps of music score segmentation identification, segment splicing, waveform alignment and blank segment filling;

and fourthly, visually displaying the synthesized audio.

In a second aspect, an embodiment of the present application provides a music score intelligent video-song system based on speech synthesis, where the system includes:

the data preparation module is used for inputting and analyzing the abc music score to obtain the pitch and duration information of each note in the specific abc music score;

the training parameter module generates notes with the length within 5 when training data are produced, namely, all notes of a complete abc music score are divided into a group of 5 notes when the complete abc music score is processed;

the synthesis audio splicing module comprises a music score segmentation identification sub-module, a segment splicing sub-module and a waveform alignment and blank segment filling sub-module;

and the visualization module is used for visually displaying the synthesized audio.

In a third aspect, the present application provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the method described in the present application.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, the computer program being configured to:

which when executed by a processor implements a method as described in embodiments of the present application.

Drawings

FIG. 1 is a flow chart of music score intelligent video-song based on speech synthesis according to the present invention;

fig. 2 is a diagram of the effect of synthesizing audio results based on a score according to the present invention.

Detailed Description

In order to make the objects, technical processes and technical innovation points of the present invention more clearly illustrated, the present invention is further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

In order to achieve the above purpose, the invention provides a music score intelligent video-singing method based on voice synthesis. The main process is shown in fig. 1, and the method comprises the following steps:

the data to be processed in the application is an abc file, the abc file comprises two parts of information, the first five parts of information are rhythm, tone and the like of a music score, the lower part of the information is notes, the music score is different from a language, the notation of the whole music score is specified in the head of the music score, and the language is that each word has an independent pronunciation rule, so when the abc music score is processed, a similar processing method needs to be considered, and the score information in the head is brought into each note.

and fourthly, visually displaying the synthesized audio.

Preferably, the data preparing step further comprises: each note setting comprises three parts of duration, note and pitch, 1/8 beats are selected in the shortest duration unit, the duration of the note is 1/8, 1/4, 3/8, 1/2, 3/4 and 1 beat, the range of the note is f 3-f #5, the data is really difficult to collect due to the human voice, vocaloid software is used for synthesizing audio to replace the human voice, the half-tone decreasing is represented by the half-tone decreasing, and all tones represented by black keys are uniformly represented by the half-tone increasing for accurate labeling. By the combined mode, the music score header information can be brought into each note, and the abc music score translation problem is solved.

Preferably, the training process further comprises: this application has used the script to generate 11000 data totally, and the data set divides into 10000 data as training data, and 1000 data remain and are used for testing data test model effect, and the test mode is artifical listening, distinguishes whether the duration and the pronunciation of sound are accurate. The alignment condition of the check point and the current encoder and decoder can be generated every 1000 training steps, and the training can be recovered from the check point again after the training is interrupted in the midway; the number # and Arabic numerals are replaced by English letters, the lengths of the bars are still fixed, if the pronunciation words are short, data with unfixed lengths are used, and the 'r' sound is directly added into the synthesized audio in the subsequent synthesis.

The method adopts a Mel scale spectrogram with the bandwidth of 80 as target output of a decoder, the sample rate is 20000hz, the frame length is 50ms, the frame offset is 12.5ms, the optimizer adopts conventional adam, the batch size is 32, and the delayed learning rate can effectively reduce the overfitting condition of a model.

Preferably, the synthetic audio splicing specifically includes:

performing segmentation recognition on a music score, namely splitting an abc file, taking 5 notes as a group, splitting the abc file into a plurality of measures, splitting the abc file when the r symbol is met, taking the rest notes as a measure, processing each measure by using a trained model to synthesize audio, and generating a wav file corresponding to each measure;

segment splicing, namely splicing the audio synthesized by the segments;

waveform alignment and blank section filling, for a continuously sung music score, a rough splicing mark can be generated, firstly, software is utilized to carry out simple splicing observation, two basic problems need to be processed, firstly, the head and the tail are pinched and removed, short blank time exists at the head and the tail of a synthesized audio, therefore, the blank position needs to be trimmed, as the head and the tail blank time of the synthesized audio are almost the same, the processing can be uniformly carried out, the second problem is the problem of 'papa' sound of the rough splicing, the main reason is that the sound is received after a certain sound, namely, the four sounds of 'duo', 'Re', 'Mi' and 'La', namely, the situation can be generated, and the noise is found to be caused by sudden change of the waveform at the position after the waveform diagram is amplified. In order to solve the problems, the following scheme is adopted in the application: the 'r' in the abc music score represents an unvoiced empty sound segment, when the audio is synthesized, the audio with blank duration is directly added to the part needing to be added as a buffer zone, sox is used for synthesizing the measure, the r sound comprises two parts, namely pronunciation length and r, the corresponding 'r' audio is selected for processing according to the requirements of the abc music score, and each splicing part is processed and output by using a very short r sound which can eliminate the noise and does not influence the duration feeling. This method has unexpected results for the elimination of noise. Through experiments, as shown in fig. 2, after a music score synthesis audio result graph is amplified, a blank r section added at a splicing part can be seen, and no obvious perception is provided on audibility. The actual effect is the same as that of the audio synthesized directly by people.

As another aspect, the present application further provides a music score intelligent video-song system based on speech synthesis, the system including:

Preferably, the data preparation module further comprises: each note setting comprises three parts of duration, note and pitch, 1/8 beats are selected in the shortest duration unit, the duration of the note is 1/8, 1/4, 3/8, 1/2, 3/4 and 1 beat, the range of the note is f 3-f #5, the data is really difficult to collect due to the human voice, vocaloid software is used for synthesizing audio to replace the human voice, the half-tone decreasing is represented by the half-tone decreasing, and all tones represented by black keys are uniformly represented by the half-tone increasing for accurate labeling.

Preferably, the training parameter module is further configured to:

the alignment condition of the check point and the current encoder and decoder can be generated every 1000 training steps, and the training can be recovered from the check point again after the training is interrupted in the midway;

the number # and Arabic numerals are replaced by English letters, the lengths of the bars are still fixed, if the pronunciation words are short, data with unfixed lengths are used, and the 'r' sound is directly added into the synthesized audio in the subsequent synthesis.

Preferably, the synthesized audio splicing module specifically includes:

the music score segmentation recognition sub-module is used for splitting an abc file, splitting the abc file into a group of 5 notes and a plurality of measures, splitting the abc file when the r symbol is met, using the rest notes as a measure, processing each measure by using a trained model to synthesize audio, and generating a wav file corresponding to each measure;

the segment splicing submodule is used for splicing the audio synthesized by segments;

the waveform alignment and blank section filling submodule is used for 'r' in an abc music score to represent an unvoiced blank section, when the audio is synthesized, the audio with blank duration is directly added to a part needing to be added to serve as a buffer zone, sox is used for synthesizing a measure, r sound is composed of two parts, namely pronunciation length and r, the corresponding 'r' audio is selected according to the requirements of the abc music score to be processed, and each section of spliced part is processed and output by using a very short r sound which can eliminate the noise and does not influence the duration feeling.

As another aspect, the present application further provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the method as described in the embodiments of the present application when executing the computer program.

As another aspect, the present application also provides a computer-readable storage medium, which may be the computer-readable storage medium included in the foregoing device in the foregoing embodiment; or it may be a separate computer readable storage medium not incorporated into the device. The computer readable storage medium stores one or more programs for use by one or more processors in performing the methods described in the embodiments of the present application.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for realizing a logic function for a data signal, an asic having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), and the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims

1. A music score intelligent video-singing method based on voice synthesis comprises the following steps:

and fourthly, visually displaying the synthesized audio.

2. The method of claim 1, wherein the data preparation step further comprises: each note setting comprises three parts of duration, note and pitch, 1/8 beats are selected in the unit with the shortest duration, the duration of the note is 1/8, 1/4, 3/8, 1/2, 3/4 and 1 beat, the range of the note is f 3-f #5, the data is that because the human voice data is difficult to collect, audio is synthesized by using vocaloloid software to replace human voice, the half-tone decreasing is represented by the half-tone increasing of the half-tone decreasing, and in order to mark accurately, the tones represented by all black keys are uniformly represented by the half-tone increasing.

3. The method of claim 1, wherein the training process further comprises:

4. The method of claim 1, wherein the synthetic audio concatenation specifically comprises:

segment splicing, namely splicing the audio synthesized by the segments;

the waveform alignment and the blank section filling are carried out, r' in the abc music score represents an unvoiced blank sound section, when the audio is synthesized, the audio with blank duration is directly added to the part needing to be added to serve as a buffer zone, sox is used for synthesizing the measure, r sound is composed of two parts, namely pronunciation length and r, the corresponding r audio is selected according to the requirements of the abc music score for processing, and each section of splicing part is processed and output by using a very short r sound which can eliminate the noise and does not influence the duration feeling.

5. A music score intelligent video-song system based on speech synthesis, the system comprising:

6. The system of claim 5, wherein the data preparation module further comprises: each note setting comprises three parts of duration, note and pitch, 1/8 beats are selected in the unit with the shortest duration, the duration of the note is 1/8, 1/4, 3/8, 1/2, 3/4 and 1 beat, the range of the note is f 3-f #5, the data is that because the human voice data is difficult to collect, audio is synthesized by using vocaloloid software to replace human voice, the half-tone decreasing is represented by the half-tone increasing of the half-tone decreasing, and all the tones represented by the black keys are uniformly represented by the half-tone increasing in order to mark accurately.

7. The system of claim 5, wherein the training parameter module is further configured to:

8. The system of claim 5, wherein the synthesized audio splicing module specifically comprises:

9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 4 when executing the computer program.

10. A computer-readable storage medium having stored thereon a computer program for: the computer program, when executed by a processor, implements the method of any one of claims 1-4.