CN115862603A

CN115862603A - Song voice recognition method, system, storage medium and electronic equipment

Info

Publication number: CN115862603A
Application number: CN202211397956.0A
Authority: CN
Inventors: 周晓桐
Original assignee: Shumei Tianxia Beijing Technology Co ltd; Beijing Nextdata Times Technology Co ltd
Current assignee: Shumei Tianxia Beijing Technology Co ltd; Beijing Nextdata Times Technology Co ltd
Priority date: 2022-11-09
Filing date: 2022-11-09
Publication date: 2023-03-28
Anticipated expiration: 2042-11-09
Also published as: CN115862603B

Abstract

The invention relates to a song voice recognition method, a song voice recognition system, a storage medium and electronic equipment, wherein the song voice recognition system comprises: acquiring and fusing voice characteristic data, text prosody characteristic data and acoustic prosody characteristic data in each original song voice sample to obtain fused characteristic data corresponding to each original song voice sample; training a preset ASR model for song voice recognition based on the fusion feature data to obtain a target song voice recognition model; and inputting the song voice data to be recognized into a target song voice recognition model for recognition to obtain a target translation text corresponding to the song voice data to be recognized. According to the method, the plurality of rhythm characteristics and the plurality of voice characteristics in the song voice sample are fused, and the fused characteristics are input into the voice recognition model for training, so that the accuracy of the voice recognition model for the song voice is improved, and the high-precision translation text of the song voice can be obtained.

Description

Song voice recognition method, system, storage medium and electronic equipment

Technical Field

The present invention relates to the field of speech recognition technologies, and in particular, to a song speech recognition method, a song speech recognition system, a song speech recognition storage medium, and an electronic device.

Background

With the development of internet and AI technology, the automatic speech recognition technology is widely used in various segment fields, and especially has strong demand for live scenes which have a large number of song recognition demands. Because the traditional acoustic information feature extraction method has less retention on prosodic information, the translation of the song by a voice recognition model is poorer, and the recognition precision of song voice is not high.

Therefore, it is desirable to provide a technical solution to solve the above technical problems.

Disclosure of Invention

In order to solve the technical problem, the invention provides a song voice recognition method, a song voice recognition system, a song voice recognition storage medium and electronic equipment.

The technical scheme of the song voice recognition method of the invention is as follows:

acquiring and respectively fusing voice characteristic data, text prosody characteristic data and acoustic prosody characteristic data in each original song voice sample containing original song voice data and original text data to obtain fusion characteristic data corresponding to each original song voice sample;

training a preset ASR model for song voice recognition based on the plurality of fusion characteristic data to obtain a target song voice recognition model;

and inputting the song voice data to be recognized into the target song voice recognition model for recognition to obtain a target translation text corresponding to the song voice data to be recognized.

The song voice recognition method has the beneficial effects that:

according to the method, the prosodic features and the voice features in the song voice sample are fused, and the fused features are input into the voice recognition model for training, so that the accuracy of the voice recognition model for song voice recognition is improved, and the high-precision translation text of the song voice can be obtained.

On the basis of the scheme, the song voice recognition method can be further improved as follows.

Further, the step of obtaining the voice feature data, text prosody feature data and acoustic prosody feature data in any original song voice sample containing the original song voice data and the original text data includes:

preprocessing the original text data of any original song voice sample to obtain first text data of any original song voice sample, and extracting text prosody feature data of any original song voice sample from the first text data of any original song voice sample;

decoupling the original song voice data of any original song voice sample through a Mel filter to obtain and input the sound characteristic data of any original song voice sample into a preset GMM model to obtain a phoneme corresponding to each frame of sound characteristic data in the sound characteristic data of any original song voice sample;

and acquiring the acoustic prosody feature data of any original song voice sample from each frame of voice feature data of any original song voice sample and the phoneme corresponding to each frame of voice feature data.

Further, the step of fusing the voice feature data, the text prosody feature data and the acoustic prosody feature data of any original song voice sample includes:

and performing feature fusion on the voice feature data, the text prosody feature data and the acoustic prosody feature data of any original song voice sample through an attention mechanism to obtain fusion feature data of any original song voice sample.

Further, the text prosody feature data of any original song voice sample includes: initial information, final information and tone information of any original song voice sample; the acoustic prosody feature data of any original song voice sample comprises: and the pronunciation time length, pronunciation speed and pronunciation tone of any original song voice sample.

Further, the step of training a preset ASR model for song speech recognition based on the plurality of fusion feature data to obtain a target song speech recognition model includes:

inputting each fusion characteristic data into the preset ASR model respectively for training to obtain a loss value of each fusion characteristic data;

optimizing the parameters of the preset ASR model according to all the loss values to obtain an optimized ASR model;

and taking the optimized ASR model as the preset ASR model, returning and executing the step of inputting each fusion feature data into the preset ASR model for training respectively, and determining the optimized ASR model as the target song voice recognition model when the optimized ASR model meets preset conditions.

The technical scheme of the song voice recognition system is as follows:

the method comprises the following steps: the system comprises a processing module, a training module and an identification module;

the processing module is used for: acquiring and respectively fusing voice characteristic data, text prosody characteristic data and acoustic prosody characteristic data in each original song voice sample containing original song voice data and original text data to obtain fusion characteristic data corresponding to each original song voice sample;

the training module is configured to: training a preset ASR model for song voice recognition based on the plurality of fusion characteristic data to obtain a target song voice recognition model;

the identification module is configured to: and inputting the song voice data to be recognized into the target song voice recognition model for recognition to obtain a target translation text corresponding to the song voice data to be recognized.

The song voice recognition system has the following beneficial effects:

according to the system, the plurality of prosodic features and the plurality of voice features in the song voice sample are fused, and the fused features are input into the voice recognition model for training, so that the accuracy of the voice recognition model for song voice recognition is improved, and the high-precision translation text of the song voice can be obtained.

On the basis of the scheme, the song voice recognition system can be further improved as follows.

Further, the processing module is specifically configured to:

the method comprises the following steps of obtaining voice characteristic data, text prosody characteristic data and acoustic prosody characteristic data in any original song voice sample containing original song voice data and original text data, wherein the steps comprise:

Further, the processing module is specifically further configured to:

The technical scheme of the storage medium of the invention is as follows:

the storage medium has stored therein instructions which, when read by a computer, cause the computer to carry out the steps of a song speech recognition method according to the invention.

The technical scheme of the electronic equipment is as follows:

comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor, when executing the computer program, causes the computer to carry out the steps of a method for speech recognition of songs according to the invention.

Drawings

Fig. 1 is a flowchart illustrating a song speech recognition method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a song speech recognition system according to an embodiment of the present invention.

Detailed Description

As shown in fig. 1, a song speech recognition method according to an embodiment of the present invention includes the following steps:

s1, acquiring and respectively fusing voice feature data, text prosody feature data and acoustic prosody feature data in each original song voice sample containing original song voice data and original text data to obtain fusion feature data corresponding to each original song voice sample.

Wherein, (1) each original song voice sample comprises: raw song speech data and raw text data. (2) The original song voice data is: a sound signal comprising FBANK sound characteristics. (3) The original text data is: the marked text corresponding to the original song voice data is the text obtained by marking according to the content heard from the original song voice data (audio). (4) The sound characteristic data is: acoustic feature data (FBANK acoustic features) extracted from original song speech data (acoustic signal). (5) The text prosodic feature data is as follows: the prosodic features in the labeled text mainly comprise: initial consonant information, vowel information, tone information, and the like. (6) The acoustic prosody feature data mainly relates to emotional prosody, and the emotional prosody mainly comprises the following components: the design of the rhythm pronunciation unit, the pronunciation time of the rhythm unit, the pronunciation speed, the pronunciation tone and other attributes. (7) The fused feature data includes: the method is based on feature data obtained by fusing a plurality of features in an original song voice sample, and is used for training a voice recognition model so as to improve the precision of a translation text obtained by recognizing song voice by the voice recognition model.

And S2, training a preset ASR model for song voice recognition based on the plurality of fusion characteristic data to obtain a target song voice recognition model.

Wherein, (1) the preset ASR model is as follows: an automatic speech recognition model is a model that converts human speech into editable text. (2) The target song voice recognition model is as follows: the trained ASR model can be used for accurately identifying the song voice data to be identified to obtain a high-precision translated text.

And S3, inputting the song voice data to be recognized into the target song voice recognition model for recognition to obtain a target translation text corresponding to the song voice data to be recognized.

The song voice data to be recognized is as follows: any song voice data, which is Fbank type voice data. The target translation text is: and the translated text corresponding to the voice data of the song to be recognized is output through the voice recognition model of the target song.

Preferably, the step of obtaining the voice feature data, text prosody feature data and acoustic prosody feature data in any original song voice sample containing the original song voice data and the original text data includes:

preprocessing the original text data of any original song voice sample to obtain first text data of any original song voice sample, and extracting text prosody feature data of any original song voice sample from the first text data of any original song voice sample.

Wherein, (1) the first text data is: and the original text data is subjected to text preprocessing to obtain text data. (2) The text preprocessing process comprises the following steps: the punctuations in the original text data are deleted, and format conversion is performed through text mapping, so that plain text data is obtained, and the plain text data is the first text data in this embodiment. (3) And obtaining text prosody feature data of each original song voice sample according to the text prosody information rule base and the first text data. The text prosody information rule base sets initial consonant information, final information and tone information of each character in each text data to be extracted, the information is extracted through corresponding script processing, and the text prosody feature data extraction process is not repeated.

Note that, the rule settings of the text prosody information rule base are shown in table 1 below. For example, when the first text data is "hello", the text prosody feature data (including initial consonant information, vowel information, and tone information) obtained based on the text prosody information rule base is: "n, i, 2, h, ao, 3". Wherein "n" represents the initial information of "you", "i" represents the final information of "you", and "2" represents the tone information of "you". In addition, the above lists only one rule regarding the text prosody information rule base, and is not limited to adding or deleting information in the rule.

Table 1:

text prosodic information type

Initial consonant information

Information of vowel

Tone information

Decoupling the original song voice data of any original song voice sample through a Mel filter to obtain and input the sound characteristic data of any original song voice sample into a preset GMM model, and obtaining the phoneme corresponding to each frame of sound characteristic data in the sound characteristic data of any original song voice sample.

Wherein, (1) the mel filter is used for decoupling the sound characteristics from the original song voice data (sound signal), thereby obtaining the sound characteristics in the sound signal. (2) The GMM model is: the Gaussian mixture model is used for obtaining phonemes corresponding to each frame of sound characteristics; in this embodiment, a trained gaussian mixture model is used.

It should be noted that the process of training the GMM model and the process of extracting the phoneme corresponding to each frame of sound feature through the GMM model are the prior art, and are not described herein again.

Specifically, based on an acoustic prosody rule base, the acoustic prosody feature data of each original song voice sample is obtained from each frame of voice feature data of any original song voice sample and a phoneme corresponding to each frame of voice feature data.

Note that, the rule settings of the acoustic prosody rule base in the present embodiment are shown in table 2 below. For example, the phoneme corresponding to each frame of sound feature data output by the GMM model is: first frame aligned phoneme: n, second frame aligned phoneme: n, third frame aligned phoneme: n, fourth frame aligned phoneme: i3, fifth frame aligned phoneme: i3; the sound characteristic data of each frame is as follows: the voice recognition method comprises the following steps of firstly, obtaining first frame voice characteristic data, second frame voice characteristic data, third frame voice characteristic data and fourth frame voice characteristic data; the acoustic prosody feature data obtained according to the acoustic prosody rule base are as follows: pronunciation duration of previous rhythm unit: 3 frames, current prosodic unit pronunciation duration: 2 frames, the pronunciation duration of the latter rhythm unit: 4 frames.

Table 2:

in table 2, the granularity refers to a labeling method of the pronunciation unit, and the granularity of one frame of data can be expressed as a word, a phoneme, a triphone, etc. from coarse to fine. A, B, C, D in Table 2: the number of frames that can be the actual utterance duration (pitch, speed of sound); or dividing the interval into A-D according to the frame number, and finding out the corresponding interval according to the actual pronunciation frame number.

In addition, different prosodic information required for the voice features can be extracted according to different rules by inputting the phoneme alignment result and the voice features. The acoustic prosody feature data is described only by taking the pronunciation duration in the acoustic prosody rule base as an example, and is not limited to other acoustic prosody features such as pronunciation pitch, pronunciation speed, and the like, which is not described herein in detail.

Preferably, the step of fusing the voice feature data, the text prosody feature data and the acoustic prosody feature data of any original song voice sample includes:

It should be noted that the process of performing feature fusion on multiple features through an attention mechanism is the prior art, and is not described herein in detail.

Preferably, step S2 comprises:

and S21, inputting each fusion characteristic data into the preset ASR model respectively for training to obtain a loss value of each fusion characteristic data.

Specifically, each fusion feature data is input into a preset ASR model to obtain a predicted value corresponding to the fusion feature data, the predicted value corresponding to each fusion feature data is compared with a true value, and a loss value of each fusion feature data is calculated.

And S22, optimizing the parameters of the preset ASR model according to all the loss values to obtain the optimized ASR model.

The process of optimizing the model parameters based on the loss value (loss function) is the prior art, and is not described herein in detail.

And S23, taking the optimized ASR model as the preset ASR model, returning to execute the step of inputting each fusion feature data into the preset ASR model for training respectively, and determining the optimized ASR model as the target song speech recognition model when the optimized ASR model meets preset conditions.

Wherein the preset conditions are as follows: the model reaches the maximum iterative training times or loses function convergence and the like, and no limitation is set here.

According to the technical scheme of the embodiment, the plurality of prosodic features and the plurality of voice features in the song voice sample are fused, and the fused features are input into the voice recognition model for training, so that the accuracy of the voice recognition model for song voice recognition is improved, and the high-precision translated text of the song voice can be obtained.

As shown in fig. 2, a song speech recognition system 200 according to an embodiment of the present invention includes: a processing module 210, a training module 220, and a recognition module 230;

the processing module 210 is configured to: acquiring and respectively fusing voice characteristic data, text prosody characteristic data and acoustic prosody characteristic data in each original song voice sample containing original song voice data and original text data to obtain fusion characteristic data corresponding to each original song voice sample;

the training module 220 is configured to: training a preset ASR model for song voice recognition based on the plurality of fusion characteristic data to obtain a target song voice recognition model;

the identification module 230 is configured to: and inputting the song voice data to be recognized into the target song voice recognition model for recognition to obtain a target translation text corresponding to the song voice data to be recognized.

Preferably, the processing module 210 is specifically configured to:

the method comprises the steps of obtaining voice characteristic data, text prosody characteristic data and acoustic prosody characteristic data in any original song voice sample containing original song voice data and original text data, and comprises the following steps:

Preferably, the processing module 210 is further configured to:

The above steps for realizing the corresponding functions of the parameters and modules in the song speech recognition system 200 of the present embodiment may refer to the parameters and steps in the above embodiments of the song speech recognition method, which are not described herein again.

An embodiment of the present invention provides a storage medium, including: the storage medium stores instructions, and when the instructions are read by the computer, the computer is caused to execute the steps of the song speech recognition method, which may specifically refer to the parameters and steps in the above embodiment of the song speech recognition method, and details are not described here.

Computer storage media such as: flash disks, portable hard disks, and the like.

An electronic device provided in an embodiment of the present invention includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and is characterized in that when the processor executes the computer program, the computer executes steps of a song speech recognition method, for which specific reference may be made to parameters and steps in an embodiment of the song speech recognition method, which are not described herein again.

As will be appreciated by one skilled in the art, the present invention may be embodied as methods, systems, storage media, and electronic devices.

Thus, the present invention may be embodied in the form of: may be embodied entirely in hardware, entirely in software (including firmware, resident software, micro-code, etc.) or in a combination of hardware and software, and may be referred to herein generally as a "circuit," module "or" system. Furthermore, in some embodiments, the invention may also be embodied in the form of a computer program product in one or more computer-readable media having computer-readable program code embodied in the medium. Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A song speech recognition method, comprising:

2. The song speech recognition method of claim 1, wherein the step of obtaining the voice feature data, text prosody feature data, and acoustic prosody feature data in any original song speech sample containing original song speech data and original text data comprises:

3. The song speech recognition method of claim 2, wherein the step of fusing the acoustic feature data, the text prosody feature data, and the acoustic prosody feature data of any one of the original song speech samples comprises:

4. The song speech recognition method of claim 2 or 3, wherein the text prosody feature data of any original song speech sample comprises: initial information, final information and tone information of any original song voice sample; the acoustic prosody feature data of any original song voice sample comprises: the pronunciation duration, pronunciation speed and pronunciation tone of any original song voice sample.

5. The song speech recognition method according to claim 1, wherein the step of training a preset ASR model for song speech recognition based on the plurality of fusion feature data to obtain a target song speech recognition model comprises:

and taking the optimized ASR model as the preset ASR model, returning to execute the step of inputting each fusion characteristic data into the preset ASR model for training, and determining the optimized ASR model as the target song speech recognition model when the optimized ASR model meets preset conditions.

6. A song speech recognition system, comprising: the system comprises a processing module, a training module and an identification module;

the training module is configured to: training a preset ASR model for song voice recognition based on the fusion feature data to obtain a target song voice recognition model;

7. The song speech recognition system of claim 6, wherein the processing module is specifically configured to:

8. The song speech recognition system of claim 7, wherein the processing module is further specifically configured to:

9. A storage medium having stored therein instructions which, when read by a computer, cause the computer to execute the song speech recognition method according to any one of claims 1 to 5.

10. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the computer program, causes the computer to perform the song speech recognition method of any one of claims 1 to 5.