CN115862603B

CN115862603B - Song voice recognition method, system, storage medium and electronic equipment

Info

Publication number: CN115862603B
Application number: CN202211397956.0A
Authority: CN
Inventors: 周晓桐
Original assignee: Shumei Tianxia Beijing Technology Co ltd; Beijing Nextdata Times Technology Co ltd
Current assignee: Shumei Tianxia Beijing Technology Co ltd; Beijing Nextdata Times Technology Co ltd
Priority date: 2022-11-09
Filing date: 2022-11-09
Publication date: 2023-06-20
Anticipated expiration: 2042-11-09
Also published as: CN115862603A

Abstract

The invention relates to a song voice recognition method, a song voice recognition system, a storage medium and electronic equipment, wherein the song voice recognition method comprises the following steps: acquiring and fusing sound characteristic data, text prosody characteristic data and acoustic prosody characteristic data in each original song voice sample to obtain fused characteristic data corresponding to each original song voice sample; training a preset ASR model for song speech recognition based on the fusion characteristic data to obtain a target song speech recognition model; and inputting the song voice data to be recognized into a target song voice recognition model for recognition to obtain target translation text corresponding to the song voice data to be recognized. According to the invention, the multiple prosodic features and the sound features in the song voice sample are fused, and the fused features are input into the voice recognition model for training, so that the accuracy of the voice recognition model for song voice recognition is improved, and the translation text of high-precision song voice can be obtained.

Description

Song voice recognition method, system, storage medium and electronic equipment

Technical Field

The present invention relates to the field of speech recognition technologies, and in particular, to a song speech recognition method, system, storage medium, and electronic device.

Background

With the development of internet and AI technology, automatic speech recognition technology is widely used in various subdivision fields, especially for live scenes, which have a great number of song recognition requirements. Because the traditional acoustic information feature extraction method has less rhythm information reservation, the song translation of a voice recognition model is poor, and the song voice recognition precision is low.

Therefore, it is needed to provide a technical solution to solve the above technical problems.

Disclosure of Invention

In order to solve the technical problems, the invention provides a song voice recognition method, a song voice recognition system, a storage medium and electronic equipment.

The technical scheme of the song voice recognition method is as follows:

acquiring and respectively fusing sound characteristic data, text prosody characteristic data and acoustic prosody characteristic data in each original song voice sample containing original song voice data and original text data to obtain fused characteristic data corresponding to each original song voice sample;

training a preset ASR model for song speech recognition based on the fusion characteristic data to obtain a target song speech recognition model;

and inputting the song voice data to be recognized into the target song voice recognition model for recognition to obtain target translation text corresponding to the song voice data to be recognized.

The song voice recognition method has the following beneficial effects:

according to the method, the multiple prosodic features and the sound features in the song voice sample are fused, and the fused features are input into the voice recognition model for training, so that the accuracy of the voice recognition model for song voice recognition is improved, and the translation text of high-precision song voice can be obtained.

On the basis of the scheme, the song voice recognition method can be improved as follows.

Further, the step of acquiring sound characteristic data, text prosody characteristic data, and acoustic prosody characteristic data in any of the original song speech samples containing the original song speech data and the original text data, includes:

preprocessing the original text data of any original song voice sample to obtain first text data of any original song voice sample, and extracting text prosody characteristic data of any original song voice sample from the first text data of any original song voice sample;

decoupling original song voice data of any original song voice sample through a Mel filter to obtain and input voice characteristic data of the any original song voice sample into a preset GMM model to obtain phonemes corresponding to each frame of voice characteristic data in the voice characteristic data of the any original song voice sample;

and acquiring acoustic prosody characteristic data of any original song voice sample from each frame of voice characteristic data of any original song voice sample and phonemes corresponding to each frame of voice characteristic data.

Further, the step of fusing the voice characteristic data, the text prosody characteristic data and the acoustic prosody characteristic data of the voice sample of any one of the original songs includes:

and carrying out feature fusion on the sound feature data, the text prosody feature data and the acoustic prosody feature data of the voice sample of any original song through an attention mechanism to obtain fusion feature data of the voice sample of any original song.

Further, the text prosody characteristic data of any one of the original song voice samples includes: the initial consonant information, the final sound information and the tone information of any original song voice sample; the acoustic prosody characteristic data of any original song voice sample comprises: the pronunciation duration, pronunciation speed and pronunciation tone of any original song voice sample.

Further, the step of training a preset ASR model for song speech recognition based on the plurality of fusion feature data to obtain a target song speech recognition model includes:

respectively inputting each fusion characteristic data into the preset ASR model for training to obtain a loss value of each fusion characteristic data;

optimizing parameters of the preset ASR model according to all the loss values to obtain an optimized ASR model;

and taking the optimized ASR model as the preset ASR model, and returning to execute the step of respectively inputting each fusion characteristic data into the preset ASR model for training until the optimized ASR model meets preset conditions, and determining the optimized ASR model as the target song speech recognition model.

The technical scheme of the song voice recognition system is as follows:

comprising the following steps: the system comprises a processing module, a training module and an identification module;

the processing module is used for: acquiring and respectively fusing sound characteristic data, text prosody characteristic data and acoustic prosody characteristic data in each original song voice sample containing original song voice data and original text data to obtain fused characteristic data corresponding to each original song voice sample;

the training module is used for: training a preset ASR model for song speech recognition based on the fusion characteristic data to obtain a target song speech recognition model;

the identification module is used for: and inputting the song voice data to be recognized into the target song voice recognition model for recognition to obtain target translation text corresponding to the song voice data to be recognized.

The song voice recognition system has the following beneficial effects:

according to the system, the multiple prosodic features and the sound features in the song voice sample are fused, and the fused features are input into the voice recognition model for training, so that the accuracy of the voice recognition model for song voice recognition is improved, and the translation text of high-precision song voice can be obtained.

Based on the scheme, the song voice recognition system can be improved as follows.

Further, the processing module is specifically configured to:

the step of acquiring sound characteristic data, text prosodic characteristic data and acoustic prosodic characteristic data in any of the original song speech samples comprising the original song speech data and the original text data comprises:

Further, the processing module is specifically further configured to:

The technical scheme of the storage medium is as follows:

the storage medium has instructions stored therein which, when read by a computer, cause the computer to perform the steps of a song speech recognition method according to the invention.

The technical scheme of the electronic equipment is as follows:

comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor, when executing the computer program, causes the computer to perform the steps of a song speech recognition method according to the invention.

Drawings

FIG. 1 is a schematic flow chart of a song speech recognition method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a song speech recognition system according to an embodiment of the present invention.

Detailed Description

As shown in fig. 1, a song voice recognition method according to an embodiment of the present invention includes the following steps:

s1, acquiring and respectively fusing sound characteristic data, text prosody characteristic data and acoustic prosody characteristic data in each original song voice sample containing original song voice data and original text data to obtain fused characteristic data corresponding to each original song voice sample.

Wherein (1) each original song voice sample comprises: original song voice data and original text data. (2) The original song voice data is: a sound signal comprising FBANK sound characteristics. (3) The original text data is: the text of the label corresponding to the original song voice data, namely the text obtained by labeling according to the content heard from the original song voice data (audio). (4) The sound characteristic data are: sound characteristic data (FBANK sound characteristics) extracted from original song sound data (sound signals). (5) The text prosody characteristic data is: marking prosodic features in text, mainly comprising: initials information, finals information, tone information, etc. (6) Acoustic prosodic feature data primarily relates to emotional prosody, which primarily includes: the design of the prosodic sounding unit, the sounding time of the prosodic sounding unit, the sounding speed, the sounding tone and other attributes. (7) The fusion characteristic data includes: based on feature data obtained after the fusion of a plurality of features in an original song voice sample, the feature data is used for training a voice recognition model so as to improve the accuracy of translation text obtained by recognizing song voice by the voice recognition model.

S2, training a preset ASR model for song speech recognition based on the fusion characteristic data to obtain a target song speech recognition model.

Wherein, (1) the preset ASR model is: an automatic speech recognition model is a model that converts human speech into editable text. (2) The target song speech recognition model is: the trained ASR model can be used for accurately identifying song voice data to be identified, and high-precision translation text is obtained.

And S3, inputting the song voice data to be recognized into the target song voice recognition model for recognition, and obtaining a target translation text corresponding to the song voice data to be recognized.

The song voice data to be identified are as follows: any song voice data, which is Fbank type voice data. The target translation text is: and the translation text corresponding to the song voice data to be recognized is output through the target song voice recognition model.

Preferably, the step of acquiring sound characteristic data, text prosodic characteristic data, and acoustic prosodic characteristic data in any of the original song speech samples containing the original song speech data and the original text data comprises:

preprocessing the original text data of any original song voice sample to obtain first text data of any original song voice sample, and extracting text prosody characteristic data of any original song voice sample from the first text data of any original song voice sample.

Wherein (1) the first text data is: text data obtained by text preprocessing of the original text data. (2) The text preprocessing process comprises the following steps: deleting punctuation in the original text data, and performing format conversion through text mapping to obtain plain text data, wherein the plain text data is the first text data in the embodiment. (3) And obtaining text prosody characteristic data of each original song voice sample according to the text prosody information rule base and the first text data. The text prosody information rule base is provided with the initial consonant information, the final sound information and the tone information of each text in each text data, the extraction of the information is finished through corresponding script processing, and the extraction process of the text prosody characteristic data is not repeated here.

Note that the rule settings of the text prosody information rule base are shown in table 1 below. For example, when the first text data is "hello", the text prosodic feature data (including the initial consonant information, the final information, and the tone information) obtained based on the text prosodic information rule base is: "n, i, 2, h, ao, 3". Where "n" represents the initial information of "you," i "represents the final information of" you, "2" represents the tone information of "you. Further, the above only exemplifies one rule regarding a rule base of text prosody information, and is not limited to adding or deleting information in the rule.

Table 1:

text prosodic information type

Initial consonant information

Vowel information

Tone information

And decoupling the original song voice data of any original song voice sample through a Mel filter to obtain and input the voice characteristic data of any original song voice sample into a preset GMM model to obtain phonemes corresponding to each frame of voice characteristic data in the voice characteristic data of any original song voice sample.

Wherein (1) the mel filter is used to decouple sound features from the original song speech data (sound signal) to obtain sound features in the sound signal. (2) The GMM model is: the Gaussian mixture model is used for obtaining phonemes corresponding to the sound characteristics of each frame; in this embodiment, a trained gaussian mixture model is used.

It should be noted that, the process of training the GMM model and the process of extracting phonemes corresponding to the sound features of each frame through the GMM model are related art, and are not repeated herein.

Specifically, based on an acoustic prosody rule base, acoustic prosody feature data of each original song voice sample is obtained from each frame of voice feature data of any original song voice sample and phonemes corresponding to each frame of voice feature data.

Note that the rule setting of the acoustic prosody rule base in this embodiment is shown in table 2 below. For example, the phonemes corresponding to each frame of sound feature data output by the GMM model are: first frame alignment phonemes: n, second frame alignment phonemes: n, third frame aligned phonemes: n, fourth frame aligned phonemes: i3, fifth frame aligned phonemes: i3; the sound characteristic data of each frame are: the first frame of sound characteristic data, the second frame of sound characteristic data, the third frame of sound characteristic data and the fourth frame of sound characteristic data; the acoustic prosody characteristic data obtained according to the acoustic prosody rule base are: the pronunciation duration of the previous prosodic unit: 3 frames, current prosodic unit pronunciation duration: 2 frames, the pronunciation duration of the latter rhythm unit: 4 frames.

Table 2:

in table 2, granularity refers to a labeling method of a pronunciation unit, and a frame of data may be correspondingly represented as words, phonemes, triphones, and the like in granularity from coarse to fine. A, B, C, D in table 2: the number of frames that may be the actual pronunciation time (pitch, speed of sound); the intervals can be divided into A-D according to the frame number, and the corresponding interval can be found according to the actual pronunciation frame number.

In addition, the phoneme alignment result and the sound characteristics are input, and different prosodic information required for the sound characteristics can be extracted according to different rules. The above description of acoustic prosody feature data using only the length of pronunciation in the acoustic prosody rule base as an example is not limited to other acoustic prosody features, such as pronunciation tone, pronunciation speed, etc., which are not repeated herein.

Preferably, the step of fusing the voice characteristic data, the text prosody characteristic data and the acoustic prosody characteristic data of the voice sample of any original song includes:

It should be noted that, the process of feature fusion of multiple features through the attention mechanism is the prior art, and is not repeated here.

Preferably, step S2 includes:

s21, respectively inputting each fusion characteristic data into the preset ASR model for training to obtain a loss value of each fusion characteristic data.

Specifically, each piece of fusion characteristic data is input into a preset ASR model to obtain a predicted value corresponding to the fusion characteristic data, the predicted value corresponding to each piece of fusion characteristic data is compared with a true value, and a loss value of each piece of fusion characteristic data is calculated.

S22, optimizing parameters of the preset ASR model according to all the loss values to obtain an optimized ASR model.

The process of optimizing the model parameters based on the loss value (loss function) is the prior art, and is not repeated here.

S23, taking the optimized ASR model as the preset ASR model, and returning to execute the step of respectively inputting each fusion characteristic data into the preset ASR model for training until the optimized ASR model meets preset conditions, and determining the optimized ASR model as the target song speech recognition model.

Wherein, the preset conditions are: the model reaches the maximum number of iterative training or loss function convergence, etc., without limitation.

According to the technical scheme, the multiple prosodic features and the sound features in the song voice sample are fused, and the fused features are input into the voice recognition model for training, so that the accuracy of the voice recognition model for song voice recognition is improved, and the translation text of high-precision song voice can be obtained.

As shown in fig. 2, a song voice recognition system 200 according to an embodiment of the present invention includes: a processing module 210, a training module 220, and an identification module 230;

the processing module 210 is configured to: acquiring and respectively fusing sound characteristic data, text prosody characteristic data and acoustic prosody characteristic data in each original song voice sample containing original song voice data and original text data to obtain fused characteristic data corresponding to each original song voice sample;

the training module 220 is configured to: training a preset ASR model for song speech recognition based on the fusion characteristic data to obtain a target song speech recognition model;

the identification module 230 is configured to: and inputting the song voice data to be recognized into the target song voice recognition model for recognition to obtain target translation text corresponding to the song voice data to be recognized.

Preferably, the processing module 210 is specifically configured to:

Preferably, the processing module 210 is specifically further configured to:

The steps for implementing the corresponding functions by the parameters and the modules in the song voice recognition system 200 according to the present embodiment are referred to the parameters and the steps in the embodiments of the song voice recognition method according to the present embodiment, and are not described herein.

The storage medium provided by the embodiment of the invention comprises: the storage medium stores instructions that, when read by a computer, cause the computer to perform steps such as a song speech recognition method, and specific reference may be made to the parameters and steps in the embodiments of a song speech recognition method described above, which are not described herein.

Computer storage media such as: flash disk, mobile hard disk, etc.

The electronic device provided in the embodiment of the present invention includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor executes the computer program to make the computer execute steps of a song voice recognition method, and specific reference may be made to each parameter and step in the above embodiments of a song voice recognition method, which are not described herein.

Those skilled in the art will appreciate that the present invention may be implemented as a method, system, storage medium, and electronic device.

Thus, the invention may be embodied in the form of: either entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or entirely software, or a combination of hardware and software, referred to herein generally as a "circuit," module "or" system. Furthermore, in some embodiments, the invention may also be embodied in the form of a computer program product in one or more computer-readable media, which contain computer-readable program code. Any combination of one or more computer readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims

1. A song speech recognition method, comprising:

inputting song voice data to be recognized into the target song voice recognition model for recognition to obtain target translation text corresponding to the song voice data to be recognized;

acquiring acoustic prosody characteristic data of any original song voice sample from each frame of voice characteristic data of the any original song voice sample and phonemes corresponding to each frame of voice characteristic data;

the text prosody characteristic data of any original song voice sample comprises: the initial consonant information, the final sound information and the tone information of any original song voice sample; the acoustic prosody characteristic data of any original song voice sample comprises: the pronunciation duration, pronunciation speed and pronunciation tone of any original song voice sample.

2. The song speech recognition method according to claim 1, wherein the step of fusing the voice characteristic data, the text prosodic characteristic data, and the acoustic prosodic characteristic data of any one of the original song speech samples comprises:

3. The song speech recognition method according to claim 1, wherein the training the preset ASR model for song speech recognition based on the plurality of fusion feature data to obtain the target song speech recognition model comprises:

4. A song speech recognition system, comprising: the system comprises a processing module, a training module and an identification module;

the identification module is used for: inputting song voice data to be recognized into the target song voice recognition model for recognition to obtain target translation text corresponding to the song voice data to be recognized;

the processing module is specifically configured to:

5. The song speech recognition system of claim 4, wherein the processing module is further specifically configured to:

6. A storage medium having instructions stored therein which, when read by a computer, cause the computer to perform the song speech recognition method of any one of claims 1 to 3.

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the computer program, causes the computer to perform the song speech recognition method of any one of claims 1 to 3.