CN115862603A - Song voice recognition method, system, storage medium and electronic equipment - Google Patents

Song voice recognition method, system, storage medium and electronic equipment Download PDF

Info

Publication number
CN115862603A
CN115862603A CN202211397956.0A CN202211397956A CN115862603A CN 115862603 A CN115862603 A CN 115862603A CN 202211397956 A CN202211397956 A CN 202211397956A CN 115862603 A CN115862603 A CN 115862603A
Authority
CN
China
Prior art keywords
data
song
original
voice
song voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211397956.0A
Other languages
Chinese (zh)
Other versions
CN115862603B (en
Inventor
周晓桐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shumei Tianxia Beijing Technology Co ltd
Beijing Nextdata Times Technology Co ltd
Original Assignee
Shumei Tianxia Beijing Technology Co ltd
Beijing Nextdata Times Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shumei Tianxia Beijing Technology Co ltd, Beijing Nextdata Times Technology Co ltd filed Critical Shumei Tianxia Beijing Technology Co ltd
Priority to CN202211397956.0A priority Critical patent/CN115862603B/en
Publication of CN115862603A publication Critical patent/CN115862603A/en
Application granted granted Critical
Publication of CN115862603B publication Critical patent/CN115862603B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention relates to a song voice recognition method, a song voice recognition system, a storage medium and electronic equipment, wherein the song voice recognition system comprises: acquiring and fusing voice characteristic data, text prosody characteristic data and acoustic prosody characteristic data in each original song voice sample to obtain fused characteristic data corresponding to each original song voice sample; training a preset ASR model for song voice recognition based on the fusion feature data to obtain a target song voice recognition model; and inputting the song voice data to be recognized into a target song voice recognition model for recognition to obtain a target translation text corresponding to the song voice data to be recognized. According to the method, the plurality of rhythm characteristics and the plurality of voice characteristics in the song voice sample are fused, and the fused characteristics are input into the voice recognition model for training, so that the accuracy of the voice recognition model for the song voice is improved, and the high-precision translation text of the song voice can be obtained.

Description

Song voice recognition method, system, storage medium and electronic equipment
Technical Field
The present invention relates to the field of speech recognition technologies, and in particular, to a song speech recognition method, a song speech recognition system, a song speech recognition storage medium, and an electronic device.
Background
With the development of internet and AI technology, the automatic speech recognition technology is widely used in various segment fields, and especially has strong demand for live scenes which have a large number of song recognition demands. Because the traditional acoustic information feature extraction method has less retention on prosodic information, the translation of the song by a voice recognition model is poorer, and the recognition precision of song voice is not high.
Therefore, it is desirable to provide a technical solution to solve the above technical problems.
Disclosure of Invention
In order to solve the technical problem, the invention provides a song voice recognition method, a song voice recognition system, a song voice recognition storage medium and electronic equipment.
The technical scheme of the song voice recognition method of the invention is as follows:
acquiring and respectively fusing voice characteristic data, text prosody characteristic data and acoustic prosody characteristic data in each original song voice sample containing original song voice data and original text data to obtain fusion characteristic data corresponding to each original song voice sample;
training a preset ASR model for song voice recognition based on the plurality of fusion characteristic data to obtain a target song voice recognition model;
and inputting the song voice data to be recognized into the target song voice recognition model for recognition to obtain a target translation text corresponding to the song voice data to be recognized.
The song voice recognition method has the beneficial effects that:
according to the method, the prosodic features and the voice features in the song voice sample are fused, and the fused features are input into the voice recognition model for training, so that the accuracy of the voice recognition model for song voice recognition is improved, and the high-precision translation text of the song voice can be obtained.
On the basis of the scheme, the song voice recognition method can be further improved as follows.
Further, the step of obtaining the voice feature data, text prosody feature data and acoustic prosody feature data in any original song voice sample containing the original song voice data and the original text data includes:
preprocessing the original text data of any original song voice sample to obtain first text data of any original song voice sample, and extracting text prosody feature data of any original song voice sample from the first text data of any original song voice sample;
decoupling the original song voice data of any original song voice sample through a Mel filter to obtain and input the sound characteristic data of any original song voice sample into a preset GMM model to obtain a phoneme corresponding to each frame of sound characteristic data in the sound characteristic data of any original song voice sample;
and acquiring the acoustic prosody feature data of any original song voice sample from each frame of voice feature data of any original song voice sample and the phoneme corresponding to each frame of voice feature data.
Further, the step of fusing the voice feature data, the text prosody feature data and the acoustic prosody feature data of any original song voice sample includes:
and performing feature fusion on the voice feature data, the text prosody feature data and the acoustic prosody feature data of any original song voice sample through an attention mechanism to obtain fusion feature data of any original song voice sample.
Further, the text prosody feature data of any original song voice sample includes: initial information, final information and tone information of any original song voice sample; the acoustic prosody feature data of any original song voice sample comprises: and the pronunciation time length, pronunciation speed and pronunciation tone of any original song voice sample.
Further, the step of training a preset ASR model for song speech recognition based on the plurality of fusion feature data to obtain a target song speech recognition model includes:
inputting each fusion characteristic data into the preset ASR model respectively for training to obtain a loss value of each fusion characteristic data;
optimizing the parameters of the preset ASR model according to all the loss values to obtain an optimized ASR model;
and taking the optimized ASR model as the preset ASR model, returning and executing the step of inputting each fusion feature data into the preset ASR model for training respectively, and determining the optimized ASR model as the target song voice recognition model when the optimized ASR model meets preset conditions.
The technical scheme of the song voice recognition system is as follows:
the method comprises the following steps: the system comprises a processing module, a training module and an identification module;
the processing module is used for: acquiring and respectively fusing voice characteristic data, text prosody characteristic data and acoustic prosody characteristic data in each original song voice sample containing original song voice data and original text data to obtain fusion characteristic data corresponding to each original song voice sample;
the training module is configured to: training a preset ASR model for song voice recognition based on the plurality of fusion characteristic data to obtain a target song voice recognition model;
the identification module is configured to: and inputting the song voice data to be recognized into the target song voice recognition model for recognition to obtain a target translation text corresponding to the song voice data to be recognized.
The song voice recognition system has the following beneficial effects:
according to the system, the plurality of prosodic features and the plurality of voice features in the song voice sample are fused, and the fused features are input into the voice recognition model for training, so that the accuracy of the voice recognition model for song voice recognition is improved, and the high-precision translation text of the song voice can be obtained.
On the basis of the scheme, the song voice recognition system can be further improved as follows.
Further, the processing module is specifically configured to:
the method comprises the following steps of obtaining voice characteristic data, text prosody characteristic data and acoustic prosody characteristic data in any original song voice sample containing original song voice data and original text data, wherein the steps comprise:
preprocessing the original text data of any original song voice sample to obtain first text data of any original song voice sample, and extracting text prosody feature data of any original song voice sample from the first text data of any original song voice sample;
decoupling the original song voice data of any original song voice sample through a Mel filter to obtain and input the sound characteristic data of any original song voice sample into a preset GMM model to obtain a phoneme corresponding to each frame of sound characteristic data in the sound characteristic data of any original song voice sample;
and acquiring the acoustic prosody feature data of any original song voice sample from each frame of voice feature data of any original song voice sample and the phoneme corresponding to each frame of voice feature data.
Further, the processing module is specifically further configured to:
and performing feature fusion on the voice feature data, the text prosody feature data and the acoustic prosody feature data of any original song voice sample through an attention mechanism to obtain fusion feature data of any original song voice sample.
The technical scheme of the storage medium of the invention is as follows:
the storage medium has stored therein instructions which, when read by a computer, cause the computer to carry out the steps of a song speech recognition method according to the invention.
The technical scheme of the electronic equipment is as follows:
comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor, when executing the computer program, causes the computer to carry out the steps of a method for speech recognition of songs according to the invention.
Drawings
Fig. 1 is a flowchart illustrating a song speech recognition method according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a song speech recognition system according to an embodiment of the present invention.
Detailed Description
As shown in fig. 1, a song speech recognition method according to an embodiment of the present invention includes the following steps:
s1, acquiring and respectively fusing voice feature data, text prosody feature data and acoustic prosody feature data in each original song voice sample containing original song voice data and original text data to obtain fusion feature data corresponding to each original song voice sample.
Wherein, (1) each original song voice sample comprises: raw song speech data and raw text data. (2) The original song voice data is: a sound signal comprising FBANK sound characteristics. (3) The original text data is: the marked text corresponding to the original song voice data is the text obtained by marking according to the content heard from the original song voice data (audio). (4) The sound characteristic data is: acoustic feature data (FBANK acoustic features) extracted from original song speech data (acoustic signal). (5) The text prosodic feature data is as follows: the prosodic features in the labeled text mainly comprise: initial consonant information, vowel information, tone information, and the like. (6) The acoustic prosody feature data mainly relates to emotional prosody, and the emotional prosody mainly comprises the following components: the design of the rhythm pronunciation unit, the pronunciation time of the rhythm unit, the pronunciation speed, the pronunciation tone and other attributes. (7) The fused feature data includes: the method is based on feature data obtained by fusing a plurality of features in an original song voice sample, and is used for training a voice recognition model so as to improve the precision of a translation text obtained by recognizing song voice by the voice recognition model.
And S2, training a preset ASR model for song voice recognition based on the plurality of fusion characteristic data to obtain a target song voice recognition model.
Wherein, (1) the preset ASR model is as follows: an automatic speech recognition model is a model that converts human speech into editable text. (2) The target song voice recognition model is as follows: the trained ASR model can be used for accurately identifying the song voice data to be identified to obtain a high-precision translated text.
And S3, inputting the song voice data to be recognized into the target song voice recognition model for recognition to obtain a target translation text corresponding to the song voice data to be recognized.
The song voice data to be recognized is as follows: any song voice data, which is Fbank type voice data. The target translation text is: and the translated text corresponding to the voice data of the song to be recognized is output through the voice recognition model of the target song.
Preferably, the step of obtaining the voice feature data, text prosody feature data and acoustic prosody feature data in any original song voice sample containing the original song voice data and the original text data includes:
preprocessing the original text data of any original song voice sample to obtain first text data of any original song voice sample, and extracting text prosody feature data of any original song voice sample from the first text data of any original song voice sample.
Wherein, (1) the first text data is: and the original text data is subjected to text preprocessing to obtain text data. (2) The text preprocessing process comprises the following steps: the punctuations in the original text data are deleted, and format conversion is performed through text mapping, so that plain text data is obtained, and the plain text data is the first text data in this embodiment. (3) And obtaining text prosody feature data of each original song voice sample according to the text prosody information rule base and the first text data. The text prosody information rule base sets initial consonant information, final information and tone information of each character in each text data to be extracted, the information is extracted through corresponding script processing, and the text prosody feature data extraction process is not repeated.
Note that, the rule settings of the text prosody information rule base are shown in table 1 below. For example, when the first text data is "hello", the text prosody feature data (including initial consonant information, vowel information, and tone information) obtained based on the text prosody information rule base is: "n, i, 2, h, ao, 3". Wherein "n" represents the initial information of "you", "i" represents the final information of "you", and "2" represents the tone information of "you". In addition, the above lists only one rule regarding the text prosody information rule base, and is not limited to adding or deleting information in the rule.
Table 1:
text prosodic information type Initial consonant information Information of vowel Tone information
Decoupling the original song voice data of any original song voice sample through a Mel filter to obtain and input the sound characteristic data of any original song voice sample into a preset GMM model, and obtaining the phoneme corresponding to each frame of sound characteristic data in the sound characteristic data of any original song voice sample.
Wherein, (1) the mel filter is used for decoupling the sound characteristics from the original song voice data (sound signal), thereby obtaining the sound characteristics in the sound signal. (2) The GMM model is: the Gaussian mixture model is used for obtaining phonemes corresponding to each frame of sound characteristics; in this embodiment, a trained gaussian mixture model is used.
It should be noted that the process of training the GMM model and the process of extracting the phoneme corresponding to each frame of sound feature through the GMM model are the prior art, and are not described herein again.
And acquiring the acoustic prosody feature data of any original song voice sample from each frame of voice feature data of any original song voice sample and the phoneme corresponding to each frame of voice feature data.
Specifically, based on an acoustic prosody rule base, the acoustic prosody feature data of each original song voice sample is obtained from each frame of voice feature data of any original song voice sample and a phoneme corresponding to each frame of voice feature data.
Note that, the rule settings of the acoustic prosody rule base in the present embodiment are shown in table 2 below. For example, the phoneme corresponding to each frame of sound feature data output by the GMM model is: first frame aligned phoneme: n, second frame aligned phoneme: n, third frame aligned phoneme: n, fourth frame aligned phoneme: i3, fifth frame aligned phoneme: i3; the sound characteristic data of each frame is as follows: the voice recognition method comprises the following steps of firstly, obtaining first frame voice characteristic data, second frame voice characteristic data, third frame voice characteristic data and fourth frame voice characteristic data; the acoustic prosody feature data obtained according to the acoustic prosody rule base are as follows: pronunciation duration of previous rhythm unit: 3 frames, current prosodic unit pronunciation duration: 2 frames, the pronunciation duration of the latter rhythm unit: 4 frames.
Table 2:
Figure BDA0003933908700000071
in table 2, the granularity refers to a labeling method of the pronunciation unit, and the granularity of one frame of data can be expressed as a word, a phoneme, a triphone, etc. from coarse to fine. A, B, C, D in Table 2: the number of frames that can be the actual utterance duration (pitch, speed of sound); or dividing the interval into A-D according to the frame number, and finding out the corresponding interval according to the actual pronunciation frame number.
In addition, different prosodic information required for the voice features can be extracted according to different rules by inputting the phoneme alignment result and the voice features. The acoustic prosody feature data is described only by taking the pronunciation duration in the acoustic prosody rule base as an example, and is not limited to other acoustic prosody features such as pronunciation pitch, pronunciation speed, and the like, which is not described herein in detail.
Preferably, the step of fusing the voice feature data, the text prosody feature data and the acoustic prosody feature data of any original song voice sample includes:
and performing feature fusion on the voice feature data, the text prosody feature data and the acoustic prosody feature data of any original song voice sample through an attention mechanism to obtain fusion feature data of any original song voice sample.
It should be noted that the process of performing feature fusion on multiple features through an attention mechanism is the prior art, and is not described herein in detail.
Preferably, step S2 comprises:
and S21, inputting each fusion characteristic data into the preset ASR model respectively for training to obtain a loss value of each fusion characteristic data.
Specifically, each fusion feature data is input into a preset ASR model to obtain a predicted value corresponding to the fusion feature data, the predicted value corresponding to each fusion feature data is compared with a true value, and a loss value of each fusion feature data is calculated.
And S22, optimizing the parameters of the preset ASR model according to all the loss values to obtain the optimized ASR model.
The process of optimizing the model parameters based on the loss value (loss function) is the prior art, and is not described herein in detail.
And S23, taking the optimized ASR model as the preset ASR model, returning to execute the step of inputting each fusion feature data into the preset ASR model for training respectively, and determining the optimized ASR model as the target song speech recognition model when the optimized ASR model meets preset conditions.
Wherein the preset conditions are as follows: the model reaches the maximum iterative training times or loses function convergence and the like, and no limitation is set here.
According to the technical scheme of the embodiment, the plurality of prosodic features and the plurality of voice features in the song voice sample are fused, and the fused features are input into the voice recognition model for training, so that the accuracy of the voice recognition model for song voice recognition is improved, and the high-precision translated text of the song voice can be obtained.
As shown in fig. 2, a song speech recognition system 200 according to an embodiment of the present invention includes: a processing module 210, a training module 220, and a recognition module 230;
the processing module 210 is configured to: acquiring and respectively fusing voice characteristic data, text prosody characteristic data and acoustic prosody characteristic data in each original song voice sample containing original song voice data and original text data to obtain fusion characteristic data corresponding to each original song voice sample;
the training module 220 is configured to: training a preset ASR model for song voice recognition based on the plurality of fusion characteristic data to obtain a target song voice recognition model;
the identification module 230 is configured to: and inputting the song voice data to be recognized into the target song voice recognition model for recognition to obtain a target translation text corresponding to the song voice data to be recognized.
Preferably, the processing module 210 is specifically configured to:
the method comprises the steps of obtaining voice characteristic data, text prosody characteristic data and acoustic prosody characteristic data in any original song voice sample containing original song voice data and original text data, and comprises the following steps:
preprocessing the original text data of any original song voice sample to obtain first text data of any original song voice sample, and extracting text prosody feature data of any original song voice sample from the first text data of any original song voice sample;
decoupling the original song voice data of any original song voice sample through a Mel filter to obtain and input the sound characteristic data of any original song voice sample into a preset GMM model to obtain a phoneme corresponding to each frame of sound characteristic data in the sound characteristic data of any original song voice sample;
and acquiring the acoustic prosody feature data of any original song voice sample from each frame of voice feature data of any original song voice sample and the phoneme corresponding to each frame of voice feature data.
Preferably, the processing module 210 is further configured to:
and performing feature fusion on the voice feature data, the text prosody feature data and the acoustic prosody feature data of any original song voice sample through an attention mechanism to obtain fusion feature data of any original song voice sample.
According to the technical scheme of the embodiment, the plurality of prosodic features and the plurality of voice features in the song voice sample are fused, and the fused features are input into the voice recognition model for training, so that the accuracy of the voice recognition model for song voice recognition is improved, and the high-precision translated text of the song voice can be obtained.
The above steps for realizing the corresponding functions of the parameters and modules in the song speech recognition system 200 of the present embodiment may refer to the parameters and steps in the above embodiments of the song speech recognition method, which are not described herein again.
An embodiment of the present invention provides a storage medium, including: the storage medium stores instructions, and when the instructions are read by the computer, the computer is caused to execute the steps of the song speech recognition method, which may specifically refer to the parameters and steps in the above embodiment of the song speech recognition method, and details are not described here.
Computer storage media such as: flash disks, portable hard disks, and the like.
An electronic device provided in an embodiment of the present invention includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and is characterized in that when the processor executes the computer program, the computer executes steps of a song speech recognition method, for which specific reference may be made to parameters and steps in an embodiment of the song speech recognition method, which are not described herein again.
As will be appreciated by one skilled in the art, the present invention may be embodied as methods, systems, storage media, and electronic devices.
Thus, the present invention may be embodied in the form of: may be embodied entirely in hardware, entirely in software (including firmware, resident software, micro-code, etc.) or in a combination of hardware and software, and may be referred to herein generally as a "circuit," module "or" system. Furthermore, in some embodiments, the invention may also be embodied in the form of a computer program product in one or more computer-readable media having computer-readable program code embodied in the medium. Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (10)

1. A song speech recognition method, comprising:
acquiring and respectively fusing voice characteristic data, text prosody characteristic data and acoustic prosody characteristic data in each original song voice sample containing original song voice data and original text data to obtain fusion characteristic data corresponding to each original song voice sample;
training a preset ASR model for song voice recognition based on the plurality of fusion characteristic data to obtain a target song voice recognition model;
and inputting the song voice data to be recognized into the target song voice recognition model for recognition to obtain a target translation text corresponding to the song voice data to be recognized.
2. The song speech recognition method of claim 1, wherein the step of obtaining the voice feature data, text prosody feature data, and acoustic prosody feature data in any original song speech sample containing original song speech data and original text data comprises:
preprocessing the original text data of any original song voice sample to obtain first text data of any original song voice sample, and extracting text prosody feature data of any original song voice sample from the first text data of any original song voice sample;
decoupling the original song voice data of any original song voice sample through a Mel filter to obtain and input the sound characteristic data of any original song voice sample into a preset GMM model to obtain a phoneme corresponding to each frame of sound characteristic data in the sound characteristic data of any original song voice sample;
and acquiring the acoustic prosody feature data of any original song voice sample from each frame of voice feature data of any original song voice sample and the phoneme corresponding to each frame of voice feature data.
3. The song speech recognition method of claim 2, wherein the step of fusing the acoustic feature data, the text prosody feature data, and the acoustic prosody feature data of any one of the original song speech samples comprises:
and performing feature fusion on the voice feature data, the text prosody feature data and the acoustic prosody feature data of any original song voice sample through an attention mechanism to obtain fusion feature data of any original song voice sample.
4. The song speech recognition method of claim 2 or 3, wherein the text prosody feature data of any original song speech sample comprises: initial information, final information and tone information of any original song voice sample; the acoustic prosody feature data of any original song voice sample comprises: the pronunciation duration, pronunciation speed and pronunciation tone of any original song voice sample.
5. The song speech recognition method according to claim 1, wherein the step of training a preset ASR model for song speech recognition based on the plurality of fusion feature data to obtain a target song speech recognition model comprises:
inputting each fusion characteristic data into the preset ASR model respectively for training to obtain a loss value of each fusion characteristic data;
optimizing the parameters of the preset ASR model according to all the loss values to obtain an optimized ASR model;
and taking the optimized ASR model as the preset ASR model, returning to execute the step of inputting each fusion characteristic data into the preset ASR model for training, and determining the optimized ASR model as the target song speech recognition model when the optimized ASR model meets preset conditions.
6. A song speech recognition system, comprising: the system comprises a processing module, a training module and an identification module;
the processing module is used for: acquiring and respectively fusing voice characteristic data, text prosody characteristic data and acoustic prosody characteristic data in each original song voice sample containing original song voice data and original text data to obtain fusion characteristic data corresponding to each original song voice sample;
the training module is configured to: training a preset ASR model for song voice recognition based on the fusion feature data to obtain a target song voice recognition model;
the identification module is configured to: and inputting the song voice data to be recognized into the target song voice recognition model for recognition to obtain a target translation text corresponding to the song voice data to be recognized.
7. The song speech recognition system of claim 6, wherein the processing module is specifically configured to:
the method comprises the following steps of obtaining voice characteristic data, text prosody characteristic data and acoustic prosody characteristic data in any original song voice sample containing original song voice data and original text data, wherein the steps comprise:
preprocessing the original text data of any original song voice sample to obtain first text data of any original song voice sample, and extracting text prosody feature data of any original song voice sample from the first text data of any original song voice sample;
decoupling the original song voice data of any original song voice sample through a Mel filter to obtain and input the sound characteristic data of any original song voice sample into a preset GMM model to obtain a phoneme corresponding to each frame of sound characteristic data in the sound characteristic data of any original song voice sample;
and acquiring the acoustic prosody feature data of any original song voice sample from each frame of voice feature data of any original song voice sample and the phoneme corresponding to each frame of voice feature data.
8. The song speech recognition system of claim 7, wherein the processing module is further specifically configured to:
and performing feature fusion on the voice feature data, the text prosody feature data and the acoustic prosody feature data of any original song voice sample through an attention mechanism to obtain fusion feature data of any original song voice sample.
9. A storage medium having stored therein instructions which, when read by a computer, cause the computer to execute the song speech recognition method according to any one of claims 1 to 5.
10. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the computer program, causes the computer to perform the song speech recognition method of any one of claims 1 to 5.
CN202211397956.0A 2022-11-09 2022-11-09 Song voice recognition method, system, storage medium and electronic equipment Active CN115862603B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211397956.0A CN115862603B (en) 2022-11-09 2022-11-09 Song voice recognition method, system, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211397956.0A CN115862603B (en) 2022-11-09 2022-11-09 Song voice recognition method, system, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN115862603A true CN115862603A (en) 2023-03-28
CN115862603B CN115862603B (en) 2023-06-20

Family

ID=85662859

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211397956.0A Active CN115862603B (en) 2022-11-09 2022-11-09 Song voice recognition method, system, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN115862603B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11338868A (en) * 1998-05-25 1999-12-10 Nippon Telegr & Teleph Corp <Ntt> Method and device for retrieving rhythm pattern by text, and storage medium stored with program for retrieving rhythm pattern by text
EP1785891A1 (en) * 2005-11-09 2007-05-16 Sony Deutschland GmbH Music information retrieval using a 3D search algorithm
CN106228977A (en) * 2016-08-02 2016-12-14 合肥工业大学 The song emotion identification method of multi-modal fusion based on degree of depth study
CN112750421A (en) * 2020-12-23 2021-05-04 出门问问(苏州)信息科技有限公司 Singing voice synthesis method and device and readable storage medium
CN115083397A (en) * 2022-05-31 2022-09-20 腾讯音乐娱乐科技(深圳)有限公司 Training method of lyric acoustic model, lyric recognition method, equipment and product
CN115169472A (en) * 2022-07-19 2022-10-11 腾讯科技(深圳)有限公司 Music matching method and device for multimedia data and computer equipment
CN115240656A (en) * 2022-07-22 2022-10-25 腾讯音乐娱乐科技(深圳)有限公司 Training of audio recognition model, audio recognition method and device and computer equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11338868A (en) * 1998-05-25 1999-12-10 Nippon Telegr & Teleph Corp <Ntt> Method and device for retrieving rhythm pattern by text, and storage medium stored with program for retrieving rhythm pattern by text
EP1785891A1 (en) * 2005-11-09 2007-05-16 Sony Deutschland GmbH Music information retrieval using a 3D search algorithm
CN106228977A (en) * 2016-08-02 2016-12-14 合肥工业大学 The song emotion identification method of multi-modal fusion based on degree of depth study
CN112750421A (en) * 2020-12-23 2021-05-04 出门问问(苏州)信息科技有限公司 Singing voice synthesis method and device and readable storage medium
CN115083397A (en) * 2022-05-31 2022-09-20 腾讯音乐娱乐科技(深圳)有限公司 Training method of lyric acoustic model, lyric recognition method, equipment and product
CN115169472A (en) * 2022-07-19 2022-10-11 腾讯科技(深圳)有限公司 Music matching method and device for multimedia data and computer equipment
CN115240656A (en) * 2022-07-22 2022-10-25 腾讯音乐娱乐科技(深圳)有限公司 Training of audio recognition model, audio recognition method and device and computer equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SOUHA AYADI ET AL.: "Multiple Neural Network architectures for visual emotion recognition using Song-Speech modality", 2022 IEEE INFORMATION TECHNOLOGIES & SMART INDUSTRIAL SYSTEMS (ITSIS) *
陈颖呈等: "基于音频内容和歌词文本相似度融合的翻唱歌曲识别模型", 华东理工大学学报(自然科学版) *

Also Published As

Publication number Publication date
CN115862603B (en) 2023-06-20

Similar Documents

Publication Publication Date Title
CN109065031B (en) Voice labeling method, device and equipment
US8478591B2 (en) Phonetic variation model building apparatus and method and phonetic recognition system and method thereof
US8731926B2 (en) Spoken term detection apparatus, method, program, and storage medium
US7472061B1 (en) Systems and methods for building a native language phoneme lexicon having native pronunciations of non-native words derived from non-native pronunciations
EP3734595A1 (en) Methods and systems for providing speech recognition systems based on speech recordings logs
CN111369974B (en) Dialect pronunciation marking method, language identification method and related device
JP5897718B2 (en) Voice search device, computer-readable storage medium, and voice search method
Marasek et al. System for automatic transcription of sessions of the Polish senate
CN112133285B (en) Speech recognition method, device, storage medium and electronic equipment
CN111640423B (en) Word boundary estimation method and device and electronic equipment
Sasmal et al. Isolated words recognition of Adi, a low-resource indigenous language of Arunachal Pradesh
KR20130126570A (en) Apparatus for discriminative training acoustic model considering error of phonemes in keyword and computer recordable medium storing the method thereof
CN111933116A (en) Speech recognition model training method, system, mobile terminal and storage medium
CN111785256A (en) Acoustic model training method and device, electronic equipment and storage medium
Sasmal et al. Robust automatic continuous speech recognition for'Adi', a zero-resource indigenous language of Arunachal Pradesh
CN115862603B (en) Song voice recognition method, system, storage medium and electronic equipment
CN114783424A (en) Text corpus screening method, device, equipment and storage medium
CN114203160A (en) Method, device and equipment for generating sample data set
CN114203180A (en) Conference summary generation method and device, electronic equipment and storage medium
JP2001312293A (en) Method and device for voice recognition, and computer- readable storage medium
CN112686041A (en) Pinyin marking method and device
JP4705535B2 (en) Acoustic model creation device, speech recognition device, and acoustic model creation program
Ahmed et al. Non-native accent pronunciation modeling in automatic speech recognition
Seman et al. Hybrid methods of Brandt’s generalised likelihood ratio and short-term energy for Malay word speech segmentation
CN111696530B (en) Target acoustic model obtaining method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant