CN115862603B - Song voice recognition method, system, storage medium and electronic equipment - Google Patents

Song voice recognition method, system, storage medium and electronic equipment Download PDF

Info

Publication number
CN115862603B
CN115862603B CN202211397956.0A CN202211397956A CN115862603B CN 115862603 B CN115862603 B CN 115862603B CN 202211397956 A CN202211397956 A CN 202211397956A CN 115862603 B CN115862603 B CN 115862603B
Authority
CN
China
Prior art keywords
data
song
characteristic data
original
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211397956.0A
Other languages
Chinese (zh)
Other versions
CN115862603A (en
Inventor
周晓桐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shumei Tianxia Beijing Technology Co ltd
Beijing Nextdata Times Technology Co ltd
Original Assignee
Shumei Tianxia Beijing Technology Co ltd
Beijing Nextdata Times Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shumei Tianxia Beijing Technology Co ltd, Beijing Nextdata Times Technology Co ltd filed Critical Shumei Tianxia Beijing Technology Co ltd
Priority to CN202211397956.0A priority Critical patent/CN115862603B/en
Publication of CN115862603A publication Critical patent/CN115862603A/en
Application granted granted Critical
Publication of CN115862603B publication Critical patent/CN115862603B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Machine Translation (AREA)

Abstract

The invention relates to a song voice recognition method, a song voice recognition system, a storage medium and electronic equipment, wherein the song voice recognition method comprises the following steps: acquiring and fusing sound characteristic data, text prosody characteristic data and acoustic prosody characteristic data in each original song voice sample to obtain fused characteristic data corresponding to each original song voice sample; training a preset ASR model for song speech recognition based on the fusion characteristic data to obtain a target song speech recognition model; and inputting the song voice data to be recognized into a target song voice recognition model for recognition to obtain target translation text corresponding to the song voice data to be recognized. According to the invention, the multiple prosodic features and the sound features in the song voice sample are fused, and the fused features are input into the voice recognition model for training, so that the accuracy of the voice recognition model for song voice recognition is improved, and the translation text of high-precision song voice can be obtained.

Description

Song voice recognition method, system, storage medium and electronic equipment
Technical Field
The present invention relates to the field of speech recognition technologies, and in particular, to a song speech recognition method, system, storage medium, and electronic device.
Background
With the development of internet and AI technology, automatic speech recognition technology is widely used in various subdivision fields, especially for live scenes, which have a great number of song recognition requirements. Because the traditional acoustic information feature extraction method has less rhythm information reservation, the song translation of a voice recognition model is poor, and the song voice recognition precision is low.
Therefore, it is needed to provide a technical solution to solve the above technical problems.
Disclosure of Invention
In order to solve the technical problems, the invention provides a song voice recognition method, a song voice recognition system, a storage medium and electronic equipment.
The technical scheme of the song voice recognition method is as follows:
acquiring and respectively fusing sound characteristic data, text prosody characteristic data and acoustic prosody characteristic data in each original song voice sample containing original song voice data and original text data to obtain fused characteristic data corresponding to each original song voice sample;
training a preset ASR model for song speech recognition based on the fusion characteristic data to obtain a target song speech recognition model;
and inputting the song voice data to be recognized into the target song voice recognition model for recognition to obtain target translation text corresponding to the song voice data to be recognized.
The song voice recognition method has the following beneficial effects:
according to the method, the multiple prosodic features and the sound features in the song voice sample are fused, and the fused features are input into the voice recognition model for training, so that the accuracy of the voice recognition model for song voice recognition is improved, and the translation text of high-precision song voice can be obtained.
On the basis of the scheme, the song voice recognition method can be improved as follows.
Further, the step of acquiring sound characteristic data, text prosody characteristic data, and acoustic prosody characteristic data in any of the original song speech samples containing the original song speech data and the original text data, includes:
preprocessing the original text data of any original song voice sample to obtain first text data of any original song voice sample, and extracting text prosody characteristic data of any original song voice sample from the first text data of any original song voice sample;
decoupling original song voice data of any original song voice sample through a Mel filter to obtain and input voice characteristic data of the any original song voice sample into a preset GMM model to obtain phonemes corresponding to each frame of voice characteristic data in the voice characteristic data of the any original song voice sample;
and acquiring acoustic prosody characteristic data of any original song voice sample from each frame of voice characteristic data of any original song voice sample and phonemes corresponding to each frame of voice characteristic data.
Further, the step of fusing the voice characteristic data, the text prosody characteristic data and the acoustic prosody characteristic data of the voice sample of any one of the original songs includes:
and carrying out feature fusion on the sound feature data, the text prosody feature data and the acoustic prosody feature data of the voice sample of any original song through an attention mechanism to obtain fusion feature data of the voice sample of any original song.
Further, the text prosody characteristic data of any one of the original song voice samples includes: the initial consonant information, the final sound information and the tone information of any original song voice sample; the acoustic prosody characteristic data of any original song voice sample comprises: the pronunciation duration, pronunciation speed and pronunciation tone of any original song voice sample.
Further, the step of training a preset ASR model for song speech recognition based on the plurality of fusion feature data to obtain a target song speech recognition model includes:
respectively inputting each fusion characteristic data into the preset ASR model for training to obtain a loss value of each fusion characteristic data;
optimizing parameters of the preset ASR model according to all the loss values to obtain an optimized ASR model;
and taking the optimized ASR model as the preset ASR model, and returning to execute the step of respectively inputting each fusion characteristic data into the preset ASR model for training until the optimized ASR model meets preset conditions, and determining the optimized ASR model as the target song speech recognition model.
The technical scheme of the song voice recognition system is as follows:
comprising the following steps: the system comprises a processing module, a training module and an identification module;
the processing module is used for: acquiring and respectively fusing sound characteristic data, text prosody characteristic data and acoustic prosody characteristic data in each original song voice sample containing original song voice data and original text data to obtain fused characteristic data corresponding to each original song voice sample;
the training module is used for: training a preset ASR model for song speech recognition based on the fusion characteristic data to obtain a target song speech recognition model;
the identification module is used for: and inputting the song voice data to be recognized into the target song voice recognition model for recognition to obtain target translation text corresponding to the song voice data to be recognized.
The song voice recognition system has the following beneficial effects:
according to the system, the multiple prosodic features and the sound features in the song voice sample are fused, and the fused features are input into the voice recognition model for training, so that the accuracy of the voice recognition model for song voice recognition is improved, and the translation text of high-precision song voice can be obtained.
Based on the scheme, the song voice recognition system can be improved as follows.
Further, the processing module is specifically configured to:
the step of acquiring sound characteristic data, text prosodic characteristic data and acoustic prosodic characteristic data in any of the original song speech samples comprising the original song speech data and the original text data comprises:
preprocessing the original text data of any original song voice sample to obtain first text data of any original song voice sample, and extracting text prosody characteristic data of any original song voice sample from the first text data of any original song voice sample;
decoupling original song voice data of any original song voice sample through a Mel filter to obtain and input voice characteristic data of the any original song voice sample into a preset GMM model to obtain phonemes corresponding to each frame of voice characteristic data in the voice characteristic data of the any original song voice sample;
and acquiring acoustic prosody characteristic data of any original song voice sample from each frame of voice characteristic data of any original song voice sample and phonemes corresponding to each frame of voice characteristic data.
Further, the processing module is specifically further configured to:
and carrying out feature fusion on the sound feature data, the text prosody feature data and the acoustic prosody feature data of the voice sample of any original song through an attention mechanism to obtain fusion feature data of the voice sample of any original song.
The technical scheme of the storage medium is as follows:
the storage medium has instructions stored therein which, when read by a computer, cause the computer to perform the steps of a song speech recognition method according to the invention.
The technical scheme of the electronic equipment is as follows:
comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor, when executing the computer program, causes the computer to perform the steps of a song speech recognition method according to the invention.
Drawings
FIG. 1 is a schematic flow chart of a song speech recognition method according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a song speech recognition system according to an embodiment of the present invention.
Detailed Description
As shown in fig. 1, a song voice recognition method according to an embodiment of the present invention includes the following steps:
s1, acquiring and respectively fusing sound characteristic data, text prosody characteristic data and acoustic prosody characteristic data in each original song voice sample containing original song voice data and original text data to obtain fused characteristic data corresponding to each original song voice sample.
Wherein (1) each original song voice sample comprises: original song voice data and original text data. (2) The original song voice data is: a sound signal comprising FBANK sound characteristics. (3) The original text data is: the text of the label corresponding to the original song voice data, namely the text obtained by labeling according to the content heard from the original song voice data (audio). (4) The sound characteristic data are: sound characteristic data (FBANK sound characteristics) extracted from original song sound data (sound signals). (5) The text prosody characteristic data is: marking prosodic features in text, mainly comprising: initials information, finals information, tone information, etc. (6) Acoustic prosodic feature data primarily relates to emotional prosody, which primarily includes: the design of the prosodic sounding unit, the sounding time of the prosodic sounding unit, the sounding speed, the sounding tone and other attributes. (7) The fusion characteristic data includes: based on feature data obtained after the fusion of a plurality of features in an original song voice sample, the feature data is used for training a voice recognition model so as to improve the accuracy of translation text obtained by recognizing song voice by the voice recognition model.
S2, training a preset ASR model for song speech recognition based on the fusion characteristic data to obtain a target song speech recognition model.
Wherein, (1) the preset ASR model is: an automatic speech recognition model is a model that converts human speech into editable text. (2) The target song speech recognition model is: the trained ASR model can be used for accurately identifying song voice data to be identified, and high-precision translation text is obtained.
And S3, inputting the song voice data to be recognized into the target song voice recognition model for recognition, and obtaining a target translation text corresponding to the song voice data to be recognized.
The song voice data to be identified are as follows: any song voice data, which is Fbank type voice data. The target translation text is: and the translation text corresponding to the song voice data to be recognized is output through the target song voice recognition model.
Preferably, the step of acquiring sound characteristic data, text prosodic characteristic data, and acoustic prosodic characteristic data in any of the original song speech samples containing the original song speech data and the original text data comprises:
preprocessing the original text data of any original song voice sample to obtain first text data of any original song voice sample, and extracting text prosody characteristic data of any original song voice sample from the first text data of any original song voice sample.
Wherein (1) the first text data is: text data obtained by text preprocessing of the original text data. (2) The text preprocessing process comprises the following steps: deleting punctuation in the original text data, and performing format conversion through text mapping to obtain plain text data, wherein the plain text data is the first text data in the embodiment. (3) And obtaining text prosody characteristic data of each original song voice sample according to the text prosody information rule base and the first text data. The text prosody information rule base is provided with the initial consonant information, the final sound information and the tone information of each text in each text data, the extraction of the information is finished through corresponding script processing, and the extraction process of the text prosody characteristic data is not repeated here.
Note that the rule settings of the text prosody information rule base are shown in table 1 below. For example, when the first text data is "hello", the text prosodic feature data (including the initial consonant information, the final information, and the tone information) obtained based on the text prosodic information rule base is: "n, i, 2, h, ao, 3". Where "n" represents the initial information of "you," i "represents the final information of" you, "2" represents the tone information of "you. Further, the above only exemplifies one rule regarding a rule base of text prosody information, and is not limited to adding or deleting information in the rule.
Table 1:
text prosodic information type Initial consonant information Vowel information Tone information
And decoupling the original song voice data of any original song voice sample through a Mel filter to obtain and input the voice characteristic data of any original song voice sample into a preset GMM model to obtain phonemes corresponding to each frame of voice characteristic data in the voice characteristic data of any original song voice sample.
Wherein (1) the mel filter is used to decouple sound features from the original song speech data (sound signal) to obtain sound features in the sound signal. (2) The GMM model is: the Gaussian mixture model is used for obtaining phonemes corresponding to the sound characteristics of each frame; in this embodiment, a trained gaussian mixture model is used.
It should be noted that, the process of training the GMM model and the process of extracting phonemes corresponding to the sound features of each frame through the GMM model are related art, and are not repeated herein.
And acquiring acoustic prosody characteristic data of any original song voice sample from each frame of voice characteristic data of any original song voice sample and phonemes corresponding to each frame of voice characteristic data.
Specifically, based on an acoustic prosody rule base, acoustic prosody feature data of each original song voice sample is obtained from each frame of voice feature data of any original song voice sample and phonemes corresponding to each frame of voice feature data.
Note that the rule setting of the acoustic prosody rule base in this embodiment is shown in table 2 below. For example, the phonemes corresponding to each frame of sound feature data output by the GMM model are: first frame alignment phonemes: n, second frame alignment phonemes: n, third frame aligned phonemes: n, fourth frame aligned phonemes: i3, fifth frame aligned phonemes: i3; the sound characteristic data of each frame are: the first frame of sound characteristic data, the second frame of sound characteristic data, the third frame of sound characteristic data and the fourth frame of sound characteristic data; the acoustic prosody characteristic data obtained according to the acoustic prosody rule base are: the pronunciation duration of the previous prosodic unit: 3 frames, current prosodic unit pronunciation duration: 2 frames, the pronunciation duration of the latter rhythm unit: 4 frames.
Table 2:
Figure BDA0003933908700000071
in table 2, granularity refers to a labeling method of a pronunciation unit, and a frame of data may be correspondingly represented as words, phonemes, triphones, and the like in granularity from coarse to fine. A, B, C, D in table 2: the number of frames that may be the actual pronunciation time (pitch, speed of sound); the intervals can be divided into A-D according to the frame number, and the corresponding interval can be found according to the actual pronunciation frame number.
In addition, the phoneme alignment result and the sound characteristics are input, and different prosodic information required for the sound characteristics can be extracted according to different rules. The above description of acoustic prosody feature data using only the length of pronunciation in the acoustic prosody rule base as an example is not limited to other acoustic prosody features, such as pronunciation tone, pronunciation speed, etc., which are not repeated herein.
Preferably, the step of fusing the voice characteristic data, the text prosody characteristic data and the acoustic prosody characteristic data of the voice sample of any original song includes:
and carrying out feature fusion on the sound feature data, the text prosody feature data and the acoustic prosody feature data of the voice sample of any original song through an attention mechanism to obtain fusion feature data of the voice sample of any original song.
It should be noted that, the process of feature fusion of multiple features through the attention mechanism is the prior art, and is not repeated here.
Preferably, step S2 includes:
s21, respectively inputting each fusion characteristic data into the preset ASR model for training to obtain a loss value of each fusion characteristic data.
Specifically, each piece of fusion characteristic data is input into a preset ASR model to obtain a predicted value corresponding to the fusion characteristic data, the predicted value corresponding to each piece of fusion characteristic data is compared with a true value, and a loss value of each piece of fusion characteristic data is calculated.
S22, optimizing parameters of the preset ASR model according to all the loss values to obtain an optimized ASR model.
The process of optimizing the model parameters based on the loss value (loss function) is the prior art, and is not repeated here.
S23, taking the optimized ASR model as the preset ASR model, and returning to execute the step of respectively inputting each fusion characteristic data into the preset ASR model for training until the optimized ASR model meets preset conditions, and determining the optimized ASR model as the target song speech recognition model.
Wherein, the preset conditions are: the model reaches the maximum number of iterative training or loss function convergence, etc., without limitation.
According to the technical scheme, the multiple prosodic features and the sound features in the song voice sample are fused, and the fused features are input into the voice recognition model for training, so that the accuracy of the voice recognition model for song voice recognition is improved, and the translation text of high-precision song voice can be obtained.
As shown in fig. 2, a song voice recognition system 200 according to an embodiment of the present invention includes: a processing module 210, a training module 220, and an identification module 230;
the processing module 210 is configured to: acquiring and respectively fusing sound characteristic data, text prosody characteristic data and acoustic prosody characteristic data in each original song voice sample containing original song voice data and original text data to obtain fused characteristic data corresponding to each original song voice sample;
the training module 220 is configured to: training a preset ASR model for song speech recognition based on the fusion characteristic data to obtain a target song speech recognition model;
the identification module 230 is configured to: and inputting the song voice data to be recognized into the target song voice recognition model for recognition to obtain target translation text corresponding to the song voice data to be recognized.
Preferably, the processing module 210 is specifically configured to:
the step of acquiring sound characteristic data, text prosodic characteristic data and acoustic prosodic characteristic data in any of the original song speech samples comprising the original song speech data and the original text data comprises:
preprocessing the original text data of any original song voice sample to obtain first text data of any original song voice sample, and extracting text prosody characteristic data of any original song voice sample from the first text data of any original song voice sample;
decoupling original song voice data of any original song voice sample through a Mel filter to obtain and input voice characteristic data of the any original song voice sample into a preset GMM model to obtain phonemes corresponding to each frame of voice characteristic data in the voice characteristic data of the any original song voice sample;
and acquiring acoustic prosody characteristic data of any original song voice sample from each frame of voice characteristic data of any original song voice sample and phonemes corresponding to each frame of voice characteristic data.
Preferably, the processing module 210 is specifically further configured to:
and carrying out feature fusion on the sound feature data, the text prosody feature data and the acoustic prosody feature data of the voice sample of any original song through an attention mechanism to obtain fusion feature data of the voice sample of any original song.
According to the technical scheme, the multiple prosodic features and the sound features in the song voice sample are fused, and the fused features are input into the voice recognition model for training, so that the accuracy of the voice recognition model for song voice recognition is improved, and the translation text of high-precision song voice can be obtained.
The steps for implementing the corresponding functions by the parameters and the modules in the song voice recognition system 200 according to the present embodiment are referred to the parameters and the steps in the embodiments of the song voice recognition method according to the present embodiment, and are not described herein.
The storage medium provided by the embodiment of the invention comprises: the storage medium stores instructions that, when read by a computer, cause the computer to perform steps such as a song speech recognition method, and specific reference may be made to the parameters and steps in the embodiments of a song speech recognition method described above, which are not described herein.
Computer storage media such as: flash disk, mobile hard disk, etc.
The electronic device provided in the embodiment of the present invention includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor executes the computer program to make the computer execute steps of a song voice recognition method, and specific reference may be made to each parameter and step in the above embodiments of a song voice recognition method, which are not described herein.
Those skilled in the art will appreciate that the present invention may be implemented as a method, system, storage medium, and electronic device.
Thus, the invention may be embodied in the form of: either entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or entirely software, or a combination of hardware and software, referred to herein generally as a "circuit," module "or" system. Furthermore, in some embodiments, the invention may also be embodied in the form of a computer program product in one or more computer-readable media, which contain computer-readable program code. Any combination of one or more computer readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims (7)

1. A song speech recognition method, comprising:
acquiring and respectively fusing sound characteristic data, text prosody characteristic data and acoustic prosody characteristic data in each original song voice sample containing original song voice data and original text data to obtain fused characteristic data corresponding to each original song voice sample;
training a preset ASR model for song speech recognition based on the fusion characteristic data to obtain a target song speech recognition model;
inputting song voice data to be recognized into the target song voice recognition model for recognition to obtain target translation text corresponding to the song voice data to be recognized;
the step of acquiring sound characteristic data, text prosodic characteristic data and acoustic prosodic characteristic data in any of the original song speech samples comprising the original song speech data and the original text data comprises:
preprocessing the original text data of any original song voice sample to obtain first text data of any original song voice sample, and extracting text prosody characteristic data of any original song voice sample from the first text data of any original song voice sample;
decoupling original song voice data of any original song voice sample through a Mel filter to obtain and input voice characteristic data of the any original song voice sample into a preset GMM model to obtain phonemes corresponding to each frame of voice characteristic data in the voice characteristic data of the any original song voice sample;
acquiring acoustic prosody characteristic data of any original song voice sample from each frame of voice characteristic data of the any original song voice sample and phonemes corresponding to each frame of voice characteristic data;
the text prosody characteristic data of any original song voice sample comprises: the initial consonant information, the final sound information and the tone information of any original song voice sample; the acoustic prosody characteristic data of any original song voice sample comprises: the pronunciation duration, pronunciation speed and pronunciation tone of any original song voice sample.
2. The song speech recognition method according to claim 1, wherein the step of fusing the voice characteristic data, the text prosodic characteristic data, and the acoustic prosodic characteristic data of any one of the original song speech samples comprises:
and carrying out feature fusion on the sound feature data, the text prosody feature data and the acoustic prosody feature data of the voice sample of any original song through an attention mechanism to obtain fusion feature data of the voice sample of any original song.
3. The song speech recognition method according to claim 1, wherein the training the preset ASR model for song speech recognition based on the plurality of fusion feature data to obtain the target song speech recognition model comprises:
respectively inputting each fusion characteristic data into the preset ASR model for training to obtain a loss value of each fusion characteristic data;
optimizing parameters of the preset ASR model according to all the loss values to obtain an optimized ASR model;
and taking the optimized ASR model as the preset ASR model, and returning to execute the step of respectively inputting each fusion characteristic data into the preset ASR model for training until the optimized ASR model meets preset conditions, and determining the optimized ASR model as the target song speech recognition model.
4. A song speech recognition system, comprising: the system comprises a processing module, a training module and an identification module;
the processing module is used for: acquiring and respectively fusing sound characteristic data, text prosody characteristic data and acoustic prosody characteristic data in each original song voice sample containing original song voice data and original text data to obtain fused characteristic data corresponding to each original song voice sample;
the training module is used for: training a preset ASR model for song speech recognition based on the fusion characteristic data to obtain a target song speech recognition model;
the identification module is used for: inputting song voice data to be recognized into the target song voice recognition model for recognition to obtain target translation text corresponding to the song voice data to be recognized;
the processing module is specifically configured to:
the step of acquiring sound characteristic data, text prosodic characteristic data and acoustic prosodic characteristic data in any of the original song speech samples comprising the original song speech data and the original text data comprises:
preprocessing the original text data of any original song voice sample to obtain first text data of any original song voice sample, and extracting text prosody characteristic data of any original song voice sample from the first text data of any original song voice sample;
decoupling original song voice data of any original song voice sample through a Mel filter to obtain and input voice characteristic data of the any original song voice sample into a preset GMM model to obtain phonemes corresponding to each frame of voice characteristic data in the voice characteristic data of the any original song voice sample;
acquiring acoustic prosody characteristic data of any original song voice sample from each frame of voice characteristic data of the any original song voice sample and phonemes corresponding to each frame of voice characteristic data;
the text prosody characteristic data of any original song voice sample comprises: the initial consonant information, the final sound information and the tone information of any original song voice sample; the acoustic prosody characteristic data of any original song voice sample comprises: the pronunciation duration, pronunciation speed and pronunciation tone of any original song voice sample.
5. The song speech recognition system of claim 4, wherein the processing module is further specifically configured to:
and carrying out feature fusion on the sound feature data, the text prosody feature data and the acoustic prosody feature data of the voice sample of any original song through an attention mechanism to obtain fusion feature data of the voice sample of any original song.
6. A storage medium having instructions stored therein which, when read by a computer, cause the computer to perform the song speech recognition method of any one of claims 1 to 3.
7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the computer program, causes the computer to perform the song speech recognition method of any one of claims 1 to 3.
CN202211397956.0A 2022-11-09 2022-11-09 Song voice recognition method, system, storage medium and electronic equipment Active CN115862603B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211397956.0A CN115862603B (en) 2022-11-09 2022-11-09 Song voice recognition method, system, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211397956.0A CN115862603B (en) 2022-11-09 2022-11-09 Song voice recognition method, system, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN115862603A CN115862603A (en) 2023-03-28
CN115862603B true CN115862603B (en) 2023-06-20

Family

ID=85662859

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211397956.0A Active CN115862603B (en) 2022-11-09 2022-11-09 Song voice recognition method, system, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN115862603B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112750421A (en) * 2020-12-23 2021-05-04 出门问问(苏州)信息科技有限公司 Singing voice synthesis method and device and readable storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11338868A (en) * 1998-05-25 1999-12-10 Nippon Telegr & Teleph Corp <Ntt> Method and device for retrieving rhythm pattern by text, and storage medium stored with program for retrieving rhythm pattern by text
EP1785891A1 (en) * 2005-11-09 2007-05-16 Sony Deutschland GmbH Music information retrieval using a 3D search algorithm
CN106228977B (en) * 2016-08-02 2019-07-19 合肥工业大学 Multi-mode fusion song emotion recognition method based on deep learning
CN115083397A (en) * 2022-05-31 2022-09-20 腾讯音乐娱乐科技(深圳)有限公司 Training method of lyric acoustic model, lyric recognition method, equipment and product
CN115169472A (en) * 2022-07-19 2022-10-11 腾讯科技(深圳)有限公司 Music matching method and device for multimedia data and computer equipment
CN115240656A (en) * 2022-07-22 2022-10-25 腾讯音乐娱乐科技(深圳)有限公司 Training of audio recognition model, audio recognition method and device and computer equipment

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112750421A (en) * 2020-12-23 2021-05-04 出门问问(苏州)信息科技有限公司 Singing voice synthesis method and device and readable storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Multiple Neural Network architectures for visual emotion recognition using Song-Speech modality;Souha Ayadi et al.;2022 IEEE Information Technologies & Smart Industrial Systems (ITSIS);全文 *
基于音频内容和歌词文本相似度融合的翻唱歌曲识别模型;陈颖呈等;华东理工大学学报(自然科学版);全文 *

Also Published As

Publication number Publication date
CN115862603A (en) 2023-03-28

Similar Documents

Publication Publication Date Title
CN109065031B (en) Voice labeling method, device and equipment
CN110148427B (en) Audio processing method, device, system, storage medium, terminal and server
CN110211565B (en) Dialect identification method and device and computer readable storage medium
US11514891B2 (en) Named entity recognition method, named entity recognition equipment and medium
US20180349495A1 (en) Audio data processing method and apparatus, and computer storage medium
CN111369974B (en) Dialect pronunciation marking method, language identification method and related device
CN111862954A (en) Method and device for acquiring voice recognition model
CN112259083B (en) Audio processing method and device
WO2021169825A1 (en) Speech synthesis method and apparatus, device and storage medium
CN113593522A (en) Voice data labeling method and device
CN112133285B (en) Speech recognition method, device, storage medium and electronic equipment
Marasek et al. System for automatic transcription of sessions of the Polish senate
CN111640423B (en) Word boundary estimation method and device and electronic equipment
CN115862603B (en) Song voice recognition method, system, storage medium and electronic equipment
Sasmal et al. Isolated words recognition of Adi, a low-resource indigenous language of Arunachal Pradesh
KR20130126570A (en) Apparatus for discriminative training acoustic model considering error of phonemes in keyword and computer recordable medium storing the method thereof
CN111933116A (en) Speech recognition model training method, system, mobile terminal and storage medium
Sasmal et al. Robust automatic continuous speech recognition for'Adi', a zero-resource indigenous language of Arunachal Pradesh
Cahyaningtyas et al. Development of under-resourced Bahasa Indonesia speech corpus
Deshwal et al. A Structured Approach towards Robust Database Collection for Language Identification
CN112820281B (en) Voice recognition method, device and equipment
CN114203160A (en) Method, device and equipment for generating sample data set
CN114203180A (en) Conference summary generation method and device, electronic equipment and storage medium
JP4705535B2 (en) Acoustic model creation device, speech recognition device, and acoustic model creation program
CN114678040B (en) Voice consistency detection method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant