CN112489646B - Speech recognition method and device thereof - Google Patents

Speech recognition method and device thereof Download PDF

Info

Publication number
CN112489646B
CN112489646B CN202011295150.1A CN202011295150A CN112489646B CN 112489646 B CN112489646 B CN 112489646B CN 202011295150 A CN202011295150 A CN 202011295150A CN 112489646 B CN112489646 B CN 112489646B
Authority
CN
China
Prior art keywords
word
language model
voice recognition
sequence
intermediate result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011295150.1A
Other languages
Chinese (zh)
Other versions
CN112489646A (en
Inventor
沈来信
朱相宇
王映新
孙明东
贾师惠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Thunisoft Information Technology Co ltd
Original Assignee
Beijing Thunisoft Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Thunisoft Information Technology Co ltd filed Critical Beijing Thunisoft Information Technology Co ltd
Priority to CN202011295150.1A priority Critical patent/CN112489646B/en
Publication of CN112489646A publication Critical patent/CN112489646A/en
Application granted granted Critical
Publication of CN112489646B publication Critical patent/CN112489646B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/12Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being prediction coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a voice recognition method and a voice recognition device. Wherein the method comprises the following steps: acquiring input voice data; decoding the voice data through a decoding model to generate a voice recognition intermediate result; matching the voice recognition intermediate result based on the core word pinyin and the tone sequence in the core word database; and outputting a matching result according to the matching state of the pinyin in the intermediate result in the tone sequence and the voice recognition. The problem that the voice recognition result deviates from the normal context can be solved by matching the voice recognition intermediate result with the core word pinyin and the tone sequence in the core word database.

Description

Speech recognition method and device thereof
Technical Field
The present invention relates to the field of speech recognition, and in particular, to a speech recognition method and apparatus.
Background
The decoding of the voice recognition has great correlation with the application scene, and the user always expects the voice recognition model to be capable of carrying out decoding recognition with certain directivity on the scene corpus of the user. At present, voice recognition is performed based on a user hotword, and when the hotword is transferred, the hotword is manually defined and a weight value is set. If the difference between the weight values is large, the voice recognition result is seriously deviated from the normal context, the number of hot words to be uploaded is limited, and a certain difficulty exists in selecting the hot words by a user.
Disclosure of Invention
The embodiment of the application provides a voice recognition method, which is used for solving the problem that a voice recognition result deviates from a normal context in the prior art. The method specifically comprises the following steps:
acquiring input voice data;
decoding the voice data through a decoding model to generate a voice recognition intermediate result;
matching the voice recognition intermediate result based on the core word pinyin and the tone sequence in the core word database;
and outputting a matching result according to the matching state of the pinyin and tone sequence and the intermediate result in the voice recognition.
Further, in a preferred embodiment provided in the present application, the decoding model is composed of an acoustic model, a dictionary, and a language model.
Further, in a preferred embodiment provided in the present application, the language model is a new language model generated by performing interpolation fitting on a foreground language model and a background language model based on text preprocessing corpus;
the foreground language model is a user language model, the weight value is preset to be 0.5-0.8, and the foreground language model comprises a scene corpus appointed by a user; the background language model is the language model of the original speech recognition engine, and comprises the corpus of each scene.
Further, in a preferred embodiment provided herein, the smoothing and pruning operations are performed on the new language model;
the pruning operation is based on a foreground language model, irrelevant scene corpus deletion is carried out on a background language model, and branches of the foreground language are reserved; the smoothing operation is based on the newly generated language model, the conditional probabilities of all scene corpus in the language model are redistributed, and the sum of the conditional probabilities of all scene corpus after the smoothing operation is 1.
Further, in a preferred embodiment provided in the present application, the core word database is established by performing word segmentation and statistics on word frequencies based on text preprocessing corpus, and generating corresponding word segmentation weights according to the word frequencies;
the corresponding word segmentation weight is calculated by dividing the word frequency of each word by the sum of the maximum word frequency and a constant, wherein the constant takes the median value of all word frequencies.
Further, in a preferred embodiment provided in the present application, the core word database may match according to core word information uploaded by a user, and automatically recommend a corresponding weight value, and the user may adjust the weight value according to actual requirements, so as to increase accuracy of speech recognition;
and if the user core word is not found through the search, taking the weighted median value of all words in the current core word database as a recommended value.
Further, in a preferred embodiment provided in the present application, when the matching result is that the intermediate result of speech recognition has a corresponding pinyin and intonation sequence in the database, the core word replacement is performed on the pinyin and intonation sequence.
Further, in a preferred embodiment provided in the present application, when the core word is replaced, if the confusion of the language model of the sentence containing the replacing sequence is reduced by a threshold value compared with the original sentence, the replacing of the core word sequence may be completed, and a speech recognition intermediate result containing the replacing sequence is output;
wherein, a threshold value which is reduced can be adjusted according to the actual environment.
Further, in a preferred embodiment provided in the present application, before the step of outputting the sentence containing the alternative sequence as the speech recognition result, the method further includes performing sentence breaking and punctuation prediction on the sentence containing the alternative sequence.
The embodiment of the application provides a voice recognition device, which comprises:
the voice receiving module is used for receiving voice data;
the voice decoding module is used for decoding the voice data and generating a voice recognition intermediate result;
the voice recognition intermediate result matching module is used for matching the voice recognition intermediate result with the core word pinyin and the tone sequence in the database;
and the voice recognition result output module is used for outputting a matching result according to the matching state of the pinyin in the tone sequence and the intermediate result in voice recognition.
The embodiment provided by the application has at least the following beneficial effects:
the voice recognition method and the device can solve the problem that the voice recognition result deviates from the normal context.
Drawings
Fig. 1 is a flowchart of a voice recognition method according to an embodiment of the present application.
Fig. 2 is a schematic structural diagram of a voice recognition device according to an embodiment of the present application.
100. Speech recognition device
11. Voice receiving module
12. Speech decoding module
13. Voice recognition intermediate result matching module
14. Speech recognition result output module
Detailed Description
For the purposes, technical solutions and advantages of the present application, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments of the present application without making any inventive effort, are intended to be within the scope of the present application.
Referring to fig. 1, the present application discloses a voice recognition method, which includes:
s100: input voice data is acquired.
The voice data is voice stream data input in real time or file stream data in an audio file.
The voice stream data acquisition can be realized by real-time recording and generating the voice through a microphone, a sound card and other hardware with a real-time recording function. The file stream data can be obtained by reading an audio file storing the recorded audio data, and the suffix format of the audio file is as follows: WAV/. AIF/. AIFF/. AU/. MP1/. MP2/. MP3/. RA/. RM/. RAM.
S200: and decoding the voice data through a decoding model to generate a voice recognition intermediate result.
Further, in a preferred embodiment provided in the present application, the speech decoding model is composed of an acoustic model, a dictionary, and a language model.
Wherein, the mapping between the voice characteristics and the phonemes in the voice data can be established through an acoustic model; mapping between phonemes and words can be established through a dictionary; through the language model, the mapping of words and sentences can be established. And the computer can finish the decoding operation of the voice data according to the mapping established by the acoustic model, the dictionary and the language model, so as to generate a corresponding voice recognition intermediate result.
Specifically, the acoustic model is a knowledge representation of differences in acoustics, phonetics, environmental variables, speaker gender, accents, etc.; a language model is a knowledge representation of a set of word sequences; a dictionary is a set of phoneme indexes corresponding to words.
Further, in a preferred embodiment provided in the present application, the language model is a new language model generated by performing interpolation fitting on a foreground language model and a background language model based on text preprocessing corpus;
the foreground language model is a user language model, the weight value is preset to be 0.5-0.8, and the foreground language model comprises a scene corpus appointed by a user; the background language model is the language model of the original speech recognition engine, and comprises the corpus of each scene.
Specifically, interpolation fitting is used to combine language models to improve the effect of the language models; when the foreground language weight is set to be 0.6, the corpus distribution of the newly generated language model can be optimized, and the processing effect is optimal.
Specifically, the text preprocessing corpus is a user total text corpus, punctuation marks, some nonsensical language words and stop words are removed, and numbers are converted into expression forms of corresponding corpus texts through a number conversion module.
Further, in a preferred embodiment provided herein, the smoothing and pruning operations are performed on the new language model;
the pruning operation is based on a foreground language model, irrelevant scene corpus deletion is carried out on a background language model, and branches of the foreground language are reserved; the smoothing operation is based on the newly generated language model, the conditional probabilities of all scene corpus in the language model are redistributed, and the sum of the conditional probabilities of all scene corpus after the smoothing operation is 1.
S300: and performing matching operation on the voice recognition intermediate result based on the core word pinyin and the tone sequence in the core word database.
Further, in a preferred embodiment provided in the present application, the core word database is established by performing word segmentation and statistics on word frequencies based on text preprocessing corpus, and generating corresponding word segmentation weights according to the word frequencies.
The corresponding word segmentation weight is calculated by dividing the word frequency of each word by the sum of the maximum word frequency and a constant, wherein the constant takes the median value of all word frequencies.
Specifically, when word segmentation is performed, a dictionary based on a decoding model is needed, and a reverse maximum matching algorithm is used, so that the word segmentation effect is optimal. And counting word frequency, namely counting the occurrence times of the same word based on word segmentation results.
Furthermore, in a preferred embodiment provided in the present application, the core word database may match according to the core word information uploaded by the user, and automatically recommend a corresponding weight value, and the user may adjust the weight value according to the actual requirement, so as to increase the accuracy of speech recognition;
and if the user core word is not found through the search, taking the weighted median value of all words in the current core word database as a recommended value.
Specifically, according to the core word input by the user, matching the word in the core word database. If the corresponding core word can be matched in the database, the weight of the core word is used as a recommendation value and recommended to the user. The weight value recommended to the user can be increased or decreased by the user according to the actual scene, so that the accuracy of voice recognition in the user scene is improved.
S400: and outputting a matching result according to the matching state of the pinyin and tone sequence and the intermediate result in the voice recognition.
Further, in a preferred embodiment provided in the present application, when the matching result is that the intermediate result sequence of speech recognition has a corresponding pinyin intonation sequence in the database, the core word replacement is performed on the sequence.
Specifically, if the intermediate result sequence of speech recognition does not match the corresponding pinyin tone sequence in the database, the intermediate result of speech recognition can be directly output as the speech recognition result.
Further, in a preferred embodiment provided in the present application, when the core word is replaced, if the confusion of the language model of the sentence containing the replacing sequence is reduced by a threshold value compared with the original sentence, the replacing of the core word sequence may be completed, and a speech recognition intermediate result containing the replacing sequence is output;
wherein, a threshold value which is reduced can be adjusted according to the actual environment.
Specifically, the smaller the confusion value of the language model, the higher the matching degree of the replacement sequence in the sentence after the core word is replaced. The threshold value is set to 0.1 by default, and if the matching degree of the replacement sequence in the sentence is to be improved, the threshold value can be set to be reduced.
Further, in a preferred embodiment provided in the present application, before the step of outputting the sentence containing the alternative sequence as the speech recognition result, the method further includes performing sentence breaking and punctuation prediction on the sentence containing the alternative sequence.
A speech recognition apparatus 100 comprising:
a voice receiving module 11 for receiving voice data;
a voice decoding module 12, configured to decode the voice data and generate a voice recognition intermediate result;
a voice recognition intermediate result matching module 13, configured to match the voice recognition intermediate result with the core word pinyin and the tone sequence in the database;
and the voice recognition result output module 14 is used for outputting a matching result according to the matching state of the pinyin and tone sequence and the intermediate result in the voice recognition.
In a typical configuration, a computer may include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and changes may be made to the present application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. which are within the spirit and principles of the present application are intended to be included within the scope of the claims of the present application.

Claims (9)

1. A method of speech recognition, comprising:
acquiring input voice data;
decoding the voice data through a decoding model to generate a voice recognition intermediate result;
matching the voice recognition intermediate result based on the core word pinyin and the tone sequence in the core word database;
outputting a matching result according to the matching state of the pinyin and tone sequence and the intermediate result in the voice recognition;
the core word database is established by performing word segmentation and counting word frequency based on text preprocessing corpus and generating corresponding word segmentation weight according to the word frequency;
the corresponding word segmentation weight is calculated by dividing the word frequency of each word by the sum of the maximum word frequency and a constant which takes the median of all word frequencies.
2. The speech recognition method of claim 1, wherein the decoding model is composed of an acoustic model, a dictionary, and a language model together.
3. The speech recognition method of claim 2, wherein the language model is a new language model generated by performing interpolation fitting on a foreground language model and a background language model based on text preprocessing corpus;
the foreground language model is a user language model, the weight value is preset to be 0.5-0.8, and the foreground language model comprises a scene corpus appointed by a user; the background language model is the language model of the original speech recognition engine, and comprises the corpus of each scene.
4. The speech recognition method of claim 3, wherein the newly generated language model is subjected to smoothing and pruning operations;
the pruning operation is based on a foreground language model, irrelevant scene corpus deletion is carried out on a background language model, and branches of the foreground language are reserved; the smoothing operation is based on the newly generated language model, the conditional probabilities of all scene corpus in the language model are redistributed, and the sum of the conditional probabilities of all scene corpus after the smoothing operation is 1.
5. The voice recognition method of claim 1, wherein the core word database can be matched according to the core word information uploaded by the user, and corresponding weight values are automatically recommended, and the user can adjust the weight values according to actual requirements so as to increase the accuracy of voice recognition;
and if the user core word is not found through the search, taking the weighted median value of all words in the current core word database as a recommended value.
6. The method of claim 1, wherein the matching result is that when the intermediate result of the speech recognition has a corresponding pinyin and intonation sequence in the database, a core word replacement is performed on the pinyin and intonation sequence.
7. The method of claim 6, wherein when the core word is replaced, if the confusion of the language model of the sentence containing the replacing sequence is reduced by a threshold value compared with the original sentence, the replacing of the core word sequence is completed, and the voice recognition intermediate result containing the replacing sequence is output;
wherein, a threshold value is reduced, and the adjustment is carried out according to the actual environment.
8. The method of claim 7, further comprising performing punctuation and punctuation prediction on the sentence containing the alternative sequence before performing the step of outputting the sentence containing the alternative sequence as a speech recognition result.
9. A speech recognition apparatus, comprising:
the voice receiving module is used for receiving voice data;
the voice decoding module is used for decoding the voice data and generating a voice recognition intermediate result;
the voice recognition intermediate result matching module is used for matching the voice recognition intermediate result with the core word pinyin and the tone sequence in the database;
the voice recognition result output module is used for outputting a matching result according to the matching state of the pinyin and tone sequence and the intermediate result in the voice recognition;
the core word database is established by performing word segmentation and counting word frequency based on text preprocessing corpus and generating corresponding word segmentation weight according to the word frequency;
the corresponding word segmentation weight is calculated by dividing the word frequency of each word by the sum of the maximum word frequency and a constant which takes the median of all word frequencies.
CN202011295150.1A 2020-11-18 2020-11-18 Speech recognition method and device thereof Active CN112489646B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011295150.1A CN112489646B (en) 2020-11-18 2020-11-18 Speech recognition method and device thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011295150.1A CN112489646B (en) 2020-11-18 2020-11-18 Speech recognition method and device thereof

Publications (2)

Publication Number Publication Date
CN112489646A CN112489646A (en) 2021-03-12
CN112489646B true CN112489646B (en) 2024-04-02

Family

ID=74931400

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011295150.1A Active CN112489646B (en) 2020-11-18 2020-11-18 Speech recognition method and device thereof

Country Status (1)

Country Link
CN (1) CN112489646B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113470619B (en) * 2021-06-30 2023-08-18 北京有竹居网络技术有限公司 Speech recognition method, device, medium and equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102063900A (en) * 2010-11-26 2011-05-18 北京交通大学 Speech recognition method and system for overcoming confusing pronunciation
KR20160078703A (en) * 2014-12-24 2016-07-05 한국전자통신연구원 Method and Apparatus for converting text to scene
CN109635081A (en) * 2018-11-23 2019-04-16 上海大学 A kind of text key word weighing computation method based on word frequency power-law distribution characteristic
CN110675855A (en) * 2019-10-09 2020-01-10 出门问问信息科技有限公司 Voice recognition method, electronic equipment and computer readable storage medium
CN110970026A (en) * 2019-12-17 2020-04-07 用友网络科技股份有限公司 Voice interaction matching method, computer device and computer-readable storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101296128A (en) * 2007-04-24 2008-10-29 北京大学 Method for monitoring abnormal state of internet information
CN105654945B (en) * 2015-10-29 2020-03-06 乐融致新电子科技(天津)有限公司 Language model training method, device and equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102063900A (en) * 2010-11-26 2011-05-18 北京交通大学 Speech recognition method and system for overcoming confusing pronunciation
KR20160078703A (en) * 2014-12-24 2016-07-05 한국전자통신연구원 Method and Apparatus for converting text to scene
CN109635081A (en) * 2018-11-23 2019-04-16 上海大学 A kind of text key word weighing computation method based on word frequency power-law distribution characteristic
CN110675855A (en) * 2019-10-09 2020-01-10 出门问问信息科技有限公司 Voice recognition method, electronic equipment and computer readable storage medium
CN110970026A (en) * 2019-12-17 2020-04-07 用友网络科技股份有限公司 Voice interaction matching method, computer device and computer-readable storage medium

Also Published As

Publication number Publication date
CN112489646A (en) 2021-03-12

Similar Documents

Publication Publication Date Title
US20230012984A1 (en) Generation of automated message responses
JP5768093B2 (en) Speech processing system
US11361753B2 (en) System and method for cross-speaker style transfer in text-to-speech and training data generation
US10176809B1 (en) Customized compression and decompression of audio data
WO2018151125A1 (en) Word vectorization model learning device, word vectorization device, speech synthesis device, method for said devices, and program
CN106875936B (en) Voice recognition method and device
CN110599998B (en) Voice data generation method and device
US20230343319A1 (en) speech processing system and a method of processing a speech signal
JP6552999B2 (en) Text correction device, text correction method, and program
CN114416989A (en) Text classification model optimization method and device
US11600261B2 (en) System and method for cross-speaker style transfer in text-to-speech and training data generation
Kumar et al. Machine learning based speech emotions recognition system
Panda et al. A waveform concatenation technique for text-to-speech synthesis
CN112489646B (en) Speech recognition method and device thereof
Viacheslav et al. System of methods of automated cognitive linguistic analysis of speech signals with noise
Sakamoto et al. Stargan-vc+ asr: Stargan-based non-parallel voice conversion regularized by automatic speech recognition
CN112686041A (en) Pinyin marking method and device
Rashmi et al. Hidden Markov Model for speech recognition system—a pilot study and a naive approach for speech-to-text model
Sasmal et al. Isolated words recognition of Adi, a low-resource indigenous language of Arunachal Pradesh
WO2020166359A1 (en) Estimation device, estimation method, and program
Popović et al. A comparison of language model training techniques in a continuous speech recognition system for Serbian
KR102277205B1 (en) Apparatus for converting audio and method thereof
JP6220733B2 (en) Voice classification device, voice classification method, and program
Qader et al. Probabilistic speaker pronunciation adaptation for spontaneous speech synthesis using linguistic features
Savchenko et al. Fuzzy Phonetic Encoding of Speech Signals in Voice Processing Systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant