CN112489646A - Speech recognition method and device - Google Patents

Speech recognition method and device Download PDF

Info

Publication number
CN112489646A
CN112489646A CN202011295150.1A CN202011295150A CN112489646A CN 112489646 A CN112489646 A CN 112489646A CN 202011295150 A CN202011295150 A CN 202011295150A CN 112489646 A CN112489646 A CN 112489646A
Authority
CN
China
Prior art keywords
language model
speech recognition
word
voice
intermediate result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011295150.1A
Other languages
Chinese (zh)
Other versions
CN112489646B (en
Inventor
沈来信
朱相宇
王映新
孙明东
贾师惠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Thunisoft Information Technology Co ltd
Original Assignee
Beijing Thunisoft Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Thunisoft Information Technology Co ltd filed Critical Beijing Thunisoft Information Technology Co ltd
Priority to CN202011295150.1A priority Critical patent/CN112489646B/en
Publication of CN112489646A publication Critical patent/CN112489646A/en
Application granted granted Critical
Publication of CN112489646B publication Critical patent/CN112489646B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/12Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being prediction coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a voice recognition method and a device thereof. Wherein the method comprises the following steps: acquiring input voice data; decoding the voice data through a decoding model to generate a voice recognition intermediate result; matching the voice recognition intermediate result based on the core word pinyin and the tone sequence in the core word database; and outputting a matching result according to the matching state of the pinyin in the tone sequence and the intermediate result in the voice recognition. The problem that the voice recognition result deviates from the normal context can be solved through matching the voice recognition intermediate result with the core word pinyin and tone sequences in the core word database.

Description

Speech recognition method and device
Technical Field
The present invention relates to the field of speech recognition, and in particular, to a speech recognition method and apparatus.
Background
The decoding of speech recognition has great relevance to the application scene, and users always expect that a speech recognition model can carry out decoding recognition with certain directivity to the scene corpus of the users. At present, voice recognition is performed based on a user hotword, and when the hotword is uploaded, the hotword is manually defined and a weight value of the hotword is set. If the setting of the weighted values is relatively large, the voice recognition result is seriously deviated from the normal context, the uploading quantity of the hot words is limited, and the user has certain difficulty in selecting the hot words.
Disclosure of Invention
The embodiment of the application provides a voice recognition method, which is used for solving the problem that a voice recognition result deviates from a normal context in the prior art. The method specifically comprises the following steps:
acquiring input voice data;
decoding the voice data through a decoding model to generate a voice recognition intermediate result;
matching the voice recognition intermediate result based on the core word pinyin and the tone sequence in the core word database;
and outputting a matching result according to the matching state of the pinyin and tone sequence and the intermediate result in the voice recognition.
Further, in a preferred embodiment provided by the present application, the decoding model is formed by an acoustic model, a dictionary, and a language model.
Further, in a preferred embodiment provided by the present application, the language model is a new language model generated by performing interpolation fitting on a foreground language model and a background language model based on text preprocessing corpora;
the foreground language model is a user language model, the weight value is preset to be 0.5-0.8, and the foreground language model comprises user-specified scene linguistic data; the background language model is a language model of an original speech recognition engine, and the background language model comprises scene linguistic data.
Further, in a preferred embodiment provided by the present application, the newly generated language model is subjected to smoothing and pruning operations;
the pruning operation is based on the foreground language model, irrelevant scene corpus deletion is carried out on the background language model, and branches of the foreground language are reserved; and the smoothing operation is based on the newly generated language model, the conditional probabilities of all scene corpora in the language model are redistributed, and the sum of the conditional probabilities of all the scene corpora after the smoothing operation is 1.
Further, in a preferred embodiment provided by the present application, the core word database is constructed by performing word segmentation and word frequency statistics based on the text preprocessing corpus, and generating corresponding word segmentation weights according to the word frequency;
the corresponding word segmentation weight is calculated by dividing the word frequency of each word by the sum of the maximum word frequency and a constant, wherein the constant is the median of all the word frequencies.
Further, in a preferred embodiment provided by the present application, the core word database may perform matching according to the core word information uploaded by the user, and automatically recommend a corresponding weight value, and the user may adjust the weight value according to the actual requirement to increase the accuracy of voice recognition;
and if the user core word is not found through retrieval, taking the weight median of all words in the current core word database as a recommendation value.
Further, in a preferred embodiment provided by the present application, when the matching result is that the speech recognition intermediate result has corresponding pinyin and intonation sequences in the database, core word replacement is performed on the pinyin and intonation sequences.
Further, in a preferred embodiment provided by the present application, when the core word is replaced, if the language model confusion degree of the sentence including the replacement sequence is lower than that of the original sentence by a threshold, the core word sequence replacement can be completed, and an intermediate result of the speech recognition including the replacement sequence is output;
wherein, the reduced threshold value can be adjusted according to the actual environment.
Further, in a preferred embodiment provided by the present application, before the step of outputting the sentence including the alternative sequence as a speech recognition result, the method further includes performing sentence break and punctuation prediction on the sentence including the alternative sequence.
An embodiment of the present application provides a speech recognition apparatus, including:
the voice receiving module is used for receiving voice data;
the voice decoding module is used for decoding the voice data and generating a voice recognition intermediate result;
the voice recognition intermediate result matching module is used for matching the voice recognition intermediate result with the core word pinyin and tone sequences in the database;
and the voice recognition result output module is used for outputting a matching result according to the matching state of the pinyin in the tone sequence and the intermediate result in the voice recognition.
The embodiment provided by the application has at least the following beneficial effects:
the problem that the speech recognition result deviates from the normal context can be solved by the speech recognition method and the speech recognition device.
Drawings
Fig. 1 is a flowchart of a speech recognition method according to an embodiment of the present application.
Fig. 2 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application.
100 speech recognition device
11 voice receiving module
12 voice decoding module
13 voice recognition intermediate result matching module
14 speech recognition result output module
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.
Referring to fig. 1, the present application discloses a speech recognition method, including:
s100: input voice data is acquired.
The voice data is voice stream data input in real time or file stream data in an audio file.
The voice stream data can be acquired through hardware with a real-time recording function, such as a microphone, a sound card and the like, and the voice is recorded and generated in real time. The file stream data can be obtained by reading the audio file storing the recorded audio data, and the common format of the suffix of the audio file is as follows: WAV/. AIF/. AIFF/. AU/. MP1/. MP2/. MP3/. RA/. RM/. RAM.
S200: and decoding the voice data through a decoding model to generate a voice recognition intermediate result.
Further, in a preferred embodiment provided by the present application, the speech decoding model is formed by an acoustic model, a dictionary, and a language model.
Mapping between voice features and phonemes in the voice data can be established through an acoustic model; mapping between phonemes and words can be established through a dictionary; the mapping of words and sentences can be established through the language model. The computer can complete the decoding operation of the voice data according to the mapping established by the acoustic model, the dictionary and the language model, thereby generating a corresponding voice recognition intermediate result.
Specifically, the acoustic model is a knowledge representation of differences in acoustics, phonetics, variables of the environment, speaker gender, accent, etc.; the language model is a knowledge representation of a set of word sequences; the dictionary is a set of phoneme indices corresponding to words.
Further, in a preferred embodiment provided by the present application, the language model is a new language model generated by performing interpolation fitting on a foreground language model and a background language model based on text preprocessing corpora;
the foreground language model is a user language model, the weight value is preset to be 0.5-0.8, and the foreground language model comprises user-specified scene linguistic data; the background language model is a language model of an original speech recognition engine, and the background language model comprises scene linguistic data.
Specifically, interpolation fitting is used for merging the language models to improve the effect of the language models; when the settable foreground language weight is 0.6, the distribution of the language material of the newly generated language model is optimal, and the processing effect is optimal.
Specifically, the text preprocessing corpus removes punctuation marks and some nonsense linguistic and stop words from the user total text corpus, and converts the numbers into the expression form of the corresponding corpus text through a number conversion module.
Further, in a preferred embodiment provided by the present application, the newly generated language model is subjected to smoothing and pruning operations;
the pruning operation is based on the foreground language model, irrelevant scene corpus deletion is carried out on the background language model, and branches of the foreground language are reserved; and the smoothing operation is based on the newly generated language model, the conditional probabilities of all scene corpora in the language model are redistributed, and the sum of the conditional probabilities of all the scene corpora after the smoothing operation is 1.
S300: and matching the voice recognition intermediate result based on the core word pinyin and the tone sequence in the core word database.
Further, in a preferred embodiment provided by the present application, the core word database is constructed by performing word segmentation and word frequency statistics based on the text preprocessing corpus, and generating corresponding word segmentation weights according to the word frequency.
The corresponding word segmentation weight is calculated by dividing the word frequency of each word by the sum of the maximum word frequency and a constant, wherein the constant is the median of all the word frequencies.
Specifically, when performing word segmentation, a dictionary based on a decoding model is required, and a reverse maximum matching algorithm is used, so that the word segmentation effect is optimal. And (4) counting word frequency, wherein the number of times of the same word is counted based on word segmentation results.
Further, in a preferred embodiment provided by the present application, the core word database may perform matching according to the core word information uploaded by the user, and automatically recommend a corresponding weight value, and the user may adjust the weight value according to the actual requirement to increase the accuracy of voice recognition;
and if the user core word is not found through retrieval, taking the weight median of all words in the current core word database as a recommendation value.
Specifically, the words are matched in the core word database according to the core words input by the user. And if the corresponding core word can be matched in the database, recommending the core word weight to the user as a recommendation value. The weight value recommended to the user can be increased or decreased by the user according to the actual scene, and the accuracy of voice recognition under the user scene is improved.
S400: and outputting a matching result according to the matching state of the pinyin and tone sequence and the intermediate result in the voice recognition.
Further, in a preferred embodiment provided by the present application, when the matching result is that the speech recognition intermediate result sequence has a corresponding pinyin intonation sequence in the database, the sequence is subjected to core word replacement.
Specifically, if the speech recognition intermediate result sequence is not matched with the corresponding pinyin intonation sequence in the database, the speech recognition intermediate result can be directly output as a speech recognition result.
Further, in a preferred embodiment provided by the present application, when the core word is replaced, if the language model confusion degree of the sentence including the replacement sequence is lower than that of the original sentence by a threshold, the core word sequence replacement can be completed, and an intermediate result of the speech recognition including the replacement sequence is output;
wherein, the reduced threshold value can be adjusted according to the actual environment.
Specifically, the smaller the language model confusion value is, the higher the engagement degree of the replacement sequence in the sentence is after the core word is replaced. The threshold value for lowering is set to 0.1 by default, and if the matching degree of the replacement sequence in the sentence is to be improved, the setting of the threshold value can be lowered.
Further, in a preferred embodiment provided by the present application, before the step of outputting the sentence including the alternative sequence as a speech recognition result, the method further includes performing sentence break and punctuation prediction on the sentence including the alternative sequence.
A speech recognition apparatus 100 comprising:
a voice receiving module 11, configured to receive voice data;
a voice decoding module 12, configured to decode the voice data and generate a voice recognition intermediate result;
the voice recognition intermediate result matching module 13 is used for matching the voice recognition intermediate result with the core word pinyin and tone sequences in the database;
and the voice recognition result output module 14 is used for outputting a matching result according to the matching state of the pinyin and tone sequence and the intermediate result in the voice recognition.
In a typical configuration, a computer may include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (10)

1. A speech recognition method, comprising:
acquiring input voice data;
decoding the voice data through a decoding model to generate a voice recognition intermediate result;
matching the voice recognition intermediate result based on the core word pinyin and the tone sequence in the core word database;
and outputting a matching result according to the matching state of the pinyin and tone sequence and the intermediate result in the voice recognition.
2. The speech recognition method of claim 1, wherein the decoding model is collectively formed of an acoustic model, a dictionary, and a language model.
3. The speech recognition method of claim 2, wherein the language model is a new language model generated by interpolation fitting of a foreground language model and a background language model based on text preprocessing corpora;
the foreground language model is a user language model, the weight value is preset to be 0.5-0.8, and the foreground language model comprises user-specified scene linguistic data; the background language model is a language model of an original speech recognition engine, and the background language model comprises scene linguistic data.
4. A speech recognition method as claimed in claim 3, characterized in that the smoothing and pruning operations are performed on the newly generated language model;
the pruning operation is based on the foreground language model, irrelevant scene corpus deletion is carried out on the background language model, and branches of the foreground language are reserved; and the smoothing operation is based on the newly generated language model, the conditional probabilities of all scene corpora in the language model are redistributed, and the sum of the conditional probabilities of all the scene corpora after the smoothing operation is 1.
5. The speech recognition method of claim 1, wherein the core word database is built by performing word segmentation and word frequency statistics based on a text preprocessing corpus, and generating corresponding word segmentation weights according to the word frequency;
the corresponding word segmentation weight is calculated by dividing the word frequency of each word by the sum of the maximum word frequency and a constant, wherein the constant is the median of all the word frequencies.
6. The speech recognition method according to claim 5, wherein the core word database is adapted to match the core word information uploaded by the user and automatically recommend a corresponding weight value, and the user can adjust the weight value according to actual needs to increase the accuracy of speech recognition;
and if the user core word is not found through retrieval, taking the weight median of all words in the current core word database as a recommendation value.
7. The speech recognition method of claim 1, wherein the matching result is a core word replacement for the pinyin and intonation sequences when the corresponding pinyin and intonation sequences exist in the database as the intermediate result of speech recognition.
8. The speech recognition method of claim 7, wherein when the core word is replaced, if the language model confusion of the sentence containing the replacement sequence is lower than the original sentence by a threshold, the core word sequence replacement is completed, and the speech recognition intermediate result containing the replacement sequence is output;
wherein, the reduced threshold value can be adjusted according to the actual environment.
9. The speech recognition method of claim 8, further comprising performing sentence break and punctuation prediction on the sentence containing the replacement sequence before the step of outputting the sentence containing the replacement sequence as a speech recognition result.
10. A speech recognition apparatus, comprising:
the voice receiving module is used for receiving voice data;
the voice decoding module is used for decoding the voice data and generating a voice recognition intermediate result;
the voice recognition intermediate result matching module is used for matching the voice recognition intermediate result with the core word pinyin and tone sequences in the database;
and the voice recognition result output module is used for outputting a matching result according to the matching state of the pinyin and tone sequence and the intermediate result in the voice recognition.
CN202011295150.1A 2020-11-18 2020-11-18 Speech recognition method and device thereof Active CN112489646B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011295150.1A CN112489646B (en) 2020-11-18 2020-11-18 Speech recognition method and device thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011295150.1A CN112489646B (en) 2020-11-18 2020-11-18 Speech recognition method and device thereof

Publications (2)

Publication Number Publication Date
CN112489646A true CN112489646A (en) 2021-03-12
CN112489646B CN112489646B (en) 2024-04-02

Family

ID=74931400

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011295150.1A Active CN112489646B (en) 2020-11-18 2020-11-18 Speech recognition method and device thereof

Country Status (1)

Country Link
CN (1) CN112489646B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023273578A1 (en) * 2021-06-30 2023-01-05 北京有竹居网络技术有限公司 Speech recognition method and apparatus, and medium and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102063900A (en) * 2010-11-26 2011-05-18 北京交通大学 Speech recognition method and system for overcoming confusing pronunciation
US20110191355A1 (en) * 2007-04-24 2011-08-04 Peking University Method for monitoring abnormal state of internet information
KR20160078703A (en) * 2014-12-24 2016-07-05 한국전자통신연구원 Method and Apparatus for converting text to scene
US20170125013A1 (en) * 2015-10-29 2017-05-04 Le Holdings (Beijing) Co., Ltd. Language model training method and device
CN109635081A (en) * 2018-11-23 2019-04-16 上海大学 A kind of text key word weighing computation method based on word frequency power-law distribution characteristic
CN110675855A (en) * 2019-10-09 2020-01-10 出门问问信息科技有限公司 Voice recognition method, electronic equipment and computer readable storage medium
CN110970026A (en) * 2019-12-17 2020-04-07 用友网络科技股份有限公司 Voice interaction matching method, computer device and computer-readable storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110191355A1 (en) * 2007-04-24 2011-08-04 Peking University Method for monitoring abnormal state of internet information
CN102063900A (en) * 2010-11-26 2011-05-18 北京交通大学 Speech recognition method and system for overcoming confusing pronunciation
KR20160078703A (en) * 2014-12-24 2016-07-05 한국전자통신연구원 Method and Apparatus for converting text to scene
US20170125013A1 (en) * 2015-10-29 2017-05-04 Le Holdings (Beijing) Co., Ltd. Language model training method and device
CN109635081A (en) * 2018-11-23 2019-04-16 上海大学 A kind of text key word weighing computation method based on word frequency power-law distribution characteristic
CN110675855A (en) * 2019-10-09 2020-01-10 出门问问信息科技有限公司 Voice recognition method, electronic equipment and computer readable storage medium
CN110970026A (en) * 2019-12-17 2020-04-07 用友网络科技股份有限公司 Voice interaction matching method, computer device and computer-readable storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023273578A1 (en) * 2021-06-30 2023-01-05 北京有竹居网络技术有限公司 Speech recognition method and apparatus, and medium and device

Also Published As

Publication number Publication date
CN112489646B (en) 2024-04-02

Similar Documents

Publication Publication Date Title
US11443733B2 (en) Contextual text-to-speech processing
JP5768093B2 (en) Speech processing system
US20140114663A1 (en) Guided speaker adaptive speech synthesis system and method and computer program product
US11727922B2 (en) Systems and methods for deriving expression of intent from recorded speech
CN106875936B (en) Voice recognition method and device
CN110599998B (en) Voice data generation method and device
Fendji et al. Automatic speech recognition using limited vocabulary: A survey
US11170763B2 (en) Voice interaction system, its processing method, and program therefor
JP6622681B2 (en) Phoneme Breakdown Detection Model Learning Device, Phoneme Breakdown Interval Detection Device, Phoneme Breakdown Detection Model Learning Method, Phoneme Breakdown Interval Detection Method, Program
US20230298564A1 (en) Speech synthesis method and apparatus, device, and storage medium
CN112489688A (en) Neural network-based emotion recognition method, device and medium
EP4266306A1 (en) A speech processing system and a method of processing a speech signal
CN112885335B (en) Speech recognition method and related device
CN112489646B (en) Speech recognition method and device thereof
US20190088258A1 (en) Voice recognition device, voice recognition method, and computer program product
Rashmi et al. Hidden Markov Model for speech recognition system—a pilot study and a naive approach for speech-to-text model
Norouzian et al. An approach for efficient open vocabulary spoken term detection
CN112837688B (en) Voice transcription method, device, related system and equipment
US20150269927A1 (en) Text-to-speech device, text-to-speech method, and computer program product
Kirkedal Danish stød and automatic speech recognition
Seki et al. Diversity-based core-set selection for text-to-speech with linguistic and acoustic features
JP2020129015A (en) Voice recognizer, voice recognition method and program
CN112786004B (en) Speech synthesis method, electronic equipment and storage device
JP2020046551A (en) Learning device and program for learning statistical model used for voice synthesis
JP6220733B2 (en) Voice classification device, voice classification method, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant