CN112489646A - Speech recognition method and device - Google Patents
Speech recognition method and device Download PDFInfo
- Publication number
- CN112489646A CN112489646A CN202011295150.1A CN202011295150A CN112489646A CN 112489646 A CN112489646 A CN 112489646A CN 202011295150 A CN202011295150 A CN 202011295150A CN 112489646 A CN112489646 A CN 112489646A
- Authority
- CN
- China
- Prior art keywords
- language model
- speech recognition
- word
- voice
- intermediate result
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 29
- 230000011218 segmentation Effects 0.000 claims description 12
- 238000009499 grossing Methods 0.000 claims description 9
- 238000007781 pre-processing Methods 0.000 claims description 7
- 238000013138 pruning Methods 0.000 claims description 6
- 238000012217 deletion Methods 0.000 claims description 3
- 230000037430 deletion Effects 0.000 claims description 3
- 238000013507 mapping Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000004590 computer program Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/12—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being prediction coefficients
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L2015/088—Word spotting
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Signal Processing (AREA)
- Machine Translation (AREA)
Abstract
The application discloses a voice recognition method and a device thereof. Wherein the method comprises the following steps: acquiring input voice data; decoding the voice data through a decoding model to generate a voice recognition intermediate result; matching the voice recognition intermediate result based on the core word pinyin and the tone sequence in the core word database; and outputting a matching result according to the matching state of the pinyin in the tone sequence and the intermediate result in the voice recognition. The problem that the voice recognition result deviates from the normal context can be solved through matching the voice recognition intermediate result with the core word pinyin and tone sequences in the core word database.
Description
Technical Field
The present invention relates to the field of speech recognition, and in particular, to a speech recognition method and apparatus.
Background
The decoding of speech recognition has great relevance to the application scene, and users always expect that a speech recognition model can carry out decoding recognition with certain directivity to the scene corpus of the users. At present, voice recognition is performed based on a user hotword, and when the hotword is uploaded, the hotword is manually defined and a weight value of the hotword is set. If the setting of the weighted values is relatively large, the voice recognition result is seriously deviated from the normal context, the uploading quantity of the hot words is limited, and the user has certain difficulty in selecting the hot words.
Disclosure of Invention
The embodiment of the application provides a voice recognition method, which is used for solving the problem that a voice recognition result deviates from a normal context in the prior art. The method specifically comprises the following steps:
acquiring input voice data;
decoding the voice data through a decoding model to generate a voice recognition intermediate result;
matching the voice recognition intermediate result based on the core word pinyin and the tone sequence in the core word database;
and outputting a matching result according to the matching state of the pinyin and tone sequence and the intermediate result in the voice recognition.
Further, in a preferred embodiment provided by the present application, the decoding model is formed by an acoustic model, a dictionary, and a language model.
Further, in a preferred embodiment provided by the present application, the language model is a new language model generated by performing interpolation fitting on a foreground language model and a background language model based on text preprocessing corpora;
the foreground language model is a user language model, the weight value is preset to be 0.5-0.8, and the foreground language model comprises user-specified scene linguistic data; the background language model is a language model of an original speech recognition engine, and the background language model comprises scene linguistic data.
Further, in a preferred embodiment provided by the present application, the newly generated language model is subjected to smoothing and pruning operations;
the pruning operation is based on the foreground language model, irrelevant scene corpus deletion is carried out on the background language model, and branches of the foreground language are reserved; and the smoothing operation is based on the newly generated language model, the conditional probabilities of all scene corpora in the language model are redistributed, and the sum of the conditional probabilities of all the scene corpora after the smoothing operation is 1.
Further, in a preferred embodiment provided by the present application, the core word database is constructed by performing word segmentation and word frequency statistics based on the text preprocessing corpus, and generating corresponding word segmentation weights according to the word frequency;
the corresponding word segmentation weight is calculated by dividing the word frequency of each word by the sum of the maximum word frequency and a constant, wherein the constant is the median of all the word frequencies.
Further, in a preferred embodiment provided by the present application, the core word database may perform matching according to the core word information uploaded by the user, and automatically recommend a corresponding weight value, and the user may adjust the weight value according to the actual requirement to increase the accuracy of voice recognition;
and if the user core word is not found through retrieval, taking the weight median of all words in the current core word database as a recommendation value.
Further, in a preferred embodiment provided by the present application, when the matching result is that the speech recognition intermediate result has corresponding pinyin and intonation sequences in the database, core word replacement is performed on the pinyin and intonation sequences.
Further, in a preferred embodiment provided by the present application, when the core word is replaced, if the language model confusion degree of the sentence including the replacement sequence is lower than that of the original sentence by a threshold, the core word sequence replacement can be completed, and an intermediate result of the speech recognition including the replacement sequence is output;
wherein, the reduced threshold value can be adjusted according to the actual environment.
Further, in a preferred embodiment provided by the present application, before the step of outputting the sentence including the alternative sequence as a speech recognition result, the method further includes performing sentence break and punctuation prediction on the sentence including the alternative sequence.
An embodiment of the present application provides a speech recognition apparatus, including:
the voice receiving module is used for receiving voice data;
the voice decoding module is used for decoding the voice data and generating a voice recognition intermediate result;
the voice recognition intermediate result matching module is used for matching the voice recognition intermediate result with the core word pinyin and tone sequences in the database;
and the voice recognition result output module is used for outputting a matching result according to the matching state of the pinyin in the tone sequence and the intermediate result in the voice recognition.
The embodiment provided by the application has at least the following beneficial effects:
the problem that the speech recognition result deviates from the normal context can be solved by the speech recognition method and the speech recognition device.
Drawings
Fig. 1 is a flowchart of a speech recognition method according to an embodiment of the present application.
Fig. 2 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application.
100 speech recognition device
11 voice receiving module
12 voice decoding module
13 voice recognition intermediate result matching module
14 speech recognition result output module
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.
Referring to fig. 1, the present application discloses a speech recognition method, including:
s100: input voice data is acquired.
The voice data is voice stream data input in real time or file stream data in an audio file.
The voice stream data can be acquired through hardware with a real-time recording function, such as a microphone, a sound card and the like, and the voice is recorded and generated in real time. The file stream data can be obtained by reading the audio file storing the recorded audio data, and the common format of the suffix of the audio file is as follows: WAV/. AIF/. AIFF/. AU/. MP1/. MP2/. MP3/. RA/. RM/. RAM.
S200: and decoding the voice data through a decoding model to generate a voice recognition intermediate result.
Further, in a preferred embodiment provided by the present application, the speech decoding model is formed by an acoustic model, a dictionary, and a language model.
Mapping between voice features and phonemes in the voice data can be established through an acoustic model; mapping between phonemes and words can be established through a dictionary; the mapping of words and sentences can be established through the language model. The computer can complete the decoding operation of the voice data according to the mapping established by the acoustic model, the dictionary and the language model, thereby generating a corresponding voice recognition intermediate result.
Specifically, the acoustic model is a knowledge representation of differences in acoustics, phonetics, variables of the environment, speaker gender, accent, etc.; the language model is a knowledge representation of a set of word sequences; the dictionary is a set of phoneme indices corresponding to words.
Further, in a preferred embodiment provided by the present application, the language model is a new language model generated by performing interpolation fitting on a foreground language model and a background language model based on text preprocessing corpora;
the foreground language model is a user language model, the weight value is preset to be 0.5-0.8, and the foreground language model comprises user-specified scene linguistic data; the background language model is a language model of an original speech recognition engine, and the background language model comprises scene linguistic data.
Specifically, interpolation fitting is used for merging the language models to improve the effect of the language models; when the settable foreground language weight is 0.6, the distribution of the language material of the newly generated language model is optimal, and the processing effect is optimal.
Specifically, the text preprocessing corpus removes punctuation marks and some nonsense linguistic and stop words from the user total text corpus, and converts the numbers into the expression form of the corresponding corpus text through a number conversion module.
Further, in a preferred embodiment provided by the present application, the newly generated language model is subjected to smoothing and pruning operations;
the pruning operation is based on the foreground language model, irrelevant scene corpus deletion is carried out on the background language model, and branches of the foreground language are reserved; and the smoothing operation is based on the newly generated language model, the conditional probabilities of all scene corpora in the language model are redistributed, and the sum of the conditional probabilities of all the scene corpora after the smoothing operation is 1.
S300: and matching the voice recognition intermediate result based on the core word pinyin and the tone sequence in the core word database.
Further, in a preferred embodiment provided by the present application, the core word database is constructed by performing word segmentation and word frequency statistics based on the text preprocessing corpus, and generating corresponding word segmentation weights according to the word frequency.
The corresponding word segmentation weight is calculated by dividing the word frequency of each word by the sum of the maximum word frequency and a constant, wherein the constant is the median of all the word frequencies.
Specifically, when performing word segmentation, a dictionary based on a decoding model is required, and a reverse maximum matching algorithm is used, so that the word segmentation effect is optimal. And (4) counting word frequency, wherein the number of times of the same word is counted based on word segmentation results.
Further, in a preferred embodiment provided by the present application, the core word database may perform matching according to the core word information uploaded by the user, and automatically recommend a corresponding weight value, and the user may adjust the weight value according to the actual requirement to increase the accuracy of voice recognition;
and if the user core word is not found through retrieval, taking the weight median of all words in the current core word database as a recommendation value.
Specifically, the words are matched in the core word database according to the core words input by the user. And if the corresponding core word can be matched in the database, recommending the core word weight to the user as a recommendation value. The weight value recommended to the user can be increased or decreased by the user according to the actual scene, and the accuracy of voice recognition under the user scene is improved.
S400: and outputting a matching result according to the matching state of the pinyin and tone sequence and the intermediate result in the voice recognition.
Further, in a preferred embodiment provided by the present application, when the matching result is that the speech recognition intermediate result sequence has a corresponding pinyin intonation sequence in the database, the sequence is subjected to core word replacement.
Specifically, if the speech recognition intermediate result sequence is not matched with the corresponding pinyin intonation sequence in the database, the speech recognition intermediate result can be directly output as a speech recognition result.
Further, in a preferred embodiment provided by the present application, when the core word is replaced, if the language model confusion degree of the sentence including the replacement sequence is lower than that of the original sentence by a threshold, the core word sequence replacement can be completed, and an intermediate result of the speech recognition including the replacement sequence is output;
wherein, the reduced threshold value can be adjusted according to the actual environment.
Specifically, the smaller the language model confusion value is, the higher the engagement degree of the replacement sequence in the sentence is after the core word is replaced. The threshold value for lowering is set to 0.1 by default, and if the matching degree of the replacement sequence in the sentence is to be improved, the setting of the threshold value can be lowered.
Further, in a preferred embodiment provided by the present application, before the step of outputting the sentence including the alternative sequence as a speech recognition result, the method further includes performing sentence break and punctuation prediction on the sentence including the alternative sequence.
A speech recognition apparatus 100 comprising:
a voice receiving module 11, configured to receive voice data;
a voice decoding module 12, configured to decode the voice data and generate a voice recognition intermediate result;
the voice recognition intermediate result matching module 13 is used for matching the voice recognition intermediate result with the core word pinyin and tone sequences in the database;
and the voice recognition result output module 14 is used for outputting a matching result according to the matching state of the pinyin and tone sequence and the intermediate result in the voice recognition.
In a typical configuration, a computer may include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.
Claims (10)
1. A speech recognition method, comprising:
acquiring input voice data;
decoding the voice data through a decoding model to generate a voice recognition intermediate result;
matching the voice recognition intermediate result based on the core word pinyin and the tone sequence in the core word database;
and outputting a matching result according to the matching state of the pinyin and tone sequence and the intermediate result in the voice recognition.
2. The speech recognition method of claim 1, wherein the decoding model is collectively formed of an acoustic model, a dictionary, and a language model.
3. The speech recognition method of claim 2, wherein the language model is a new language model generated by interpolation fitting of a foreground language model and a background language model based on text preprocessing corpora;
the foreground language model is a user language model, the weight value is preset to be 0.5-0.8, and the foreground language model comprises user-specified scene linguistic data; the background language model is a language model of an original speech recognition engine, and the background language model comprises scene linguistic data.
4. A speech recognition method as claimed in claim 3, characterized in that the smoothing and pruning operations are performed on the newly generated language model;
the pruning operation is based on the foreground language model, irrelevant scene corpus deletion is carried out on the background language model, and branches of the foreground language are reserved; and the smoothing operation is based on the newly generated language model, the conditional probabilities of all scene corpora in the language model are redistributed, and the sum of the conditional probabilities of all the scene corpora after the smoothing operation is 1.
5. The speech recognition method of claim 1, wherein the core word database is built by performing word segmentation and word frequency statistics based on a text preprocessing corpus, and generating corresponding word segmentation weights according to the word frequency;
the corresponding word segmentation weight is calculated by dividing the word frequency of each word by the sum of the maximum word frequency and a constant, wherein the constant is the median of all the word frequencies.
6. The speech recognition method according to claim 5, wherein the core word database is adapted to match the core word information uploaded by the user and automatically recommend a corresponding weight value, and the user can adjust the weight value according to actual needs to increase the accuracy of speech recognition;
and if the user core word is not found through retrieval, taking the weight median of all words in the current core word database as a recommendation value.
7. The speech recognition method of claim 1, wherein the matching result is a core word replacement for the pinyin and intonation sequences when the corresponding pinyin and intonation sequences exist in the database as the intermediate result of speech recognition.
8. The speech recognition method of claim 7, wherein when the core word is replaced, if the language model confusion of the sentence containing the replacement sequence is lower than the original sentence by a threshold, the core word sequence replacement is completed, and the speech recognition intermediate result containing the replacement sequence is output;
wherein, the reduced threshold value can be adjusted according to the actual environment.
9. The speech recognition method of claim 8, further comprising performing sentence break and punctuation prediction on the sentence containing the replacement sequence before the step of outputting the sentence containing the replacement sequence as a speech recognition result.
10. A speech recognition apparatus, comprising:
the voice receiving module is used for receiving voice data;
the voice decoding module is used for decoding the voice data and generating a voice recognition intermediate result;
the voice recognition intermediate result matching module is used for matching the voice recognition intermediate result with the core word pinyin and tone sequences in the database;
and the voice recognition result output module is used for outputting a matching result according to the matching state of the pinyin and tone sequence and the intermediate result in the voice recognition.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011295150.1A CN112489646B (en) | 2020-11-18 | 2020-11-18 | Speech recognition method and device thereof |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011295150.1A CN112489646B (en) | 2020-11-18 | 2020-11-18 | Speech recognition method and device thereof |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112489646A true CN112489646A (en) | 2021-03-12 |
CN112489646B CN112489646B (en) | 2024-04-02 |
Family
ID=74931400
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011295150.1A Active CN112489646B (en) | 2020-11-18 | 2020-11-18 | Speech recognition method and device thereof |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112489646B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023273578A1 (en) * | 2021-06-30 | 2023-01-05 | 北京有竹居网络技术有限公司 | Speech recognition method and apparatus, and medium and device |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102063900A (en) * | 2010-11-26 | 2011-05-18 | 北京交通大学 | Speech recognition method and system for overcoming confusing pronunciation |
US20110191355A1 (en) * | 2007-04-24 | 2011-08-04 | Peking University | Method for monitoring abnormal state of internet information |
KR20160078703A (en) * | 2014-12-24 | 2016-07-05 | 한국전자통신연구원 | Method and Apparatus for converting text to scene |
US20170125013A1 (en) * | 2015-10-29 | 2017-05-04 | Le Holdings (Beijing) Co., Ltd. | Language model training method and device |
CN109635081A (en) * | 2018-11-23 | 2019-04-16 | 上海大学 | A kind of text key word weighing computation method based on word frequency power-law distribution characteristic |
CN110675855A (en) * | 2019-10-09 | 2020-01-10 | 出门问问信息科技有限公司 | Voice recognition method, electronic equipment and computer readable storage medium |
CN110970026A (en) * | 2019-12-17 | 2020-04-07 | 用友网络科技股份有限公司 | Voice interaction matching method, computer device and computer-readable storage medium |
-
2020
- 2020-11-18 CN CN202011295150.1A patent/CN112489646B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110191355A1 (en) * | 2007-04-24 | 2011-08-04 | Peking University | Method for monitoring abnormal state of internet information |
CN102063900A (en) * | 2010-11-26 | 2011-05-18 | 北京交通大学 | Speech recognition method and system for overcoming confusing pronunciation |
KR20160078703A (en) * | 2014-12-24 | 2016-07-05 | 한국전자통신연구원 | Method and Apparatus for converting text to scene |
US20170125013A1 (en) * | 2015-10-29 | 2017-05-04 | Le Holdings (Beijing) Co., Ltd. | Language model training method and device |
CN109635081A (en) * | 2018-11-23 | 2019-04-16 | 上海大学 | A kind of text key word weighing computation method based on word frequency power-law distribution characteristic |
CN110675855A (en) * | 2019-10-09 | 2020-01-10 | 出门问问信息科技有限公司 | Voice recognition method, electronic equipment and computer readable storage medium |
CN110970026A (en) * | 2019-12-17 | 2020-04-07 | 用友网络科技股份有限公司 | Voice interaction matching method, computer device and computer-readable storage medium |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023273578A1 (en) * | 2021-06-30 | 2023-01-05 | 北京有竹居网络技术有限公司 | Speech recognition method and apparatus, and medium and device |
Also Published As
Publication number | Publication date |
---|---|
CN112489646B (en) | 2024-04-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11443733B2 (en) | Contextual text-to-speech processing | |
JP5768093B2 (en) | Speech processing system | |
US20140114663A1 (en) | Guided speaker adaptive speech synthesis system and method and computer program product | |
US11727922B2 (en) | Systems and methods for deriving expression of intent from recorded speech | |
CN106875936B (en) | Voice recognition method and device | |
CN110599998B (en) | Voice data generation method and device | |
Fendji et al. | Automatic speech recognition using limited vocabulary: A survey | |
US11170763B2 (en) | Voice interaction system, its processing method, and program therefor | |
JP6622681B2 (en) | Phoneme Breakdown Detection Model Learning Device, Phoneme Breakdown Interval Detection Device, Phoneme Breakdown Detection Model Learning Method, Phoneme Breakdown Interval Detection Method, Program | |
US20230298564A1 (en) | Speech synthesis method and apparatus, device, and storage medium | |
CN112489688A (en) | Neural network-based emotion recognition method, device and medium | |
EP4266306A1 (en) | A speech processing system and a method of processing a speech signal | |
CN112885335B (en) | Speech recognition method and related device | |
CN112489646B (en) | Speech recognition method and device thereof | |
US20190088258A1 (en) | Voice recognition device, voice recognition method, and computer program product | |
Rashmi et al. | Hidden Markov Model for speech recognition system—a pilot study and a naive approach for speech-to-text model | |
Norouzian et al. | An approach for efficient open vocabulary spoken term detection | |
CN112837688B (en) | Voice transcription method, device, related system and equipment | |
US20150269927A1 (en) | Text-to-speech device, text-to-speech method, and computer program product | |
Kirkedal | Danish stød and automatic speech recognition | |
Seki et al. | Diversity-based core-set selection for text-to-speech with linguistic and acoustic features | |
JP2020129015A (en) | Voice recognizer, voice recognition method and program | |
CN112786004B (en) | Speech synthesis method, electronic equipment and storage device | |
JP2020046551A (en) | Learning device and program for learning statistical model used for voice synthesis | |
JP6220733B2 (en) | Voice classification device, voice classification method, and program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |