CN113327587A - Method and device for voice recognition in specific scene, electronic equipment and storage medium - Google Patents
Method and device for voice recognition in specific scene, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN113327587A CN113327587A CN202110616948.XA CN202110616948A CN113327587A CN 113327587 A CN113327587 A CN 113327587A CN 202110616948 A CN202110616948 A CN 202110616948A CN 113327587 A CN113327587 A CN 113327587A
- Authority
- CN
- China
- Prior art keywords
- domain
- audio data
- speech
- database
- decoding network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 43
- 238000012549 training Methods 0.000 claims abstract description 20
- 238000000605 extraction Methods 0.000 claims abstract description 9
- 238000009432 framing Methods 0.000 claims description 5
- 238000010586 diagram Methods 0.000 description 6
- 230000000694 effects Effects 0.000 description 4
- 238000012986 modification Methods 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 239000013598 vector Substances 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
Abstract
The invention relates to a method for voice recognition in a specific scene, an electronic device and a storage medium, wherein the method comprises the following steps: acquiring audio data to be identified; extracting the characteristics of the audio data; inputting the characteristics of the audio data into a first decoding network to obtain an identification text; wherein the first decoding network is determined by: training the characteristics of the domain voice database into an acoustic model to obtain a domain acoustic model; the domain acoustic model, the lexicon, and the speech model form a first decoding network. The audio data to be recognized in the application are input into the first decoding network after feature extraction, corresponding field acoustic models can be found out from the first decoding network, the field acoustic models are matched with the audio data to be recognized better due to the fact that acoustic signal features under specific scenes are learned, the performance of the obtained recognized text can be better, the voice recognition accuracy under specific application fields is improved, and time and resource investment are saved.
Description
Technical Field
The invention relates to the technical field of voice recognition, in particular to a method for voice recognition in a specific scene, electronic equipment and a storage medium.
Background
Conventional ASR systems include a training phase and a decoding phase, the training phase: training an Acoustic Model (AM) by utilizing a voice database based on technologies such as a deep neural network and the like; and training a Language Model (LM) by utilizing a text database based on technologies such as ngram and a deep neural network. And a decoding stage: the acoustic model, the language model and the pronunciation dictionary obtained in the training stage form a decoding network. After the input audio is subjected to feature extraction, an optimal path can be found out from a decoding network through a decoding algorithm, and a final recognition result is obtained.
In a general application scene, mass data sources for training an acoustic model are data acquired in various daily scenes or open source data, and test data in a decoding stage in the general scene are basically matched with training data in terms of acoustic signals, so that a very good recognition effect can be achieved.
In a specific field application scenario, the test data in the decoding stage and the training data are usually not matched in terms of acoustic signals, and the acoustic mismatch may cause a drastic performance degradation.
Disclosure of Invention
The invention provides a method, a device, an electronic device and a storage medium for voice recognition in a specific scene, which can solve the technical problem that the voice recognition performance is sharply reduced.
The technical scheme for solving the technical problems is as follows:
in a first aspect, an embodiment of the present invention provides a method for speech recognition in a specific scenario, including:
acquiring audio data to be identified;
extracting the characteristics of the audio data;
features of the audio data are input into the first decoding network to obtain the recognized text.
Wherein the first decoding network is determined by:
training the characteristics of the domain voice database into an acoustic model to obtain a domain acoustic model;
the domain acoustic model, the lexicon, and the speech model form a first decoding network.
In some embodiments, the features of the domain voice database in the above method are determined by:
acquiring a domain voice database;
and extracting the characteristics of the domain voice database.
In some embodiments, the domain voice database in the above method is a labeled domain voice database.
In some embodiments, the extracting features of the domain voice database in the above method at least includes: pre-emphasis, framing, windowing, and discrete fourier transform.
In some embodiments, the domain voice database in the method is a domain voice database corresponding to different domains.
In a second aspect, an embodiment of the present invention further provides an apparatus for speech recognition in a specific scenario, including:
an acquisition module: the method comprises the steps of acquiring audio data to be identified;
an extraction module: for extracting features of the audio data;
an input module: the system comprises a first decoding network, a second decoding network and a third decoding network, wherein the first decoding network is used for inputting the characteristics of audio data into the first decoding network to obtain an identification text;
wherein the first decoding network is determined by:
training the characteristics of the domain voice database into an acoustic model to obtain a domain acoustic model;
the domain acoustic model, the lexicon, and the speech model form a first decoding network.
In some embodiments, the characteristics of the domain speech database in the apparatus are determined by:
acquiring a domain voice database;
and extracting the characteristics of the domain voice database.
In some embodiments, the domain voice database in the apparatus is a labeled domain voice database.
In some embodiments, the extracting features of the domain voice database in the apparatus at least includes: pre-emphasis, framing, windowing, and discrete fourier transform.
In some embodiments, the domain voice database in the device is a domain voice database corresponding to different domains.
In a third aspect, an embodiment of the present invention further provides an electronic device, including: a processor and a memory;
the processor is used for executing the method for voice recognition under any one specific scene by calling the program or the instructions stored in the memory.
In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores a program or instructions, and the program or instructions cause a computer to execute the method for speech recognition in any one of the specific scenarios described above.
The invention has the beneficial effects that: acquiring audio data to be identified; extracting the characteristics of the audio data; inputting the characteristics of the audio data into a first decoding network to obtain an identification text; wherein the first decoding network is determined by: training the characteristics of the domain voice database into an acoustic model to obtain a domain acoustic model; the domain acoustic model, the lexicon, and the speech model form a first decoding network. The audio data to be recognized in the application are input into the first decoding network after feature extraction, corresponding field acoustic models can be found out from the decoding network, the field acoustic models are matched with the audio data to be recognized better due to the fact that the acoustic signal features under specific scenes are learned, the performance of the obtained recognized text can be better, the voice recognition accuracy under specific application fields is improved, and time and resource investment are saved.
Drawings
FIG. 1 is a diagram illustrating a method for speech recognition in a specific scenario according to an embodiment of the present invention;
FIG. 2 is a second method for speech recognition under a specific scenario according to an embodiment of the present invention;
FIG. 3 is a third method for speech recognition in a specific scenario according to an embodiment of the present invention;
FIG. 4 is a diagram of an apparatus for speech recognition in a specific scenario according to an embodiment of the present invention;
fig. 5 is a schematic block diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.
In order that the above objects, features and advantages of the present application can be more clearly understood, the present disclosure will be further described in detail with reference to the accompanying drawings and examples. It is to be understood that the embodiments described are only a few embodiments of the present disclosure, and not all embodiments. The specific embodiments described herein are merely illustrative of the disclosure and are not limiting of the application. All other embodiments that can be derived by one of ordinary skill in the art from the description of the embodiments are intended to be within the scope of the present disclosure.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
The Accuracy of Speech Recognition (ASR) systems is greatly affected by environmental factors. The environment difference of different application scenes is very large, for example, a home scene and a vehicle-mounted scene are two completely different environment signals. The acoustic signals in the same application scene may also be very different, for example, in a vehicle-mounted scene, various situations such as when the vehicle is stationary, driving at a low speed, driving at a high speed, turning on/off an air conditioner, and turning on/off a door of the vehicle need to be considered.
Under certain circumstances, the performance of a speech recognition system is often not ideal due to reasons such as data matching. In order to perform deep customization in a specific field, large investment is needed, large-scale high-quality training data is collected, an acoustic model aiming at the specific environment can be trained, and a good recognition effect is achieved. In view of this, the present application provides a method, an apparatus, an electronic device and a storage medium for speech recognition in a specific scenario, which can solve the above technical problem of poor speech recognition performance.
Fig. 1 is a diagram of a method for speech recognition in a specific scenario according to an embodiment of the present invention.
In a first aspect, with reference to fig. 1, an embodiment of the present invention provides a method for speech recognition in a specific scenario, including three steps S101, S102, and S103.
S101: and acquiring audio data to be identified.
Specifically, the audio data to be recognized in the application may be a command statement spoken by the user, such as "turn on the air conditioner" spoken by the user in a speech control air conditioner scene, and "adjust the temperature to 25 degrees celsius", or may also be "turn on the meeting PPT" spoken by the user in a meeting scene.
S102: features of the audio data are extracted.
Specifically, the features of the audio data are extracted, and the audio data are converted into feature vectors which can be processed by a computer. The most common features are MFCC, i.e. mel-frequency cepstral coefficients, and Filterbank, i.e. Filterbank-based, Fbank features.
S103: features of the audio data are input into the first decoding network to obtain the recognized text.
Specifically, after the audio data to be recognized in the application is subjected to feature extraction, the audio data to be recognized is input into the first decoding network, a corresponding field acoustic model can be found out from the first decoding network, the field acoustic model is matched with the audio data to be recognized due to the fact that the acoustic signal features under a specific scene are learned, the performance of the obtained recognized text can be better, the voice recognition accuracy under the specific application field is improved, and time and resource investment are saved.
Fig. 2 is a second method for speech recognition under a specific scenario according to an embodiment of the present invention.
With reference to fig. 2, the first decoding network in step S103 is determined by the following steps S201 and S202:
s201: and training the characteristics of the domain voice database into an acoustic model to obtain a domain acoustic model.
S202: the domain acoustic model, the lexicon, and the speech model form a first decoding network.
It should be understood that the domain acoustic model in the present application is a domain acoustic model obtained by learning the acoustic signal characteristics in a specific scene using limited domain speech data based on an existing general model, and the "domain acoustic model" will replace the "general acoustic model" in the conventional framework, and together with the dictionary and the language model, form the first decoding network. The domain acoustic model can be matched with audio data to be recognized in a specific scene, and therefore the accuracy rate of voice recognition in the specific scene is improved.
Fig. 3 is a third method for speech recognition in a specific scenario according to an embodiment of the present invention.
In some embodiments, in conjunction with fig. 3, the features of the domain voice database in the above method are determined by the following two steps S301 and S302:
s301: and acquiring a domain voice database.
Specifically, the domain voice database in the embodiment of the present application is different from the general voice database, and the scale of the domain voice database is much smaller. The scale of the domain voice database only needs hundreds of sentences to tens of thousands of sentences according to different task difficulty degrees. For applications with simpler texts, such as command type scenes, the recognition performance can be greatly improved by only hundreds of sentences, such as opening sound, playing Chinese words and the like, and for scenes which are free and have special acoustic characteristics, tens of thousands of sentences of data are possibly needed. For example, in a conference scene, the speaking content is free, and the relative positions of different speakers and microphones are different, so that more training data is needed to learn the acoustic features in the scene. In summary, the larger the data size, the better the effect of improving the speech recognition performance in a specific scene.
S302: and extracting the characteristics of the domain voice database.
Specifically, in the embodiment of the present application, the features of the domain speech database are extracted, and the audio in the database is converted into feature vectors that can be processed by a computer. Such as MFCC, i.e. mel-frequency cepstral coefficients, and Filterbank, i.e. filter bank based, features.
In some embodiments, the domain voice database in the above method is a labeled domain voice database.
Specifically, the field voice database in the embodiment of the present application is a field voice database with text labels, for example, the intelligent control home appliance voice database has an intelligent control label, the home scene voice database has a home scene label, the vehicle-mounted scene voice database has a vehicle-mounted scene label, and the like.
In some embodiments, the extracting features of the domain voice database in the above method at least includes: pre-emphasis, framing, windowing, and discrete fourier transform.
In some embodiments, the domain voice database in the above method is a domain voice database corresponding to different domains.
It should be understood that, in the field of intelligent control home appliances, the voice database of the intelligent control home appliances corresponds to; for example, the conference scene field corresponds to a conference scene field database; such as a vehicle-mounted scene, a domain database corresponding to the vehicle-mounted scene; and the like, which are not listed herein, do not limit the scope of the embodiments of the present application. It should also be appreciated that by differentiating between different domain speech databases, the more detailed the segmentation yields a higher accuracy of the domain acoustic model and a higher accuracy of the resulting speech recognition.
Fig. 4 is a diagram of a device for speech recognition in a specific scenario according to an embodiment of the present invention.
In a second aspect, with reference to fig. 4, an embodiment of the present invention further provides an apparatus for speech recognition in a specific scenario, including:
the acquisition module 401: for obtaining audio data to be identified.
Specifically, the audio data to be recognized, which is acquired by the acquisition module 401 in the present application, may be a command statement spoken by a user, such as "turn on the air conditioner", "adjust the temperature to 25 degrees celsius" spoken by the user in a speech control air conditioner scene, or "turn on the conference PPT" spoken by the user in a conference scene.
The extraction module 402: for extracting features of the audio data.
Specifically, the extraction module 402 in the present application extracts features of audio data, and converts the audio data into feature vectors that can be processed by a computer. Such as MFCC, i.e. mel-frequency cepstral coefficients, and Filterbank, i.e. filter bank based, features.
The input module 403: for inputting the characteristics of the audio data into the first decoding network to obtain the recognized text.
Specifically, after the audio data to be recognized in the application is subjected to feature extraction, the input module 403 inputs the audio data to the first decoding network, and a corresponding field acoustic model can be found out from the first decoding network, the field acoustic model is more matched with the audio data to be recognized due to learning of acoustic signal features in a specific scene, the performance of the obtained recognized text is better, the speech recognition accuracy in a specific application field is improved, and time and resource investment are saved.
The first decoding network in the above apparatus is determined by:
and training the characteristics of the domain voice database into an acoustic model to obtain a domain acoustic model.
The domain acoustic model, the lexicon, and the speech model form a first decoding network.
It should be understood that the domain acoustic model in the present application is a domain acoustic model obtained by learning the acoustic signal characteristics in a specific scene according to limited domain speech data on the basis of an existing general model, and the "domain acoustic model" will replace the "general acoustic model" in the conventional framework, and together with a dictionary and a language model, form a first decoding network. The domain acoustic model can be matched with audio data to be recognized in a specific scene, and therefore the accuracy rate of voice recognition in the specific scene is improved.
In some embodiments, the characteristics of the domain speech database in the apparatus are determined by:
and acquiring a domain voice database.
Specifically, the domain voice database in the embodiment of the present application is different from the general voice database, and the scale of the domain voice database is much smaller. The scale of the domain voice database only needs hundreds of sentences to tens of thousands of sentences according to different task difficulty degrees. For applications with simpler texts, such as command type scenes, the recognition performance can be greatly improved by only hundreds of sentences, such as opening sound, playing Chinese words and the like, and for scenes which are free and have special acoustic characteristics, tens of thousands of sentences of data are possibly needed. For example, in a conference scene, the speaking content is free, and the relative positions of different speakers and microphones are different, so that more training data is needed to learn the acoustic features in the scene. In summary, the larger the data size, the better the effect of improving the speech recognition performance in a specific scene.
And extracting the characteristics of the domain voice database.
Specifically, in the embodiment of the present application, the features of the domain speech database are extracted, and the audio in the database is converted into feature vectors that can be processed by a computer. The most common features are MFCC, i.e. mel-frequency cepstral coefficients, and Filterbank, i.e. Filterbank-based, Fbank features.
In some embodiments, the domain voice database in the apparatus is a labeled domain voice database.
In some embodiments, the extracting module 402 in the apparatus extracts features of the domain voice database, and includes at least: pre-emphasis, framing, windowing, and discrete fourier transform.
In some embodiments, the domain voice database in the device is a domain voice database corresponding to different domains.
It should be understood that, in the field of intelligent control home appliances, the voice database of the intelligent control home appliances corresponds to; for example, the conference scene field corresponds to a conference scene field database; such as a vehicle-mounted scene, a domain database corresponding to the vehicle-mounted scene; and the like, which are not listed herein, do not limit the scope of the embodiments of the present application. It should also be appreciated that by differentiating between different domain speech databases, the more detailed the segmentation yields a higher accuracy of the domain acoustic model and a higher accuracy of the resulting speech recognition.
In a third aspect, an embodiment of the present invention further provides an electronic device, including: a processor and a memory;
the processor is used for executing the method for voice recognition under any one specific scene by calling the program or the instructions stored in the memory.
In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores a program or instructions, and the program or instructions cause a computer to execute the method for speech recognition in any one of the specific scenarios described above.
Fig. 5 is a schematic block diagram of an electronic device provided by an embodiment of the present disclosure.
As shown in fig. 5, the electronic apparatus includes: at least one processor 501, at least one memory 502, and at least one communication interface 503. The various components in the electronic device are coupled together by a bus system 504. A communication interface 503 for information transmission with an external device. It is understood that the bus system 504 is used to enable communications among the components. The bus system 504 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, the various buses are labeled as bus system 504 in fig. 5.
It will be appreciated that the memory 502 in this embodiment can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory.
In some embodiments, memory 502 stores elements, executable units or data structures, or a subset thereof, or an expanded set thereof as follows: an operating system and an application program.
The operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application programs, including various application programs such as a Media Player (Media Player), a Browser (Browser), etc., are used to implement various application services. The program for implementing the method for speech recognition in any specific scenario in the methods for speech recognition in specific scenarios provided by the embodiments of the present application may be included in the application program.
In this embodiment of the present application, the processor 501 is configured to execute the steps of the embodiments of the method for speech recognition in the specific scenario provided in this application by calling a program or an instruction stored in the memory 502, specifically, a program or an instruction stored in an application program.
Acquiring audio data to be identified;
extracting the characteristics of the audio data;
inputting the characteristics of the audio data into a first decoding network to obtain an identification text;
wherein the first decoding network is determined by: training the characteristics of the domain voice database into an acoustic model to obtain a domain acoustic model; the domain acoustic model, the lexicon, and the speech model form a first decoding network.
Any one of the methods of speech recognition in a specific scenario provided in the embodiment of the present application may be applied to the processor 501, or implemented by the processor 501. The processor 501 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 501. The Processor 501 may be a general-purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The steps of any one of the methods for speech recognition in a specific scenario provided in the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software units in the decoding processor. The software elements may be located in ram, flash, rom, prom, or eprom, registers, among other storage media that are well known in the art. The storage medium is located in the memory 502, and the processor 501 reads the information in the memory 502 and, in combination with its hardware, performs the steps of the method for speech recognition in a specific scenario.
Those skilled in the art will appreciate that although some embodiments described herein include some features included in other embodiments instead of others, combinations of features of different embodiments are meant to be within the scope of the application and form different embodiments.
Those skilled in the art will appreciate that the description of each embodiment has a respective emphasis, and reference may be made to the related description of other embodiments for those parts of an embodiment that are not described in detail.
Although the embodiments of the present application have been described in conjunction with the accompanying drawings, those skilled in the art will be able to make various modifications and variations without departing from the spirit and scope of the application, and such modifications and variations are included in the specific embodiments of the present invention as defined in the appended claims, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of various equivalent modifications and substitutions within the technical scope of the present disclosure, and these modifications and substitutions are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (10)
1. A method for speech recognition in a specific scene, comprising:
acquiring audio data to be identified;
extracting features of the audio data;
inputting the characteristics of the audio data into a first decoding network to obtain an identification text;
wherein the first decoding network is determined by:
training the characteristics of the domain voice database into an acoustic model to obtain a domain acoustic model;
the domain acoustic model, the lexicon, and the speech model constitute the first decoding network.
2. The method of claim 1, wherein the characteristics of the domain speech database are determined by:
acquiring a domain voice database;
and extracting the characteristics of the domain voice database.
3. The method of claim 2, wherein the domain speech database is a labeled domain speech database.
4. The method of speech recognition under specific scenario according to claim 2, wherein the extracting features of the domain speech database at least comprises: pre-emphasis, framing, windowing, and discrete fourier transform.
5. The method of claim 2, wherein the domain speech databases are domain speech databases corresponding to different domains.
6. An apparatus for speech recognition under a specific scenario, comprising:
an acquisition module: the method comprises the steps of acquiring audio data to be identified;
an extraction module: for extracting features of the audio data;
an input module: for inputting the characteristics of the audio data into a first decoding network to obtain a recognized text
Wherein the first decoding network is determined by:
training the characteristics of the domain voice database into an acoustic model to obtain a domain acoustic model;
the domain acoustic model, the dictionary and the speech model form the first decoding network.
7. The apparatus for speech recognition under specific circumstances as claimed in claim 6, wherein the features of the domain speech database are determined by the following steps:
acquiring a domain voice database;
and extracting the characteristics of the domain voice database.
8. The apparatus for speech recognition under certain circumstances as claimed in claim 7, wherein the domain speech database is a labeled domain speech database.
9. An electronic device, comprising: a processor and a memory;
the processor is used for executing the method for speech recognition under the specific scene according to any one of claims 1 to 5 by calling the program or the instructions stored in the memory.
10. A computer-readable storage medium storing a program or instructions for causing a computer to execute a method of speech recognition in a specific scenario according to any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110616948.XA CN113327587A (en) | 2021-06-02 | 2021-06-02 | Method and device for voice recognition in specific scene, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110616948.XA CN113327587A (en) | 2021-06-02 | 2021-06-02 | Method and device for voice recognition in specific scene, electronic equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113327587A true CN113327587A (en) | 2021-08-31 |
Family
ID=77419419
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110616948.XA Pending CN113327587A (en) | 2021-06-02 | 2021-06-02 | Method and device for voice recognition in specific scene, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113327587A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114120979A (en) * | 2022-01-25 | 2022-03-01 | 荣耀终端有限公司 | Optimization method, training method, device and medium of voice recognition model |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110024026A (en) * | 2016-11-28 | 2019-07-16 | 谷歌有限责任公司 | Structured text content is generated using speech recognition modeling |
CN110379415A (en) * | 2019-07-24 | 2019-10-25 | 出门问问(苏州)信息科技有限公司 | The training method of domain-adaptive acoustic model |
CN111402862A (en) * | 2020-02-28 | 2020-07-10 | 问问智能信息科技有限公司 | Voice recognition method, device, storage medium and equipment |
CN112002308A (en) * | 2020-10-30 | 2020-11-27 | 腾讯科技(深圳)有限公司 | Voice recognition method and device |
CN112669851A (en) * | 2021-03-17 | 2021-04-16 | 北京远鉴信息技术有限公司 | Voice recognition method and device, electronic equipment and readable storage medium |
CN112802461A (en) * | 2020-12-30 | 2021-05-14 | 深圳追一科技有限公司 | Speech recognition method and device, server, computer readable storage medium |
-
2021
- 2021-06-02 CN CN202110616948.XA patent/CN113327587A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110024026A (en) * | 2016-11-28 | 2019-07-16 | 谷歌有限责任公司 | Structured text content is generated using speech recognition modeling |
CN110379415A (en) * | 2019-07-24 | 2019-10-25 | 出门问问(苏州)信息科技有限公司 | The training method of domain-adaptive acoustic model |
CN111402862A (en) * | 2020-02-28 | 2020-07-10 | 问问智能信息科技有限公司 | Voice recognition method, device, storage medium and equipment |
CN112002308A (en) * | 2020-10-30 | 2020-11-27 | 腾讯科技(深圳)有限公司 | Voice recognition method and device |
CN112802461A (en) * | 2020-12-30 | 2021-05-14 | 深圳追一科技有限公司 | Speech recognition method and device, server, computer readable storage medium |
CN112669851A (en) * | 2021-03-17 | 2021-04-16 | 北京远鉴信息技术有限公司 | Voice recognition method and device, electronic equipment and readable storage medium |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114120979A (en) * | 2022-01-25 | 2022-03-01 | 荣耀终端有限公司 | Optimization method, training method, device and medium of voice recognition model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20240038218A1 (en) | Speech model personalization via ambient context harvesting | |
JP2017058674A (en) | Apparatus and method for speech recognition, apparatus and method for training transformation parameter, computer program and electronic apparatus | |
US20220115002A1 (en) | Speech recognition method, speech recognition device, and electronic equipment | |
CN110097870A (en) | Method of speech processing, device, equipment and storage medium | |
CN112562640B (en) | Multilingual speech recognition method, device, system, and computer-readable storage medium | |
JP6189818B2 (en) | Acoustic feature amount conversion device, acoustic model adaptation device, acoustic feature amount conversion method, acoustic model adaptation method, and program | |
CN112420050B (en) | Voice recognition method and device and electronic equipment | |
WO2023030235A1 (en) | Target audio output method and system, readable storage medium, and electronic apparatus | |
Chao et al. | Speaker-targeted audio-visual models for speech recognition in cocktail-party environments | |
Shahnawazuddin et al. | Improvements in IITG Assamese spoken query system: Background noise suppression and alternate acoustic modeling | |
CN113327587A (en) | Method and device for voice recognition in specific scene, electronic equipment and storage medium | |
CN110648669B (en) | Multi-frequency shunt voiceprint recognition method, device and system and computer readable storage medium | |
Ponting | Computational Models of Speech Pattern Processing | |
CN111798838A (en) | Method, system, equipment and storage medium for improving speech recognition accuracy | |
Sinha et al. | AI based Desktop Voice Assistant for Visually Impared Persons | |
CN112668704B (en) | Training method and device of audio recognition model and audio recognition method and device | |
Rasipuram et al. | Grapheme and multilingual posterior features for under-resourced speech recognition: a study on scottish gaelic | |
CN115132170A (en) | Language classification method and device and computer readable storage medium | |
Shah et al. | Speaker recognition for pashto speakers based on isolated digits recognition using accent and dialect approach | |
CN114203180A (en) | Conference summary generation method and device, electronic equipment and storage medium | |
Yadava et al. | An end-to-end continuous Kannada ASR system under uncontrolled environment | |
Dey et al. | Enhancements in Assamese spoken query system: Enabling background noise suppression and flexible queries | |
CN115699170A (en) | Text echo cancellation | |
CN113782005A (en) | Voice recognition method and device, storage medium and electronic equipment | |
Mann et al. | Tamil talk: What you speak is what you get! |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |