CN113077786A - Voice recognition method, device, equipment and storage medium - Google Patents

Voice recognition method, device, equipment and storage medium Download PDF

Info

Publication number
CN113077786A
CN113077786A CN202110308487.XA CN202110308487A CN113077786A CN 113077786 A CN113077786 A CN 113077786A CN 202110308487 A CN202110308487 A CN 202110308487A CN 113077786 A CN113077786 A CN 113077786A
Authority
CN
China
Prior art keywords
language
dictionary
acoustic model
new
new language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110308487.XA
Other languages
Chinese (zh)
Other versions
CN113077786B (en
Inventor
徐燃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Rubu Technology Co ltd
Original Assignee
Beijing Roobo Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Roobo Technology Co ltd filed Critical Beijing Roobo Technology Co ltd
Priority to CN202110308487.XA priority Critical patent/CN113077786B/en
Publication of CN113077786A publication Critical patent/CN113077786A/en
Application granted granted Critical
Publication of CN113077786B publication Critical patent/CN113077786B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a voice recognition method, a voice recognition device, equipment and a storage medium, wherein the method comprises the following steps: acquiring command voice of a new language; obtaining grammar of the command voice according to a first dictionary of the new language and a first acoustic model of the new language, wherein the first dictionary is determined according to a second dictionary of the first language and a phoneme mapping relation between the new language and the first language, the first acoustic model is determined according to a second acoustic model of the first language and the first dictionary, the first language is a language which is closer to phonemes of pronunciation of the new language in a basic language, and the basic language comprises a non-tonal language and a tonal language; and decoding according to the grammar of the new language and the first acoustic model to obtain the command word of the command voice. By adopting the scheme of the embodiment of the application, multi-language identification can be quickly realized, the cost is low, and the cost performance is high.

Description

Voice recognition method, device, equipment and storage medium
Technical Field
The present application relates to the field of speech recognition, and in particular, to a speech recognition method, apparatus, device, and storage medium.
Background
With the development of globalization, many products with voice command control are required to support multiple major languages, which requires training acoustic models for each language individually. However, for each language, from scratch, the process of training the acoustic model is expensive in data acquisition and long in development period.
Disclosure of Invention
The application provides a voice recognition method, a voice recognition device, voice recognition equipment and a storage medium, which can quickly realize multi-language recognition and have low cost and high cost performance.
In order to achieve the above object, an embodiment of the present application provides a speech recognition method, including:
acquiring command voice of a new language;
obtaining grammar of the command voice according to a first dictionary of the new language and a first acoustic model of the new language, wherein the first dictionary is determined according to a second dictionary of the first language and a phoneme mapping relation between the new language and the first language, the first acoustic model is determined according to a second acoustic model of the first language and the first dictionary, the first language is a language which is closer to phonemes of pronunciation of the new language in a basic language, and the basic language comprises a non-tonal language and a tonal language;
and decoding according to the grammar of the new language and the first acoustic model to obtain the command word of the command voice.
Further, before obtaining the grammar of the command speech according to the first dictionary of the new language and the first acoustic model of the new language, the method further includes:
the method comprises the steps of obtaining a training sample of a new language, wherein the number of sampling people of the training sample is less than or equal to 200, the recording times of each person are the same, and the recording times are less than or equal to 10.
Furthermore, the number of the male and the female in the sampling number is the same.
Further, the training samples comprise N synchronous sound recordings sampled by the same sampler at equal intervals, and N is less than or equal to 3.
Further, before obtaining the grammar of the command speech according to the first dictionary of the new language and the first acoustic model of the new language, the method further includes:
and determining a first dictionary of the new language according to the phoneme mapping relation between the new language and the first language, wherein the first phoneme in the first dictionary is a second phoneme used for representing the pronunciation of the new language in the second dictionary.
Further, before obtaining the grammar of the command speech according to the first dictionary of the new language and the first acoustic model of the new language, the method further includes:
aligning the phonemes in the first dictionary;
and according to the alignment result, carrying out fine tuning iteration on the second acoustic model at a preset learning rate to obtain the first acoustic model, wherein the preset learning rate is less than or equal to a low learning rate threshold value, and the second acoustic model is based on a neural network.
Furthermore, the non-tonal language is english, and the tonal language is chinese.
In order to achieve the above object, an embodiment of the present application provides a speech recognition apparatus, including:
an acquisition unit configured to acquire a command voice of a new language;
a grammar unit configured to obtain a grammar of the command speech according to a first dictionary of the new language and a first acoustic model of the new language, wherein the first dictionary is determined according to a second dictionary of the first language and a phoneme mapping relation between the new language and the first language, the first acoustic model is determined according to a second acoustic model of the first language and the first dictionary, the first language is a language closer to phonemes of the new language in pronunciation in a basic language, and the basic language includes a non-tonal language and a tonal language;
and the decoding unit is used for decoding the command words of the command voice according to the grammar of the new language and the first acoustic model.
To achieve the above object, an embodiment of the present application provides an apparatus, including:
one or more processors;
a memory arranged to store one or more programs;
when executed by the one or more processors, cause the one or more processors to implement the method as described above.
To achieve the above object, an embodiment of the present application provides a storage medium storing a computer program, and the computer program is executed by a processor to implement the method as described above.
According to the speech recognition method, the speech recognition device, the speech recognition equipment and the speech recognition storage medium, labeling and learning of new language training data for hundreds of thousands of hours are not needed, command words of a new language are quickly recognized according to a dictionary and an acoustic model of an existing basic language, cost is low, and cost performance is high.
Drawings
FIG. 1 is a flow chart of a speech recognition method according to an embodiment of the present application;
FIG. 2 is a flowchart of a speech recognition method according to a second embodiment of the present application;
fig. 3 is a block diagram of a speech recognition apparatus according to a third embodiment of the present application;
fig. 4 is a block diagram of an apparatus provided in an embodiment of the present application.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.
It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.
It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.
It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.
The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.
Example one
Fig. 1 is a flowchart of a speech recognition method according to an embodiment of the present disclosure, where the speech recognition method may be executed by a speech recognition device, and the speech recognition device may be implemented in software and/or hardware. As shown in fig. 1, the method specifically includes step S110, step S120, and step S130.
And S110, acquiring command voice of the new language.
S120, obtaining grammar of the command voice according to a first dictionary of the new language and a first acoustic model of the new language, wherein the first dictionary is determined according to a second dictionary of the first language and a phoneme mapping relation between the new language and the first language, the first acoustic model is determined according to a second acoustic model of the first language and the first dictionary, the first language is a language which is closer to phonemes of a new language in a basic language, and the basic language comprises a non-tonal language and a tonal language.
The base language refers to a language in which the acoustic model has been completed. In the embodiment of the present application, an example of the unvoiced language is english, and an example of the voiced language is chinese; of course, the non-tonal language may be other languages that have been subjected to acoustic modeling and are not tonal, besides english, and the non-tonal language may be other languages that have been subjected to acoustic modeling and are tonal, besides chinese.
S130, decoding according to the grammar of the new language and the first acoustic model to obtain the command word of the command voice.
The voice recognition method provided by the embodiment of the application does not need to label and learn new language training data for hundreds of thousands of hours, quickly recognizes command words of a new language according to the existing dictionary and acoustic model of the basic language, and is low in cost and high in cost performance.
Example two
Fig. 2 is a flowchart of a speech recognition method according to a second embodiment of the disclosure. As shown in fig. 2, the method includes step S210, step 120, and step 130 based on the first embodiment.
And S210, obtaining a training sample of the new language.
And recording the command words of the new language required to be subjected to voice recognition to be used as training samples of the new language. In order to control the number of training samples as small as possible, optionally, the following requirements may be made for the training samples:
(1) the number of the training samples is less than or equal to 200, and further optionally, the number of the training samples is [100,200 ].
(2) The number of the male and female persons in the sampling number is the same, so that the gender characteristics of the training samples are balanced.
(3) The recording times of each person are the same, and the recording times are less than or equal to 10.
(4) If the speech of the new language needs to be remotely recognized (namely the distance between the speech recognition device and the user to be recognized is greater than or equal to a preset threshold), for example, 3m speech interaction is supported at the farthest, one device is placed every 1m, namely the same sampling person is synchronously recorded at positions 1m, 2m and 3m away from the speech recognition device respectively.
The number of training samples is the product of the number of sentences of the new voice command word, the number of sampling persons, the recording times and N.
S220, determining a first dictionary of the new language according to the phoneme mapping relation between the new language and the first language, wherein the first phoneme in the first dictionary is a second phoneme used for representing pronunciation of the new language in the second dictionary.
In one embodiment, the first language is english, since the new language is japanese, the silent language in the base language is english (english in this embodiment is american english), and the voiced language in the base language is chinese, and the pronunciation phonemes of japanese are closer to those of english.
Referring to table 1, taking an International Phonetic Alphabet (IPA) phoneme table as an example in the present embodiment, a phoneme mapping relationship between japanese (new language) and english (first language) is constructed according to the similarity of pronunciation between the pronunciation of japanese (new language) and english (first language).
Figure BDA0002988688120000041
TABLE 1 phoneme mapping relationship between Japanese and English
The pronunciation of the new language command word is characterized by a second phoneme in a second dictionary of the first language according to a phoneme mapping relationship such as that shown in table 1, the second phoneme used for characterization constituting the first dictionary of the new language as the first phoneme. Referring to table 2, two japanese command words and their corresponding japanese-to-english phoneme mappings are shown.
Figure BDA0002988688120000042
Figure BDA0002988688120000051
TABLE 2 Japanese command words and their corresponding Japanese-to-English phoneme mapping relationships
S230, aligning phonemes in the first dictionary; and according to the alignment result, carrying out fine tuning (Finetune) iteration on the second acoustic model at a preset learning rate to obtain the first acoustic model, wherein the preset learning rate is less than or equal to a low learning rate threshold value, and the second acoustic model is based on a neural network.
S240, acquiring command voice of the new language.
S250, obtaining grammar of the command voice according to the first dictionary of the new language and the first acoustic model of the new language.
S260, decoding according to the grammar of the new language and the first acoustic model to obtain the command word of the command voice.
The voice recognition method provided by the embodiment of the application does not need to label and learn new language training data for hundreds of thousands of hours, quickly recognizes command words of a new language according to the existing dictionary and acoustic model of the basic language, and is low in cost and high in cost performance.
EXAMPLE III
Fig. 3 is a structural diagram of a speech recognition apparatus according to a third embodiment of the present disclosure. As shown in fig. 3, the voice recognition apparatus includes: an acquisition unit 310, a syntax unit 320 and a decoding unit 330.
An acquisition unit 310 configured to acquire a command voice of a new language;
a grammar unit 320 configured to obtain a grammar of the command speech according to a first dictionary of the new language and a first acoustic model of the new language, the first dictionary being determined according to a second dictionary of the first language and a phoneme mapping relationship between the new language and the first language, the first acoustic model being determined according to a second acoustic model of the first language and the first dictionary, the first language being a language closer to the phoneme of the new language in a base language, the base language including a non-tonal language and a tonal language;
and the decoding unit 330 is configured to decode the command word of the command voice according to the grammar of the new language and the first acoustic model.
Further, the obtaining unit 310 is further configured to obtain a training sample of the new language before obtaining the grammar of the command speech according to the first dictionary of the new language and the first acoustic model of the new language, where the number of sampling people of the training sample is less than or equal to 200, the number of times of recording of each person is the same, and the number of times of recording is less than or equal to 10.
Furthermore, the number of the male and the female in the sampling number is the same.
Further, the training samples comprise N synchronous sound recordings sampled by the same sampler at equal intervals, and N is less than or equal to 3.
Further, the obtaining unit 310 is further configured to, before obtaining the grammar of the command speech according to the first dictionary of the new language and the first acoustic model of the new language, determine the first dictionary of the new language according to a phoneme mapping relationship between the new language and the first language, where a first phoneme in the first dictionary is a second phoneme in the second dictionary, which is used for representing the pronunciation of the new language.
Further, the obtaining unit 310 is further configured to align the phonemes in the first dictionary before obtaining the grammar of the command speech according to the first dictionary of the new language and the first acoustic model of the new language; and according to the alignment result, carrying out fine tuning iteration on the second acoustic model at a preset learning rate to obtain the first acoustic model, wherein the preset learning rate is less than or equal to a low learning rate threshold value, and the second acoustic model is based on a neural network.
Furthermore, the non-tonal language is english, and the tonal language is chinese.
The speech recognition device provided by the embodiment of the application does not need to label and learn new language training data for hundreds of thousands of hours, quickly recognizes command words of a new language according to a dictionary and an acoustic model of the existing basic language, and is low in cost and high in cost performance.
An apparatus is further provided in the embodiments of the present application, and fig. 4 is a structural diagram of an apparatus provided in the embodiments of the present application, and as shown in fig. 4, the apparatus includes a processor 71, a memory 72, an input device 73, and an output device 74; the number of processors 71 in the device may be one or more, for example one processor 71; the processor 71, the memory 72, the input device 73 and the output device 74 in the apparatus may be connected by a bus or other means, and the present embodiment is exemplified by being connected by a bus.
The memory 72 is a computer-readable storage medium, and can be used for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the voice recognition device in the embodiment of the present application (for example, the obtaining unit 310, the grammar unit 320, and the decoding unit 330 in the voice recognition device), and the processor 71 executes various functional applications and data processing of the apparatus by executing the software programs, instructions, and modules stored in the memory 72, so as to implement any method provided in the embodiment of the present application.
The memory 72 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the device, and the like. Further, the memory 72 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 72 may further include memory located remotely from the processor 71, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 73 may be used to receive input numeric or character information and generate key signal inputs relating to user settings and function control of the apparatus. The output device 74 may include a display device such as a display screen.
Embodiments of the present application also provide a storage medium containing computer-executable instructions, which when executed by a computer processor, perform a speech recognition method, comprising:
acquiring command voice of a new language;
obtaining grammar of the command voice according to a first dictionary of the new language and a first acoustic model of the new language, wherein the first dictionary is determined according to a second dictionary of the first language and a phoneme mapping relation between the new language and the first language, the first acoustic model is determined according to a second acoustic model of the first language and the first dictionary, the first language is a language which is closer to phonemes of pronunciation of the new language in a basic language, and the basic language comprises a non-tonal language and a tonal language;
and decoding according to the grammar of the new language and the first acoustic model to obtain the command word of the command voice.
Of course, the storage medium provided in the embodiments of the present application contains computer-executable instructions, and the computer-executable instructions are not limited to the operations of the speech recognition method described above, and may also perform related operations in the speech recognition method provided in any embodiments of the present application.
From the above description of the embodiments, it is obvious for those skilled in the art that the present application can be implemented by software and necessary general hardware, and certainly can be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods described in the embodiments of the present application.
The above description is only exemplary embodiments of the present application, and is not intended to limit the scope of the present application.
It will be clear to a person skilled in the art that the term user terminal covers any suitable type of wireless user node, such as a mobile phone, a portable data processing device, a portable web browser or a vehicle mounted mobile station.
In general, the various embodiments of the application may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the application is not limited thereto.
Embodiments of the application may be implemented by a data processor of a mobile device executing computer program instructions, for example in a processor entity, or by hardware, or by a combination of software and hardware. The computer program instructions may be assembly instructions, Instruction Set Architecture (ISA) instructions, machine related instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages.
Any logic flow block diagrams in the figures of this application may represent program steps, or may represent interconnected logic circuits, modules, and functions, or may represent a combination of program steps and logic circuits, modules, and functions. The computer program may be stored on a memory. The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as, but not limited to, Read Only Memory (ROM), Random Access Memory (RAM), optical storage devices and systems (digital versatile disks, DVDs, or CD discs), etc. The computer readable medium may include a non-transitory storage medium. The data processor may be of any type suitable to the local technical environment, such as but not limited to general purpose computers, special purpose computers, microprocessors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), programmable logic devices (FGPAs), and processors based on a multi-core processor architecture.
The foregoing has provided by way of exemplary and non-limiting examples a detailed description of exemplary embodiments of the present application. Various modifications and adaptations to the foregoing embodiments may become apparent to those skilled in the relevant arts in view of the following drawings and the appended claims without departing from the scope of the invention. Therefore, the proper scope of the invention is to be determined according to the claims.

Claims (10)

1. A speech recognition method, characterized by: the method comprises the following steps:
acquiring command voice of a new language;
obtaining grammar of the command voice according to a first dictionary of the new language and a first acoustic model of the new language, wherein the first dictionary is determined according to a second dictionary of the first language and a phoneme mapping relation between the new language and the first language, the first acoustic model is determined according to a second acoustic model of the first language and the first dictionary, the first language is a language which is closer to phonemes of pronunciation of the new language in a basic language, and the basic language comprises a non-tonal language and a tonal language;
and decoding according to the grammar of the new language and the first acoustic model to obtain the command word of the command voice.
2. The method of claim 1, further comprising, prior to said deriving the grammar for the command speech from the first dictionary of the new language and the first acoustic model of the new language:
the method comprises the steps of obtaining a training sample of a new language, wherein the number of sampling people of the training sample is less than or equal to 200, the recording times of each person are the same, and the recording times are less than or equal to 10.
3. The method of claim 2, wherein the number of sampled people is the same for both men and women.
4. The method of claim 2, wherein the training samples comprise N simultaneous recordings of the same human sampler sampled at equal distances, wherein N is equal to or less than 3.
5. The method of any of claims 2 to 4, further comprising, prior to said deriving the grammar of the command speech from the first dictionary of the new language and the first acoustic model of the new language:
and determining a first dictionary of the new language according to the phoneme mapping relation between the new language and the first language, wherein the first phoneme in the first dictionary is a second phoneme used for representing the pronunciation of the new language in the second dictionary.
6. The method of claim 5, further comprising, prior to said deriving the grammar for the command speech from the first dictionary of the new language and the first acoustic model of the new language:
aligning the phonemes in the first dictionary;
and according to the alignment result, carrying out fine tuning iteration on the second acoustic model at a preset learning rate to obtain the first acoustic model, wherein the preset learning rate is less than or equal to a low learning rate threshold value, and the second acoustic model is based on a neural network.
7. The method of claim 1, wherein the non-tonal language is english and the tonal language is chinese.
8. A speech recognition apparatus, comprising:
an acquisition unit configured to acquire a command voice of a new language;
a grammar unit configured to obtain a grammar of the command speech according to a first dictionary of the new language and a first acoustic model of the new language, wherein the first dictionary is determined according to a second dictionary of the first language and a phoneme mapping relation between the new language and the first language, the first acoustic model is determined according to a second acoustic model of the first language and the first dictionary, the first language is a language closer to phonemes of the new language in pronunciation in a basic language, and the basic language includes a non-tonal language and a tonal language;
and the decoding unit is used for decoding the command words of the command voice according to the grammar of the new language and the first acoustic model.
9. An apparatus, comprising:
one or more processors;
a memory arranged to store one or more programs;
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.
10. A storage medium, characterized in that the storage medium stores a computer program which is executed by a processor to implement the method according to any one of claims 1-7.
CN202110308487.XA 2021-03-23 2021-03-23 Voice recognition method, device, equipment and storage medium Active CN113077786B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110308487.XA CN113077786B (en) 2021-03-23 2021-03-23 Voice recognition method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110308487.XA CN113077786B (en) 2021-03-23 2021-03-23 Voice recognition method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113077786A true CN113077786A (en) 2021-07-06
CN113077786B CN113077786B (en) 2022-12-02

Family

ID=76613500

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110308487.XA Active CN113077786B (en) 2021-03-23 2021-03-23 Voice recognition method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113077786B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115240640A (en) * 2022-07-20 2022-10-25 科大讯飞股份有限公司 Dialect voice recognition method, device, equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007155833A (en) * 2005-11-30 2007-06-21 Advanced Telecommunication Research Institute International Acoustic model development system and computer program
CN101901599A (en) * 2009-05-19 2010-12-01 塔塔咨询服务有限公司 The system and method for the quick original shapeization of the existing voice identifying schemes of different language
CN107195296A (en) * 2016-03-15 2017-09-22 阿里巴巴集团控股有限公司 A kind of audio recognition method, device, terminal and system
CN109616096A (en) * 2018-12-29 2019-04-12 北京智能管家科技有限公司 Construction method, device, server and the medium of multilingual tone decoding figure
CN110070855A (en) * 2018-01-23 2019-07-30 中国科学院声学研究所 A kind of speech recognition system and method based on migration neural network acoustic model
CN110675855A (en) * 2019-10-09 2020-01-10 出门问问信息科技有限公司 Voice recognition method, electronic equipment and computer readable storage medium
US20200111484A1 (en) * 2018-10-04 2020-04-09 Google Llc Cross-lingual speech recognition

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007155833A (en) * 2005-11-30 2007-06-21 Advanced Telecommunication Research Institute International Acoustic model development system and computer program
CN101901599A (en) * 2009-05-19 2010-12-01 塔塔咨询服务有限公司 The system and method for the quick original shapeization of the existing voice identifying schemes of different language
CN107195296A (en) * 2016-03-15 2017-09-22 阿里巴巴集团控股有限公司 A kind of audio recognition method, device, terminal and system
CN110070855A (en) * 2018-01-23 2019-07-30 中国科学院声学研究所 A kind of speech recognition system and method based on migration neural network acoustic model
US20200111484A1 (en) * 2018-10-04 2020-04-09 Google Llc Cross-lingual speech recognition
CN109616096A (en) * 2018-12-29 2019-04-12 北京智能管家科技有限公司 Construction method, device, server and the medium of multilingual tone decoding figure
CN110675855A (en) * 2019-10-09 2020-01-10 出门问问信息科技有限公司 Voice recognition method, electronic equipment and computer readable storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115240640A (en) * 2022-07-20 2022-10-25 科大讯飞股份有限公司 Dialect voice recognition method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN113077786B (en) 2022-12-02

Similar Documents

Publication Publication Date Title
CN109887497B (en) Modeling method, device and equipment for speech recognition
CN110556093B (en) Voice marking method and system
US11043213B2 (en) System and method for detection and correction of incorrectly pronounced words
CN111402862B (en) Speech recognition method, device, storage medium and equipment
JP2017058674A (en) Apparatus and method for speech recognition, apparatus and method for training transformation parameter, computer program and electronic apparatus
CN111369974B (en) Dialect pronunciation marking method, language identification method and related device
CN103680498A (en) Speech recognition method and speech recognition equipment
CN110503956B (en) Voice recognition method, device, medium and electronic equipment
CN110852075B (en) Voice transcription method and device capable of automatically adding punctuation marks and readable storage medium
CN109448704A (en) Construction method, device, server and the storage medium of tone decoding figure
EP3791388A1 (en) Computer implemented method and apparatus for recognition of speech patterns and feedback
CN112818680B (en) Corpus processing method and device, electronic equipment and computer readable storage medium
CN110010136A (en) The training and text analyzing method, apparatus, medium and equipment of prosody prediction model
CN113053390B (en) Text processing method and device based on voice recognition, electronic equipment and medium
CN111916062A (en) Voice recognition method, device and system
CN112015872A (en) Question recognition method and device
CN110647613A (en) Courseware construction method, courseware construction device, courseware construction server and storage medium
Płaza et al. Call transcription methodology for contact center systems
CN113077786B (en) Voice recognition method, device, equipment and storage medium
CN113836945A (en) Intention recognition method and device, electronic equipment and storage medium
CN113012683A (en) Speech recognition method and device, equipment and computer readable storage medium
Hanzlíček et al. LSTM-based speech segmentation trained on different foreign languages
Zellou et al. Linguistic disparities in cross-language automatic speech recognition transfer from Arabic to Tashlhiyt
CN113053415B (en) Method, device, equipment and storage medium for detecting continuous reading
CN113053409B (en) Audio evaluation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
TA01 Transfer of patent application right

Effective date of registration: 20210902

Address after: 301-112, floor 3, building 2, No. 18, YANGFANGDIAN Road, Haidian District, Beijing 100038

Applicant after: Beijing Rubu Technology Co.,Ltd.

Address before: Room 508-598, Xitian Gezhuang Town Government Office Building, No. 8 Xitong Road, Miyun District Economic Development Zone, Beijing 101500

Applicant before: BEIJING ROOBO TECHNOLOGY Co.,Ltd.

TA01 Transfer of patent application right
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant