CN113077786A

CN113077786A - Voice recognition method, device, equipment and storage medium

Info

Publication number: CN113077786A
Application number: CN202110308487.XA
Authority: CN
Inventors: 徐燃
Original assignee: Beijing Roobo Technology Co ltd
Current assignee: Beijing Rubu Technology Co ltd
Priority date: 2021-03-23
Filing date: 2021-03-23
Publication date: 2021-07-06
Anticipated expiration: 2041-03-23
Also published as: CN113077786B

Abstract

The present application provides a speech recognition method, device, device and storage medium. The method includes: acquiring a command speech of a new language; and obtaining the obtained speech according to a first dictionary of the new language and a first acoustic model of the new language. to describe the grammar of command speech, the first dictionary is determined according to the second dictionary of the first language and the phoneme mapping relationship between the new language and the first language, and the first acoustic model is determined according to the first language The second acoustic model and the first dictionary determine that the first language is a language with phonemes closer to the new language in the basic language, and the basic language includes a toneless language and a tonal language; according to the new language The grammar and the first acoustic model are decoded to obtain command words of the command speech. By adopting the solution of the embodiment of the present application, multi-language recognition can be quickly realized, the cost is low, and the cost performance ratio is high.

Description

Voice recognition method, device, equipment and storage medium

Technical Field

The present application relates to the field of speech recognition, and in particular, to a speech recognition method, apparatus, device, and storage medium.

Background

With the development of globalization, many products with voice command control are required to support multiple major languages, which requires training acoustic models for each language individually. However, for each language, from scratch, the process of training the acoustic model is expensive in data acquisition and long in development period.

Disclosure of Invention

The application provides a voice recognition method, a voice recognition device, voice recognition equipment and a storage medium, which can quickly realize multi-language recognition and have low cost and high cost performance.

In order to achieve the above object, an embodiment of the present application provides a speech recognition method, including:

acquiring command voice of a new language;

obtaining grammar of the command voice according to a first dictionary of the new language and a first acoustic model of the new language, wherein the first dictionary is determined according to a second dictionary of the first language and a phoneme mapping relation between the new language and the first language, the first acoustic model is determined according to a second acoustic model of the first language and the first dictionary, the first language is a language which is closer to phonemes of pronunciation of the new language in a basic language, and the basic language comprises a non-tonal language and a tonal language;

and decoding according to the grammar of the new language and the first acoustic model to obtain the command word of the command voice.

Further, before obtaining the grammar of the command speech according to the first dictionary of the new language and the first acoustic model of the new language, the method further includes:

the method comprises the steps of obtaining a training sample of a new language, wherein the number of sampling people of the training sample is less than or equal to 200, the recording times of each person are the same, and the recording times are less than or equal to 10.

Furthermore, the number of the male and the female in the sampling number is the same.

Further, the training samples comprise N synchronous sound recordings sampled by the same sampler at equal intervals, and N is less than or equal to 3.

and determining a first dictionary of the new language according to the phoneme mapping relation between the new language and the first language, wherein the first phoneme in the first dictionary is a second phoneme used for representing the pronunciation of the new language in the second dictionary.

aligning the phonemes in the first dictionary;

and according to the alignment result, carrying out fine tuning iteration on the second acoustic model at a preset learning rate to obtain the first acoustic model, wherein the preset learning rate is less than or equal to a low learning rate threshold value, and the second acoustic model is based on a neural network.

Furthermore, the non-tonal language is english, and the tonal language is chinese.

In order to achieve the above object, an embodiment of the present application provides a speech recognition apparatus, including:

an acquisition unit configured to acquire a command voice of a new language;

a grammar unit configured to obtain a grammar of the command speech according to a first dictionary of the new language and a first acoustic model of the new language, wherein the first dictionary is determined according to a second dictionary of the first language and a phoneme mapping relation between the new language and the first language, the first acoustic model is determined according to a second acoustic model of the first language and the first dictionary, the first language is a language closer to phonemes of the new language in pronunciation in a basic language, and the basic language includes a non-tonal language and a tonal language;

and the decoding unit is used for decoding the command words of the command voice according to the grammar of the new language and the first acoustic model.

To achieve the above object, an embodiment of the present application provides an apparatus, including:

one or more processors;

a memory arranged to store one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the method as described above.

To achieve the above object, an embodiment of the present application provides a storage medium storing a computer program, and the computer program is executed by a processor to implement the method as described above.

According to the speech recognition method, the speech recognition device, the speech recognition equipment and the speech recognition storage medium, labeling and learning of new language training data for hundreds of thousands of hours are not needed, command words of a new language are quickly recognized according to a dictionary and an acoustic model of an existing basic language, cost is low, and cost performance is high.

Drawings

FIG. 1 is a flow chart of a speech recognition method according to an embodiment of the present application;

FIG. 2 is a flowchart of a speech recognition method according to a second embodiment of the present application;

fig. 3 is a block diagram of a speech recognition apparatus according to a third embodiment of the present application;

fig. 4 is a block diagram of an apparatus provided in an embodiment of the present application.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

Example one

Fig. 1 is a flowchart of a speech recognition method according to an embodiment of the present disclosure, where the speech recognition method may be executed by a speech recognition device, and the speech recognition device may be implemented in software and/or hardware. As shown in fig. 1, the method specifically includes step S110, step S120, and step S130.

And S110, acquiring command voice of the new language.

S120, obtaining grammar of the command voice according to a first dictionary of the new language and a first acoustic model of the new language, wherein the first dictionary is determined according to a second dictionary of the first language and a phoneme mapping relation between the new language and the first language, the first acoustic model is determined according to a second acoustic model of the first language and the first dictionary, the first language is a language which is closer to phonemes of a new language in a basic language, and the basic language comprises a non-tonal language and a tonal language.

The base language refers to a language in which the acoustic model has been completed. In the embodiment of the present application, an example of the unvoiced language is english, and an example of the voiced language is chinese; of course, the non-tonal language may be other languages that have been subjected to acoustic modeling and are not tonal, besides english, and the non-tonal language may be other languages that have been subjected to acoustic modeling and are tonal, besides chinese.

S130, decoding according to the grammar of the new language and the first acoustic model to obtain the command word of the command voice.

The voice recognition method provided by the embodiment of the application does not need to label and learn new language training data for hundreds of thousands of hours, quickly recognizes command words of a new language according to the existing dictionary and acoustic model of the basic language, and is low in cost and high in cost performance.

Example two

Fig. 2 is a flowchart of a speech recognition method according to a second embodiment of the disclosure. As shown in fig. 2, the method includes step S210, step 120, and step 130 based on the first embodiment.

And S210, obtaining a training sample of the new language.

And recording the command words of the new language required to be subjected to voice recognition to be used as training samples of the new language. In order to control the number of training samples as small as possible, optionally, the following requirements may be made for the training samples:

(1) the number of the training samples is less than or equal to 200, and further optionally, the number of the training samples is [100,200 ].

(2) The number of the male and female persons in the sampling number is the same, so that the gender characteristics of the training samples are balanced.

(3) The recording times of each person are the same, and the recording times are less than or equal to 10.

(4) If the speech of the new language needs to be remotely recognized (namely the distance between the speech recognition device and the user to be recognized is greater than or equal to a preset threshold), for example, 3m speech interaction is supported at the farthest, one device is placed every 1m, namely the same sampling person is synchronously recorded at positions 1m, 2m and 3m away from the speech recognition device respectively.

The number of training samples is the product of the number of sentences of the new voice command word, the number of sampling persons, the recording times and N.

S220, determining a first dictionary of the new language according to the phoneme mapping relation between the new language and the first language, wherein the first phoneme in the first dictionary is a second phoneme used for representing pronunciation of the new language in the second dictionary.

In one embodiment, the first language is english, since the new language is japanese, the silent language in the base language is english (english in this embodiment is american english), and the voiced language in the base language is chinese, and the pronunciation phonemes of japanese are closer to those of english.

Referring to table 1, taking an International Phonetic Alphabet (IPA) phoneme table as an example in the present embodiment, a phoneme mapping relationship between japanese (new language) and english (first language) is constructed according to the similarity of pronunciation between the pronunciation of japanese (new language) and english (first language).

TABLE 1 phoneme mapping relationship between Japanese and English

The pronunciation of the new language command word is characterized by a second phoneme in a second dictionary of the first language according to a phoneme mapping relationship such as that shown in table 1, the second phoneme used for characterization constituting the first dictionary of the new language as the first phoneme. Referring to table 2, two japanese command words and their corresponding japanese-to-english phoneme mappings are shown.

TABLE 2 Japanese command words and their corresponding Japanese-to-English phoneme mapping relationships

S230, aligning phonemes in the first dictionary; and according to the alignment result, carrying out fine tuning (Finetune) iteration on the second acoustic model at a preset learning rate to obtain the first acoustic model, wherein the preset learning rate is less than or equal to a low learning rate threshold value, and the second acoustic model is based on a neural network.

S240, acquiring command voice of the new language.

S250, obtaining grammar of the command voice according to the first dictionary of the new language and the first acoustic model of the new language.

S260, decoding according to the grammar of the new language and the first acoustic model to obtain the command word of the command voice.

EXAMPLE III

Fig. 3 is a structural diagram of a speech recognition apparatus according to a third embodiment of the present disclosure. As shown in fig. 3, the voice recognition apparatus includes: an acquisition unit 310, a syntax unit 320 and a decoding unit 330.

An acquisition unit 310 configured to acquire a command voice of a new language;

a grammar unit 320 configured to obtain a grammar of the command speech according to a first dictionary of the new language and a first acoustic model of the new language, the first dictionary being determined according to a second dictionary of the first language and a phoneme mapping relationship between the new language and the first language, the first acoustic model being determined according to a second acoustic model of the first language and the first dictionary, the first language being a language closer to the phoneme of the new language in a base language, the base language including a non-tonal language and a tonal language;

and the decoding unit 330 is configured to decode the command word of the command voice according to the grammar of the new language and the first acoustic model.

Further, the obtaining unit 310 is further configured to obtain a training sample of the new language before obtaining the grammar of the command speech according to the first dictionary of the new language and the first acoustic model of the new language, where the number of sampling people of the training sample is less than or equal to 200, the number of times of recording of each person is the same, and the number of times of recording is less than or equal to 10.

Further, the obtaining unit 310 is further configured to, before obtaining the grammar of the command speech according to the first dictionary of the new language and the first acoustic model of the new language, determine the first dictionary of the new language according to a phoneme mapping relationship between the new language and the first language, where a first phoneme in the first dictionary is a second phoneme in the second dictionary, which is used for representing the pronunciation of the new language.

Further, the obtaining unit 310 is further configured to align the phonemes in the first dictionary before obtaining the grammar of the command speech according to the first dictionary of the new language and the first acoustic model of the new language; and according to the alignment result, carrying out fine tuning iteration on the second acoustic model at a preset learning rate to obtain the first acoustic model, wherein the preset learning rate is less than or equal to a low learning rate threshold value, and the second acoustic model is based on a neural network.

The speech recognition device provided by the embodiment of the application does not need to label and learn new language training data for hundreds of thousands of hours, quickly recognizes command words of a new language according to a dictionary and an acoustic model of the existing basic language, and is low in cost and high in cost performance.

An apparatus is further provided in the embodiments of the present application, and fig. 4 is a structural diagram of an apparatus provided in the embodiments of the present application, and as shown in fig. 4, the apparatus includes a processor 71, a memory 72, an input device 73, and an output device 74; the number of processors 71 in the device may be one or more, for example one processor 71; the processor 71, the memory 72, the input device 73 and the output device 74 in the apparatus may be connected by a bus or other means, and the present embodiment is exemplified by being connected by a bus.

The memory 72 is a computer-readable storage medium, and can be used for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the voice recognition device in the embodiment of the present application (for example, the obtaining unit 310, the grammar unit 320, and the decoding unit 330 in the voice recognition device), and the processor 71 executes various functional applications and data processing of the apparatus by executing the software programs, instructions, and modules stored in the memory 72, so as to implement any method provided in the embodiment of the present application.

The memory 72 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the device, and the like. Further, the memory 72 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 72 may further include memory located remotely from the processor 71, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 73 may be used to receive input numeric or character information and generate key signal inputs relating to user settings and function control of the apparatus. The output device 74 may include a display device such as a display screen.

Embodiments of the present application also provide a storage medium containing computer-executable instructions, which when executed by a computer processor, perform a speech recognition method, comprising:

acquiring command voice of a new language;

Of course, the storage medium provided in the embodiments of the present application contains computer-executable instructions, and the computer-executable instructions are not limited to the operations of the speech recognition method described above, and may also perform related operations in the speech recognition method provided in any embodiments of the present application.

From the above description of the embodiments, it is obvious for those skilled in the art that the present application can be implemented by software and necessary general hardware, and certainly can be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods described in the embodiments of the present application.

The above description is only exemplary embodiments of the present application, and is not intended to limit the scope of the present application.

It will be clear to a person skilled in the art that the term user terminal covers any suitable type of wireless user node, such as a mobile phone, a portable data processing device, a portable web browser or a vehicle mounted mobile station.

In general, the various embodiments of the application may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the application is not limited thereto.

Embodiments of the application may be implemented by a data processor of a mobile device executing computer program instructions, for example in a processor entity, or by hardware, or by a combination of software and hardware. The computer program instructions may be assembly instructions, Instruction Set Architecture (ISA) instructions, machine related instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages.

Any logic flow block diagrams in the figures of this application may represent program steps, or may represent interconnected logic circuits, modules, and functions, or may represent a combination of program steps and logic circuits, modules, and functions. The computer program may be stored on a memory. The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as, but not limited to, Read Only Memory (ROM), Random Access Memory (RAM), optical storage devices and systems (digital versatile disks, DVDs, or CD discs), etc. The computer readable medium may include a non-transitory storage medium. The data processor may be of any type suitable to the local technical environment, such as but not limited to general purpose computers, special purpose computers, microprocessors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), programmable logic devices (FGPAs), and processors based on a multi-core processor architecture.

The foregoing has provided by way of exemplary and non-limiting examples a detailed description of exemplary embodiments of the present application. Various modifications and adaptations to the foregoing embodiments may become apparent to those skilled in the relevant arts in view of the following drawings and the appended claims without departing from the scope of the invention. Therefore, the proper scope of the invention is to be determined according to the claims.

Claims

1. A speech recognition method, characterized by: the method comprises the following steps:

acquiring command voice of a new language;

2. The method of claim 1, further comprising, prior to said deriving the grammar for the command speech from the first dictionary of the new language and the first acoustic model of the new language:

3. The method of claim 2, wherein the number of sampled people is the same for both men and women.

4. The method of claim 2, wherein the training samples comprise N simultaneous recordings of the same human sampler sampled at equal distances, wherein N is equal to or less than 3.

5. The method of any of claims 2 to 4, further comprising, prior to said deriving the grammar of the command speech from the first dictionary of the new language and the first acoustic model of the new language:

6. The method of claim 5, further comprising, prior to said deriving the grammar for the command speech from the first dictionary of the new language and the first acoustic model of the new language:

aligning the phonemes in the first dictionary;

7. The method of claim 1, wherein the non-tonal language is english and the tonal language is chinese.

8. A speech recognition apparatus, comprising:

an acquisition unit configured to acquire a command voice of a new language;

9. An apparatus, comprising:

one or more processors;

a memory arranged to store one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

10. A storage medium, characterized in that the storage medium stores a computer program which is executed by a processor to implement the method according to any one of claims 1-7.