CN112447168A

CN112447168A - Voice recognition system and method, sound box, display device and interaction platform

Info

Publication number: CN112447168A
Application number: CN201910838408.9A
Authority: CN
Inventors: 张仕良; 雷鸣
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-09-05
Filing date: 2019-09-05
Publication date: 2021-03-05

Abstract

The invention discloses a voice recognition system, a voice recognition method, a sound box, display equipment and an interaction platform. Wherein, this system includes: the acoustic model is used for mapping the obtained acoustic features and the modeling unit sequence to obtain a modeling unit sequence of the acoustic features; and the translator is used for translating the modeling unit sequence to obtain a translation result. The invention solves the technical problem of low operation performance of the voice recognition system caused by complex operation of the existing voice recognition system.

Description

Voice recognition system and method, sound box, display device and interaction platform

Technical Field

The invention relates to the technical field of internet, in particular to a voice recognition system, a voice recognition method, a sound box, display equipment and an interaction platform.

Background

The mainstream speech recognition system at present adopts a framework of a hybrid deep neural network and a hidden markov model: an acoustic model and a language model are trained independently, and then combined with a pronunciation dictionary for combined modeling. This entire procedure is relatively cumbersome.

The problem that the existing voice recognition system is low in operation performance due to complex operation is solved. Although some studies have proposed building end-to-end speech recognition systems to simplify the system, there is no advantage in performance over traditional hybrid systems.

Disclosure of Invention

The embodiment of the invention provides a voice recognition system, a voice recognition method and a sound box, which at least solve the technical problem of low operation performance of the voice recognition system caused by complex operation of the existing voice recognition system.

According to an aspect of an embodiment of the present invention, there is provided a speech recognition system including: the acoustic model is used for mapping the obtained acoustic features and the modeling unit sequence to obtain a modeling unit sequence of the acoustic features; and the translator is used for translating the modeling unit sequence to obtain a translation result.

Optionally, the system further comprises: and the text dictionary is combined with the context information of the modeling unit sequence for correction, and the translator is combined with the text dictionary for translation of the modeling unit sequence to obtain a translation result.

Optionally, the acoustic model and the translator are trained separately in a training phase.

Optionally, the acoustic model includes: the method is composed of a deep learning model and a connection time Classification (Connectionist Temporal Classification) model.

Optionally, the modeling unit sequence includes: syllables and characters.

Optionally, the translator is a Transformer model.

Optionally, the system is applied to a system of an end-to-end architecture.

According to another aspect of the embodiments of the present invention, there is provided a speech recognition method including: acquiring acoustic features through an acoustic model; mapping the acoustic features and the modeling unit sequence through an acoustic model to obtain a modeling unit sequence of the acoustic features; and translating the modeling unit sequence through a translator to obtain a translation result.

Optionally, the method further includes: before the acoustic features are acquired, in a training phase, the acoustic model and the translator are trained respectively.

Optionally, translating the modeling unit sequence by the translator, and obtaining a translation result includes: and correcting by combining the text dictionary with the context information of the modeling unit sequence, and translating the modeling unit sequence by combining the translator with the text dictionary to obtain a translation result.

Optionally, the modeling unit sequence includes: syllables and characters.

According to still another aspect of embodiments of the present invention, there is provided a storage medium including a stored program, wherein, when the program is executed, an apparatus in which the storage medium is located is controlled to perform the above-described speech recognition method.

According to a further aspect of the embodiments of the present invention, there is provided an acoustic enclosure including a speech recognition system, wherein the speech recognition system is the above system.

According to an aspect of another embodiment of the present invention, there is provided a display apparatus including: the equipment that possesses image, video interaction function, wherein, equipment includes: a speech recognition system, wherein the speech recognition system comprises the system of claim 1.

According to an aspect of another embodiment of the present invention, there is provided an interactive platform including: receiving voice data input by a user on line; returning identification data according to the voice data, wherein the identification data is obtained by translating through a voice identification system; wherein the speech recognition system comprises the system of claim 1.

Optionally, the system architecture of the interaction platform includes: an end-to-end architecture.

In the embodiment of the invention, the acoustic model is used for mapping according to the obtained acoustic features and the modeling unit sequence to obtain the modeling unit sequence of the acoustic features; the translator is used for translating the modeling unit sequence to obtain a translation result, and the purpose of improving the performance of the voice recognition system is achieved, so that the technical effect of improving the recognition accuracy of the voice recognition system is achieved, and the technical problem that the operation performance of the voice recognition system is low due to the fact that the existing voice recognition system is complex to operate is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a schematic block diagram of a speech recognition system according to a first embodiment of the present invention;

FIG. 2 is a schematic block diagram of another speech recognition system according to a first embodiment of the present invention;

fig. 3 is a block diagram of a hardware structure of a computer terminal of a speech recognition method according to a second embodiment of the present invention;

fig. 4 is a flowchart of a speech recognition method according to a second embodiment of the invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Technical terms related to the present application

Acoustic feature: an acoustic feature;

and (3) classifying connection time: connection termination Temporal Classification, abbreviated as CTC;

syllable: a syllable;

a character: a character;

character sequence: a sequence of character modeling units;

text lexicon: a text dictionary;

DFSMN: deep-feed forward Sequential Memory Networks, Deep feed forward sequence Memory neural Networks;

DNN-HMM: deep Neural Networks-Hidden Markov models, Deep Neural Networks and Hidden Markov acoustic models.

Example 1

Fig. 1 is a schematic block diagram of a speech recognition system according to a first embodiment of the present invention. As shown in fig. 1, according to an aspect of an embodiment of the present invention, there is provided a speech recognition system including:

the acoustic modeling system comprises an acoustic model 12 and a translator 14, wherein the acoustic model 12 is used for mapping according to the obtained acoustic features and a modeling unit sequence to obtain a modeling unit sequence of the acoustic features; the translator 14 is used for translating the modeling unit sequence to obtain a translation result;

the speech recognition system provided by the embodiment of the application further comprises: a text dictionary 16, wherein the context information of the modeling unit sequence is corrected through the text dictionary 16, and the translation result is obtained through the translator 14 and the text dictionary 16.

Specifically, as shown in fig. 2, fig. 2 is a schematic block diagram of another speech recognition system according to a first embodiment of the present invention. The speech recognition system provided in the embodiment of the present application may be an end-to-end speech recognition system, where the end-to-end speech recognition system mainly includes two major components, an Acoustic model, i, Lister and a Translator, as shown in FIG. 2, where the Lister is an Acoustic model optimized based on CTC training criteria, and implements mapping from Acoustic feature acoustics feature to character modeling unit sequence character (taking Chinese as an example, Chinese syllables, characters, etc.). The Translator is a Translator, and is used for further translating the modeling unit sequence obtained by Listener recognition into a required translation result which is finally output, and automatically correcting some Listener recognition errors.

In the embodiment of the invention, the acoustic model is used for mapping according to the obtained acoustic features and the modeling unit sequence to obtain the modeling unit sequence of the acoustic features; the translator is used for translating the modeling unit sequence to obtain a final recognition result after primary recognition, and automatically correcting recognition errors simultaneously, so that the purpose of improving the performance of the voice recognition system is achieved, the technical effect of improving the recognition precision of the voice recognition system is achieved, and the technical problem that the running performance of the voice recognition system is low due to the fact that the existing voice recognition system is complex to operate is solved.

Optionally, the acoustic model 12 and translator 14 are trained separately during a training phase.

Optionally, the acoustic model 12 includes: the method is composed of a deep learning model and a connection time Classification (Connectionist Temporal Classification) model.

Wherein the Acoustic Model (AM) comprises (for example): RNN + CTC, DFSMN + CTC, FSMN + CTC, and the like.

The first part (Listener) of the end-to-end architecture provided by the embodiments of the present application is composed of "deep learning models (e.g., CNN, RNN, FSMN, DFSMN)" + "CTC", wherein the deep learning models are referred to as Acoustic Models (AM) in the Speech field, and the deep learning models can be selected from the above models; CTC is a loss function.

Optionally, the modeling unit sequence includes: syllables and characters.

In summary, the speech recognition system provided by the embodiment of the present application belongs to an end-to-end speech recognition system, wherein in the training stage, the Listener and the Translator are trained separately, and in the using stage, the Listener and the Translator can be combined into a model, thereby implementing the end-to-end speech recognition system.

Based on the above, the end-to-end speech recognition system in the embodiment of the present application can directly recognize an original input speech signal to obtain a final output text, and based on the above, recognition and translation of a speech input by a user can be realized without including a separate language model in the embodiment of the present application.

The speech recognition system provided by the embodiment of the application is combined with models in the speech recognition field and the machine translation field, the acoustic model based on DFSMN-CTC can be adopted for Listener, and the Transformer model in the machine translation field can be adopted for Translator. Meanwhile, massive texts can be additionally utilized when a Translator is trained, so that the defect that the existing end-to-end speech recognition system cannot utilize the additional texts can be overcome.

It should be noted that the models provided in the embodiment of the present application are only illustrated by taking the DFSMN-CTC acoustic model and the transform model as examples, so that the implementation of the speech recognition system provided in the embodiment of the present application is not particularly limited.

The speech recognition system provided by the embodiment of the application obtains a coarse modeling unit series only by using the CTC-AM, errors can exist in the series, and a plurality of errors can be corrected by using context information through a rear-end translator, so that the effect can be remarkably improved. Moreover, because the attribute based encoder-decoder in the related art does not effectively utilize the additional text, the scheme provided by the embodiment of the application can use the additional text when training the translator, thereby improving the system performance.

Example 2

There is also provided, in accordance with an embodiment of the present invention, a speech recognition method embodiment, it being noted that the steps illustrated in the flowchart of the figure may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.

The method provided by the first embodiment of the present application may be executed in a mobile terminal, a computer terminal, or a similar computing device. Taking the example of the operation on a computer terminal, fig. 3 is a hardware structure block diagram of a computer terminal of a speech recognition method according to a second embodiment of the present invention. As shown in fig. 3, the computer terminal 30 may include one or more (only one shown) processors 302 (the processors 302 may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), a memory 304 for storing data, and a transmission module 306 for communication functions. It will be understood by those skilled in the art that the structure shown in fig. 3 is only an illustration and is not intended to limit the structure of the electronic device. For example, the computer terminal 30 may also include more or fewer components than shown in FIG. 3, or have a different configuration than shown in FIG. 3.

The memory 304 may be used to store software programs and modules of application software, such as program instructions/modules corresponding to the voice recognition method in the embodiment of the present invention, and the processor 302 executes various functional applications and data processing by executing the software programs and modules stored in the memory 304, that is, implementing the voice recognition method of the application program. The memory 304 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 304 may further include memory located remotely from the processor 302, which may be connected to the computer terminal 30 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission module 306 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 30. In one example, the transmission module 306 includes a Network Interface Controller (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission module 306 can be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

Under the above operating environment, the present application provides a speech recognition method as shown in fig. 4. Fig. 4 is a flowchart of a speech recognition method according to a second embodiment of the invention.

Step S402, obtaining acoustic characteristics through an acoustic model;

step S404, mapping is carried out on the acoustic model according to the acoustic features and the modeling unit sequence to obtain the modeling unit sequence of the acoustic features;

and step S406, translating the modeling unit sequence through the translator to obtain a translation result.

Specifically, the speech recognition method provided in the embodiment of the present application can be applied to the speech recognition system in embodiment 1, that is, an end-to-end speech recognition system, wherein the end-to-end speech recognition system mainly includes two major components as shown in fig. 2, an acoustic model, Listener, and a Translator, where Listener is an acoustic model optimized based on CTC training criteria, and realizes mapping from acoustic features to a sequence of modeling units (for example, chinese syllables, characters, etc.). The Translator is a Translator, and is used for further translating the modeling unit sequence obtained by Listener recognition into a required translation result which is finally output, and automatically correcting some Listener recognition errors.

In the embodiment of the invention, the acoustic characteristics are obtained through an acoustic model; mapping the acoustic features and the modeling unit sequence through an acoustic model to obtain a modeling unit sequence of the acoustic features; the translator translates the modeling unit sequence to obtain a translation result, and the purpose of improving the performance of the voice recognition system is achieved, so that the technical effect of improving the recognition precision of the voice recognition system is achieved, and the technical problem that the existing voice recognition system is low in operation performance due to complex operation is solved.

Optionally, the speech recognition method provided in the embodiment of the present application further includes: before the acoustic features are acquired, in a training phase, the acoustic model and the translator are trained respectively.

Optionally, the acoustic model includes: the method is composed of a deep learning model and a connection time Classification Connectionist Temporal Classification model.

Optionally, the modeling unit sequence includes: syllables and characters.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

Through the above description of the embodiments, those skilled in the art can clearly understand that the speech recognition method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

Example 3

Example 4

According to a further aspect of an embodiment of the present invention, there is provided an acoustic enclosure including a speech recognition system, wherein the speech recognition system is the system of embodiment 1.

Example 5

According to a further aspect of the embodiments of the present invention, there is provided a processor for executing a program, wherein the program executes the above-mentioned speech recognition method.

Example 6

According to an aspect of another embodiment of the present invention, there is provided a display apparatus including: the equipment that possesses image, video interaction function, wherein, equipment includes: a speech recognition system, wherein the speech recognition system comprises the system of embodiment 1.

The display device in the embodiment of the present application may include a device having a voice interaction function, and may be a vehicle-mounted display device, an intelligent robot, a terminal of the internet of things (e.g., a vending machine), and the like, so as to implement the display device provided in the embodiment of the present application as a standard, and the display device is not particularly limited.

Example 7

According to an aspect of another embodiment of the present invention, there is provided an interactive platform including: receiving voice data input by a user on line; returning identification data according to the voice data, wherein the identification data is obtained by translating through a voice identification system; wherein the speech recognition system comprises the system of embodiment 1.

Specifically, based on an end-to-end architecture, the embodiment of the present application provides an online interaction platform, which obtains voice data input online by a user, translates the voice data based on a voice recognition system to obtain recognition data, and returns the recognition data to the user, thereby implementing intelligent online interaction. The online interaction platform can be applied to a consultation website, an e-commerce transaction website or a language translation terminal (such as translation software APP and an online translation website).

Example 8

The embodiment of the invention also provides a storage medium. Optionally, in this embodiment, the storage medium may be configured to store a program code executed by the speech recognition method provided in the first embodiment.

Optionally, in this embodiment, the storage medium may be located in any one of computer terminals in a computer terminal group in a computer network, or in any one of mobile terminals in a mobile terminal group.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: acquiring acoustic features through an acoustic model; mapping the acoustic features and the modeling unit sequence through an acoustic model to obtain a modeling unit sequence of the acoustic features; and translating the modeling unit sequence through a translator to obtain a translation result.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: before the acoustic features are acquired, in a training phase, the acoustic model and the translator are trained respectively.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: translating the modeling unit sequence through a translator to obtain a translation result, wherein the translation result comprises: and correcting by combining the text dictionary with the context information of the modeling unit sequence, and translating the modeling unit sequence by combining the translator with the text dictionary to obtain a translation result.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: the acoustic model includes: the method is composed of a deep learning model and a connection time Classification (Connectionist Temporal Classification) model.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: the sequence of modeling units includes: syllables and characters.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A speech recognition system comprising:

an acoustic model and a translator, wherein,

the acoustic model is used for mapping according to the obtained acoustic features and the modeling unit sequence to obtain the modeling unit sequence of the acoustic features;

and the translator is used for translating the modeling unit sequence to obtain a translation result.

2. The system of claim 1, wherein the system further comprises: and the text dictionary is combined with the context information of the modeling unit sequence for correction, and the translator is combined with the text dictionary for translation of the modeling unit sequence to obtain the translation result.

3. The system of claim 1, wherein the acoustic model and the translator are trained separately in a training phase.

4. The system of claim 1, wherein the acoustic model comprises: the method is composed of a deep learning model and a connection time Classification (Connectionist Temporal Classification) model.

5. The system of claim 1, wherein the sequence of modeling units comprises: syllables and characters.

6. The system of claim 1, wherein the translator is a transform model.

7. The system of claim 1, wherein the system is applied to a system of an end-to-end architecture.

8. A speech recognition method comprising:

acquiring acoustic features through an acoustic model;

mapping the acoustic model according to the acoustic features and the modeling unit sequence to obtain the modeling unit sequence of the acoustic features;

and translating the modeling unit sequence through a translator to obtain a translation result.

9. The method of claim 8, wherein the method further comprises:

before the acoustic features are acquired, in a training phase, the acoustic model and the translator are trained respectively.

10. The method of claim 8, wherein translating the sequence of modeling units by a translator to obtain a translation result comprises: and correcting by combining a text dictionary with the context information of the modeling unit sequence, and translating the modeling unit sequence by combining the text dictionary through the translator to obtain the translation result.

11. The method of claim 8, wherein the acoustic model comprises: the method is composed of a deep learning model and a connection time Classification (Connectionist Temporal Classification) model.

12. The method of claim 8, wherein the sequence of modeling units comprises: syllables and characters.

13. A storage medium comprising a stored program, wherein the apparatus on which the storage medium is located is controlled to perform the speech recognition method of claim 8 when the program is run.

14. An acoustic enclosure, comprising: a speech recognition system, wherein the speech recognition system comprises the system of claim 1.

15. A display device, comprising: the equipment that possesses image, video interaction function, wherein, the equipment includes: a speech recognition system, wherein the speech recognition system comprises the system of claim 1.

16. An interaction platform, comprising:

receiving voice data input by a user on line;

returning identification data according to the voice data, wherein the identification data is obtained by translating through a voice recognition system; wherein the speech recognition system comprises the system of claim 1.

17. The interaction platform of claim 16, wherein the system architecture of the interaction platform comprises: an end-to-end architecture.