CN112447168A - Voice recognition system and method, sound box, display device and interaction platform - Google Patents

Voice recognition system and method, sound box, display device and interaction platform Download PDF

Info

Publication number
CN112447168A
CN112447168A CN201910838408.9A CN201910838408A CN112447168A CN 112447168 A CN112447168 A CN 112447168A CN 201910838408 A CN201910838408 A CN 201910838408A CN 112447168 A CN112447168 A CN 112447168A
Authority
CN
China
Prior art keywords
modeling unit
translator
unit sequence
speech recognition
acoustic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910838408.9A
Other languages
Chinese (zh)
Inventor
张仕良
雷鸣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201910838408.9A priority Critical patent/CN112447168A/en
Publication of CN112447168A publication Critical patent/CN112447168A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/02Casings; Cabinets ; Supports therefor; Mountings therein
    • H04R1/028Casings; Cabinets ; Supports therefor; Mountings therein associated with devices performing functions other than acoustics, e.g. electric candles
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/027Syllables being the recognition units

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a voice recognition system, a voice recognition method, a sound box, display equipment and an interaction platform. Wherein, this system includes: the acoustic model is used for mapping the obtained acoustic features and the modeling unit sequence to obtain a modeling unit sequence of the acoustic features; and the translator is used for translating the modeling unit sequence to obtain a translation result. The invention solves the technical problem of low operation performance of the voice recognition system caused by complex operation of the existing voice recognition system.

Description

Voice recognition system and method, sound box, display device and interaction platform
Technical Field
The invention relates to the technical field of internet, in particular to a voice recognition system, a voice recognition method, a sound box, display equipment and an interaction platform.
Background
The mainstream speech recognition system at present adopts a framework of a hybrid deep neural network and a hidden markov model: an acoustic model and a language model are trained independently, and then combined with a pronunciation dictionary for combined modeling. This entire procedure is relatively cumbersome.
The problem that the existing voice recognition system is low in operation performance due to complex operation is solved. Although some studies have proposed building end-to-end speech recognition systems to simplify the system, there is no advantage in performance over traditional hybrid systems.
Disclosure of Invention
The embodiment of the invention provides a voice recognition system, a voice recognition method and a sound box, which at least solve the technical problem of low operation performance of the voice recognition system caused by complex operation of the existing voice recognition system.
According to an aspect of an embodiment of the present invention, there is provided a speech recognition system including: the acoustic model is used for mapping the obtained acoustic features and the modeling unit sequence to obtain a modeling unit sequence of the acoustic features; and the translator is used for translating the modeling unit sequence to obtain a translation result.
Optionally, the system further comprises: and the text dictionary is combined with the context information of the modeling unit sequence for correction, and the translator is combined with the text dictionary for translation of the modeling unit sequence to obtain a translation result.
Optionally, the acoustic model and the translator are trained separately in a training phase.
Optionally, the acoustic model includes: the method is composed of a deep learning model and a connection time Classification (Connectionist Temporal Classification) model.
Optionally, the modeling unit sequence includes: syllables and characters.
Optionally, the translator is a Transformer model.
Optionally, the system is applied to a system of an end-to-end architecture.
According to another aspect of the embodiments of the present invention, there is provided a speech recognition method including: acquiring acoustic features through an acoustic model; mapping the acoustic features and the modeling unit sequence through an acoustic model to obtain a modeling unit sequence of the acoustic features; and translating the modeling unit sequence through a translator to obtain a translation result.
Optionally, the method further includes: before the acoustic features are acquired, in a training phase, the acoustic model and the translator are trained respectively.
Optionally, translating the modeling unit sequence by the translator, and obtaining a translation result includes: and correcting by combining the text dictionary with the context information of the modeling unit sequence, and translating the modeling unit sequence by combining the translator with the text dictionary to obtain a translation result.
Optionally, the acoustic model includes: the method is composed of a deep learning model and a connection time Classification (Connectionist Temporal Classification) model.
Optionally, the modeling unit sequence includes: syllables and characters.
According to still another aspect of embodiments of the present invention, there is provided a storage medium including a stored program, wherein, when the program is executed, an apparatus in which the storage medium is located is controlled to perform the above-described speech recognition method.
According to a further aspect of the embodiments of the present invention, there is provided an acoustic enclosure including a speech recognition system, wherein the speech recognition system is the above system.
According to an aspect of another embodiment of the present invention, there is provided a display apparatus including: the equipment that possesses image, video interaction function, wherein, equipment includes: a speech recognition system, wherein the speech recognition system comprises the system of claim 1.
According to an aspect of another embodiment of the present invention, there is provided an interactive platform including: receiving voice data input by a user on line; returning identification data according to the voice data, wherein the identification data is obtained by translating through a voice identification system; wherein the speech recognition system comprises the system of claim 1.
Optionally, the system architecture of the interaction platform includes: an end-to-end architecture.
In the embodiment of the invention, the acoustic model is used for mapping according to the obtained acoustic features and the modeling unit sequence to obtain the modeling unit sequence of the acoustic features; the translator is used for translating the modeling unit sequence to obtain a translation result, and the purpose of improving the performance of the voice recognition system is achieved, so that the technical effect of improving the recognition accuracy of the voice recognition system is achieved, and the technical problem that the operation performance of the voice recognition system is low due to the fact that the existing voice recognition system is complex to operate is solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a schematic block diagram of a speech recognition system according to a first embodiment of the present invention;
FIG. 2 is a schematic block diagram of another speech recognition system according to a first embodiment of the present invention;
fig. 3 is a block diagram of a hardware structure of a computer terminal of a speech recognition method according to a second embodiment of the present invention;
fig. 4 is a flowchart of a speech recognition method according to a second embodiment of the invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Technical terms related to the present application
Acoustic feature: an acoustic feature;
and (3) classifying connection time: connection termination Temporal Classification, abbreviated as CTC;
syllable: a syllable;
a character: a character;
character sequence: a sequence of character modeling units;
text lexicon: a text dictionary;
DFSMN: deep-feed forward Sequential Memory Networks, Deep feed forward sequence Memory neural Networks;
DNN-HMM: deep Neural Networks-Hidden Markov models, Deep Neural Networks and Hidden Markov acoustic models.
Example 1
Fig. 1 is a schematic block diagram of a speech recognition system according to a first embodiment of the present invention. As shown in fig. 1, according to an aspect of an embodiment of the present invention, there is provided a speech recognition system including:
the acoustic modeling system comprises an acoustic model 12 and a translator 14, wherein the acoustic model 12 is used for mapping according to the obtained acoustic features and a modeling unit sequence to obtain a modeling unit sequence of the acoustic features; the translator 14 is used for translating the modeling unit sequence to obtain a translation result;
the speech recognition system provided by the embodiment of the application further comprises: a text dictionary 16, wherein the context information of the modeling unit sequence is corrected through the text dictionary 16, and the translation result is obtained through the translator 14 and the text dictionary 16.
Specifically, as shown in fig. 2, fig. 2 is a schematic block diagram of another speech recognition system according to a first embodiment of the present invention. The speech recognition system provided in the embodiment of the present application may be an end-to-end speech recognition system, where the end-to-end speech recognition system mainly includes two major components, an Acoustic model, i, Lister and a Translator, as shown in FIG. 2, where the Lister is an Acoustic model optimized based on CTC training criteria, and implements mapping from Acoustic feature acoustics feature to character modeling unit sequence character (taking Chinese as an example, Chinese syllables, characters, etc.). The Translator is a Translator, and is used for further translating the modeling unit sequence obtained by Listener recognition into a required translation result which is finally output, and automatically correcting some Listener recognition errors.
In the embodiment of the invention, the acoustic model is used for mapping according to the obtained acoustic features and the modeling unit sequence to obtain the modeling unit sequence of the acoustic features; the translator is used for translating the modeling unit sequence to obtain a final recognition result after primary recognition, and automatically correcting recognition errors simultaneously, so that the purpose of improving the performance of the voice recognition system is achieved, the technical effect of improving the recognition precision of the voice recognition system is achieved, and the technical problem that the running performance of the voice recognition system is low due to the fact that the existing voice recognition system is complex to operate is solved.
Optionally, the acoustic model 12 and translator 14 are trained separately during a training phase.
Optionally, the acoustic model 12 includes: the method is composed of a deep learning model and a connection time Classification (Connectionist Temporal Classification) model.
Wherein the Acoustic Model (AM) comprises (for example): RNN + CTC, DFSMN + CTC, FSMN + CTC, and the like.
The first part (Listener) of the end-to-end architecture provided by the embodiments of the present application is composed of "deep learning models (e.g., CNN, RNN, FSMN, DFSMN)" + "CTC", wherein the deep learning models are referred to as Acoustic Models (AM) in the Speech field, and the deep learning models can be selected from the above models; CTC is a loss function.
Optionally, the modeling unit sequence includes: syllables and characters.
In summary, the speech recognition system provided by the embodiment of the present application belongs to an end-to-end speech recognition system, wherein in the training stage, the Listener and the Translator are trained separately, and in the using stage, the Listener and the Translator can be combined into a model, thereby implementing the end-to-end speech recognition system.
Based on the above, the end-to-end speech recognition system in the embodiment of the present application can directly recognize an original input speech signal to obtain a final output text, and based on the above, recognition and translation of a speech input by a user can be realized without including a separate language model in the embodiment of the present application.
The speech recognition system provided by the embodiment of the application is combined with models in the speech recognition field and the machine translation field, the acoustic model based on DFSMN-CTC can be adopted for Listener, and the Transformer model in the machine translation field can be adopted for Translator. Meanwhile, massive texts can be additionally utilized when a Translator is trained, so that the defect that the existing end-to-end speech recognition system cannot utilize the additional texts can be overcome.
It should be noted that the models provided in the embodiment of the present application are only illustrated by taking the DFSMN-CTC acoustic model and the transform model as examples, so that the implementation of the speech recognition system provided in the embodiment of the present application is not particularly limited.
The speech recognition system provided by the embodiment of the application obtains a coarse modeling unit series only by using the CTC-AM, errors can exist in the series, and a plurality of errors can be corrected by using context information through a rear-end translator, so that the effect can be remarkably improved. Moreover, because the attribute based encoder-decoder in the related art does not effectively utilize the additional text, the scheme provided by the embodiment of the application can use the additional text when training the translator, thereby improving the system performance.
Example 2
There is also provided, in accordance with an embodiment of the present invention, a speech recognition method embodiment, it being noted that the steps illustrated in the flowchart of the figure may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.
The method provided by the first embodiment of the present application may be executed in a mobile terminal, a computer terminal, or a similar computing device. Taking the example of the operation on a computer terminal, fig. 3 is a hardware structure block diagram of a computer terminal of a speech recognition method according to a second embodiment of the present invention. As shown in fig. 3, the computer terminal 30 may include one or more (only one shown) processors 302 (the processors 302 may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), a memory 304 for storing data, and a transmission module 306 for communication functions. It will be understood by those skilled in the art that the structure shown in fig. 3 is only an illustration and is not intended to limit the structure of the electronic device. For example, the computer terminal 30 may also include more or fewer components than shown in FIG. 3, or have a different configuration than shown in FIG. 3.
The memory 304 may be used to store software programs and modules of application software, such as program instructions/modules corresponding to the voice recognition method in the embodiment of the present invention, and the processor 302 executes various functional applications and data processing by executing the software programs and modules stored in the memory 304, that is, implementing the voice recognition method of the application program. The memory 304 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 304 may further include memory located remotely from the processor 302, which may be connected to the computer terminal 30 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission module 306 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 30. In one example, the transmission module 306 includes a Network Interface Controller (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission module 306 can be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.
Under the above operating environment, the present application provides a speech recognition method as shown in fig. 4. Fig. 4 is a flowchart of a speech recognition method according to a second embodiment of the invention.
Step S402, obtaining acoustic characteristics through an acoustic model;
step S404, mapping is carried out on the acoustic model according to the acoustic features and the modeling unit sequence to obtain the modeling unit sequence of the acoustic features;
and step S406, translating the modeling unit sequence through the translator to obtain a translation result.
Specifically, the speech recognition method provided in the embodiment of the present application can be applied to the speech recognition system in embodiment 1, that is, an end-to-end speech recognition system, wherein the end-to-end speech recognition system mainly includes two major components as shown in fig. 2, an acoustic model, Listener, and a Translator, where Listener is an acoustic model optimized based on CTC training criteria, and realizes mapping from acoustic features to a sequence of modeling units (for example, chinese syllables, characters, etc.). The Translator is a Translator, and is used for further translating the modeling unit sequence obtained by Listener recognition into a required translation result which is finally output, and automatically correcting some Listener recognition errors.
In the embodiment of the invention, the acoustic characteristics are obtained through an acoustic model; mapping the acoustic features and the modeling unit sequence through an acoustic model to obtain a modeling unit sequence of the acoustic features; the translator translates the modeling unit sequence to obtain a translation result, and the purpose of improving the performance of the voice recognition system is achieved, so that the technical effect of improving the recognition precision of the voice recognition system is achieved, and the technical problem that the existing voice recognition system is low in operation performance due to complex operation is solved.
Optionally, the speech recognition method provided in the embodiment of the present application further includes: before the acoustic features are acquired, in a training phase, the acoustic model and the translator are trained respectively.
Optionally, translating the modeling unit sequence by the translator, and obtaining a translation result includes: and correcting by combining the text dictionary with the context information of the modeling unit sequence, and translating the modeling unit sequence by combining the translator with the text dictionary to obtain a translation result.
Optionally, the acoustic model includes: the method is composed of a deep learning model and a connection time Classification Connectionist Temporal Classification model.
Optionally, the modeling unit sequence includes: syllables and characters.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.
Through the above description of the embodiments, those skilled in the art can clearly understand that the speech recognition method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
Example 3
According to still another aspect of embodiments of the present invention, there is provided a storage medium including a stored program, wherein, when the program is executed, an apparatus in which the storage medium is located is controlled to perform the above-described speech recognition method.
Example 4
According to a further aspect of an embodiment of the present invention, there is provided an acoustic enclosure including a speech recognition system, wherein the speech recognition system is the system of embodiment 1.
Example 5
According to a further aspect of the embodiments of the present invention, there is provided a processor for executing a program, wherein the program executes the above-mentioned speech recognition method.
Example 6
According to an aspect of another embodiment of the present invention, there is provided a display apparatus including: the equipment that possesses image, video interaction function, wherein, equipment includes: a speech recognition system, wherein the speech recognition system comprises the system of embodiment 1.
The display device in the embodiment of the present application may include a device having a voice interaction function, and may be a vehicle-mounted display device, an intelligent robot, a terminal of the internet of things (e.g., a vending machine), and the like, so as to implement the display device provided in the embodiment of the present application as a standard, and the display device is not particularly limited.
Example 7
According to an aspect of another embodiment of the present invention, there is provided an interactive platform including: receiving voice data input by a user on line; returning identification data according to the voice data, wherein the identification data is obtained by translating through a voice identification system; wherein the speech recognition system comprises the system of embodiment 1.
Optionally, the system architecture of the interaction platform includes: an end-to-end architecture.
Specifically, based on an end-to-end architecture, the embodiment of the present application provides an online interaction platform, which obtains voice data input online by a user, translates the voice data based on a voice recognition system to obtain recognition data, and returns the recognition data to the user, thereby implementing intelligent online interaction. The online interaction platform can be applied to a consultation website, an e-commerce transaction website or a language translation terminal (such as translation software APP and an online translation website).
Example 8
The embodiment of the invention also provides a storage medium. Optionally, in this embodiment, the storage medium may be configured to store a program code executed by the speech recognition method provided in the first embodiment.
Optionally, in this embodiment, the storage medium may be located in any one of computer terminals in a computer terminal group in a computer network, or in any one of mobile terminals in a mobile terminal group.
Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: acquiring acoustic features through an acoustic model; mapping the acoustic features and the modeling unit sequence through an acoustic model to obtain a modeling unit sequence of the acoustic features; and translating the modeling unit sequence through a translator to obtain a translation result.
Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: before the acoustic features are acquired, in a training phase, the acoustic model and the translator are trained respectively.
Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: translating the modeling unit sequence through a translator to obtain a translation result, wherein the translation result comprises: and correcting by combining the text dictionary with the context information of the modeling unit sequence, and translating the modeling unit sequence by combining the translator with the text dictionary to obtain a translation result.
Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: the acoustic model includes: the method is composed of a deep learning model and a connection time Classification (Connectionist Temporal Classification) model.
Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: the sequence of modeling units includes: syllables and characters.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (17)

1. A speech recognition system comprising:
an acoustic model and a translator, wherein,
the acoustic model is used for mapping according to the obtained acoustic features and the modeling unit sequence to obtain the modeling unit sequence of the acoustic features;
and the translator is used for translating the modeling unit sequence to obtain a translation result.
2. The system of claim 1, wherein the system further comprises: and the text dictionary is combined with the context information of the modeling unit sequence for correction, and the translator is combined with the text dictionary for translation of the modeling unit sequence to obtain the translation result.
3. The system of claim 1, wherein the acoustic model and the translator are trained separately in a training phase.
4. The system of claim 1, wherein the acoustic model comprises: the method is composed of a deep learning model and a connection time Classification (Connectionist Temporal Classification) model.
5. The system of claim 1, wherein the sequence of modeling units comprises: syllables and characters.
6. The system of claim 1, wherein the translator is a transform model.
7. The system of claim 1, wherein the system is applied to a system of an end-to-end architecture.
8. A speech recognition method comprising:
acquiring acoustic features through an acoustic model;
mapping the acoustic model according to the acoustic features and the modeling unit sequence to obtain the modeling unit sequence of the acoustic features;
and translating the modeling unit sequence through a translator to obtain a translation result.
9. The method of claim 8, wherein the method further comprises:
before the acoustic features are acquired, in a training phase, the acoustic model and the translator are trained respectively.
10. The method of claim 8, wherein translating the sequence of modeling units by a translator to obtain a translation result comprises: and correcting by combining a text dictionary with the context information of the modeling unit sequence, and translating the modeling unit sequence by combining the text dictionary through the translator to obtain the translation result.
11. The method of claim 8, wherein the acoustic model comprises: the method is composed of a deep learning model and a connection time Classification (Connectionist Temporal Classification) model.
12. The method of claim 8, wherein the sequence of modeling units comprises: syllables and characters.
13. A storage medium comprising a stored program, wherein the apparatus on which the storage medium is located is controlled to perform the speech recognition method of claim 8 when the program is run.
14. An acoustic enclosure, comprising: a speech recognition system, wherein the speech recognition system comprises the system of claim 1.
15. A display device, comprising: the equipment that possesses image, video interaction function, wherein, the equipment includes: a speech recognition system, wherein the speech recognition system comprises the system of claim 1.
16. An interaction platform, comprising:
receiving voice data input by a user on line;
returning identification data according to the voice data, wherein the identification data is obtained by translating through a voice recognition system; wherein the speech recognition system comprises the system of claim 1.
17. The interaction platform of claim 16, wherein the system architecture of the interaction platform comprises: an end-to-end architecture.
CN201910838408.9A 2019-09-05 2019-09-05 Voice recognition system and method, sound box, display device and interaction platform Pending CN112447168A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910838408.9A CN112447168A (en) 2019-09-05 2019-09-05 Voice recognition system and method, sound box, display device and interaction platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910838408.9A CN112447168A (en) 2019-09-05 2019-09-05 Voice recognition system and method, sound box, display device and interaction platform

Publications (1)

Publication Number Publication Date
CN112447168A true CN112447168A (en) 2021-03-05

Family

ID=74733255

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910838408.9A Pending CN112447168A (en) 2019-09-05 2019-09-05 Voice recognition system and method, sound box, display device and interaction platform

Country Status (1)

Country Link
CN (1) CN112447168A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111968629A (en) * 2020-07-08 2020-11-20 重庆邮电大学 Chinese speech recognition method combining Transformer and CNN-DFSMN-CTC
CN113160827A (en) * 2021-04-07 2021-07-23 深圳鱼亮科技有限公司 Voice transcription system and method based on multi-language model
CN113555016A (en) * 2021-06-24 2021-10-26 北京房江湖科技有限公司 Voice interaction method, electronic equipment and readable storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1670728A (en) * 2003-10-23 2005-09-21 微软公司 Full-form lexicon with tagged data and methods of constructing and using the same
CN103941870A (en) * 2014-04-21 2014-07-23 百度在线网络技术(北京)有限公司 Head-mounted display device
CN105068998A (en) * 2015-07-29 2015-11-18 百度在线网络技术(北京)有限公司 Translation method and translation device based on neural network model
CN105869624A (en) * 2016-03-29 2016-08-17 腾讯科技(深圳)有限公司 Method and apparatus for constructing speech decoding network in digital speech recognition
CN108268452A (en) * 2018-01-15 2018-07-10 东北大学 A kind of professional domain machine synchronous translation device and method based on deep learning
CN108417202A (en) * 2018-01-19 2018-08-17 苏州思必驰信息科技有限公司 Audio recognition method and system
CN108460027A (en) * 2018-02-14 2018-08-28 广东外语外贸大学 A kind of spoken language instant translation method and system
CN108628859A (en) * 2018-04-25 2018-10-09 合肥上城信息技术有限公司 A kind of real-time voice translation system
CN108829685A (en) * 2018-05-07 2018-11-16 内蒙古工业大学 A kind of illiteracy Chinese inter-translation method based on single language training
CN109346064A (en) * 2018-12-13 2019-02-15 苏州思必驰信息科技有限公司 Training method and system for end-to-end speech identification model
CN109976702A (en) * 2019-03-20 2019-07-05 青岛海信电器股份有限公司 A kind of audio recognition method, device and terminal

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1670728A (en) * 2003-10-23 2005-09-21 微软公司 Full-form lexicon with tagged data and methods of constructing and using the same
CN103941870A (en) * 2014-04-21 2014-07-23 百度在线网络技术(北京)有限公司 Head-mounted display device
CN105068998A (en) * 2015-07-29 2015-11-18 百度在线网络技术(北京)有限公司 Translation method and translation device based on neural network model
CN105869624A (en) * 2016-03-29 2016-08-17 腾讯科技(深圳)有限公司 Method and apparatus for constructing speech decoding network in digital speech recognition
CN108268452A (en) * 2018-01-15 2018-07-10 东北大学 A kind of professional domain machine synchronous translation device and method based on deep learning
CN108417202A (en) * 2018-01-19 2018-08-17 苏州思必驰信息科技有限公司 Audio recognition method and system
CN108460027A (en) * 2018-02-14 2018-08-28 广东外语外贸大学 A kind of spoken language instant translation method and system
CN108628859A (en) * 2018-04-25 2018-10-09 合肥上城信息技术有限公司 A kind of real-time voice translation system
CN108829685A (en) * 2018-05-07 2018-11-16 内蒙古工业大学 A kind of illiteracy Chinese inter-translation method based on single language training
CN109346064A (en) * 2018-12-13 2019-02-15 苏州思必驰信息科技有限公司 Training method and system for end-to-end speech identification model
CN109976702A (en) * 2019-03-20 2019-07-05 青岛海信电器股份有限公司 A kind of audio recognition method, device and terminal

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111968629A (en) * 2020-07-08 2020-11-20 重庆邮电大学 Chinese speech recognition method combining Transformer and CNN-DFSMN-CTC
CN113160827A (en) * 2021-04-07 2021-07-23 深圳鱼亮科技有限公司 Voice transcription system and method based on multi-language model
CN113555016A (en) * 2021-06-24 2021-10-26 北京房江湖科技有限公司 Voice interaction method, electronic equipment and readable storage medium

Similar Documents

Publication Publication Date Title
JP2021089705A (en) Method and device for evaluating translation quality
CN109101545A (en) Natural language processing method, apparatus, equipment and medium based on human-computer interaction
CN110942763B (en) Speech recognition method and device
CN110930980B (en) Acoustic recognition method and system for Chinese and English mixed voice
CN113505205A (en) System and method for man-machine conversation
CN109543200A (en) A kind of text interpretation method and device
CN112447168A (en) Voice recognition system and method, sound box, display device and interaction platform
CN111523324B (en) Named entity recognition model training method and device
CN111523305A (en) Text error correction method, device and system
CN111144137B (en) Method and device for generating corpus of machine post-translation editing model
CN111209740A (en) Text model training method, text error correction method, electronic device and storage medium
CN108304376B (en) Text vector determination method and device, storage medium and electronic device
CN112765981A (en) Text information generation method and device
CN111477212A (en) Content recognition, model training and data processing method, system and equipment
CN112837674B (en) Voice recognition method, device, related system and equipment
CN112242132A (en) Data labeling method, device and system in speech synthesis
CN108427671B (en) Information conversion method and apparatus, storage medium, and electronic apparatus
CN112687271B (en) Voice translation method and device, electronic equipment and storage medium
CN112002325B (en) Multi-language voice interaction method and device
CN107423293A (en) The method and apparatus of data translation
US20220262353A1 (en) Method and device for Processing Voice Information, Storage Medium and Electronic Apparatus
CN110890085B (en) Voice recognition method and system
CN107291676B (en) Method for cutting off voice file, terminal equipment and computer storage medium
CN111161737A (en) Data processing method and device, electronic equipment and storage medium
CN110929504A (en) Statement diagnosis method, device and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination