CN111681661B

CN111681661B - Speech recognition method, apparatus, electronic device and computer readable medium

Info

Publication number: CN111681661B
Application number: CN202010512892.9A
Authority: CN
Inventors: 蔡猛; 梁镇麟
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2020-06-08
Filing date: 2020-06-08
Publication date: 2023-08-08
Anticipated expiration: 2040-06-08
Also published as: CN111681661A

Abstract

Embodiments of the present disclosure disclose a method, apparatus, electronic device, and computer-readable medium for speech recognition. One embodiment of the method comprises the following steps: generating a syllable probability set of the voice frame aiming at each voice frame in a voice frame sequence of the target voice to obtain a syllable probability set sequence, wherein the syllable probability set is used for representing the probability that each syllable in a plurality of preset syllables is expressed by the voice frame; and generating the target characters based on the syllable probability set sequence and a pre-trained language model. According to the embodiment, the external language model is used for decoding, so that the voice recognition result is more accurate, and further, the user experience is improved.

Description

Speech recognition method, apparatus, electronic device and computer readable medium

Technical Field

Embodiments of the present disclosure relate to the field of computer technology, and in particular, to a method, an apparatus, an electronic device, and a computer readable medium for speech recognition.

Background

Currently, speech recognition has been widely used in people's lives. However, in the related method, in the voice recognition process, the phenomenon of inaccurate voice recognition often exists, so that the user experience is poor.

Disclosure of Invention

The disclosure is in part intended to introduce concepts in a simplified form that are further described below in the detailed description. The disclosure is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Some embodiments of the present disclosure propose methods, apparatuses, devices and computer-readable media for speech recognition to solve the technical problems mentioned in the background section above.

In a first aspect, some embodiments of the present disclosure provide a method of speech recognition, the method comprising: generating a syllable probability set of the voice frame aiming at each voice frame in a voice frame sequence of the target voice to obtain a syllable probability set sequence, wherein the syllable probability set is used for representing the probability that each syllable in a plurality of preset syllables is expressed by the voice frame; and generating the target characters based on the syllable probability set sequence and a pre-trained language model.

In a second aspect, some embodiments of the present disclosure provide an apparatus for speech recognition, the apparatus comprising: a first generation unit configured to generate, for each speech frame in a speech frame sequence of a target speech, a syllable probability set of the speech frame to obtain a syllable probability set sequence, where the syllable probability set is used to represent a probability that the speech frame expresses each syllable of a preset plurality of syllables; and a second generation unit configured to generate a target text based on the syllable probability set sequence and a pre-trained language model.

In a third aspect, some embodiments of the present disclosure provide an electronic device comprising: one or more processors; a storage device having one or more programs stored thereon, which when executed by one or more processors, cause the one or more processors to implement the method as in any of the first aspects.

In a fourth aspect, some embodiments of the present disclosure provide a computer readable medium having a computer program stored thereon, wherein the program when executed by a processor implements a method as in any of the first aspects.

One of the above embodiments of the present disclosure has the following advantageous effects: firstly, determining syllable probability sets of all voice frames included in target voice to obtain syllable probability set sequences. And then, according to the syllable probability set sequence, externally connecting a pre-trained language model for decoding, and finally obtaining the target characters. The circumscribed language model may better utilize the text corpus. As an example, the probability of certain words (e.g., hot words) may be changed by adjusting the circumscribed language model. And finally, the decoded target characters are more accurate.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale.

FIG. 1 is a schematic illustration of one application scenario of a method of speech recognition according to some embodiments of the present disclosure;

FIG. 2 is a flow chart of some embodiments of a method of speech recognition according to the present disclosure;

FIG. 3 is a flow chart of other embodiments of a method of speech recognition according to the present disclosure;

FIG. 4 is a flow chart of still other embodiments of a method of speech recognition according to the present disclosure;

FIG. 5 is a schematic diagram of a greedy search method according to some embodiments of the method of speech recognition of the present disclosure;

FIG. 6 is a schematic illustration of speech decoding according to some embodiments of a method of speech recognition of the present disclosure;

FIG. 7 is a schematic diagram of the structure of some embodiments of an apparatus for speech recognition according to the present disclosure;

fig. 8 is a schematic structural diagram of an electronic device suitable for use in implementing some embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings. Embodiments of the present disclosure and features of embodiments may be combined with each other without conflict.

It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 is a schematic diagram of one application scenario of a method of speech recognition according to some embodiments of the present disclosure.

As shown in fig. 1, the existing target voices include: speech frame 1, speech frame 2, speech frame 3 and speech frame 4. The electronic device 101 may then generate a syllable probability set for each speech frame separately, resulting in a syllable probability set sequence. For example, the syllable probability set for speech frame 1 can include: syllable probability 1, syllable probability 2, and syllable probability 3. The syllable probability set for speech frame 2 can include: syllable probability 4, syllable probability 5, and syllable probability 6. Here, the syllable probability sets corresponding to the voice frame 1, the voice frame 2, the voice frame 3 and the voice frame 4 form a syllable probability set sequence. Then, decoding is performed by combining a pre-trained language model 103 according to the syllable probability set sequence, and finally, the target text 102 is determined.

It will be appreciated that the method for speech recognition may be performed by the electronic device 101 described above. The electronic device 101 may be hardware or software. When the electronic device 101 is hardware, it may be a variety of electronic devices having information processing capabilities, including but not limited to smartphones, tablets, electronic book readers, laptop computers, desktop computers, servers, and the like. When the electronic apparatus 101 is software, it can be installed in the above-listed electronic apparatus. It may be implemented as a plurality of software or software modules, for example, for providing distributed services, or as a single software or software module. The present invention is not particularly limited herein.

It should be understood that the number of electronic devices in fig. 1 is merely illustrative. There may be any number of electronic devices as desired for an implementation.

With continued reference to fig. 2, a flow 200 of some embodiments of a method of speech recognition according to the present disclosure is shown. The method for voice recognition comprises the following steps:

step 201, for each speech frame in the speech frame sequence of the target speech, generating a syllable probability set of the speech frame, and obtaining a syllable probability set sequence.

In some embodiments, for each speech frame in the sequence of speech frames of the target speech, the subject of execution of the method of speech recognition (e.g., the electronic device shown in fig. 1) may generate the syllable probability set of speech frames described above in various ways. And further obtaining syllable probability set sequences. For example, for each speech frame in a sequence of speech frames of a target speech, the speech frame may be input into a pre-trained deep learning network (e.g., an LSTM (Long Short-Term Memory network)), generating a syllable probability set of the speech frame. Here, the syllable probability set is used to represent the probability that the speech frame expresses each syllable among a preset plurality of syllables (for example, 400 syllables). The target speech may be a sentence spoken by an average user. For example, "what weather is today.

As an example, a segment of speech is present, which includes speech frame a and speech frame B. Through the pre-trained deep neural network, a syllable probability set of speech frame A and a syllable probability set of speech frame B can be generated. The syllable probability set of the voice frame A can include: hu (80%), lu (10%), liu (5%), dou (5%). The syllable probability set of the speech frame B may include: su (80%), lao (10%), luo (5%), dun (5%). The syllable probability set of speech frame a and the syllable probability set of speech frame B may form a sequence of syllable probabilities sets.

Step 202, generating target characters based on syllable probability set sequences and a pre-trained language model.

In some embodiments, according to the syllable probability set sequence generated in step 201, the execution subject may decode in various ways in combination with a pre-trained language model, and finally generate the target text. For example, multiple paths may be generated from the syllable probability set sequence described above. Specifically, for each syllable probability set in the syllable probability set sequence, one syllable probability is sequentially extracted from the syllable probability set, so that multiple paths and scores corresponding to the multiple paths are generated. Using the multiple paths, the scores corresponding to the multiple paths, the pre-trained language model, and the preset dictionary, a decoding algorithm (e.g., viterbi algorithm) may be used to decode, and finally the target text is generated.

Here, the path refers to a sequence of a plurality of syllables. For example, "jin tian xing qi si" is a path. The score corresponding to a path refers to the product of syllable correspondence probabilities included in the path.

For example, the syllable probability for each syllable in the known path "jin tian xing qi si" is in turn: 90%,90%,90%,90%,90%, the fraction corresponding to path "jin tian xing qi si" is the product of 90% by 90%.

Here, the pre-trained language model may be one of the following: statistical language model, neural network language model. The target text may be content for characterizing the target speech. For example, the end user speaks a target voice of "hello earth" with a high probability that the target text is the text corresponding to "hello earth".

With further reference to fig. 3, a flow 300 of further embodiments of a method of speech recognition is shown. The flow 300 of the method of speech recognition comprises the steps of:

step 301, inputting the speech frame into a pre-trained recurrent neural network converter RNNT for each speech frame in the speech frame sequence of the target speech, generating a syllable probability set of the speech frame.

In some embodiments, for each speech frame in the sequence of speech frames of the target speech, the execution body may input the speech frame into a pre-trained RNNT (recurrent neural network converter, recurrent Neural Network Transducer) to generate a syllable probability set for the speech frame. Here, since the RNNT model itself is provided with prediction network (predictive network), the syllable probability set of the above-mentioned speech frame can be better determined according to the context. In practice, syllable probability set generation of the above-mentioned speech frame is often affected by the first few speech frames of the above-mentioned speech frame.

Step 302, selecting syllable probability meeting preset conditions from syllable probability sets according to each syllable probability set in the syllable probability set sequence to obtain a syllable probability subset, and obtaining a syllable probability subset sequence.

In some embodiments, for each syllable probability set in the sequence of syllable probability sets, the executing entity may select a syllable probability meeting a predetermined condition from the syllable probability sets. A subset of syllable probabilities for this speech frame is obtained. Here, a syllable probability subset sequence can be obtained from a plurality of speech frames. Here, the predetermined condition may be that the syllable probability is greater than a preset threshold (e.g., 80%). Here, the subset of syllable probabilities is a subset of the subset of syllable probabilities.

For example, the syllable probability set for a certain speech frame includes: hu (80%), lu (10%), liu (5%). If the predetermined condition is that the syllable probability is greater than 8%, the resulting subset of syllable probabilities may include: hu (80%), lu (10%).

Step 303, generating a plurality of paths and scores corresponding to the paths based on the syllable probability subset sequence and the pruning algorithm, wherein the paths are sequences containing a plurality of syllables, and the scores are products of the syllable probabilities in the sequences.

In some embodiments, for each subset in the subset sequence of syllable probabilities obtained in step 302, a syllable is selected from the subset in turn, thereby creating a path. It is emphasized that doing so generates a large number of paths. Thus, in the process of generating a path, a pruning algorithm is required to prune the path generated by the process according to the score of the path. The pruning algorithm here may include, but is not limited to: the viterbi algorithm, beam Search. Here, the path refers to a sequence of a plurality of syllables. For example, "jin tian xing qi si" is a path. The score corresponding to a path refers to the product of syllable correspondence probabilities included in the path.

For example, as shown with reference to fig. 5. According to p11, p12, p13, p21, p22 and p23, and p31, p32 and p33. Multiple paths are available, where some pruning algorithm is often needed to prune branches of paths in the process of generating paths, so as to prune branches of paths with probability smaller than a preset threshold. And thus 5 paths are obtained.

Step 304, obtaining the target text based on the paths, the scores corresponding to the paths, the target dictionary and the pre-trained language model.

In some embodiments, the executing entity may obtain the target text in various manners according to the paths, the scores corresponding to the paths, the target dictionary, and the pre-trained language model. The target dictionary is used for finding out the corresponding Chinese characters according to syllables.

By way of example, a dynamic search space may be constructed based on the multiple paths, the scores associated with the multiple paths, the target dictionary, and the pre-trained language model, and searched using a search algorithm (e.g., a viterbi algorithm) to find the optimal path (e.g., the path with the highest probability product), i.e., complete decoding. And taking the characters or words on the optimal path as target text.

As an example, reference may be made to fig. 6. The target speech 601 is passed through the RNNT model 602 to obtain multiple paths. For example, path 1, path 2, path 3, path 4, and path 5, as indicated by reference numeral 603. Here, the score corresponding to each of the plurality of paths may be obtained while the paths are obtained.

The 5 paths and the scores corresponding to the 5 paths are combined with the language model 604 and the target dictionary to construct a dynamic search space, and the search algorithm is used for searching and decoding. For example, decoding using a viterbi algorithm. Finally, a decoding result, i.e., the target text 605, is obtained.

In some alternative implementations of some embodiments, the language model may include an n-gram language model. The n-gram model is a commonly used language model, and may be referred to as a chinese language model (CLM, chinese Language Model) for the middle. The N-Gram model is based on the assumption that the occurrence of the N-th word is only related to the previous N-1 words, but not related to any other word, and the probability of the whole sentence is the product of the occurrence probabilities of the words, and the probabilities can be obtained by directly counting the number of times that the N words occur simultaneously from the corpus. Here, N may be equal to 5. The implementation mode uses the n-gram language model to decode, so that the final voice recognition result is more accurate.

As can be seen in FIG. 3, the flow 300 of the method of speech recognition in some embodiments corresponding to FIG. 3 embodies the use of the RNNT model, as the RNN-T model itself has a predictive network that better incorporates syllable probability prediction in context than the description of some embodiments corresponding to FIG. 2. Meanwhile, syllable probability meeting the conditions is selected from the syllable probability set at each moment to perform next calculation, so that the calculated amount is reduced.

With further reference to fig. 4, a flow 400 of further embodiments of a method of speech recognition is shown. The flow 400 of the method of speech recognition comprises the steps of:

step 401, for each speech frame in the speech frame sequence of the target speech, inputting the speech frame into the pre-trained recurrent neural network converter RNNT, generating a syllable probability set of the speech frame.

In some embodiments, the specific implementation of step 401 and the technical effects thereof may refer to step 301 in those embodiments corresponding to fig. 3, which are not described herein.

Step 402, selecting a predetermined number of syllable probabilities based on a greedy search algorithm for each syllable probability set in the syllable probability set sequence to obtain a syllable probability subset, and obtaining a syllable probability subset sequence.

In some embodiments, for each syllable probability set in the sequence of syllable probability sets, the executing entity may select a predetermined number (e.g., 5) of syllable probabilities from the syllable probability sets using a greedy search algorithm to obtain a subset of syllable probabilities. Finally, syllable probability subset sequences are obtained. Here, greedy search is used in which a predetermined number of syllable probabilities with the highest probability are selected from each syllable probability set, and syllable probability subsets are formed.

As an example, reference may be made to fig. 5. A speech frame sequence is available, which may include: speech frame a, speech frame B, and speech frame C. Here, three voice frames may be sequentially input into the RNNT model at three time points (e.g., t1, t2, and t 3). Here, taking t1 as an example, at time t1, speech frame a is input into the RNNT model, which outputs a syllable probability set for speech frame a. The syllable probability set for speech frame a may include: p11, p12, p13, p14 to p1n. Here, n is 2 or more. For example, n may be equal to 400. Specifically, p11, p12, p13, p14 to p1n can be expressed as: hu (80%), lu (2%), liu (5%), dou (3%) … gou (2%). On the basis, greedy search can be conducted on the syllable probability set of the voice frame A at the time t1, and the first three syllable probabilities are selected from the syllable probability set of the voice frame A. Further, p11, p12, and p13 were obtained. Similarly, p21, p22 and p23, and p31, p32 and p33 can be obtained for time t2 and time t3, respectively.

Step 403, generating a plurality of paths and scores corresponding to the plurality of paths based on the syllable probability subset sequence and the pruning algorithm, wherein the paths are sequences containing a plurality of syllables, and the scores are products of the syllable probabilities in the sequences.

Step 404, obtaining the target text based on the paths, the scores corresponding to the paths, the target dictionary and the pre-trained language model.

In some embodiments, the specific implementation of steps 403 to 404 and the technical effects thereof may refer to steps 303 to 304 in those embodiments corresponding to fig. 3, which are not described herein.

As can be seen in fig. 4, compared with the description of some embodiments corresponding to fig. 3, the flow 400 of the method of speech recognition in some embodiments corresponding to fig. 4 reflects the use of greedy search algorithm, and for each moment, the syllable probability with higher probability is selected from the syllable probability set to perform the next calculation, so that not only the accuracy of the speech recognition result is improved, but also the calculation amount is reduced.

With further reference to fig. 7, as an implementation of the method shown in the above figures, the present disclosure provides some embodiments of a speech recognition apparatus, which apparatus embodiments correspond to those method embodiments shown in fig. 2, and which apparatus is particularly applicable in various electronic devices.

As shown in fig. 7, the voice recognition apparatus 700 of some embodiments includes: a first generation unit 701 and a second generation unit 702. The receiving unit 701 is configured to generate, for each speech frame in a speech frame sequence of a target speech, a syllable probability set of the speech frame to obtain a syllable probability set sequence, where the syllable probability set is used to represent a probability that the speech frame expresses each syllable of a preset plurality of syllables. The second generation unit 702 is configured to generate a target text based on the syllable probability set sequence and a pre-trained language model.

In some optional implementations of some embodiments, the first generating unit 701 may be further configured to input the speech frame into a pre-trained recurrent neural network converter RNNT for each speech frame in the sequence of speech frames of the target speech, to generate a syllable probability set of the speech frame.

In some optional implementations of some embodiments, the second generating unit 702 may be further configured to, for each syllable probability set in the syllable probability set sequence, select a syllable probability meeting a predetermined condition from the syllable probability sets to obtain a syllable probability subset, and obtain a syllable probability subset sequence; generating a plurality of paths and scores corresponding to the paths based on the syllable probability subset sequence and a pruning algorithm, wherein the paths are sequences containing a plurality of syllables, and the scores are products of the syllable probabilities in the sequences; and obtaining the target text based on the paths, the scores corresponding to the paths, the target dictionary and the pre-trained language model.

In some optional implementations of some embodiments, the second generation unit 702 may be further configured to select a predetermined number of syllable probabilities based on a greedy search algorithm, resulting in a subset of syllable probabilities.

In some alternative implementations of some embodiments, the pre-trained language model includes an n-gram language model.

It will be appreciated that the elements described in the apparatus 700 correspond to the various steps in the method described with reference to fig. 2. Thus, the operations, features and resulting benefits described above for the method are equally applicable to the apparatus 700 and the units contained therein, and are not described in detail herein.

Referring now to fig. 8, a schematic diagram of an electronic device (e.g., the electronic device of fig. 1) 800 suitable for use in implementing some embodiments of the present disclosure is shown. The electronic device shown in fig. 8 is merely an example, and should not impose any limitations on the functionality and scope of use of embodiments of the present disclosure.

As shown in fig. 8, the electronic device 800 may include a processing means (e.g., a central processor, a graphic processor, etc.) 801, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 802 or a program loaded from a storage means 807 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the electronic device 800 are also stored. The processing device 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

In general, the following devices may be connected to the I/O interface 805: input devices 806 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 807 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, etc.; storage 808 including, for example, magnetic tape, hard disk, etc.; communication means 809. The communication means 809 may allow the electronic device 800 to communicate wirelessly or by wire with other devices to exchange data. While fig. 8 shows an electronic device 800 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead. Each block shown in fig. 8 may represent one device or a plurality of devices as needed.

In particular, according to some embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, some embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via communication device 809, or from storage device 808, or from ROM 802. The above-described functions defined in the methods of some embodiments of the present disclosure are performed when the computer program is executed by the processing device 801.

It should be noted that, in some embodiments of the present disclosure, the computer readable medium may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In some embodiments of the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In some embodiments of the present disclosure, however, the computer-readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some implementations, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be embodied in the apparatus; or may exist alone without being incorporated into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: generating a syllable probability set of the voice frame aiming at each voice frame in a voice frame sequence of the target voice to obtain a syllable probability set sequence, wherein the syllable probability set is used for representing the probability that each syllable in a plurality of preset syllables is expressed by the voice frame; and generating the target characters based on the syllable probability set sequence and a pre-trained language model.

Computer program code for carrying out operations for some embodiments of the present disclosure may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in some embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. The described units may also be provided in a processor, for example, described as: a processor includes a first generation unit and a second generation unit. The names of these units are not limited to the unit itself in some cases, and for example, the first generation unit may be further described as "a unit that generates a syllable probability set of the above-mentioned voice frame for each voice frame in the voice frame sequence of the target voice, and obtains a syllable probability set sequence".

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

In accordance with one or more embodiments of the present disclosure, there is provided a method of speech recognition, comprising: generating a syllable probability set of the voice frame aiming at each voice frame in a voice frame sequence of the target voice to obtain a syllable probability set sequence, wherein the syllable probability set is used for representing the probability that each syllable in a plurality of preset syllables is expressed by the voice frame; and generating the target characters based on the syllable probability set sequence and a pre-trained language model.

According to one or more embodiments of the present disclosure, the generating a syllable probability set of the speech frame for each speech frame in the speech frame sequence of the target speech to obtain a syllable probability set sequence includes: for each speech frame in the sequence of speech frames of the target speech, inputting the speech frame into a pre-trained recurrent neural network converter RNNT, generating a syllable probability set of the speech frame.

According to one or more embodiments of the present disclosure, the generating the target text based on the syllable probability set sequence and the pre-trained language model includes: selecting syllable probability meeting preset conditions from the syllable probability sets according to each syllable probability set in the syllable probability set sequence to obtain syllable probability subsets, and obtaining syllable probability subset sequences; generating a plurality of paths and scores corresponding to the paths based on the syllable probability subset sequence and a pruning algorithm, wherein the paths are sequences containing a plurality of syllables, and the scores are products of the syllable probabilities in the sequences; and obtaining the target text based on the paths, the scores corresponding to the paths, the target dictionary and the pre-trained language model.

According to one or more embodiments of the present disclosure, the selecting, for each syllable probability set in the syllable probability set sequence, a syllable probability meeting a predetermined condition from the syllable probability sets, to obtain a subset of syllable probabilities, includes:

and selecting a preset number of syllable probabilities according to the order of the syllable probabilities from the big syllable probability to the small syllable probability in the syllable probability set to obtain a syllable probability subset.

According to one or more embodiments of the present disclosure, the pre-trained language model described above includes an n-gram language model.

According to one or more embodiments of the present disclosure, there is provided an apparatus for speech recognition, comprising: a first generation unit configured to generate, for each speech frame in a speech frame sequence of a target speech, a syllable probability set of the speech frame to obtain a syllable probability set sequence, where the syllable probability set is used to represent a probability that the speech frame expresses each syllable of a preset plurality of syllables; and a second generation unit configured to generate a target text based on the syllable probability set sequence and a pre-trained language model.

According to one or more embodiments of the present disclosure, the first generating unit may be further configured to input the voice frame into a pre-trained recurrent neural network converter RNNT for each voice frame in the voice frame sequence of the target voice, to generate a syllable probability set of the voice frame.

According to one or more embodiments of the present disclosure, the second generating unit may be further configured to select, for each syllable probability set in the syllable probability set sequence, a syllable probability meeting a predetermined condition from the syllable probability sets, to obtain a syllable probability subset sequence; generating a plurality of paths and scores corresponding to the paths based on the syllable probability subset sequence and a pruning algorithm, wherein the paths are sequences containing a plurality of syllables, and the scores are products of the syllable probabilities in the sequences; and obtaining the target text based on the paths, the scores corresponding to the paths, the target dictionary and the pre-trained language model.

According to one or more embodiments of the present disclosure, the second generation unit may be further configured to select a predetermined number of syllable probabilities based on a greedy search algorithm, resulting in a subset of syllable probabilities.

According to one or more embodiments of the present disclosure, there is provided an electronic device including: one or more processors; and a storage device having one or more programs stored thereon, which when executed by the one or more processors, cause the one or more processors to implement the method as described in any of the embodiments above.

According to one or more embodiments of the present disclosure, there is provided a computer readable medium having stored thereon a computer program, wherein the program, when executed by a processor, implements a method as described in any of the embodiments above.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above technical features, but encompasses other technical features formed by any combination of the above technical features or their equivalents without departing from the spirit of the invention. Such as the above-described features, are mutually substituted with (but not limited to) the features having similar functions disclosed in the embodiments of the present disclosure.

Claims

1. A method of speech recognition, comprising:

generating a syllable probability set of a voice frame aiming at each voice frame in a voice frame sequence of target voice to obtain a syllable probability set sequence, wherein the syllable probability set is used for representing the probability that each syllable in a plurality of preset syllables is expressed by the voice frame;

selecting syllable probability meeting preset conditions from the syllable probability sets according to each syllable probability set in the syllable probability set sequence to obtain syllable probability subsets, and obtaining syllable probability subset sequences;

generating a plurality of paths and scores corresponding to the paths based on the syllable probability subset sequence and a pruning algorithm, wherein the paths are sequences containing a plurality of syllables, and the scores are products of the syllable probabilities in the sequences;

and acquiring a target dictionary, and acquiring target characters based on the paths, the scores corresponding to the paths, the target dictionary and a pre-trained language model.

2. The method of claim 1, wherein the generating a syllable probability set of the speech frame for each speech frame in the sequence of speech frames of the target speech resulting in a sequence of syllable probability sets comprises:

for each speech frame in the sequence of speech frames of the target speech, inputting the speech frame into a pre-trained recurrent neural network converter RNNT, generating a syllable probability set of the speech frame.

3. The method of claim 2, wherein the selecting, for each syllable probability set in the sequence of syllable probability sets, a syllable probability meeting a predetermined condition from the syllable probability sets, resulting in a subset of syllable probabilities, comprises:

based on a greedy search algorithm, a predetermined number of syllable probabilities are selected, resulting in a subset of syllable probabilities.

4. A method according to one of claims 1-3, wherein the pre-trained language model comprises an n-gram language model.

5. An apparatus for speech recognition, comprising:

the first generation unit is configured to generate a syllable probability set of a target voice for each voice frame in a voice frame sequence of the voice frame to obtain a syllable probability set sequence, wherein the syllable probability set is used for representing the probability that the voice frame expresses each syllable in a plurality of preset syllables;

a second generating unit configured to select, for each syllable probability set in the syllable probability set sequence, a syllable probability meeting a predetermined condition from the syllable probability sets, to obtain a syllable probability subset sequence;

6. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-4.

7. A computer readable medium having stored thereon a computer program, wherein the program when executed by a processor implements the method of any of claims 1-4.