CN113012683A

CN113012683A - Speech recognition method and device, equipment and computer readable storage medium

Info

Publication number: CN113012683A
Application number: CN202110150327.7A
Authority: CN
Inventors: 陈文明; 冯兵兵; 邓高锋; 张世明
Original assignee: Wormhole Innovation Platform Shenzhen Co ltd
Current assignee: Wormhole Innovation Platform Shenzhen Co ltd
Priority date: 2021-02-02
Filing date: 2021-02-02
Publication date: 2021-06-22

Abstract

The invention relates to the technical field of voice recognition, and discloses a voice recognition method, a voice recognition device, voice recognition equipment and a computer-readable storage medium. If the voice information is monitored and received, acoustic feature extraction is carried out on the voice information to obtain first acoustic feature information, then a target decoder is used for carrying out decoding recognition on the first acoustic feature information to obtain a decoding recognition result, wherein the target decoder is a language model obtained by training text corpora after normalization processing according to a pronunciation dictionary after normalization processing and the acoustic model, and then the decoding recognition result is output to realize voice recognition; the problem of poor accuracy of speech recognition in the correlation technique is solved.

Description

Speech recognition method and device, equipment and computer readable storage medium

Technical Field

The present invention relates to the field of speech recognition technologies, and in particular, to a speech recognition method, apparatus, device, and computer-readable storage medium.

Background

With the rapid development of computer technology and signal processing technology, robust speech recognition has achieved the application in the true sense, and free human-computer interaction can be realized; however, the current speech recognition accuracy is low, for example, the recognition accuracy is low in the process of recognizing proper noun compound words such as Editor-in-Chief, acronyms such as UFO, names of people such as Jessie, names of places such as Beijing, and the like, so that the use experience of users is greatly reduced.

Therefore, how to improve the accuracy of speech recognition is an urgent problem to be solved.

Disclosure of Invention

The invention mainly aims to provide a voice recognition method, a voice recognition device, voice recognition equipment and a computer readable storage medium, and aims to improve the accuracy of voice recognition.

In order to achieve the above object, the present invention provides a speech recognition method, comprising the steps of:

if the received voice information is monitored, extracting acoustic features of the voice information to obtain first acoustic feature information;

decoding and identifying the first acoustic characteristic information by using a target decoder to obtain a decoding and identifying result; the target decoder is constructed according to a pronunciation dictionary after normalization processing, a language model obtained by training text corpora after normalization processing and an acoustic model;

and outputting the decoding recognition result to realize voice recognition.

Optionally, before the step of decoding the first acoustic feature information by using a target decoder to obtain a decoding recognition result, the speech recognition method further includes:

carrying out standardization processing on the pronunciation dictionary to obtain a standardized pronunciation dictionary;

the method comprises the steps of carrying out standardization processing on a text corpus to obtain a standardized text corpus, and training the standardized text corpus to obtain a language model;

and constructing a decoder according to the normalized pronunciation dictionary, the language model and the acoustic model to obtain a target decoder.

Optionally, before the step of constructing a decoder according to the normalized pronunciation dictionary, the language model, and the acoustic model to obtain a target decoder, the method further includes:

extracting acoustic features of the voice corpus to obtain second acoustic feature information;

and training the second acoustic characteristic information to obtain an acoustic model.

obtaining an acoustic model and the language model;

establishing a mapping relation between phonemes and Chinese words according to the phonemes in the acoustic model and the Chinese words in the language model, and establishing a mapping relation between phonemes and words according to the phonemes in the acoustic model and the words in the language model;

and obtaining a pronunciation dictionary according to the mapping relation between the phonemes and the Chinese words and the mapping relation between the phonemes and the words.

Optionally, the step of performing a normalization process on the pronunciation dictionary to obtain a normalized pronunciation dictionary includes:

training the pronunciation dictionary to obtain a word-to-phoneme model;

generating a supplementary pronunciation dictionary according to the word-to-phoneme model; wherein phonemes contained in the supplementary pronunciation dictionary are not in the pronunciation dictionary, and Chinese words corresponding to the phonemes contained in the supplementary pronunciation dictionary and words corresponding to the phonemes are in the language model;

obtaining a combined pronunciation dictionary according to the supplementary pronunciation dictionary and the pronunciation dictionary;

and obtaining a normalized pronunciation dictionary according to the combined pronunciation dictionary.

Optionally, the step of obtaining a normalized pronunciation dictionary according to the combined pronunciation dictionary includes:

unifying case and case of phonemes contained in the combined pronunciation dictionary;

carrying out case treatment on phonemes contained in the combined pronunciation dictionary according to case rules of preset proper nouns;

and adding phonemes corresponding to the mute words and/or the noise words and/or the extraset words into the combined pronunciation dictionary to obtain a normalized pronunciation dictionary.

Optionally, the step of performing normalization processing on the text corpus to obtain a normalized text corpus includes:

collecting text corpora from a plurality of fields;

carrying out standardization processing on the text corpus to obtain a standardized text corpus;

the step of training the normalized text corpus to obtain a language model comprises:

acquiring Chinese words and/or words with the use frequency higher than a preset threshold value from the normalized text corpus;

generating a constructed vocabulary list according to the Chinese words and/or words with the use frequency higher than a preset threshold value;

and training the normalized text corpus and the construction vocabulary to obtain a language model.

deleting characters and/or character strings in the text corpus according to a preset deletion rule;

and/or converting non-ASCII characters in the text corpus into ASCII characters;

and/or converting Roman numerals in the text corpus into decimal system;

and/or performing semantic segmentation on the text in the text corpus;

and/or, the words in the text corpus are subjected to case and case unification;

and/or, performing error correction processing on Chinese words and/or words in the text corpus;

and/or constructing compound words and/or acronyms and/or names of people and/or places and adding the compound words and/or acronyms and/or names of people and/or places to the text corpus.

In addition, to achieve the above object, the present invention provides a voice recognition apparatus including:

the extraction module is used for extracting acoustic features of the voice information to obtain first acoustic feature information if the voice information is monitored to be received;

the decoding module is used for decoding the first acoustic characteristic information by using a target decoder to obtain a decoding identification result; the target decoder is constructed according to a pronunciation dictionary after normalization processing, a language model obtained by training text corpora after normalization processing and an acoustic model;

and the output module is used for outputting the decoding recognition result so as to realize voice recognition.

Further, to achieve the above object, the present invention also provides an apparatus comprising: a memory, a processor and a speech recognition program stored on the memory and running on the processor, the speech recognition program when executed by the processor implementing the steps of the speech recognition method as above.

Furthermore, to achieve the above object, the present invention also provides a computer-readable storage medium having stored thereon a speech recognition program, which when executed by a processor, implements the steps of the speech recognition method as above.

According to the technical scheme provided by the invention, if the voice information is monitored and received, acoustic feature extraction is carried out on the voice information to obtain first acoustic feature information, then a target decoder is used for carrying out decoding recognition on the first acoustic feature information to obtain a decoding recognition result, wherein the target decoder is a language model obtained by training a text corpus after normalized processing according to a pronunciation dictionary after normalized processing and an acoustic model structure, and then the decoding recognition result is output to realize voice recognition; the problem of poor accuracy of speech recognition in the correlation technique is solved.

That is, according to the technical scheme provided by the present invention, in the process of decoding and recognizing the extracted first acoustic feature information by using the target decoder, the target decoder is constructed according to the pronunciation dictionary after the normalization processing, the language model obtained by training the text corpus after the normalization processing and the acoustic model, and the pronunciation dictionary and the language model are subjected to the normalization processing, accordingly, the target decoder constructed by the pronunciation dictionary and the language model is more standard, so that the accuracy of decoding and recognizing by using the more standard target decoder is higher, the accuracy of speech recognition is improved, and the satisfaction of the user in use experience is further improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the structures shown in the drawings without creative efforts.

FIG. 1 is a schematic diagram of an apparatus architecture of a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a first embodiment of a speech recognition method according to the present invention;

FIG. 3 is a flowchart illustrating a speech recognition method according to a second embodiment of the present invention;

FIG. 4 is a flowchart illustrating a speech recognition method according to a third embodiment of the present invention;

FIG. 5 is a flowchart illustrating a fourth embodiment of a speech recognition method according to the present invention;

FIG. 6 is a block diagram of a first embodiment of a speech recognition apparatus according to the present invention;

FIG. 7 is a diagram illustrating a speech recognition method performed by the speech recognition apparatus according to the first embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Referring to fig. 1, fig. 1 is a schematic device structure diagram of a hardware operating environment according to an embodiment of the present invention.

The apparatus comprises: at least one processor 101, a memory 102, and a speech recognition program stored on the memory and executable on the processor, the speech recognition program being configured to implement the steps of the speech recognition method of any of the following embodiments.

Processor 101 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 101 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 101 may also include a main processor and a coprocessor, where the main processor is a processor for processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 101 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. The processor 101 may further include an AI (Artificial Intelligence) processor for processing operations related to the speech recognition method so that the speech recognition method model can be trained and learned autonomously, improving efficiency and accuracy.

Memory 102 may include one or more computer-readable storage media, which may be non-transitory. Memory 102 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 102 is used to store at least one instruction for execution by processor 101 to implement a speech recognition method provided by method embodiments herein.

In some embodiments, the apparatus may further include: a communication interface 103 and at least one peripheral device. The processor 101, memory 102 and communication interface 103 may be connected by a bus or signal lines. Various peripheral devices may be connected to communication interface 103 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 104, display screen 105, and power supply 106.

The communication interface 103 can be used to connect at least one peripheral device related to I/O (Input/Output) to the processor 101 and the memory 102. In some embodiments, the processor 101, memory 102, and communication interface 103 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 101, the memory 102 and the communication interface 103 may be implemented on a single chip or circuit board, which is not limited in this embodiment.

The Radio Frequency circuit 104 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 104 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 104 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 104 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 104 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 104 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 105 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 105 is a touch display screen, the display screen 105 also has the ability to capture touch signals on or over the surface of the display screen 105. The touch signal may be input to the processor 101 as a control signal for processing. At this point, the display screen 105 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display screen 105 may be one, the front panel of the device; in other embodiments, the display screens 105 may be at least two, respectively disposed on different surfaces of the device or in a folded design; in some embodiments, the display 105 may be a flexible display, disposed on a curved surface or on a folded surface of the device. Even further, the display screen 105 may be arranged in a non-rectangular irregular pattern, i.e. a shaped screen. The Display screen 105 may be made of LCD (liquid crystal Display), OLED (Organic Light-Emitting Diode), and the like.

The power supply 106 is used to power the various components in the device. The power source 106 may be alternating current, direct current, disposable batteries, or rechargeable batteries. When the power source 106 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

Those skilled in the art will appreciate that the configuration shown in fig. 1 does not constitute a limitation of the device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

Based on the above hardware structure, embodiments of the present invention are proposed.

Referring to fig. 2, fig. 2 is a flowchart illustrating a voice recognition method according to a first embodiment of the present invention, where the voice recognition method includes the following steps:

step S21: and if the received voice information is monitored, extracting acoustic features of the voice information to obtain first acoustic feature information.

The received voice information in this embodiment may be issued by the user in real time, may also be recorded and uploaded by the user in advance, and may also be downloaded and uploaded from the internet or the like; in practical application, the method can be flexibly adjusted according to specific application scenes.

It is understood that the acoustic features extracted from the speech information in this embodiment refer to physical quantities representing acoustic characteristics of speech, which are general terms of acoustic expressions of sound elements, such as energy concentration areas representing timbre, formant frequency, formant intensity and bandwidth, and duration, fundamental frequency, average speech power representing prosodic characteristics of speech, and the like.

In some examples, performing acoustic feature extraction on the voice information to obtain first acoustic feature information may be performing feature extraction through Mel Frequency Cepstrum Coefficients (MFCC); specifically, the MFCC feature extraction comprises the steps of A/D conversion, pre-emphasis, windowing and framing, DFT + squaring, Mel filtering, logarithm taking, IDFT, dynamic feature and the like.

In some examples, performing acoustic feature extraction on the voice information to obtain first acoustic feature information may be by deep learning feature extraction; specifically, the deep learning feature extraction comprises the steps of sampling, framing, Fourier transformation, character recognition, mapping image acquisition and the like.

In some examples, whether the voice information is received or not may be monitored every preset time, for example, whether the voice information is received or not may be monitored every 10s, and when the voice information is monitored to be received, the voice information is subjected to acoustic feature extraction to obtain first acoustic feature information; therefore, the system consumption can be reduced to a certain extent, and the electric quantity is saved.

In some examples, whether voice information is received or not can be continuously monitored, and then when the voice information is monitored to be received, acoustic feature extraction is carried out on the voice information to obtain first acoustic feature information; therefore, the monitoring accuracy can be improved to a certain extent, the voice information can be acquired at the first time, and the voice recognition rate is improved.

Step S22: decoding and identifying the first acoustic characteristic information by using a target decoder to obtain a decoding and identifying result; the target decoder is constructed according to the pronunciation dictionary after the normalization processing, a language model obtained by training text corpora after the normalization processing and an acoustic model.

In this embodiment, after the received speech information is subjected to the acoustic feature extraction to obtain the first acoustic feature information, the target decoder needs to be used to decode and identify the first acoustic feature information, so as to obtain a decoding and identifying result. It should be noted that, the target decoder in this embodiment is constructed according to the pronunciation dictionary after the normalization processing, the language model obtained by the text corpus training after the normalization processing, and the acoustic model, that is, in this embodiment, the pronunciation dictionary and the language model are both normalized first, and then the pronunciation dictionary, the language model after the normalization processing, and the acoustic model are utilized to construct and obtain the target decoder; therefore, the target decoder is constructed and obtained to be more standard according to the pronunciation dictionary and the language model which are subjected to standardization processing and the acoustic model, so that the accuracy of decoding and recognition by using the more standard target decoder is higher, and the accuracy of voice recognition is improved.

The language model in this embodiment includes a plurality of word strings, and the probability of occurrence of the word strings in the text corpus, for example, the probability of word strings that are not suitable for grammar is close to 0, and the probability of word strings that are suitable for grammar is high. It can be understood that the language model is trained by the collected text corpora of each field; specifically, the collected text corpora of each field may be news-type text corpora covering politics, economy, society, culture, religion, sports, format documents, catalogs and the like in the general field, or dialogue-type text corpora covering message types and the like in the specific field; in general, the collected text corpus comes from a wider field, so that the trained language model is more accurate.

The acoustic model in this embodiment includes a plurality of models for recognizing individual phonemes, for example, a model for phoneme a may determine whether a short segment of speech is a, a model for phoneme b may determine whether b is a short segment of speech, and so on. It can be understood that the acoustic model is trained from a large batch of collected speech corpora; specifically, the collected large batch of speech corpora may relate to speech corpora of different locations, such as accents, different ages, different sexes, different speech rates, different sound sizes, and the like; in general, the more speech corpora are collected from the involved factors, the more accurate the acoustic model is trained.

The pronunciation dictionary in the embodiment comprises a set of words which can be processed by the system, pronunciations of the words are marked, and a mapping relation between a modeling unit of an acoustic model and a language model modeling unit is obtained through the pronunciation dictionary, so that the acoustic model and the language model are connected to form a searched state space for a decoder to decode; specifically, the mapping relationship is a mapping relationship between a phoneme and a Chinese word and a mapping relationship between a phoneme and a word, and it can be understood that the mapping relationship between a phoneme and a Chinese word, which is the mapping relationship between pinyin and a Chinese word, is processed by a preset conversion rule of pinyin and a phoneme, so as to obtain the mapping relationship between the phoneme and the Chinese word.

It can be understood that, in this embodiment, the target decoder is used to decode and recognize the first acoustic feature information, specifically, all possible word strings corresponding to the first acoustic feature information are determined according to the language model, expanded into phoneme strings according to the pronunciation dictionary, and then a decoding diagram is obtained according to the acoustic model, and then the Viterbi algorithm is implemented on the decoding diagram to obtain an optimal sequence, so as to obtain a recognition result.

Step S23: and outputting a decoding recognition result to realize voice recognition.

In the embodiment, the target decoder is used for decoding and identifying the first acoustic characteristic information to obtain a decoding and identifying result, and then the decoding and identifying result is output; it is understood that the manner of outputting the decoded recognition result includes, but is not limited to, outputting on a screen of a terminal, wherein the terminal may be the same terminal as receiving the voice information or a terminal different from receiving the voice information; in practical application, the method can be flexibly adjusted according to specific application scenes.

In this embodiment, in the process of decoding and recognizing the extracted first acoustic feature information by using the target decoder, the target decoder is constructed by the acoustic model and the language model trained by the text corpus after the normalization processing according to the pronunciation dictionary after the normalization processing, and accordingly, the target decoder constructed by the pronunciation dictionary and the language model is more standard through the normalization processing, so that the accuracy of decoding and recognizing by using the more standard target decoder is higher, the accuracy of voice recognition is improved, and the satisfaction of the user in use experience is further improved.

Based on the above embodiments, a second embodiment of the speech recognition method of the present invention is provided, please refer to fig. 3, and fig. 3 is a flowchart illustrating the second embodiment of the speech recognition method of the present invention.

In this embodiment, before the step of decoding the first acoustic feature information by using the target decoder in step S22 to obtain the decoding recognition result, the speech recognition method may further include the following steps:

step S31: and carrying out standardization processing on the pronunciation dictionary to obtain a standardized pronunciation dictionary.

The step of performing normalization processing on the pronunciation dictionary to obtain a normalized pronunciation dictionary in this embodiment may include the following steps:

firstly, training a pronunciation dictionary to obtain a word-to-phoneme model;

then, generating a supplementary pronunciation dictionary according to the word-to-phoneme model; the phoneme contained in the supplementary pronunciation dictionary is not in the pronunciation dictionary, and the Chinese word corresponding to the phoneme contained in the supplementary pronunciation dictionary and the word corresponding to the phoneme are in the language model;

secondly, obtaining a combined pronunciation dictionary according to the supplementary pronunciation dictionary and the pronunciation dictionary;

further, a normalized pronunciation dictionary is obtained from the combined pronunciation dictionary.

It should be clear that, in this embodiment, the obtained pronunciation dictionary is trained to obtain a word-to-Phoneme (G2P) model, and then a supplementary pronunciation dictionary is generated according to the G2P model; the phoneme contained in the supplementary pronunciation dictionary is not in the pronunciation dictionary, but the Chinese word corresponding to the phoneme contained in the supplementary pronunciation dictionary and the word corresponding to the phoneme are in the language model, namely on the basis of the original pronunciation dictionary, a supplementary pronunciation dictionary is additionally generated, the phoneme contained in the supplementary pronunciation dictionary is not in the original pronunciation dictionary, so that the supplementary pronunciation dictionary supplements the original pronunciation dictionary, and a combined pronunciation dictionary is obtained according to the supplementary pronunciation dictionary and the original pronunciation dictionary, so that the phonemes contained in the combined pronunciation dictionary are comprehensive and perfect; and then a normalized pronunciation dictionary is obtained according to the comprehensive and perfect combined pronunciation dictionary.

In this embodiment, the step of obtaining the normalized pronunciation dictionary according to the combined pronunciation dictionary may include the following steps:

first, the phonemes included in the combined pronunciation dictionary are case-aligned.

It can be understood that, after the combined pronunciation dictionary is obtained in the present embodiment, the case can be unified by first referring to the phonemes contained in the combined pronunciation dictionary; specifically, the phonemes contained in the pronunciation dictionary may be unified into uppercase, or the phonemes contained in the pronunciation dictionary may be unified into lowercase, so that the unified case-case processing performed first is more convenient for subsequent operations.

And then, carrying out case treatment on the phonemes contained in the combined pronunciation dictionary according to the case rule of the preset proper nouns.

It can be understood that different proper nouns have corresponding capitalization rules, such as capitalization of the first letter of the name of a person or the name of a place, therefore, in this embodiment, after the case is unified for the phonemes contained in the combined pronunciation dictionary, the case processing may be further performed for the phonemes contained in the combined pronunciation dictionary according to the preset proper noun case rule; for example, after the phonemes included in the pronunciation dictionary are unified into upper case, the initials of the names of the persons or places are further adjusted to lower case except for the upper case, and after the phonemes included in the pronunciation dictionary are unified into lower case, the initials of the names of the persons or places are further adjusted to upper case.

And secondly, adding phonemes corresponding to the mute words and/or the noise words and/or the extraset words into the combined pronunciation dictionary to obtain a normalized pronunciation dictionary.

It can be understood that, in order to make the phonemes included in the combined pronunciation dictionary more comprehensive and complete, in this embodiment, phonemes corresponding to the silent words and/or the noise words and/or the extragroup words may be pre-constructed, and then the phonemes corresponding to the pre-constructed silent words and/or the noise words and/or the extragroup words are added to the combined pronunciation dictionary, so as to obtain a more comprehensive and complete combined pronunciation dictionary; wherein, the mute word refers to pause related words, the noise word refers to words such as o, ya, wool and the like, and the collected words refer to words not in the pronunciation dictionary.

It should be noted that, after the above-mentioned combined pronunciation dictionary is processed by some series of steps, the combined pronunciation dictionary is more standard, comprehensive and perfect; and thus, for the sake of distinction, it is referred to as a normalized pronunciation dictionary.

Step S32: and carrying out standardization processing on the text corpus to obtain a standardized text corpus, and training the standardized text corpus to obtain a language model.

The step of performing normalization processing on the text corpus to obtain the normalized text corpus in this embodiment may include the following steps;

firstly, collecting text corpora from a plurality of fields;

and then, carrying out normalization processing on the text corpus to obtain a normalized text corpus.

It should be clear that, in this embodiment, the text corpora are collected from multiple fields, where the fields may be news-type text corpora such as general fields covering politics, economy, society, culture, religion, sports, format documents, catalogs, etc., or specific fields covering conversation-type text corpora such as message types, etc., so that the text corpora collected from multiple fields are more comprehensive and complete; and then, carrying out standardization processing on the collected text corpus to obtain the standardized text corpus.

In this embodiment, the step of performing normalization processing on the text corpus to obtain a normalized text corpus may include the following steps:

and/or, converting Roman numerals in the text corpus into decimal;

and/or performing semantic segmentation on the text in the text corpus;

and/or, the words in the text corpus are uniformly capital-lowerable;

Correspondingly, the step of training the normalized text corpus to obtain the language model in this embodiment may include the following steps:

firstly, acquiring Chinese words and/or words with the use frequency higher than a preset threshold value from a normalized text corpus;

then, generating a constructed vocabulary list according to the Chinese words and/or words with the use frequency higher than a preset threshold value;

secondly, training the normalized text corpus and the constructed vocabulary to obtain a language model.

It should be clear that, after the normalized text corpus is obtained in this embodiment, the chinese words and/or words with the usage frequency higher than the preset threshold may be obtained from the normalized text corpus, then the constructed vocabulary table is generated according to the chinese words and/or words with the usage frequency higher than the preset threshold, and then training is performed according to the normalized text corpus and the constructed vocabulary table to obtain the language model; the language model is obtained by training the structured vocabulary table generated according to the Chinese words and/or words with the use frequency higher than the preset threshold value in the normalized text corpus and the normalized text corpus together, so that the accuracy of the trained language model is improved, and the accuracy of voice recognition is further improved.

Step S33: and constructing a decoder according to the normalized pronunciation dictionary, the language model and the acoustic model to obtain a target decoder.

It should be clear that, in this embodiment, after the pronunciation dictionary is normalized, and the text corpus is trained to obtain the language model, the decoder can be constructed by using the pronunciation dictionary after the normalization, the language model obtained by training the text corpus after the normalization, and the acoustic model to obtain the target decoder.

In the embodiment, a decoder is constructed by using the pronunciation dictionary after the normalization processing, the language model obtained by the text corpus training after the normalization processing and the acoustic model to obtain a target decoder, so that the obtained target decoder is more standard, comprehensive and perfect, the target decoder is used for decoding and identifying the acoustic characteristic information, the obtained decoding and identifying result is more accurate, and the voice identifying accuracy is improved; and the whole process is simple and easy to develop and implement.

Based on the above embodiments, a third embodiment of the speech recognition method of the present invention is provided, please refer to fig. 4, and fig. 4 is a flowchart illustrating the third embodiment of the speech recognition method of the present invention.

In this embodiment, before the step of constructing the decoder according to the normalized pronunciation dictionary, the normalized language model, and the normalized acoustic model to obtain the target decoder in step S33, the method may further include the following steps:

step S41: extracting acoustic features of the voice corpus to obtain second acoustic feature information;

step S42: and training the second acoustic characteristic information to obtain an acoustic model.

It should be clear that, in this embodiment, a large batch of voice corpora may be collected first, where the voice corpora may relate to accents, different ages, different sexes, different speeds of speech, different sound sizes, and the like in various regions, so that the collected voice corpora are considered from different factors to be more comprehensive and complete; and then, extracting acoustic features of the collected voice corpus to obtain second acoustic feature information, and then training the second acoustic feature information to obtain an acoustic model.

In the embodiment, the collected voice linguistic data are considered from different factors, so that the collected voice linguistic data are more comprehensive and perfect, acoustic characteristic information extracted from the voice linguistic data is more comprehensive and perfect, the accuracy of an acoustic model obtained by training is improved, and the accuracy of voice recognition is further improved.

Based on the above embodiments, a fourth embodiment of the speech recognition method of the present invention is provided, please refer to fig. 5, and fig. 5 is a flowchart illustrating the fourth embodiment of the speech recognition method of the present invention.

step S51: acquiring an acoustic model and a language model;

step S52: establishing a mapping relation between phonemes and Chinese words according to the phonemes in the acoustic model and the Chinese words in the language model, and establishing a mapping relation between phonemes and words according to the phonemes in the acoustic model and the words in the language model;

step S53: and obtaining a pronunciation dictionary according to the mapping relation between the phonemes and the Chinese words and the mapping relation between the phonemes and the words.

It should be clear that, in this embodiment, an acoustic model and a language model may be obtained first, where the language model is obtained by training a text corpus after normalization processing, then a mapping relationship between phonemes and Chinese words is established according to phonemes in the acoustic model and Chinese words in the language model, a mapping relationship between phonemes and words is established according to phonemes in the acoustic model and words in the language model, and then a pronunciation dictionary is obtained according to the established mapping relationship between phonemes and Chinese words and the mapping relationship between phonemes and words.

In the embodiment, the pronunciation dictionary is obtained according to the acoustic model and the language model obtained by training the text corpus after the normalization processing, so that the pronunciation dictionary obtained on the basis is more standard, comprehensive and perfect, and the obtained pronunciation dictionary is subjected to the normalization processing, so that the pronunciation dictionary is more standard, comprehensive and perfect; and because the pronunciation dictionary obtained in the past is more standard, comprehensive and perfect, and the pronunciation dictionary itself is subjected to standardization processing, the standardization processing time can be reduced, and the standardization processing efficiency of the pronunciation dictionary is improved.

In addition, referring to fig. 6, an embodiment of the present invention further provides a speech recognition apparatus based on the speech recognition method, where the speech recognition apparatus includes:

the extracting module 601 is configured to, if it is monitored that the voice information is received, perform acoustic feature extraction on the voice information to obtain first acoustic feature information;

the decoding module 602 is configured to decode the first acoustic feature information by using a target decoder to obtain a decoding identification result; the target decoder is constructed according to the pronunciation dictionary after the normalization processing, a language model obtained by training text corpora after the normalization processing and an acoustic model;

and an output module 603, configured to output a decoding recognition result to implement speech recognition.

It should be noted that, in this embodiment, the speech recognition apparatus further optionally includes other corresponding modules to implement the steps of the speech recognition method; for better understanding, please refer to fig. 7, which is a schematic diagram illustrating steps of the speech recognition method implemented by the speech recognition apparatus according to the present invention.

The speech recognition device of the present invention adopts all the technical solutions of all the embodiments described above, so that at least all the beneficial effects brought by the technical solutions of the embodiments described above are achieved, and no further description is given here.

In addition, an embodiment of the present invention further provides a computer-readable storage medium, where a speech recognition program is stored on the computer-readable storage medium, and the speech recognition program, when executed by a processor, implements the steps of the speech recognition method as described above.

The computer-readable storage media include volatile or nonvolatile, removable or non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, computer program modules or other data. Computer-readable storage media include, but are not limited to, RAM (Random Access Memory), ROM (Read-Only Memory), EEPROM (Electrically erasable Programmable Read-Only Memory), flash Memory or other Memory technology, CD-ROM (Compact disk Read-Only Memory), Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage, or any other medium which can be used to store the desired information and which can be accessed by a computer.

It will be apparent to one skilled in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A speech recognition method, characterized in that it comprises the steps of:

and outputting the decoding recognition result to realize voice recognition.

2. The speech recognition method of claim 1, wherein before the step of decoding the first acoustic feature information by using a target decoder to obtain a decoded recognition result, the speech recognition method further comprises:

3. The speech recognition method of claim 2, wherein the step of constructing a decoder from the normalized pronunciation dictionary, the language model, and the acoustic model to obtain a target decoder is preceded by the step of:

4. The speech recognition method of claim 2, wherein the step of constructing a decoder from the normalized pronunciation dictionary, the language model, and the acoustic model to obtain a target decoder is preceded by the step of:

obtaining an acoustic model and the language model;

5. The speech recognition method according to any one of claims 2 to 4, wherein the step of normalizing the pronunciation dictionary to obtain a normalized pronunciation dictionary comprises:

training the pronunciation dictionary to obtain a word-to-phoneme model;

6. The speech recognition method of claim 5, wherein the step of deriving a normalized pronunciation dictionary based on the combined pronunciation dictionary comprises:

7. The speech recognition method according to any one of claims 2-4, wherein the step of normalizing the text corpus to obtain a normalized text corpus comprises;

collecting text corpora from a plurality of fields;

8. The speech recognition method of claim 7, wherein the step of normalizing the corpus of text to obtain a normalized corpus of text comprises:

and/or converting Roman numerals in the text corpus into decimal system;

and/or performing semantic segmentation on the text in the text corpus;

9. A speech recognition apparatus, characterized in that the speech recognition apparatus comprises:

10. An apparatus, characterized in that the apparatus comprises: memory, a processor and a speech recognition program stored on the memory and running on the processor, the speech recognition program when executed by the processor implementing the steps of the speech recognition method according to any one of claims 1-8.

11. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a speech recognition program which, when being executed by a processor, carries out the steps of the speech recognition method according to any one of claims 1-8.