CN114495905A

CN114495905A - Speech recognition method, apparatus and storage medium

Info

Publication number: CN114495905A
Application number: CN202111509507.6A
Authority: CN
Inventors: 郭震; 李智勇; 陈孝良
Original assignee: Beijing SoundAI Technology Co Ltd
Current assignee: Beijing SoundAI Technology Co Ltd
Priority date: 2021-12-10
Filing date: 2021-12-10
Publication date: 2022-05-13

Abstract

The application provides a voice recognition method, a voice recognition device and a storage medium, wherein the method comprises the following steps: acquiring a voice to be recognized; inputting the voice to be recognized into an acoustic model to obtain the probabilities of different pronunciations output by the acoustic model; respectively inputting the probabilities of different pronunciations into N language models of different application scenes to obtain the probabilities of different phrases output by each language model in the N language models; aiming at each language model in the N language models, obtaining the recognition result of the language model and the score of the recognition result according to the probability of different pronunciations and the probability of different phrases output by the language model; and the voice recognition result corresponding to the highest recognition result score in the recognition result scores corresponding to each language model in the N language models is used as the target voice recognition result, so that the accuracy and the efficiency of voice recognition can be improved.

Description

Speech recognition method, apparatus and storage medium

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a voice recognition method, a voice recognition device and a storage medium.

Background

Speech Recognition refers to a technology capable of converting human Speech into Text or computer-readable input, also known as Automatic Speech Recognition (ASR), computer Speech Recognition or Speech-to-Text (STT), which incorporates knowledge and research in the fields of linguistics, computer science, and electrical engineering.

At present, the speech recognition mode is: and inputting the voice to be recognized into the acoustic model and the language model to obtain a recognition result of the voice to be recognized.

However, in the current speech recognition method, the same language model is used for recognizing the speech to be recognized in different application scenes, so that the recognition accuracy is poor and the efficiency is low.

Disclosure of Invention

The embodiment of the application provides a voice recognition method, a voice recognition device and a storage medium, which are used for improving the accuracy and efficiency of voice recognition.

In a first aspect, the present application provides a speech recognition method, including:

acquiring a voice to be recognized;

inputting the voice to be recognized into an acoustic model to obtain the probabilities of different pronunciations output by the acoustic model;

respectively inputting the probabilities of different pronunciations into N language models to obtain the probability of different phrases output by each language model in the N language models, wherein the N language models are respectively in one-to-one correspondence with N different application scenes;

aiming at each language model in the N language models, obtaining the recognition result of the language model and the score of the recognition result according to the probability of different pronunciations and the probability of different phrases output by the language model;

and taking the voice recognition result corresponding to the highest recognition result score in the recognition result scores corresponding to each language model in the N language models as a target voice recognition result.

In a second aspect, an embodiment of the present application provides a speech recognition apparatus, including:

the device comprises an acquisition unit, a recognition unit and a processing unit, wherein the acquisition unit is used for acquiring a voice to be recognized;

the voice recognition unit is used for inputting the voice to be recognized into an acoustic model to obtain the probabilities of different pronunciations output by the acoustic model; respectively inputting the probabilities of different pronunciations into N language models to obtain the probability of different phrases output by each language model in the N language models; aiming at each language model in the N language models, obtaining the recognition result of the language model and the score of the recognition result according to the probability of different pronunciations and the probability of different phrases output by the language model; and taking the voice recognition result corresponding to the highest recognition result score in the recognition result scores corresponding to each language model in the N language models as a target voice recognition result.

In a third aspect, an embodiment of the present application provides an electronic device, including: a processor and a memory, the memory for storing a computer program, the processor for invoking and executing the computer program stored in the memory to perform the method of the first aspect.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium for storing a computer program, where the computer program causes a computer to execute the method in any one of the above aspects or in each implementation manner of the first aspect.

To sum up, in the embodiment provided by the present application, N language models are set for N different application scenarios, so as to obtain recognition results of the N language models and scores of recognition results corresponding to each language model in the N language models, and then a speech recognition result corresponding to the highest recognition result score in the recognition result scores corresponding to the N language models is used as a target speech recognition result. In other words, in one speech recognition process, the speech recognition accuracy is further improved by simultaneously using a plurality of language models of different application scenarios, for example, using N language models for recognition, and using the highest recognition result in the N language models as the target speech recognition result. In addition, the N language models work in parallel, and therefore the efficiency of language identification is improved. That is, the present application improves the accuracy and efficiency of speech recognition.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic view of an application scenario according to an embodiment of the present application;

fig. 2 is a schematic flow chart of a speech recognition method according to an embodiment of the present application;

fig. 3 is another schematic flow chart of a speech recognition method according to an embodiment of the present application;

fig. 4 is a schematic flowchart of a speech recognition method according to an embodiment of the present application;

fig. 5 is a schematic flowchart of another speech recognition method according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;

fig. 7 is a flowchart of a speech recognition method according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiment of the application is applied to the technical field of voice recognition, for example, applied to voice recognition corresponding to different application scenes so as to improve the efficiency of the voice recognition.

In order to facilitate understanding of the embodiments of the present application, the related concepts related to the embodiments of the present application are first briefly described as follows:

in the related art, voice recognition as a new interactive mode may be applied to various electronic devices or terminals, for example, the voice recognition is applied to a smart speaker, and the smart speaker can receive a voice instruction sent by a user and perform related operations (e.g., playing a song, playing news, or playing a weather forecast); for another example, the voice recognition is applied to a smart home system, and the smart home system can receive a voice command sent by a user and enable home equipment corresponding to the voice command to perform related operations (e.g., opening/closing an air conditioner, opening a curtain, opening/closing a television, etc.). As another example, voice recognition is applied to an in-vehicle apparatus that can receive a voice instruction (e.g., navigation) issued by a user and provide a travel route to the user. For example, voice recognition is applied to the mobile device, and voice received by the mobile device can be converted into text and output to a display of the mobile device.

Fig. 1 is a schematic view of an application scenario according to an embodiment of the present application, and as shown in fig. 1, includes a user 101, a terminal device 102, and a server 103.

The terminal device 102 has a microphone, and a user can input a sound signal to the microphone by sounding near the microphone. The terminal device 102 may be provided with at least one microphone. Optionally, the microphone may also implement a noise reduction function in addition to collecting the sound signal.

In the embodiment of the present application, the terminal device 102 may be any terminal, for example, the terminal device 102 may be a user equipment for machine type communication. The terminal 102 may also be referred to as a User Equipment (UE), a Mobile Station (MS), a mobile terminal (mobile terminal), a terminal (terminal), and so on.

The terminal device 102 may also be referred to as a wireless terminal, which may be a device that provides voice and/or data connectivity to a user, a handheld device having wireless connection capability, or other processing device connected to a wireless modem.

For example, terminal device 102 may be a cellular telephone, a cordless telephone, a Session Initiation Protocol (SIP) phone, a Wireless Local Loop (WLL) station, a Personal Digital Assistant (PDA), a handheld device having wireless communication capabilities, a computing device or other processing device connected to a wireless modem, an in-vehicle device, or a wearable device, a Virtual Reality (VR) terminal device, an Augmented Reality (AR) terminal device, a wireless terminal in industrial control (industrial control), a wireless terminal in self driving (self driving), a wireless terminal in remote medical (remote medical), a wireless terminal in smart grid, a wireless terminal in transportation safety, a wireless terminal in smart city (smart city), a wireless terminal in smart home (smart home), and the like. The embodiments of the present application are not particularly limited.

As another example, terminal devices 102 include, but are not limited to, connections via wireline, such as Public Switched Telephone Network (PSTN), Digital Subscriber Line (DSL), Digital cable, direct cable connection; and/or another data connection/network; and/or via a Wireless interface, e.g., to a cellular Network, a Wireless Local Area Network (WLAN), a digital television Network such as a DVB-H Network, a satellite Network, an AM-FM broadcast transmitter; and/or means of another terminal device arranged to receive/transmit communication signals; and/or Internet of Things (IoT) devices. A terminal device arranged to communicate over a wireless interface may be referred to as a "wireless communication terminal", "wireless terminal", or "mobile terminal". Examples of mobile terminals include, but are not limited to, satellite or cellular telephones; personal Communications Systems (PCS) terminals that may combine cellular radiotelephones with data processing, facsimile, and data Communications capabilities; PDAs that may include radiotelephones, pagers, internet/intranet access, Web browsers, notepads, calendars, and/or Global Positioning System (GPS) receivers; and conventional laptop and/or palmtop receivers or other electronic devices that include a radiotelephone transceiver.

Alternatively, the terminal device 102 may be deployed on land, including indoors or outdoors, hand-held or vehicle-mounted; can also be deployed on the water surface; it may also be deployed on airborne airplanes, balloons and satellite vehicles. The embodiment of the present application does not limit the application scenario of the terminal device 102.

The server 103 may be a single server or may be one server in a cloud data center. The cloud data center belongs to a cloud environment. The cloud environment is an entity which provides cloud services to users by using basic resources in a cloud computing mode. A cloud environment includes a cloud data center that includes a large number of infrastructure resources (including computing resources, storage resources, and network resources) owned by a cloud service provider, which may include a large number of computing devices (e.g., servers), and a cloud service platform. For example, taking an example that the computing resources included in the cloud data center are servers running virtual machines, the acoustic model and the N language models of the present application may be deployed independently on the servers or virtual machines in the cloud data center. Optionally, the acoustic model and the N language models of the present application may also be deployed in a distributed manner on multiple servers in the cloud data center, or in a distributed manner on multiple virtual machines in the cloud data center, or in a distributed manner on servers and virtual machines in the cloud data center.

In practical applications, the user 101 utters speech to be recognized to the terminal device 102. For example, the user 101 outputs speech to be recognized to a microphone of the terminal apparatus 102. The terminal apparatus 102 transmits the speech to be recognized to the server 103. The server 103 executes the method of the embodiment of the present application through the acoustic model and the N language models to recognize the speech to be recognized, and obtains a target recognition result. The target recognition result is sent to the terminal device 102. Optionally, the terminal device 102 executes a corresponding action according to the target recognition result.

The terminal device 102 and the server 103 are connected through a network.

In the current speech recognition method, a speech to be recognized is acquired, an optimal path corresponding to the speech to be recognized is searched in an acoustic model and a decoded graph corresponding to a language model by using a decoder, and a text or a command and the like corresponding to the optimal path are used as a recognition result of the speech to be recognized.

However, as the voice recognition technology is applied to more and more electronic devices or terminals, users have not only satisfied with recognition of audio and video application scenarios such as songs, dramas, and dramas, but also put high demands on voice recognition of a plurality of application scenarios such as education, medical care, and finance. The increase in application scenarios results in an increase in speech samples and text data used for training the language model, which results in a more complex process of training the language model and decoding. Therefore, at present, one language model is used for voice recognition, so that the recognition accuracy is poor, and longer time is needed for completing the voice recognition, so that the voice recognition is not completely suitable for some situations with higher requirements on instantaneity.

In order to solve the technical problem, in a speech recognition process, a plurality of language models of different application scenarios are used simultaneously, for example, N language models are used, and the highest recognition result in the N language models is used as a target speech recognition result, so that the accuracy of speech recognition is improved. In addition, the N language models work in parallel, and therefore the efficiency of language identification is improved. That is, the present application improves the accuracy and efficiency of speech recognition.

The technical solutions of the embodiments of the present application are described in detail below with reference to some embodiments. The following several embodiments may be combined with each other and may not be described in detail in some embodiments for the same or similar concepts or processes.

Fig. 2 is a schematic flowchart of a speech recognition method according to an embodiment of the present application, and fig. 3 is a schematic diagram of a speech recognition process according to an embodiment of the present application. As shown in fig. 2 and 3, the method of the embodiment of the present application includes:

and S100, acquiring the voice to be recognized.

The main implementation of the embodiment of the present application is a device with a voice recognition function, such as a voice recognition device, and the voice recognition device may be the server in fig. 1 or a part of the server. Other computing devices are possible, as the present application is not limited in this respect.

For convenience of description, the present application takes an execution subject as an example of an electronic device.

Optionally, the speech to be recognized may be any type of speech information to be recognized, such as a control instruction sent by a user to the terminal device.

In some embodiments, if the terminal device has a specific voice recognition function, the terminal device serves as an electronic device to perform a voice recognition operation. That is, the electronic device obtains the speech to be recognized directly from the user.

In some embodiments, if the terminal device does not have the voice recognition function, the terminal device may send the voice to be recognized to a corresponding electronic device with the voice recognition function, for example, to the server in fig. 1. That is, the electronic device obtains the speech to be recognized from the terminal device.

S200, inputting the voice to be recognized into the acoustic model to obtain the probabilities of different pronunciations output by the acoustic model.

The acoustic model is modeled by integrating knowledge of acoustics and phonetics, and is trained using collected speech samples (e.g., raw corpora).

The speech to be recognized is input into the trained acoustic model, and can be converted into acoustic representation to form output, that is, the probability that the speech to be recognized belongs to a certain acoustic symbol (i.e. different pronunciations) is given.

In some embodiments, in order to further improve the accuracy of speech recognition, before the speech to be recognized is input into the acoustic model, feature extraction needs to be performed on the speech to be recognized to obtain features corresponding to the speech to be recognized, and then the features corresponding to the speech to be recognized are input into the acoustic model.

Wherein, the feature extraction of the voice to be recognized comprises the following steps: the speech to be recognized is transformed from the time domain to the frequency domain to extract the appropriate features.

Alternatively, the acoustic Model may be any one of a Hidden Markov Model (HMM), a Gaussian mixture Model-Hidden Markov Model (GMM-HMM), or a Deep Neural Network Model (DNN), a Deep Neural Network-Hidden Markov Model, and the like.

S300, respectively inputting the probabilities of different pronunciations into the N language models to obtain the probabilities of different phrases output by each language model in the N language models.

The language model is used for calculating the probability of occurrence of a phrase, mainly describes the habit of a way of human language expression, emphasizes the internal relation between words and words on an arrangement structure, and simply speaking, the language model is used for calculating the probability of correctness of a sentence on grammar, thereby improving the recognition rate and reducing the search range. Because sentence construction is often regular, words appearing in front often presage possible words appearing behind. The language model defines which words can follow an already recognized word (matching is a sequential process) and thus eliminates unlikely words for the matching process.

The good language model can not only improve the efficiency of voice recognition, but also improve the accuracy of voice recognition to a certain extent. The language models are divided into two categories, namely a rule model and a statistical model, the statistical language model describes the inherent statistical rules among words by using a probability statistical method, and the statistical language model is widely used in the fields of speech recognition, machine translation and the like.

The language model can effectively combine the knowledge of grammar and semantics of the language and describe the internal relation between words, thereby reducing the search range and improving the accuracy of voice recognition. And (5) carrying out grammar and semantic analysis on the language model by using the voice sample, and obtaining the language model based on the statistical model.

In the embodiment provided by the application, corresponding voice samples are respectively collected for N application scenes, each language model is trained to obtain N language models, and the N language models are respectively in one-to-one correspondence with the N application scenes. Wherein N is a positive integer greater than 1. For example, assuming that N is 3, and the 3 application scenarios include an education application scenario, a medical application scenario, and a financial application scenario, the training process of the language model corresponding to the 3 application scenarios includes: respectively collecting voice samples corresponding to education application scenes, voice samples corresponding to medical application scenes and voice samples corresponding to financial application scenes, training a language model by using the voice samples corresponding to the application scenes, and respectively obtaining the language model corresponding to the education application scenes, the language model corresponding to the medical application scenes and the language model corresponding to the financial application scenes.

Alternatively, the language Model may be one of an N-gram Model, a Hidden Markov Model (HMM), a Maximum Entropy Model (MEM), and the like.

S400, aiming at each language model in the N language models, obtaining the recognition result of the language model and the score of the recognition result according to the probability of different pronunciations and the probability of different phrases output by the language model.

In some embodiments, as shown in fig. 4, the obtaining of the recognition result of the language model according to the probabilities of the different pronunciations and the probabilities of the different phrases output by the language model in S400 includes the following steps S410 and S420:

s410, acquiring a target decoding graph corresponding to the language model from N preset decoding graphs, wherein the N decoding graphs correspond to the N language models one by one.

And S420, decoding through the target decoding graph according to the probabilities of the different pronunciations and the probabilities of the different phrases output by the language model to obtain the recognition result of the language model.

The decoding graph is a state network compiled by an acoustic model, a dictionary and a language model, each phrase (Chinese character) and the occurrence probability of the phrase (Chinese character) are nodes, paths from the nodes to the nodes are called edges, and a plurality of nodes form the complex state network. The decoding process is to search the best path in the decoding diagram, i.e. to find the maximum probability of the occurrence of the word string formed by different phrases, and to take the word string corresponding to the probability as the recognition result of the language model. A commonly used decoding algorithm is a Viterbi (Viterbi) algorithm, which can quickly determine an optimal path by using a principle of dynamic programming, which is not described in detail in this embodiment. The pronunciation dictionary is corresponding to pinyin and Chinese characters in a Chinese context and corresponding to phonetic symbols and words in an English context, and aims to find corresponding Chinese characters (phrases) or words according to pronunciation recognized by the acoustic model and establish a bridge between the acoustic model and the language model to link the two.

In some embodiments, as shown in fig. 5, in the above S400, obtaining the score of the recognition result according to the probabilities of the different pronunciations and the probabilities of the different phrases output by the language model includes the following steps S401 to S403:

s401, obtaining an acoustic score P according to the probabilities of different pronunciations_s。

The acoustic score may include probability or state information for a phone, pronunciation, morpheme, syllable, or word. However, the acoustic score is not so limited and may include probability or state information for all possible linguistic units that are lexically divisible. The pronunciation score (i.e., the probability of pronunciation) in the acoustic score is used here as an example.

In some embodiments, step S401 obtains an acoustic score P according to the probability of different pronunciations_sThe method comprises the following steps:

determining the sum of the probabilities of different pronunciations output by the acoustic model as acousticScore value P_s。

In other embodiments, step S401 obtains the acoustic score P according to the probability of different pronunciations_sThe method comprises the following steps:

according to specific language environment, giving different weights to the probabilities of different pronunciations, multiplying the probabilities of the pronunciations by the corresponding weights, adding the products corresponding to the different pronunciations to obtain a sum, and determining the sum as an acoustic score P_s。

By way of example, a specific language environment is illustrated: the speech to be recognized can come from different areas of China, such as south China and northwest China, for the text content actually corresponding to the same speech to be recognized, the pronunciation of the southern China and the northwest China are not completely the same, and both are not completely the same as the Putonghua, and in order to reduce the influence of the accents of different areas on the speech recognition as much as possible, different weights can be given to the probabilities of different pronunciations according to the specific language environment.

S402, obtaining a language score P corresponding to the language model according to the probabilities of different phrases output by the language model_i。

In some embodiments, step S402, the probabilities P of different phrases output according to the language model_miObtaining the language score P corresponding to the language model_iThe method comprises the following steps:

determining the sum of the probabilities of different phrases output by the language model as the corresponding language score P of the language model_i。

For example, assume that a language model corresponding to the ith application scenario outputs M different phrases, and the probability that the language model corresponding to the ith application scenario outputs the mth phrase is P_miAnd M is an integer of 1 to M, then P_i＝P_1i+P_2i+,…,+P_MiThat is, the language score corresponding to the language model corresponding to the ith application scenario is P_i。

In some other embodiments, in step S402, obtaining the language score corresponding to the language model according to the probabilities of the different phrases output by the language model includes:

according to the specific application scene, different weights are given to the probabilities of different phrases, the probabilities of the phrases are multiplied by the corresponding weights to obtain corresponding products, the products corresponding to the different phrases are added to obtain sums, and the sums are determined as language scores P_i。

For example, in the description of the application scenario, assuming that the outputs of the language model are "queen wavelet" and "queen wave", the application scenario corresponding to the language model is literature, and "queen wavelet" is a writer and "queen wave" is a political figure, the weight given to "queen wavelet" is greater than the weight given to "queen wave" when calculating the language score corresponding to the language model.

In some other embodiments, in step S400, obtaining the score of the recognition result according to the probability of the different pronunciation and the probability of the different phrase output by the language model includes:

s403, according to the acoustic score P_sAnd a language score P corresponding to said language model_iAnd obtaining the score P (i) of the recognition result of the language model.

In some embodiments, the obtaining of the score of the recognition result of the language model according to the acoustic score and the language score corresponding to the language model in step S403 includes:

and determining the sum of the acoustic score and the language score corresponding to the language model as the score of the recognition result of the language model.

In still other embodiments, step S403 is based on the acoustic score P_sAnd the language score corresponding to the language model is P_iObtaining a score of the recognition result of the language model comprises:

and S4031, acquiring a preset weight corresponding to the language model.

The preset weight is set according to the specific application scene of the language model, the experience of a speech recognition method developer or the occurrence frequency of phrases in the language model, and the test set checks how to finally select the optimal model.

For example, the preset weight corresponding to the language model in the ith application scenario is λ_i。

S4032, score P of the language_iAnd the preset weight lambda_iAnd carrying out multiplication to obtain a multiplication result.

S4033, and dividing the acoustic score P into_sAnd adding the product result to obtain the score of the recognition result of the language model.

For example, the recognition result score P (i) ═ P of the language model corresponding to the ith application scenario_iλ_i+P_s。

S500, taking the voice recognition result corresponding to the recognition result score P (1) P (2), …, P (N) corresponding to each language model in the N language models as the target voice recognition result.

The voice recognition method provided by the embodiment of the application sets N language models aiming at N application scenes to obtain recognition results of the N language models and scores of the recognition results corresponding to each language model in the N language models, and then takes the voice recognition result corresponding to the highest recognition result score in the recognition result scores corresponding to the N language models as a target voice recognition result. Therefore, the speech recognition method provided by the embodiment of the application reduces the complexity of the language models, so that the speed of outputting the probability of different phrases by each language model is higher, and the probabilities of different phrases output by each language model in the N language models can be obtained simultaneously, thereby greatly improving the speed of speech recognition, and enabling the speech recognition to be suitable for occasions with higher requirements on real-time performance.

The method and the device use a plurality of language models of different application scenes, for example, use the language models of N different application scenes for voice recognition, and use the highest recognition result in the N language models as a target voice recognition result, thereby improving the accuracy of voice recognition. In addition, the N language models work in parallel, and therefore the efficiency of language identification is improved. That is, the present application improves the accuracy and efficiency of speech recognition.

In some embodiments, before the recognizing the speech to be recognized by using the acoustic model and the N language models respectively to obtain the speech recognition result of each of the N language models, the method further includes:

and S110, carrying out voice activity detection on the voice to be recognized to obtain target voice to be recognized.

The speech to be recognized may include a speech signal and a non-speech signal (e.g., a silence portion or an unvoiced portion), and Voice Activity Detection (VAD), also called Voice endpoint detection or Voice boundary detection, aims to recognize and eliminate a long silence period from the received speech to be recognized, and implement separation of the speech signal and the non-speech signal, so as to use the separated speech signal as a target speech to be recognized.

Voice activity detection includes the following three ways: firstly, by framing a voice to be recognized, judging the energy or zero crossing rate of each frame in a multi-frame signal corresponding to the voice to be recognized, and the like, judging whether the frame signal belongs to a target voice; secondly, judging whether each frame of a multi-frame signal corresponding to the voice to be recognized belongs to the target voice by detecting whether a pitch period exists in each frame; thirdly, a model is trained by a Deep Neural Network (DNN) method to classify whether each frame of a multi-frame signal corresponding to the speech to be recognized belongs to the target speech.

The framing of the speech to be recognized refers to cutting the speech to be recognized into small segments, each of which is called a frame, and the framing is not simply cutting the speech to be recognized but implemented by using a moving window function, so that adjacent frames are generally overlapped, which is not described in detail in the embodiments of the present application.

On the basis, inputting the speech to be recognized into an acoustic model in step 200, and obtaining probabilities of different pronunciations output by the acoustic model includes:

and inputting the target voice to be recognized into the acoustic model to obtain the probabilities of different pronunciations output by the acoustic model.

In some embodiments, the speech recognition method further comprises:

s600, determining a target language model corresponding to the target voice recognition result from the N language models.

S601, adding the voice to be recognized to a training set corresponding to the target language model.

And adding the speech to be recognized as newly added data into a training set corresponding to the target language model, so that the complexity of collecting the newly added data can be reduced, and the speech to be recognized can be used for training the language model to obtain a better target language model.

In some embodiments, the speech recognition method further comprises a process of model training:

s700, when the condition that the target language model is updated is detected to be satisfied, updating the target language model by using data in a training set corresponding to the target language model.

The update condition may be set by the user, for example, the update condition may be set to update the target language model on the last day of each natural month, and in order to enable the target language model to adapt to the application scenario more quickly, the update condition may also be set to update the target language model on the last day of each week, or the update condition may be set to update the target language model when newly added data in a training set corresponding to the target language model reaches a certain data amount, which may be, for example, 100 phrases, and this is not limited in this embodiment of the present application.

On the basis, with the increase of application scenes, considering the problem of the proportion of the voice sample corresponding to the newly added application scenes and the initial training set of the text data and the language model, the language model is usually updated when the newly added data amount reaches a certain magnitude, because the voice sample and the text data amount are large and the statistical language model is complex, the time required for completing one update is long, for example, one day or even two days is required for completing one complete update, and after the language model is updated, the updated language model can be used in the speech recognition application only by testing the updated language model, so that the time for updating the language model once is long, and the updating speed of the language model is influenced.

According to the embodiment of the application, the N language models corresponding to the N application scenes are set, only the language model corresponding to the application scene corresponding to the newly-added data is updated during updating, and the single language model is simple, so that the updating speed of the language model is increased.

In some embodiments, as shown in fig. 6, the model training involved in the method may be: acquiring a language sample corresponding to each language model in the N language models; and respectively inputting the language sample corresponding to each language model in the N language models into the language model corresponding to each application scene, and training the language sample corresponding to each application scene.

In some embodiments, as shown in fig. 7, the obtaining the language sample corresponding to each of the N language models may include: the method comprises the steps of obtaining a plurality of language samples, classifying the plurality of language samples according to application scenes, and obtaining a language sample 1, a language sample 2 …, a language sample N and the like. And training the language models of different scenes by using the language samples of different scenes.

The speech recognition method of the embodiment of the application obtains the recognition results of the N language models and the scores of the recognition results corresponding to each language model in the N language models by setting the N language models for the N application scenes, and then takes the speech recognition result corresponding to the highest recognition result score in the recognition result scores corresponding to the N language models as the target speech recognition result. In other words, in one speech recognition process, the present application uses a plurality of language models simultaneously, for example, uses N language models, and uses the highest recognition result in the N language models as the target speech recognition result, thereby providing the accuracy of speech recognition. In addition, the N language models work in parallel, and therefore the efficiency of language identification is improved. That is, the present application improves the accuracy and efficiency of speech recognition.

It should be understood that fig. 1-7 are only examples of the present application and should not be construed as limiting the present application.

The preferred embodiments of the present application have been described in detail with reference to the accompanying drawings, however, the present application is not limited to the details of the above embodiments, and various simple modifications can be made to the technical solution of the present application within the technical idea of the present application, and these simple modifications are all within the protection scope of the present application. For example, the various features described in the foregoing detailed description may be combined in any suitable manner without contradiction, and various combinations that may be possible are not described in this application in order to avoid unnecessary repetition. For example, various embodiments of the present application may be arbitrarily combined with each other, and the same should be considered as the disclosure of the present application as long as the concept of the present application is not violated.

Method embodiments of the present application are described in detail above in conjunction with fig. 1-7, and apparatus 10 embodiments of the present application are described in detail below in conjunction with fig. 8.

In a second aspect, as shown in fig. 8, an embodiment of the present application provides a language identification device 10, including:

an acquisition unit 11 configured to acquire a speech to be recognized;

the voice recognition unit 12 is configured to input the voice to be recognized into an acoustic model, so as to obtain probabilities of different pronunciations output by the acoustic model; respectively inputting the probabilities of different pronunciations into N language models to obtain the probabilities of different phrases output by each language model in the N language models, wherein the N language models are respectively in one-to-one correspondence with N different application scenes, and N is a positive integer greater than 1; aiming at each language model in the N language models, obtaining the recognition result of the language model and the score of the recognition result according to the probability of different pronunciations and the probability of different phrases output by the language model; and taking the voice recognition result corresponding to the highest recognition result score in the recognition result scores corresponding to each language model in the N language models as a target voice recognition result.

In some embodiments, the speech recognition unit 12 is specifically configured to obtain an acoustic score according to the probabilities of the different pronunciations; obtaining a language score corresponding to the language model according to the probabilities of different phrases output by the language model; and obtaining the score of the recognition result of the language model according to the acoustic score and the language score corresponding to the language model.

In some embodiments, the speech recognition unit 12 is specifically configured to determine the sum of the probabilities of the different utterances as the acoustic score.

In some embodiments, the speech recognition unit 12 is specifically configured to determine a sum of probabilities of different phrases output by the language model as a language score corresponding to the language model.

In some embodiments, the speech recognition unit 12 is specifically configured to determine a sum of the acoustic score and the language score corresponding to the language model as a score of the recognition result of the language model.

In some embodiments, the speech recognition unit 12 is specifically configured to obtain a preset weight corresponding to the language model; multiplying the language score by the preset weight to obtain a product result; and adding the acoustic score and the product result to obtain the score of the recognition result of the language model.

In some embodiments, the speech recognition unit 12 is further configured to perform speech activity detection on the speech to be recognized, so as to obtain a target speech to be recognized; and inputting the target voice to be recognized into the acoustic model to obtain the probabilities of different pronunciations output by the acoustic model.

In some embodiments, the speech recognition unit 12 is specifically configured to obtain a target decoding graph corresponding to the language model from preset N decoding graphs, where the N decoding graphs correspond to the N language models one to one; and decoding through the target decoding graph according to the probabilities of different pronunciations and the probabilities of different phrases output by the language model to obtain the recognition result of the language model.

In some embodiments, the speech recognition unit 12 is further configured to determine, from the N language models, a target language model corresponding to the target speech recognition result; and adding the speech to be recognized to a training set corresponding to the target language model.

In some embodiments, the speech recognition unit 12 is further configured to, when it is detected that the update condition of the target language model is satisfied, update the target language model using data in the training set corresponding to the target language model.

It is to be understood that apparatus embodiments and method embodiments may correspond to one another and that similar descriptions may refer to method embodiments. To avoid repetition, further description is omitted here. Specifically, the speech recognition apparatus shown in fig. 8 may correspond to a corresponding main body in executing the speech recognition method according to the embodiment of the present application, and the foregoing and other operations and/or functions of each module in the speech recognition apparatus are respectively for implementing a corresponding flow in the speech recognition method, and are not described herein again for brevity.

The apparatus of the embodiments of the present application is described above in connection with the drawings from the perspective of functional modules. It should be understood that the functional modules may be implemented by hardware, by instructions in software, or by a combination of hardware and software modules. Specifically, the steps of the method embodiments in the present application may be implemented by integrated logic circuits of hardware in a processor and/or instructions in the form of software, and the steps of the method disclosed in conjunction with the embodiments in the present application may be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in a processor. Alternatively, the software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, electrically erasable programmable memory, registers, or other storage medium known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps in the above method embodiments in combination with hardware thereof.

In a third aspect, as shown in fig. 9, an embodiment of the present application provides an electronic device 20, including: a processor 21 and a memory 22, said memory being arranged to store a computer program, said processor being arranged to call and run the computer program stored in said memory to perform the method of the first aspect.

A memory 21 and a memory 22, the memory 21 being arranged to store a computer program and to transfer the program code to the memory 22. In other words, the memory 22 can call and run the computer program from the memory 21 to implement the method in the embodiment of the present application.

For example, the memory 22 may be used to execute the above-described method embodiments according to instructions in the computer program.

In some embodiments of the present application, the memory 22 may include, but is not limited to:

general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like.

In some embodiments of the present application, the memory 41 includes, but is not limited to:

volatile memory and/or non-volatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of example, but not limitation, many forms of RAM are available, such as Static random access memory (Static RAM, SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic random access memory (Synchronous DRAM, SDRAM), Double Data Rate Synchronous Dynamic random access memory (DDR SDRAM), Enhanced Synchronous SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DR RAM).

In some embodiments of the present application, the computer program may be partitioned into one or more modules, which are stored in the memory 22 and executed by the memory 21 to perform the methods provided herein. The one or more modules may be a series of computer program instruction segments capable of performing specific functions, the instruction segments describing the execution of the computer program in the video production device.

As shown in fig. 9, the electronic device 20 may further include:

a transceiver 23, the transceiver 23 being connectable to the memory 22 or the memory 21.

The memory 22 may control the transceiver 23 to communicate with other devices, and specifically, may transmit information or data to the other devices or receive information or data transmitted by the other devices. The transceiver 23 may include a transmitter and a receiver. The transceiver 23 may further include antennas, and the number of antennas may be one or more.

It should be understood that the various components in the video production device are connected by a bus system that includes a power bus, a control bus, and a status signal bus in addition to a data bus.

In a fourth aspect, the present application further provides a computer storage medium, on which a computer program is stored, where the computer program, when executed by a computer, enables the computer to execute the method of the foregoing method embodiment. In other words, the present application also provides a computer program product containing instructions, which when executed by a computer, cause the computer to execute the method of the above method embodiments.

When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions described in accordance with the embodiments of the present application occur, in whole or in part, when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a Digital Video Disk (DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), among others.

Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the module is merely a logical division, and other divisions may be realized in practice, for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

Modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. For example, functional modules in the embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules are integrated into one module.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think of the changes or substitutions within the technical scope of the present application, and the changes or substitutions should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A speech recognition method, comprising:

acquiring a voice to be recognized;

respectively inputting the probabilities of different pronunciations into N language models to obtain the probability of different phrases output by each language model in the N language models, wherein N is a positive integer greater than 1, and the N language models are respectively in one-to-one correspondence with N different application scenes;

2. The method of claim 1, wherein obtaining the score of the recognition result according to the probabilities of the different pronunciations and the probabilities of the different phrases output by the language model comprises:

obtaining acoustic scores according to the probabilities of the different pronunciations;

obtaining a language score corresponding to the language model according to the probabilities of different phrases output by the language model;

and obtaining the score of the recognition result of the language model according to the acoustic score and the language score corresponding to the language model.

3. The method of claim 2, wherein deriving an acoustic score based on the probabilities of the different pronunciations comprises:

and determining the sum of the probabilities of the different pronunciations as the acoustic score.

4. The method according to claim 2, wherein obtaining the language score corresponding to the language model according to the probabilities of different phrases output by the language model comprises:

and determining the sum of the probabilities of different phrases output by the language model as a language score corresponding to the language model.

5. The method according to any one of claims 2-4, wherein the obtaining the score of the recognition result of the language model according to the acoustic score and the language score corresponding to the language model comprises:

6. The method according to any one of claims 2-4, wherein the obtaining the score of the recognition result of the language model according to the acoustic score and the language score corresponding to the language model comprises:

acquiring a preset weight corresponding to the language model;

multiplying the language score by the preset weight to obtain a product result;

and adding the acoustic score and the product result to obtain the score of the recognition result of the language model.

7. The method according to any one of claims 1-4, wherein before inputting the speech to be recognized into the acoustic model and obtaining the probabilities of different utterances being output by the acoustic model, the method further comprises:

performing voice activity detection on the voice to be recognized to obtain target voice to be recognized;

the step of inputting the speech to be recognized into an acoustic model to obtain probabilities of different pronunciations output by the acoustic model includes:

8. The method according to any one of claims 1 to 4, wherein obtaining the recognition result of the language model according to the probabilities of the different pronunciations and the probabilities of the different phrases output by the language model comprises:

acquiring a target decoding graph corresponding to the language model from preset N decoding graphs, wherein the N decoding graphs correspond to the N language models one to one;

and decoding through the target decoding graph according to the probabilities of different pronunciations and the probabilities of different phrases output by the language model to obtain the recognition result of the language model.

9. The method according to any one of claims 1-4, further comprising:

determining a target language model corresponding to the target voice recognition result from the N language models;

adding the speech to be recognized to a training set corresponding to the target language model;

and when the update condition of the target language model is detected to be met, updating the target language model by using the data in the training set corresponding to the target language model.

10. A speech recognition apparatus, comprising:

the voice recognition unit is used for inputting the voice to be recognized into an acoustic model to obtain the probabilities of different pronunciations output by the acoustic model; respectively inputting the probabilities of different pronunciations into N language models to obtain the probability of different phrases output by each language model in the N language models, wherein N is a positive integer greater than 1, and the N language models are respectively in one-to-one correspondence with N different application scenes; aiming at each language model in the N language models, obtaining the recognition result of the language model and the score of the recognition result according to the probability of different pronunciations and the probability of different phrases output by the language model; and taking the voice recognition result corresponding to the highest recognition result score in the recognition result scores corresponding to each language model in the N language models as a target voice recognition result.

11. An electronic device, comprising:

a processor and a memory, the memory for storing a computer program, the processor for invoking and executing the computer program stored in the memory to perform the method of any of claims 1 to 9.

12. A computer-readable storage medium for storing a computer program which causes a computer to perform the method of any one of claims 1 to 9.