CN113990325A - Streaming voice recognition method and device, electronic equipment and storage medium - Google Patents

Streaming voice recognition method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113990325A
CN113990325A CN202111150034.5A CN202111150034A CN113990325A CN 113990325 A CN113990325 A CN 113990325A CN 202111150034 A CN202111150034 A CN 202111150034A CN 113990325 A CN113990325 A CN 113990325A
Authority
CN
China
Prior art keywords
voice
recognition
block
speech
result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111150034.5A
Other languages
Chinese (zh)
Inventor
洪密
王旭阳
汪俊杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lenovo Beijing Ltd
Original Assignee
Lenovo Beijing Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lenovo Beijing Ltd filed Critical Lenovo Beijing Ltd
Priority to CN202111150034.5A priority Critical patent/CN113990325A/en
Publication of CN113990325A publication Critical patent/CN113990325A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The application relates to the technical field of voice recognition, in particular to a streaming voice recognition method and device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a voice block to be recognized; carrying out object recognition processing on the voice block based on a binding meaning time classification model to obtain an object recognition result, and determining the number of objects in the voice block according to the object recognition result; and determining the recognition times according to the number of the objects in the voice block, and performing voice recognition processing on the voice block according to the attention model, wherein the recognition times are the same as the recognition times, so that a voice recognition result corresponding to the voice block is obtained. The method predicts the number of recognition objects contained in the current voice block as the voice recognition times of the attention model by connecting the semantic time classification model, and recognizes the voice block by the attention model for corresponding times, so that the voice can be recognized more accurately and efficiently and converted and output into corresponding information such as characters and the like.

Description

Streaming voice recognition method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of speech recognition technologies, and in particular, to a streaming speech recognition method and apparatus, an electronic device, and a storage medium.
Background
Speech Recognition, also known as Automatic Speech Recognition (ASR), aims to allow machines to convert received Speech signals into words for output by Recognition and understanding, and is an important branch of modern artificial intelligence development.
The traditional speech recognition technology is based on the establishment of an acoustic model by a hidden Markov model, a Gaussian mixture model and a deep neural network-hidden Markov model, and the method for recognizing the network formed by the language model, the acoustic model and a dictionary model needs to respectively train different models and then fuse a plurality of models together by a decoder such as a Weighted Finite State Transducer (WFST). The training or design of each model requires professional knowledge and technology accumulation, and the training and recognition process of each model is tedious, low in recognition efficiency, low in accuracy and large in delay. Therefore, there is a need to provide a new speech recognition technology to solve the above problems in the prior art.
Disclosure of Invention
The application aims to provide a streaming voice recognition method and a streaming voice recognition device.
According to an aspect of the present application, there is provided a streaming voice recognition method, including the steps of:
acquiring a voice block to be recognized;
carrying out object recognition processing on the voice block based on a binding meaning time classification model to obtain an object recognition result, and determining the number of objects in the voice block according to the object recognition result;
and determining the recognition times according to the number of the objects in the voice block, and performing voice recognition processing on the voice block according to the attention model, wherein the recognition times are the same as the recognition times, so that a voice recognition result corresponding to the voice block is obtained.
In an exemplary embodiment of the present application, the performing object recognition processing on the speech block based on the binding-sense time classification model to obtain an object recognition result, and determining the number of objects in the speech block according to the object recognition result includes:
performing object recognition processing on the voice block based on the connection meaning time classification model to obtain at least one group of object recognition processing results and corresponding accuracy;
determining a result of the set of object recognition processing with the highest accuracy as the object recognition processing result;
and determining the number of the objects in the voice block according to the object identification result.
In an exemplary embodiment of the present application, the performing object recognition processing on the speech block based on the joint-sense time classification model to obtain an object recognition result includes:
coding the voice block to obtain a characteristic sequence of the voice block;
and carrying out object recognition processing on the feature sequence of the speech block based on the connection meaning time classification model to obtain an object recognition result.
In an exemplary embodiment of the present application, the determining a recognition number according to the number of objects in the speech block, and performing speech recognition processing on the speech block for the recognition number based on an attention model to obtain a speech recognition result corresponding to the speech block includes:
determining the recognition times according to the number of the objects in the voice block, and performing voice recognition processing on the voice block according to the attention model to obtain at least one group of voice recognition processing results and corresponding accuracy;
and determining the result of the group of speech recognition processing with the highest accuracy as the speech recognition result corresponding to the speech block.
In an exemplary embodiment of the present application, the obtaining a speech block to be recognized includes:
and if the last speech block to be recognized is detected to have an unidentified object, carrying out object recognition processing on the last speech block to be recognized.
In an exemplary embodiment of the present application, the streaming voice recognition method further includes:
and if the accuracy corresponding to the results of the at least one group of voice recognition processing is less than the threshold value, re-performing object recognition processing or voice recognition processing on the voice block.
In an exemplary embodiment of the present application, the obtaining a speech block to be recognized includes:
the speech block is extracted from the speech signal to be recognized according to a specified time range or a specified speech block size.
According to another aspect of the present application, there is provided a streaming voice recognition apparatus including:
the acquisition module is used for acquiring a voice block to be recognized;
the first recognition module is used for carrying out object recognition processing on the voice block based on the binding meaning time classification model to obtain an object recognition result and determining the number of objects in the voice block according to the object recognition result;
and the second recognition module is used for determining the recognition times according to the number of the objects in the voice block, and performing voice recognition processing on the voice block for the recognition times based on the attention model to obtain a voice recognition result corresponding to the voice block.
According to another aspect of the present application, there is provided an electronic device including:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the streaming speech recognition method described above.
According to another aspect of the present application, there is provided a computer-readable storage medium storing a computer program which, when executed by a processor, implements the streaming speech recognition method described above.
The application provides a streaming voice recognition method and device, electronic equipment and a storage medium, the number of recognition objects contained in a current voice block is predicted by connecting an ambiguity time classification model to be used as the voice recognition times of an attention model, and the voice block is recognized for corresponding times through the attention model, so that a final voice recognition result is obtained more accurately and efficiently.
Drawings
FIG. 1 is a flow chart of a speech recognition method in the related art;
FIG. 2 is a flow chart illustrating another speech recognition method of the related art;
FIG. 3 is a flow chart of a streaming speech recognition method according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a streaming speech recognition apparatus in an embodiment of the present application.
Detailed Description
In order to make the objects, features and advantages of the present application more apparent and understandable, embodiments and technical solutions of the present application will be clearly and completely described below with reference to the accompanying drawings. Example embodiments and examples, however, may be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments and examples are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments and examples to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments and examples. In the following description, numerous specific details are provided to provide a thorough understanding of embodiments and examples of the application. One skilled in the relevant art will recognize, however, that the subject matter of the present application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present application.
Furthermore, the drawings are merely schematic illustrations of the present application and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Although the steps of the methods in this application are depicted in the drawings in a particular order, this does not require or imply that the steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the steps. For example, some steps may be decomposed, and some steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
Speech Recognition, also known as Automatic Speech Recognition (ASR), aims to allow a machine to convert a received Speech signal into text for output by Recognition and understanding, and is an important branch of modern artificial intelligence development. The traditional speech recognition technology is based on a Hidden Markov Model (Hidden Markov Model), a Gaussian Mixture Model (Gaussian Mixture Model) and a Deep Neural Network-Hidden Markov Model (Deep Neural Network-Hidden Markov Model) to establish an acoustic Model, in the method for recognizing by a Network composed of a language Model, the acoustic Model and a dictionary Model, different models need to be trained respectively, then a plurality of models are fused together by decoders such as WFST (weighted round robin scheduling) and the like, the training or design of each Model needs professional knowledge and technology accumulation, and the process of training and recognizing is very complicated. With the development and application of the related art, the conventional speech recognition technology has been gradually replaced by more advanced End-to-End Streaming speech recognition (End-to-End Streaming ASR) due to its disadvantages of low recognition efficiency, low accuracy and large delay. End-to-End (E2E) conversion from input speech to output text can be directly realized only by speech characteristics of an input End and text information of an output End; streaming speech recognition is a recognition method as opposed to non-streaming speech recognition, which returns a result after processing a complete sentence of audio, which returns a recognition result in real time during processing of an audio stream. The streaming voice recognition can be better used for scenes needing to acquire a recognition result in real time, such as live real-time subtitles, conference real-time recording, voice input, voice awakening and the like.
A related technology is based on the flow-type speech recognition of the binding meaning time classification model as shown in figure 1, the recognition method is after carrying on the code to the speech input with the flow-type and extracting its characteristic information, input the binding meaning time classification model to carry on the search recognition of the prefix bundle, output a preselected recognition result group after finishing discerning; and secondly, decoding and sorting the preselected identification result group to obtain another preselected identification result group, and then carrying out weighted summation on the two identification result groups according to scores of the two identification result groups to obtain a preselected result with the highest score as a final identification result. Another streaming speech recognition based on the joint meaning time classification model is shown in fig. 2, after the current recognition result is obtained by performing weighted summation on the two groups of recognition results by the same method as that shown in fig. 1, the recognition result is input into the joint meaning time classification model to realize optimization of the joint meaning time classification model, so that a more accurate recognition result can be obtained in the speech recognition. The two schemes are based on the speech recognition performed by the connection meaning time classification model, but the connection meaning time classification model has strong independence and can not consider the association relation between words according to the pronunciation characteristics of each word, so the accuracy of the recognition result is low; on the other hand, in both of the above methods, the identification result needs to be subjected to more complicated weighted summation and decoding sorting operation in the identification process, so that higher delay exists, and the synchronism of the output identification result is reduced.
In view of the above problems in the related art, the present application provides a streaming speech recognition method and apparatus, an electronic device, and a storage medium, where the method and apparatus are mainly applied to an end-to-end streaming speech recognition scenario, where end-to-end refers to the input end to the output end of speech recognition, the input end characteristic commonly used for end-to-end speech recognition is fbank (filter bank banks) characteristic, and the processing procedure includes pre-emphasis, framing, windowing, short-time fourier transform (ST FT), mel (mel) filtering, mean value removing, and the like for a speech signal; the output end can be a recognition object such as a letter, a subword (subword), a word and the like. The method comprises the following steps: acquiring a voice block to be recognized; carrying out object recognition processing on the voice block based on a binding meaning time classification model to obtain an object recognition result, and determining the number of objects in the voice block according to the object recognition result; and determining the recognition times according to the number of the objects in the voice block, and performing voice recognition processing on the voice block according to the attention model, wherein the recognition times are the same as the recognition times, so that a voice recognition result corresponding to the voice block is obtained. The method predicts the number of recognition objects contained in the current voice block by connecting the semantic time classification model as the voice recognition times of the attention model, and recognizes the voice block for corresponding times by the attention model, thereby obtaining the final voice recognition result more accurately and efficiently.
An exemplary embodiment of the present application provides a streaming voice recognition method, and fig. 3 shows a flowchart of the streaming voice recognition method in an exemplary embodiment of the present application. The streaming speech recognition method may be implemented by a terminal device, i.e. the terminal device may perform the steps of the following method, in which case the streaming speech recognition means may be included in the terminal device. The terminal devices may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., car navigation terminals), and fixed terminals such as digital TVs, desktop computers, and the like. As shown in fig. 3, the streaming voice recognition method includes:
step S31: acquiring a voice block to be recognized;
in an exemplary embodiment, the audio information may be collected by a microphone or the like, and the audio information collected by the microphone may be obtained through a corresponding data transmission interface for subsequent identification. In the process of inputting voice to the audio acquisition equipment end by a user, the audio acquisition equipment can detect voice activity of continuous voice signals. Wherein the voice activity detection may determine the nature of the detected audio data by a preset detection means. Taking an energy detection mode as an example, when the energy of an audio segment is greater than a preset threshold value, determining that the audio segment is voice; and when the energy of the audio segment is less than or equal to a preset threshold value, determining that the audio segment is noise.
In an exemplary embodiment, speech blocks may be extracted from a speech signal to be recognized according to a specified time range or a specified speech block size. And dividing the continuous voice signal into a plurality of voice blocks according to the detected endpoint time, for example, in the process that a user continuously inputs the voice signal to the audio acquisition device, the audio acquisition device processes the continuously input voice into the voice blocks with the specified size according to the length of the preset data frame. For example, it may be set that the input voice is divided into a plurality of voice blocks in units of 10ms, or it may be set that the input voice is divided into a plurality of voice blocks to be recognized in units of 10 kb. As another example, each clause of the speech input may be determined according to the detected endpoint time, and each clause may be identified as a speech block. For example, it is known through voice detection that there is a voice signal between time a and time B, there is no voice signal between time B and time C, and there is a voice signal between time C and time D, that is, the voice signal between time a and time B may be used as a first voice block, and the voice signal between time C and time D may be used as a second voice block.
Step S33: carrying out object recognition processing on the voice block based on a binding meaning time classification model to obtain an object recognition result, and determining the number of objects in the voice block according to the object recognition result;
a connection semantic temporal classification (CTC) model is an algorithm model commonly used in the fields of speech recognition, text recognition, etc., and can solve the problems of inconsistent lengths and non-alignment of input sequences and output sequences. The associative semantic time classification model has two features, first, an additional output node is added to the output of the network to represent a "blank" symbol, and each output node of the neural network represents an acoustic modeling factor during speech recognition. Depending on the modeling granularity, the modeling factor may be a single phoneme factor or a triple phoneme factor, wherein the output of the network at each time represents the posterior probability of each phoneme factor at that time. The effect of adding a "blank" symbol is to represent the state when the network output is uncertain, i.e. when the input is an unrecognizable feature such as representing noise or the input is a critical state between two different phonemes, the network can output a "blank" symbol while avoiding outputting a certain phoneme. Secondly, the training method of the associative semantic time classification model optimizes the whole sentence input by the network, aims to maximize the output probability of the correct text sequence of the whole sentence, and does not maximize the output probability of each frame like cross entropy.
In an exemplary embodiment, the performing an object recognition process on the speech block based on the joint-sense time classification model to obtain an object recognition result, and determining the number of objects in the speech block according to the object recognition result may include: performing object recognition processing on the voice block based on the connection meaning time classification model to obtain at least one group of object recognition processing results and corresponding accuracy; determining a result of the set of object recognition processing with the highest accuracy as the object recognition processing result; and determining the number of the objects in the voice block according to the object identification result. Specifically, in the process of performing object recognition processing on the voice block based on the association-meaning time classification model, on one hand, due to the problem of recognition accuracy based on the association-meaning time classification model, and on the other hand, due to the characteristic that the recognition result is not unique in voice recognition, the obtained recognition result is not a uniquely determined set of results; thus, in the case where a plurality of sets of results of the object recognition processing are likely to be obtained, the accuracy of the plurality of sets of recognition results can be judged, and the higher the accuracy indicates the greater the likelihood of being the result of the object recognition processing. In the case where more than one set of results of the object recognition process is obtained, it is therefore necessary to determine the set of recognition results that is most likely, and thus to determine the number of recognized objects in the speech block from the set of recognition results.
In an exemplary embodiment, the performing the object recognition process on the speech block based on the joint-sense temporal classification model to obtain the object recognition result may include: coding the voice block to obtain a characteristic sequence of the voice block; and carrying out object recognition processing on the feature sequence of the speech block based on the connection meaning time classification model to obtain an object recognition result. The joint-sense time classification system can calculate, for a given input, a loss function corresponding to the probability distribution of all possible outputs from which the probability of the output corresponding to the maximum probability or the probability of a particular output can be predicted. Therefore, the first step of performing speech block recognition in the speech recognition method is to encode or extract features of the speech block to be recognized to obtain a corresponding feature sequence, and input the feature sequence as a joint semantic time classification model, that is, extract the recognizable features in the audio signal as the input of the speech block to be recognized. For example, in the joint meaning time classification model identification, the length of each 10 milliseconds is usually taken as a voice block, taking the voice block of "i love you Chinese" as an example, a feature sequence of the voice block obtained by a feature extraction method such as MFCC or LPC is input into the joint meaning time classification model, the joint meaning time classification model converts a pronunciation method contained in the voice block into characters through decoding identification and outputs the characters, and then the number of the identified objects contained in the voice block is determined according to an output result.
One feature extraction method in an exemplary implementation is MFCC (Mel Frequency cepstral al coeffients) feature extraction. In the method, a voice frequency spectrum passes through a group of triangular filters and Discrete Cosine Transform (DCT) is carried out to obtain MFCC coefficients, namely, the energy distribution of the signal frequency spectrum in different frequency intervals is represented. The spectral energy of the corresponding frequency interval can be obtained by setting a filter, for example, 26 triangular filters can be set to obtain 26 MFCC coefficients, and then the coefficients with lower order are obtained to obtain the feature information capable of representing the sound channel.
Another feature extraction method is lpc (linear Predictive coding), which is generally used to code pitch, formant, short-time spectrum, etc. of speech, and can perform accurate estimation and easy calculation on speech parameters. In particular, the speech signal can be modeled as the output of a linear time varying system whose input excitation signal is periodic pulses (during voiced speech) or random noise (during unvoiced speech). Differential equations for speech signals the samples of the speech signal can be approximated by a linear fit of past samples, and then a set of prediction coefficients can be derived by locally minimizing the sum of the squared differences between the actual samples and the linear prediction samples. Linear predictive analysis in modeling speech signal problems can be implemented in a number of ways, for example, linear predictive analysis can be performed by algorithms such as covariance, autocorrelation formulas, lattice methods, inverse filters, spectral estimation formulas, maximum likelihood formulas, inner product formulas, and the like.
Step S35: and determining the recognition times according to the number of the objects in the voice block, and performing voice recognition processing on the voice block according to the attention model, wherein the recognition times are the same as the recognition times, so that a voice recognition result corresponding to the voice block is obtained.
An Attention model (Attention Mechanism) is a Mechanism for improving the effects of coding an Encoder and decoding a Decoder model based on an RNN recurrent neural network (or LSTM long-short term memory artificial neural network, GRU neural network), and is widely applied to multiple fields of machine translation, speech recognition, image labeling, and the like. The attention model has the capability of distinguishing recognized objects, for example, in machine translation and speech recognition applications, the attention model gives different weights to each word in a sentence, so that the learning of the neural network model becomes more flexible, and the attention model can be used as an alignment relation to explain the alignment relation between translation input/output sentences and the learning content. The attention model can give different weights to each part of the input voice signal, and key and important information in the input voice signal is extracted, so that the model can make more accurate judgment, and meanwhile, the whole recognition process does not need large calculation amount and excessive memory occupation. In the attention model, an encoder encodes an input speech signal into a vector sequence, and a decoder selectively selects a subset of the vector sequence for further processing during decoding. The attention model can therefore make full use of the information carried by the input sequence in generating each output.
The generated words of the attention model for speech recognition not only focus on the global semantic coding feature vector, but also correspondingly increase the attention range to indicate which parts of the input sequence the output result needs to focus on, and generate the next output result according to the focused region, which processes one object to be recognized in the recognition block each time. Therefore, in the streaming speech recognition method of this embodiment, the number of objects to be recognized included in a speech block to be recognized is determined by first connecting the semantic time classification models, and then recognition is performed for the corresponding times according to the attention model, so that all recognition results corresponding to the speech block can be determined.
In an exemplary embodiment, obtaining the speech block to be recognized may further include: and if the last speech block to be recognized is detected to have an unidentified object, carrying out object recognition processing on the last speech block to be recognized. Specifically, in the present exemplary embodiment, by using the feature that the attention model can be combined with global recognition of a specific object, a specific delay may be set in the recognition process, for example, the number of recognition objects included in a current speech block is five, the speech block may be recognized only four times in the current attention model recognition, and the last speech block close to a boundary in the speech block is recognized in the recognition process of delaying to a next speech block; the embodiment can solve the problem that the recognition object of the voice block boundary in the attention model recognition cannot be recognized according to the global and the context.
Further, the streaming voice recognition method of the embodiment further includes evaluating the system for the accuracy of the recognition result. In an exemplary embodiment, determining the number of recognition times according to the number of objects in the speech block, and performing speech recognition processing on the speech block for the number of recognition times based on an attention model to obtain a speech recognition result corresponding to the speech block includes: determining the recognition times according to the number of the objects in the voice block, and performing voice recognition processing on the voice block according to the attention model to obtain at least one group of voice recognition processing results and corresponding accuracy; and determining the result of the group of speech recognition processing with the highest accuracy as the speech recognition result corresponding to the speech block. The recognition result, i.e., a set of sentences composed of a plurality of recognition objects, in this embodiment, the speech recognition does not determine only the unique recognition object in some cases, and thus, when a plurality of sets of recognition results are possible to be obtained for one speech block, there is a possibility that the recognition result is not uniquely determined by performing the speech recognition according to the attention model, and therefore, when more than one set of recognition results are obtained, the most likely set of recognition results can be determined with the accuracy of each set of recognition results as an evaluation criterion. In an embodiment, a score evaluation criterion may be preset to screen the accuracy of the recognition object, and if the score is lower than a preset threshold, the accuracy is considered to be too low and the recognition result is not adopted, or if the score is lower than the threshold, the speech block may be re-recognized to obtain a more accurate recognition result.
Another exemplary embodiment of the present application provides a streaming voice recognition apparatus, and fig. 4 is a schematic structural diagram of a streaming voice recognition apparatus in an embodiment of the present application. As shown in fig. 4, the streaming voice recognition apparatus 40 includes:
an obtaining module 42, configured to obtain a speech block to be recognized;
a first recognition module 44, configured to perform object recognition processing on the speech block based on a joint-sense time classification model to obtain an object recognition result, and determine the number of recognition objects in the speech block according to an object optimal recognition processing result obtained by the object recognition processing;
and a second recognition module 46, configured to determine recognition times according to the number of objects in the recognized speech block, perform speech recognition processing on the speech block for the recognition times based on an attention model, and obtain a target speech recognition result corresponding to the optimal speech block.
The details of each module/unit in the above device have been described in detail in the corresponding method section, and are not described herein again. It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
In addition to the above-described methods and apparatus, embodiments of the present application may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the methods according to the various embodiments of the present application described in the "exemplary methods" section of this specification, above.
The computer program product may be written with program code for performing the operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.
Another embodiment of the present application provides an electronic device, which may be used to perform all or part of the steps of the method or the network control method described in this example embodiment. The device comprises: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform steps in a method according to various embodiments of the present application described in the "exemplary method" set forth above in the specification.
Another implementation of the present application provides a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform the steps in the method according to various embodiments of the present application described in the "exemplary method" described above in this specification.
The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The foregoing describes the general principles of the present application in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present application are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present application. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the foregoing disclosure is not intended to be exhaustive or to limit the disclosure to the precise details disclosed.
The block diagrams of devices, apparatuses, systems referred to in this application are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (10)

1. A streaming speech recognition method, comprising the steps of:
acquiring a voice block to be recognized;
carrying out object recognition processing on the voice block based on a binding meaning time classification model to obtain an object recognition result, and determining the number of objects in the voice block according to the object recognition result;
and determining the recognition times according to the number of the objects in the voice block, and performing voice recognition processing on the voice block according to the attention model, wherein the recognition times are the same as the recognition times, so that a voice recognition result corresponding to the voice block is obtained.
2. The streaming speech recognition method of claim 1, wherein the performing an object recognition process on the speech block based on the binding-sense time classification model to obtain an object recognition result, and determining the number of objects in the speech block according to the object recognition result comprises:
performing object recognition processing on the voice block based on the connection meaning time classification model to obtain at least one group of object recognition processing results and corresponding accuracy;
determining a result of the set of object recognition processing with the highest accuracy as the object recognition processing result;
and determining the number of the objects in the voice block according to the object identification result.
3. The streaming speech recognition method of claim 1, wherein the performing object recognition processing on the speech block based on the joint-meaning time classification model to obtain an object recognition result comprises:
coding the voice block to obtain a characteristic sequence of the voice block;
and carrying out object recognition processing on the feature sequence of the speech block based on the connection meaning time classification model to obtain an object recognition result.
4. The streaming voice recognition method of claim 1, wherein the determining the number of recognition times according to the number of objects in the voice block, and performing the voice recognition processing on the voice block for the number of recognition times based on an attention model to obtain a voice recognition result corresponding to the voice block comprises:
determining the recognition times according to the number of the objects in the voice block, and performing voice recognition processing on the voice block according to the attention model to obtain at least one group of voice recognition processing results and corresponding accuracy;
and determining the result of the group of speech recognition processing with the highest accuracy as the speech recognition result corresponding to the speech block.
5. The streaming speech recognition method of claim 1, wherein the obtaining the speech block to be recognized comprises:
and if the last speech block to be recognized is detected to have an unidentified object, carrying out object recognition processing on the last speech block to be recognized.
6. The streaming speech recognition method of claim 4, further comprising:
and if the accuracy corresponding to the results of the at least one group of voice recognition processing is less than the threshold value, re-performing object recognition processing or voice recognition processing on the voice block.
7. The streaming speech recognition method of claim 1, wherein the obtaining the speech block to be recognized comprises:
the speech block is extracted from the speech signal to be recognized according to a specified time range or a specified speech block size.
8. A streaming speech recognition apparatus, comprising:
the acquisition module is used for acquiring a voice block to be recognized;
the first recognition module is used for carrying out object recognition processing on the voice block based on the binding meaning time classification model to obtain an object recognition result and determining the number of objects in the voice block according to the object recognition result;
and the second recognition module is used for determining the recognition times according to the number of the objects in the voice block, and performing voice recognition processing on the voice block for the recognition times based on the attention model to obtain a voice recognition result corresponding to the voice block.
9. An electronic device, comprising:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the streaming speech recognition method of claims 1-7.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the streaming speech recognition method of claims 1-7.
CN202111150034.5A 2021-09-29 2021-09-29 Streaming voice recognition method and device, electronic equipment and storage medium Pending CN113990325A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111150034.5A CN113990325A (en) 2021-09-29 2021-09-29 Streaming voice recognition method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111150034.5A CN113990325A (en) 2021-09-29 2021-09-29 Streaming voice recognition method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113990325A true CN113990325A (en) 2022-01-28

Family

ID=79737221

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111150034.5A Pending CN113990325A (en) 2021-09-29 2021-09-29 Streaming voice recognition method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113990325A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114822540A (en) * 2022-06-29 2022-07-29 广州小鹏汽车科技有限公司 Vehicle voice interaction method, server and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114822540A (en) * 2022-06-29 2022-07-29 广州小鹏汽车科技有限公司 Vehicle voice interaction method, server and storage medium

Similar Documents

Publication Publication Date Title
Van Niekerk et al. A comparison of discrete and soft speech units for improved voice conversion
Murthy et al. Group delay functions and its applications in speech technology
CN102982811B (en) Voice endpoint detection method based on real-time decoding
CN106297800B (en) Self-adaptive voice recognition method and equipment
EP4018437B1 (en) Optimizing a keyword spotting system
Bezoui et al. Feature extraction of some Quranic recitation using mel-frequency cepstral coeficients (MFCC)
Pawar et al. Review of various stages in speaker recognition system, performance measures and recognition toolkits
CN112581963B (en) Voice intention recognition method and system
Shaikh Naziya et al. Speech recognition system—a review
CN112349289A (en) Voice recognition method, device, equipment and storage medium
CN111640418A (en) Prosodic phrase identification method and device and electronic equipment
CN110570842B (en) Speech recognition method and system based on phoneme approximation degree and pronunciation standard degree
Karpov An automatic multimodal speech recognition system with audio and video information
CN114530141A (en) Chinese and English mixed offline voice keyword recognition method under specific scene and system implementation thereof
CN112735404A (en) Ironic detection method, system, terminal device and storage medium
CN116994553A (en) Training method of speech synthesis model, speech synthesis method, device and equipment
Stanek et al. Algorithms for vowel recognition in fluent speech based on formant positions
Absa et al. A hybrid unsupervised segmentation algorithm for arabic speech using feature fusion and a genetic algorithm (July 2018)
Dave et al. Speech recognition: A review
CN114298019A (en) Emotion recognition method, emotion recognition apparatus, emotion recognition device, storage medium, and program product
CN113990325A (en) Streaming voice recognition method and device, electronic equipment and storage medium
Rasipuram et al. Grapheme and multilingual posterior features for under-resourced speech recognition: a study on scottish gaelic
Chit et al. Myanmar continuous speech recognition system using fuzzy logic classification in speech segmentation
Unnibhavi et al. A survey of speech recognition on south Indian Languages
CN114203180A (en) Conference summary generation method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination