CN113096642A - Speech recognition method and device, computer readable storage medium, electronic device - Google Patents

Speech recognition method and device, computer readable storage medium, electronic device Download PDF

Info

Publication number
CN113096642A
CN113096642A CN202110351353.6A CN202110351353A CN113096642A CN 113096642 A CN113096642 A CN 113096642A CN 202110351353 A CN202110351353 A CN 202110351353A CN 113096642 A CN113096642 A CN 113096642A
Authority
CN
China
Prior art keywords
prediction
voice
library
features
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110351353.6A
Other languages
Chinese (zh)
Inventor
黄明运
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Horizon Robotics Technology Co Ltd
Original Assignee
Nanjing Horizon Robotics Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Horizon Robotics Technology Co Ltd filed Critical Nanjing Horizon Robotics Technology Co Ltd
Priority to CN202110351353.6A priority Critical patent/CN113096642A/en
Publication of CN113096642A publication Critical patent/CN113096642A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Abstract

The embodiment of the disclosure discloses a voice recognition method and device, a computer readable storage medium and an electronic device, wherein the method comprises the following steps: processing the voice signals acquired in the application scene into a plurality of groups of voice features; processing the multiple groups of voice features through a prediction library to obtain multiple prediction results; determining a speech recognition result in the application scene based on the plurality of prediction results; according to the method, the prediction library is used for predicting the multiple groups of voice features, so that the occupied space of the memory is reduced, the method provided by the embodiment can be applied to a system with a small memory, and the application range of the voice recognition method is widened.

Description

Speech recognition method and device, computer readable storage medium, electronic device
Technical Field
The present disclosure relates to the field of speech recognition technologies, and in particular, to a speech recognition method and apparatus, a computer-readable storage medium, and an electronic device.
Background
In the scene of keyword awakening, because the noise reduction effect of different scenes is different in emphasis, in order to achieve a better recognition effect, a multi-channel decoder is often started to decode, and therefore the memory occupation of the system is increased. On a platform with limited system resources, the memory is a relatively limited resource, so that decoding cannot be realized.
Disclosure of Invention
The present disclosure is proposed to solve the above technical problems. The embodiment of the disclosure provides a voice recognition method and device, a computer-readable storage medium and an electronic device.
According to an aspect of an embodiment of the present disclosure, there is provided a speech recognition method including:
processing the voice signals acquired in the application scene into a plurality of groups of voice features;
processing the multiple groups of voice features through a prediction library to obtain multiple prediction results;
determining a speech recognition result in the application scene based on the plurality of prediction results.
According to another aspect of the embodiments of the present disclosure, there is provided a speech recognition apparatus including:
the signal processing module is used for processing the voice signals acquired in the application scene into a plurality of groups of voice features;
the feature prediction module is used for respectively processing the multiple groups of voice features obtained by the signal processing module through a prediction library to obtain multiple prediction results;
and the voice recognition module is used for determining a voice recognition result in the application scene based on a plurality of prediction results obtained by the characteristic prediction module.
According to still another aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium storing a computer program for executing the voice recognition method of the above-described embodiments.
According to still another aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:
a processor;
a memory for storing the processor-executable instructions;
the processor is configured to read the executable instructions from the memory and execute the instructions to implement the speech recognition method according to the above embodiment.
Based on the voice recognition method and device, the computer-readable storage medium and the electronic device provided by the above embodiments of the present disclosure, the voice signal acquired in the application scene is processed into a plurality of groups of voice features; processing the multiple groups of voice features through a prediction library to obtain multiple prediction results; determining a speech recognition result in the application scene based on the plurality of prediction results; according to the method, the prediction library is used for predicting the multiple groups of voice features, so that the occupied space of the memory is reduced, the method provided by the embodiment can be applied to a system with a small memory, and the application range of the voice recognition method is widened.
The technical solution of the present disclosure is further described in detail by the accompanying drawings and examples.
Drawings
The above and other objects, features and advantages of the present disclosure will become more apparent by describing in more detail embodiments of the present disclosure with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the principles of the disclosure and not to limit the disclosure. In the drawings, like reference numbers generally represent like parts or steps.
Fig. 1 is a schematic structural diagram of a speech recognition system according to an exemplary embodiment of the present disclosure.
Fig. 2 is a flowchart illustrating a speech recognition method according to an exemplary embodiment of the present disclosure.
Fig. 3 is a flowchart illustrating a speech recognition method according to another exemplary embodiment of the present disclosure.
Fig. 4 is a schematic flow chart of step 302 in the embodiment shown in fig. 3 of the present disclosure.
Fig. 5 is a schematic flow chart of step 202 in the embodiment shown in fig. 2 of the present disclosure.
Fig. 6 is another flow chart illustrating step 202 in the embodiment shown in fig. 2 of the present disclosure.
Fig. 7 is a schematic flow chart of step 203 in the embodiment shown in fig. 2 of the present disclosure.
Fig. 8 is a flowchart illustrating a speech recognition method according to another exemplary embodiment of the present disclosure.
Fig. 9 is a schematic structural diagram of a speech recognition apparatus according to an exemplary embodiment of the present disclosure.
Fig. 10 is a schematic structural diagram of a speech recognition apparatus according to another exemplary embodiment of the present disclosure.
Fig. 11 is a block diagram of an electronic device provided in an exemplary embodiment of the present disclosure.
Detailed Description
Hereinafter, example embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of the embodiments of the present disclosure and not all embodiments of the present disclosure, with the understanding that the present disclosure is not limited to the example embodiments described herein.
It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.
It will be understood by those of skill in the art that the terms "first," "second," and the like in the embodiments of the present disclosure are used merely to distinguish one element from another, and are not intended to imply any particular technical meaning, nor is the necessary logical order between them.
It is also understood that in embodiments of the present disclosure, "a plurality" may refer to two or more and "at least one" may refer to one, two or more.
It is also to be understood that any reference to any component, data, or structure in the embodiments of the disclosure, may be generally understood as one or more, unless explicitly defined otherwise or stated otherwise.
In addition, the term "and/or" in the present disclosure is only one kind of association relationship describing an associated object, and means that three kinds of relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in the present disclosure generally indicates that the former and latter associated objects are in an "or" relationship.
It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and the same or similar parts may be referred to each other, so that the descriptions thereof are omitted for brevity.
Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
The disclosed embodiments may be applied to electronic devices such as terminal devices, computer systems, servers, etc., which are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with electronic devices, such as terminal devices, computer systems, servers, and the like, include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set top boxes, programmable consumer electronics, network pcs, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above systems, and the like.
Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.
Summary of the application
In the process of implementing the present disclosure, the inventor finds that, in a scene of keyword wake-up, since different scenes have different emphasis on noise reduction effects, in order to achieve a better recognition effect, a multi-channel decoder is often started to decode, but the technical scheme has at least the following problems: the memory usage of the system is increased.
Exemplary System
In a keyword awakening system, in order to adapt to noise reduction effects of different scenes, multiple decoders are often started to decode, each decoder needs a load (loading) model (a deep neural network, a network structure and parameters) and an initialization prediction library (a set of operating system, prediction is performed through the prediction library, a basic computing unit is determined according to network operation, and the specific operation of the network operation and the parameter prediction library are determined), so that the memory usage of the system is increased.
The embodiment of the disclosure reduces the memory use by a method that a plurality of decoders share a set of prediction library.
Fig. 1 is a schematic structural diagram of a speech recognition system according to an exemplary embodiment of the present disclosure. The system provided by this embodiment includes: a prediction library 102 and a plurality of decoders 103, wherein, for the convenience of understanding the system processing procedure, the processed speech signal 101, the speech features 105 and the speech recognition results 104 obtained based on the speech signal 101 are also shown in the figure;
respectively performing feature extraction on a plurality of groups of voice signals 101 to obtain a plurality of groups of voice features 105, for example, acquiring the voice signals by using a microphone array, in the obtained plurality of groups of voice signals 101 in the set application scene; the collected voice signals are usually time domain signals, and in order to enable the neural network to process the signals, the present embodiment performs frequency domain conversion (for example, fourier transform, etc.) on the voice signals, and obtains spectral features as a plurality of sets of voice features 105 based on the plurality of sets of voice signals.
The multiple groups of voice features 105 are respectively input into the prediction library 102, the prediction library 102 performs probability value prediction on the multiple groups of voice features, determines a probability value of performing corresponding at least one phoneme on each frame of signal in the multiple frame signals included in each group of voice signals, and determines a probability of possibly corresponding phoneme(s) of each frame of signal, namely determines a phoneme probability value of each phoneme in at least one phoneme for each frame of signal in the multiple frame signals included in the voice signals; and adding the probability values of the multiple phonemes corresponding to each phoneme based on the continuous frame length (including at least one frame) corresponding to each phoneme to obtain a sum as the probability value corresponding to each phoneme.
When decoding is performed by more than 1 decoder, the prediction library 102 and the function of model initialization are extracted separately, and the initialization process is completed first, and decoding-related parts are initialized (parameters in the decoder are initialized) by using each decoder value.
The method also comprises a plurality of decoders 103 corresponding to the number of the voice signals 101, wherein each decoder 103 corresponds to one group of the voice signals 101, and the decoders 103 can call the prediction library in sequence to decode when decoding. Since the audio is generated in real time and the real time rate of processing is typically small, no significant delay is introduced. The memory only uses the magnitude of a model and a prediction library, so that the memory occupation is reduced.
Since the prediction library 102 itself occupies a certain code space and requires loading the model into the system memory. Therefore, the amount of the memory occupied by the part is generally large. The disclosed embodiment may extract this portion when initializing the decoder 103, and the codec 103 multiplexes a prediction pool 102. And the decoding and identifying processes after the extraction of the features and the prediction are finished are respectively and independently finished without mutual interference. The effect of saving the memory without influencing the original effect is achieved.
The decoder 103 determines at least one path for each group of speech signals based on the probability value of at least one phoneme corresponding to each frame of multi-frame signals included in each group of speech signals output by the prediction library 102. And determining a voice recognition result 104 according to at least one path corresponding to each group of voice signals, for example, determining a path with a maximum path probability value among all paths corresponding to the multiple groups of voice signals as a voice recognition result in the application scenario.
The embodiment of the disclosure reduces the number of required memories when the embedded platform runs, and reduces the memory occupation while improving the multi-channel decoding performance, thereby enabling the wake-up system to run in a platform with more limited resources.
Exemplary method
Fig. 2 is a flowchart illustrating a speech recognition method according to an exemplary embodiment of the present disclosure. The embodiment can be applied to an electronic device, as shown in fig. 2, and includes the following steps:
step 201, processing the voice signals acquired in the application scene into a plurality of groups of voice features.
The voice signals can be collected at different positions of the application scene through a plurality of groups of microphone arrays, and the obtained voice signals can be time domain signals. Application scenarios may include, but are not limited to, keyword wake-up, etc. scenarios.
Step 202, processing the multiple groups of voice features through a prediction library respectively to obtain multiple prediction results.
The prediction base refers to a code base supporting model forward calculation, such as: mxnet, TensorFlow, etc.; the prediction result may be a probability value for each of a plurality of phonemes for each frame of the speech signal.
Step 203, determining a speech recognition result in the application scene based on the plurality of prediction results.
The voice recognition result may be a determined one of a plurality of paths determined based on the plurality of prediction results, for example, the path having the highest probability value.
In the embodiment, the prediction results of the voice signals at a plurality of positions in the application scene are integrated, and a voice recognition result with more accurate recognition is determined; for example, the prediction result with the highest confidence level among the prediction results is used as the speech recognition result.
In the speech recognition method provided by the above embodiment of the present disclosure, speech signals acquired in an application scene are processed into multiple groups of speech features; processing the multiple groups of voice features through a prediction library to obtain multiple prediction results; determining a speech recognition result in the application scene based on the plurality of prediction results; according to the method, the prediction library is used for predicting the multiple groups of voice features, so that the occupied space of the memory is reduced, the method provided by the embodiment can be applied to a system with a small memory, and the application range of the voice recognition method is widened.
Fig. 3 is a flowchart illustrating a speech recognition method according to another exemplary embodiment of the present disclosure. As shown in fig. 3, the method comprises the following steps:
in step 301, a plurality of identical decoders to be applied for decoding are determined according to an application scenario.
Each decoder corresponds to a group of speech features, and one decoder processes a group of speech features, as shown in fig. 1, a group of speech features input on the left side are input into the decoder on the left side after being processed by the prediction library, a group of speech features input on the right side are input into the decoder on the right side after being processed by the prediction library, and the specific corresponding relationship is the same as the corresponding relationship before being added into the prediction library.
Step 302, a prediction library is initialized based on a prediction model included in a decoder.
And adding computation logic in the prediction library to realize initialization, wherein the computation logic loaded in the prediction library is the computation logic corresponding to the neural network because the prediction model is a deep neural network.
Then, the steps 201 and 203 in the embodiment shown in fig. 2 are performed.
In this embodiment, when decoding by using more than one decoder, functions of the prediction library and model initialization may be separately extracted, and an initialization process is first completed, where the initialization process may include: each decoder value initializes the parameters of the relevant part for decoding; according to the method and the device, the prediction library is initialized through the prediction models corresponding to the decoders before decoding, so that the initialized prediction library can be used for predicting multiple groups of voice characteristics, and the initialization efficiency of the prediction library is improved.
As shown in fig. 4, based on the embodiment shown in fig. 3, step 302 may include the following steps:
step 3021, determining an operation logic corresponding to the prediction model according to the prediction model included in the decoder.
The operation logic expresses an operation formula involved in the prediction process of the prediction model. The operation formula is a logical relationship for the prediction model to implement prediction, for example, when the prediction model is a two-dimensional convolution model, the corresponding operation formula may be: y ═ W × x + B.
And step 3022, initializing the prediction library according to the operation logic, so that the prediction library performs prediction processing according to the operation logic of the prediction model when running.
In this embodiment, the structure of the prediction model (which may be any kind of neural network) is analyzed, and the parameters of each layer are loaded (loaded) into the memory according to the node names; deducing the input and output shape and type of each layer based on the operation logic of the prediction model, and opening up a corresponding space in the memory; in the embodiment, through the initialization of the prediction library, the prediction library can accurately predict all the voice signals corresponding to the operation logic (corresponding to at least one decoder) when running, so that the problem of long initialization time caused by initializing a plurality of prediction libraries is solved, and the initialization accuracy is improved while the initialization speed is increased by initializing the operation logic.
As shown in fig. 5, based on the embodiment shown in fig. 2, step 202 may include the following steps:
and step 2021, operating the prediction library to sequentially predict probability values of the multiple groups of voice features.
Step 2022, determining a phoneme probability value of each frame signal in the multi-frame signals included in the speech signal as each phoneme in at least one phoneme for each group of speech features in the plurality of groups of speech features.
Alternatively, the process of the decoder predicting the speech signal through the prediction base is a process of identifying the speech signal through a neural network, and identifying a probability value of each frame signal being each of at least one phoneme, for example, a probability value of identifying a frame signal as phoneme d is 80%, a probability value of identifying a frame signal as phoneme t is 20%, and so on.
In this embodiment, each decoder may call the prediction library in sequence to realize decoding when decoding. Since the audio is generated in real time and the real time rate of processing is typically small, no significant delay is introduced. The embodiment realizes the prediction of a plurality of groups of voice characteristics, but the memory only uses the magnitude of a prediction model and a prediction library, thereby achieving the purpose of reducing the memory occupation.
As shown in fig. 6, based on the embodiment shown in fig. 2, step 202 may further include the following steps:
step 2023, combine similar operations in the operation logic included in the prediction library into the same operation.
The method comprises the steps of calculating a plurality of similar operations by using an operation logic, and queuing and executing the received similar operations, for example, receiving two groups of voice features needing to be subjected to convolution operation, and performing the convolution operation on one group of voice features and then performing the convolution operation on the other group of voice features according to the receiving time sequence.
And 2024, operating the prediction library after the operation and the combination on the embedded platform, and sequentially predicting probability values of the multiple groups of voice features.
Step 2022, determining a phoneme probability value of each frame signal in the multi-frame signals included in the speech signal as each phoneme in at least one phoneme for each group of speech features in the plurality of groups of speech features.
In this embodiment, when the prediction library runs on the embedded platform, the prediction library is appropriately trimmed, the size of the prediction library is changed by changing the format or in other manners, the data size of the prediction library is reduced to meet the needs of the embedded platform, the size of the memory occupied by the prediction library is reduced by sacrificing a certain performance, and the trimming manner may include, for example, but is not limited to: float32 is quantified as int8, etc. And corresponding optimization can be carried out according to the CPU instruction of the corresponding platform so as to accelerate the calculation speed of the model; according to the embodiment, the similar calculation is combined, the parallel instruction is fully utilized, the speed of accessing the memory is increased, the structure of the prediction library is more compact, and the caching speed is increased.
As shown in fig. 7, on the basis of the embodiment shown in fig. 2, step 203 may further include the following steps:
step 2031, for each group of voice features in the multiple groups of voice features, determining probability values of phonemes based on a sum of phoneme probability values of at least one frame of signal corresponding to each phoneme.
Step 2032, determining at least one path corresponding to each group of voice signals based on the probability value corresponding to each phoneme.
Wherein each path comprises a plurality of phonemes.
Step 2033, determining a voice recognition result based on the plurality of paths corresponding to the plurality of voice signals.
In this embodiment, the path refers to a plurality of phonemes obtained by connecting a plurality of phonemes in the order of the multi-frame signal when each frame signal corresponds to one phoneme in the multi-frame signal. Referring to the decoding process in step 103 in the embodiment provided in fig. 1, because each frame of the multi-frame signal corresponds to a phoneme, multiple paths can be obtained through prediction of the neural network, that is, the phonemes that each frame of the signal may be are connected in the order of the multi-frame signal, that is, a path can be obtained, for example, taking the wakeup word "horizon" as an example, the phonemes that constitute the wakeup word are: d. i, p, ing, x, ian; however, in the neural network prediction, two phonemes may be predicted in the first frame: d and t, at which time two paths are available: d-i-p-ing-x-ian and t-i-p-ing-x-ian; similarly to the other phonemes, when there are a plurality of phonemes corresponding to each frame of signal, there are more corresponding paths, each path corresponds to a path probability value, the path probability value is the sum of the probability values of each phoneme corresponding to the path, and each phoneme has a corresponding probability value, so that each path can determine a path probability value by adding the probability values of the included phonemes; one or more paths with higher (e.g., largest) path probability values can be determined from the path probability values, and the paths are used as the voice recognition results.
In some alternative embodiments, step 201 may include:
and performing Fourier transform on the voice signals acquired in the application scene at least once to obtain a plurality of groups of frequency spectrum characteristics as a plurality of groups of voice characteristics.
In this embodiment, in a general case, a voice signal acquired by a voice acquisition device such as a microphone is a time domain signal, and in this embodiment, the acquired time domain signal is subjected to fourier transform and converted into a frequency domain, so as to obtain a frequency spectrum characteristic as a voice characteristic.
Fig. 8 is a flowchart illustrating a speech recognition method according to another exemplary embodiment of the present disclosure. As shown in fig. 8, the method comprises the following steps:
step 301 in the embodiment of fig. 3 described above is performed before step 801 is performed.
Step 801, load the prediction model and prediction library included in the decoder into the memory.
After the step 801 is executed, the step 302 in the embodiment shown in FIG. 3 is executed, and then the step 201 and 203 in the embodiment shown in FIG. 2 are executed.
In this embodiment, since the prediction library itself occupies a certain code space, the embodiment needs to load the model into the memory of the system before initializing the prediction library, and the initialization of the prediction library is completed in the memory, so that the initialized prediction library is directly stored in the memory without reloading.
The voice recognition method provided by any of the above embodiments of the present disclosure may be applied to the application fields such as voice wakeup.
Any of the speech recognition methods provided by the embodiments of the present disclosure may be performed by any suitable device having data processing capabilities, including but not limited to: terminal equipment, a server and the like. Alternatively, any of the speech recognition methods provided by the embodiments of the present disclosure may be executed by a processor, such as the processor executing any of the speech recognition methods mentioned by the embodiments of the present disclosure by calling corresponding instructions stored in a memory. And will not be described in detail below.
Exemplary devices
Fig. 9 is a schematic structural diagram of a speech recognition apparatus according to an exemplary embodiment of the present disclosure. As shown in fig. 9, the apparatus provided in this embodiment includes:
and the signal processing module 91 is configured to process the voice signals acquired in the application scene into multiple groups of voice features.
The feature prediction module 92 is configured to process the multiple sets of speech features obtained by the signal processing module 91 through a prediction library, respectively, to obtain multiple prediction results.
And the speech recognition module 93 is configured to determine a speech recognition result in the application scenario based on the plurality of prediction results obtained by the feature prediction module 92.
In the speech recognition device provided by the above embodiment of the present disclosure, speech signals acquired in an application scene are processed into multiple groups of speech features; processing the multiple groups of voice features through a prediction library to obtain multiple prediction results; determining a speech recognition result in the application scene based on the plurality of prediction results; according to the method, the prediction library is used for predicting the multiple groups of voice features, so that the occupied space of the memory is reduced, the method provided by the embodiment can be applied to a system with a small memory, and the application range of the voice recognition method is widened.
Fig. 10 is a schematic structural diagram of a speech recognition apparatus according to another exemplary embodiment of the present disclosure. As shown in fig. 10, the apparatus provided in this embodiment includes:
in this embodiment, before the signal processing module 91, the method further includes:
a decoder determining module 11, configured to determine a plurality of identical decoders to be applied for decoding according to the application scenario. Wherein each decoder corresponds to a set of speech features.
And a memory loading module 12, configured to load the prediction model and the prediction library included in the decoder into a memory.
An initialization module 13 for initializing the prediction library based on the prediction model included in the decoder.
Optionally, the initialization module 13 is specifically configured to determine an operation logic corresponding to the prediction model according to the prediction model included in the decoder; and initializing the prediction library according to the operation logic, so that the prediction library performs prediction processing according to the operation logic of the prediction model when in operation.
In this embodiment, the feature prediction module 92 includes:
the operation merging unit 921 merges similar operations in the operation logic included in the prediction base into the same operation.
And a probability prediction unit 922, configured to run the prediction library to sequentially perform probability value prediction on multiple groups of voice features.
A probability value determining unit 923 configured to determine, for each group of the plurality of groups of speech features, a phoneme probability value for each of at least one phoneme for each frame signal in a multi-frame signal included in the speech signal.
Optionally, the probability prediction unit 922 is specifically configured to run a prediction library after operation and merging on the embedded platform, and sequentially perform probability value prediction on multiple groups of voice features.
In this embodiment, the speech recognition module 93 includes:
a phoneme probability determining unit 931, configured to determine, for each group of the multiple groups of speech features, probability values of the phonemes based on a sum of the phoneme probability values of at least one frame of signal corresponding to each phoneme.
A path determining unit 932, configured to determine at least one path corresponding to each group of voice signals based on the probability value corresponding to each phoneme; wherein each path comprises a plurality of phonemes.
A recognition result determining unit 933 configured to determine a voice recognition result based on a plurality of paths corresponding to a plurality of voice signals.
In this embodiment, the signal processing module 91 is specifically configured to perform at least one fourier transform on a voice signal acquired in an application scenario, and obtain multiple groups of spectrum features as multiple groups of voice features.
Exemplary electronic device
Next, an electronic apparatus according to an embodiment of the present disclosure is described with reference to fig. 11. The electronic device may be either or both of the first device 100 and the second device 200, or a stand-alone device separate from them that may communicate with the first device and the second device to receive the collected input signals therefrom.
FIG. 11 illustrates a block diagram of an electronic device in accordance with an embodiment of the disclosure.
As shown in fig. 11, electronic device 110 includes one or more processors 111 and memory 112.
Processor 111 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in electronic device 110 to perform desired functions.
Memory 112 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by processor 111 to implement the speech recognition methods of the various embodiments of the present disclosure described above and/or other desired functions. Various contents such as an input signal, a signal component, a noise component, etc. may also be stored in the computer-readable storage medium.
In one example, the electronic device 110 may further include: an input device 113 and an output device 114, which are interconnected by a bus system and/or other form of connection mechanism (not shown).
For example, when the electronic device is the first device 100 or the second device 200, the input device 113 may be a microphone or a microphone array as described above for capturing an input signal of a sound source. When the electronic device is a stand-alone device, the input means 113 may be a communication network connector for receiving the acquired input signals from the first device 100 and the second device 200.
The input device 113 may also include, for example, a keyboard, a mouse, and the like.
The output device 114 may output various information including the determined distance information, direction information, and the like to the outside. The output devices 114 may include, for example, a display, speakers, a printer, and a communication network and its connected remote output devices, among others.
Of course, for simplicity, only some of the components of the electronic device 110 relevant to the present disclosure are shown in fig. 11, omitting components such as buses, input/output interfaces, and the like. In addition, electronic device 110 may include any other suitable components, depending on the particular application.
Exemplary computer program product and computer-readable storage Medium
In addition to the above-described methods and apparatus, embodiments of the present disclosure may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the speech recognition methods according to various embodiments of the present disclosure described in the "exemplary methods" section of this specification above.
The computer program product may write program code for carrying out operations for embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.
Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform steps in a speech recognition method according to various embodiments of the present disclosure described in the "exemplary methods" section above of this specification.
The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The foregoing describes the general principles of the present disclosure in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present disclosure are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present disclosure. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the disclosure is not intended to be limited to the specific details so described.
In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The block diagrams of devices, apparatuses, systems referred to in this disclosure are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".
The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.
It is also noted that in the devices, apparatuses, and methods of the present disclosure, each component or step can be decomposed and/or recombined. These decompositions and/or recombinations are to be considered equivalents of the present disclosure.
The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the disclosure to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims (10)

1. A speech recognition method comprising:
processing the voice signals acquired in the application scene into a plurality of groups of voice features;
processing the multiple groups of voice features through a prediction library to obtain multiple prediction results;
determining a speech recognition result in the application scene based on the plurality of prediction results.
2. The method according to claim 1, further comprising, before processing the speech signals acquired in the application scenario into a plurality of sets of speech features:
determining a plurality of identical decoders applied to decoding according to the application scenario; wherein each of the decoders corresponds to a set of the speech features;
initializing the prediction library based on a prediction model included in the decoder.
3. The method of claim 2, wherein the initializing the prediction library based on a prediction model included in the decoder comprises:
determining an operation logic corresponding to the prediction model according to the prediction model included in the decoder;
and initializing the prediction library according to the operation logic, so that the prediction library carries out prediction processing according to the operation logic of the prediction model when in operation.
4. The method of claim 2, further comprising, prior to initializing the prediction library based on a prediction model included in the decoder:
and loading the prediction model and the prediction library included by the decoder into a memory.
5. The method according to any of claims 1-4, wherein the processing the plurality of groups of speech features by a prediction library to obtain a plurality of prediction results comprises:
operating the prediction library to sequentially predict probability values of the multiple groups of voice features;
for each group of the plurality of groups of voice features, determining a phoneme probability value of each frame signal in a multi-frame signal included in the voice signal as each phoneme in at least one phoneme.
6. The method of claim 5, wherein the operating the prediction library to sequentially predict probability values for the plurality of groups of speech features comprises:
merging similar operations in the operation logic included in the prediction library into the same operation;
and operating the prediction library after the operation combination on the embedded platform, and sequentially predicting probability values of the multiple groups of voice characteristics.
7. The method of claim 6, wherein the determining speech recognition results in the application scenario based on the plurality of prediction results comprises:
for each group of voice features in the multiple groups of voice features, determining probability values of phonemes based on the sum of phoneme probability values of at least one frame signal corresponding to each phoneme;
determining at least one path corresponding to each group of the voice signals based on the probability value corresponding to each phoneme; wherein each of the paths comprises a plurality of phonemes;
and determining a voice recognition result based on a plurality of paths corresponding to the plurality of voice signals.
8. A speech recognition apparatus comprising:
the signal processing module is used for processing the voice signals acquired in the application scene into a plurality of groups of voice features;
the feature prediction module is used for respectively processing the multiple groups of voice features obtained by the signal processing module through a prediction library to obtain multiple prediction results;
and the voice recognition module is used for determining a voice recognition result in the application scene based on a plurality of prediction results obtained by the characteristic prediction module.
9. A computer-readable storage medium, which stores a computer program for executing the speech recognition method according to any one of claims 1 to 8.
10. An electronic device, the electronic device comprising:
a processor;
a memory for storing the processor-executable instructions;
the processor is used for reading the executable instructions from the memory and executing the instructions to realize the voice recognition method of any one of the above claims 1-8.
CN202110351353.6A 2021-03-31 2021-03-31 Speech recognition method and device, computer readable storage medium, electronic device Pending CN113096642A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110351353.6A CN113096642A (en) 2021-03-31 2021-03-31 Speech recognition method and device, computer readable storage medium, electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110351353.6A CN113096642A (en) 2021-03-31 2021-03-31 Speech recognition method and device, computer readable storage medium, electronic device

Publications (1)

Publication Number Publication Date
CN113096642A true CN113096642A (en) 2021-07-09

Family

ID=76672293

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110351353.6A Pending CN113096642A (en) 2021-03-31 2021-03-31 Speech recognition method and device, computer readable storage medium, electronic device

Country Status (1)

Country Link
CN (1) CN113096642A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114333799A (en) * 2022-03-09 2022-04-12 深圳市友杰智新科技有限公司 Detection method and device for phase-to-phase sound misidentification and computer equipment

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170154033A1 (en) * 2015-11-30 2017-06-01 Samsung Electronics Co., Ltd. Speech recognition apparatus and method
CN107507613A (en) * 2017-07-26 2017-12-22 合肥美的智能科技有限公司 Towards Chinese instruction identification method, device, equipment and the storage medium of scene
CN108510977A (en) * 2018-03-21 2018-09-07 清华大学 Language Identification and computer equipment
CN109036391A (en) * 2018-06-26 2018-12-18 华为技术有限公司 Audio recognition method, apparatus and system
CN111489737A (en) * 2020-04-13 2020-08-04 深圳市友杰智新科技有限公司 Voice command recognition method and device, storage medium and computer equipment
CN111862943A (en) * 2019-04-30 2020-10-30 北京地平线机器人技术研发有限公司 Speech recognition method and apparatus, electronic device, and storage medium
CN112102816A (en) * 2020-08-17 2020-12-18 北京百度网讯科技有限公司 Speech recognition method, apparatus, system, electronic device and storage medium
CN112216307A (en) * 2019-07-12 2021-01-12 华为技术有限公司 Speech emotion recognition method and device
CN112259077A (en) * 2020-10-20 2021-01-22 网易(杭州)网络有限公司 Voice recognition method, device, terminal and storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170154033A1 (en) * 2015-11-30 2017-06-01 Samsung Electronics Co., Ltd. Speech recognition apparatus and method
CN106816148A (en) * 2015-11-30 2017-06-09 三星电子株式会社 Speech recognition apparatus and method
CN107507613A (en) * 2017-07-26 2017-12-22 合肥美的智能科技有限公司 Towards Chinese instruction identification method, device, equipment and the storage medium of scene
CN108510977A (en) * 2018-03-21 2018-09-07 清华大学 Language Identification and computer equipment
CN109036391A (en) * 2018-06-26 2018-12-18 华为技术有限公司 Audio recognition method, apparatus and system
CN111862943A (en) * 2019-04-30 2020-10-30 北京地平线机器人技术研发有限公司 Speech recognition method and apparatus, electronic device, and storage medium
CN112216307A (en) * 2019-07-12 2021-01-12 华为技术有限公司 Speech emotion recognition method and device
CN111489737A (en) * 2020-04-13 2020-08-04 深圳市友杰智新科技有限公司 Voice command recognition method and device, storage medium and computer equipment
CN112102816A (en) * 2020-08-17 2020-12-18 北京百度网讯科技有限公司 Speech recognition method, apparatus, system, electronic device and storage medium
CN112259077A (en) * 2020-10-20 2021-01-22 网易(杭州)网络有限公司 Voice recognition method, device, terminal and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114333799A (en) * 2022-03-09 2022-04-12 深圳市友杰智新科技有限公司 Detection method and device for phase-to-phase sound misidentification and computer equipment
CN114333799B (en) * 2022-03-09 2022-08-02 深圳市友杰智新科技有限公司 Detection method and device for phase-to-phase sound misidentification and computer equipment

Similar Documents

Publication Publication Date Title
US10891944B2 (en) Adaptive and compensatory speech recognition methods and devices
JP7023934B2 (en) Speech recognition method and equipment
US20180350351A1 (en) Feature extraction using neural network accelerator
US11093755B2 (en) Video segmentation based on weighted knowledge graph
WO2021206804A1 (en) Sequence-to-sequence speech recognition with latency threshold
US20210390970A1 (en) Multi-modal framework for multi-channel target speech seperation
CN112509600A (en) Model training method and device, voice conversion method and device and storage medium
CN110929505B (en) Method and device for generating house source title, storage medium and electronic equipment
US10629184B2 (en) Cepstral variance normalization for audio feature extraction
US9959887B2 (en) Multi-pass speech activity detection strategy to improve automatic speech recognition
CN110288974B (en) Emotion recognition method and device based on voice
JP7329393B2 (en) Audio signal processing device, audio signal processing method, audio signal processing program, learning device, learning method and learning program
CN113096642A (en) Speech recognition method and device, computer readable storage medium, electronic device
CN110890098B (en) Blind signal separation method and device and electronic equipment
Fazliddinovich et al. Parallel processing capabilities in the process of speech recognition
CN113053377A (en) Voice wake-up method and device, computer readable storage medium and electronic equipment
CN111858916B (en) Method and device for clustering sentences
CN110874343B (en) Method for processing voice based on deep learning chip and deep learning chip
CN114333769B (en) Speech recognition method, computer program product, computer device and storage medium
CN111783431A (en) Method and device for predicting word occurrence probability by using language model and training language model
JP2023517004A (en) Unsupervised Singing-to-Speech Conversion Using Pitch Adversarial Networks
Chunwijitra et al. Distributing and sharing resources for automatic speech recognition applications
CN113409802B (en) Method, device, equipment and storage medium for enhancing voice signal
CN112017662A (en) Control instruction determination method and device, electronic equipment and storage medium
CN112802458B (en) Wake-up method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination