CN114566156A - Keyword speech recognition method and device - Google Patents

Keyword speech recognition method and device Download PDF

Info

Publication number
CN114566156A
CN114566156A CN202210191909.4A CN202210191909A CN114566156A CN 114566156 A CN114566156 A CN 114566156A CN 202210191909 A CN202210191909 A CN 202210191909A CN 114566156 A CN114566156 A CN 114566156A
Authority
CN
China
Prior art keywords
voice signal
keyword
voice
probability
target keyword
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210191909.4A
Other languages
Chinese (zh)
Inventor
陈锦明
吴涛
李倩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bestechnic Shanghai Co Ltd
Original Assignee
Bestechnic Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bestechnic Shanghai Co Ltd filed Critical Bestechnic Shanghai Co Ltd
Priority to CN202210191909.4A priority Critical patent/CN114566156A/en
Publication of CN114566156A publication Critical patent/CN114566156A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting

Landscapes

  • Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a keyword voice recognition method and device, which are used for improving accuracy of keyword voice recognition and avoiding mistaken awakening. The method comprises the following steps: acquiring a voice signal with a period of time, and calculating the voice recognition characteristics of the voice signal; inputting the voice recognition characteristics into a neural network model, and determining the probability of the existence of N classification labels of target keywords in the voice signals respectively through the neural network model, wherein N is a positive integer; determining the probability of the N classification labels of the target keyword existing together in the voice signal according to the probability of the N classification labels of the target keyword existing in the voice signal respectively; and if the probability that the N classification labels of the target keyword coexist in the voice signal is greater than or equal to a set threshold value, determining that the target keyword exists in the voice signal.

Description

Keyword speech recognition method and device
Technical Field
The present application relates to the field of speech recognition technologies, and in particular, to a method and an apparatus for speech recognition of a keyword.
Background
With the development of technology, the application scenarios of intelligent speech recognition are more and more extensive. The keyword awakening is the first step of voice recognition, and the keyword recognition scheme with higher robustness can optimize human-computer interaction experience and provide a basis for subsequent intelligent application.
In the prior art, keyword recognition is generally performed by a pattern recognition method. The method specifically comprises the following steps: firstly, selecting a period of time window, carrying out short-time Fourier transform on sampling data in the time window, then obtaining a discrete cosine cepstrum coefficient corresponding to the sampling data in the time window, inputting the discrete cosine cepstrum coefficient into a neural network as a voice feature for classification, and finally determining the probability of existence of a target keyword.
The technical scheme has the following problems: when the time window is selected, the longest estimated time required by the longest keyword is required to be used as a fixed time window, which may result in that the position of the voice existing in one time window is not fixed. During multi-keyword training, if the time difference between the keyword with the longest time consumption and the keyword with the shortest time consumption is too large, the pattern recognition task is difficult to learn the complete pronunciation characteristics of the keywords, so that larger false awakening exists. Moreover, when speech of a plurality of keywords simultaneously exists in the time window, the recognition of the category of the keyword is not accurate enough.
Disclosure of Invention
The application provides a keyword voice recognition method and device, which are used for improving accuracy of keyword voice recognition and avoiding mistaken awakening.
In a first aspect, an embodiment of the present application provides a method for speech recognition of a keyword, where the method includes: acquiring a voice signal with a period of time, and calculating the voice recognition characteristics of the voice signal; inputting the voice recognition characteristics into a neural network model, and determining the probability of the existence of N classification labels of target keywords in the voice signals respectively through the neural network model, wherein N is a positive integer; determining the probability of the N classification labels of the target keyword existing together in the voice signal according to the probability of the N classification labels of the target keyword existing in the voice signal respectively; and if the probability that the N classification labels of the target keyword coexist in the voice signal is greater than or equal to a set threshold value, determining that the target keyword exists in the voice signal.
Different from the keyword recognition based on the speech recognition in the prior art, all possible pronunciations need to be trained as the classification result, and generally, the classification target is large, large model parameters are needed, and the method is not beneficial to the application scene with limited resources. According to the technical scheme, the target keywords are divided into the plurality of classification labels, the classification labels in the target keywords are used as classification targets, and the effects of obviously improving the keyword identification accuracy and reducing false identification are achieved under the condition that the size of the model is not increased. The classification labels can be phonemes, characters, words and the like existing in the target keywords, and because the position and the occurrence rule of the voice in a short time window have stationarity, the neural network model can identify the difference between the classification labels more easily by reducing the detection granularity, and then identify different categories, namely the keywords.
The method has better performance on training tasks of similar keywords, if the keywords have more similar pronunciations, the keywords are directly trained for classification, and the model hardly focuses on all the characteristics of the keywords, so that the training target and the actual characteristics have larger difference, or the background noise in a data set is taken as the characteristics of the keywords, and the error recognition of the model recognition is improved.
In one possible design, the N classification labels of the target keyword are obtained by dividing the target keyword according to a phoneme, word, or word granularity.
In one possible design, the probabilities that the N classification tags of the target keyword coexist in the speech signal are associated with the probabilities that the speech recognition features respectively correspond to the N classification tags of the target keyword.
In one possible design, the computing the speech recognition feature of the speech signal includes: framing the voice signal according to the set window length and step length; aiming at each frame of voice signal, determining the time-frequency characteristics of the frame of voice signal by carrying out short-time Fourier transform on the frame of voice signal; and determining the characteristics based on the filter group and the Mel frequency cepstrum coefficient according to the time-frequency characteristics.
In one possible design, the window length is greater than or equal to the longest pronunciation length of the classification label for the target keyword, and the step size is less than or equal to half the window length.
In one possible design, the method further includes: and training the neural network model by adopting a multi-label training method until the neural network model is converged.
In a second aspect, embodiments of the present application provide a keyword speech recognition apparatus, which may include a module/unit for performing any one of the possible methods according to the first aspect. These modules/units may be implemented by hardware or by hardware executing corresponding software.
Illustratively, the apparatus may include a communication module and a processing module; wherein:
the communication module is used for acquiring a voice signal with a certain duration;
the processing module is used for calculating the voice recognition characteristics of the voice signals; inputting the voice recognition characteristics into a neural network model, and determining the probability of the existence of N classification labels of target keywords in the voice signals respectively through the neural network model, wherein N is a positive integer; determining the probability of the coexistence of the N classification labels of the target keyword in the voice signal according to the probability of the respective existence of the N classification labels of the target keyword in the voice signal; and if the probability that the N classification labels of the target keyword coexist in the voice signal is greater than or equal to a set threshold value, determining that the target keyword exists in the voice signal.
In one possible design, the N classification labels of the target keyword are obtained by dividing the target keyword according to a phoneme, word, or word granularity.
In one possible design, the probability that the N classification tags of the target keyword exist in the speech signal at the same time is associated with the probability that the speech recognition features respectively correspond to the N classification tags of the target keyword.
In one possible design, the processing module is specifically configured to: framing the voice signal according to a set window length and a set step length; aiming at each frame of voice signal, performing short-time Fourier transform on the frame of voice signal to determine the time-frequency characteristics of the frame of voice signal; and determining the characteristics based on the filter group and the Mel frequency cepstrum coefficient according to the time-frequency characteristics.
In one possible design, the window length is greater than or equal to the longest pronunciation length of the classification label for the target keyword, and the step size is less than or equal to half the window length.
In one possible design, the processing module is further to: and training the neural network model by adopting a multi-label training method until the neural network model converges.
In a third aspect, an embodiment of the present application further provides a computing device, including:
a memory for storing program instructions;
a processor for calling the program instructions stored in said memory and for executing the method as described in the various possible designs of the first aspect according to the obtained program instructions.
In a fourth aspect, embodiments of the present application further provide a chip, which is deployed with a multi-label trained neural network model and weights, and is configured to: when the chip receives a speech signal for a period of time, the method as described in the various possible designs of the first aspect is performed.
In a fifth aspect, the present application further provides a computer-readable storage medium, in which computer-readable instructions are stored, and when the computer-readable instructions are read and executed by a computer, the computer-readable instructions cause the method described in any one of the possible designs of the first aspect to be implemented.
In a sixth aspect, the present application further provides a computer program product including computer readable instructions, which when executed by a processor, enable the method described in any one of the possible designs of the first aspect to be implemented.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
Fig. 1 is a schematic flowchart of a method for recognizing a keyword by using speech according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a neural network model according to an embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of a keyword speech recognition apparatus according to an embodiment of the present disclosure;
fig. 4 is another schematic structural diagram of a keyword speech recognition method according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the accompanying drawings, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In order to solve the problems that multiple keywords have multiple categories, large similarity, high requirements on models for direct classification, low keyword identification accuracy and low identification speed, and the large-model-related identification method is not suitable for being used in an offline situation of edge recognition or a situation with limited computational resources, the embodiment of the application provides a keyword speech identification method.
The method has the core that the classification label of the keyword is changed into a multi-label category from a one-hot code, and a multi-label classification task is carried out. According to the multi-label classification task, the classification labels are obtained by classifying the keywords according to the particle sizes of phonemes, single characters or words and the like, then the probability of the existence of each classification label is predicted through the neural network model, and whether the target keyword exists in a section of voice or not is judged according to the probability of the simultaneous existence of each classification label.
Different from the conventional speech recognition keyword technology, the speech or one-section input (the length of one section is related to the length of the keyword) in the application uses a multi-label task for classification, frame-level alignment is not needed, and the difficulty of labeling is saved.
Fig. 1 schematically illustrates a flow chart of a keyword speech recognition method provided in an embodiment of the present application, where as shown in fig. 1, the method includes:
step 101, acquiring a speech signal with a certain duration, and calculating speech recognition characteristics of the speech signal.
The calculating the speech recognition feature of the speech signal may include: framing the voice signal according to the set window length and step length; then, aiming at each frame of voice signal, performing short-time Fourier transform on the frame of voice signal to determine the time-frequency characteristics of the frame of voice signal, namely converting the characteristics of time-domain sampling into the time-frequency characteristics; further, according to the time-frequency characteristics, characteristics fbank and Mel frequency cepstrum coefficients mfcc based on the filter group are determined, specifically, frequency domain coordinates can be converted into logarithmic coordinates according to human hearing to obtain fbank characteristics, inverse Fourier transform is carried out on the logarithmic coordinates to obtain cepstrum, and therefore the mfcc characteristics are obtained. The fbank feature and the mfcc feature can be input into the neural network model in a subsequent step as speech recognition features.
The framing means that a section of voice signal is divided into a small section and a small section according to the stationarity of voice, and the voice sounding rules in the small section of signal are similar and have the stationarity of a first level. Each small segment of the signal is called a frame. In a particular implementation, there may be a degree of overlap between signals of different frames in view of maintaining continuity of speech.
The window length refers to the time length divided when the speech signal is divided into frames, and the step length refers to the step length which is continuously slid by a fixed step length on the time dimension according to the window length to obtain a frame signal. In the present application, the window length is greater than or equal to the longest pronunciation length of the classification label of the target keyword, and the step length is less than or equal to half of the window length. For example, the length of a frame signal (i.e. window length) can be 20-40ms, the step size can be 7-16ms, and the window length and the step size can be flexibly selected according to different tasks.
Illustratively, the process of extracting the speech features may include: firstly, windowing processing is carried out on a voice signal, the voice signal is divided into frames, and the voice signals of the previous frame and the next frame have a temporal precedence relationship. The speech signal within each frame is subjected to a Fast Fourier Transform (FFT) to obtain a power spectrum for each frame. The power spectrum is then subjected to a filter bank processing using Mel-scale. After the power spectrum is transformed into the log domain, a discrete cosine transform is applied to the speech signal to compute the MFCC coefficients.
The formula for calculating Mels for any frequency is:
mel(f)=2595×log10(1+f/700)
where Mel (f) is frequency (Mels), and f is frequency (Hz).
The formula for MFCCs is:
Figure BDA0003525205430000071
wherein k is the number of Mel cepstral coefficients, S ^ k is the output of filterbank, and C ^ n is the final mfcc coefficient.
The Fourier transform input is a time sampling point, the function is to convert time domain characteristics into time frequency characteristics, the time frequency characteristics after Fourier transform can be converted into fbank and mfcc characteristics according to the pronunciation rule of a human, and the conversion is generated according to the nonlinear response of the human ear hearing. The frequency domain coordinates are the number of frequency domain points after the fourier transform. And splicing frequency domain point numbers obtained by converting the speech signals of the previous and the next frames to obtain a characteristic matrix formed by the frequency characteristics (y) of the time sequence (x).
Step 102, inputting the voice recognition characteristics into a neural network model, and determining the probability of the respective existence of N classification labels of the target keywords in the voice signals through the neural network model, wherein N is a positive integer.
In the application, a plurality of classification labels can exist in one target keyword, and the number N of the classification labels can be determined according to the classification granularity of the classification labels. The N classification tags of the target keyword may be obtained by dividing the target keyword according to the phoneme, word or word granularity, that is, the classification tags may be phonemes, words or words. For example, the keyword "Xiaoaixiang-Lei" is a category, which can be subdivided into a plurality of classification labels according to Chinese characters, each classification label being a classification target in the neural network model and being a subclass of the above category. The 4 words of "small", "love", "same", "learning" of "small love classmate" are 4 classification labels.
Fig. 2 illustrates a structure of a neural network model used in the present application, and as shown in fig. 2, the neural network model uses a Convolutional neural network, and convolution kernels of three layers are collocated in a fully connected manner, and each block from left to right in fig. 2 represents an input layer, a 4-layer Convolutional layer (CONV), a 2-layer fully connected layer (FC), and an output layer in the neural network model, respectively. Optionally, between different convolutional layers, a BN operation may be further included, that is, the weight values are normalized by the mean and the variance.
The number of units in the output layer is equal to the sum of the number of the classification labels of the target keywords, and the classification activation function adopted in the output layer is a softmax function or a sigmid function. The neural network belongs to the characteristic required for establishing classification characteristic extraction classification, the classification activation function is used for carrying out nonlinear mapping, the final classification activation function obtains the classification result, and the structure output by the neural network model directly enters the activation function to obtain the mapping.
The neural network model can be trained by adopting a multi-label training method until the neural network model converges. Wherein, in the training process, cross entropy loss or mean square error loss can be used as a model training loss function. The output layer of the neural network model adopts a softmax function or a sigmid function as a classification activation function.
Illustratively, the training process of the neural network model may include: firstly, preparing a corpus and a corresponding labeled tag, wherein the tag is a real result judged by manually listening to the content of voice. And obtaining a frequency characteristic matrix of a time sequence by the characteristic extraction method, and training parameters of the neural network model by using the linguistic data and the corresponding labels. Through multiple iterations, the trained weights can predict the most possible labeled label on the data of the labeled data. Thus, trained weights can be used to predict those true voices, and the probability of the possible existence of the label can be obtained.
When needing to be explained, the neural network model in the application has the characteristic of light weight, so the neural network model can be suitable for general keyword speech recognition scenes, and can also be suitable for keyword recognition scenes that edge terminals are offline and cannot be recognized through a cloud end or the computing power of the edge terminals is limited.
The parameters of the neural network model are about 100k, and under the condition of keeping the structure of the model unchanged, the method can achieve better recognition effect on keywords with higher similarity. Intuitively understand that for scenes with more keywords, each keyword needs to be classified, and if the classification granularity is reduced, the number of final classification results can be less than that of the keywords, so that the classification difficulty is reduced. For parts with similar keywords, the multi-label tends to focus on different parts and the same part of two keywords better than a single label, so that the model training has a faster convergence speed.
And 103, determining the co-existing probability of the N classification labels of the target keyword in the voice signal according to the respective existing probabilities of the N classification labels of the target keyword in the voice signal.
In the application, a set threshold value can be set for each of the N classification tags of the target keyword. And aiming at one classification label, judging whether the classification label exists in the voice signal or not according to the probability of the existence of the classification label output by the neural network model and a corresponding set threshold value. If the N classification tags of the target keyword exist in the voice signal, the probability that the N classification tags coexist in the voice signal, namely the joint probability that the N classification tags coexist at the same time, is further calculated. The joint probability refers to the probability that different classification labels will be predicted in different frames of a speech signal, that is, the final probability that different classification labels are combined into words is the probability combination of each classification label.
The probability that the N classification labels of the target keyword coexist in the voice signal is associated with the probability that the voice recognition features respectively correspond to the N classification labels of the target keyword. Specifically, the probability that the N classification tags of the target keyword exist in the speech signal at the same time is equal to the product of the probabilities that the speech recognition features respectively correspond to the N classification tags of the target keyword.
And 104, if the probability that the N classification labels of the target keywords coexist in the voice signal is greater than or equal to a set threshold value, determining that the target keywords exist in the voice signal.
The set threshold is a threshold corresponding to the probability that the N classification tags coexist in the voice signal.
Optionally, whether a word can be formed by combining the N classification tags may be determined according to the probability that the N classification tags coexist in the voice signal, and then whether the word is a keyword may be determined.
It can be seen that the conventional classification method directly targets keywords as classification, and one keyword is a category. However, in the multi-label classification in the present application, each keyword (i.e., category) is further classified to obtain a finer-grained classification label, i.e., a subclass of the category, each of the categories is a combination of different subclasses, and the prediction results of the neural network model are compared with the subclass prediction results of each keyword, and the joint probability of the subclasses is used as the probability of keyword recognition. The subclassing may select phonemes, words, or the like.
In summary, the whole process of the application mainly includes two parts of training the neural network model and recognizing the keywords by using the trained neural network model. The method for recognizing the keywords by using the trained neural network model comprises the following steps: obtaining a section of voice, calculating voice characteristics of the section of voice, inputting the voice characteristics of the section of voice into a neural network model, outputting probabilities corresponding to the voice characteristics and all classification labels of target keywords respectively through the neural network model, obtaining whether the corresponding classification labels exist finally according to the probability of each classification label and the size of a set threshold value, and determining whether the target keywords exist in the section of voice according to the types and the number of the classification labels.
The technical scheme in the application can also have the following technical effects:
1) the misrecognition rate of the keyword recognition is reduced. The method in the application adopts a multi-target classification task, so that the classification target is not directly related to the target keyword, but phonemes, characters, words and the like are adopted as the classification target, and more specifically, the keyword classification is simplified into different words (single phoneme or single English word or single Chinese character), and each keyword is a combination of a plurality of labels. And the predicted single label has similar pronunciation length, and whether the target keyword exists is judged according to the probability of the simultaneous existence of a plurality of labels. The method can reduce the influence of the length difference between different keywords on the recognition effect of the model, judge the probability of the existence of the target keywords by a plurality of classified targets simultaneously and reduce the error recognition rate of the target keywords.
2) The method has high accuracy for identifying similar keywords, has the effect of fine-grained classification, and is easy for a neural network model to learn the characteristics among categories because the pronunciation of a single word/phoneme/Chinese character/English letter is usually fixed. If the similarity of the keywords is high, the classification task is directly trained, the model easily focuses on the information of the speech background, and the learned classification features are not the features of the keywords, so that the recognition accuracy is reduced.
Based on the same inventive concept, the application also provides a keyword speech recognition device, and the device is used for realizing the method in the embodiment of the method.
As shown in fig. 3, the apparatus 300 includes: a communication module 310 and a processing module 320.
A communication module 310, configured to obtain a voice signal of a certain duration;
a processing module 320 for calculating a speech recognition feature of the speech signal; inputting the voice recognition features into a neural network model, and determining the probability of N classification labels of target keywords respectively existing in the voice signals through the neural network model, wherein N is a positive integer; determining the probability of the N classification labels of the target keyword existing together in the voice signal according to the probability of the N classification labels of the target keyword existing in the voice signal respectively; and if the probability that the N classification labels of the target keyword coexist in the voice signal is greater than or equal to a set threshold value, determining that the target keyword exists in the voice signal.
In one possible design, the N classification tags of the target keyword are obtained by dividing the target keyword by a phoneme, word, or word granularity.
In one possible design, the probabilities that the N classification tags of the target keyword coexist in the speech signal are associated with the probabilities that the speech recognition features respectively correspond to the N classification tags of the target keyword.
In one possible design, the processing module 320 is specifically configured to: framing the voice signal according to the set window length and step length; aiming at each frame of voice signal, determining the time-frequency characteristics of the frame of voice signal by carrying out short-time Fourier transform on the frame of voice signal; and determining the characteristics based on the filter group and the Mel frequency cepstrum coefficient according to the time-frequency characteristics.
In one possible design, the window length is greater than or equal to the longest pronunciation length of the classification label for the target keyword, and the step size is less than or equal to half the window length.
In one possible design, the processing module 320 is further configured to: and training the neural network model by adopting a multi-label training method until the neural network model is converged.
Embodiments of the present application also provide a chip deployed with a multi-label trained neural network model and weights, and configured to: when the chip receives a speech signal of a certain duration, the speech recognition method of the keyword as described above is executed.
Based on the same technical concept, the embodiment of the present application further provides a computing device, as shown in fig. 4, including at least one processor 401 and a memory 402 connected to the at least one processor, where a specific connection medium between the processor 401 and the memory 402 is not limited in the embodiment of the present application, and the processor 401 and the memory 402 are connected through a bus in fig. 4 as an example. The bus may be divided into an address bus, a data bus, a control bus, etc.
In this embodiment, the memory 402 stores instructions executable by the at least one processor 401, and the at least one processor 401 may implement the steps of the secret sharing method by executing the instructions stored in the memory 402.
The processor 401 is a control center of the computer device, and may connect various parts of the computer device by using various interfaces and lines, and perform resource setting by executing or executing instructions stored in the memory 402 and calling data stored in the memory 402. Optionally, the processor 401 may include one or more processing units, and the processor 401 may integrate an application processor and a modem processor, wherein the application processor mainly handles an operating system, a user interface, an application program, and the like, and the modem processor mainly handles wireless communication. It will be appreciated that the modem processor described above may not be integrated into the processor 401. In some embodiments, processor 401 and memory 402 may be implemented on the same chip, or in some embodiments, they may be implemented separately on separate chips.
The processor 401 may be a general-purpose processor, such as a Central Processing Unit (CPU), a digital signal processor, an Application Specific Integrated Circuit (ASIC), a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like, and may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present Application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in a processor.
Memory 402, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory 402 may include at least one type of storage medium, and may include, for example, a flash Memory, a hard disk, a multimedia card, a card-type Memory, a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Programmable Read Only Memory (PROM), a Read Only Memory (ROM), a charge Erasable Programmable Read Only Memory (EEPROM), a magnetic Memory, a magnetic disk, an optical disk, and so on. The memory 402 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 402 in the embodiments of the present application may also be circuitry or any other device capable of performing a storage function for storing program instructions and/or data.
Based on the same technical concept, embodiments of the present application further provide a computer-readable storage medium, where computer-readable instructions are stored, and when the computer reads and executes the computer-readable instructions, the method in the foregoing method embodiments is implemented.
Based on the same technical concept, the embodiment of the present application further provides a computer program product, which includes computer readable instructions, and when the computer readable instructions are executed by a processor, the method in the above method embodiment is implemented.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims (10)

1. A method for speech recognition of a keyword, the method comprising:
acquiring a voice signal with a period of time, and calculating the voice recognition characteristics of the voice signal;
inputting the voice recognition characteristics into a neural network model, and determining the probability of the existence of N classification labels of target keywords in the voice signals respectively through the neural network model, wherein N is a positive integer;
determining the probability of the N classification labels of the target keyword existing together in the voice signal according to the probability of the N classification labels of the target keyword existing in the voice signal respectively;
and if the probability that the N classification labels of the target keyword coexist in the voice signal is greater than or equal to a set threshold value, determining that the target keyword exists in the voice signal.
2. The method of claim 1, wherein the N classification tags of the target keyword are obtained by dividing the target keyword by a phoneme, word or word granularity.
3. The method of claim 1, wherein the probabilities of the N class labels of the target keyword co-existing in the speech signal are associated with the probabilities of the speech recognition features corresponding to the N class labels of the target keyword, respectively.
4. The method of claim 1, wherein the computing the speech recognition feature of the speech signal comprises:
framing the voice signal according to the set window length and step length;
aiming at each frame of voice signal, determining the time-frequency characteristics of the frame of voice signal by carrying out short-time Fourier transform on the frame of voice signal;
and determining the characteristics based on the filter group and the Mel frequency cepstrum coefficient according to the time-frequency characteristics.
5. The method of claim 4, wherein the window length is greater than or equal to a longest utterance length of a category label for the target keyword, and wherein the step size is less than or equal to half the window length.
6. The method according to any one of claims 1 to 5, further comprising:
and training the neural network model by adopting a multi-label training method until the neural network model converges.
7. An apparatus for speech recognition of a keyword, the apparatus comprising:
the communication module is used for acquiring a voice signal with a period of time;
the processing module is used for calculating the voice recognition characteristics of the voice signals; inputting the voice recognition characteristics into a neural network model, and determining the probability of the existence of N classification labels of target keywords in the voice signals respectively through the neural network model, wherein N is a positive integer; determining the probability of the simultaneous existence of the N classification labels of the target keyword in the voice signal according to the probability of the respective existence of the N classification labels of the target keyword in the voice signal; and if the probability that the N classification labels of the target keyword coexist in the voice signal is greater than a set threshold value, determining that the target keyword exists in the voice signal.
8. A chip, characterized in that it is deployed with a multi-label trained neural network model and weights, and is configured to:
the method of speech recognition of a keyword according to any one of claims 1 to 5 is performed when the chip receives a speech signal for a period of time.
9. A computing device, comprising:
a memory for storing program instructions;
a processor for calling program instructions stored in said memory and for executing the method of any one of claims 1 to 6 in accordance with the obtained program instructions.
10. A computer-readable storage medium comprising computer-readable instructions which, when read and executed by a computer, cause the method of any one of claims 1 to 6 to be carried out.
CN202210191909.4A 2022-02-28 2022-02-28 Keyword speech recognition method and device Pending CN114566156A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210191909.4A CN114566156A (en) 2022-02-28 2022-02-28 Keyword speech recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210191909.4A CN114566156A (en) 2022-02-28 2022-02-28 Keyword speech recognition method and device

Publications (1)

Publication Number Publication Date
CN114566156A true CN114566156A (en) 2022-05-31

Family

ID=81716159

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210191909.4A Pending CN114566156A (en) 2022-02-28 2022-02-28 Keyword speech recognition method and device

Country Status (1)

Country Link
CN (1) CN114566156A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114863915A (en) * 2022-07-05 2022-08-05 中科南京智能技术研究院 Voice awakening method and system based on semantic preservation

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114863915A (en) * 2022-07-05 2022-08-05 中科南京智能技术研究院 Voice awakening method and system based on semantic preservation

Similar Documents

Publication Publication Date Title
CN107680582B (en) Acoustic model training method, voice recognition method, device, equipment and medium
Zeng et al. Effective combination of DenseNet and BiLSTM for keyword spotting
CN111276131B (en) Multi-class acoustic feature integration method and system based on deep neural network
US20190266998A1 (en) Speech recognition method and device, computer device and storage medium
US9728183B2 (en) System and method for combining frame and segment level processing, via temporal pooling, for phonetic classification
CN107731233B (en) Voiceprint recognition method based on RNN
CN110706690A (en) Speech recognition method and device
CN111429946A (en) Voice emotion recognition method, device, medium and electronic equipment
CN107093422B (en) Voice recognition method and voice recognition system
CN111462756B (en) Voiceprint recognition method and device, electronic equipment and storage medium
CN106875936A (en) Audio recognition method and device
CN111798840A (en) Voice keyword recognition method and device
KR102655791B1 (en) Speaker authentication method, learning method for speaker authentication and devices thereof
CN112017648A (en) Weighted finite state converter construction method, speech recognition method and device
US20230031733A1 (en) Method for training a speech recognition model and method for speech recognition
CN115312033A (en) Speech emotion recognition method, device, equipment and medium based on artificial intelligence
CN113327575B (en) Speech synthesis method, device, computer equipment and storage medium
CN111128174A (en) Voice information processing method, device, equipment and medium
CN114566156A (en) Keyword speech recognition method and device
CN112542173A (en) Voice interaction method, device, equipment and medium
CN114999463B (en) Voice recognition method, device, equipment and medium
CN111048068A (en) Voice wake-up method, device and system and electronic equipment
CN115132170A (en) Language classification method and device and computer readable storage medium
CN112037772B (en) Response obligation detection method, system and device based on multiple modes
CN111883109B (en) Voice information processing and verification model training method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination