CN114783438B - Adaptive decoding method, apparatus, computer device and storage medium - Google Patents

Adaptive decoding method, apparatus, computer device and storage medium Download PDF

Info

Publication number
CN114783438B
CN114783438B CN202210687945.XA CN202210687945A CN114783438B CN 114783438 B CN114783438 B CN 114783438B CN 202210687945 A CN202210687945 A CN 202210687945A CN 114783438 B CN114783438 B CN 114783438B
Authority
CN
China
Prior art keywords
decoding matrix
decoding
sliding window
probability value
preset threshold
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210687945.XA
Other languages
Chinese (zh)
Other versions
CN114783438A (en
Inventor
李�杰
王广新
杨汉丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Youjie Zhixin Technology Co ltd
Original Assignee
Shenzhen Youjie Zhixin Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Youjie Zhixin Technology Co ltd filed Critical Shenzhen Youjie Zhixin Technology Co ltd
Priority to CN202210687945.XA priority Critical patent/CN114783438B/en
Publication of CN114783438A publication Critical patent/CN114783438A/en
Application granted granted Critical
Publication of CN114783438B publication Critical patent/CN114783438B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Abstract

The invention provides a self-adaptive decoding method, a self-adaptive decoding device, computer equipment and a storage medium, wherein the self-adaptive decoding method comprises the following steps: acquiring voice data, and preprocessing the voice data to obtain a decoding matrix; and performing sliding window processing on the decoding matrix according to a self-adaptive step length strategy, and outputting a decoding result when the confidence coefficient of the key word in the sliding window meets a preset threshold value. And performing sliding window processing on the decoding matrix according to a self-adaptive step size strategy, so that the step size of the sliding window is changed every time, the problem that the sliding window is overlapped too much or overlapped too little is avoided, the decoding times and the processing time are reduced, and the possibility of voice information loss is reduced.

Description

Adaptive decoding method, apparatus, computer device and storage medium
Technical Field
The present invention relates to the field of speech recognition technologies, and in particular, to an adaptive decoding method, apparatus, computer device, and storage medium.
Background
Awakening word and command word recognition belongs to the field of voice recognition and is widely applied to scenes such as intelligent homes, intelligent terminals and the like. When the existing awakening word and command word models are applied, the voice of a user is detected in real time, and when a specific vocabulary is detected, feedback is made. How to accurately detect a specific vocabulary in streaming voice data is not only related to the performance of an algorithm, but also important for the algorithm of streaming processing. For example, a common streaming process selects a fixed window length and then performs sliding according to a certain step size. When the sliding processing mode is carried out according to a certain step length, if the sliding window is overlapped too much, the decoding times are increased, and the processing time is increased; if the sliding window overlap is too small, there is a possibility that voice information is lost.
Disclosure of Invention
The invention mainly aims to provide a self-adaptive decoding method, a self-adaptive decoding device, computer equipment and a storage medium, which can solve the technical problem that sliding windows are overlapped too much or overlapped too little when a sliding processing mode is carried out according to a certain step length in the prior art.
The invention provides a self-adaptive decoding method, which comprises the following steps:
acquiring voice data, and preprocessing the voice data to obtain a decoding matrix;
and performing sliding window processing on the decoding matrix according to a self-adaptive step length strategy, and outputting a decoding result when the confidence coefficient of the key word appearing in the sliding window meets a preset threshold value.
Further, the step of obtaining the voice data and preprocessing the voice data to obtain a decoding matrix includes:
inputting the voice data into a neural network model to obtain an initial decoding matrix;
obtaining a probability value corresponding to a blank label in the initial decoding matrix;
and if the probability value is greater than a preset threshold value, removing the characteristic data frame corresponding to the probability value to obtain a simplified decoding matrix.
Further, if the probability value is greater than a preset threshold, after the step of removing the feature data frame corresponding to the probability value to obtain a simplified decoding matrix, the method includes:
caching the decoding matrix to a cache region;
and when the caching time of the caching area meets a time threshold, copying the decoding matrix of the caching area to a processing area for sliding window processing.
Further, before the step of performing sliding window processing on the decoding matrix according to the adaptive step size strategy, the method includes:
if the length of the decoding matrix is smaller than the preset processing length, receiving the next decoding matrix and splicing the two decoding matrices;
and if the length of the spliced decoding matrix is greater than the preset processing length, performing sliding window processing.
Further, the step of performing sliding window processing on the decoding matrix according to the adaptive step size strategy and outputting a decoding result when the confidence of the keyword appearing in the sliding window meets a preset threshold value includes:
searching the decoding matrix by the sliding window within a preset step length range, and judging whether one of the following conditions is met: the method comprises the following steps that a first condition is that whether a probability value corresponding to a phoneme at the beginning of a keyword in a preset step range is larger than a preset threshold value or not, and a second condition is that whether a probability value of each column of a decoding matrix in the preset step range is the maximum probability value of the phoneme at the beginning of the keyword or not;
if the condition one is met, the sliding window slides to the first position where the probability value is larger than a preset threshold value, whether the confidence coefficient of the keyword appearing in the sliding window at the moment meets the preset threshold value is judged, and if yes, a decoding result is output; if the condition II is met, the sliding window slides to the first occurrence position of the phoneme with the probability value being the beginning of the keyword, whether the confidence coefficient of the keyword occurring in the sliding window meets a preset threshold value or not is judged, and if yes, a decoding result is output.
Further, the step of performing sliding window processing on the decoding matrix according to the adaptive step size strategy, and outputting a decoding result when the confidence of the keyword appearing in the sliding window meets a preset threshold value, includes:
if the confidence coefficient of the keyword appearing in the sliding window does not meet a preset threshold value, judging whether the tail end of the decoding matrix obtains the probability value of the phoneme at the beginning of the keyword in a preset length range;
if not, receiving a next section of decoding matrix;
if yes, judging whether one of the following conditions is met: whether the probability value is larger than a preset threshold value or not is judged, and whether the probability value is the maximum value of the decoding matrix within the preset length range or not is judged;
if not, receiving the next decoding matrix;
if the first condition is met, a section of decoding matrix from the current position to the tail end is intercepted from the first position where the probability value is larger than a preset threshold value, the first decoding matrix and the next section of decoding matrix to be processed are spliced to obtain a second decoding matrix, if the second condition is met, a section of decoding matrix from the current position to the tail end is intercepted from the position where the probability value is the maximum value, the first decoding matrix and the next section of decoding matrix to be processed are spliced to obtain the second decoding matrix.
The present invention also provides an adaptive decoding apparatus, comprising:
the acquisition module is used for acquiring voice data and preprocessing the voice data to obtain a decoding matrix;
and the processing module is used for performing sliding window processing on the decoding matrix according to the self-adaptive step length strategy and outputting a decoding result when the confidence coefficient of the key word appearing in the sliding window meets a preset threshold value.
Further, the obtaining module includes:
the obtaining submodule I is used for inputting the voice data into a neural network model to obtain an initial decoding matrix;
the obtaining submodule II is used for obtaining a probability value corresponding to a blank label in the initial decoding matrix;
and the simplification submodule is used for removing the characteristic data frame corresponding to the probability value to obtain a simplified decoding matrix if the probability value is greater than a preset threshold value.
The invention also provides a computer device comprising a memory storing a computer program and a processor implementing the steps of any of the above methods when the processor executes the computer program.
The invention also provides a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of any of the methods described above.
Compared with the prior art, the invention provides a self-adaptive decoding method, a self-adaptive decoding device, computer equipment and a storage medium, wherein the decoding matrix is subjected to sliding window processing according to a self-adaptive step size strategy, so that the step size of the sliding window is changed every time, the problem that the sliding window is overlapped too much or overlapped too little is avoided, the decoding times and the processing time are reduced, and the possibility of voice information loss is reduced.
Drawings
FIG. 1 is a schematic diagram illustrating steps of an adaptive decoding method according to an embodiment of the present application;
FIG. 2 is a schematic diagram illustrating steps of an adaptive decoding method according to another embodiment of the present application;
FIG. 3 is a block diagram of an adaptive decoding apparatus according to an embodiment of the present application;
FIG. 4 is a block diagram illustrating a computer device according to an embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, an adaptive decoding method according to an embodiment of the present invention includes
S1, acquiring voice data, and preprocessing the voice data to obtain a decoding matrix;
s2, performing sliding window processing on the decoding matrix according to the self-adaptive step length strategy, and outputting a decoding result when the confidence coefficient of the key word in the sliding window meets a preset threshold value.
In the step S1, the voice data may be voice data input to the terminal device by a user through a microphone; or, the data can be voice data downloaded from the internet by the user; or, the terminal device may also be voice data in an audio/video file stored locally. The source of the voice data is not limited in this application.
In step S2, the length of the output decoding matrix is generally fixed after neural network prediction, and the length of the decoding matrix is changed after preprocessing. For example, for 1.6s of speech data (10 ms frame, 160 frames), the length of the decoding matrix is 40 (output after stride =4 neural network), but after preprocessing the decoding matrix, the length is variable, for example, between 0 and 40. For example, a common streaming process selects a fixed window length and then performs sliding according to a certain step size. Such a treatment has the following disadvantages: one is to set a fixed window length, thereby defining the range of detection. For example, the window length is set to 2s, that is, the wake-up word and the command word need to be spoken within 2s before accurate detection can be performed. For users with slow speech speed, the users can not be identified easily after more than 2s, so that the user experience is reduced; secondly, the selection of the step length needs to be designed elaborately, if the step length is too large, the detection is easy to miss, and if the step length is too small, the processing frequency is increased, the time delay is increased, the real-time feedback time is prolonged, and the user experience is poor; and thirdly, overlapping processing of input ends, for example, the window length is 3s, the step size stride =1s, data is processed for the first time in 1-3s, data is processed for the 2 nd time in 2-4s, and the first window and the second window have overlapping, but not completely disjoint. The unit processing time is the sum of the model reasoning time and the post-processing time, and the model reasoning is generally a main part of the running time of the whole algorithm, so that the selectable range of the step length is greatly limited. The step length setting is small, the time delay is serious, the setting is large, and the missing identification is greatly increased. And performing sliding window processing on the decoding matrix through a self-adaptive step length strategy, namely performing next step length setting according to the search result in a preset step length range in the sliding window processing process, wherein the sliding window has different sliding step lengths due to different search results in the preset step length range, and outputting the decoding result until the confidence coefficient of the keyword appearing in the sliding window meets a preset threshold value. Specifically, the algorithm for searching in the sliding window is an f-function, and the confidence of the keyword may be obtained by using an algorithm pair of greedy search, beam search, or ctc search. For example, score = f (window) the f function input is a window, and the result returned is that the confidence of the keyword meets a preset threshold.
Further, the step of obtaining the voice data and preprocessing the voice data to obtain a decoding matrix includes:
s11, inputting the voice data into a neural network model to obtain an initial decoding matrix;
s12, obtaining a probability value corresponding to the blank label in the initial decoding matrix;
and if the probability value is greater than a preset threshold value, removing the characteristic data frame corresponding to the probability value to obtain a simplified decoding matrix.
In this embodiment, after the neural network model outputs the initial sequence matrix, the recognition system reduces the initial decoding matrix according to the preset rule, and reduces the data amount output by the model (i.e., the initial sequence matrix). Specifically, the identification system obtains probability values of corresponding positions of all blank tags in the initial sequence matrix, then a preset probability threshold value is called, the probability value of the corresponding position of each blank tag is compared with the preset probability threshold value, if the probability value of the corresponding position of each blank tag is larger than the probability threshold value, the feature data frame of the corresponding position of the blank tag belongs to an invalid frame, and the feature data frame corresponding to the blank tag with the probability value larger than the probability threshold value in the initial sequence matrix is removed, so that the simplified initial decoding matrix is obtained. The recognition system performs the above simplification processing on each initial decoding matrix output by the voice recognition model, and then splices the initial decoding matrix simplified at the current moment and a preset number of initial decoding matrices simplified at the previous moment to obtain a secondary decoding matrix (i.e. the secondary decoding matrix is spliced on the basis of the simplified initial decoding matrix, so that the data volume is greatly reduced).
Further, if the probability value is greater than a preset threshold, after the step of removing the feature data frame corresponding to the probability value to obtain a simplified decoding matrix, the method includes:
s111, caching the decoding matrix to a cache region;
s121, when the caching time of the caching area meets a time threshold, copying a decoding matrix of the caching area to a processing area for sliding window processing.
In the process of processing the audio stream, the streaming voice recognition model supports real-time return of the recognition result, and the non-streaming voice recognition model needs to return the recognition result after processing the whole sentence. In the above steps, the mode of processing the voice data by real-time streaming decoding, such as caching discontinuous voice segments, can obtain the voice recognition result according to less information. Real-time streaming speech decoding may reduce the time required to output a recognition result by keeping history information in a cache or by limiting the use of a locally small amount of history information. Specifically, the preset time threshold in this embodiment is 100 milliseconds (ms), a decoding matrix is truncated every 100ms from the time when the voice data starts to be output, and after a decoding matrix is truncated or after a plurality of decoding matrices are received within 00ms, the truncated decoding matrix or the concatenated decoding matrix is copied to the processing area for sliding window processing. Therefore, the preset time threshold can actually reflect the speech recognition delay of the decoding matrix, in the above example, the time threshold is 100ms, that is, the speech recognition processing is performed every 100ms, so that for the user, it can be perceived that the characters output by speech recognition are only 100ms later than the actual speaking, and the speech recognition is performed without waiting for the user to speak the end of the whole sentence. Different speech recognition application scenarios have different requirements on the time threshold of the speech recognition process, and therefore, in practical applications, the time threshold corresponding to the speech recognition application scenario currently triggering the speech recognition requirement may be determined, so as to perform block segmentation or block splicing on the speech data stream by using the time threshold. Therefore, a streaming voice processing model suitable for different time can be trained in advance, and the streaming voice processing model can be suitable for different voice recognition application scenes. Briefly, during the training process, various time thresholds, such as 100ms, 150ms, 300ms, etc., may be preset. In the iterative training process of the streaming voice processing model, the time threshold used in each iterative process can be arbitrarily selected from the preset time thresholds, that is, voice data blocks serving as training samples are cut for corresponding duration according to the selected time threshold, and then the voice data blocks obtained by cutting are respectively input into the streaming voice processing model to train the streaming voice processing model.
Further, before the step of performing sliding window processing on the decoding matrix according to the adaptive step size strategy, the method includes:
s101, if the length of the decoding matrix is smaller than a preset processing length, receiving a next decoding matrix and splicing the two decoding matrices;
and S102, if the length of the spliced decoding matrix is greater than the preset processing length, performing sliding window processing.
When the caching time of the caching area meets the time threshold, the decoding matrix of the caching area is copied to the processing area for sliding window processing, and the situation that the decoding matrix needs to be spliced in blocks can occur. The splicing results corresponding to the multiple voice data blocks are input into the voice processing model, and if the length of the decoding matrix after splicing is too short, the contained voice data may be incomplete or the processing efficiency of the voice processing model is not high. For each voice data block, the voice processing model needs to obtain context information in a longer distance range, so that the keyword sequences corresponding to the voice data blocks output by the offline voice processing model are more accurate. Therefore, when the length of the decoding matrix is smaller than the preset processing length, the next decoding matrix is received and spliced with the decoding matrix for sliding window processing.
Further, referring to fig. 2, the step of performing sliding window processing on the decoding matrix according to the adaptive step size strategy, and outputting a decoding result when the confidence of the keyword appearing in the sliding window meets a preset threshold includes:
s21, the sliding window searches the decoding matrix in a preset step range, and whether one of the following conditions is met is judged: the method comprises the following steps that a first condition is that whether a probability value corresponding to a phoneme at the beginning of a keyword in a preset step range is larger than a preset threshold value or not, and a second condition is that whether a probability value of each column of a decoding matrix in the preset step range is the maximum probability value of the phoneme at the beginning of the keyword or not;
s23, if the condition I is met, sliding the sliding window to the position where the probability value is larger than a preset threshold value, judging whether the confidence coefficient of the keyword appearing in the sliding window at the moment meets the preset threshold value, and if so, outputting a decoding result; if the condition II is met, the sliding window slides to the first occurrence position of the phoneme with the probability value being the beginning of the keyword, whether the confidence coefficient of the keyword occurring in the sliding window meets a preset threshold value or not is judged, and if yes, a decoding result is output.
When another judgment result occurs in each of the above judgment steps, the following steps should be further included, specifically:
s21, the sliding window searches the decoding matrix in a preset step range, and whether one of the following conditions is met is judged: the method comprises the following steps that a first condition is that whether a probability value corresponding to a phoneme at the beginning of a keyword in a preset step range is larger than a preset threshold value or not, and a second condition is that whether a probability value of each column of a decoding matrix in the preset step range is the maximum probability value of the phoneme at the beginning of the keyword or not;
s22, if yes, outputting a decoding result;
s23, if the condition I is met, sliding the sliding window to the position where the probability value is larger than a preset threshold value, judging whether the confidence coefficient of the keyword appearing in the sliding window at the moment meets the preset threshold value, and if so, outputting a decoding result; if the condition II is met, the sliding window slides to the first occurrence position of the phoneme with the probability value being the beginning of the keyword, whether the confidence coefficient of the keyword occurring in the sliding window meets a preset threshold value or not is judged, and if yes, a decoding result is output.
And S24, the first condition and the second condition are not satisfied, and the sliding window slides once on the decoding matrix by a preset step size.
For example, in this embodiment, the length of the decoding matrix is 40, the window length of the sliding window is 20, the preset step is 10, the sliding window searches for the decoding matrix within the range where the preset step is 10, at this time, the leftmost position of the sliding window is located at the position of the 1 st column vector, and at this time, it is determined whether the confidence of the keyword appearing in the current sliding window meets the preset threshold. And if the confidence coefficient of the key word appearing in the current sliding window meets a preset threshold value, outputting a decoding result. If the confidence coefficient of the key word appearing in the current sliding window does not meet the preset threshold value, judging whether one of the following conditions is met: and if the condition one, whether the probability value of the phoneme at the beginning of the keyword in the preset step range is greater than a preset threshold value, and the condition two, whether the probability value of each column of the decoding matrix in the preset step range is the maximum phoneme at the beginning of the keyword, and if neither the condition one nor the condition two is satisfied, the sliding window performs one-time sliding on the decoding matrix by using a step length with a preset step length of 10, at this time, the leftmost end of the sliding window is located at the position of the 11 th column vector, and step S21 is executed. If the probability value corresponding to the phoneme at the beginning of the keyword is greater than the preset threshold value or the probability value of each column of the decoding matrix is the maximum probability value of the phoneme at the beginning of the keyword in the process of searching the decoding matrix in the range of the preset step length of 10 by the sliding window, when the position where the first probability value appears is the position of the 5 th column vector or the position where the maximum value appears is the position of the 5 th column vector, the leftmost end of the sliding window is shifted to the position of the 5 th column vector, and at the moment, the step length of the sliding window is not the previous step length of 10 any more, but the step length is 5. And if the confidence coefficient of the key word appearing in the current sliding window meets a preset threshold value, outputting a decoding result. If the probability value corresponding to the phoneme at the beginning of the keyword is not greater than the preset threshold value or the probability value of each column of the decoding matrix is the maximum of the phonemes at the beginning of the keyword, searching is performed within a range with a preset step length of 10, that is, searching is performed within the 5 th to 15 th column vectors, and the specific steps are as described in step S21, so that details are not repeated. The step length is adjusted according to the result of searching the preset step length in the sliding process of the sliding window, so that the step length of the sliding window is changed every time, the problem that the sliding window is overlapped too much or too little is avoided, the decoding times and the processing time are reduced, and the possibility of losing the voice information is reduced.
Further, after the step of performing sliding window processing on the decoding matrix according to the adaptive step size strategy and outputting a decoding result when the confidence of the keyword appearing in the sliding window meets a preset threshold, the method includes:
s31, if the confidence coefficient of the keyword appearing in the sliding window does not meet a preset threshold value, judging whether the tail end of the decoding matrix obtains the probability value of the phoneme at the beginning of the keyword in a preset length range;
s32, if not, receiving the next section of decoding matrix;
s33, if yes, judging whether one of the following conditions is met: whether the probability value is larger than a preset threshold value or not is judged, and whether the probability value is the maximum value of the decoding matrix within the preset length range or not is judged;
s34, if not, receiving the next section of decoding matrix;
and S35, if the condition I is met, intercepting a section of decoding matrix from the current position to the tail end as a first decoding matrix from the position where the probability value is greater than the preset threshold value, splicing the first decoding matrix with the next section of decoding matrix to be processed to obtain a second decoding matrix, if the condition II is met, intercepting a section of decoding matrix from the current position to the tail end as the first decoding matrix from the position where the probability value is the maximum value, and splicing the first decoding matrix with the next section of decoding matrix to be processed to obtain the second decoding matrix.
Because the processing area in the application processes the decoding matrix which is obtained by cutting off the cache area, the decoding matrix containing complete semantic data is cut off, and if the decoding matrix of each stage is directly processed, the final recognition result is possibly wrong. Therefore, when the confidence of the keyword appearing in the sliding window does not meet the preset threshold, the next speech segment needs to be processed. At this time, it is necessary to determine whether the tail end of the current decoding matrix obtains the probability value of the phoneme at the beginning of the keyword within a preset length range. For example, if the length of the decoding matrix is 40 and the preset length is 10, it is determined whether the end of the decoding matrix obtains the probability value of the phoneme at the beginning of the keyword within the preset length range, that is, the probability value of the phoneme at the beginning of the keyword between the 31 st column vector and the 40 th column vector. If the position with the probability value larger than the preset threshold value is at the position of the 36 th column or the maximum value of the probability value is at the position of the 36 th column, column vectors from the 36 th column to the 40 th column are intercepted from the position of the 36 th column and spliced with the next decoding matrix to be processed to obtain a second decoding matrix. The integrity of the voice data is ensured through a splicing mode, and the accuracy of recognizing the voice data is ensured.
Referring to fig. 3, the present invention further provides an adaptive decoding apparatus, including:
the device comprises an acquisition module 1, a decoding module and a decoding module, wherein the acquisition module is used for acquiring voice data and preprocessing the voice data to obtain a decoding matrix;
and the processing module 2 is used for performing sliding window processing on the decoding matrix according to a self-adaptive step length strategy and outputting a decoding result when the confidence coefficient of the key word appearing in the sliding window meets a preset threshold value.
Further, the obtaining module includes:
the obtaining submodule I is used for inputting the voice data into a neural network model to obtain an initial decoding matrix;
the obtaining submodule II is used for obtaining a probability value corresponding to a blank label in the initial decoding matrix;
and the simplification submodule is used for removing the characteristic data frame corresponding to the probability value to obtain a simplified decoding matrix if the probability value is greater than a preset threshold value.
Referring to fig. 4, an embodiment of the present application further provides a computer device, and an internal structure of the computer device may be as shown in fig. 4. The computer equipment comprises a processor, a memory, a network interface, a display device and an input device which are connected through a system bus. Wherein, the network interface of the computer equipment is used for communicating with an external terminal through network connection. The display device of the computer device is used for displaying the interactive page. The input means of the computer device is for receiving input from a user. The computer device is designed with a processor for providing computing and control capabilities. The memory of the computer device includes non-volatile storage media. The non-volatile storage medium stores an operating system, a computer program, and a database. The database of the computer device is used for storing the original data. The computer program is executed by a processor to implement a message push method based on intelligent analysis.
The processor executes the self-adaptive decoding method to obtain voice data, and preprocesses the voice data to obtain a decoding matrix; and performing sliding window processing on the decoding matrix according to a self-adaptive step length strategy, and outputting a decoding result when the confidence coefficient of the key word appearing in the sliding window meets a preset threshold value. And performing sliding window processing on the decoding matrix according to a self-adaptive step size strategy, so that the step size of the sliding window is changed every time, the problem that the sliding window is overlapped too much or overlapped too little is avoided, the decoding times and the processing time are reduced, and the possibility of voice information loss is reduced.
Further, the step of obtaining the voice data and preprocessing the voice data to obtain a decoding matrix includes:
inputting the voice data into a neural network model to obtain an initial decoding matrix;
obtaining a probability value corresponding to a blank label in the initial decoding matrix;
and if the probability value is greater than a preset threshold value, removing the characteristic data frame corresponding to the probability value to obtain a simplified decoding matrix.
Further, if the probability value is greater than a preset threshold, after the step of removing the feature data frame corresponding to the probability value to obtain a simplified decoding matrix, the method includes:
caching the decoding matrix to a cache region;
and when the caching time of the caching area meets a time threshold, copying the decoding matrix of the caching area to a processing area for sliding window processing.
Further, before the step of performing sliding window processing on the decoding matrix according to the adaptive step size strategy, the method includes:
if the length of the decoding matrix is smaller than the preset processing length, receiving the next decoding matrix and splicing the two decoding matrices;
and if the length of the spliced decoding matrix is greater than the preset processing length, performing sliding window processing.
Further, the step of performing sliding window processing on the decoding matrix according to a self-adaptive step length strategy, and outputting a decoding result when the confidence of the keyword appearing in the sliding window meets a preset threshold value, includes:
searching the decoding matrix by the sliding window within a preset step length range, and judging whether one of the following conditions is met: the method comprises the following steps that a first condition is that whether a probability value corresponding to a phoneme at the beginning of a keyword in a preset step range is larger than a preset threshold value or not, and a second condition is that whether the maximum probability value of each column of a decoding matrix in the preset step range is the phoneme at the beginning of the keyword or not;
if the condition one is met, the sliding window slides to the position where the probability value is larger than a preset threshold value, whether the confidence coefficient of the keyword appearing in the sliding window at the moment meets the preset threshold value is judged, and if yes, a decoding result is output; if the condition II is met, the sliding window slides to the first occurrence position of the phoneme with the probability value being the beginning of the keyword, whether the confidence coefficient of the keyword occurring in the sliding window meets a preset threshold value or not is judged, and if yes, a decoding result is output.
Further, the step of performing sliding window processing on the decoding matrix according to the adaptive step size strategy, and outputting a decoding result when the confidence of the keyword appearing in the sliding window meets a preset threshold value, includes:
if the confidence coefficient of the keyword appearing in the sliding window does not meet a preset threshold value, judging whether the tail end of the decoding matrix obtains the probability value of the phoneme at the beginning of the keyword in a preset length range;
if not, receiving a next section of decoding matrix;
if yes, judging whether one of the following conditions is met: whether the probability value is larger than a preset threshold value or not is judged, and whether the probability value is the maximum value of the decoding matrix within the preset length range or not is judged;
if not, receiving a next section of decoding matrix;
if the first condition is met, a section of decoding matrix from the current position to the tail end is intercepted from the first position where the probability value is larger than a preset threshold value, the first decoding matrix and the next section of decoding matrix to be processed are spliced to obtain a second decoding matrix, if the second condition is met, a section of decoding matrix from the current position to the tail end is intercepted from the position where the probability value is the maximum value, the first decoding matrix and the next section of decoding matrix to be processed are spliced to obtain the second decoding matrix.
The present application further provides a computer-readable storage medium having a computer program stored thereon, where the computer program, when executed by the processor, implements an adaptive decoding method, obtains speech data, and preprocesses the speech data to obtain a decoding matrix; and performing sliding window processing on the decoding matrix according to a self-adaptive step length strategy, and outputting a decoding result when the confidence coefficient of the key word appearing in the sliding window meets a preset threshold value. And performing sliding window processing on the decoding matrix according to a self-adaptive step strategy, so that the step length of the sliding window is changed every time, the problem of overlarge or undersize sliding window overlapping is avoided, the decoding times and the processing time are reduced, and the possibility of voice information loss is reduced.
Further, the step of obtaining the voice data and preprocessing the voice data to obtain the decoding matrix includes:
inputting the voice data into a neural network model to obtain an initial decoding matrix;
obtaining a probability value corresponding to a blank label in the initial decoding matrix;
and if the probability value is greater than a preset threshold value, removing the characteristic data frame corresponding to the probability value to obtain a simplified decoding matrix.
Further, if the probability value is greater than a preset threshold, after the step of removing the feature data frame corresponding to the probability value to obtain a simplified decoding matrix, the method includes:
caching the decoding matrix to a cache region;
and when the caching time of the caching area meets a time threshold, copying the decoding matrix of the caching area to a processing area for sliding window processing.
Further, before the step of performing sliding window processing on the decoding matrix according to the adaptive step size strategy, the method includes:
searching the decoding matrix by the sliding window within a preset step length range, and judging whether one of the following conditions is met: the method comprises the following steps that a first condition is that whether a probability value corresponding to a phoneme at the beginning of a keyword in a preset step range is larger than a preset threshold value or not, and a second condition is that whether a probability value of each column of a decoding matrix in the preset step range is the maximum probability value of the phoneme at the beginning of the keyword or not;
if the condition one is met, the sliding window slides to the first position where the probability value is larger than a preset threshold value, whether the confidence coefficient of the keyword appearing in the sliding window at the moment meets the preset threshold value is judged, and if yes, a decoding result is output; if the condition II is met, the sliding window slides to the first occurrence position of the phoneme with the probability value being the beginning of the keyword, whether the confidence coefficient of the keyword occurring in the sliding window meets a preset threshold value or not is judged, and if yes, a decoding result is output.
Further, the step of performing sliding window processing on the decoding matrix according to the adaptive step size strategy, and outputting a decoding result when the confidence of the keyword appearing in the sliding window meets a preset threshold value, includes:
if the confidence coefficient of the keyword appearing in the sliding window does not meet a preset threshold value, judging whether the tail end of the decoding matrix obtains the probability value of the phoneme at the beginning of the keyword in a preset length range;
if not, receiving a next section of decoding matrix;
if yes, judging whether one of the following conditions is met: whether the probability value is larger than a preset threshold value or not is judged, and whether the probability value is the maximum value of the decoding matrix within the preset length range or not is judged;
if not, receiving a next section of decoding matrix;
if the first condition is met, a section of decoding matrix from the current position to the tail end is intercepted from the first position where the probability value is larger than a preset threshold value, the first decoding matrix and the next section of decoding matrix to be processed are spliced to obtain a second decoding matrix, if the second condition is met, a section of decoding matrix from the current position to the tail end is intercepted from the position where the probability value is the maximum value, the first decoding matrix and the next section of decoding matrix to be processed are spliced to obtain the second decoding matrix.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.
The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.

Claims (9)

1. An adaptive decoding method, comprising:
acquiring voice data, and preprocessing the voice data to obtain a decoding matrix;
performing sliding window processing on the decoding matrix according to a self-adaptive step length strategy, and outputting a decoding result when the confidence coefficient of the key word appearing in the sliding window meets a preset threshold value;
the step of performing sliding window processing on the decoding matrix according to the adaptive step length strategy and outputting a decoding result when the confidence coefficient of the keyword appearing in the sliding window meets a preset threshold value comprises the following steps:
searching the decoding matrix by the sliding window within a preset step length range, and judging whether one of the following conditions is met: the method comprises the following steps that a first condition is that whether a probability value corresponding to a phoneme at the beginning of a keyword in a preset step range is larger than a preset threshold value or not, and a second condition is that whether a probability value of each column of a decoding matrix in the preset step range is the maximum probability value of the phoneme at the beginning of the keyword or not;
if the condition one is met, the sliding window slides to the first position where the probability value is larger than a preset threshold value, whether the confidence coefficient of the keyword appearing in the sliding window at the moment meets the preset threshold value is judged, and if yes, a decoding result is output; if the condition II is met, the sliding window slides to the first occurrence position of the phoneme with the probability value being the beginning of the keyword, whether the confidence coefficient of the keyword occurring in the sliding window meets a preset threshold value or not is judged, and if yes, a decoding result is output.
2. The adaptive decoding method according to claim 1, wherein the step of obtaining the speech data and preprocessing the speech data to obtain the decoding matrix comprises:
inputting the voice data into a neural network model to obtain an initial decoding matrix;
obtaining a probability value corresponding to a blank label in the initial decoding matrix;
and if the probability value is greater than a preset threshold value, removing the characteristic data frame corresponding to the probability value to obtain a simplified decoding matrix.
3. The adaptive decoding method of claim 2, wherein the step of removing the feature data frame corresponding to the probability value to obtain the reduced decoding matrix if the probability value is greater than a preset threshold comprises:
caching the decoding matrix to a cache region;
and when the caching time of the caching area meets a time threshold, copying the decoding matrix of the caching area to a processing area for sliding window processing.
4. The adaptive decoding method according to claim 1, wherein the step of performing sliding window processing on the decoding matrix according to the adaptive step size strategy is preceded by:
if the length of the decoding matrix is smaller than the preset processing length, receiving the next decoding matrix and splicing the two decoding matrices;
and if the length of the spliced decoding matrix is greater than the preset processing length, performing sliding window processing.
5. The adaptive decoding method according to claim 1, wherein the step of performing sliding window processing on the decoding matrix according to an adaptive step size strategy, and outputting a decoding result when a confidence level of a keyword appearing in a sliding window satisfies a preset threshold value, comprises:
if the confidence coefficient of the keyword appearing in the sliding window does not meet a preset threshold value, judging whether the tail end of the decoding matrix obtains the probability value of the phoneme at the beginning of the keyword in a preset length range;
if not, receiving a next section of decoding matrix;
if yes, judging whether one of the following conditions is met: whether the probability value is larger than a preset threshold value or not is judged, and whether the probability value is the maximum value of the decoding matrix within the preset length range or not is judged;
if the first condition and the second condition are not met, receiving a next section of decoding matrix;
if the first condition is met, a section of decoding matrix from the current position to the tail end is intercepted from the first position where the probability value is larger than a preset threshold value, the first decoding matrix and the next section of decoding matrix to be processed are spliced to obtain a second decoding matrix, if the second condition is met, a section of decoding matrix from the current position to the tail end is intercepted from the position where the probability value is the maximum value, the first decoding matrix and the next section of decoding matrix to be processed are spliced to obtain the second decoding matrix.
6. An adaptive decoding apparatus, comprising:
the acquisition module is used for acquiring voice data and preprocessing the voice data to obtain a decoding matrix;
the processing module is used for performing sliding window processing on the decoding matrix according to a self-adaptive step length strategy and outputting a decoding result when the confidence coefficient of the key word appearing in the sliding window meets a preset threshold value;
the step of performing sliding window processing on the decoding matrix according to the adaptive step length strategy and outputting a decoding result when the confidence coefficient of the keyword appearing in the sliding window meets a preset threshold value comprises the following steps:
searching the decoding matrix by the sliding window within a preset step length range, and judging whether one of the following conditions is met: the method comprises the following steps that a first condition is that whether a probability value corresponding to a phoneme at the beginning of a keyword in a preset step range is larger than a preset threshold value or not, and a second condition is that whether a probability value of each column of a decoding matrix in the preset step range is the maximum probability value of the phoneme at the beginning of the keyword or not;
if the condition one is met, the sliding window slides to the first position where the probability value is larger than a preset threshold value, whether the confidence coefficient of the keyword appearing in the sliding window at the moment meets the preset threshold value is judged, and if yes, a decoding result is output; if the condition II is met, the sliding window slides to the first occurrence position of the phoneme with the probability value being the beginning of the keyword, whether the confidence coefficient of the keyword occurring in the sliding window meets a preset threshold value or not is judged, and if yes, a decoding result is output.
7. The adaptive decoding apparatus of claim 6, wherein the obtaining module comprises:
the obtaining submodule I is used for inputting the voice data into a neural network model to obtain an initial decoding matrix;
the obtaining submodule II is used for obtaining a probability value corresponding to a blank label in the initial decoding matrix;
and the simplification submodule is used for removing the characteristic data frame corresponding to the probability value to obtain a simplified decoding matrix if the probability value is greater than a preset threshold value.
8. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 5 when executing the computer program.
9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 5.
CN202210687945.XA 2022-06-17 2022-06-17 Adaptive decoding method, apparatus, computer device and storage medium Active CN114783438B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210687945.XA CN114783438B (en) 2022-06-17 2022-06-17 Adaptive decoding method, apparatus, computer device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210687945.XA CN114783438B (en) 2022-06-17 2022-06-17 Adaptive decoding method, apparatus, computer device and storage medium

Publications (2)

Publication Number Publication Date
CN114783438A CN114783438A (en) 2022-07-22
CN114783438B true CN114783438B (en) 2022-09-27

Family

ID=82421083

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210687945.XA Active CN114783438B (en) 2022-06-17 2022-06-17 Adaptive decoding method, apparatus, computer device and storage medium

Country Status (1)

Country Link
CN (1) CN114783438B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114822539A (en) * 2022-06-24 2022-07-29 深圳市友杰智新科技有限公司 Method, device, equipment and storage medium for decoding double-window voice
CN115101063B (en) * 2022-08-23 2023-01-06 深圳市友杰智新科技有限公司 Low-computation-power voice recognition method, device, equipment and medium
CN115497484B (en) * 2022-11-21 2023-03-28 深圳市友杰智新科技有限公司 Voice decoding result processing method, device, equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106940998A (en) * 2015-12-31 2017-07-11 阿里巴巴集团控股有限公司 A kind of execution method and device of setting operation
CN107871499A (en) * 2017-10-27 2018-04-03 珠海市杰理科技股份有限公司 Audio recognition method, system, computer equipment and computer-readable recording medium
CN110033758A (en) * 2019-04-24 2019-07-19 武汉水象电子科技有限公司 A kind of voice wake-up implementation method based on small training set optimization decoding network
CN111462777A (en) * 2020-03-30 2020-07-28 厦门快商通科技股份有限公司 Keyword retrieval method, system, mobile terminal and storage medium
CN113192501A (en) * 2021-04-12 2021-07-30 青岛信芯微电子科技股份有限公司 Instruction word recognition method and device
CN113470646A (en) * 2021-06-30 2021-10-01 北京有竹居网络技术有限公司 Voice wake-up method, device and equipment
CN113506575A (en) * 2021-09-09 2021-10-15 深圳市友杰智新科技有限公司 Processing method and device for streaming voice recognition and computer equipment

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9275637B1 (en) * 2012-11-06 2016-03-01 Amazon Technologies, Inc. Wake word evaluation
US10490209B2 (en) * 2016-05-02 2019-11-26 Google Llc Automatic determination of timing windows for speech captions in an audio stream
US11615786B2 (en) * 2019-03-05 2023-03-28 Medyug Technology Private Limited System to convert phonemes into phonetics-based words
US11443734B2 (en) * 2019-08-26 2022-09-13 Nice Ltd. System and method for combining phonetic and automatic speech recognition search
KR20210047173A (en) * 2019-10-21 2021-04-29 엘지전자 주식회사 Artificial intelligence apparatus and method for recognizing speech by correcting misrecognized word
CN114530141A (en) * 2020-11-23 2022-05-24 北京航空航天大学 Chinese and English mixed offline voice keyword recognition method under specific scene and system implementation thereof

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106940998A (en) * 2015-12-31 2017-07-11 阿里巴巴集团控股有限公司 A kind of execution method and device of setting operation
CN107871499A (en) * 2017-10-27 2018-04-03 珠海市杰理科技股份有限公司 Audio recognition method, system, computer equipment and computer-readable recording medium
CN110033758A (en) * 2019-04-24 2019-07-19 武汉水象电子科技有限公司 A kind of voice wake-up implementation method based on small training set optimization decoding network
CN111462777A (en) * 2020-03-30 2020-07-28 厦门快商通科技股份有限公司 Keyword retrieval method, system, mobile terminal and storage medium
CN113192501A (en) * 2021-04-12 2021-07-30 青岛信芯微电子科技股份有限公司 Instruction word recognition method and device
CN113470646A (en) * 2021-06-30 2021-10-01 北京有竹居网络技术有限公司 Voice wake-up method, device and equipment
CN113506575A (en) * 2021-09-09 2021-10-15 深圳市友杰智新科技有限公司 Processing method and device for streaming voice recognition and computer equipment

Also Published As

Publication number Publication date
CN114783438A (en) 2022-07-22

Similar Documents

Publication Publication Date Title
CN114783438B (en) Adaptive decoding method, apparatus, computer device and storage medium
CN112100349B (en) Multi-round dialogue method and device, electronic equipment and storage medium
CN112102815B (en) Speech recognition method, speech recognition device, computer equipment and storage medium
CN108899013B (en) Voice search method and device and voice recognition system
US9934452B2 (en) Pruning and label selection in hidden Markov model-based OCR
CN111797632B (en) Information processing method and device and electronic equipment
CN113506575B (en) Processing method and device for streaming voice recognition and computer equipment
CN110910903B (en) Speech emotion recognition method, device, equipment and computer readable storage medium
CN110675861B (en) Method, device and equipment for speech sentence interruption and storage medium
CN111223476A (en) Method and device for extracting voice feature vector, computer equipment and storage medium
CN114639386A (en) Text error correction and text error correction word bank construction method
CN112016319A (en) Pre-training model obtaining method, disease entity labeling method, device and storage medium
CN113254613A (en) Dialogue question-answering method, device, equipment and storage medium
CN112201275A (en) Voiceprint segmentation method, voiceprint segmentation device, voiceprint segmentation equipment and readable storage medium
CN110503943B (en) Voice interaction method and voice interaction system
CN114944149A (en) Speech recognition method, speech recognition apparatus, and computer-readable storage medium
US20120116765A1 (en) Speech processing device, method, and storage medium
CN115101063B (en) Low-computation-power voice recognition method, device, equipment and medium
CN115497484B (en) Voice decoding result processing method, device, equipment and storage medium
CN115512687B (en) Voice sentence-breaking method and device, storage medium and electronic equipment
CN108694939B (en) Voice search optimization method, device and system
CN111739506A (en) Response method, terminal and storage medium
CN112397053B (en) Voice recognition method and device, electronic equipment and readable storage medium
CN112669836A (en) Command recognition method and device and computer readable storage medium
CN112420054A (en) Speech recognition system and method based on speaker vector multiplexing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant