CN117409778A - Decoding processing method, device, equipment and storage medium - Google Patents

Decoding processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN117409778A
CN117409778A CN202311714025.3A CN202311714025A CN117409778A CN 117409778 A CN117409778 A CN 117409778A CN 202311714025 A CN202311714025 A CN 202311714025A CN 117409778 A CN117409778 A CN 117409778A
Authority
CN
China
Prior art keywords
result
phoneme
decoding
preset
path
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311714025.3A
Other languages
Chinese (zh)
Other versions
CN117409778B (en
Inventor
李�杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Youjie Zhixin Technology Co ltd
Original Assignee
Shenzhen Youjie Zhixin Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Youjie Zhixin Technology Co ltd filed Critical Shenzhen Youjie Zhixin Technology Co ltd
Priority to CN202311714025.3A priority Critical patent/CN117409778B/en
Publication of CN117409778A publication Critical patent/CN117409778A/en
Application granted granted Critical
Publication of CN117409778B publication Critical patent/CN117409778B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/28Data switching networks characterised by path configuration, e.g. LAN [Local Area Networks] or WAN [Wide Area Networks]
    • H04L12/2803Home automation networks
    • H04L12/2816Controlling appliance services of a home automation network by calling their functionalities
    • H04L12/282Controlling appliance services of a home automation network by calling their functionalities based on user interaction within the home
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0638Interactive procedures
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Automation & Control Theory (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)

Abstract

The present invention relates to the field of speech decoding technologies, and in particular, to a decoding processing method, device, apparatus, and storage medium, where the method includes: obtaining a predicted text result based on a decoding matrix corresponding to the first voice command word; judging whether the editing distance between the predicted text result and the preset result is smaller than a first preset threshold value or not; if the decoding result is smaller than a first preset threshold value, carrying out path alignment on the decoding matrix by taking the preset result as a decoding path; if the phoneme score value of the preset result is lower than a second preset threshold value in the phoneme column corresponding to a certain time point in the decoding matrix and the phoneme corresponding to the maximum phoneme score value is different from the phoneme part of speech corresponding to the preset result, correcting the corresponding phoneme score value; and if the identification result of the corrected decoding matrix is larger than a third preset threshold value, judging that the identification is effective. The method and the device can achieve improvement of accuracy of voice recognition through controllable calculation time consumption and simple algorithm process.

Description

Decoding processing method, device, equipment and storage medium
Technical Field
The present invention relates to the field of speech decoding, and in particular, to a decoding processing method, apparatus, device, and storage medium.
Background
Command word recognition belongs to voice recognition and is widely applied to the field of intelligent home, such as intelligent voice sound boxes, intelligent voice headphones, intelligent voice lamps, intelligent voice fans and the like. With the development of deep learning technology, the positive recognition rate of command words is remarkably improved, and the user requirements are basically met. In the process of speech decoding, a ctc decoder (Connectionist Temporal Classification) is generally required to be used for decoding, and the ctc decoder has the characteristics of light weight, no need of frame level alignment information for training and convenience for training, and is more applied to speech recognition on embedded equipment. However, since the output of the ctc decoder is independent, some output is generated, which does not conform to the actual syntax structure, resulting in a decrease in the accuracy of the final recognition and an increase in the probability of false recognition.
Therefore, how to effectively process the situations that the output conditions of ctc independently cause some outputs to be inconsistent with the actual grammar structure, and improving the accuracy of speech recognition is a problem to be solved.
Disclosure of Invention
The main purpose of the present application is to provide a decoding processing method, device, equipment and storage medium, which aims to solve the technical problem of how to improve the accuracy of speech recognition for the case that output conditions of ctc are independently processed to cause that some outputs do not conform to the actual grammar structure.
In order to achieve the above object, the present application proposes a decoding processing method, the method comprising: obtaining a predicted text result based on a decoding matrix corresponding to the first voice command word;
judging whether the editing distance between the predicted text result and the preset result is smaller than a first preset threshold value or not;
if the decoding result is smaller than a first preset threshold value, carrying out path alignment on the decoding matrix by taking the preset result as a decoding path;
if the phoneme score value of the preset result is lower than a second preset threshold value in the phoneme column corresponding to a certain time point in the decoding matrix and the phoneme corresponding to the maximum phoneme score value is different from the phoneme part of speech corresponding to the preset result, correcting the corresponding phoneme score value;
and if the identification result of the corrected decoding matrix is larger than a third preset threshold value, judging that the identification is effective.
Further, the step of obtaining a predicted text result based on the decoding matrix corresponding to the first voice command word includes:
performing acoustic modeling and language modeling on the first voice command word through a voice recognition model to obtain a corresponding decoding matrix;
and selecting a preset decoding algorithm, starting from the first time point of the decoding matrix, gradually decoding, and obtaining a word with highest probability as a predicted text result after the last time point is decoded.
Further, the step of determining whether the edit distance between the predicted text result and the preset result is smaller than a first preset threshold value includes:
acquiring character string information of the predicted text result and the preset result;
creating an edit distance matrix, and filling the edit distance matrix based on the character string information;
and obtaining the edit distance between the predicted text result and the preset result based on the filled edit distance matrix.
Further, if the number of the decoding paths is smaller than a first preset threshold, the step of aligning the decoding matrix with the preset result as the decoding path includes:
initializing a path pointer matrix according to the size of the decoding matrix, wherein the path pointer matrix is used for recording the path pointer of each position;
traversing the time point and the phonemes of the decoding matrix, and calculating a path score and a path pointer of each position;
comparing the path score and the path pointer of each position with a preset result, and updating the path score and the path pointer to optimal values according to the comparison result;
and (3) reversely backtracking from the last time point according to the path pointer matrix, acquiring an alignment path, and finishing path alignment.
Further, the step of correcting the corresponding phoneme score value includes:
traversing the phoneme columns corresponding to each time point in the decoding matrix, and finding the phoneme columns corresponding to the preset result;
based on the phoneme columns corresponding to the preset results, checking the score value of each phoneme, and finding a phoneme position with the phoneme score value lower than a second preset threshold value;
determining a phoneme with the largest phoneme score value in the corresponding phoneme column, and comparing the phoneme with a phoneme corresponding to the preset result;
if the part of speech of the maximum score phoneme is different from that of the phoneme corresponding to the reference result, judging that the maximum score phoneme is misidentified;
subtracting a fourth preset threshold value from the score value of the misrecognized phonemes, and supplementing the corrected score value to the corresponding phonemes.
Further, if the identification result of the corrected decoding matrix is greater than a third preset threshold, determining that the identification is valid includes:
calculating a score for each candidate word path or phoneme path based on CTC criteria;
selecting the command word with the highest score from all the candidate command words as a recognition result;
and comparing the score of the identification result with a third preset threshold value, and if the score is larger than the third preset threshold value, considering that the identification is effective.
The second aspect of the present application further includes a decoding processing apparatus, including:
the prediction result acquisition module is used for acquiring a prediction text result based on a decoding matrix corresponding to the first voice command word;
the editing distance judging module is used for judging whether the editing distance between the predicted text result and the preset result is smaller than a first preset threshold value or not;
the path alignment module is used for performing path alignment on the decoding matrix by taking the preset result as a decoding path if the path alignment module is smaller than a first preset threshold value;
the correction module is used for correcting the corresponding phoneme score value if the phoneme score value of the preset result is lower than a second preset threshold value and the phoneme corresponding to the maximum phoneme score value is different from the phoneme part of speech corresponding to the preset result in a phoneme column corresponding to a certain time point in the decoding matrix;
and the correct judging module is used for judging that the decoding matrix is effectively recognized if the recognition result of the corrected decoding matrix is larger than a third preset threshold value.
Further, the prediction result obtaining module includes:
the matrix building unit is used for carrying out acoustic modeling and language modeling on the first voice command word through a voice recognition model to obtain a corresponding decoding matrix;
and the decoding calculation unit is used for selecting a preset decoding algorithm, gradually decoding from the first time point of the decoding matrix, and obtaining the word with the highest probability as a predicted text result after the last time point is decoded.
A third aspect of the present application also comprises a computer device comprising a memory storing a computer program and a processor implementing the steps of any of the methods described above when the processor executes the computer program.
A fourth aspect of the present application also includes a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of any of the methods described above.
According to the decoding processing method, only a simple character string matching method is adopted to correct the misrecognized phonemes, complex post-processing steps are not needed, the network structure and the framework are not needed to be changed, the implementation mode is simple and effective, the occurrence of misrecognition can be effectively reduced without introducing complexity or extra calculation cost, and the user experience and the recognition accuracy are improved.
Drawings
FIG. 1 is a flow chart of a decoding method according to an embodiment of the present disclosure;
FIG. 2 is a block schematic diagram of a visual timing apparatus for device control according to an embodiment of the present application;
fig. 3 is a block diagram schematically illustrating a structure of a computer device according to an embodiment of the present application.
The realization, functional characteristics and advantages of the present application will be further described with reference to the embodiments, referring to the attached drawings.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, modules, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, modules, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein includes all or any module and all combination of one or more of the associated listed items.
It will be understood by those skilled in the art that all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs unless defined otherwise. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
Referring to fig. 1, an embodiment of the present invention provides a decoding processing method, including steps S1 to S5, specifically:
s1, obtaining a predicted text result based on a decoding matrix corresponding to a first voice command word;
and obtaining a candidate predicted text result through decoding and a path searching algorithm according to the decoding matrix corresponding to the first voice command word. A specific implementation may be to convert an input speech signal into a series of candidate text results using a speech recognition model and decoding algorithm. Then, according to the path score in the decoding matrix, selecting the text result corresponding to the path with the highest score as the predicted text result.
S2, judging whether the editing distance between the predicted text result and the preset result is smaller than a first preset threshold value;
the edit distance is a measure for measuring the degree of similarity between two character strings, and represents the minimum number of operations required to convert one character string into another by insert, delete, and replace operations. And comparing the predicted text result with a preset result. The step calculates the edit distance between the predicted text result and the character string of the preset result, namely the operation times required for converting the predicted text result into the preset result through the least insertion, deletion and replacement operations, and judges whether the edit distance is smaller than a first preset threshold value. If the editing distance is smaller than the first preset threshold value, the predicted text result is indicated to be closer to the preset result, otherwise, if the editing distance is larger than or equal to the first preset threshold value, the predicted result is indicated to be larger than the preset result, the possibility that the predicted result is the correct voice command word is directly eliminated, and the next correction is not necessary.
S3, if the decoding matrix is smaller than a first preset threshold value, carrying out path alignment on the decoding matrix by taking the preset result as a decoding path;
for each time point in the decoding matrix, selecting a path with the highest score of characters (phonemes) corresponding to the position in the preset result, taking the path as an alignment path of the decoding matrix, and performing path alignment.
S4, if the phoneme score value of the preset result is lower than a second preset threshold value in a phoneme column corresponding to a certain time point in the decoding matrix, and the phoneme corresponding to the maximum phoneme score value is different from the phoneme part of speech corresponding to the preset result, correcting the corresponding phoneme score value;
based on the path with the highest score of the character (phoneme) corresponding to the position in the preset result, the score and the part of speech corresponding to the phoneme are recorded, and whether the phoneme with the score lower than a second preset threshold value exists in the phoneme column corresponding to the time point is judged. If so, the following operations are performed: and finding the phonemes with the highest scores in the phoneme columns corresponding to the time points, and recording the parts of speech of the phonemes. Judging whether the part of speech of the phoneme is the same as that of the phoneme corresponding to the preset result. If not, correcting the value of the phonemes; the embodiment can screen out the phonemes with misrecognition and correct the phonemes.
And S5, if the identification result of the corrected decoding matrix is larger than a third preset threshold value, judging that the identification is effective.
In the corrected decoding matrix, whether the identification result calculated in the current decoding matrix meets the threshold requirement is still judged through the threshold, and the correct judgment is carried out after the identification result meets the threshold requirement, so that the output identification is carried out on the correct identification result. On the premise of ensuring the accuracy of overall identification, the step can correct local errors. By setting the third preset threshold, the acceptable range of the identification result can be flexibly controlled according to the application requirement. Thus, the recognition errors of the single phonemes or the phoneme sequences can be filtered, and the reliability and the stability of overall recognition are improved.
Through steps S1-S5, the method obtains a candidate predicted text result by using decoding and path searching algorithms based on a decoding matrix corresponding to the first voice command word. And judging the similarity degree by calculating the editing distance between the predicted text result and the preset result. And if the editing distance is smaller than the first preset threshold value, carrying out path alignment on the decoding matrix by taking a preset result as a decoding path. Therefore, the decoding matrix can be finely adjusted according to the preset result, and the accuracy is improved. The editing distance between the predicted text result and the preset result can be calculated, and the similarity degree can be judged. And if the editing distance is smaller than the first preset threshold value, carrying out path alignment on the decoding matrix by taking a preset result as a decoding path. Therefore, the decoding matrix can be finely adjusted according to the preset result, and the accuracy is improved. If the phoneme score value of the preset result is lower than the second preset threshold value in the phoneme column corresponding to a certain time point in the decoding matrix, and the phoneme corresponding to the maximum score is different from the phoneme part of speech of the preset result, correcting the corresponding phoneme. Partial errors can be corrected by screening out and correcting the misrecognized phonemes, and the overall recognition accuracy is improved. And judging whether the decoding matrix is effectively identified or not according to comparison of the identification result of the corrected decoding matrix and a third preset threshold value. Therefore, partial local errors can be filtered out on the premise of ensuring the accuracy of the overall identification, and the reliability of the overall identification is improved.
In one embodiment, the step of obtaining the predicted text result based on the decoding matrix corresponding to the first voice command word includes:
s11, performing acoustic modeling and language modeling on the first voice command word through a voice recognition model to obtain a corresponding decoding matrix;
s12, selecting a preset decoding algorithm, starting from the first time point of the decoding matrix, decoding gradually, and obtaining the word with the highest probability as a predicted text result after the last time point is decoded.
In this embodiment, acoustic modeling and language modeling are performed on the first voice command word through the voice recognition model, so as to obtain a corresponding decoding matrix. Acoustic modeling refers to the conversion of a speech signal into a probabilistic sequence of phonemes or subwords using techniques such as feature extraction, acoustic models, and training. The language modeling is to model the occurrence probability of phonemes or subwords according to the pre-trained language model and by using the structure and rules of the language. These models can help us convert speech signals into corresponding words or instructions. And selecting a preset decoding algorithm, and gradually decoding to obtain a word with highest probability as a predicted text result. The decoding algorithm can use methods such as dynamic programming and a Viterbi algorithm, calculates an optimal path according to the phoneme or subword probability of each time point in the decoding matrix, and selects the word with the highest probability as a predicted text result. The decoding algorithm may consider factors such as transition probabilities between phonemes or subwords, prior knowledge of language models, and other contextual information to determine an optimal word sequence. These algorithms can help us get the most appropriate instructions or words from the speech signal.
For example, we want to turn on a living room light in a voice controlled smart home device. We can do this by speaking "turn on living room lights", which is our first voice command word. Then, the speech recognition model performs acoustic modeling and language modeling on the command word to obtain a corresponding decoding matrix. Next, we choose a preset decoding algorithm (e.g. by greedy algorithm), starting from the first time point of the decoding matrix, decoding step by step, calculating the optimal path of each time point, and obtaining the word with the highest probability according to the optimal path as the predicted text result, "turn on the living room lamp". Finally, the intelligent home equipment executes corresponding operation and turns on the living room lamp. The embodiment can improve the accuracy and efficiency of voice command recognition. Command words can be identified more accurately through acoustic modeling and language modeling, and a decoding algorithm can help us obtain more proper instructions or words according to the context information, so that the accuracy and efficiency of operation are improved.
In an embodiment, the step of determining whether the edit distance between the predicted text result and the preset result is smaller than a first preset threshold value includes:
s21, acquiring the predicted text result and the character string information of the preset result;
s22, creating an editing distance matrix, and filling the editing distance matrix based on the character string information;
s23, obtaining the edit distance between the predicted text result and the preset result based on the filled edit distance matrix.
In this embodiment, character string information of a predicted text result and a preset result is obtained. The character string information can obtain a predicted text result through a voice recognition model, and a preset result can be obtained through manual input or preset information, wherein the preset result can be a correct command word pre-stored in advance. An edit distance matrix is created and populated based on the string information. The edit distance matrix is a two-dimensional matrix whose rows represent each character of the predicted text result and columns represent each character of the preset result. We initialize the edit distance matrix to 0 through a fill operation and then gradually calculate the edit distance between each character. And obtaining the edit distance between the predicted text result and the preset result based on the filled edit distance matrix. Specifically, by a dynamic programming method, starting from the upper left corner of the edit distance matrix, gradually calculating the edit distance of each position, and finally obtaining the value of the lower right corner position, namely the edit distance between the predicted text result and the preset result. By comparing the editing distance between the predicted text result and the preset result with the magnitude relation of the preset threshold value, whether the similarity between the predicted text result and the preset result meets the requirement can be judged. If the edit distance is smaller than the preset threshold, the predicted text result is considered similar to the preset result, and although the predicted text result is not completely correct, a space for correction and recognition is provided, but if the edit distance is larger than the preset threshold, the similarity is considered to be insufficient, the tendency of correct command words is not caused, or the recognition gap is too large, and no space and no need for correction are provided.
In an embodiment, the step of performing path alignment on the decoding matrix by using the preset result as a decoding path if the decoding result is smaller than a first preset threshold value includes:
s31, initializing a path pointer matrix according to the size of the decoding matrix, wherein the path pointer matrix is used for recording path pointers of all positions;
s32, traversing time points and phonemes of the decoding matrix, and calculating a path score and a path pointer of each position;
s33, comparing the path score and the path pointer of each position with a preset result, and updating the path score and the path pointer to optimal values according to the comparison result;
s34, back tracing from the last time point according to the path pointer matrix to acquire an aligned path, and finishing path alignment.
In the present embodiment, first, a path pointer matrix is initialized according to the size of a decoding matrix. The path pointer matrix is used to record the previous position of the optimal path for each position. Then, the time points and phonemes of the decoding matrix are traversed, and the path score and path pointer for each position are calculated. For each position, the path score is composed of the path score and the transition score of the last position and the phoneme score of the current position. The path pointer then points to the last position that maximizes the path score. Then, the path score and the path pointer of each position are compared with a preset result, and the path score and the path pointer are updated to be optimal values according to the comparison result. If the phonemes corresponding to the current position are the same as those of the preset result, the path score is kept unchanged; otherwise, the path score is reduced by a larger negative number as a penalty, and the path pointer is updated to the position above the current position. And finally, reversely backtracking from the last time point according to the path pointer matrix, acquiring an alignment path, and finishing path alignment. Each position in the optimal path can be found by backtracking the path pointer matrix, thereby constructing an aligned path.
For example, to illustrate: assume that the decoding matrix is a 3x3 matrix and the predetermined result is "ABC". The path score matrix and path pointer matrix are initialized and then each position of the decoding matrix is traversed. In calculating the path score, for example, for the position (2, 2), the path score is calculated as 10 based on the path score of the last position, the transition score, and the phoneme score of the current position. The path pointer points to the last position, i.e., (1, 2), because the path score is the greatest at this position. Then, the path score and the path pointer are compared with a preset result. If the phoneme corresponding to the current position is the same as the preset result, the path score is kept unchanged; otherwise, subtracting a larger negative number from the path score and updating the path pointer to the last position. In the backtracking process, each position in the optimal path is found in turn from the last time point through the path pointer matrix until the initial position is backtracked. Thus, an alignment path is obtained, and path alignment is completed. Through the path alignment process, the decoding matrix can be finely adjusted according to a preset result, and accuracy is improved. For example, if the preset result is "ABC" and the recognition result in the decoding matrix is "AXC", after the path alignment is performed, the obtained alignment path is "a-B-C", and the correct reference path is provided, the recognition result in the matrix can be compared and corrected in the next step, so that the decoding matrix can be optimized according to the preset result, and the accuracy of speech recognition is effectively improved.
In one embodiment, the step of correcting the corresponding phoneme score value includes:
s41, traversing a phoneme column corresponding to each time point in the decoding matrix, and finding a phoneme column corresponding to the preset result;
s42, checking the score value of each phoneme based on the phoneme column corresponding to the preset result, and finding the phoneme position with the phoneme score value lower than a second preset threshold value;
s43, determining a phoneme with the largest phoneme score value in the corresponding phoneme column, and comparing the phoneme with a phoneme corresponding to the preset result;
s44, if the parts of speech of the maximum score phonemes are different from those of the phonemes corresponding to the reference result, judging that the phonemes are misidentified;
s45, subtracting a fourth preset threshold value from the score value of the misrecognized phonemes, and supplementing the corrected score value to the corresponding phonemes.
In the present embodiment, each phoneme in the phoneme string is traversed, and its score value is checked. If the score value of a certain phoneme is lower than a second preset threshold, namely lower than a set lower score threshold, the position of the certain phoneme is recorded. And determining the phonemes with the largest score values in the phoneme columns, and comparing the phonemes with the phonemes corresponding to the preset results. If the part of speech of the maximum score phoneme is different from that of the phoneme corresponding to the reference result, the recognition is judged to be wrong. Subtracting a fourth preset threshold value from the score value of the misrecognized phonemes, and supplementing the corrected score value to the corresponding phonemes.
For example, to illustrate: assume that the preset result is "ABC", and a certain phoneme column in the decoding matrix is "a-i-C", where "X" is a misrecognized phoneme. According to step S41, we traverse the phoneme string and find that the score value of "X" is lower than the second preset threshold, thus recording its position. Next, according to step S42, it is determined that the phoneme having the largest score value in the phoneme string is "a", and it is compared with the phoneme of the preset result. Since the maximum score phoneme "a" is the same as "a" in the preset result, it is not determined as erroneous recognition.
In step S43, if the maximum score phoneme is found to be different from the part of speech of the phoneme corresponding to the reference result, it is determined that the recognition is erroneous. For example, if the phoneme string in the decoding matrix is "a-i-C" and the preset result is "a-B-C", where "i" and "B" are different parts of speech, then in step S43, it is determined that the recognition is erroneous.
In step S44, the score value of the misrecognized phoneme is subtracted by a fourth preset threshold value, and the corrected score value is supplemented to the corresponding phoneme. Thus, the error of the score caused by the misidentification can be corrected, and the accuracy of the decoding process is improved. The fourth preset threshold value can be used as a training set, the other part of data is used as a test set, the performance of the model on the test set is measured by adjusting different values of the fourth preset threshold value, and the threshold value with the best performance is selected as the fourth preset threshold value; the setting can also be performed according to actual application scenes and requirements. For example, if the effect on misrecognition is very sensitive, the fourth preset threshold may be set to a smaller value to reduce the effect of misrecognition; if the effect on misrecognition is not very sensitive, the fourth preset threshold may be set to a larger value to improve the overall efficiency and performance of the system.
According to the embodiment, in the decoding process, the phoneme score value is checked, and the misrecognized phonemes are corrected, so that the decoding accuracy can be improved, and the misrecognized rate can be reduced. The corrected score can better reflect the characteristics of the actual voice, thereby improving the quality and stability of voice recognition.
In an embodiment, the step of determining that the identification is valid if the identification result of the modified decoding matrix is greater than a third preset threshold value includes:
s51, calculating the score of the path of each candidate word or phoneme based on the ctc rule;
s52, finding out the candidate word or phoneme with the largest score as a recognition result;
and S53, comparing the score of the identification result with a third preset threshold value, and if the score is larger than the third preset threshold value, considering that the identification is effective.
In this embodiment, CTC is a commonly used end-to-end speech recognition method, which uses a back-propagation algorithm to train the model by aligning the input sequence with the output sequence. In the decoding process, a score for each candidate word or phoneme path may be calculated according to CTC rules. Next, finding out the best candidate word or phoneme as the recognition result according to the score; and then finding out the candidate word or phoneme with the highest score from all the candidate words or phonemes as a recognition result. And finally, judging the result of effective recognition according to a third preset threshold value, and comparing the score of the selected candidate word or phoneme with the third preset threshold value. If the score of the corrected recognition result is larger than a third preset threshold value, the correction is effective, and the similarity with the preset result (correct command word) is high, the recognition is judged to be effective. The embodiment can make re-recognition based on the corrected decoding matrix, verify the accuracy of the recognition result of the current decoding matrix, screen out correct voice command words, and improve the recognition accuracy and recognition efficiency of the voice command words.
Referring to fig. 2, a block diagram of a pet status monitoring device according to an embodiment of the present application, the device includes:
the prediction result obtaining module 100 is configured to obtain a prediction text result based on a decoding matrix corresponding to the first voice command word;
the edit distance judging module 200 is configured to judge whether an edit distance between the predicted text result and a preset result is smaller than a first preset threshold;
the path alignment module 300 is configured to perform path alignment on the decoding matrix by using the preset result as a decoding path if the path alignment module is smaller than a first preset threshold;
a correction module 400, configured to correct the corresponding phoneme score value if, in a phoneme column corresponding to a certain time point in the decoding matrix, there is a phoneme score value of the preset result lower than a second preset threshold and a phoneme corresponding to the maximum phoneme score value is different from a phoneme part of speech corresponding to the preset result;
the correct determination module 500 is configured to determine that the recognition is valid if the corrected recognition result of the decoding matrix is greater than the third preset threshold.
In one embodiment, the prediction result acquisition module 100 includes:
the matrix building unit is used for carrying out acoustic modeling and language modeling on the first voice command word through a voice recognition model to obtain a corresponding decoding matrix;
and the decoding calculation unit is used for selecting a preset decoding algorithm, gradually decoding from the first time point of the decoding matrix, and obtaining the word with the highest probability as a predicted text result after the last time point is decoded.
In one embodiment, the edit distance determination module 200 includes
A character string obtaining unit, configured to obtain character string information of the predicted text result and the preset result;
the distance matrix unit is used for creating an editing distance matrix and filling the editing distance matrix based on the character string information;
and the distance acquisition unit is used for acquiring the edit distance between the predicted text result and the preset result based on the filled edit distance matrix.
In one embodiment, the path alignment module 300 includes:
an initialization unit, configured to initialize a path pointer matrix according to a size of the decoding matrix, where the sum path pointer matrix is used for a path pointer;
a traversal calculating unit for traversing the time points and phonemes of the decoding matrix and calculating a path score and a path pointer for each position;
the optimal value updating unit is used for comparing the path score and the path pointer of each position with a preset result and updating the path score and the path pointer to the optimal value according to the comparison result;
and the backtracking acquisition unit is used for reversely backtracking from the last time point according to the path pointer matrix to acquire an aligned path and finish path alignment.
In one embodiment, the correction module 400 includes:
a phoneme column obtaining unit, configured to traverse a phoneme column corresponding to each time point in the decoding matrix, and find a phoneme column corresponding to the preset result;
a score checking unit for checking the score value of each phoneme based on the phoneme column corresponding to the preset result, and finding the phoneme position with the phoneme score value lower than a second preset threshold value;
the phoneme comparison unit is used for determining a phoneme with the largest phoneme score value in the corresponding phoneme column and comparing the phoneme with a phoneme corresponding to the preset result;
the misidentification judging unit is used for judging that misidentification is performed if the part of speech of the phoneme corresponding to the maximum score phoneme is different from that of the phoneme corresponding to the reference result;
and the score correction unit subtracts a fourth preset threshold value from the score value of the misrecognized phoneme and supplements the corrected score value to the corresponding phoneme.
In one embodiment, the correctness determination module 500 includes:
a path score calculation unit for calculating a score of a path of each candidate word or phoneme based on the ctc rule;
the maximum candidate unit is used for finding out candidate words or phonemes with the maximum score as a recognition result;
and the effective judging unit is used for comparing the score of the identification result with a third preset threshold value, and if the score is larger than the third preset threshold value, the effective judgment unit is used for judging that the identification is effective.
Referring to fig. 3, a computer device is further provided in the embodiment of the present application, where the computer device may be a server, and the internal structure of the computer device may be as shown in fig. 3. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing usage data and the like during the decoding processing method. The network interface of the computer device is used for communicating with an external terminal through a network connection. Further, the above-mentioned computer apparatus may be further provided with an input device, a display screen, and the like. The computer program is executed by a processor to realize a decoding processing method, and comprises the following steps: obtaining a predicted text result based on a decoding matrix corresponding to the first voice command word; judging whether the editing distance between the predicted text result and the preset result is smaller than a first preset threshold value or not; if the decoding result is smaller than a first preset threshold value, carrying out path alignment on the decoding matrix by taking the preset result as a decoding path; if the phoneme score value of the preset result is lower than a second preset threshold value in the phoneme column corresponding to a certain time point in the decoding matrix and the phoneme corresponding to the maximum phoneme score value is different from the phoneme part of speech corresponding to the preset result, correcting the corresponding phoneme score value; and if the identification result of the corrected decoding matrix is larger than a third preset threshold value, judging that the identification is effective.
Those skilled in the art will appreciate that the architecture shown in fig. 3 is merely a block diagram of a portion of the architecture in connection with the present application and is not intended to limit the computer device to which the present application is applied.
An embodiment of the present application further provides a computer readable storage medium having a computer program stored thereon, which when executed by a processor, implements a decoding processing method, including the steps of: obtaining a predicted text result based on a decoding matrix corresponding to the first voice command word; judging whether the editing distance between the predicted text result and the preset result is smaller than a first preset threshold value or not; if the decoding result is smaller than a first preset threshold value, carrying out path alignment on the decoding matrix by taking the preset result as a decoding path; if the phoneme score value of the preset result is lower than a second preset threshold value in the phoneme column corresponding to a certain time point in the decoding matrix and the phoneme corresponding to the maximum phoneme score value is different from the phoneme part of speech corresponding to the preset result, correcting the corresponding phoneme score value; and if the identification result of the corrected decoding matrix is larger than a third preset threshold value, judging that the identification is effective. It is understood that the computer readable storage medium in this embodiment may be a volatile readable storage medium or a nonvolatile readable storage medium.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in embodiments may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual speed data rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.
The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the claims, and all equivalent structures or equivalent processes using the descriptions and drawings of the present application, or direct or indirect application in other related technical fields are included in the scope of the claims of the present application.

Claims (10)

1. A decoding processing method, the method comprising:
obtaining a predicted text result based on a decoding matrix corresponding to the first voice command word;
judging whether the editing distance between the predicted text result and the preset result is smaller than a first preset threshold value or not;
if the decoding result is smaller than a first preset threshold value, carrying out path alignment on the decoding matrix by taking the preset result as a decoding path;
if the phoneme score value of the preset result is lower than a second preset threshold value in the phoneme column corresponding to a certain time point in the decoding matrix and the phoneme corresponding to the maximum phoneme score value is different from the phoneme part of speech corresponding to the preset result, correcting the corresponding phoneme score value;
and if the identification result of the corrected decoding matrix is larger than a third preset threshold value, judging that the identification is effective.
2. The decoding method according to claim 1, wherein the step of obtaining the predicted text result based on the decoding matrix corresponding to the first voice command word includes:
performing acoustic modeling and language modeling on the first voice command word through a voice recognition model to obtain a corresponding decoding matrix;
and selecting a preset decoding algorithm, starting from the first time point of the decoding matrix, gradually decoding, and obtaining a word with highest probability as a predicted text result after the last time point is decoded.
3. The decoding method according to claim 1, wherein the step of determining whether the edit distance between the predicted text result and the preset result is smaller than a first preset threshold value comprises:
acquiring character string information of the predicted text result and the preset result;
creating an edit distance matrix, and filling the edit distance matrix based on the character string information;
and obtaining the edit distance between the predicted text result and the preset result based on the filled edit distance matrix.
4. The decoding method according to claim 1, wherein the step of aligning the decoding matrix with the preset result as a decoding path if the decoding result is smaller than a first preset threshold value includes:
initializing a path pointer matrix according to the size of the decoding matrix, wherein the path pointer matrix is used for recording the path pointer of each position;
traversing the time point and the phonemes of the decoding matrix, and calculating a path score and a path pointer of each position;
comparing the path score and the path pointer of each position with a preset result, and updating the path score and the path pointer to optimal values according to the comparison result;
and (3) reversely backtracking from the last time point according to the path pointer matrix, acquiring an alignment path, and finishing path alignment.
5. The decoding processing method according to claim 1, wherein the step of correcting the corresponding phoneme score value includes:
traversing the phoneme columns corresponding to each time point in the decoding matrix, and finding the phoneme columns corresponding to the preset result;
based on the phoneme columns corresponding to the preset results, checking the score value of each phoneme, and finding a phoneme position with the phoneme score value lower than a second preset threshold value;
determining a phoneme with the largest phoneme score value in the corresponding phoneme column, and comparing the phoneme with a phoneme corresponding to the preset result;
if the part of speech of the maximum score phoneme is different from that of the phoneme corresponding to the reference result, judging that the maximum score phoneme is misidentified;
subtracting a fourth preset threshold value from the score value of the misrecognized phonemes, and supplementing the corrected score value to the corresponding phonemes.
6. The decoding method according to claim 1, wherein the step of determining that the identification is valid if the identification result of the modified decoding matrix is greater than a third predetermined threshold value, comprises:
calculating a score for each candidate word path or phoneme path based on CTC criteria;
selecting the command word with the highest score from all the candidate command words as a recognition result;
and comparing the score of the identification result with a third preset threshold value, and if the score is larger than the third preset threshold value, considering that the identification is effective.
7. A decoding processing apparatus, comprising:
the prediction result acquisition module is used for acquiring a prediction text result based on a decoding matrix corresponding to the first voice command word;
the editing distance judging module is used for judging whether the editing distance between the predicted text result and the preset result is smaller than a first preset threshold value or not;
the path alignment module is used for performing path alignment on the decoding matrix by taking the preset result as a decoding path if the path alignment module is smaller than a first preset threshold value;
the correction module is used for correcting the corresponding phoneme score value if the phoneme score value of the preset result is lower than a second preset threshold value and the phoneme corresponding to the maximum phoneme score value is different from the phoneme part of speech corresponding to the preset result in a phoneme column corresponding to a certain time point in the decoding matrix;
and the correct judging module is used for judging that the decoding matrix is effectively recognized if the recognition result of the corrected decoding matrix is larger than a third preset threshold value.
8. The decoding processing apparatus according to claim 7, wherein the prediction result acquisition module includes:
the matrix building unit is used for carrying out acoustic modeling and language modeling on the first voice command word through a voice recognition model to obtain a corresponding decoding matrix;
and the decoding calculation unit is used for selecting a preset decoding algorithm, gradually decoding from the first time point of the decoding matrix, and obtaining the word with the highest probability as a predicted text result after the last time point is decoded.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 6.
CN202311714025.3A 2023-12-14 2023-12-14 Decoding processing method, device, equipment and storage medium Active CN117409778B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311714025.3A CN117409778B (en) 2023-12-14 2023-12-14 Decoding processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311714025.3A CN117409778B (en) 2023-12-14 2023-12-14 Decoding processing method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN117409778A true CN117409778A (en) 2024-01-16
CN117409778B CN117409778B (en) 2024-03-19

Family

ID=89492886

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311714025.3A Active CN117409778B (en) 2023-12-14 2023-12-14 Decoding processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117409778B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112614510A (en) * 2020-12-23 2021-04-06 北京猿力未来科技有限公司 Audio quality evaluation method and device
CN113327595A (en) * 2021-06-16 2021-08-31 北京语言大学 Pronunciation deviation detection method and device and storage medium
CN114678013A (en) * 2022-03-21 2022-06-28 苏州奇梦者科技有限公司 Method and device for evaluating sentence pronunciation and readable storage medium
CN115293139A (en) * 2022-08-03 2022-11-04 北京中科智加科技有限公司 Training method of voice transcription text error correction model and computer equipment
CN116757184A (en) * 2023-08-18 2023-09-15 昆明理工大学 Vietnam voice recognition text error correction method and system integrating pronunciation characteristics
CN116778914A (en) * 2022-03-11 2023-09-19 广州视源电子科技股份有限公司 Training method of command word recognition model, command word recognition method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112614510A (en) * 2020-12-23 2021-04-06 北京猿力未来科技有限公司 Audio quality evaluation method and device
CN113327595A (en) * 2021-06-16 2021-08-31 北京语言大学 Pronunciation deviation detection method and device and storage medium
CN116778914A (en) * 2022-03-11 2023-09-19 广州视源电子科技股份有限公司 Training method of command word recognition model, command word recognition method and device
CN114678013A (en) * 2022-03-21 2022-06-28 苏州奇梦者科技有限公司 Method and device for evaluating sentence pronunciation and readable storage medium
CN115293139A (en) * 2022-08-03 2022-11-04 北京中科智加科技有限公司 Training method of voice transcription text error correction model and computer equipment
CN116757184A (en) * 2023-08-18 2023-09-15 昆明理工大学 Vietnam voice recognition text error correction method and system integrating pronunciation characteristics

Also Published As

Publication number Publication date
CN117409778B (en) 2024-03-19

Similar Documents

Publication Publication Date Title
KR100925479B1 (en) The method and apparatus for recognizing voice
US10714076B2 (en) Initialization of CTC speech recognition with standard HMM
US7324941B2 (en) Method and apparatus for discriminative estimation of parameters in maximum a posteriori (MAP) speaker adaptation condition and voice recognition method and apparatus including these
KR20160066441A (en) Voice recognizing method and voice recognizing appratus
US20070185713A1 (en) Recognition confidence measuring by lexical distance between candidates
CN111429887B (en) Speech keyword recognition method, device and equipment based on end-to-end
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
US8738378B2 (en) Speech recognizer, speech recognition method, and speech recognition program
CN112331229B (en) Voice detection method, device, medium and computing equipment
CN113948066B (en) Error correction method, system, storage medium and device for real-time translation text
CN116127953B (en) Chinese spelling error correction method, device and medium based on contrast learning
CN111951825A (en) Pronunciation evaluation method, medium, device and computing equipment
CN115293138B (en) Text error correction method and computer equipment
CN115293139B (en) Training method of speech transcription text error correction model and computer equipment
CN115455946A (en) Voice recognition error correction method and device, electronic equipment and storage medium
CN110808049A (en) Voice annotation text correction method, computer device and storage medium
CN117409778B (en) Decoding processing method, device, equipment and storage medium
CN111477212B (en) Content identification, model training and data processing method, system and equipment
CN112259084A (en) Speech recognition method, apparatus and storage medium
CN115270771B (en) Fine-grained self-adaptive Chinese spelling error correction method assisted by word-sound prediction task
KR20180062859A (en) Speech recognition device and method thereof
JP4533160B2 (en) Discriminative learning method, apparatus, program, and recording medium on which discriminative learning program is recorded
CN111883109B (en) Voice information processing and verification model training method, device, equipment and medium
JP5447382B2 (en) Speech recognition hypothesis verification device, speech recognition device, method and program used therefor
CN114596843A (en) Fusion method based on end-to-end voice recognition model and language model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant