CN115206324A - Speech recognition method and apparatus, computer readable storage medium - Google Patents

Speech recognition method and apparatus, computer readable storage medium Download PDF

Info

Publication number
CN115206324A
CN115206324A CN202110313911.XA CN202110313911A CN115206324A CN 115206324 A CN115206324 A CN 115206324A CN 202110313911 A CN202110313911 A CN 202110313911A CN 115206324 A CN115206324 A CN 115206324A
Authority
CN
China
Prior art keywords
voice
speech
text
word
segment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110313911.XA
Other languages
Chinese (zh)
Inventor
孙宇嘉
陈家胜
耿杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202110313911.XA priority Critical patent/CN115206324A/en
Publication of CN115206324A publication Critical patent/CN115206324A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training

Abstract

A speech recognition method and apparatus, and a computer-readable storage medium are disclosed. In an embodiment of the present application, a speech recognition method may include: acquiring a voice to be recognized; segmenting the voice to be recognized to obtain a plurality of voice sections, wherein the tail of a preceding voice section in adjacent voice sections of the plurality of voice sections is overlapped with the head of a succeeding voice section; acquiring attention data and initial text segments of each voice segment in the plurality of voice segments by using a voice recognition model based on an attention mechanism; extracting human voice data from the attention data of each voice segment; obtaining a corrected text segment of each voice segment according to the initial text segment, the voice data and the overlapping duration of each voice segment; and splicing the corrected text sections of each of the plurality of voice sections to obtain the recognition text of the voice to be recognized. The method and the device can realize the long-voice continuous recognition with higher accuracy without high-complexity models such as VAD (voice activity detection) and the like.

Description

Speech recognition method and apparatus, computer readable storage medium
Technical Field
The invention relates to a voice recognition method and device and a computer readable storage medium.
Background
With the development of research and engineering technology related to the field of artificial intelligence, the technology of Speech Recognition (ASR) is gradually advancing to people's daily life as one of the technical means in human-computer interaction. The voice recognition capability is one of indispensable capabilities of terminal equipment such as mobile phones, sound boxes, vehicle-mounted systems, televisions and the like.
In the practical application of the speech recognition technology, the speech recognition can be further divided into short speech recognition and long speech recognition according to the difference of the application scenes of the speech technology. In long speech recognition, the operation memory and the computing power of the computing equipment are limited, and overlong audio cannot be directly sent to a speech recognition engine for speech recognition. In the face of such a problem, the conventional method is to deploy an endpoint Detection engine, i.e., a Voice Activity Detection (VAD) engine, at the same time as the speech recognition engine. However, the VAD model has the problems of hard audio segmentation, limited recognition accuracy of voice endpoints, susceptibility to noise interference, fuzzy segmentation semantics and the like, so that the accuracy of the long voice recognition result is low.
Disclosure of Invention
In view of the above problems of the related art, the present application provides a speech recognition method and apparatus, and a computer-readable storage medium, which can achieve long speech continuous recognition with higher accuracy without requiring a high-complexity model such as VAD.
To achieve the above object, a first aspect of the present application provides a speech recognition method, including:
acquiring a voice to be recognized;
segmenting the voice to be recognized to obtain a plurality of voice sections, wherein the tail of a front voice section in adjacent voice sections of the plurality of voice sections is overlapped with the head of a rear voice section;
acquiring attention data and initial text segments of each voice segment in the plurality of voice segments by using a voice recognition model based on an attention mechanism;
extracting voice data from the attention data of each voice section;
obtaining a corrected text section of each voice section according to the initial text section, the voice data and the overlapping duration of each voice section, wherein the text corresponding to the tail in the corrected text section of the previous voice section in the adjacent voice sections of the plurality of voice sections is the same as the text corresponding to the head in the corrected text section of the next voice section;
and splicing the corrected text sections of each of the plurality of voice sections to obtain the recognition text of the voice to be recognized.
Therefore, high-accuracy long-speech continuous recognition can be realized without high-complexity models such as VAD (voice activity detection) and the like.
As a possible implementation manner of the first aspect, the segmenting the speech to be recognized to obtain a plurality of speech segments specifically includes: and segmenting the speech to be recognized according to a fixed window length and/or a fixed overlapping duration to obtain the plurality of speech segments with equal time length and/or overlapping part length.
Therefore, the parallelism of voice segment processing can be improved, the processing efficiency of the method of the embodiment of the application can be improved, and the condition that the voice recognition model is broken down due to overlong audio obtained by segmentation can be eliminated.
As a possible implementation manner of the first aspect, the attention-based speech recognition model is a model of an encoder-decoder structure, the model of the encoder-decoder structure includes an encoder and a decoder, an attention module is disposed in a plurality of decoding layers of the decoder, and the attention data is obtained through an attention matrix output by the attention module in a last layer of the plurality of decoding layers.
Therefore, better human voice characteristics can be extracted with less computation.
As a possible implementation manner of the first aspect, the attention-based speech recognition model is trained by multi-objective loss functions, and the multi-objective loss functions include at least one loss function with frame alignment capability.
Therefore, the speech recognition model has higher recognition accuracy, and meanwhile, the hidden state output by the encoder contains the frame alignment information, so that the attention data obtained by the decoder contains clearer human voice interval information, and the attention data can obtain high-accuracy human voice data.
As a possible implementation of the first aspect, the attention data has a word dimension and a frame dimension; the extracting of the vocal data from the attention data of each voice segment specifically includes:
traversing the attention data according to the word dimension to extract an attention vector of the word dimension;
obtaining a voice vector of the word dimension according to the attention vector of the word dimension and a preset threshold value;
and accumulating and summing numerical values in the voice vectors of the word dimension corresponding to each voice segment to obtain a voice sequence of each voice segment, wherein the voice sequence comprises voice information of each audio frame in the voice segment, and the voice information is used for indicating whether the audio frame belongs to voice or does not belong to voice.
Therefore, the voice data with the data granularity being the audio frame can be extracted without high-complexity models such as VAD models, the continuity and the accuracy of voice recognition can be ensured, the problem of false recognition of the VAD models in noise scenes is avoided, and the development and maintenance cost of the VAD models is reduced.
As a possible implementation manner of the first aspect, the human voice vector of the word dimension is obtained by the following formula:
Figure BDA0002991020900000021
wherein, thred a Represents the threshold value, M s [l,t]Value of the audio frame t in the human voice vector representing the word dimension l, M a [l,t]The value of the audio frame t in the attention vector representing the word dimension l.
Thus, the human voice vector can be obtained from the attention vector by simple threshold judgment.
As a possible implementation manner of the first aspect, the obtaining a modified text segment of each speech segment according to the initial text segment, the voice data, and the overlap length of each speech segment specifically includes:
for each pair of adjacent speech segments of the plurality of speech segments, performing the steps of:
extracting overlapped texts from initial text sections of adjacent speech sections, wherein the overlapped texts comprise overlapped texts of a previous speech section and overlapped texts of a later speech section in the adjacent speech sections, the overlapped texts of the previous speech section correspond to a tail part of the overlapped duration in the human voice data, and the overlapped texts of the later speech section correspond to a head part of the human voice data, the length of which is the overlapped duration;
aligning the overlapped text of the preceding speech segment with the overlapped text of the succeeding speech segment to obtain aligned text of the adjacent speech segments, the aligned text comprising the aligned text of the preceding speech segment and the aligned text of the succeeding speech segment;
obtaining a corrected text of the adjacent voice sections according to the confidence degrees of the words in the aligned text of the previous voice section and the confidence degrees of the words in the aligned text of the later voice section, wherein the corrected text of the previous voice section is the same as the corrected text of the later voice section;
and obtaining a corrected text section of a previous speech section and a corrected text section of a next speech section in the adjacent speech sections by using the corrected texts of the adjacent speech sections.
Therefore, the recognition accuracy of the segmentation boundary of the voice sections can be improved by aligning and correcting the recognition results of the voice section overlapping regions, and the overall recognition accuracy of the voice to be recognized is further improved.
As a possible implementation manner of the first aspect, the confidence level of the word includes at least one of: a frame alignment confidence for the word, an attention confidence for the word, a language confidence for the word, and a location confidence for the word.
Thus, it is possible to correct a text by combining various factors such as the performance of a speech recognition model, language logic, and word position.
As a possible implementation manner of the first aspect, obtaining the corrected texts of the adjacent speech segments according to the confidence degrees of the words in the aligned text of the previous speech segment and the confidence degrees of the words in the aligned text of the subsequent speech segment, specifically includes: obtaining the corrected texts of the adjacent voice sections according to the comprehensive scores of the words in the aligned texts of the previous voice sections and the comprehensive scores of the words in the aligned texts of the next voice sections; wherein the comprehensive score of the word is determined by taking the position confidence of the word as a penalty item.
Therefore, the position confidence of the character is introduced as a punishment item of the comprehensive character score, and the influence of poor recognition effect at the end point of the phrase sound on the overall recognition precision of the long voice can be corrected.
As a possible implementation manner of the first aspect, the position confidence of the word is calculated by the following formula:
Posscore=-β|l-L/2|
wherein, the Posscore represents a position confidence value of the word, L represents a position of the word in the aligned text segment, L represents a number of words contained in the aligned text segment, and β represents a preset position weight, and the aligned text segment is a text segment obtained by replacing the overlapped text in the initial text segment with the aligned text.
Therefore, the position confidence coefficient is represented by a negative number, so that certain punishment can be carried out on the recognition result close to the segmentation boundary, the condition that the voice recognition result at the segmentation boundary is poor can be made up, and the overall recognition accuracy of the voice recognition result to be recognized is effectively improved.
As a possible implementation manner of the first aspect, the composite score of the word is obtained by calculating according to the following formula:
Jointscore=α×CTCscore+λ×Attscore+η×LMscore+Posscore
wherein Jointscore represents the composite score of a word, CTCscore represents the frame alignment confidence of a word, attscore represents the attention confidence of a word, LMscore represents the language confidence value of a word, poscore represents the position confidence value of a word, α represents the weight of the frame alignment confidence of a word, λ represents the weight of the attention confidence of a word, and η represents the weight of the language confidence of a word.
Therefore, the comprehensive score of the word can be determined by combining various factors such as frame alignment, attention, language, word position and the like through addition and multiplication by utilizing the preset weight, and the method is low in hardware cost and easy to implement.
As a possible implementation manner of the first aspect, the obtaining the modified texts of the adjacent speech segments according to the comprehensive score of the words in the aligned text of the preceding speech segment and the comprehensive score of the words in the aligned text of the succeeding speech segment specifically includes:
adjusting each word in the aligned text in adjacent speech segments by:
Figure BDA0002991020900000041
where Uri [ l ] (i = 1.... N-1) represents a word at position 1 in the corrected text of the speech segment Ai, N is the number of the speech segments, UPai [ l ] represents a word at position l in the aligned text of the speech segment Ai, UPai +1[l ] represents a word at position 1 in the aligned text of the speech segment Ai +1, jointscore (UPai [ l ]) represents a composite score of a word at position l in the aligned text of the speech segment Ai, jointscore (UPai +1[l ]) represents a composite score of a word at position 1 in the aligned text of the speech segment Ai +1, and represents a placeholder.
Therefore, text correction can be realized only by adjusting the text of the overlapped part in the adjacent voice sections, the recognition accuracy of the segmentation boundary of the voice sections is improved, the overall recognition accuracy of the voice to be recognized is improved, the calculation amount is small, the calculation complexity is low, and the hardware cost is reduced and the processing efficiency is improved.
As a possible implementation manner of the first aspect, the method further includes: and obtaining the confidence coefficient of the recognized text of the speech to be recognized according to the confidence coefficient of the words in the corrected text segment.
Therefore, the confidence of the recognized text of the speech to be recognized can be provided while the recognized text is provided for the user, and the user can generate the transcribed text of the speech by referring to the confidence.
As a possible implementation manner of the first aspect, the method further includes: and obtaining the voice confidence of each voice section in the plurality of voice sections by using the attention data.
Therefore, the voice data of the voice to be recognized can be provided for the user or the device thereof, and meanwhile, the corresponding voice confidence coefficient is provided, so that the user or the device thereof can generate the transcription text of the voice by referring to the voice confidence coefficient and the voice data.
A second aspect of the present application provides a computing device comprising: at least one processor; and at least one memory storing program instructions that, when executed by the at least one processor, cause the at least one processor to perform the speech recognition method described above.
A third aspect of the present application provides a computer-readable storage medium having stored thereon program instructions that, when executed by a computer, cause the computer to execute the above-described speech recognition method.
These and other aspects of the invention will be apparent from and elucidated with reference to the embodiment(s) described hereinafter.
Drawings
The various features and the connections between the various features of the present application are further described below with reference to the drawings. The figures are exemplary, some features are not shown to scale, and some of the figures may omit features that are conventional in the art to which the application relates and are not essential to the application, or show additional features that are not essential to the application, and the combination of features shown in the figures is not intended to limit the application. In addition, the same reference numerals are used throughout the specification to designate the same components. The specific drawings are illustrated as follows:
FIG. 1 is a flow chart of a speech recognition method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of an exemplary implementation process of a speech recognition method according to an embodiment of the present application;
FIG. 3 is an exemplary illustration of a speech windowing long overlap-cut in an embodiment of the present application;
FIG. 4 is an exemplary network structure of a speech recognition model in a speech recognition method according to an embodiment of the present application;
FIG. 5 is a diagram illustrating attention data in a speech recognition method according to an embodiment of the present application;
FIG. 6 is a schematic view of an exemplary flow chart of extracting human voice data in a voice recognition method according to an embodiment of the present application;
FIG. 7 is a flowchart illustrating an exemplary process of aligning and correcting an initial text segment in a speech recognition method according to an embodiment of the present application;
FIG. 8 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;
FIG. 9 is a schematic structural diagram of a computing device according to an embodiment of the present application.
Detailed Description
The terms "first, second, third, etc. in the description and in the claims, or the like, may be used solely to distinguish one from another and are not intended to imply a particular order to the objects, but rather are to be construed in a manner that permits interchanging particular sequences or orderings where permissible such that embodiments of the present application may be practiced otherwise than as specifically illustrated or described herein.
In the following description, reference numbers indicating steps, such as S110, S120, etc., do not necessarily indicate that the steps are executed in this order, and the order of the steps may be interchanged, or executed simultaneously, where the case allows.
The term "comprising" as used in the specification and claims should not be construed as being limited to the contents listed thereafter; it does not exclude other elements or steps. It should therefore be interpreted as specifying the presence of the stated features, integers, steps or components as referred to, but does not preclude the presence or addition of one or more other features, integers, steps or components, and groups thereof. Thus, the expression "an apparatus comprising the devices a and B" should not be limited to an apparatus consisting of only the components a and B.
Reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the application. Thus, appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment, but may. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments, as would be apparent to one of ordinary skill in the art from this disclosure.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. In the case of inconsistency, the meaning described in the present specification or the meaning derived from the content described in the present specification shall control. In addition, the terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.
To accurately describe the technical contents in the present application and to accurately understand the present application, the terms used in the present specification are given the following explanations or definitions before the description of the specific embodiments.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. In the case of inconsistency, the meaning described in the present specification or the meaning derived from the content described in the present specification shall control. In addition, the terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.
To accurately describe the technical contents in the present application and to accurately understand the present invention, the terms used in the present specification are given the following explanations or definitions before the description of the specific embodiments.
Long speech, i.e., longer speech, is speech that the ASR model cannot directly recognize as allowed by the computing device's existing operating memory and computing power. For example, 10 seconds or more of speech.
And (4) long voice recognition, namely recognition of long voice. For example, speech recognition for audio of tens of minutes, or even hours, such as conference recordings, telephone recordings, or movies, television audio, and the like.
Short speech, i.e., shorter speech, i.e., speech that the ASR model can directly recognize as permitted by the computing device's existing operating memory and computing power. E.g., speech within 10 seconds.
The short voice recognition is the recognition of short voice, and the voice text content is mainly the vertical field content such as calling, playing programs, playing music, navigating and the like. Voice assistants for cell phones, speakers, and televisions are a typical short voice recognition application.
ASR model, a machine learning model that is capable of recognizing short phrases as text.
The VAD model identifies the voice starting point and the voice finishing point of the long voice by using an end point detection algorithm, further divides the long voice into a plurality of continuous phrase voice sections, can realize the discrimination of the voice section and the non-voice section, but ignores the semantic integrity of the long voice. When a pause occurs in the user voice due to thinking or hesitation, the VAD model can wrongly identify the pause of the user as the end point of the voice, so that a complete voice is divided into two short voice paragraphs with incomplete semantics, and the voice identification effect of the short voice paragraphs is reduced.
And the end point detection algorithm comprises a short-time energy and short-time zero-crossing rate analysis method or a method for classifying voice and non-voice based on a neural network model. The short-time energy and short-time zero-crossing rate method relates to selection of multiple threshold values, requires more manpower and time for optimization to obtain a better effect, and has larger difference in the threshold values under different use scenes. The method for classifying voice and non-voice based on the neural network model has the advantages that the detection precision depends on the corpus of the model to a large extent, and when the actual deployment scene of the model is greatly different from the corpus, the effective precision of an endpoint detection algorithm is difficult to achieve.
Shuffling algorithm (Fisher), one of dynamic programming algorithms.
A Connection Timing Classification (CTC) loss function, a loss function with frame alignment capability, can automatically align unaligned data, and is mainly used for training serialized data which is not aligned in advance. Such as speech Recognition, optical Character Recognition (OCR), etc.
The Transducer loss function, a loss function with frame alignment capability, is commonly used for sequence-to-sequence model training.
The KL divergence (KL-Leibler divergence, KL-divergence) loss function calculates the KL divergence between the input and the target value, the KL divergence is an index for measuring the matching degree of two probability distributions, and the closer the two probability distributions are, the smaller the KL divergence is.
A neural network model of encoder-decoder (encoder-decoder) architecture, comprising an encoder (also called coding module or coding network) and a decoder (also called decoding module or decoding network), a cyclic neural network based sequence-to-sequence model, the input of the next layer being dependent on the output of the previous layer.
The attention (attention) mechanism determines the attention probability distribution (i.e. attention distribution) of the decoder output according to the matching degree of the current input and output of the model such as the decoder, and the higher the matching degree, the higher the score of the attention distribution concentration point is. The attention mechanism is used more as a component of neural networks.
Multi-headed attention, using multiple queries to select multiple sets of information in parallel from input information, each focused on a different portion of the input information.
The Transformer module is a neural network model built based on an attention mechanism, can be an encoder-decoder structure based on a multi-head attention mechanism, and can also only contain one of an encoder and a decoder.
The voice data is used for indicating data of which audio frames in a section of voice belong to voice and which audio frames do not belong to voice, and comprises a plurality of pieces of voice information, each piece of voice information corresponds to one frame of voice in the voice section, and the value of the voice information can be used for indicating whether the corresponding audio frame belongs to voice or not.
The first related art is as follows: a continuous voice man-machine interaction method and a system thereof disclose the technical scheme as follows: receiving a continuous voice signal input by a user; segmenting the continuous voice signal into a plurality of short voices based on a long voice segmentation technology; recognizing a plurality of short voices based on the dynamic language model, and generating a corresponding recognition result for each short voice; performing semantic completion on the recognition result based on a context semantic analysis technology, and generating a semantic completion result; and generating a question-answer sentence corresponding to the semantic completion result based on the dialogue management technology.
The defects of the related art include: the VAD model is relied to segment the long voice, the precision of the VAD model is limited, and the audio segmentation is hard, so that the overall precision of the voice recognition result under the scheme is poor, and the whole sentence of the long voice is segmented frequently and wrongly, so that the accuracy rate of the recognition of each short voice is also poor.
The second related art is: a long voice continuous recognition and real-time recognition result feedback method and system, the disclosed technical scheme is: increasing the expansion possibility from the end point of the sentence to the starting point of the recognition in the recognition network; in the decoding process, the identification path can generate a single sentence or a plurality of continuous sentences during expansion, and a large space voice signal is segmented through the overall optimization of acoustics and language probability; regularly detecting the common part of the optimal historical paths of all active nodes; obtaining a recognition word sequence which is fixed at the current moment; feeding back the updated local identification result to the user in real time; and recovering the decoding space corresponding to the determined identification part.
The defects of the related art II comprise: on the basis of the traditional speech recognition method based on the hidden Markov model, a decoding network is improved, the semantics, the intonation and the mute duration of each frame are fused to recognize speech endpoints, and the method is no longer applicable to an end-to-end deep learning speech recognition model (such as ASR).
The third related technology: and the speech signal can not be input into the ASR model for speech recognition for a long time due to the limitation of the operating memory and the computing power of the computing equipment. To address this problem, an endpoint detection engine, i.e., a VAD engine, may be deployed at the same time that the speech recognition engine is deployed. The VAD engine can recognize the voice starting point and the voice ending point in the long voice, the long voice is further divided into a plurality of continuous short voice paragraphs, the obtained short voice paragraphs are sent into the ASR engine one by one for voice recognition, and finally the texts recognized by the ASR engine are spliced together to obtain the recognition result of the long voice.
The defects of the third related art mainly include the following three items:
1) The voice endpoint recognition accuracy of the VAD engine is limited, and more manpower and time are needed for tuning and optimizing to obtain a better effect.
2) The VAD engine realizes the discrimination of the voice section and the non-voice section, and ignores the semantic integrity in the voice. When a pause occurs in a long voice due to thinking or hesitation of a user, the VAD engine often mistakenly recognizes the pause of the user as an end point of the voice, and then divides a complete long voice into two paragraphs with incomplete semantics, and the short voice with incomplete semantics is input into the ASR engine, so that the recognition result is generally poor.
3) Besides the problems of limited recognition precision and fuzzy semantic segmentation, the VAD model is also easily interfered by noise, which also has adverse effect on the accuracy of the long voice recognition result.
In view of this, a basic idea of an embodiment of the present application is to provide a speech recognition method and apparatus, and a computer-readable storage medium, where overlapping segmentation is performed on a speech to be recognized to obtain multiple speech segments with segments overlapping each other, attention data and initial text segments of the multiple speech segments are obtained through a speech recognition model based on an attention mechanism, vocal data of the speech segments are extracted through the attention data of the speech segments, alignment correction is performed on each initial text segment based on the vocal data and overlapping duration to obtain a corrected text segment, and finally, a recognition text of the speech to be recognized is obtained by using the corrected text segment. Therefore, the voice segmentation can be realized and the voice data can be obtained without depending on a VAD model, the problem of low identification accuracy of the end point caused by segmentation is solved through the combination of overlapped segmentation and character correction, the accuracy of long voice continuous identification is effectively improved, and the character accuracy of the embodiment of the application can reach 89% through experimental verification. Therefore, the embodiment of the application can realize the long voice continuous recognition with higher accuracy without VAD model.
The method and the device for converting the continuous speech into the text are suitable for various scenes needing to convert the continuous speech into the text. The method is particularly suitable for scenes in which the audio data with a super-long time needs to be automatically converted into texts, so that the manual work of dictating the audio is replaced or reduced. For example, the embodiment of the application can be applied to transcription scenes such as conference recording, court trial recording, interview recording and the like, can also be applied to application scenes such as audio and video subtitle generation and the like, and can effectively improve the transcription efficiency from long voice to text in the scenes.
Fig. 1 shows an exemplary flow of a speech recognition method provided by an embodiment of the present application. Referring to fig. 1, an exemplary speech recognition method of an embodiment of the present application may include the steps of:
step S101, obtaining a voice to be recognized;
step S102, segmenting the voice to be recognized to obtain a plurality of voice sections, wherein the tail part of the front voice section in the adjacent voice sections of the plurality of voice sections is overlapped with the head part of the rear voice section;
step S103, acquiring attention data and initial text segments of each voice segment in a plurality of voice segments by using a voice recognition model based on an attention mechanism;
step S104, extracting voice data from the attention data of each voice segment;
step S105, obtaining a corrected text segment of each voice segment according to the initial text segment, the voice data and the overlapping duration of each voice segment, wherein the text corresponding to the tail in the corrected text segment of the previous voice segment in the adjacent voice segments of the plurality of voice segments is the same as the text corresponding to the head in the corrected text segment of the next voice segment;
and S106, splicing the corrected text segments of each of the plurality of voice segments to obtain the recognition text of the voice to be recognized.
The voice recognition method of the embodiment of the application can realize high-accuracy long voice continuous recognition without high-complexity models such as VAD models. Experiments prove that the word standard rate of the method in the embodiment of the application can reach 89%. And the word standard rate of the traditional method is only 72 percent.
The embodiment of the application is not only suitable for the transcription of long voice, but also suitable for the transcription of short voice. That is, the "speech to be recognized" herein may be a long speech or a short speech.
All kinds of segmentation modes can be applicable to the embodiment of the application, and only a plurality of voice segments obtained by segmentation need to be overlapped at the segmentation position. The method has the advantages that the texts at the corresponding segmentation boundaries in the initial text segments of the adjacent voice segments can be overlapped through an overlapped segmentation mode, so that the continuity of voice recognition can be ensured, the overlapped parts of the initial text segments can be used for aligning and correcting the initial text segments, the condition of poor recognition accuracy at the segmentation boundaries of the voice segments is effectively improved, and the accuracy of the voice recognition is improved while the voice continuous recognition is realized.
In some embodiments, the speech to be recognized may be segmented according to a predetermined window length and a predetermined overlap duration in step S102. The predetermined window length may be a preset fixed value or variable, which may be dynamically adjusted according to different application scenarios, characteristics of the speech to be recognized (e.g., length, language, type of utterer, etc.), hardware performance (e.g., memory size, processor performance, memory capacity, read/write performance, etc.), model performance (e.g., the longest speech length that can be processed by the speech recognition model at a time), and/or any other factors related to speech recognition. Similarly, the predetermined overlap period may also be a fixed value or variable that is preset and may be determined based on the predetermined window length in combination with the above factors.
In one implementation, the predetermined window length and the predetermined overlap duration may be determined by the performance of a speech recognition model (e.g., an ASR model) and the audio scene. The values of the predetermined window length and the predetermined overlap period may be determined using various methods applicable to a speech recognition scenario. For example, it can be obtained by a method of performing a grid search (grid search) experiment on a small sample. For another example, the empirical value may be set directly.
In one implementation manner, in step S102, the speech to be recognized may be sliced according to a fixed window length and/or a fixed overlapping duration, so as to obtain a plurality of speech segments with equal time lengths and/or overlapping portions. In the embodiment, the voice segments with equal length and/or equal overlapping duration can be obtained by adopting a segmentation mode with fixed length or fixed overlapping duration, so that the parallelism of the voice segment processing can be improved, the processing efficiency of the method in the embodiment of the application can be further improved, and the condition that the voice recognition model is collapsed due to overlong audio obtained by segmentation can be eliminated.
In step S103, the attention-based speech recognition model may be any sequence-to-sequence model applicable to the embodiments of the present application. In some embodiments, the speech recognition model may be an attention-based neural network model. In some implementations, the speech recognition model can be, but is not limited to, a recurrent neural network model based on an attention mechanism. In this implementation, the speech recognition model may be a model of an encoder-decoder (encoder-decoder) structure, which includes an encoder and a decoder, in which an attention module is disposed, which may be a single-layer structure or a multi-layer structure, and may employ a multi-head attention mechanism or a single-head attention mechanism. Alternatively, the speech recognition model may also employ an encode structure or any other network structure that is applicable to speech recognition. The specific implementation of the speech recognition model can be found in the related examples below. It is to be understood that the description herein and in the following detailed description of the related art regarding speech recognition models is by way of example only and is not intended to limit the present application. In practice, the speech recognition model may be any model with speech recognition capability and is not limited to neural networks.
The attention-based speech recognition model may be trained by a predefined loss function.
In some embodiments, the attention-based speech recognition model may be trained by multi-objective loss functions (i.e., hybrid loss functions) including at least one loss function with frame alignment capability, which not only helps to improve the recognition accuracy of the speech recognition model, but also simultaneously improves the frame alignment capability of the speech recognition model to obtain more accurate human voice data (e.g., human voice data at audio frame granularity as described below).
Taking an encoder-decoder model based on a multi-head attention mechanism as an example, an encoder and a decoder in a speech recognition model can adopt different loss functions, and the loss function of the encoder can adopt a loss function with a forced alignment characteristic, so that the speech recognition model has higher recognition accuracy, and meanwhile, a hidden state output by the encoder contains frame alignment information, so that attention data obtained by the decoder contains clearer human voice interval information, and high-accuracy human voice data can be obtained by the attention data. In some examples, the decoder may employ a KL divergence loss function and the encoder may employ a CTC loss function or a Transducer loss function, as described in more detail below with respect to specific embodiments.
In some embodiments, the speech recognition model may be trained using a single loss function. For example, the single loss function may be, but is not limited to, a Transducer loss function, a KL divergence loss function, a CTC loss function, or others.
In step S103, each speech segment is subjected to speech recognition by the speech recognition model to obtain an initial text segment. Meanwhile, an attention module in the speech recognition model obtains attention data of each speech segment. The attention data may indicate the probability that the word corresponding to each audio frame in the speech segment is each word in the preset word list, and the higher the probability value is, the higher the probability that the word corresponding to the audio frame is the corresponding word in the preset word list is, and the higher the probability that the audio frame belongs to human voice is. That is, the attention data includes the vocal features of each audio frame in the speech segment.
The attention data may be a matrix having a word dimension, which may be equal to the number of words in the preset vocabulary, and a frame dimension, which may be equal to the number of audio frames in the speech segment. If the fixed window length segmentation is adopted in step S102, the lengths of the voice segments are the same, and the number of dimensions of the attention data of each voice segment is also the same, which may improve the parallelism of the method according to the embodiment of the present application.
Where the speech recognition model employs a single-headed attention mechanism, the attention data may be an attention matrix obtained based on the single-headed attention mechanism. In the embodiment, the attention data is obtained by adopting a single-head attention mechanism, the data volume is small, the operation is less, the calculation complexity is lower, the requirement on the hardware performance is lower, and the hardware cost can be reduced under the condition that the overall processing efficiency of the method in the embodiment of the application is improved.
When the speech recognition model adopts a multi-head attention mechanism, the attention data may be a matrix obtained by a set of attention moment matrix operations (e.g., averaging, etc.) obtained by the multi-head attention mechanism, the number of attention moment matrices in the set of attention matrix is equal to the number of heads of the multi-head attention mechanism, and each attention moment matrix may include vocal features of the speech segment in different aspects. In the embodiment, the attention data are obtained by adopting a multi-head attention mechanism, the voice characteristics of each audio frame of the voice section can be extracted in multiple aspects through a plurality of attention matrixes, and the obtained attention data can relatively comprehensively and accurately characterize the voice characteristics of the voice section, so that the voice data with higher accuracy can be obtained, the accuracy of text correction is improved, and the accuracy of the to-be-recognized voice recognition result is further improved.
When attention modules are provided at multiple levels in the speech recognition model, the attention data can be obtained from an attention matrix obtained from the attention modules of a selected level. The selected layer may be any intermediate or final layer of a plurality of layers in the speech recognition model. The inventor finds, through experiments, that features extracted from the last layer of the network have better vocal characteristics than other layers in the multi-layer network structure, and in view of this, in an example, when the speech recognition model adopts a coder-decoder structure and a plurality of decoding layers in a decoder thereof are provided with attention modules, the selected layer may be the last layer of the plurality of decoding layers, that is, the attention data may be obtained through an attention matrix output by the attention module of the last layer of the plurality of decoding layers. Here, if the attention module of the selected layer employs a single-head attention mechanism, the attention data may be an attention matrix obtained by the attention module, and if the attention module of the selected layer employs a multi-head attention mechanism, the attention data may be a matrix obtained by a plurality of attention moment matrix operations (e.g., averaging, etc.) obtained by the attention module.
In step S104, the human voice data may be obtained by: step a1, traversing the attention data according to word dimensions to extract attention vectors of the word dimensions; step a2, obtaining a voice vector of a word dimension according to the attention vector of the word dimension and a preset threshold value; and a3, accumulating and summing numerical values in the voice vectors of the word dimension corresponding to each voice section to obtain a voice sequence of each voice section, wherein the voice sequence comprises voice information of each audio frame in the voice section, and the voice information is used for indicating whether the audio frame belongs to voice or does not belong to voice. Therefore, the voice data with the data granularity being the audio frame can be extracted without high-complexity models such as VAD models, the continuity and the accuracy of voice recognition can be ensured, the problem of false recognition of the VAD models in noise scenes is avoided, and the development and maintenance cost of the VAD models is reduced.
The threshold in step a2 may be a hyper-parameter, and the specific value thereof may be determined through experiments. The specific implementation manner of step a2 and the threshold may refer to the following description of the embodiments, and are not described herein again. It should be noted that the specific implementation manner of step S104 is not limited to this, and any human voice data extraction method applicable to the embodiment of the present application may be used to implement step S104.
The inventor has found through analysis and experiments that the fixed-window-length segmentation may lose the voice endpoint information, and the voice recognition result at the segmentation may be poor. Therefore, the method of the embodiment of the present application solves these problems by aligning and correcting the text of the overlapping portions of the speech segments in step S105.
In practical applications, step S105 can be implemented by various methods suitable for the embodiments of the present application. In some embodiments, an exemplary implementation of step S105 may include: for each pair of adjacent speech segments of the plurality of speech segments, performing the steps of: step b1, extracting overlapped texts from initial text sections of adjacent speech sections, wherein the overlapped texts comprise overlapped texts of a previous speech section and overlapped texts of a later speech section in the adjacent speech sections, the overlapped texts of the previous speech section correspond to the tail part of the overlapped duration with the length in the human voice data, and the overlapped texts of the later speech section correspond to the head part of the human voice data with the length of the overlapped duration; b2, aligning the overlapped texts of the previous voice segment and the subsequent voice segment to obtain aligned texts of the adjacent voice segments, wherein the aligned texts comprise the aligned texts of the previous voice segment and the aligned texts of the subsequent voice segment; b3, obtaining the corrected texts of the adjacent speech sections according to the confidence degrees of the words in the aligned texts of the previous speech section and the confidence degrees of the words in the aligned texts of the later speech section, wherein the corrected texts of the previous speech section are the same as the corrected texts of the later speech section; and b4, obtaining the corrected text segment of the previous speech segment and the corrected text segment of the next speech segment in the adjacent speech segments by using the corrected texts of the adjacent speech segments. Therefore, the text alignment is assisted through the voice data, the integrity of the long voice recognition is guaranteed, and meanwhile the manual verification efficiency of the long voice recognition text is improved. Moreover, the recognition results of the voice segment overlapping regions are aligned and corrected, so that the recognition accuracy of the segmentation boundary of the voice segments is improved, and the overall recognition accuracy of the voice to be recognized is further improved.
The confidence level of a word may include, but is not limited to, a speech recognition confidence level, a language confidence level, and/or a location confidence level of the word. In practical applications, the specific content of the word confidence may be pre-configured according to different application scenarios, the characteristics of the speech to be recognized, user requirements, or other various factors. Therefore, the initial text segment is corrected by accurately obtaining the confidence degree of the word dimension in the embodiment of the application, and the word accuracy of the recognized text of the speech to be recognized can be obviously improved.
The speech recognition confidence level refers to a confidence level associated with the speech recognition model, which is associated with the structure of the speech recognition model. Taking the speech recognition model of the encoder-decoder structure above as an example, the speech recognition confidence level may include one or both of a frame alignment confidence level and an attention confidence level, and the frame alignment confidence level may be obtained by a frame alignment evaluation module integrated in the speech recognition model, which may be a CTC network, a predefined function, a pre-trained mapping relationship, or other various forms. Similarly, the attention confidence may be obtained by a recognition text evaluation module integrated in the speech recognition model, and the recognition text evaluation module may be a neural network module, a predefined function, a pre-trained mapping relationship, or any other form. The frame alignment evaluation module and the recognition text evaluation module can be obtained by training together with the speech recognition model.
The Language confidence is a confidence related to Natural Language logic, and the Language confidence can be obtained by performing semantic parsing on the initial text segment through a Language parsing module or a Language model integrated in the Language recognition module, and the Language model or the Language parsing module can be, but is not limited to, a Natural Language Understanding (NLU) model or other models with Natural Language Understanding capability.
The position confidence is a confidence associated with the position of the word in the initial text segment, and may be determined based on the length of the initial text segment after alignment (i.e., the initial text segment obtained by replacing overlapping text in the initial text segment with aligned text) and the position of the word in the initial text segment, the closer the position of the word is to the segmentation boundary of the initial text segment, the lower the position confidence score thereof, and the closer the position of the word is to the middle position of the initial text segment, the higher the position confidence score thereof. The position confidence coefficient can be represented by a negative number in the following specific implementation mode, and the position confidence coefficient can perform certain punishment on the recognition result close to the segmentation boundary, so that the condition of poor voice recognition result at the segmentation boundary can be compensated, and the overall recognition accuracy of the voice recognition result to be recognized is effectively improved.
The speech recognition confidence of the word may be obtained synchronously in the speech recognition of step S103, and the language confidence of the word may be obtained after step S103 and before step b 3. Through the alignment processing of the step b2, the length of the initial text segment is effectively corrected, so that the position confidence of the word calculated after the alignment of the step b2 is more accurate.
In some embodiments, the speech recognition method may further include: and obtaining the confidence coefficient of the recognized text of the speech to be recognized according to the confidence coefficient of the words in the corrected text segment. In this way, the confidence of the recognized text of the speech to be recognized can be provided while the recognized text is provided to the user, so that the user can generate the transcribed text of the speech with reference to the confidence. Here, the confidence of recognizing the text may include the confidence of recognizing each word in the text. In practical applications, the recognized text and its confidence level may be determined synchronously in step S106.
In some embodiments, the speech recognition method may further include: and determining the voice confidence of each voice segment by using the attention data. Likewise, while providing the voice data of the speech to be recognized to the user or the device thereof, the corresponding voice confidence coefficient can be provided, so that the user or the data thereof generates the transcription text of the speech by referring to the voice confidence coefficient and the voice data. In practical applications, the voice confidence may be obtained in step S104 in synchronization with the voice data.
It should be noted that the "word" in this document may be a single word in chinese characters, english words, or other various languages.
Based on the above description of the exemplary overall flow of the speech recognition method provided by the embodiment of the present application, an exemplary specific implementation manner of the speech recognition method provided by the embodiment of the present application is described in detail below.
Referring to fig. 2, a specific implementation flow of the speech recognition method in this embodiment may include the following steps:
step S201, receiving a target audio from a pickup device, and performing fixed-window-length overlapping segmentation on the target audio to obtain a voice segment set [ A1.... AN ] in which adjacent voice segments are overlapped;
here, the target audio may be a short audio (10 seconds) or a long audio (the time length may be several tens of minutes or even several hours).
Fig. 3 illustrates an exemplary implementation of a fixed window length overlap cut. Referring to fig. 3, the target audio may be segmented according to a fixed time length wl to obtain a speech segment set [ A1.,. AN ] including N speech segments, and two consecutive speech segments need to ensure overlap of wo time lengths. Wherein, N is an integer and represents the total number of the speech segments obtained by segmentation. Referring to fig. 3, the time duration wl of a speech segment may take 16 seconds and the overlap duration wo may take 6 seconds. In practical applications, the overlap duration wo is less than or equal to one half of the speech segment duration w 1. The specific value of the overlapping duration can be determined according to the actual requirement, the application scene, the characteristics of the target audio, the performance of the voice recognition model and the like. For example, in the case where high processing efficiency is required while the recognition accuracy is not high, the overlap period wo may be set to one-third to one-fourth of the speech segment period w 1. In the case where higher recognition accuracy is required, the overlap duration wo may be set to one-half to one-third of the speech segment duration w 1.
Step S202, a speech segment set [ A1. -, AN ] is subjected to speech recognition by a speech recognition model, and AN initial text segment Ui (i = 1. -. N) and attention data Mia [ L, T ] (i = 1. -. N) of each speech segment are obtained.
In one implementation, speech recognition may be performed on all speech segments in the set of speech segments [ A1., AN ] in parallel to improve processing efficiency.
In this step, the speech segment Ai (i =1,.., N) is input into the trained ASR model, so as to obtain the initial text segment Ui and the speech confidence Uci of the initial text segment Ui, and at the same time, a multi-head attention matrix Mi [ L, T × Nh ] obtained from the attention module (assuming that the number of heads is Nh) of the selected layer in the decoder of the ASR model is extracted, and the attention data Mia [ L, T ] of the speech segment Ai is obtained from the attention matrix Mi [ L, T × Nh ].
In step S203, the human voice data Si [ T ] of the voice segment Ai is extracted from the attention data Mia [ L, T ] (i = 1.
In this step, the voice data of each voice segment can be extracted in parallel. That is, the human voice data Si [ T ] of the respective voice sections Ai is extracted from the attention data Mia [ L, T ] (i = 1.
And step S204, aligning and correcting the initial text segment Ui according to the voice data Si [ T ] and the overlapping duration wo to obtain a corrected text segment Pi.
In one implementation mode, the text UPi corresponding to the overlapping part in the initial text segment Ui can be aligned and corrected by utilizing the voice data Si [ T ] and the overlapping duration wo, so that the recognition accuracy of the segmentation end can be improved, the calculation complexity and the operation data amount can be reduced as much as possible, and the purpose of improving the recognition accuracy of the segmentation end can be achieved with lower hardware cost.
In one implementation, the initial text segment Ui may be entirely aligned and corrected using the human voice data Si [ T ] and the overlap duration wo.
In step S205, a recognition result P of the target audio is obtained from the corrected text segment Pi (i = 1...., N) obtained by the alignment correction.
Here, the recognition result of the target audio may include a recognition text, a text confidence, a human voice interval of the target audio.
In one implementation, the identification text P [ T ] of the target audio may be obtained by concatenating the obtained pieces of modified text segments Pi (i = 1.
In one implementation, the text confidence Pc of the target audio may be obtained by concatenating the word confidence or word synthesis score of the corrected text segment Pi.
In one implementation, the voice interval ST of the target audio can be obtained by splicing the voice data Si T of each voice segment.
Assume that the initial text segments Ui and Ui +1 of the speech segments Ai and Ai +1 obtained in step S202 are as follows:
Ui:
Figure BDA0002991020900000131
Ui+1:
Figure BDA0002991020900000132
the recognition text P obtained in step S205 is:
Figure BDA0002991020900000133
Figure BDA0002991020900000134
in the above example, "when there is no intersection special item" in Ui and "when there is no intersection turning such as road" in Ui +1 are recognition results obtained by ASR corresponding to the overlapping portions of two consecutive segments of speech Ai and Ai +1, respectively. It can be seen that, due to the fact that the overlap portion is close to the cut portion, due to the lack of context and semantic information, the recognition accuracy of the ASR model for the close cut portion in the speech segment is low, and recognition errors occur, such as that "turn" is recognized as "special item" incorrectly, and that "intersection" is recognized as "Liu Kou" incorrectly.
According to the method provided by the embodiment of the application, alignment correction is carried out on Ui and Ui +1 through the step S204, so that the identification errors corresponding to the overlapped part of two continuous sections of voices Ai and Ai +1, namely the identification errors close to the cut parts of the voice sections, are efficiently and accurately corrected and aligned, the special item of the identification error is successfully corrected into turning, the special item Liu Kou is successfully corrected into crossing, characters which are possibly lost between street lamps and the crossing are simultaneously filled, and the identification text obtained after alignment correction not only accords with the context and the integral semantics of long voice formed by the voices Ai and the Ai +1, but also is smooth and coherent in expression, accords with the Chinese natural language logic, and is visible, and the integral identification accuracy of the long voice identification text is remarkably improved.
It can be seen from the above examples that the present embodiment significantly improves the accuracy of text recognition while realizing continuous speech recognition.
This embodiment will describe in detail an exemplary implementation of the speech recognition model in the embodiment of the present application.
In this embodiment, the speech recognition model is an ASR model, and the ASR model uses a neural network model based on an encoder-decoder structure of a multi-head attention mechanism.
Fig. 4 shows an exemplary network structure of the ASR model in the present embodiment. Referring to fig. 4, the ASR model includes an encoder (encoder) and a decoder (decoder), the encoder performs feature extraction on data X = { X1,... Multidot.xt } (T represents the number of frames of target audio) of target audio to obtain hidden feature data H = { H1,... Multidot.hl } (L represents the total number of words in a vocabulary supported by the ASR), and the decoder performs feature extraction on previously obtained text data Y = { SOS, Y1,. Alpha.yu } (U represents the length of text, i.e., the number of words of text Y (which may include punctuation symbols)) to obtain current text data Y, and features (not shown) of the text data and hidden features H of the audio data are fused together by an attention mechanism (source interaction). The Encoder may include Ne code layers (Encoder layers), where Ne is an integer not less than 1, and each code Layer may be connected to a Multi-head Attention module (Multi-head Attention). In one example, the encoding layer may be implemented by a transform module that includes a multi-headed attention model. The Decoder comprises Nd decoding layers (Decoder layers), wherein Nd is an integer not less than 1, each decoding Layer is connected with a Multi-head Attention module (Multi-head Attention), and a Feed-Forward Layer (Feed Forward) can be arranged in front of the Multi-head Attention module (Multi-head Attention). In one example, the encoding layer may be implemented by a transform module that includes a multi-headed attention model. In view of the aspects of model processing efficiency, accuracy, computational complexity, etc., in general, the decoder in ASR may include 6 decoding layers, and the number of heads of each multi-head attention module may take 8.
Referring to fig. 4, the ASR encoder may further include a convolution activation module (Conv + ReLU), a full connection layer (sense), a position encoding module (Positional encoding), and a feature normalization layer (LayerNorm) having a 2-layer structure, in addition to the encoding layer. Besides the decoding layer, the ASR decoder may further include a word encoding module (Character encoding), a position encoding module (Positional encoding), a feature normalization layer (LayerNorm), and a normalization layer (softmax). It should be noted that the structure shown in fig. 4 is merely an example. Those skilled in the art should understand that the specific structure of the ASR model can be freely configured according to the actual application requirements, the application scenario, the characteristics of the target audio, and so on.
In this embodiment, the ASR model may be obtained by multi-objective loss function training. The ASR model obtained through the multi-target loss function not only has higher recognition accuracy, but also has better frame alignment capability, and can obtain human voice data with higher accuracy.
Referring to fig. 4, the encoder may employ a CTC loss function (CTC loss), the decoder may employ a KL divergence loss function (CE loss), and the multiple objective loss function values (i.e., the mixed loss function values) of the ASR model may be determined by minimizing the mixed loss function values of the ASR model to complete its training, as shown in fig. 4:
loss=λ×CTC_loss+(1-λ)×KL_loss (1)
in the formula (1), loss represents a value of the mixed loss function, CTC _ loss represents a CTC loss function value, KL _ loss represents a KL divergence function value, and λ is a preset weight, and may be a fixed value greater than 0 and less than 1, and an experimental value is determined or taken through experiments. For example, λ may take on the value 0.7, 0.6, or other values.
In the above example, the CTC loss function has a frame alignment capability, and benefits from the forced alignment characteristic of the CTC algorithm, and the CTC loss function is used to train the encoder, so that the hidden state H of the encoder has frame-by-frame alignment information, and further, an attention moment matrix output by the multi-head attention model in the decoder includes clearer vocal range information, that is, the attention data obtained by the ASR model includes clearer vocal information. Meanwhile, experiments prove that the ASR model is trained by adopting a multi-target loss function, so that the recognition accuracy of the ASR model is improved.
FIG. 5 shows a schematic of an attention data visualization obtained via an ASR model. The schematic diagram shown in fig. 5 is obtained by performing data visualization based on the attention data obtained by the ASR model, and is used for visually presenting that the attention data obtained by the ASR model contains clear vocal range information. The horizontal axis represents the number of audio frame frames, the vertical axis represents the word dimension of the attention data, the gray level of each point can be used for indicating the probability that the audio frame corresponding to the point belongs to the human voice, and the probability that the audio frame corresponding to the point with the deeper gray level belongs to the human voice is higher. In fig. 5, the gray level of the corresponding region of the unvoiced segment (i.e., the audio segment without human voice) is the lightest (almost zero), which indicates that the unvoiced segment does not substantially contain the audio frame belonging to human voice, the gray level of the corresponding region of the human voice segment (i.e., the voice segment containing human voice) (i.e., the region between the unvoiced segments in fig. 5) is clearly defined, it indicates that the audio frame belonging to human voice and the audio frame not belonging to human voice in the human voice segment are clearly defined, the gray level of the region where the audio frame corresponding to human voice in the human voice segment is located is the deepest, and the gray level of the region where the audio frame corresponding to non-human voice (e.g., sentence break and hesitation) in the human voice segment is located is shallower, which corresponds to the actual situation of human voice in voice, i.e., the attention data obtained based on the ASR model contains clear human voice interval information, which fully indicates that accurate human voice data can be extracted from the attention data.
In this embodiment, the corresponding valid human voice sequence S, that is, the human voice sequence, is extracted from the attention matrix M of each short speech obtained from the ASR model. Meanwhile, corresponding human voice confidence Sc can be extracted.
Fig. 6 shows an exemplary implementation flow of human voice data extraction in the present embodiment. Referring to fig. 6, an exemplary specific process for extracting a voice sequence in this embodiment may include the following steps:
in step S601, nh attention matrices Mi [ L, T × Nh ] obtained from the multi-head attention module (assuming the number of heads is Nh) in the last layer of the decoder of the ASR model are extracted, and the average of these attention matrices is calculated to obtain the attention data Mia [ L, T ] (i = 1.
The attention data Mia [ L, T ] represents the correspondence between each word in the initial text segment Ui of the short speech and each audio segment in the short speech, and the two dimensions thereof respectively represent the feature information of the phrase sound in the word dimension and the time dimension. L represents the number of characters contained in an initial text section Ui obtained by recognizing the short speech by the ASR model, and T represents the number of audio frames of a speech section Ai, which can be obtained by calculation according to the duration of the short speech.
Experiments show that in the multi-layer network structure, the extracted features in the last layer of the network have better vocal characteristics than other layers, so in the step, the attention data is directly determined through the attention matrix Mi [ L, T multiplied by Nh ] of the multi-head attention module in the last layer of the plurality of decoding layers.
In step S602, the human voice vector Mis [1,T ] is extracted from the attention data Mia [ L, T ] to form the human voice matrix Mis [ L, T ].
The voice matrix Mis [ L, T ] includes voice feature information of the voice segment Ai in a word dimension and a frame dimension, L represents the number of words included in an initial text segment Ui obtained by an ASR model recognizing the voice segment Ai, T represents the frame number of the voice segment Ai, each element Mis [1,t ] in the voice matrix Mis [ L, T ] indicates whether the audio frame T in the word dimension 1 belongs to a voice, and when the value is 0, the voice matrix represents that the audio frame T does not belong to a voice, and when the value is not 0, the voice matrix represents that the audio frame T belongs to a voice.
Repeated experiments and data analysis of the inventor find that the value of the audio frame in the attention vector is higher than a certain threshold when the audio frame belongs to the human voice. Thus, in one implementation, obtaining the vocal matrix Mis [ L, T ] may include: traversing the matrix Mia [ L, T ] according to the word dimension, extracting an attention vector Mia [1,T ] corresponding to the word dimension L, calculating a vocal vector Mis [1,T ] corresponding to each attention vector Mia [1,T ] according to the following formula (2), and splicing the vocal vectors Mis [ L, T ] according to the word dimension to obtain a vocal matrix Mis [ L, T ].
Figure BDA0002991020900000161
Wherein, thred a Denotes a predetermined threshold value, mis [ l, t [ ]]The value of the audio frame t in the vocal vector representing the word dimension l, mia [ l, t ]]The value of the audio frame t in the attention vector representing the word dimension l. Thred a For hyper-parameters, usually take values of 0.09, 0.1, 0.16, etc., threaded a May be determined experimentally or may take empirical values.
Therefore, the voice vector can be extracted from the attention data through the hyper-parameter, and then the voice sequence is obtained, the calculation complexity is low, the requirement on the hardware performance is low, and the hardware cost is reduced and the processing efficiency is improved.
In this step, the confidence of the human voice can be calculated by using the vector Mia [ l, T ] at the same time. Specifically, the maximum value of Mia [1,T ] can be selected in each word dimension and recorded as the effective human voice confidence coefficient Sic [ L ], and the effective human voice confidence coefficient Sic [ L ] is the human voice confidence coefficient.
Step S603, traversing the voice matrix Mis [ L, T ] according to the character dimension, accumulating and summing the voice vectors to obtain a result Si [ T ], and normalizing the result Si [ T ] of accumulation and summation based on the following formula (3) to obtain an effective voice sequence Si [ T ].
Figure BDA0002991020900000162
Wherein, si [ T ] can indicate whether each audio frame in the voice section Ai belongs to the voice, if the value of a certain element Si [ T ] in Si [ T ] is a number larger than 0, the corresponding audio frame T belongs to the voice, if the value of Si [ T ] is 0, the audio frame T does not belong to the voice.
In the embodiment, for a multi-layer attention matrix of a speech recognition model, the attention matrix of the last layer is selected, the mean value of a multi-head attention matrix is calculated along the head dimension to obtain attention data, a voice matrix is obtained by performing threshold judgment on each attention vector on the word dimension in the attention data, and meanwhile, a voice confidence coefficient is obtained by maximum calculation; and after the threshold value is judged, the obtained voice matrix is subjected to accumulation summation, and the normalization is carried out according to the value obtained by the accumulation summation, so that the effective voice sequence is obtained finally. Therefore, the problem of poor endpoint recognition under the fixed-length segmentation is solved, and the accuracy of endpoint recognition in a noise environment is improved by using the anti-noise capability of the voice recognition model.
This embodiment provides an exemplary implementation of initial text segment alignment correction.
Fig. 7 shows an exemplary implementation flow of text correction in the present embodiment. Referring to fig. 7, an exemplary implementation flow of text modification in the present embodiment may include the following steps:
in step S701, for each pair of adjacent speech segments Ai & Ai +1 (i =1,.. Protrait, N-1) in the speech segment set [ A1,. Protrait, AN ], AN overlapped text UPi of each speech segment is extracted from its initial text segment Ui according to the valid human voice sequence Si [ T ] and the overlap duration wo, respectively.
And (3) assuming adjacent speech sections Ai and Ai +1, and respectively obtaining initial text sections Ui and Ui +1 thereof through recognition of an ASR model. Assuming that the time length of the overlapping part of the voice section Ai and the voice section Ai +1 is wo, extracting the text corresponding to the time length wo of the tail part of the voice sequence Si [ T ] from the initial text section Ui as the overlapped text UPi of the voice section Ai. And extracting a text corresponding to the voice sequence Si +1[T ] head time length wo of the voice section Ai +1 from the initial text section Ui +1 as an overlapped text UPi +1 of the voice section Ai +1.
For example, ui and Ui +1 are shown in italics as follows:
Ui:
Figure BDA0002991020900000163
Ui+1:
Figure BDA0002991020900000164
the extracted overlapping texts UPi and UPi +1 are as follows:
UPi:
Figure BDA0002991020900000171
UPi+1:
Figure BDA0002991020900000172
step S702, aligning the overlapped texts UPi of the preceding speech sections Ai and the overlapped texts UPi +1 of the following speech sections Ai +1 in the adjacent speech sections by adopting a dynamic programming algorithm to obtain the aligned texts UPai and UPai +1 of the adjacent speech sections;
in one implementation, a standard dynamic programming algorithm such as Fisher may be used to align the lengths of the overlapped texts of adjacent speech segments, and fill in the missing words with a predetermined placeholder (e.g., a @), so as to obtain the aligned texts of each phrase and voice.
Still taking the above adjacent speech segments Ai and Ai +1 as an example, the following aligned texts UPai, UPai +1 can be obtained by this step from the overlapped texts UPi, UPi +1 of the adjacent speech segments Ai, ai +1:
UPai:
Figure BDA0002991020900000173
UPai+1:
Figure BDA0002991020900000174
as can be seen, the overlapped text UPi contains 11 characters, UPi +1 contains 12 characters, and UPi is one less word than UPi +1. Therefore, the above-mentioned aligned texts UPai and UPai +1 are obtained by filling missing words in UPi with preset placeholders "". Therefore, the situations of character missing or character lacking in the positions of the corresponding voice sections in the recognized text can be effectively eliminated.
In other implementations, the overlapping text of adjacent speech segments may be matched based on the confidence level of each word in the initial segment of text obtained by the ASR model to obtain aligned text of the overlapping portion of adjacent speech segments.
Step S703, obtaining a corrected text UPxi of the adjacent speech segments Ai and Ai +1 according to the comprehensive score of the word in the aligned text UPai of the preceding speech segment Ai and the comprehensive score of the word in the aligned text UPai +1 of the succeeding speech segment Ai +1, and further obtaining aligned text segments Pi and Pi +1 (i =1, i.e.,.. Protation, N-1) of the adjacent speech segments Ai and Ai +1 from the corrected text UPxi.
As previously described, the confidence level for each word may include a speech recognition confidence level, a language confidence level, and a location confidence level. Taking the ASR model shown in fig. 4 as an example, the speech recognition confidence level for each word may include a frame alignment confidence level and an attention confidence level, and details related to the frame alignment confidence level and the attention confidence level may refer to the above description and are not repeated. In this embodiment, the position confidence is used as a penalty term to determine the composite score of each word in the aligned texts UPai and UPai +1 respectively.
In one implementation, the composite score for each word may be obtained by the following equation (4):
Jointscore=α×CTCscore+λ×Attscore+η×LMscore+Posscore (4)
wherein Jointscore represents the comprehensive score of a word, CTCscore represents the frame alignment confidence of the word, attscore represents the attention confidence of the word, LMscore represents the language confidence value of the word, posscore represents the position confidence value of the word, alpha represents the weight of the frame alignment confidence of the word, lambda represents the weight of the attention confidence of the word, eta represents the weight of the language confidence of the word, and lambda, eta and alpha are hyper-parameters, and the values can be determined in advance through experiments or the values can be obtained through experiments.
Wherein, the Posscore can be calculated by the following formula (5):
Posscore=-β|l-L/2| (5)
the Posscore represents a position confidence value of a word, L represents a position of the word in an aligned text segment, L represents a number of words contained in the aligned text segment, the aligned text segment is a text segment obtained by replacing an overlapped text in an initial text segment with the aligned text, β represents a preset position weight, β is a hyper-parameter, and a value of β can be determined through experiments or can be set by using a set empirical value.
Experiments find that for a voice segment obtained by segmentation, recognition results at the beginning and the end of the voice segment often show a phenomenon of poor recognition effect due to disordered semantic information, in the embodiment, position scoring of characters is introduced, as can be seen from formula (5), the position confidence value of the character at the position closer to the beginning or the end of the initial text segment of the voice segment is smaller, and the position confidence value of the character at the position closer to the middle position of the initial text segment of the voice segment is larger, so that the position confidence of the character is introduced as a penalty term of comprehensive character scoring, and the influence of poor recognition effect at the end point of a phrase sound on the overall recognition accuracy of a long voice is corrected.
In one implementation, step S703 may be performed according to the following formula (6), that is, each word in the aligned text of the adjacent speech segments is adjusted by the following formula (6):
Figure BDA0002991020900000181
where Uri [ l ] (i = 1.... N-1) represents a word at a position l in the corrected text of the speech segment Ai, N is the number of speech segments, UPai [ l ] represents a word at a position l in the aligned text of the speech segment Ai, UPai +1[l ] represents a word at a position l in the aligned text of the speech segment Ai +1, jointscore (UPai [ l ]) represents a comprehensive score of a word at a position 1 in the aligned text of the speech segment Ai, jointscore (UPai +1[l ]) represents a comprehensive score of a word at a position 1 in the aligned text of the speech segment Ai +1, and x represents a placeholder.
Taking the above speech segments Ai and Ai +1 as examples, it is assumed that the confidence of "lamp" in the aligned text UPai is (-0.235, -0.525, -0.216,0), i.e. the frame alignment confidence of "lamp" is-0.235, the attention confidence is-0.525, the language confidence is-0.216, the position confidence is 0, and the confidence of "equal" in the aligned text UPai +1 is (-0.287, -0.672, -0.228, -0.871). Assuming that α, λ, η, and β are all 1, the composite score of "lamp" is (-0.235) + (-0.525) + (-0.216) +0= -0.976, the composite score of "equal" is (-0.287) + (-0.672) + (-0.228) + (-0.871) = -2.058, comparing the composite scores of both, it is obvious that the composite score of "lamp" is greater than "equal", then "equal" in UPi +1 is revised as "lamp", and the word at that position in UPxi is selected as "lamp". Similarly, the word of each recognition difference position in the aligned text is corrected, so that the corrected texts UPri & i +1 of the adjacent speech sections Ai and Ai +1 can be obtained, and the corrected texts UPri of the speech sections Ai and the corrected texts UPri +1 of the speech sections Ai +1 are the same and are UPri & i +1.
UPri&i+1:
Figure BDA0002991020900000182
After the corrected text of each voice segment is obtained, the corrected text can be used for replacing the corresponding text in the initial text segment, and then the corrected text segment can be obtained.
Taking the above speech segments Ai and Ai +1 as examples, the modified text segments Pi and Pi +1 are:
Pi:
Figure BDA0002991020900000183
Pi+1:
Figure BDA0002991020900000184
in this embodiment, the initial text segments of adjacent speech segments are aligned by using a dynamic programming algorithm. And integrating the voice recognition model, the language model and the position of each character in the text to score each character in the aligned text, and correcting the aligned text according to the scoring condition of each character. Therefore, the accuracy of the character at the segmentation end point of the voice segment is obviously improved, the problem of poor recognition result of the voice end point in the fixed-window long segmentation is solved, the recognition accuracy of the segmentation boundary of the voice segment is improved, and the overall recognition accuracy of the long voice is improved.
Fig. 8 shows an exemplary structure of a speech recognition apparatus 800 provided in an embodiment of the present application. Referring to fig. 8, an exemplary speech recognition apparatus 800 of an embodiment of the present application may comprise
A voice acquiring unit 810 configured to acquire a voice to be recognized;
a speech segmentation unit 820 configured to segment the speech to be recognized to obtain a plurality of speech segments, wherein a tail of a preceding speech segment overlaps a head of a succeeding speech segment in adjacent ones of the plurality of speech segments;
a speech recognition unit 830 configured to obtain attention data and an initial text segment of each of the plurality of speech segments using an attention-based speech recognition model;
a voice data extracting unit 840 configured to extract voice data of each voice segment from the attention data;
an alignment correction unit 850 configured to obtain a corrected text segment of each speech segment according to the initial text segment, the voice data, and the overlap duration of each speech segment, wherein a text corresponding to the tail in the corrected text segment of the previous speech segment in the adjacent speech segments of the plurality of speech segments is the same as a text corresponding to the head in the corrected text segment of the next speech segment;
a text splicing unit 860 configured to splice the modified text segments of each of the plurality of speech segments to obtain the recognition text of the speech to be recognized.
In some embodiments, the speech segmentation unit 820 is configured to segment the speech to be recognized according to a fixed window length and/or a fixed overlap duration to obtain the plurality of speech segments with equal time length and/or overlap length.
In some embodiments, the attention mechanism-based speech recognition model is a model of an encoder-decoder structure, the model of the encoder-decoder structure includes an encoder and a decoder, attention modules are arranged in a plurality of decoding layers of the decoder, and the attention data is obtained through an attention matrix output by the attention module in the last layer of the plurality of decoding layers.
In some embodiments, the attention-based speech recognition model is trained using multi-objective loss functions, including at least one loss function with frame alignment capability.
In some embodiments, the attention data has a word dimension and a frame dimension; the voice data extracting unit 840 is configured to obtain the voice data by: traversing the attention data according to the word dimension to extract an attention vector of the word dimension; obtaining a voice vector of the word dimension according to the attention vector of the word dimension and a preset threshold value; and accumulating and summing numerical values in the voice vectors of the word dimension corresponding to each voice segment to obtain a voice sequence of each voice segment, wherein the voice sequence comprises voice information of each audio frame in the voice segment, and the voice information is used for indicating whether the audio frame belongs to voice or does not belong to voice.
In some embodiments, the vocal vector for the word dimension is obtained by equation (2).
In some embodiments, the alignment correction unit 850 is configured to obtain the corrected text segments of each speech segment by:
for each pair of adjacent speech segments of the plurality of speech segments, performing the steps of:
extracting overlapped texts from initial text sections of adjacent speech sections, wherein the overlapped texts comprise overlapped texts of a previous speech section and overlapped texts of a later speech section in the adjacent speech sections, the overlapped texts of the previous speech section correspond to a tail part of the overlapped duration in the human voice data, and the overlapped texts of the later speech section correspond to a head part of the human voice data, the length of which is the overlapped duration;
aligning the overlapped text of the preceding speech segment with the overlapped text of the succeeding speech segment to obtain aligned text of the adjacent speech segments, the aligned text comprising the aligned text of the preceding speech segment and the aligned text of the succeeding speech segment;
obtaining the corrected texts of the adjacent voice sections according to the confidence degrees of the characters in the aligned texts of the previous voice sections and the confidence degrees of the characters in the aligned texts of the later voice sections, wherein the corrected texts of the previous voice sections are the same as the corrected texts of the later voice sections;
and obtaining a corrected text section of a previous speech section and a corrected text section of a next speech section in the adjacent speech sections by using the corrected texts of the adjacent speech sections.
In some embodiments, the confidence level of a word includes at least one of: a frame alignment confidence for the word, an attention confidence, a language confidence for the word, and a location confidence for the word.
In some embodiments, the alignment modification unit 850 is configured to obtain the modified texts of the adjacent speech segments according to the comprehensive scores of the words in the aligned texts of the preceding speech segments and the comprehensive scores of the words in the aligned texts of the following speech segments; wherein, the comprehensive score of the word is determined by taking the position confidence of the word as a penalty term.
In some embodiments, the confidence in the location of the word is calculated by equation (5).
In some embodiments, the composite score of the word is calculated by equation (4).
In some embodiments, the alignment modification unit 850 is configured to obtain the modified texts of the adjacent speech segments according to the composite score of the words in the aligned texts of the preceding speech segments and the composite score of the words in the aligned texts of the following speech segments, and specifically includes:
adjusting each word in the aligned text of adjacent speech segments by:
Figure BDA0002991020900000201
where Uri [ l ] (i = 1.... N-1) represents a word of a position l in the corrected text of the speech segment Ai, N is the number of the speech segments, UPai [ l ] represents a word of a position 1 in the aligned text of the speech segment Ai, UPai +1[l ] represents a word of a position l in the aligned text of the speech segment Ai +1, jointscore (UPai [ l ]) represents a comprehensive score of a word of a position l in the aligned text of the speech segment Ai, jointscore (UPai +1[l ]) represents a comprehensive score of a word in an aligned text position l of the speech segment Ai +1, and represents a placeholder.
In some embodiments, the text concatenation unit 860 is further configured to obtain a confidence level of the recognized text of the speech to be recognized according to the confidence level of the word in the corrected text segment.
In some embodiments, the vocal data extraction unit 840 is further configured to obtain the vocal confidence of each of the plurality of speech segments by using the attention data.
The speech recognition apparatus 800 of the embodiment of the present application may be implemented by software, hardware, or a combination of both. In some examples, the speech recognition apparatus 800 may be implemented by the following computing device 900.
The traditional method retains important information during the meeting period in a manual recording mode for subsequent filing and retrieval and review. The manual recording is time-consuming and labor-consuming, and the problem of recording errors is easy to occur. In order to improve the efficiency of recording the conference, the voice recognition method can automatically convert the recording of the conference into a text, and meanwhile, the confidence coefficient and the voice interval of the text are output, so that the verification personnel can be helped to quickly complete the transcription and verification work of the recording of the conference.
With the rapid development of streaming media, short videos, live broadcasts, movies and television series occupy a large amount of entertainment time in daily life of people. Subtitles are a part of video resources which are not available, and the subtitles are added to the video resources by a traditional method in a manual dictation mode.
For the current situation that the video resources of the society are rapidly changed, the manual subtitle dictation method is too inefficient, and is more impractical especially in the field of live video. By using the voice recognition method of the embodiment of the application, the subtitle of the audio stream of the video resource can be automatically and efficiently generated, and the subtitle can be automatically added to the video according to the human voice interval.
1. The method does not depend on the VAD model to segment the ultra-long audio, avoids the problem of false recognition of the VAD model in a noise scene, and reduces the development and maintenance cost of the VAD model.
2. The voice endpoint recognition is carried out based on the attention data, the accuracy of long voice recognition in a noise environment can be effectively improved, and the accuracy of noise environment endpoint recognition is improved by utilizing the anti-noise capability of the voice recognition engine.
3. The problems of low recognition accuracy, high deletion error and model collapse in the requirement of ultra-long speech recognition are solved. The accuracy is improved from 72% to 89%.
Fig. 9 is a schematic structural diagram of a computing device 900 provided in an embodiment of the present application. The computing device 900 includes: a processor 910, a memory 920. In addition, the method can also comprise the following steps: a communication interface 930, and a bus 940.
It is to be appreciated that the communication interface 930 in the computing device 900 shown in FIG. 9 may be used to communicate with other devices.
The processor 910 may be connected to the memory 920. The memory 920 may be used to store the program codes and data. Accordingly, the memory 920 may be a storage unit inside the processor 910, an external storage unit independent of the processor 910, or a component including a storage unit inside the processor 910 and an external storage unit independent of the processor 910.
Optionally, computing device 900 may also include a bus 940. The memory 920 and the communication interface 930 may be connected to the processor 910 through a bus 940. The bus 940 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus 940 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one line is shown in FIG. 9, but this does not represent only one bus or one type of bus.
It should be understood that, in the embodiment of the present application, the processor 910 may employ a Central Processing Unit (CPU). The processor may also be other general purpose processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. Or the processor 910 may employ one or more integrated circuits for executing related programs to implement the technical solutions provided in the embodiments of the present application.
The memory 920 may include a read-only memory and a random access memory, and provides instructions and data to the processor 910. A portion of the processor 910 may also include non-volatile random access memory. For example, the processor 910 may also store information of the device type.
When the computing device 900 is running, the processor 910 executes the computer-executable instructions in the memory 920 to perform the operational steps of the above-described method.
It should be understood that the computing device 900 according to the embodiment of the present application may correspond to a corresponding main body executing a method according to each embodiment of the present application, and the above and other operations and/or functions of each module in the computing device 900 are respectively for implementing a corresponding flow of each method of the embodiment, and are not described herein again for brevity.
The computing device 900 in the embodiments of the present application may be, but is not limited to, a mobile phone, a notebook computer, a voice transcription device, or other various types. Of course, the computing device 900 of the embodiments of the present application may also be embodied as a device such as a server.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that contribute to the related art in essence may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The present embodiments also provide a computer-readable storage medium, on which a computer program is stored, where the computer program is used to execute a speech recognition method when executed by a processor, and the method includes at least one of the solutions described in the above embodiments.
The computer storage media of the embodiments of the present application may take any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
It should be noted that the foregoing is only illustrative of the preferred embodiments of the present application and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present application has been described in more detail with reference to the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention.

Claims (16)

1. A method of speech recognition, the method comprising:
acquiring a voice to be recognized;
segmenting the voice to be recognized to obtain a plurality of voice sections, wherein the tail of a front voice section in adjacent voice sections of the plurality of voice sections is overlapped with the head of a rear voice section;
acquiring attention data and initial text segments of each voice segment in the plurality of voice segments by using a voice recognition model based on an attention mechanism;
extracting human voice data from the attention data of each voice segment;
obtaining a corrected text section of each voice section according to the initial text section, the voice data and the overlapping duration of each voice section, wherein the text corresponding to the tail in the corrected text section of the previous voice section in the adjacent voice sections of the plurality of voice sections is the same as the text corresponding to the head in the corrected text section of the next voice section;
and splicing the corrected text sections of each of the plurality of voice sections to obtain the recognition text of the voice to be recognized.
2. The speech recognition method according to claim 1, wherein the segmenting the speech to be recognized to obtain a plurality of speech segments specifically comprises: and segmenting the speech to be recognized according to a fixed window length and/or a fixed overlapping duration to obtain the plurality of speech segments with equal time length and/or overlapping part length.
3. The speech recognition method according to claim 1 or 2, wherein the attention-based speech recognition model is a model of a coder-decoder structure, the model of the coder-decoder structure comprises a coder and a decoder, attention modules are arranged in a plurality of decoding layers of the decoder, and the attention data is obtained through an attention matrix output by the attention module in a last layer of the plurality of decoding layers.
4. The speech recognition method of any one of claims 1-3, wherein the attention-based speech recognition model is trained using multi-objective loss functions, the multi-objective loss functions including at least one loss function with frame alignment capability.
5. The speech recognition method of any one of claims 1-4, wherein the attention data has a word dimension and a frame dimension; the extracting of the vocal data from the attention data of each voice segment specifically includes:
traversing the attention data according to the word dimension to extract an attention vector of the word dimension;
obtaining a voice vector of the word dimension according to the attention vector of the word dimension and a preset threshold value;
and accumulating and summing numerical values in the voice vectors of the word dimension corresponding to each voice segment to obtain a voice sequence of each voice segment, wherein the voice sequence comprises voice information of each audio frame in the voice segment, and the voice information is used for indicating that the audio frame belongs to voice or does not belong to voice.
6. The speech recognition method of claim 5, wherein the vocal vectors for the word dimension are obtained by:
Figure FDA0002991020890000021
wherein, thred a Is representative of the threshold value(s),M s [l,t]value of the audio frame t in the human voice vector representing the word dimension l, M a [l,t]The value of the audio frame t in the attention vector representing the word dimension l.
7. The speech recognition method according to any one of claims 1 to 6, wherein the obtaining the modified text segment of each speech segment according to the initial text segment of each speech segment, the vocal data, and the overlap length specifically comprises:
for each pair of adjacent speech segments of the plurality of speech segments, performing the steps of:
extracting overlapped texts from initial text sections of adjacent speech sections, wherein the overlapped texts comprise overlapped texts of a previous speech section and overlapped texts of a later speech section in the adjacent speech sections, the overlapped texts of the previous speech section correspond to a tail part of the overlapped duration in the human voice data, and the overlapped texts of the later speech section correspond to a head part of the human voice data, the length of which is the overlapped duration;
aligning the overlapped text of the preceding speech segment with the overlapped text of the succeeding speech segment to obtain aligned text of the adjacent speech segments, the aligned text comprising the aligned text of the preceding speech segment and the aligned text of the succeeding speech segment;
obtaining a corrected text of the adjacent voice sections according to the confidence degrees of the words in the aligned text of the previous voice section and the confidence degrees of the words in the aligned text of the later voice section, wherein the corrected text of the previous voice section is the same as the corrected text of the later voice section;
and obtaining a corrected text section of a previous speech section and a corrected text section of a next speech section in the adjacent speech sections by using the corrected texts of the adjacent speech sections.
8. The speech recognition method of claim 7, wherein the confidence level of the word comprises at least one of: a frame alignment confidence for the word, an attention confidence for the word, a language confidence for the word, and a location confidence for the word.
9. The speech recognition method according to claim 8, wherein the obtaining the modified texts of the adjacent speech segments according to the confidence degrees of the words in the aligned texts of the preceding speech segments and the confidence degrees of the words in the aligned texts of the succeeding speech segments specifically comprises:
obtaining a corrected text of the adjacent voice sections according to the comprehensive scores of the characters in the aligned text of the previous voice section and the comprehensive scores of the characters in the aligned text of the next voice section; wherein the comprehensive score of the word is determined by taking the position confidence of the word as a penalty item.
10. The speech recognition method according to claim 8 or 9, wherein the positional confidence of the word is calculated by:
Posscore=-β|l-L/2|
wherein, the Posscore represents a position confidence value of a word, L represents a position of the word in the aligned text segment, L represents a number of words contained in the aligned text segment, and β represents a preset position weight, and the aligned text segment is a text segment obtained by replacing an overlapped text in the initial text segment with the aligned text.
11. The speech recognition method of claim 10, wherein the composite score of the word is calculated by:
Jointscore=α×CTCscore+λ×Attscore+η×LMscore+Posscore
wherein Jointscore represents a composite score for a word, CTCscore represents a frame alignment confidence for the word, attscore represents an attention confidence for the word, LMscore represents a language confidence value for the word, poscore represents a position confidence value for the word, α represents a weight for the frame alignment confidence for the word, λ represents a weight for the attention confidence for the word, and η represents a weight for the language confidence for the word.
12. The speech recognition method according to claim 9, wherein the obtaining of the modified texts of the adjacent speech segments according to the composite score of the words in the aligned text of the preceding speech segment and the composite score of the words in the aligned text of the succeeding speech segment comprises:
adjusting each word in the aligned text of adjacent speech segments by:
Figure FDA0002991020890000031
where Uri [ l ] (i =1, … … N-1) represents the word at position l in the corrected text of a speech segment Ai, N being the number of said speech segments, UPai [ l ] represents the word at position l in the aligned text of a speech segment Ai, UPai +1[l ] represents the word at position 1 in the aligned text of a speech segment Ai +1, jointscore (UPai [ l ]) represents the composite score of the word at position l in the aligned text of a speech segment Ai, jointscore (UPai +1[l ]) represents the composite score of the word at position 1 in the aligned text of a speech segment Ai +1, and placeholder.
13. The speech recognition method of claim 1, 8, 10 or 11, further comprising: and obtaining the confidence coefficient of the recognized text of the speech to be recognized according to the confidence coefficient of the words in the corrected text segment.
14. A speech recognition method according to any one of claims 1-13, characterised in that the method further comprises: and obtaining the voice confidence of each voice section in the plurality of voice sections by using the attention data.
15. A computing device, comprising:
at least one processor; and
at least one memory storing program instructions that, when executed by the at least one processor, cause the at least one processor to perform the method of any of claims 1-14.
16. A computer-readable storage medium having stored thereon program instructions that, when executed by a computer, cause the computer to perform the method of any of claims 1-14.
CN202110313911.XA 2021-03-24 2021-03-24 Speech recognition method and apparatus, computer readable storage medium Pending CN115206324A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110313911.XA CN115206324A (en) 2021-03-24 2021-03-24 Speech recognition method and apparatus, computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110313911.XA CN115206324A (en) 2021-03-24 2021-03-24 Speech recognition method and apparatus, computer readable storage medium

Publications (1)

Publication Number Publication Date
CN115206324A true CN115206324A (en) 2022-10-18

Family

ID=83571526

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110313911.XA Pending CN115206324A (en) 2021-03-24 2021-03-24 Speech recognition method and apparatus, computer readable storage medium

Country Status (1)

Country Link
CN (1) CN115206324A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115828907A (en) * 2023-02-16 2023-03-21 南昌航天广信科技有限责任公司 Intelligent conference management method, system, readable storage medium and computer equipment
CN116682432A (en) * 2022-09-23 2023-09-01 荣耀终端有限公司 Speech recognition method, electronic device and readable medium
CN116758902A (en) * 2023-06-01 2023-09-15 镁佳(北京)科技有限公司 Audio and video recognition model training and recognition method under multi-person speaking scene
CN117219067A (en) * 2023-09-27 2023-12-12 北京华星酷娱文化传媒有限公司 Method and system for automatically generating subtitles by short video based on speech understanding
CN117253485A (en) * 2023-11-20 2023-12-19 翌东寰球(深圳)数字科技有限公司 Data processing method, device, equipment and storage medium

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116682432A (en) * 2022-09-23 2023-09-01 荣耀终端有限公司 Speech recognition method, electronic device and readable medium
CN115828907A (en) * 2023-02-16 2023-03-21 南昌航天广信科技有限责任公司 Intelligent conference management method, system, readable storage medium and computer equipment
CN116758902A (en) * 2023-06-01 2023-09-15 镁佳(北京)科技有限公司 Audio and video recognition model training and recognition method under multi-person speaking scene
CN117219067A (en) * 2023-09-27 2023-12-12 北京华星酷娱文化传媒有限公司 Method and system for automatically generating subtitles by short video based on speech understanding
CN117219067B (en) * 2023-09-27 2024-04-09 北京华星酷娱文化传媒有限公司 Method and system for automatically generating subtitles by short video based on speech understanding
CN117253485A (en) * 2023-11-20 2023-12-19 翌东寰球(深圳)数字科技有限公司 Data processing method, device, equipment and storage medium
CN117253485B (en) * 2023-11-20 2024-03-08 翌东寰球(深圳)数字科技有限公司 Data processing method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110534095B (en) Speech recognition method, apparatus, device and computer readable storage medium
CN111933129B (en) Audio processing method, language model training method and device and computer equipment
Xiong Fundamentals of speech recognition
CN115206324A (en) Speech recognition method and apparatus, computer readable storage medium
US11314921B2 (en) Text error correction method and apparatus based on recurrent neural network of artificial intelligence
US20200105280A1 (en) Diarization using linguistic labeling
CN111429889A (en) Method, apparatus, device and computer readable storage medium for real-time speech recognition based on truncated attention
CN110706690A (en) Speech recognition method and device
US11158307B1 (en) Alternate utterance generation
Momeni et al. Seeing wake words: Audio-visual keyword spotting
US11227579B2 (en) Data augmentation by frame insertion for speech data
CN111613215B (en) Voice recognition method and device
US20230110205A1 (en) Alternate natural language input generation
CN115004296A (en) Two-wheeled end-to-end speech recognition based on consultation model
KR20230147685A (en) Word-level reliability learning for subword end-to-end automatic speech recognition
Doutre et al. Improving streaming automatic speech recognition with non-streaming model distillation on unsupervised data
US20230368796A1 (en) Speech processing
CN112802444A (en) Speech synthesis method, apparatus, device and storage medium
CN113488028A (en) Speech transcription recognition training decoding method and system based on rapid skip decoding
CN113327574A (en) Speech synthesis method, device, computer equipment and storage medium
CN115827854A (en) Voice abstract generation model training method, voice abstract generation method and device
CN115455946A (en) Voice recognition error correction method and device, electronic equipment and storage medium
Khurana et al. DARTS: Dialectal Arabic transcription system
Yang et al. Keyword search using attention-based end-to-end asr and frame-synchronous phoneme alignments
CN114420104A (en) Method for automatically generating subtitles and related product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination