CN114120975A

CN114120975A - Method, apparatus and storage medium for speech recognition punctuation recovery

Info

Publication number: CN114120975A
Application number: CN202111335102.5A
Authority: CN
Inventors: 吴礼蔚; 朱耀明; 程善伯; 王明轩
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2021-11-11
Filing date: 2021-11-11
Publication date: 2022-03-01
Also published as: WO2023082931A1

Abstract

The present disclosure relates to a method, apparatus, and storage medium for speech recognition punctuation recovery. A training method of a model for speech recognition punctuation recovery is provided, comprising the following steps: acquiring text samples and corresponding audio samples for model training, wherein for the text samples obtained from the non-audio text, the corresponding audio samples are virtual samples; and training a model for speech recognition punctuation recovery based on the obtained text samples and corresponding audio samples for model training.

Description

Method, apparatus and storage medium for speech recognition punctuation recovery

Technical Field

The present disclosure relates to speech recognition, including punctuation recovery in speech recognition.

Background

Automatic Speech Recognition (ASR) is a technology that converts human speech into text, has wide application, and can serve multiple tasks as upstream components, such as speech assistants and speech translation, among others. Existing commercial speech recognition systems often output text without punctuation, which may cause human misunderstanding and affect the performance of downstream tasks such as machine translation and information extraction. Specifically, punctuated text, on the one hand, is difficult to read, poorly readable, with sentence breaks unclear and ambiguous, and on the other hand, downstream tasks like machine translation and information extraction assume that the input is punctuated, and punctuated text results in degraded performance of the downstream tasks.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

According to some embodiments of the present disclosure, there is provided a training method of a model for speech recognition punctuation recovery, comprising the steps of: acquiring text samples and corresponding audio samples for model training, wherein for the text samples obtained from the non-audio text, the corresponding audio samples are virtual samples; and training a model for speech recognition punctuation recovery based on the obtained text samples and corresponding audio samples for model training.

According to further embodiments of the present disclosure, there is provided a training apparatus of a model for speech recognition punctuation recovery, comprising: the sample acquisition unit is configured to acquire a text sample for model training and a corresponding audio sample, wherein the corresponding audio sample is a virtual sample for the text sample obtained from the non-audio text; and a training unit configured to train a model for speech recognition punctuation recovery based on the obtained text samples and corresponding audio samples for model training.

According to some embodiments of the present disclosure, there is provided an electronic device including: a memory; and a processor coupled to the memory, the processor configured to perform the method of any of the embodiments described in the present disclosure based on instructions stored in the memory.

According to some embodiments of the disclosure, a computer-readable storage medium is provided, on which a computer program is stored, which program, when executed by a processor, causes a method of implementing any of the embodiments described in the disclosure.

According to some embodiments of the disclosure, there is provided a computer program product comprising instructions which, when executed by a processor, cause a method of implementing any of the embodiments described in the disclosure.

Other features, aspects, and advantages of the present disclosure will become apparent from the following detailed description of exemplary embodiments thereof, which is to be read in connection with the accompanying drawings.

Drawings

Preferred embodiments of the present disclosure are described below with reference to the accompanying drawings. The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure. It is to be understood that the drawings in the following description are directed to only some embodiments of the disclosure and are not limiting of the disclosure. In the drawings:

fig. 1A to 1C illustrate a flow diagram of a method of training a model for speech recognition punctuation recovery according to some embodiments of the present disclosure.

FIG. 2 illustrates a flow diagram of a method of speech recognition punctuation recovery according to some embodiments of the present disclosure.

FIG. 3 illustrates an exemplary implementation of model training for speech recognition punctuation recovery according to some embodiments of the present disclosure.

Fig. 4A illustrates a block diagram of a training apparatus for a model of speech recognition punctuation recovery according to some embodiments of the present disclosure, and fig. 4B illustrates a block diagram of a speech recognition punctuation recovery apparatus according to some embodiments of the present disclosure.

Fig. 5 illustrates a block diagram of some embodiments of an electronic device of the present disclosure.

Fig. 6 shows a block diagram of further embodiments of the electronic device of the present disclosure.

It should be understood that the dimensions of the various features shown in the drawings are not necessarily drawn to scale for ease of illustration. The same or similar reference numbers are used throughout the drawings to refer to the same or like parts. Thus, once an item is defined in one drawing, it may not be further discussed in subsequent drawings.

Detailed Description

Technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, but it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. The following description of the embodiments is merely exemplary in nature and is in no way intended to limit the disclosure, its application, or uses. It is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect. Unless specifically stated otherwise, the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments should be construed as merely illustrative, and not limiting the scope of the present disclosure.

The term "comprising" and variations thereof as used in this disclosure is intended to be open-ended terms that include at least the following elements/features, but do not exclude other elements/features, i.e., "including but not limited to". Furthermore, the term "comprising" and variations thereof as used in this disclosure is intended to be an open term that includes at least the following elements/features, but does not exclude other elements/features, i.e., "including but not limited to". Thus, including is synonymous with including. The term "based on" means "based at least in part on".

Reference throughout this specification to "one embodiment," "some embodiments," or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. For example, the term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Moreover, the appearances of the phrases "in one embodiment," "in some embodiments," or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment, but may.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units. Unless otherwise specified, the notions "first", "second", etc. are not intended to imply that the objects so described must be in a given order, either temporally, spatially, in ranking, or in any other manner.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of data, messages or information interacted with in the disclosed embodiments are for illustrative purposes only and are not intended to limit the scope of such data, messages or information.

In order to solve the problem of no punctuation of a speech recognition output text, a punctuation recovery task is provided and a corresponding punctuation model is designed. The punctuation recovery task aims at correctly adding punctuation marks to the output results of an automatic speech recognition system, such as an output text sentence, and in implementation performs punctuation recovery by employing the relevant punctuation models. Conventional punctuation models either use only text or require the provision of text-corresponding audio, but are often limited by the real scene.

On the one hand, conventional punctuation models perform punctuation recovery based only on textual information. However, the single text information is ambiguous, which easily causes some problems. A sentence may have different punctuation marks added to produce different meanings, respectively. For example, "I don't want anymore kits" means a significant difference from "I don't want anymore, kits", whereby the importance of commas can be seen, while improper addition can lead to significant ambiguity or error. Furthermore, because the speaker's mood is unknown, the model may have difficulty determining whether a sentence should end with a period or a question mark.

On the other hand, in consideration of rich information in speech audio, such as pauses and intonations, sound signals can help to alleviate ambiguity of punctuation models, and therefore, some multi-modal punctuation models have been proposed, which extract acoustic features from speech audio and fuse the acoustic and lexical features by addition/concatenation to facilitate the addition of punctuation symbols. Here, the modalities may relate to text and audio, respectively. However, the conventional multimodal approach faces the problem of lack of modality in practical applications. Firstly, due to storage limitation or privacy policy, corresponding audio cannot be accessed sometimes, and the previous multi-mode method cannot perform punctuation recovery on the text sentence without audio; secondly, the cost of manually labeled text audio is very high and difficult to obtain, which results in a more sparse training set of these multi-modal models and difficulty in obtaining good multi-modal models.

In view of this, the present disclosure proposes an improvement that enables punctuation model training with both audio-rich text and non-audio text. In particular, noting that audio-free text is readily available, the solution of the present disclosure can make full use of the audio-free text to construct a large number of training sets for training of punctuation models, and provide further optimization training with limited audio, thereby improving the accuracy of punctuation models, and providing punctuation models that can enhance the punctuation recovery effect of text using alternative audio.

In particular, the present disclosure proposes a so-called unified multi-modal punctuation framework, which may be referred to as UniPunc, that enables punctuation model training with both audio-text and non-audio-text. Specifically, for audio-containing text, corresponding text and audio pairs can be obtained, while for non-audio-containing text, virtual content can be constructed to reconstruct corresponding audio thereof, and corresponding text and audio pairs can also be constructed. This allows punctuation model training based on text and audio pairs obtained from both audio-bearing text and non-audio text. Thus, a large amount of easily available non-audio text/corpus can be used as a training set, and ambiguities in punctuation can be reduced using voice input, resulting in an improved accuracy of the obtained punctuation model.

In addition, the scheme of the disclosure can further improve the training and application of the punctuation model in a multilingual scene. The conventional multi-modal method can only process single-language input, while data of some languages are difficult to obtain and rare, for example, a large amount of high-quality text and audio data are difficult to obtain in many languages, the requirement of model training cannot be met, and the model has very poor effect due to direct training. In view of this, the present disclosure solves the problem of data insufficiency in the small languages using a multilingual collaborative training method. In particular, the present disclosure may utilize both multi-lingual audio and text for model training, which enables enhanced performance in a small language using readily available data in a common language, resulting in an improved multi-lingual punctuation model.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings, but the present disclosure is not limited to these specific embodiments. These particular embodiments may be combined with each other below, and details of the same or similar concepts or processes may not be repeated in some embodiments. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner as would be apparent to one of ordinary skill in the art from this disclosure in one or more embodiments.

FIG. 1A illustrates a method of training a model for speech recognition punctuation recovery according to some embodiments of the present disclosure. In the method 100, in step S101, a text sample and a corresponding audio sample for model training are obtained, where for a text sample obtained from an audioless text, the corresponding audio sample is a virtual sample; and in step S102, training a model for speech recognition punctuation recovery based on the obtained text samples and corresponding audio samples for model training.

According to some embodiments of the present disclosure, the text samples and corresponding audio samples used for model training substantially correspond to a set of paired text and audio samples, wherein for each text sample, its corresponding audio sample is obtained, which is either the corresponding audio content of the text sample content or a virtual audio sample of the audio used to compose the text sample.

According to some embodiments of the present disclosure, text samples and corresponding audio samples for model training are obtained from both audio-text and non-audio-text, e.g., derived from an initial set containing both audio-text and non-audio-text. The pairs of text samples and audio samples obtained from the audio text may be obtained by, for example, converting/extracting/dividing in various suitable manners, and will not be described in detail herein. Further, from the non-audio text, a text sample is obtained (such a text sample may be referred to as a single text sample) and a virtual sample is obtained as an audio sample corresponding to the text sample. In this way, on the one hand, a large amount of plain text can be utilized without audio; on the other hand, the acoustic features can be effectively utilized in the presence of audio. Thus, a training set can be constructed using a large number of easily-acquired non-audio text/single text samples, and in particular, for a large number of single text samples, pairs of text and audio samples can still be constructed using virtual samples, thereby helping to optimize the model training effect.

In some embodiments, the virtual sample may be a predetermined sample of audio used to compose the text sample. The virtual samples may take various suitable forms. In particular, it may be a vector with a predetermined content, for example a fixed length, fixed content, preferably an all zero vector. Of course, it may take other suitable forms. In some embodiments, for all of the single text samples used for model training, virtual samples may be constructed for them, and their virtual samples may be the same or different from each other.

In some embodiments, the initial set containing both audio text and non-audio text may be obtained in any suitable manner. For example, they may be derived from a conventional database, such as an existing training database in a storage device; or may be acquired using a suitable type of acquisition device, for example, a microphone may be used to acquire audio data, a video device or a suitable text input device may be used to input text data; or user entered text and/or manually labeled audio. In particular, the audio-present text and the audio-absent text may be stored in any suitable manner for obtaining the text samples and the audio samples. For example, in some embodiments, the texts are provided with indication information of whether the associated texts have audio, such as indicators in binary form, etc., so that the model training apparatus can determine whether to generate virtual samples for the text samples based on the indication information, e.g., generate virtual samples only if the indication information indicates that the text samples have no audio.

According to some embodiments of the present disclosure, training a punctuation model for speech recognition punctuation recovery may comprise generating a multimodal mixed representation based on the obtained text samples and audio samples for model training, as in step S1021 in fig. 1B, and performing model training based on the generated multimodal mixed representation, as in step S1022 in fig. 1B. In this way, the obtained multi-modal mixed representation may contain information of text and audio obtained from a training sample set efficiently created with both audio text and non-audio text, such multi-modal mixed representation may well support model training. In some embodiments, for each text sample and audio sample pair used for model training, a corresponding multimodal mixture representation is obtained, whereby a multimodal mixture representation of all text samples and audio samples can be obtained.

According to some embodiments of the present disclosure, a multimodal hybrid representation may be obtained by performing processing based on text samples and audio samples, respectively, and combining the results of the processing based on the text samples and the results of the processing based on the audio samples. In some embodiments, the mixed representation is generated based on a sum of processing results of the text samples and the audio samples. It should be noted that the mixed representation may also be generated by performing other suitable processing on the processing results of the text samples and the audio samples, such as a weighted sum or other suitable mathematical operation, which will not be described in detail herein.

According to some embodiments of the present disclosure, the processing performed based on the text samples and the audio samples may be attention (attention) -based processing, and thus, the multimodal mixture representation may be a multimodal mixture representation obtained by performing attention-based processing based on the acquired text samples and audio samples for model training. In particular, attention-based processing is performed based on the text samples and the audio samples, respectively, and the processing results of the text samples and the processing results of the audio samples are combined to obtain a multimodal mixed representation.

According to some embodiments of the present disclosure, the attention-based processing applied to the text samples and the audio samples may be any suitable processing, preferably they are different from each other. In some embodiments, a self-attention (self-attention) mechanism is employed to process the text samples. In some embodiments, a cross-attention (cross-attention) mechanism is employed to process the audio samples. The self-attentive mechanism and the cross-attentive mechanism may employ various suitable architectures/algorithms/models, etc., which will not be described in detail herein. Thus, a multimodal hybrid representation can be generated by combining the self-attentional processing results of the text samples and the cross-attentional processing results of the audio samples.

According to some embodiments of the present disclosure, textual information related to the text samples is also considered when performing attention-based processing of the audio samples for model training. That is, in the processing for an audio sample, text information related to a text sample corresponding to the audio sample will also be an input parameter to perform the processing. In some embodiments, the text information associated with the text sample may be a feature derived from conversion/extraction of the text sample or a result of processing performed on the text sample based on attention, such as self-attention processing.

According to some embodiments of the present disclosure, training a model for speech recognition punctuation recovery based on text samples and corresponding audio samples for model training further comprises performing punctuation model training based on intermediate feature values/sequences converted from the text samples and the audio samples. In some embodiments, the text samples and the audio samples may be separately transformed to obtain intermediate feature values/sequences, and then a multimodal hybrid representation may be generated based on the intermediate feature values/sequences transformed from the text samples and the audio samples for punctuation model training. Preferably, attention-based processing may be performed on the intermediate feature values/sequences converted from the text samples and the audio samples to generate a multi-modal mixture representation, and punctuation model training may be performed based on the multi-modal mixture, as shown in fig. 1C. The attention-based processing performed on the intermediate feature values/sequences may be performed as previously described and will not be described in detail herein.

In some embodiments, the intermediate characteristic values/sequences may be in any suitable form and may be obtained by corresponding operations.

According to some embodiments of the present disclosure, a lexical embedding sequence may be converted from a text sample as its intermediate feature values/sequences. In some embodiments, a text sample may be lexical coded to obtain a lexical embedded sequence. It should be noted that the encoding of the text samples may be handled in various suitable ways, such as a pre-trained vocabulary encoder, or various encoding ways known in the art, which will not be described in detail herein. In some embodiments, self-attention processing may be performed on the vocabulary embedding sequence. In some embodiments, the vocabulary embedding sequence or the result of the self-attention-processing of the vocabulary embedding sequence may also be applied as text sample-related text information to attention-based processing for the audio sample, for example as input parameters to cross-attention processing for the audio sample.

According to some embodiments of the present disclosure, an audio sample may be processed to obtain acoustically embedded content as its intermediate feature values/sequences. In some embodiments, the processing of the audio samples may be implemented in a variety of suitable ways. In some embodiments, the processing of the audio samples may include encoding and/or downsampling of the audio samples. In some embodiments, the processing may be performed accordingly for the text sample with or without audio and for the corresponding audio sample. For example, for audio samples derived from audio text, encoding and/or downsampling may be performed to generate acoustically embedded content, while for virtual samples corresponding to non-audio text, it may be directly used as acoustically embedded content. In an implementation, indication information about whether the text sample has audio, such as an indicator in a binary form, etc., may be input to the model training apparatus along with the text sample and the audio sample, so that the model training apparatus may perform corresponding processing on the audio sample corresponding to the text sample according to the indication information to perform model training. In some embodiments, attention-based processing, such as cross-attention processing, may be performed on the acoustically embedded content.

Training a model for speech recognition punctuation recovery may include punctuation symbol learning/prediction based on the obtained text samples and audio samples for model training, thereby performing punctuation model training, according to some embodiments of the present disclosure. In some embodiments, punctuation model training can be performed based on punctuation learning/prediction based on the multi-modal hybrid representation described previously. In some embodiments, a classifier may be employed for punctuation learning/prediction. The classifier may be of any suitable form, such as a Linear classifier (Linear + Softmax), but may of course be of any other suitable form and will not be described in detail here.

In this way, the scheme of the disclosure can simultaneously use single text data and paired text and audio data for training to solve the problem of insufficient paired text and audio data, and particularly for single text data without corresponding audio content, virtual samples are used to reconstruct audio as the corresponding audio data for model training, so that a large amount of pure text data can be used for training, and the audio data can be effectively used for enhancing the model training effect.

According to some embodiments of the present disclosure, text and audio samples used to train punctuation models may include multilingual text and audio samples. In some embodiments, multilingual text and audio samples for model training may be obtained as described above, and in particular, for each language text sample, its corresponding audio sample, which is either an audio-form sample of the text sample content, or a virtual sample, may be obtained. Thus, an improved punctuation model can be trained, which can more accurately perform punctuation recovery in a multilingual scenario.

As an example, in a multi-language case, there is often a problem of insufficient data in a small language. Many languages are difficult to obtain a large amount of high-quality text data, the effect of the model is very poor when a small amount of language data is directly used for training, and the performance of the languages can be enhanced by using the easily obtained data of common languages by using the method of simultaneous multi-language training. Therefore, a sample library for multi-language model training can be effectively constructed, and a punctuation model more accurately suitable for multiple languages can be obtained by using the multi-language samples for simultaneous training.

According to some embodiments of the present disclosure, multi-lingual text and audio samples for model training may be equalized to further optimize the samples for multi-lingual model training. In some embodiments, the occupation ratio of the text samples and/or the audio samples of the language with a low occupation ratio, in particular, the occupation ratio of the text samples and/or the audio samples of the language with a low occupation ratio, may be increased by performing an expansion process on the text samples and/or the audio samples of the language with a low occupation ratio, such as language samples that are not easily obtained, language samples with a small language, and language samples with low resources/resources. In some embodiments, data expansion may be performed in various suitable manners, such as repeating, interpolating, and so forth, for text samples and/or audio samples in a low-priority language. In some embodiments, equalization may also be achieved by processing multilingual text samples and/or audio samples by performing temperature sampling (temperature sampling) or similar sampling algorithms. As an example, since the data of the small language is rare, the method of using the temporal sampling increases the ratio of the data of the small language and decreases the ratio of the data of the high resource language according to the data amount of the data of different languages. Therefore, the proportion of the data of various languages in the total multilingual data is balanced as much as possible, the training effect of the model can be further optimized, and for example, punctuation recovery of the model obtained by training for the text of various languages is improved.

An exemplary implementation of punctuation model training according to some embodiments of the present disclosure will be described in detail below with reference to fig. 3. The disclosed solution receives text input and optionally audio input to perform a punctuation recovery task to add punctuation to the text. The text input and the optional audio input may be the aforementioned text samples and audio samples for training, and may also be the aforementioned text samples and audio samples in multiple languages.

The scheme of the present disclosure mainly comprises three phases of operation: the first stage processes the text input and the audio input, respectively, the second stage performs attention-based processing on the processed text input and audio input, respectively, so as to incorporate audio information into a representation of the text to obtain a multi-modal mixed representation, and the third stage performs punctuation prediction based on the multi-modal mixed representation, thereby performing model training. The training process of the present invention uses the classic SGD algorithm and an Adam optimizer to optimize the Cross-Encopy loss function.

Punctuation recovery is typically modeled as a sequence marking task. In general, a multimodal punctuation corpus is a set of sentence-audio-punctuation triplets, denoted as S ═ { x, a, y } where x is an unlabeled sentence of length T and a is the corresponding speech audio. The output of the model should be the predicted sequence y of additive punctuation given x and a. Due to the nature of the sequence tagging task, the length of the sequence of punctuation y is the same as the un-punctuated sentence x, i.e. | x | ═ y |. The punctuation may be any suitable punctuation available for text, and may be of four types, i.e., comma (,), period (, question mark (.

In the first stage, the text input may be subjected to an encoding process. The encoding process here may be performed in various suitable ways, and in particular, the text sentences that are not punctuated may be split into sequences of subwords and converted into embedded sequences of words

As an example, the vocabulary encoder may be built using a pre-trained Natural Language (NLP) processing model as a backbone model, and the model of the vocabulary encoder may be fine-tuned according to the data of a particular task. In some embodiments, the encoding process of the text may be performed by a text encoder, which may be included in an aspect according to an embodiment of the present disclosure.

The audio input may be processed in a first stage, in particular, the audio input may be converted into an acoustic embedding H based on the type of audio input_aOr virtual embedding H_i. In some embodiments, the audio processing may be performed by various suitable acoustic processing components, which may be included in aspects consistent with embodiments of the present disclosure.

In particular, for audio input that is an audio representation of the content of the text input, such as transcription with audio annotations, it may be converted into an acoustic embedding H_a. In particular, the audio input may be converted into acoustic features by a pre-trained acoustic model/acoustic feature extractor. In general, the acoustic feature extractor may first be pre-processed on the unlabeled audio data set through self-supervised training and may be fine-tuned according to downstream punctuation recovery tasks. The down-sampling network can then be further applied to shorten the length of the extracted acoustic features, resulting in acoustically embedded content

The goal is to have the length of the acoustic embedding be close to or equal to the sentence embedding so that the model can better align the cross-modal information. As an example, a multi-layer convolutional network may be selected as the core of the downsampling networkA core assembly.

For non-audio text, virtual embedding H is used_iTo make up for possible missing acoustic features, i.e. if

Then H_a＝H_iAs an example, virtual embedding H may be performed_iA learnable parameter array set to a fixed length and expected to learn a representation of the missing audio. That is, the virtual audio samples corresponding to the non-audio text as described above may be a predetermined vector sequence, e.g., an all-zero sequence, and virtual embedding may be derived therefrom, e.g., directly used as virtual embedding, or the length thereof may be shortened such that the length of the virtual embedding is close to or equal to sentence embedding, such that the shortened sequence acts as virtual embedding.

The acoustic and lexical features can then be combined based on the lexical embedding sequence and the acoustic or virtual embedding to generate a multi-modal hybrid representation. This operation may be performed by a coordinated bootstraper (coordinator) that may train both audio-free and audio-free text jointly to overcome the dropout problem. In particular, the coordination director jointly utilizes acoustic and lexical features and applies attention-based operations to learn a hybrid representation of the two modalities.

Specifically, the sequence H is first embedded into the vocabulary_lSelf-attention operations are performed to capture long-range dependencies S in uncarked sentences_lAnd embedding the sequence H in the vocabulary_lAnd an acoustic embedding sequence H_aApplies cross attention operations therebetween to form a cross modal representation S_a：

S_l＝Att(H_l，H_l，H_l) (1)

S_a＝Att(H_a，H_a，H_l) (2)

Here, the

Is an attention maneuver, wherein d_kIs the dimension size of the model. Please note thatFor modal missing samples, we use virtual embedding H_iReplacement acoustic embedding H_aIn this case, if

Then S_a＝Att(H_i，H_i，H_l)

Then, a mixed representation H is obtained by adding the attention-processed representation to the residual connection_h：

H_h＝S_l+S_a+H_l (3)

In implementations, the coordination director may be stacked into multiple layers to further increase model capacity.

Finally, a prediction is made from the mixed representation by outputting a classifier layer. The output classifier layer consists of a linear projection and a softmax activation function, and will input H to the classifier layer_hTo predict punctuation sequences

In this way, we enable the representation of the audio sample and the representation without the audio sample to share the same embedding space. Thus, the model may receive mixed data in the same training batch, and the trained model is able to punctuate both audio text and non-audio text.

It should be noted that, according to some embodiments of the present disclosure, the aforementioned text encoder, acoustic processing component, and coordination director may all be included as sub-modules/frameworks in a model training apparatus, i.e., UniPunc framework, according to embodiments of the present disclosure. Thus, the UniPunc framework according to the present disclosure provides a universal framework for resolving modal deficiencies in multi-modal punctuation recovery tasks. Furthermore, the approach according to the present disclosure may also be effectively applied to, and advantageously supplement, some punctuation models at present. As an example, the acoustic processing component and coordination director according to the present disclosure may be applied/added to the current punctuation model for modification so that they can be refined to handle modal missing samples.

Furthermore, the above examples of model training are equally applicable to multi-language scenarios, and in particular, models may be trained using data and audio in different languages simultaneously. In particular, the processing of the various stages in the above-described solution according to the present disclosure is performed with data of a plurality of different languages, including text data and/or audio data, as input, for example, audio of each language is input to an audio component, and a vocabulary encoder can be uniformly used for text of different languages, thereby enabling a performance-enhanced model suitable for a multilingual scene to be obtained.

Also, although not shown, before the above-described processing is performed on the text and audio, equalization processing may be performed on the input in multiple languages, and the above-described processing, such as the above-described three-stage processing, is further performed on the text and audio in each language after equalization. And will not be described in detail herein.

Punctuation models according to some embodiments of the present disclosure may be used in a variety of suitable punctuation recovery applications. Punctuation models according to some embodiments of the present disclosure may have good universality and may be applied to any speech recognition system, for example, the text output by the speech recognition system may be further processed to optimize punctuation recovery of the text output by the speech recognition system. For example, punctuation addition may be performed on text to which punctuation has not been added, or verification may be performed on text to which punctuation has been added for further correction.

Fig. 2 illustrates a flow diagram of a punctuation recovery method according to some embodiments of the present disclosure. In the method 200, in step S201, a text to which punctuation marks are to be added, for example, a speech recognition text output, is acquired, and in step S202, punctuation models trained according to the model training method of the present disclosure are applied to the acquired text output to restore punctuation in the text output. Punctuation marks can be properly added to the text, thereby realizing accurate punctuation recovery or punctuation verification/correction.

In some embodiments, the input text may be entered into the model along with associated audio, so that punctuation may be appropriately added to the input text based on both. If the input model is text only, meaning that the text is inaudible, a virtual sample may be generated to input the model with the text and punctuation marks may be added to the input text based thereon. In other embodiments, audio information for text entered into the punctuation model may be obtained, for example, based on an indicator indicating whether the text entry has audio, in which case the punctuation model may obtain text audio, e.g., text audio, along with the text entered into the model, or from a predetermined storage location, in which case the indicator indicates that the text does not have audio, e.g., is a single text, a virtual audio sample may be generated and the punctuation model entered, whereby the punctuation model will make sense to predict the representation, and thus the punctuation-restored text may be obtained. This process may be performed in a similar manner as the model training process described previously, such as the three-phase process described above, and will not be described in detail herein.

FIG. 4A illustrates a punctuation model training device 400 according to some embodiments of the present disclosure. The apparatus 400 includes a sample obtaining unit 401 configured to obtain text samples for model training and corresponding audio samples, where for a text sample obtained from an audioless text, its corresponding audio sample is a virtual sample; and a training unit 402 configured to train a model for speech recognition punctuation recovery based on the obtained text samples and corresponding audio samples for model training.

In some embodiments, training unit 402 is further configured to generate a multimodal hybrid representation by performing attention (attention) -based processing based on the obtained text samples and audio samples for model training, whereby training unit 402 may perform model training based on the generated multimodal hybrid representation. The above-described multi-modal hybrid representation may be generated by the hybrid representation generation unit 4021, although not shown, in an exemplary implementation, model training may be performed by a component for performing punctuation prediction, such as a classifier or similar model training component, in which case such a component may be included in the training unit.

In some embodiments, the mixed representation generating unit 4021 may further include a text converting unit 4022 configured to convert the text samples into vocabulary embedded sequences; an audio conversion unit 4023 configured to convert the audio samples into an acoustically embedded sequence; and a joint processing unit 4024 configured to perform attention-based operations on the vocabulary embedding sequence and the acoustic embedding sequence, respectively; and combining the processing results of the vocabulary embedding sequence and the acoustic embedding sequence to generate a multi-modal hybrid representation. It should be noted that although not shown, the text conversion unit and the audio conversion unit may not be included in the mixed representation generation unit, even outside the training unit, that is, the text conversion and the audio conversion may be processing of samples before performing model training, so that the samples of the input for model training are actually the samples after conversion.

The above units may be implemented in various suitable ways. In an exemplary implementation, the text conversion unit 4022 and the audio conversion unit 4023 may correspond to at least the text encoder and the acoustic processing component, respectively, as described above, the joint processing unit 4024 may correspond to the coordination director, as described above, and the mixed representation generating unit 4021 may include or correspond to at least the coordination director, as described above, and may further include the vocabulary encoder and the acoustic processing component, the coordination director, as described above; the training unit 402 may comprise or correspond to at least the corresponding modules/devices, etc. of all the processing stages shown in fig. 3, such as the vocabulary encoder and acoustic processing components, the coordination director, the classifier, etc. described above.

Fig. 4B illustrates a punctuation recovery device 410 according to some embodiments of the present disclosure. The apparatus 410 includes an obtaining unit 411 configured to obtain a text output of speech recognition, and a punctuation restoration unit 412 configured to apply a punctuation model trained according to a training method of an embodiment of the present disclosure to the obtained text output to restore punctuation in the text output. The punctuation recovery unit herein may perform similar operations/processes as the training unit described above, e.g. may comprise the text encoder and acoustic processing components, coordination director, classifier, etc. described above.

It should be noted that the above units are only logic modules divided according to the specific functions implemented by the units, and are not used for limiting the specific implementation manner, and may be implemented in software, hardware or a combination of software and hardware, for example. In actual implementation, the above units may be implemented as separate physical entities, or may also be implemented by a single entity (e.g., a processor (CPU or DSP, etc.), an integrated circuit, etc.). Furthermore, the various elements described above are shown in dashed lines in the figures to indicate that these elements may not actually be present, but that the operations/functions that they implement may be implemented by the processing circuitry itself.

Further, although not shown, the apparatus may further include a memory which can store various information generated in operation by the apparatus, respective units included in the apparatus, programs and data for operation, data to be transmitted by the communication unit, and the like. The memory may be volatile memory and/or non-volatile memory. For example, memory may include, but is not limited to, Random Access Memory (RAM), Dynamic Random Access Memory (DRAM), Static Random Access Memory (SRAM), Read Only Memory (ROM), flash memory. Of course, the memory may also be located outside the device. Optionally, although not shown, the device may also include a communication unit, which may be used to communicate with other devices. In one example, the communication unit may be implemented in a suitable manner as known in the art, e.g., including communication components such as antenna arrays and/or radio frequency links, various types of interfaces, communication units, and so forth. And will not be described in detail herein. Furthermore, the apparatus may also include other components not shown, such as radio frequency links, baseband processing units, network interfaces, processors, controllers, and so forth. And will not be described in detail herein.

The present disclosure presents training and application of an improved punctuation model for punctuation recovery that can be trained using both single text data and pairs of text and audio data, thereby addressing the problem of insufficient pairs of text and audio data. In particular, by using a predefined vector to replace audio as input when no audio exists, a training set can be constructed by using a large amount of plain text data, and the audio data is used for assistance, so that the model training effect is further improved, and the accuracy of a trained model is improved. In addition, the present disclosure may also be trained synergistically with multilingual data, in particular, by using language data with large data volumes to enhance model performance for a small number of languages, resulting in a further optimized punctuation model. Therefore, the plain text data, the data with aligned text and audio and the data of different languages can be trained together, and the performance of the model on the plain text data, the data with aligned text and audio and the data of the small languages is improved. Moreover, such a model may be suitable for various application tasks, in particular for speech recognition tasks, and achieve a better punctuation recovery effect.

The effectiveness of the scheme of the present disclosure will be further demonstrated below in conjunction with examples.

For training and testing data sets, experiments will be performed mainly on two real-world corpora, MuST-C and the multilingual TEDx (mTEDx), the audio of which originates from a TED lecture. A data set was constructed based on 1) English Audio (English-Audio) which contains both English Audio and sentences in MuST-C, with Audio for each sample. 2) English-Mixed (English-Mixed) this set contains all English audio sentences and no audio sentences of both corpora. Note that english audio is a subset of english mixing.

On the above data set, testing is performed on UniPunc according to the scheme of the present disclosure and a multi-modal model in the related art to compare performance, and it is proved by the test that UniPunc according to the embodiments of the present disclosure is superior to the multi-modal model in the related art on an english audio set, and performance of UniPunc according to the embodiments of the present disclosure is superior to that on an english-mixed set, which means that UniPunc of the embodiments of the present disclosure can be superior to an existing multi-modal model even if no audiofree sentences are adopted, and performance of a punctuation model according to the present disclosure can be further improved by adopting audiofree sentences.

On the above english hybrid data set, testing UniPunc according to the present disclosure and the single-modal model in the related art to compare performance proves that UniPunc according to the embodiments of the present disclosure can effectively obtain a multi-modal hybrid representation and effectively represent acoustic features in speech, which can significantly improve punctuation recovery for text.

Furthermore, by testing on multiple language data of mTEDx, it can be seen that the recovery of punctuation by UniPunc achieved according to embodiments of the present disclosure can be closer to human, better distinguishing pauses in commas and periods and mood in question. Furthermore, UniPunc according to the present disclosure has better punctuation performance over other baselines in multi-language punctuation, which indicates that UniPunc has better robustness and generalization capability.

Moreover, the UniPunc of the present disclosure can be better adapted to any existing punctuation recovery method. In particular, by introducing the active modules in the UniPunc framework of the present disclosure, particularly the acoustic assist modules and/or the coordination director, into an existing punctuation recovery scheme, the performance of the existing punctuation recovery scheme can be further optimized. Experiments show that the Unipunc scheme disclosed by the invention, particularly the acoustic auxiliary module and/or the coordination director and other modules have universality for solving modal loss in punctuation symbol recovery, and a previous single-modal model can process a multi-modal corpus, so that the overall performance is further improved.

Some embodiments of the present disclosure also provide an electronic device that is operable to implement the aforementioned operations/functions of the model pre-training device and/or the model training device. Fig. 5 illustrates a block diagram of some embodiments of an electronic device of the present disclosure. For example, in some embodiments, the electronic device 5 may be various types of devices, which may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., car navigation terminals), and the like, and fixed terminals such as digital TVs, desktop computers, and the like, for example. For example, the electronic device 5 may comprise a display panel for displaying data and/or execution results utilized in the solution according to the present disclosure. For example, the display panel may be various shapes such as a rectangular panel, an elliptical panel, or a polygonal panel, etc. In addition, the display panel can be not only a plane panel, but also a curved panel, even a spherical panel.

As shown in fig. 5, the electronic apparatus 5 of this embodiment includes: a memory 51 and a processor 52 coupled to the memory 51. It should be noted that the components of the electronic device 50 shown in fig. 5 are only exemplary and not limiting, and the electronic device 50 may have other components according to the actual application. The processor 52 may control other components in the electronic device 5 to perform desired functions.

In some embodiments, memory 51 is used to store one or more computer readable instructions. The processor 52 is configured to execute computer readable instructions, which when executed by the processor 52 implement the method according to any of the embodiments described above. For specific implementation and related explanation of each step of the method, reference may be made to the above-mentioned embodiments, and repeated details are not described herein.

For example, the processor 52 and the memory 51 may be in direct or indirect communication with each other. For example, the processor 52 and the memory 51 may communicate over a network. The network may include a wireless network, a wired network, and/or any combination of wireless and wired networks. The processor 52 and the memory 51 may also communicate with each other via a system bus, which is not limited by the present disclosure.

For example, processor 52 may be embodied as various suitable processors, Processing devices, and the like, such as a Central Processing Unit (CPU), Graphics Processing Unit (GPU), Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The Central Processing Unit (CPU) may be an X86 or ARM architecture, etc. For example, the memory 51 may include any combination of various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The memory 51 may include, for example, a system memory in which an operating system, application programs, a Boot Loader (Boot Loader), databases, and other programs are stored, for example. Various application programs and various data and the like can also be stored in the storage medium.

In addition, according to some embodiments of the present disclosure, in the case of being implemented by software and/or firmware, various operations/processes according to the present disclosure may install a program constituting the software from a storage medium or a network to a computer system having a dedicated hardware structure, for example, the computer system 600 shown in fig. 6, which is capable of performing various functions including functions such as those described above, etc., when the various programs are installed. Fig. 6 is a block diagram illustrating an example structure of a computer system that may be employed in accordance with some embodiments of the present disclosure.

In fig. 6, a Central Processing Unit (CPU)601 performs various processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 to a Random Access Memory (RAM) 603. In the RAM 603, data necessary when the CPU 601 executes various processes and the like is also stored as necessary. The central processing unit is merely exemplary and may be other types of processors such as the various processors described above. The ROM 602, RAM 603, and storage section 608 may be various forms of computer-readable storage media, as described below. It is noted that although ROM 602, RAM 603, and storage 608 are shown separately in fig. 6, one or more of them may be combined or located in the same or different memory or storage modules.

The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An input/output interface 605 is also connected to bus 604.

The following components are connected to the input/output interface 605: an input portion 606 such as a touch screen, a touch pad, a keyboard, a mouse, an image sensor, a microphone, an accelerometer, a gyroscope, or the like; an output section 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; a storage section 608 including a hard disk, a magnetic tape, and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 allows communication processing to be performed via a network such as the internet. It will be readily appreciated that while the various devices or modules in the electronic device 600 are shown in fig. 6 as communicating via the bus 604, they may also communicate via a network or otherwise, wherein a network may include a wireless network, a wired network, and/or any combination of wireless and wired networks.

A driver 610 is also connected to the input/output interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that the computer program read out therefrom is installed in the storage section 608 as necessary.

In the case where the series of processes described above is implemented by software, a program constituting the software may be installed from a network such as the internet or a storage medium such as the removable medium 611.

According to some embodiments of the disclosure, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a computer readable medium, the computer program containing program code for performing a method according to some embodiments of the present disclosure. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the CPU 601, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that in the context of this disclosure, a computer-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable medium may be a computer readable signal medium or a computer readable storage medium or any combination of the two. The computer-readable storage medium may be, for example, but is not limited to: an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

In some embodiments, there is also provided a computer program comprising: instructions which, when executed by a processor, cause the processor to perform the method of any of the embodiments described above. For example, the instructions may be embodied as computer program code.

In embodiments of the present disclosure, computer program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules, components or units described in the embodiments of the present disclosure may be implemented by software, or may be implemented by hardware. Wherein the designation of a module, component or unit does not in some way constitute a limitation on the module, component or unit itself.

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

According to some embodiments of the present disclosure, a training method of a model for speech recognition punctuation recovery is proposed, comprising the steps of: acquiring text samples and corresponding audio samples for model training, wherein for the text samples obtained from the non-audio text, the corresponding audio samples are virtual samples; and training a model for speech recognition punctuation recovery based on the obtained text samples and audio samples for model training.

In some embodiments, the text samples and corresponding audio samples used for model training may be obtained from both audio-bearing text and non-audio text, where pairs of text samples and audio samples are obtained from audio-bearing text, and from non-audio text, a text sample is obtained and a virtual sample is obtained as the audio sample to which the text sample corresponds.

In some embodiments, the virtual sample is a predetermined sample of audio used to make up the text sample.

In some embodiments, training a model for speech recognition punctuation recovery may comprise: a hybrid representation is generated based on the obtained text samples and audio samples for model training, and model training is performed based on the hybrid representation.

In some embodiments, the hybrid representation may be a multi-modal hybrid representation obtained by performing attention (attention) -based processing based on the acquired text samples and audio samples for model training.

In some embodiments, attention-based processing may be performed on audio samples for model training based on textual information associated with the text samples.

In some embodiments, the text information associated with the text sample may be a lexical feature resulting from conversion of the text sample or a result of processing performed on the text sample based on attention.

In some embodiments, the text sample is converted to a vocabulary embedding sequence, the audio sample is converted to an acoustic embedding sequence, and the attention-based operation is performed on the vocabulary embedding sequence and the acoustic embedding sequence, respectively.

In some embodiments, where the audio sample is in the form of audio corresponding to text content of the text sample, obtaining an acoustic embedding sequence based on acoustic features extracted from the audio sample; and in the case where the audio sample is a virtual sample, treating the virtual sample as an acoustic embedding sequence.

In some embodiments, the attention-based processing performed on the text samples for model training may be self-attention (self-attention) -based processing, and the attention-based processing performed on the audio samples for model training may be cross-attention (cross-attention) -based processing.

In some embodiments, the text samples and corresponding audio samples obtained for model training may include multilingual text samples and audio samples.

In some embodiments, multilingual text samples and audio samples may be equalized to increase the fraction of low-resource language samples.

According to some embodiments of the present disclosure, a method for speech recognition punctuation recovery is proposed, comprising the steps of: the method comprises the steps of obtaining text output of voice recognition, and applying a punctuation model obtained by training according to the model training method of any embodiment in the disclosure to the obtained text output to recover punctuation in the text output.

According to other embodiments of the present disclosure, a training apparatus for a model for speech recognition punctuation recovery is provided, comprising: the sample acquisition unit is configured to acquire text samples used for model training and corresponding audio samples, wherein for the text samples obtained from the non-audio text, the corresponding audio samples are virtual samples; and a training unit configured to train a model for speech recognition punctuation recovery based on the acquired text samples and audio samples for model training.

In some embodiments, the training unit may be further configured to: an attention (attention) based processing is performed based on the acquired text samples and audio samples for model training to generate a multimodal hybrid representation.

In some embodiments, the device may further comprise: a text conversion unit configured to convert the text sample into a vocabulary embedding sequence, an audio conversion unit configured to convert the audio sample into an acoustic embedding sequence, and the training unit configured to perform attention-based operations on the vocabulary embedding sequence and the acoustic embedding sequence, respectively.

According to further embodiments of the present disclosure, there is provided an apparatus for speech recognition punctuation recovery, comprising: the system comprises an acquisition unit configured to acquire text output of voice recognition, and a punctuation restoration unit configured to apply punctuation models obtained by training according to the model training method of any embodiment described in the present disclosure to the acquired text output to restore punctuation in the text output.

According to still further embodiments of the present disclosure, there is provided an electronic device including: a memory; and a processor coupled to the memory, the memory having instructions stored therein that, when executed by the processor, cause the electronic device to perform the method of any of the embodiments described in this disclosure.

According to further embodiments of the present disclosure, a computer-readable storage medium is provided, on which a computer program is stored, which when executed by a processor, implements the method of any of the embodiments described in the present disclosure.

According to still further embodiments of the present disclosure, there is provided a computer program comprising: instructions that when executed by a processor cause the processor to perform the method of any of the embodiments described in the present disclosure.

According to some embodiments of the disclosure, there is provided a computer program product comprising instructions which, when executed by a processor, implement the method of any of the embodiments described in the disclosure.

The foregoing description is only exemplary of some embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In other instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although some specific embodiments of the present disclosure have been described in detail by way of example, it should be understood by those skilled in the art that the foregoing examples are for purposes of illustration only and are not intended to limit the scope of the present disclosure. It will be appreciated by those skilled in the art that modifications may be made to the above embodiments without departing from the scope and spirit of the present disclosure. The scope of the present disclosure is defined by the appended claims.

Claims

1. A method of training a model for speech recognition punctuation recovery, comprising the steps of:

acquiring text samples and corresponding audio samples for model training, wherein for the text samples obtained from the non-audio text, the corresponding audio samples are virtual samples; and

training a model for speech recognition punctuation recovery based on the obtained text samples and corresponding audio samples for model training.

2. The method of claim 1, wherein text samples and corresponding audio samples for model training are obtained from both audio-bearing text and non-audio text,

wherein, from the audio text, the paired text sample and audio sample are obtained, and from the non-audio text, the text sample is obtained and the virtual sample is obtained as the audio sample corresponding to the text sample.

3. The method of claim 1 or 2, wherein the virtual sample is a predetermined sample of audio for imaginary text samples.

4. The method of claim 1, wherein training a model for speech recognition punctuation recovery comprises:

generating a multimodal hybrid representation based on the obtained text samples and audio samples for model training, and

model training is performed based on the multi-modal hybrid representation.

5. The method of claim 4, wherein the multimodal hybrid representation is a multimodal hybrid representation obtained by performing attention (attention) -based processing based on the obtained text samples and audio samples for model training.

6. The method of claim 5, wherein the attention-based processing is performed on the audio samples for model training based on text information associated with the text samples.

7. The method of claim 6, wherein the text information related to the text sample is a lexical feature obtained by converting the text sample or a processing result obtained by performing attention-based processing on the text sample.

8. The method of claim 1, wherein,

the text sample is converted into a vocabulary-embedded sequence,

the audio samples are converted into an acoustically embedded sequence,

performing attention-based operations on the vocabulary embedding sequence and the acoustic embedding sequence respectively; and is

The manipulated vocabulary embedding sequence and the acoustic embedding sequence are combined to generate a multi-modal hybrid representation.

9. The method of claim 8, wherein,

in the case where the audio sample is in an audio form corresponding to the text content of the text sample, obtaining an acoustic embedding sequence based on acoustic features extracted from the audio sample;

in the case where the audio sample is a virtual sample, the virtual sample is taken as an acoustic embedding sequence.

10. The method according to any of claims 5-9, wherein the attention-based processing performed on the text samples for model training is self-attention (self-attention) based processing and the attention-based processing performed on the audio samples for model training is cross-attention (cross-attention) based processing.

11. The method of any of claims 1-10, wherein the text samples and corresponding audio samples obtained for model training comprise multilingual text samples and corresponding audio samples.

12. The method of claim 11, wherein the multilingual text samples and corresponding audio samples are equalized so as to increase the fraction of low-resource language samples.

13. A method for speech recognition punctuation recovery, comprising the steps of:

obtaining text output for speech recognition, an

Applying punctuation models trained according to the method of any one of claims 1-12 to the obtained text output to recover punctuation in the text output.

14. A training apparatus for a model for speech recognition punctuation recovery, comprising:

the sample acquisition unit is configured to acquire a text sample for model training and a corresponding audio sample, wherein the corresponding audio sample is a virtual sample for the text sample obtained from the non-audio text; and

a training unit configured to train a model for speech recognition punctuation recovery based on the obtained text samples and corresponding audio samples for model training.

15. The apparatus of claim 14, wherein the training unit further comprises a hybrid representation generating unit configured to:

an attention (attention) based processing is performed based on the acquired text samples and audio samples for model training to generate a multimodal hybrid representation.

16. The apparatus of claim 15, wherein the training unit further comprises:

a text conversion unit configured to convert the text sample into a vocabulary embedded sequence,

an audio conversion unit configured to convert the audio samples into an acoustically embedded sequence, an

A joint processing unit configured to perform attention-based operations on the vocabulary embedding sequence and the acoustic embedding sequence, respectively; and combining the manipulated vocabulary embedding sequence and the acoustic embedding sequence to generate a multi-modal hybrid representation.

17. An apparatus for speech recognition punctuation recovery, comprising:

an acquisition unit configured to acquire a text output of the voice recognition, an

A punctuation restoration unit configured to apply punctuation models trained according to the method of any one of claims 1-12 to the obtained text output to restore punctuation in the text output.

18. An electronic device, comprising:

a memory; and

a processor coupled to the memory, the memory having stored therein instructions that, when executed by the processor, cause the electronic device to perform the method of any of claims 1-12.

19. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-12.

20. A computer program product comprising instructions which, when executed by a processor, cause the method according to any of claims 1-12 to be carried out.